CFD 11 – Day 3 – No one cares about backups

Until you experience data loss.

If you don’t care about good backups, you should start. Someone is going to make permanent that thing you thought was temporary, or make important that thing you thought wasn’t.

They are also going to store important data somewhere you thought important data didn’t belong – like Kubernetes.

But those are all stateless workloads, right? Surely there’s a snapshot of the data somewhere in S3, bi-directionally triple replicated with 20 9’s of durability.

Sorry, there isn’t. Not unless you put it there.

When I was first exposed to the “it’s not designed to be backed up, your important data should be elsewhere” logic regarding K8S, I wondered to myself how long THAT would last.

And now that the folks over at Veeam have moved into the adjacent market of Kubernetes data protection through their acquisition of Kasten, we can now say that phase lasted about around 5 years 🙂

It makes sense. Today’s datacenters are highly, if not exclusively, virtualized. Targeting the virtualization layer for backup makes the most sense, in most situations. And containers are just another form of virtualization.

The cloud? Using virtualization, too. But the underlying hypervisor is no longer exposed and available to capture backups directly from, so an adjustment in approach is needed. There are more efficient means of capturing data anyway – at the container level.

Image credit: Kasten

Kasten deploys directly into your generic Kubernetes environment, and from this layer it is able to interact with containerized workloads on a granular basis. And as a result of this approach, Kasten is able to both protect and restore data to a wide variety of K8S distributions, whether they be located on-premises or in the public cloud.

The disaster recovery implications are particularly important because there is no shortage of complex, active/active, disaster-tolerant reference architectures for a customer to choose from. But for customers with less aggressive recovery timelines, a recovery strategy that involves restore from backup can make more sense, and the options in this area have been lacking to this point.

Being able to utilize the same data protection tool, should your on-premises K8S distribution differ from the one you utilize in the cloud, is also a significant benefit. Having to weave together a series of solutions to meet your backup requirements isn’t a good time. Neither is dealing with a separate set of restore procedures when disaster strikes.

It’s encouraging to see vendors start to offer more mature data protection solutions for Kubernetes. Because, as we expected, yes – that data is going to need to be backed up (even though you said it wasn’t important).

Special thanks to Gestalt IT for assembling another great Cloud Field Day, and thank you to all the presenting sponsors. Until next time!

CFD 11 – Day 2 – It’s all about the ingress

There are a ton of important design decisions to make when creating your Kubernetes architecture, and deciding how the outside world is going to communicate with your hosted services, ingress, is a big part of that. And there can be no ingress without an ingress controller.

I came into Day 2 of Cloud Field Day 11 expecting to be most interested in Kubernetes data protection, but NGINX’s ingress controller function caught my attention, instead.

Image credit: AWS

Commercial Kubernetes offerings come out of the box with the design decision of which ingress controller to use having already been made. OpenShift, for example, utilizes an HAPROXY layer deployed within the OCP environment to expose applications and load balance across components.

AWS EKS, on the other hand, implements part of the functionality using its own Load Balancer Controller for Kubernetes, combined with native VPC constructs like the AWS ALB, to achieve the same general outcome.

However, this decision isn’t set in stone, and the default ingress controller can be replaced using your choice of solutions, if another option better aligns with your requirements. But why would you want to do this?

Any functionality gained as a result of the swap could be offset by the loss of out-of-the-box usability and integrations.. It turns out there are some specific benefits to be had by incorporating the NGINX ingress controller instead of an alternative.

Diagram NGINX Ingress Controller
Image credit: NGINX

First is synergy. If you already make use of NGINX elsewhere in your environment, that expertise (and some of your configurations) will carry forward and be useful as you develop your Kubernetes environment. If using different Kubernetes distributions/services across your Hybrid architecture, use of NGINX gives you the ability to make your ingress controller configuration consistent across environments, as well.

A second differentiator for the NGINX ingress controller is that web application firewall (WAF) functionality can be implemented at this layer, bringing the protection a WAF provides closer to the actual workload. Because the ingress represents the perimeter of the Kubernetes network, it makes sense to consider implementing security controls at this level.

Image credit: NGINX

Ingress controller selection is one of many security and networking decisions that can either contribute to harmony or disorder within your Kubernetes environment, especially in a hybrid context.. For those willing to venture outside of the comfort of the manufacturer-default configuration, this solution could be worth a look.

Be sure to check out the live stream tomorrow, Friday 6/25/21 for more Cloud Field Day 11 – I am sure more interesting discussions are to follow.

CFD11 – Day 1 – Storage is still hard

And Kubernetes makes it that much more-so. It’s such a difficult challenge that some vendors path to success involves focusing on very specific use cases and trying to be the best at those, leaving the remainder of the overall storage ‘problem” to other vendors.

After all, if you try to do everything and fail, you are going to be laughed at. Just look at those crusty old storage arrays with bolt-on CSI drivers. Those are cloudy, right? So, I can’t blame some vendors for their desire to specialize. You won’t please everyone, but you may end up with a few solid wins.

MinIO, who presented on 6/23 in the afternoon timeslot of Cloud Field Day 11, is solidly in this camp. They take a focused, software defined approach and rely on a distributed, containerized architecture to aggregate local units of storage into a centralized pool for shared consumption.

Conceptually, this is similar to how virtual storage appliances consume local host storage and re-present to a virtualization cluster for consumption in the HCI world, except the lowest unit of distributed compute is now a container, and the pooled storage is presented using an object storage API.

An overview of the software architecture is shown below (view from within a single node)

Cloud Native

Now we can zoom out and see how each logical storage node (containerized itself on K8s) presents its local storage to the network as part of a shared object pool:

Because the hardware and OS layers have been abstracted away, this lends itself to deployment across a variety of heterogeneous environments. This can be especially helpful for customers in hybrid scenarios that are looking to maintain a consistent Kubernetes storage layer across private and public cloud environments.

One such use case would be an Enterprise customer with an on-premises Tanzu deployment and another off-premises in VMC on AWS. Or, alternatively, an on-premises OpenShift deployment with another residing in AWS EC2. The same object storage layer, MinIO, could be maintained in both cases, lending itself to operational efficiencies vs. utilizing siloed storage solutions on both sides of the hybrid architecture.

As an architect, I appreciate being able to achieve simplicity and consistency, when possible. Even though there are ways to manage different Kubernetes distributions in a uniform manner, it can be useful just to use the same technology everywhere. I think the same logic can be applied to Kubernetes storage.

However, as I mentioned in the live session, if my customer requires more than object storage, I am left with additional research to do in order to optimally meet those requirements. MinIO does not attempt to address block and file needs. They can provide whatever storage type you need – as long as it’s object.

Anyway, one aspect of Cloud Field Day 11 that I was most looking forward to, coming in, were the anticipated discussions around persistent container storage. Aside from being generally interesting from a technical perspective, this is an area that I now touch on a daily basis, so I feel I have a bit more to contribute this go-round, compared to Field Day’s past.

Looking forward to hearing more on this front, especially with respect to the recovery of K8S data in disaster scenarios, in the days ahead.

Be sure to check out the livestream of Cloud Field Day 11, resuming Thursday 6/24 at 8AM PST at and follow us using #TFD11 on Twitter.

Why Design Process Matters – Developing a Concern for “Why”


Earlier in the year I had the opportunity to sit down virtually with Ethan Banks and Chris Wahl on the Datanauts podcast to discuss two of my favorite topics: design process and documentation. Being a Datanauts listener since the launch of the show in 2015, and a Packet Pushers listener since 2013, it was an honor to be able to contribute content to a platform that has done so much to continually encourage my career development.

As this series is meant to be a companion to the podcast, I’d recommend giving the episode a listen using the link or embedded audio below. We had a great discussion and I believe it’s well worth the time invested.

Datanauts 168: Why Design Process Matters For Data Centers And The Cloud

With that out of the way, I imagine a few questions might come to mind:

  • What makes design process and documentation so important?
  • Aren’t there new, cool technologies that should be talked about, instead?

In short, I believe there’s more than enough news-of-the-day type commentary on specific technologies, and instead I thought I’d share my thoughts on the topics that have drastically altered the trajectory of my career.

Aside from maintaining a general curiosity and investing off-hours time in developing relevant skills, I consider attention to design process and documentation responsible for much of my professional progress.

If you are in the IT infrastructure space, creating well-structured designs and effectively communicating your decisions will go a long way toward improving your work and others’ perception of it.

Before we dive into the details of these topics, though, I wanted to provide a bit of background on my career and how I came to understand and appreciate them. Hopefully the context proves useful as we move forward.

Developing a concern for “why”

Stage 1 – User Focus

My career in IT began in Managed Services where I started off as a systems technician deploying, migrating to and supporting Windows-based environments. While this role was far from glamorous, it exposed me to a wide variety of end-users, the applications they used, and the back-end infrastructure that supported their operation.

All the while I was under intense pressure to simultaneously think on my feet, learn quickly and provide a high level of customer service. At this level, being friendly, resourceful and responsive were probably the most useful techniques available to me, and I relied on them to get me through this user-centric phase.

Stage 2 – Technology Focus

As a result of this initial exposure and the growth that accompanied it, I was able to progress through the ranks to an infrastructure engineer role and focus more on the underlying infrastructure I was most interested in. Along with this transition came a separation from users, their needs and day-to-day complaints. It was a very welcome change.

In its place, a concern for the customer-wide impact of technology and the decisions I made developed. I found as my technical skills broadened, so did the scope of my responsibility and perspective. Detailed understanding of technology, impact of changes, overall work output and a can-do attitude were my go-to techniques when navigating this technology-centric phase.

Stage 3 – Business Focus

At some point, being immersed full time in the implementation and support of infrastructure technologies became less appealing, and I pursued a transition to a much more customer-facing pre-sales architecture role. That experience exposed me to organizations of all sizes with varying levels of internal IT expertise, process maturity and infrastructure complexity, which was a (mostly) welcome change.

As I soon discovered, the techniques I relied on in my previous roles were no longer enough. Being friendly, responsive and resourceful are table-stakes attributes for senior level positions.

A high level of work output is also assumed, as efficiency and multi-tasking are required to perform these new duties. And instead of being beneficial, a detailed understanding of technology, and exposing it during conversation with the wrong audience, can actually prove detrimental.

Glazing-over a customer executives’ eyes with an improperly-timed technical tangent is a quick (and painful) way to learn this lesson. What, then, was required to be successful when handling these new responsibilities of solution design?

A working understanding of the customers business, their goals and project-specific requirements was needed, at a minimum. Beyond this, there was still a need for a structured way to communicate decisions and rationale. The customer needs to understand how you intend to provide value and reduce risk, after all.

The answer, as I eventually discovered, was formalized design process and structured documentation, informed by a curiosity for the business side of things and driven by a concern for “why”.


Getting to a functional understanding of design across multiple technology silos wasn’t completely straightforward, though.

For each of the technologies I worked with, including virtualization and cloud, there were different sets of design guidance, significant variations in quality and sometimes conflicting advice to be reconciled.

As I suspect I am not the only one who has had this experience, I am hoping the lessons I learned will be useful to those traveling along the same path.

Throughout the remainder of this series, we’ll take a look at design process in general, specific guidance offered by both VMware and AWS, see if we can come to a working synthesis and provide a few helpful documentation tips along the way.

Stay tuned!


TFDx @ DTW ’19 – Get To Know: Big Switch

In the final post of this series ahead of TFDx @ Dell Technologies World 2019, we will be focusing on Big Switch Networks, their evolving relationship with Dell EMC and their presence here at the show.

I’d like to start out by acknowledging that partnerships are a dime-a-dozen, and many vendors tentatively put their “support” behind things just to check a box and say they have a capability. In addition, I have noticed a not-uncommon discrepancy between the messaging contained in vendor marketing materials and the messaging (or, general enthusiasm) of their SE’s. As a partner peddling vendor wares, this type of scenario is less than inspiring.

Fortunately, that does not appear to be the case with Dell EMC and their embrace of Open Networking. In discussions with multiple levels and types of Dell EMC partner SE’s, it is consistently mentioned as something that gives them an edge vs. other vendors, and it appears to be a point of pride. They are all about it.

Within this context, the recent news of the agreement between Dell EMC and Big Switch to OEM Big Switch products under the Dell EMC name makes a lot of sense. Dell EMC will provide the merchant-silicon based switching, Big Switch will provide the software, and the customer will get an open, mutually-validated and supported solution.

The primary components within this solution are Dell EMC S-Series Open Networking switches and Big Switch Big Cloud Fabric (BCF) software, so let’s talk a bit about those next.

Dell EMC S-Series Open Networking Switches

For purposes of brevity, I am going to focus on the switch type most relevant to the datacenter, the newly released line of 25Gbit+ switches. According to Dell EMC contacts, the per-port price is very competitive compared to the 10Gbit variants, and adoption of 25Gbit (and above) looks to be accelerating.

Within this lineup, there are a number of port densities and uplink configurations available, including the following:

  • S5048F-ON & S5148F-ON: 48x25GbE and 6x100GbE or 72x25GbE
  • S5212F-ON: 12x25GbE and 3x100GbE
  • S5224F-ON: 24x25GbE and 4x100GbE
  • S5248F-ON: 48x25GbE and 6x100GbE
  • S5296F-ON: 96x25GbE and 8x100GbE
  • S5232F-ON: 32x100GbE
  • S6010-ON: 32x40GbE or 96x10GbE and 8x40GbE
  • S6100:32x100GbE, 32x50GbE, 32x40GbE, 128x25GbE or 128x100GbE (breakout)

Obviously, it’s always impressive to see the specifications associated with the top model in a product line. With up to 128×100 Gbit ports available, the S6100 is no exception.

What stands out to me, though, is the inclusion of a very interesting half-width 12-port model. With this, a customer can power a new all-flash HCI (or other scale-out) environment of up to 12 nodes and occupy only 1U of rack space for networking. All while retaining network switch redundancy.

With compute and storage densities where they are in 2019, you can house a reasonably-sized environment with 12 HCI nodes. It can also be useful to keep HCI-specific east/west traffic off of the existing switching infrastructure, depending on the customer environment.

Not all customers in need of new compute and storage are ready to bite the bullet on a network refresh or re-architecture, either. This gives solution providers a good tool in the toolbelt for these occasions, and other networking vendors should take note.

The star of the show is…a 12-port switch? In a way, yes.

Common within the Dell EMC S-Series of Open Networking switches is the inclusion of the Open Network Install Environment (ONIE), which enables streamlined deployment of alternative OS’es, including Big Switch Networks BCF. Dell’s own OS10 network OS is also available for deployment, should the customer want to go that direction in some instances.

Underpinning all of this is merchant silicon, so customers don’t need to worry about lack of hardware capability, vendor expertise or R&D as much here. This approach allows specialist vendors like Broadcom and Bigfoot to focus on what they do best, chip engineering, while Dell EMC and software vendors like Big Switch can focus on how to get the most from provided capabilities. Hardware parity also brings costs down and encourages innovation through software, which is a beneficial thing.

Although a full analysis of Dell’s use of merchant ASIC’s in their networking gear is outside the scope of this post (and my wheelhouse), I’d recommend checking out this analysis on NextPlatform for more info. I think it’s safe to say the arguments against “whitebox” and for proprietary solutions are beginning to lose their potency, though.

An Open Networking switch equipped with ONIE doesn’t move frames by itself, though. For that, you’ll need an OS like Big Switch BCF, which we’ll touch on next.

Big Switch Networks Big Cloud Fabric

Big Switch Networks Big Cloud Fabric is available in two variants: Public Cloud (BCF-PC) and Enterprise Cloud (BCF-EC). Since we are focusing on the deployment of Big Switch as part of a Dell EMC Open Networking solution, we’ll keep things limited to BCF-EC, for now.

At its foundation, BCF is a controller-based design that moves the control plane off of the switches themselves and onto an intelligent central component (controller). This controller is typically implemented as a highly-available pair of appliances to ensure control services are resistant to failure.

As network changes are needed throughout the environment, these are made in automated fashion through API calls between the controller and subordinate switches. These switches are powered by a combination of merchant silicon and the Switch Light OS and are available from a number of vendors, including Dell EMC.

Big Switch diagram showing an example leaf-spine architecture powered by Big Cloud Fabric

There are a number of benefits associated with the resulting configuration, including simplified central management, increased visibility into traffic flows and behavior, and improved efficiency through automation. One great use-case for this type of deployment is within a VMware-based SDDC. A solid whitepaper expanding on the benefits of the combined Big Switch and Dell EMC networking solution within a VMware-based virtualization environment can be found here.


All in all, I think this OEM agreement is good news in support of competition and customer choice. It’s also encouraging that Dell EMC appears to be bought-in to Open Networking, both in word and in practice.

Despite this, I still think Dell EMC could do a better job of promoting and selling their network line. It’s not a one-way street, though. It’s also the responsibility of partners (all architects and decision-makers, really) to re-evaluate solutions as they evolve and adjust previous conclusions, as appropriate. Increasingly often, you can come up with a good answer without using the C-word (Cisco).

I look forward to talking more with the Big Switch team about BCF on Dell EMC Open Networking switching during their session at TFDx this Wednesday at 16:30. Be sure to check out the livestream and submit any questions/comments on Twitter to the hashtag #TFDx.

TFDx @ DTW ’19 – Get To Know: Kemp

Next up in our Get To Know series, we have a well-known vendor whose primary solutions many of us are already familiar with: load balancers and application delivery controllers. When I run into these components in the real world, they are typically implemented in front of, or between tiers, in a multi-tier application. However, the use case Kemp is bringing to the table for Dell Technology World 2019 and Tech Field Day may not be the one you would expect.

The big news coming out of Kemp ahead of the conference is that they are the only load balancing solution to be certified under the Dell EMC Select program for use with Dell EMC’s Elastic Cloud Storage solution. Although it’s easy enough to understand why a load balancer would be useful within the context of a scale-out storage solution, I am not intimately familiar with ECS itself, so let’s take a quick look at how that solution works.

Dell EMC Elastic Cloud Storage

At a high level, the ECS solution consists of object-based storage software available for deployment on-premises, in public cloud or consumption as a hosted service. Nodes are organized and presented under a single, global namespace and the solution is intended to scale horizontally through addition of nodes.

ECS has been designed to accommodate deployment across multiple locations and/or regions simultaneously, which is a key part of a global data management strategy. As you might expect, a number of data protection schemes are possible, and the available storage can be consumed using a number of protocols, supporting the “unifying” part of Dell EMC’s messaging.
More information on the architecture of ECS can be found here.

While ECS is functional as a standalone product, Dell EMC highly recommends that this solution be deployed in conjunction with a load balancer, which brings us to our next subject.

Dell EMC diagram showing high-level services provided by a distributed deployment of ECS.

Kemp Load Master

At some level, the challenges with this type of architecture are not dissimilar to the ones seen when scaling a multi-tier app or creating a multi-region design for said application.

As we begin to scale horizontally, it becomes critical to have a central point of communication brokerage so load can be distributed and failure can be handled in a graceful way. Management of traffic across geographic regions according to environment load, failure events and user location, can also be important.

This, as you might guess, is where Kemp comes into play. An example of how this joint solution might be deployed is shown below:

Dell EMC diagram showing multi-site deployment of ECS with Kemp load balancing.


The desire to be the single, unifying Object storage platform employed in the cloud and on-premises for broad consumption by customer applications is not unique. Many other vendors are targeting the same goal.

With so many options for scalable object storage available, I will be very interested to hear more about the value proposition of this joint solution, as well as learn how the solution handles issues of scale, availability and performance. I expect Kemp has some differentiators to emphasize here, otherwise they wouldn’t be the only load balancer within the EMC Select program.

If you will be attending Dell Technologies World this year, pay Kemp a visit at booth #1546 to hear more about how the LoadMaster product works within an ECS deployment.

I’d also recommend checking out their TFDx session on Wednesday 5/1/19 at 15:00. The live stream can be accessed here. If you have any questions or comments during their session, feel free to submit them to the hashtags #TFDx and #KempAX4Dell

TFDx @ DTW ’19 – Get To Know: Liqid


It’s been said that innovation begets innovation, and Liqid has developed a very interesting composable platform that builds upon recent developments in the areas of interconnect and fabric technology. But before we get into the technical specifics, let’s quickly touch on a few of the drawbacks of traditional infrastructure that composable solutions look to improve upon:

  • Procuring, deploying, and managing datacenter infrastructure is labor-intensive and can be complex.
  • Bespoke configurations, common lack of centralized management and automation capabilities can impact consistency and reproducibility.
  • Statically-configured resources can be over or under utilized, either leading to performance issues or preventing maximum return on investment.
  • Operations teams responsible for said infrastructure can struggle to be as responsive as their application owners and developers would like.

Composable solutions, on the other hand, take a building-block based approach, where resources are implemented as disaggregated pools and managed dynamically through software.

Depending on which vendor you ask, the definitions of “composable” and “disaggregated”, as well as the types of resources available for composition, will vary. The common theme here is that we are moving away from static configurations toward a systems architecture that is dynamically configurable through software.

Liqid, as you will see, has a very different take on composability than HPE and Dell, but that doesn’t mean HPE and Dell hardware can’t be part of the Liqid solution. Thus their presence at Dell Tech World 2019, I suppose. 🙂

At its core, their solution consists of three primary components: the Fabric, the Resources, and the Manager. We’ll take a closer look at these next.

The Fabric

What is the self-described “holy grail of the datacenter fabric” that makes the Liqid approach to composability possible? Infiniband? No. It’s not Ethernet, either. It’s PCIe.

Liqid argues that because PCIe currently is, and has been, leveraged heavily in modern CPU architectures, it is uniquely positioned to connect compute to peripheral resources across a switched fabric. This architecture decision allows Liqid to avoid additional levels of abstraction or protocol translation, which at a minimum keeps things more elegant.

At its core, the fabric is powered by a 24-port PCI Express switch, with each port being capable of Gen3 x4 speeds. This equates to a per-port bandwidth of 8GB/s full-duplex and a total switch capacity of 192GB/s full-duplex. Devices can be physically connected via copper (MiniSAS) or Photonics, proving some flexibility in connecting the required resources.

Overall, the approach of using a native PCIe fabric allows Liqid to be one step closer to true composability than the bigger players, because a larger number of resource types can be pooled and dynamically allocated. More on this in a moment.

Overview of disaggregated resources and their relationship to the Liqid PCIe fabric.

The Resources

In reading over the benefits of available composable systems, it’s easy to get the impression that compute, network and storage resources are the only relevant resource types. HPE Synergy, as an example, introduces hardware resources in the form of an improved blade chassis (frame) with abstracted, virtual networking and internal storage presented over an internal SAS fabric.

Resources can be dynamically and programmatically managed, but the scope of the sharing domain is limited to the frame. Although this limits flexibility, there are still a number of benefits to HPE Synergy vs. a traditional architecture. This is just one interpretation of what composable should look like.

Liqid takes a different approach and deploys pools of resources using commodity hardware attached to their PCIe switch fabric. Because of the use of PCIe, a number of additional resource types are available for composition, including GPU’s, NVMe storage, FPGA’s and Optane-based memory. Compute resources are provided by commodity x86 servers containing both CPU and RAM. This additional flexibility is a primary differentiator for Liqid vs. the other available composable solutions.

Commodity resources attached to x86 compute over the Liqid PCIe fabric

The Manager

Bringing the solution together is the management component, the Liqid Command Center. This provides administrators with a way to graphically and programmatically create systems of the desired configuration using compute, storage, network and other resources present on the fabric. In short, the features you’d expect to be present are here, and it looks like some attention has been paid to the style of the interface. A brief demonstration is available on YouTube and gives a good preview of the look/feel and capabilities:


Although there’s a significant amount of marketing fluff to sift through at times when looking into composable solutions, I don’t believe composability is just another meaningless throw-around term.

There are benefits to be had, both on the technical and operational side of things. Based on my initial research, the Liqid approach appears to be a step in the right direction. However, achieving true composability looks to be a work in progress for all solution vendors.

I look forward to talking with the Liqid team about that point and more this Wednesday 5/1/19 at TFDx. Check out the live stream at 13:30 using the link below, and feel free to send your questions via Twitter using the hashtag #TFDx and #DellTechWorld.