CFD13 – Day 3 – Kubernetes Resource Management

One of the foundational value propositions of Kubernetes was the increase in autonomy and flexibility afforded to developers in deploying their applications. This freedom has proven to be very useful in some instances, but additional burden is placed on developers to make decisions around lower-level concepts like resource management.

And as it turns out, the same resource management challenges that have previously impacted physical and virtual environments have become entrenched in the container platforms that now reside further up the stack. Except this time, software people instead of infrastructure people are at the controls. It makes for an interesting dynamic.

The fundamental issues are similar, though. Over-provisioning leads to waste. Under-provisioning leads to performance degradation and instability. Avoiding these outcomes requires either understanding the workload before deployment, or, in the absence of that understanding, the ability to adapt to changes in resource requirements as they emerge.

This was one of the topics of discussion with StormForge during Day 3 of Cloud Field Day 13. Their solution started out as a platform that facilitated experimentation with various resource values, with the goal of better educating teams about their workloads requirements. The solution has since evolved to focus on observation of existing deployments, correction recommendations and automated remediation. This evolution was driven by customer demand.

This is a telling transition – understanding the workload is still an intensive process during the design phase. And there are arguments about whether infrastructure level details like resource management are really a developer’s responsibility. However, based on current trends, it’s looking like automatic remediation via an “easy button” approach is currently preferred by many.

But the question remains, how much design work are we comfortable delegating to artificial intelligence?

Thanks to the team at Gestalt IT for another great Cloud Field Day experience! I hope to see everyone in person next time. Until then!

CFD13 – Day 2 – Cloud-based Recovery Vaults

With ransomware attacks and other similar threats increasing in frequency and severity, data protection architectures across many organizations are being scrutinized more closely.

In many environments, backup repositories exist on-premises and are attached in some way to an accessible network using a standard protocol. This placement and attachment approach is useful in providing protected resources an efficient path for depositing backup data.

However, this accessibility is frequently exploited during attacks, resulting in impact to both running workloads and associated backups. One way to combat this vulnerability is to incorporate an additional copy (vault) of critical backup data that is isolated both from a network and security perspective. Variations on the theme exist, but this is the basic premise.

Metallic, a Commvault company, presented on their capabilities in this area at Cloud Field Day 13, and it made for an interesting discussion.

The cloud can be a good fit for this type of use-case because of the additional physical and logical security boundaries that separate the impacted environment from the recovery vault. But this placement also presents a few challenges when it comes to recovery scenarios, and high on this list is the potential impact to recovery time.

In the event of an attack impacting a large quantity of on-premises data, where the emergency copy resides in the cloud, the overall recovery time is going to be highly dependent on network throughput between the cloud environment and the impacted datacenter.

So, organizations dependent on a cloud-based recovery vault should account not just for steady state backup traffic requirements, but also the ability to scale network throughput in the event an emergency restore from the cloud is required.

In this example, Metallic is hosted within Microsoft Azure, so this could be as straightforward as planning an adjustment to your ExpressRoute links ahead of time. If there’s a lead time associated with this type of work, it should be accounted for during the backup design process, and not discovered during an emergency.

Simplification of a solution in one area can sometimes lead to an increase in complexity in another, and this is a good example. Selecting a cloud-based platform for hosting a recovery vault can decrease administrative and storage complexity, while introducing network elements that must be accounted for. The tradeoff may be worthwhile for some, in exchange for the convenience afforded.

Day 3 of Cloud Field Day 13 kicks off tomorrow with StormForge, followed by RackN and Fortinet. See everyone there.

CFD13 – Day 1 – “The Cloud” acquiesces to Enterprise Storage

I was as surprised as anyone to hear that in the Fall of 2021, AWS announced the general availability of FSx for NetApp ONTAP. Immediately, I visualized racks of co-located storage arrays with custom NetApp “Cloud” Storage logos hung on each cabinet door.

No, that’s not very cloudlike, that can’t be it, I thought. My mind then went to the possibility of fleets of ONTAP virtual appliances deployed in automated fashion and attached to customer VPC’s. The virtual appliance model isn’t very “cloudy”, either. It’s hard to imagine an official AWS service offering based on that approach, but it’s possible.

Fortunately, as the NetApp team clarified during their presentation and roundtable discussion at Cloud Field Day 13, the partnership with AWS has resulted in a deeper level of integration than the scenarios I was envisioning. The joint engineering effort has apparently resulted in a fully-fledged cloud service, where the functionality of ONTAP is hosted and exposed using the same approach as other core AWS services (although the details couldn’t be shared).

I am not going to speculate as to why the specific vendor and technology decision was made, rather than extending the functionality of existing offerings to provide the full suite of expected Enterprise data services, like multiprotocol support, replication, snapshotting, and disaster-recovery orchestration.

It’s worthwhile to simply look at this service-offering strategy as an acknowledgement of the continuing importance of infrastructure-level services, and the potential barrier to cloud adoption that matters of storage represent.

The practicality of historical comfortability has clashed with the idealism of what the cloud “should be”, and the cloud has adapted. It’s an interesting development.

Tune in tomorrow for more Cloud Field Day 13.

CFD 11 – Day 3 – No one cares about backups

Until you experience data loss.

If you don’t care about good backups, you should start. Someone is going to make permanent that thing you thought was temporary, or make important that thing you thought wasn’t.

They are also going to store important data somewhere you thought important data didn’t belong – like Kubernetes.

But those are all stateless workloads, right? Surely there’s a snapshot of the data somewhere in S3, bi-directionally triple replicated with 20 9’s of durability.

Sorry, there isn’t. Not unless you put it there.

When I was first exposed to the “it’s not designed to be backed up, your important data should be elsewhere” logic regarding K8S, I wondered to myself how long THAT would last.

And now that the folks over at Veeam have moved into the adjacent market of Kubernetes data protection through their acquisition of Kasten, we can now say that phase lasted about around 5 years 🙂

It makes sense. Today’s datacenters are highly, if not exclusively, virtualized. Targeting the virtualization layer for backup makes the most sense, in most situations. And containers are just another form of virtualization.

The cloud? Using virtualization, too. But the underlying hypervisor is no longer exposed and available to capture backups directly from, so an adjustment in approach is needed. There are more efficient means of capturing data anyway – at the container level.

Image credit: Kasten

Kasten deploys directly into your generic Kubernetes environment, and from this layer it is able to interact with containerized workloads on a granular basis. And as a result of this approach, Kasten is able to both protect and restore data to a wide variety of K8S distributions, whether they be located on-premises or in the public cloud.

The disaster recovery implications are particularly important because there is no shortage of complex, active/active, disaster-tolerant reference architectures for a customer to choose from. But for customers with less aggressive recovery timelines, a recovery strategy that involves restore from backup can make more sense, and the options in this area have been lacking to this point.

Being able to utilize the same data protection tool, should your on-premises K8S distribution differ from the one you utilize in the cloud, is also a significant benefit. Having to weave together a series of solutions to meet your backup requirements isn’t a good time. Neither is dealing with a separate set of restore procedures when disaster strikes.

It’s encouraging to see vendors start to offer more mature data protection solutions for Kubernetes. Because, as we expected, yes – that data is going to need to be backed up (even though you said it wasn’t important).

Special thanks to Gestalt IT for assembling another great Cloud Field Day, and thank you to all the presenting sponsors. Until next time!

CFD 11 – Day 2 – It’s all about the ingress

There are a ton of important design decisions to make when creating your Kubernetes architecture, and deciding how the outside world is going to communicate with your hosted services, ingress, is a big part of that. And there can be no ingress without an ingress controller.

I came into Day 2 of Cloud Field Day 11 expecting to be most interested in Kubernetes data protection, but NGINX’s ingress controller function caught my attention, instead.

Image credit: AWS

Commercial Kubernetes offerings come out of the box with the design decision of which ingress controller to use having already been made. OpenShift, for example, utilizes an HAPROXY layer deployed within the OCP environment to expose applications and load balance across components.

AWS EKS, on the other hand, implements part of the functionality using its own Load Balancer Controller for Kubernetes, combined with native VPC constructs like the AWS ALB, to achieve the same general outcome.

However, this decision isn’t set in stone, and the default ingress controller can be replaced using your choice of solutions, if another option better aligns with your requirements. But why would you want to do this?

Any functionality gained as a result of the swap could be offset by the loss of out-of-the-box usability and integrations.. It turns out there are some specific benefits to be had by incorporating the NGINX ingress controller instead of an alternative.

Diagram NGINX Ingress Controller
Image credit: NGINX

First is synergy. If you already make use of NGINX elsewhere in your environment, that expertise (and some of your configurations) will carry forward and be useful as you develop your Kubernetes environment. If using different Kubernetes distributions/services across your Hybrid architecture, use of NGINX gives you the ability to make your ingress controller configuration consistent across environments, as well.

A second differentiator for the NGINX ingress controller is that web application firewall (WAF) functionality can be implemented at this layer, bringing the protection a WAF provides closer to the actual workload. Because the ingress represents the perimeter of the Kubernetes network, it makes sense to consider implementing security controls at this level.

Image credit: NGINX

Ingress controller selection is one of many security and networking decisions that can either contribute to harmony or disorder within your Kubernetes environment, especially in a hybrid context.. For those willing to venture outside of the comfort of the manufacturer-default configuration, this solution could be worth a look.

Be sure to check out the live stream tomorrow, Friday 6/25/21 for more Cloud Field Day 11 – I am sure more interesting discussions are to follow. https://techfieldday.com/event/cfd11/

CFD11 – Day 1 – Storage is still hard

And Kubernetes makes it that much more-so. It’s such a difficult challenge that some vendors path to success involves focusing on very specific use cases and trying to be the best at those, leaving the remainder of the overall storage ‘problem” to other vendors.

After all, if you try to do everything and fail, you are going to be laughed at. Just look at those crusty old storage arrays with bolt-on CSI drivers. Those are cloudy, right? So, I can’t blame some vendors for their desire to specialize. You won’t please everyone, but you may end up with a few solid wins.

MinIO, who presented on 6/23 in the afternoon timeslot of Cloud Field Day 11, is solidly in this camp. They take a focused, software defined approach and rely on a distributed, containerized architecture to aggregate local units of storage into a centralized pool for shared consumption.

Conceptually, this is similar to how virtual storage appliances consume local host storage and re-present to a virtualization cluster for consumption in the HCI world, except the lowest unit of distributed compute is now a container, and the pooled storage is presented using an object storage API.

An overview of the software architecture is shown below (view from within a single node)

Cloud Native

Now we can zoom out and see how each logical storage node (containerized itself on K8s) presents its local storage to the network as part of a shared object pool:

Because the hardware and OS layers have been abstracted away, this lends itself to deployment across a variety of heterogeneous environments. This can be especially helpful for customers in hybrid scenarios that are looking to maintain a consistent Kubernetes storage layer across private and public cloud environments.

One such use case would be an Enterprise customer with an on-premises Tanzu deployment and another off-premises in VMC on AWS. Or, alternatively, an on-premises OpenShift deployment with another residing in AWS EC2. The same object storage layer, MinIO, could be maintained in both cases, lending itself to operational efficiencies vs. utilizing siloed storage solutions on both sides of the hybrid architecture.

As an architect, I appreciate being able to achieve simplicity and consistency, when possible. Even though there are ways to manage different Kubernetes distributions in a uniform manner, it can be useful just to use the same technology everywhere. I think the same logic can be applied to Kubernetes storage.

However, as I mentioned in the live session, if my customer requires more than object storage, I am left with additional research to do in order to optimally meet those requirements. MinIO does not attempt to address block and file needs. They can provide whatever storage type you need – as long as it’s object.

Anyway, one aspect of Cloud Field Day 11 that I was most looking forward to, coming in, were the anticipated discussions around persistent container storage. Aside from being generally interesting from a technical perspective, this is an area that I now touch on a daily basis, so I feel I have a bit more to contribute this go-round, compared to Field Day’s past.

Looking forward to hearing more on this front, especially with respect to the recovery of K8S data in disaster scenarios, in the days ahead.

Be sure to check out the livestream of Cloud Field Day 11, resuming Thursday 6/24 at 8AM PST at https://techfieldday.com/event/cfd11/ and follow us using #TFD11 on Twitter.