What Is Cloud Infrastructure Resilience?

W. Curtis Preston, Chief Technology Evangelist

A term you may have heard lately is Cloud Infrastructure Resilience. Druva has been focused on resilience for a while now, so this term caught the eye of Stephen Manley, our CTO.

What exactly is it? How is it different than cloud availability or cloud redundancy? Is it different? Is this just a new buzzword, or does it have value? Find out in this episode of No Hardware Required.

This week on no hardware required. We're talking about a new term cloud infrastructure resilience. Hope you enjoy the episode.

With me as always is our CTO. Steven Manley. Thanks for joining.

[00:00:13] W. Curtis Preston: Hi and welcome Druva's No Hardware required podcast. I'm your host w Curtis Preston, aka a Mr. Backup, and have with me a guy who we're gonna talk about what things mean, and I think that excites him. Steven Manley, our c t o.

How's it going?

[00:00:29] Stephen Manley: I tell you, you know, there, there, there are nights where I just go, go to sleep dictionary in bed next to me. Just, just, just snuggle up to, I mean, I mean there's, there's just, it's, it's, it's, it's lovely.

[00:00:42] W. Curtis Preston: You know, words mean things. And when people start using them to mean other things, it gets very frustrated for some of us whose job it is to help define things. I'm just saying. Um, and there there's this new term that I've been seeing, floating, floating around. Uh, this happens a lot, especially with analyst companies.

They try to. Categories of products. Right. And this was a category of product cloud infrastructure resilience. Have you been seeing this term floating around a lot?

[00:01:18] Stephen Manley: I I, I have been seeing the term floating around and I've started to see some, some customers maybe not use the term exactly, but you can tell they're dancing around. They're reading some reports and, and, and yeah, it's, it's a mouthful. But, uh, so people are trying to wrap their heads around what does this mean?

And should I be worried about what I'm doing in the cloud right now?

Spoiler alert, yes, you should be.

[00:01:44] W. Curtis Preston: spoiler alert. Yes. This reminds me of back in the day. Right when I was, um, you know, trying to get people to back up stuff and they were saying things like, well, I have raid, right? The, the discussion that we're gonna be having here. It reminds me a lot of that.

It's like, well I have raid or I have ha right? That was the term back in the day was I have high availability. Uh, cuz that's one of the terms that, that we should talk about here. And that's both availability and redundancy. And Well, I ha I have a highly redundant system. In fact, I think a lot about when we talk, you know, one of the things that I know you and I talk a lot about is why you should back up Microsoft 365.

And then people who are fans of 365, um, they're like, well, it's just a really highly redundant system. Yes. Right. Um,

[00:02:39] Stephen Manley: is awesome.

[00:02:40] W. Curtis Preston: Which is awesome. So what do we mean when we talk about availability and redundancy? Should we do these separate terms? What do you think?

[00:02:49] Stephen Manley: I, I think largely availability and redundancy largely go together, right? And, and the entire purpose of both of them. Well, regardless of how you, you say it is, look, there are components of your infrastructure that can go down. Maybe it's hardware. I mean, these days, especially in cloud, maybe it's a software component, right?

Um, uh, and, and so you make sure that, you know, any one component failing or, or any one instance of a component failing. Isn't, doesn't bring your service down, right? So it says, oh, well that, uh, that that node went down. That's fine. I've got other nodes that can pick up for it, or that database went down, that's fine.

I've got that database, high availability, you know, sort of set up so that one of the other nodes in the database takes over. So, so you're always up and running and, and, and, and I think most of your really top end cloud services like Microsoft 365 or Google Workspace. uh, is built with that kind of redundancy in mind because we all evaluate ourselves on first and foremost, the service has to be up and running.

[00:03:54] W. Curtis Preston: Right. Yeah. It, it's one of these, it's, it's, it's, uh, RAID adjacent, right? Um, in that, It's designed to help keep the system up all the time, right? To keep it running. But, um, there are things that it doesn't take, um, care of, right? Or it, it doesn't take into account. And generally speaking, it's things that act at a different level.

So availability, redundancy, raid, all of these are meant. To deal with the, the loss of a portion, sort of, sort of a building block of the infrastructure, right? Like you said, like a VM went down, an availability zone went down, um, you know, a a a node went down in a database. What it, what none of those deals with is.

I dropped a table in the database,

[00:04:58] Stephen Manley: Right,

[00:04:59] W. Curtis Preston: I went in and just dropped a table, or I deleted a database, uh, or again, a ransomware attack, right? Uh, you know, the boogeyman of the moment, the ransomware attacks, if you get a ransomware attack. Availability, redundancy, mirroring, raid, all of these things are great, but they honestly will just make the, the ransomware, uh, more, you know, more effective.

Right. It just helps spread it out. It doesn't

[00:05:29] Stephen Manley: You're gonna mirror that.

[00:05:31] W. Curtis Preston: You're gonna mirror it. Exactly.

[00:05:33] Stephen Manley: You're just gonna spread it. So, so yeah. So what I like to tell people is the availability discussion is, is, is important, right? I mean, if you. If you're really building a service you care about and you're not building availability in, and that can be at an application level, it can be at an infrastructure level, depending on where you wanna do it, maybe a little bit of both.

If you aren't doing that, you're making a mistake. But availability, as you're pointing out, is necessary. But as we always say, not sufficient because above the layer of hardware, software failure, there is user error. There is, uh, programmatic error, and there is of course bad people doing bad things.

[00:06:11] W. Curtis Preston: Yeah. Um, there are just many things that could potentially happen. Um, That, that can take out a portion of the infrastructure and you, and you need to plan for those, right? Uh, and all we are saying as a resilience provider, right, uh, is that you also need to be able to handle these other things. So that's when we start talking about the, start talking about this word of resilience, which is a, which is a word that I'm seeing a lot more again, than I used to.

Um, when I think about like, outside of technology like this, I think this is a, this is an accurate use of the word resilient, right? Because when I think of like resilient outside of technology is if you have, I'm, I'm gonna think of like a person. This person just keeps. Speed up, whether it's physically, verbally, emotionally, and they bounce back, right?

No matter what happens to them, they bounce back. You're like, man, that person is resilient. Right? And I, I think that's the idea of resilience is that no matter, no matter what happens to you, You bounce back. There may be, there may be a period where you might not be available, but you will bounce back, right?

Availability, redundancy is perhaps about making it always available all the time, but it only deals with certain things. W with resiliency, it's we're going to deal with no matter what happens, we're going to be able to let you bounce back. What, what do you think about my, my analogy

[00:07:54] Stephen Manley: I, I love your analogy, and I, I, you know, I know your movies on movies. The thing that comes to mind is, is our old friend, Rocky, right? Uh, uh, and, and, and, and the line Rocky always had is, You know, it's not about how hard you hit, it's about how hard you can get hit and keep moving forward, how much you can take and keep moving forward.

And that's, to me, that's resilient infrastructure is you are just taking body blow and getting hit in the head and, and all the bad stuff is happening, but you're not crumbling, right? You're, you're, you're coming back and you're making sure that your services is there for, for your customer.

[00:08:29] W. Curtis Preston: Yeah. And that's a person that's resilient enough to have a franchise of like 19 movies at this point, one of which is in the theater right now. Creed three, I think, is in the, is in the theater right now. That's resilient. Uh, exactly. Um, I can't even keep track of how many Rocky movies that essentially is at this point.

But, um, so when we, so this is another thing. Well, when we talk about backup and Dr. How does that relate to resilience?

[00:09:00] Stephen Manley: Yeah, so, so to me, you know, when, when I think of backup and they're, they're very similar, right? We get into maybe periods of retention and how fast you recover, but the concept is, Going back to the kind of, the kind of issues you talked about, which is I get compromised by ransomware. So a user makes an error and, and wipes out a database or part of a file share, something like that.

Again, availability doesn't take care of it. Availability says, cool, I'm gonna replicate that. Destruction everywhere, all at once. Um, uh, what, what backup and Dr. Say backup especially. I can get you back to yesterday, or I can get you back to a week ago, or I can, I can get you to some point in time before this mistake happened or before this attack happened, so that you're, you're, you're not looking at a smoking crater of, of destruction, but that you have alternatives and, and, and, and, and I think too often, especially in the cloud infrastructure, people get so focused on Yeah, yeah.

But I've got three levels of, of, of availability. They forget. That, that, that first part of resiliency, which is if you don't have a backup copy, all those other bad things that haven't gone away in the cloud, uh, are gonna get you if you're not careful,

[00:10:14] W. Curtis Preston: Yeah, I'd say without backup, you don't have Dr. Without Dr. You don't have resiliency. Um, you know, I like this idea of the, you know, building on, um, I mean, there was a time. And it wasn't that long ago, but there was a time that backup and DR were separate, right? That if you, that if you truly cared about dr, you had something other than a backup system that was part of your DR plan.

You, you were doing it based on, uh, replication, uh, et cetera. And, you know, we, we started talking a lot about C D P. With that, there was a whole huge,

[00:10:51] Stephen Manley: There was that whole wave of c d P.

[00:10:53] W. Curtis Preston: wave of C D P,

[00:10:55] Stephen Manley: Kasha and Tokyo and,

[00:10:57] W. Curtis Preston: Yeah, we are definitely deep in the trough of disillusionment of that. Um, but we're in the world. I think we're in this new world where modern backup programs like Druva are an essential part to, to having a solid DR plan, especially if it's a DR plan, involves restoring into the cloud.

Um, because again, if you, um, If you're able to do, I mean, we could do a restore, you know, a dr of an entire environment, regardless of its size in about 15 to 20 minutes. If you could do that, that's pretty dang resilient, right? If you, you get hit really hard with a ransomware, uh, attack, but if you do all the other things, we're talking about the it, you've gotta do the incident response.

You need a ransomware response plan. But when it comes time to actually restore the data, if you can restore your entire environment in 15 to 20 minutes, Um, you're gonna be able to be pretty darn resilient for just about anything that comes your way, I would think.

[00:12:01] Stephen Manley: Yeah. And, and, and I think to your point, you know, there, there was a 0.0 boy, uh, maybe a decade ago where, um, at previous places I worked in this particular NetApp and emc, we had lots of internal debates about, you know, there was this idea of continuous availability, high-end disaster recovery and backup, and the real question.

Which if, uh, which if any of these can converge and, and of course at upper management they say, well, they should all converge. I'm like, okay, well sure, that's, that's great, but let's, let's, let's actually be realistic. And, and I think especially now working at Druva where it's a service, it's not just, Hey, we're, we're selling tech.

Uh, you know, we're giving you a box. Good luck. Um, but that we offer the service. You really do see how different it is. The teams that, that work on designing the resiliency of our system, uh, uh, the, um, you know, the availability of our system, making sure it's always running. They look at the world very differently than the people who are looking at how do I make sure, how do I do the backup and disaster recovery because the problems are different.

The, the timelines are different. The, uh, the threats are different and, and, but what we do see is, yeah, we do see backup and Dr coming closer and closer together. Now there's still always that, that last part, which is that long-term retention archive, 30 years kind of thing. That's always, you know, that, that's a different beast.

But, but to your point, you know, the fact that I can recover a backup in 15 minutes has pretty much a dr in most people's.

[00:13:38] W. Curtis Preston: Yeah, exactly. I mean, we, we would've paid a billion dollars for that back in the day.

[00:13:46] Stephen Manley: Right. I mean, yeah, just the notion, uh, the notion we used to have of, I've gotta have a separate data center with separate boxes and every, uh, and, and you look and you go, the amount of money we were spending for stuff that literally did not

[00:14:01] W. Curtis Preston: Right. Yeah.

[00:14:02] Stephen Manley: was astounding.

[00:14:04] W. Curtis Preston: Yeah, I, I spent a lot of money back in the day.

So I want to talk about the term that we started with, but we, the, and we, I, I think we all wanna know what cloud is, right? We've covered what resilience is when we talk about cloud infrastructure.

Is, is there anything special there that we need to know about?

[00:14:23] Stephen Manley: I, I think there is. Um, so, so what I find is when I talk to customers who are new to the cloud and, and, and it's okay for them because they're just starting, they look at it and they say, I've got some templates or some scripts that help me set up my, my cloud environment. So if something went wrong, I'd just be able to spin it up off my scripts again.

And you look at them and you go, yeah, I mean, if you're running a handful of apps, you're fine. But when you start to mature in the cloud, the thing you realize is. Just like your data center, there's a lot of custom stuff that starts to come in and you can't just run a couple scripts and it's up and running again.

And so one of the things you've gotta think of when you think of cloud infrastructure resilience is it's not just about the data, it's not just about the applications, but how are you actually making your cloud environment resilient? How are you, uh, protecting your security settings and your deployments and, and, and how you're configuring.

You know, interactions between, you know, IAM and identity and all those sorts of pieces. And so, so it becomes very important as you evolve that, that when we think of cloud, you know, it's everything that was in backup, everything that was in disaster recovery. But now you need to actually think about how you're gonna recover your cloud environment as well.

[00:15:38] W. Curtis Preston: And I think just to build on that, you know, part of resilience is also being able to take the hit in the first place. Just to go back to, uh, you know, the analogy that we, we were talking about, the idea that if you, um, if you've made your infrastructure resilient, especially in the cloud, A lot of things can that, that it can try to happen to you, right?

You can receive a certain amount of attacks, a certain amount of, um, well, a certain amount of attacks of various kinds and incidents that may happen. Um, I think of the O H V fire that happen with the cloud provider in France, you know, physical bad things that aren't cyber attacks may happen to you. And if you've designed your system to be resilient, the availability.

Right. It, it does. I think resiliency does, I think, uh, include that, right? We can't, we can't dismiss it. But then you, you need the, uh, you need the ability, again, if something were to happen, like what happened at O H V and you need to pull out the backups, uh, you need to be able to meet that demand as well.

So when, when we talk about cloud infrastructure resilience, All the things, like all the things that we just talked about.

[00:16:58] Stephen Manley: it, really is. And, and, and I know right? Everyone that goes into the cloud says, well, the good part here is all these problems are solved for me. And you look and you go. All right, there are problem solve for you. Let's the but shared responsibility model, and we can't say that enough. You are responsible for, for a lot of what you are running in the cloud, and so you need to be sure that your application is highly available.

You need to be sure that your data is backed up and you have a disaster recovery plan. You need to be sure that your disaster recovery plan includes not just. You know, your, your, your data and your e c two instances and your rds, but also some of your config information and, and how everything ties together.

And of course the thing Curtis, is you always say, and I always say, is, and then you should test it. You know, if, if, if you haven't tested

[00:17:50] W. Curtis Preston: it.

Test it. Yeah.

[00:17:51] Stephen Manley: Yeah.

[00:17:52] W. Curtis Preston: Yeah, absolutely. Well, we will end on that note. Test, test, test. Uh, can never, can never test enough. Uh, so thanks. Thanks for helping me wade through, uh, you know, some word salad.

[00:18:07] Stephen Manley: Alex, it's delicious and it's in low in calories.

[00:18:12] W. Curtis Preston: And thanks to our listeners, uh, you are why we do this. We hope you've enjoyed this episode. Be sure to subscribe so that you don't miss other ones. And remember, here at Druva, there's no hardware required.