Episode 124: Beyond the runbook; DR Organization

Episode 124: Beyond the runbook; DR Organization

Episode 124: Beyond the runbook; DR Organization 560 420 Carlos L Chacon

Perhaps you have heard about a runbook–the documented instructions you should follow in the event of a disaster or some situation where an outage occurs.  Instructions are great for IT folks as they give a reference to follow–and they don’t freak out even if we might. They don’t, however, include instructions for everyone–what about the folks that aren’t tapping away on the keyboard?  In this episode, we chat with Greg Moore about his experience in emergency situations outside of IT and how we might apply these principles in our environment.

Whether you are directly in the line of fire or support those that do, we think you will find this conversation interesting.

Episode Quotes

“No battle plan survives first contact with the enemy.”

“When something bad really happens, the first thing you should do is sit down and make yourself a cup of tea. [It will] force you to slow down and stop and think about what’s going on.”

“One of the challenges that we as IT people have is that we tend to think that our way is the best way.”

“I’m a huge fan of people learning the how and the why behind things, not just what to do in X, but why are we doing X, why are we doing Y?”

Listen to Learn

01:34     Compañero Shout-Outs
03:27     SQL Server in the News
04:38     Intro to the guest and topic
06:49     The beginning of the caving example as an object lesson in IT disasters
10:37     Make sure we don’t make things worse
13:27     Recognize your limitations – know what help to offer and when to step back
16:03     What happens when you don’t trust your team – don’t be a micro-manager
18:48     Training helps in most situations but not always
25:03     Always ask “why”
26:40     Don’t be afraid to practice possible disaster scenarios – Jeff Bezos & Delta examples
32:39     How can we manage our managers
33:54     Ask questions to assess the situation before you charge in
36:35     SQL Family Questions
39:47     Closing Thoughts

FEMA’s ICS course link: https://training.fema.gov/is/courseoverview.aspx?code=IS-100.b

About Greg Moore

Greg Moore is a graduate of RPI. There, he majored in CompSci, but probably spent as much time hiking, canoeing, caving and rock-climbing as he did studying. He started working with SQL Server 4.21a in 1995 and has survived numerous upgrades. He’s been a Director and later VP of IT at several startups including PowerOne Media, TownNews and Traffiq and now consults. These days, when he’s not busy with playing with SQL Server or spending time with his family, he can often be found underground caving or teaching cave rescue with the NCRC. While his focus is on the operations side of DBA, his interests include DR, performance and general IT problem solving. He is the author of: IT Disaster Response: Lessons Learned in the Field. You can find it here: https://smile.amazon.com/dp/1484221834/

**Sound effects have been removed from the recording**
*Untranscribed introduction*

Carlos:             Compañeros, welcome back to the SQL Trail! This is Episode 124. Today we’re talking with Greg Moore. He’s the owner of Green Mountain Software, and the author of the book IT Disaster Response: Lessons Learned from the Field. And ultimately, that is the topic of our conversation today, is this idea of getting beyond the runbook. So in a disaster, how we can better coordinate with one another, not so much in simply just the execution of the response. So, we’re excited to have him. Greg is up from the northeast portion of the United States and works with cave rescue teams and so we’re interested in talking with him about some of his experiences.

We do have a few Compañero Shout-Outs we want to get to. The first, John Workman reached out on LinkedIn. He just took a job with Microsoft, helping folks on-board with the various services that are there and wanted to reach out. I think we’re going to try to get John on the program here to talk about some of the things that he’s doing. Also, I’m not sure if we have any Podable listeners here, but if you’re getting your podcast through Podable, welcome. Welcome to the SQL Trail, we’re glad to have you. Want to give a quick shout-out to Dave Mason from the panhandle of Florida, reaching out, saying hello. Dave and I were talking about a few ideas for the podcast as well and I think I’m going to snag him here later on for an episode as well. Now Cristian Satnic, he suggested a topic, although I couldn’t convince him to come on, but he had suggested a topic, and wanted to talk about continuous delivery to the database. And so Christian, I’m happy to say that that episode has been recorded and will be out next month, so we’re looking forward to that. I also want to give a shout-out to Steve Stedman. Unfortunately, through work and busy lives that we lead, Steve is no longer going to be able to be on the podcast with us. That’s a big blow, I know, to all you listeners. I’ve enjoyed having Steve on the program for, gosh, these 50 plus episodes and appreciate all of his contributions to the program. That’s not to say that he won’t be back, but unfortunately, he won’t be a regular guest. Now he is going to join us for today’s conversation and there’s one other episode that he’ll be with us. Then moving forward, you will probably see a variety of other folks that are panels, if you will, that we’ll be doing on the program to try to get a few more people there in as part of the conversation. But I wanted to thank Steve, again, for all of his time and effort, because it does take quite a bit of time to do that many episodes, and so I’m appreciative to him.

So now I think it’s time for a little SQL Server in the News! I want to give a shout-out to some of my friends at Channel Advisor. I think there are a couple of others, but those folks, Tracy, Mark and Brian, have put together a website called We Speak Linux at wespeaklinux.com, surprisingly. This website is directed to Windows administrators and developers who want to learn about Linux. The reality is, I think, most of them are coming from the DBA background and they want to help others who are going to be in dual environments, which Channel Advisor is, and running SQL Server in multiple operating systems and some of the challenges that they’re going to face. I’m not sure if they’ve started yet, but I know they are going to be starting in March. The first week in every month, they’re going to have a webcast and invite folks to come on, ask questions, so there’ll be more content. They are just kind of getting the site up, but there’ll continue to be blog posts and whatnot around that. So if you are interested in finding out a little bit more about running SQL Server in Linux, then that site might be for you.

Our URL for today’s episode is going to be sqldatapartners.com/caving or sqldatapartners.com/124. Let’s go ahead and jump into the conversation with Greg.

 Carlos:             So, Greg, welcome to the program!

Greg:               Thank you guys. Happy to be here.

Steve:              Yeah hey Greg, it’s great to have you.

Carlos:             Yeah, it’s great to have you on the program today. We don’t get too many authors on the program and so it’s nice to see that you’ve put out a book and we’re going to, obviously, be focusing on that a bit today. Our topic is how IT folks can learn, maybe, from some of the other systems or other organizations and how they respond to disasters. And I think this idea of beyond the runbook. You know, as IT people, we know, hey, we have to have a runbook and that’s going to tell me what I need to do, or for those of us who have actually had a disaster, we know that it doesn’t always work quite as smoothly. So you actually give a couple of examples or scenarios and we thought we might chat about those and how do we apply them to IT.

Greg:               Sounds good to me, where do you want to start?

Carlos:             So, we have a caving example and then we have a plane crash example. Which one do you want to start with?

Greg:               Let’s start with caving and as folks can probably tell, I have this fascination with when things go wrong, so I kind of pick the highlights and the lowlights there, I suppose.

Carlos:             Yes, that’s right. What I find really interesting about this, at least from your book, I’m not sure if this is still the case, but at one time you were actually the database administrator for one of the caving teams, like this group of people who would actually do rescues.

Greg:               I still am the database administrator. It’s the National Cave Rescue Commission. We’re not a rescue team, we actually train folks who form the rescue teams. I’m also an instructor, so I like to tell people where to go underground and what to do when they get there.

Carlos:             Yes, so I thought that was an interesting mix of professional and personal interests colliding there.

Greg:               Yep.

Carlos:             So, in the book you tell this story, in the middle of the night you get called in, there’s a caver in a cave who’s been free climbing and they’ve been injured, and your goal is to go in and help them get out.

Greg:               Yes, one of the mantras that we try to teach is get people out as quickly as safely as we can and hopefully in better shape than we found them. He unfortunately was not in great shape after falling about 40 feet. Yeah, it was not a pretty scene, but by the time I got there, actually a lot of the rescue had taken place and there were quite a few other folks there so I was just one cog in the overall machine.

Carlos:             Now one of the challenges in this specific instance was that, so if you’ve never done caving before, if you’ve been on a tour, it’s not the same thing as some of this cave exploring. The holes are very small, so you go on a tour, everything’s very big. I won’t say caving is unsafe, you have to crawl around a bit. I’ve been caving here in Virginia and there’s actually instances where you have to, you’re on your belly almost, trying to get through some of these places. And so that kind of lends itself to this story in the sense that he was not in a big open space and you could just get the stretcher down there and roll him on out.

Greg:               You’re very right. I always love in movies where they walk through caves, there’s plenty of light and the floor’s nice and smooth and everything like that. Especially in New York, our caves don’t tend to have any of those features. Crawling is a big part of it. This particular cave in Vermont is much like that. The entrance is kind of a weird trapezoid shape that you kind of crawl in, crawl over some rocks, and then you get to the top of basically about a 50 foot drop and straight down from there, there is actually a large room. So, where the unfortunate patient had landed was actually quite a large room and there was room to work there. But to tie this back to IT, the cavers in the area had recognized that something could go wrong here, and they had done some measurements and they knew at the top of that drop, to get a patient through in a litter was not going to work. It was basically chest-size, normally, you put them in a litter, it’s not going to work. So they had actually had a pre-plan, kind of think of it as our runbook, that basically said, “hey, if we ever have to evacuate a patient from beyond this point, we need to get in here and remove some rock.” So they actually had spent about 2 to 4 hours removing that rock there and setting up some other stuff, which is why by the time I got there, it was about a 2 hour drive, I kind of got there into the middle of the rescue because the local rescuers had already done a lot of the pre-plan. But you know, as they say in the military, what’s the quote, something along the lines of no battle plan survives first contact with the enemy. That was the situation here. We had a great plan that we got him to the top of that drop, but them getting him out through that trapezoid shape, we realized very quickly that, hm, the plan isn’t going to work. It’s 40 feet long, so we can’t remove 40 feet of rock, we’ve got to come up with Plan B. Or, I like to say in the back of my head I’m thinking Plan B, C, D and E and trying to figure out what’s going to work.

Steve:              So, with that, then, with a 40 foot fall, someone could have pretty significant injuries.

Greg:               Correct.

Steve:              And to just try and drag them out without being on a stretcher or a basket of some kind, could do much more damage than the fall itself. So therein lies your challenge, is how are you trying to get them, you’re trying to get them through this narrow passageway and it just won’t fit because of the size of the stretcher and the patient themselves. So, what did you do at that point, then?

Greg:               Steve, you bring up a great point, and since this is more of a database podcast than a cave rescue podcast, let me tie some of this back. One of my fellow rescuers who got to the patient recognized the injuries, had some basic first aid training, a little beyond that and was able to stabilize the patient including putting a little traction on one of his legs, which immediately calmed the patient down. As you can imagine, having broken legs was a likely result of this and not a comfortable position. But as I mentioned, we had to clear rock out. And one of the first rules we have is, make sure we don’t make things worse. And it’s kind of the same thing in, say a database failed or something like that. Sometimes the temptation is “oh, let’s reboot the server, oh, let’s do this.” I’ve had a boss come in and say “oh, let’s do this” and make things worse because they didn’t stop to think, “hey, what’s the situation, how can we stabilize things, make them better?” In this case it was “let’s move the patient out of the way of the falling rock, let’s get the patient into a litter.” For those who are EMS geeks, we tend to use the big orange Ferno litters. Let’s make the patient comfortable, they’re going to be there for a while, let’s give them some food and water and wrap them up, so we spend a lot of effort stabilizing the patient. So, we have one team kind of doing that while we have the other team putting in the ropes, opening up the rock, things like that. And it’s one of those things that I’ve seen in what I call kind of the large IT disasters a failure sometimes where there isn’t a whole lot of thinking beyond the immediate problem. There’s the “oh, this is crashed, we need to focus on this” but someone’s got to be thinking about, “well, who’s going to handle the flux of emails we’re going to get from our thousands of irate customers that might overwhelm our email server. So maybe I should have a team focusing on that while we have a team focusing on the dead database”, or in this case, a team focusing on the patient and a team focusing on the rockfall.

Steve:              As you say that, I’m thinking back to an example I worked on last week with a corrupt database and with the whole concept of don’t make things worse. And I think that when people run into database corruption, one of the first things they do is they try and reboot the server to see if that’ll fix it and that never fixes it, and in fact, sometimes it makes it worse. But the one that I was working on, it was actually if they had rebooted it in the condition the database was in, it would not have come back up. So that was one of them we got lucky that they followed that same logic, like you were talking about there, don’t make it worse, and they didn’t do the reboot. They just left it running until somebody could look at it. And I think that so many of these things you’re talking about here, really translate over from rescue world to dealing with database issues or IT crises.

Greg:               Exactly, and then the other problem that we run into and I started to mention this, too, is so we get the patient up and we try to get them out through the entrance and realize, huh, our initial plan is not going to work. And this is where, because I knew several members of the rescue team, I had worked with them previously, or I had trained with them, I had actually given instruction to a couple of them. And I also had with me an expert rigger inside the cave and one outside the cave. And I basically came up with a plan, I yelled at them, said “hey, come close so I can tell you, this is what I need.” And then I stepped back and let them make it happen. And it’s something I’ve seen again in IT, my best managers during a disaster have been the ones who know what help to offer and what help to step back. I joke, but it’s a true story, back, I’d say it was probably about 2000 when the internet was still young and we were all not as jaded and grey haired as we are now. I don’t remember what was on, our website was suffering performance issues or something and the CEO of the company who was also a good friend of mine, he came in, looked at us, realized that we knew what we were doing and he said, “I probably can’t help you, but can I order you some pizza?” And it sounds like a small thing, but it really was a big thing because we were going to be there for another hour or two at least, it was close to dinnertime and we were all focused on the problem but in the back of our mind, we’re all thinking, “hey, I’m kind of hungry, here.” So as a manager, he recognized his limitation, he wasn’t going to sit down and start typing out SQL commands or anything like that, but he could remove the small issues and he could keep all the sales people from bugging us every 5 minutes, “when’s the site up, when’s the site up?” So, he worked very well in that role as, let the people who can solve the problem, solve the problem. Back to the cave rescue here, that was my attitude was these two, they know what I want, I trust their rigging, if they can’t make it happen they’ll let me know and when they do make it happen, we’ll go ahead and try it. And it worked very well, and that’s why I say I don’t want to take credit for it. It might have been my idea, let’s try this rope this way, but these guys did all the hard work. And in a disaster, knowing that you can trust your team, you can trust the people below you or alongside you is a big part of that. And I’m going to guess Steve, as an EMT, you probably have that where you work with certain partners where you just know that you get to the patient and you’re zoned in and you just know what to do without having to shoot a hundred questions back and forth.

Steve:              Oh yeah, absolutely. And I think that, the thing I see oftentimes, from flipping that around on the IT side, is you get that IT manager who doesn’t have the knowledge or confidence in the team and their solution of dealing with issues is to come in and sit next to or stand next to the people who are doing the most critical work and ask them questions and basically interrogate every single step that they’re doing, instead of ordering pizza like in your example. Where that actually can make things much worse or slow things down, either in the rescue situation or in the IT side of things.

Greg:               Exactly, and I think we’ve all had that. I know I’ve had that. I’ve actually had, it was a brand-new VP to the company basically come in my office and “well, what were you doing?” and giving me a hard time. I basically said, “while you were asking a thousand questions, I was solving the problem.” It was almost a career limiting event at that point, but it’s kind of what he needed to set him in his place and I would not recommend doing that to most managers. But again, the most successful managers I’ve had have been the ones who will back off and say, they’ll ask the pertinent questions, “hey, do you have an ETA so I can get out an email. Okay, that’s 30 minutes, great. I’ll check back with you in 31 minutes.” But yeah, the ones who hover over you and stand over your shoulder while you’re trying to get stuff done, I don’t think they realize that half the time they’re going to make things take three times as long.

Steve:              Right, and they’ll even add the risk of more mistakes being made, because I know I seem to make more mistakes when somebody’s looking directly over my shoulder. I’m more concerned about what’s their next question going to be, than what it is you’re doing on the system, perhaps.

Greg:               Oh sure, and of course, there’s a rule that the more people standing over your shoulder, the harder it is to type, regardless. And if the high-ranking, your fingers aren’t even going to find the keyboard.

Carlos:             Yeah, and I think some of this ultimately kind of comes back to culture as well. And if we’re not working with our teams, and we tend to be very siloed, when we do get into this situation where we have to, it’s all-hands-on deck kind of situation, if that’s not a culture that we’ve developed, then I think it does make it very challenging. And I think maybe in the manager’s defense, perhaps, is that if the teams are not reaching out to them, so I’m not singling anyone out here, necessarily, but if we’re not reaching out and trying to have those conversations and work together to solve those problems, then when it becomes super important, the pressure of that doesn’t make for a great combination.

Greg:               Agreed.

Steve:              So, one thing I’m really curious about there in your caving example is that, and I see this on the EMS side quite a bit as well as in IT is that it really comes back a lot to training. And there’s things that are sort of the common things that could happen, and those are the things that you train for quite a bit. And then there’s also high-risk things that can happen, but maybe they’re less frequent. And those things you train for because of the risk involved in them. And I think that when you come across a scenario where it’s something you’ve trained for and you’ve seen it lots of times or you’ve trained for it lots of times, it makes it much easier to deal with. But then when you get in a situation where it is way different than anything you’ve trained for or there’s some twist that really changes it up, that’s where I think mistakes end up being made. And I guess I’m curious if you’ve seen that same thing in your background.

Greg:               Yeah, I definitely would agree and this is one area where I think what I’m going to call street EMS vs backcountry EMS or rescue differs a bit. Street EMS, like you say, a lot of it is pretty standard calls, broken limb, broken that and everything like that. And then like you say, you’ve got the very out of the ordinary ones. With cave rescue, there’s a, I don’t want to say a magazine, something that comes out about every two years called the American Cave Accident Report. It’s put out by the NSS that basically covers all the reported cave accidents, and one of the goals is to say, here’s the commonality, here’s the things. But one of the things that strikes me is generally, while there’s often a lot of similar injuries, and a lot of them are, honestly self-rescue, you know, a person twists an ankle and their friends help them out or something like that. When actual calls come out, every cave is unique. It’s really kind of hard to develop standardized plans, so what I really try to teach, and I’d say the NCRC, the Cave Rescue Commissioner I work with tries to teach is, here’s a set of skills, here’s a set of tools to put in your tool box and keep them in mind because you may never know which one you’re going to need. And that’s a big part of it. This one rescue up in Vermont, it was a lot of putting little pieces together of, hey, this will work here, this will work there, but there was obviously nothing we could train for when you have this particular cave, do that. Unless you’re training on a particular cave, which some teams might do because they know that cave is popular. But in terms of making mistakes, you’re right and again it goes back to something I was taught decades ago, literally, was kind of a joke but not entirely was, in the back country when something bad really happens, first thing you should do is sit down and make yourself a cup of tea. And it sounds like “oh my gosh they’re going to”, well, unless they’re going to bleed out in the next five minutes, in which case, take care of that. If it’s 6, 12 hours to a rescue, making that cup of tea is going to force you to slow down and stop and think about what’s going on. And one thing that we all get caught up in is the term Go Fever. We’ve got to go, go, go. Well, sometimes we need to slow, slow, slow. And it’s a hard thing to really do and I think especially with a more critical and often what appeared to be the more larger disasters is the ones where we most often have to slow down and stop. To use an analogy, Steve, if you’ve ever done mass casualty incidents, they always say “the person who’s screaming in pain is probably not the problem.” You know, hey, they’ve got an airway, everything like that. It’s that person sitting quietly that you’ve got to think about. What’s going on with them, why are they not making any noise? And again, we have to slow down and do that slow, slow, and stop and think about it and take that extra 5, 10 seconds to evaluate before we jump in.

Steve:              Yeah, you know what’s interesting about that is that there’s a term that we use oftentimes in EMS, which is called the distracting injury or a distracting patient. Where it’s something that jumps out as sort of obvious and big but it hides the real issue. For instance, it might be a broken leg when someone’s really having a cardiac issue. And people will tend to focus on the obvious thing and sometimes overlook the more serious things and I think that happens quite a bit in the database world as well, because you’ll see one thing and you’ll immediately start dealing with that one distracting issue and not realize that you’ve got a file system failure or something more significant on your database that is really causing all of it and not just causing the distraction.

Greg:               You bring up a great example. Just dealing with a client of mine where “oh my gosh, our backups filled up our disc space, what went wrong. It’s like, okay, we can clean up a few of the backups, no I’m not going to reduce your retention rate because you really need to have this many days’ worth just in case, but maybe we should look into why for three days in a row your backup that normally takes one hour took five hours. Your worried about the disc space, well that’s a symptom of the fact that something went wrong with your file system that your backups now are taking five times longer.

Carlos:             Yeah, I also think that going back to the differences in caves and whatnot, so standards, if we’re going to apply that to an IT environment, so standards might play a role here, and everybody does things a little bit differently. And I think one of the challenges that we as IT people have is that we tend to think that our way is the best way.

Greg:               Well, of course my way’s the best, but that’s beside the point in this discussion.

Carlos:             Yeah, and coming back and say okay, well, do we have any processes, maybe going back to that runbook, how is it that we’re supposed to be tackling some of these things.

Greg:               I’m a huge fan of people learning the how and the why behind things, not just what to do in X, but why are we doing X, why are we doing Y? bringing back again with Steve’s EMS, a lot of it is why is the patient’s heartrate this way. You know, we can treat it, but if we understand why, we can do a better job of it. Same thing with databases, why is this happening, why do I back up the tail of the log? Well, because it’s useful, not just “oh, let me hit a button because the commercial software says to do that.

Steve:              And you know, that’s interesting because I often say that why is perhaps the most important question you can ask and somebody, I’ll get this working with clients where somebody will ask for something and I’ll come back and ask “why do you want to do that” or “why are you asking for this?” Because they just said, “do this” without any detail around it and sometimes when you really understand that why, it changes how you approach it or how you’re going to deal with things.

Carlos:             A question I want to bring up, we’ve talked about preparation, how different might things be training. Now again, I think for a lot of people, getting the runbook would be a huge win, right?

Greg:               Yep.

Carlos:             Like hey, we’re done, right, I’m finished. But what are your thoughts around taking that to the next step and how can we be better, is prepared the right word for working together?

Greg:               I think one of the things that companies are afraid to do sometimes is practice. One of my former employers had a very good setup, you know, very robust network architecture and everything like that. And my network engineer came to me one day and said, “hey, I’d really like to do a failover of our switches, just to make sure it works and everything.” And I said, “yeah, that’s a great idea, what’s the risk” everything like that and he wrote it all up. And it was something like maybe worst-case 30 seconds to rehome the BGP tables and all of that and we would do this at like four in the morning when nobody’s using our site. I said, “great.” So, I went to my managers and one of the other managers and they’re like “oh no, no, we can’t do that, it’s too risky.” And what I couldn’t get through to them and my network engineer felt the same way was, it can’t be too risky. We either trust the setup and prove it or we don’t, in which case we can’t rely on it in an emergency. And I think in my book, and I mentioned in one of my talks, supposedly Jeff Bezos would walk through his data centers and just randomly pull out a network cable or hard drive. And you know, I say that to people and they get this look on their face, “oh my gosh, I could never do that.” And I ask them, “how do you know your DR snare is robust enough to handle that? If you have a saying and it’s supposed to handle a drive down, how do you know it’s going to do that until you actually pull a drive?” Now, that said, I’m not quite that extreme of just walking and pull out hard drives because you might want to check first that you don’t already have a failed hard drive. But you know, practice really is an important part of it.

Carlos:             Yeah, so from that perspective, cause one of the things that, the problems you have, and there have been plenty of examples I think, so what was it, Delta that had the big downtime? They were trying to switch their power supply and they were like, “ok, well, this is a test. Let’s switch from one to the other.” And then it failed and then, so the practice caused an outage, and so there’s that fear, which I think is real. Part of that is because you already have all this infrastructure, we talked about testing, is already in a production state. Users are depending on it, somebody is depending on it. It can get complicated. I’m not saying you could necessarily recreate your entire production environment. Obviously, that would be nice, but you know, routers and switches and things, that gets expensive. But what about maybe trying that in another environment. Okay, yeah, maybe I don’t have all the users on there, but can I at least attempt on these two VMs, and go from this one to that one? Okay, well, at least I know it can do that.

Greg:               I think it’s a brilliant question and a brilliant point. Again, what’s it worth to you. If you’re doing mission-critical systems, say Wall Street where they’re literally trading hundreds of millions of dollars a minute, then investing in basically a duplicate infrastructure is probably worth it. If you’re doing $1000 an hour of business, well, yeah, you probably can’t duplicate your entire data center, but like you say, these days you can pretty much virtualize anything in your environment and taking the time to do that. But I’m going to go back to your Delta comment for a minute and say consider this. In a proper test and everything, you have people in the key places that, oh yeah, normally software would do this, but we’ll have someone there that can manually type in the command if need-be or something like that. So, imagine that the Delta scenario was not a test but had happened for real, that for whatever reason the power had failed and they didn’t have any of their preparation in place? I’m assuming that they had a plan in place of, okay, if something goes wrong, we’ll fall back. Again, think about if it happens in real and I don’t have anything in place, it’s probably going to be worse than if I do it in practice and something does go sideways.

Steve:              Oh yeah, and I think that whatever outage you cause when you’re doing a practice like that is going to be less significant than when it happens in reality if you had not practiced for it. And yeah, if you cause an hour of downtime because of a practice, that might prevent 12 hours of downtime in reality.

Greg:               Exactly.

Steve:              And that’s one of those that I think it’s oftentimes hard to get management to approve.

Greg:               Oh, very hard.

Carlos:             Oh, I agree, I agree.

Steve:              Because oftentimes, it’s running right now and if you touch anything it’s your fault when it breaks.

Greg:               Exactly. You know, I’m reminded, I just finished teaching my son to drive, he’s got his license now, and you know, growing up it the northeast, we get snow storms. So, one of the first things that we did last winter, actually was took him over to the local parking lot with big snow banks, said okay, let’s do some skids, let’s do some slides. And he asked that question, “well, dad, what if I slide into a snowbank?” or there’s like two light poles that are in the middle of the parking lot, “what if I hit that?” and I had to point out to him is “I’d much rather have you do that at 5, 10 miles an hour in the parking lot than on a highway for real where there’s other cars and people around. I mean, yeah, I’ll be upset if my car gets a dent in it, but that’s a far cry from if you’ve never practiced it for real and go into a real skid on a real curve with other people in the car and stuff like that.

Steve:              Very good points.

Carlos:             So, what do you think, I guess maybe the last question here and then we’ll move into SQL Family. So, for the non-managers, for managers, I’m not saying it’s easier, but ultimately you have a little more decision-making authority. So how could we, as the humble IT worker that we are, how could we begin to influence some of those things?

Greg:               How can we manage our managers? I would do two things. I would honestly try to read up a little more on the incident command system, ICS, and then on cockpit resource management or crew resource management, which deals with how management both in terms of structure and then in terms of communications can work during a disaster scenario. ICS was developed to handle everything from a car accident up to wildfires in California. And I honestly would love every IT manager and maybe one step above that to take at least a 100 level ICS course, which by the way, you can go to the FEMA website and click through the whole thing in about half an hour. Maybe us IT folks can convince our manager to take that class once and let me do a little more research on this so I learn more on how to be part of a team and then how to be able to communicate up and down and across my team.

Steve:              Oh yeah, I would completely agree with that. If IT management could learn ICS, it would make disaster response much more organized, I think.

Carlos:             Yeah, interesting. So I guess I’ll leave, maybe another word. I remember one time I was at church and an older gentleman was in the pew and he appears to be, so there’s something wrong with him. He slumps over, I was several pews back but he appears to stop breathing, and so obviously there’s a situation going on here. And so, people around were like “oh gosh, what’s going on?” and they didn’t immediately move him. Obviously, they called the paramedics and the paramedics come over, they get him out. And what was interesting, I did not see this, I was seeing this second hand or hearing this second hand, was that again, we were all concerned, “oh my gosh, do something.” And what the EMT did, he came over, again, he hears this call, guy stopped breathing, this older gentleman, you know, 80 years old, might be having heart attack. And at that time he was alert, and the EMT walks over to him and just starts talking to him, “hey, what’s your name? I’m” whatever. And they’re like “wait a second, aren’t you supposed to be doing something?” But again, I think through that training, he wanted to establish “where am I in the process?”

Greg:               He’s assessing the situation and everything like that. And that’s a big part of it. One last thing with cave rescue, we tell folks and we drill this into our students, when you come across a scene, the first thing you do is stop and look around and see what happened. Why did that person fall down? You don’t want to, if it’s bad air in the cave, which is actually very rare, but you don’t want to go charging in after somebody and find out that, oh, it’s bad air.

Carlos:             Right, and then you’re down there as well.

Greg:               So yeah, really is often just sometimes the simple things of, “hey, let me just ask a few questions” and patients, 9 times out of 10, they’ll tell you everything you need to know. You just have to be listening. With databases, you can’t quite ask them questions in terms of English, but if you’re willing to take the time and look at the logs and spend an extra couple minutes looking at stuff, it’ll answer most of your questions.

Steve:              Yeah, it’s interesting because so often when there’s a problem whether it’s in EMS or in a database, people get tunnel-vision and they just think about a specific issue that’s being reported but they don’t ask the questions like that either from the patient or from a database to understand what the real issues are. I think, yeah, I would completely agree there.

Carlos:             Awesome, shall we do SQL Family?

Greg:               Sounds good to me.

Steve:              So Greg, how did you get started with SQL Server?

Greg:               You know, I joke because when I was in college, I took a class on databases and it discussed the SQL language and how this is starting to be more popular and maybe someday it’ll be a force in the industry. And probably within about 10 years I had a client that did Information Management System for Laboratories. And they started moving from using Informix, for anybody who remembers that, now I’m dating myself, over to SQL Server. And what they needed was their team could install their software, but they wanted someone to go in a day or two beforehand and install the SQL Server setup and everything like that. So, I kind of became their go-to person for that and after that, moved into an internet start-up in 1998 where I became the DBA and stuff like that and just really decided I really enjoy it. I love the fact that there’s so much data at your fingertips and if you just know how to ask the questions, you can get some amazing stuff out of databases.

Steve:              All right, very cool.

Carlos:             If there’s one thing you could change about SQL Server, what would it be?

Greg:               You know it’s funny, I just answered this question on Quora.com. one of my beefs, and it’s a small one, but I’m sure every DBA’s gotten bit by it once is the fact that when you do a Restore Database, it defaults to With Recovery. And if you forget to type With No Recovery and you’ve just restored that database that’s taken 5 hours to restore and you try to restore the logs and discover, ooo, I can’t, you’re going to say “ah, I wish I’d typed no recovery.” So really wish No Recovery was the default because worst case is, if you’ve got nothing else to do, you just then say With Recovery and you’re done.

Steve:              What is the best piece of career advice that you’ve received?

Greg:               I don’t know what’s the best piece I received. The best I can give is do what you enjoy, not what you love, because if you do what you love for a job, it becomes a job and you stop loving it. But if you do what you enjoy, you’ll probably still enjoy it while working.

Carlos:             There you go. Our last question for you today, Greg, if you could have one superhero power, what would it be and why do you want it?

Greg:               I go back and forth between teleportation and flying. I think being able to fly like a bird and swoop and soar and all that would be pretty darn cool. And I think cool is a good enough reason for wanting a super power.

Carlos:             Well, there you go. Yes, I was going to say you must still have hair that the wind can flow through.

Greg:               I do, I just got it trimmed last week, so it’d be pretty short, but I’ll grow it out if I ever get the chance to fly.

Carlos:             Awesome, Greg, thanks so much for being on the program. We do appreciate it.

Greg:               I enjoyed it. Thanks for having me and if you ever want to talk about plane crashes in the future, we can do another one on that.

Steve:              All right, sounds good, thanks Greg.

Carlos:             So, I went ahead and took Greg’s challenge here, and went to the FEMA website, and I found this ICS course that he talked about. While I admit I didn’t get through all of it, I did begin the course. It won’t take too long, I just need to spend a little bit more time with it. But I did find it interesting that this was created and kind of looking back to the origin of FEMA and some of these things is the California wildfires in the 70’s. They just got bigger and bigger and with more homes and people dying as a result of these fires, the government decided to put together kind of a plan of action, or at least some ideas or a framework on how first responders and emergency personnel could get together and coordinate ideas. I do think it harkens a lot to some of the thoughts that Greg had in today’s episode and so of course we’ll have the link to that course up on the website if you’re interested in checking it out. I think we in technology get a little bit spoiled with some of the interactivity that we have. Now there are some videos and whatnot that are in the course, but for the most part it’s going to be PowerPoint types or slides that you’re going to go through and do a bit of reading on. Again, very interesting, I think one of these ideas from a culture perspective, so this is not necessarily technology-driven, but how can we change the culture around disasters? I was actually at an event last night talking with folks, and these are general IT people, and security is always a big issue. But there’s always that question in the back of my mind is, “if I make this change, am I going to break something? Is it going to cause an issue, and then what are the repercussions there?” So, I think part of this idea is that we can more forward if we know how we’re going to interact when emergencies do come. And so, thanks to Greg for that conversation.

As always, compañeros, we’re interested in your thoughts on what we should be talking about on the podcast, or if you’d like to join us, of course we’re very interested in that. Our music for SQL Server in the News is by Mansardian used under Creative Commons. Compañeros, you can always reach out to us on social media. I am interested in connecting with you on LinkedIn. I am @carloslchacon. And compañeros, I’ll see you on the SQL Trail.

2 Comments
  • ROSEMARY new valentine February 27, 2018 at 2:23 pm

    So great Greg!!! Great interview. So well done. Many should listen to this.
    EMS & Caving. Needs are simalar & sometime so different as to know how to respond.
    We all need this knowledge. Up ground & underground are different at times, but both help each other do help to save people.

    Get his bookon Amazon

  • Dew Drop – February 21, 2018 (#2669) – Morning Dew February 21, 2018 at 6:53 am

    […] SQL Data Partners Podcast Episode 124: Beyond the runbook; DR Organization (Carlos L Chacon) […]

Leave a Reply

Back to top