Episode 124: Beyond the runbook; DR Organization

Perhaps you have heard about a runbook–the documented instructions you should follow in the event of a disaster or some situation where an outage occurs.  Instructions are great for IT folks as they give a reference to follow–and they don’t freak out even if we might. They don’t, however, include instructions for everyone–what about the folks that aren’t tapping away on the keyboard?  In this episode, we chat with Greg Moore about his experience in emergency situations outside of IT and how we might apply these principles in our environment.

Whether you are directly in the line of fire or support those that do, we think you will find this conversation interesting.

Episode Quotes

“No battle plan survives first contact with the enemy.”

“When something bad really happens, the first thing you should do is sit down and make yourself a cup of tea. [It will] force you to slow down and stop and think about what’s going on.”

“One of the challenges that we as IT people have is that we tend to think that our way is the best way.”

“I’m a huge fan of people learning the how and the why behind things, not just what to do in X, but why are we doing X, why are we doing Y?”

Listen to Learn

01:34     Compañero Shout-Outs
03:27     SQL Server in the News
04:38     Intro to the guest and topic
06:49     The beginning of the caving example as an object lesson in IT disasters
10:37     Make sure we don’t make things worse
13:27     Recognize your limitations – know what help to offer and when to step back
16:03     What happens when you don’t trust your team – don’t be a micro-manager
18:48     Training helps in most situations but not always
25:03     Always ask “why”
26:40     Don’t be afraid to practice possible disaster scenarios – Jeff Bezos & Delta examples
32:39     How can we manage our managers
33:54     Ask questions to assess the situation before you charge in
36:35     SQL Family Questions
39:47     Closing Thoughts

About Greg Moore

Greg Moore is a graduate of RPI. There, he majored in CompSci, but probably spent as much time hiking, canoeing, caving and rock-climbing as he did studying. He started working with SQL Server 4.21a in 1995 and has survived numerous upgrades. He’s been a Director and later VP of IT at several startups including PowerOne Media, TownNews and Traffiq and now consults. These days, when he’s not busy with playing with SQL Server or spending time with his family, he can often be found underground caving or teaching cave rescue with the NCRC. While his focus is on the operations side of DBA, his interests include DR, performance and general IT problem solving. He is the author of: IT Disaster Response: Lessons Learned in the Field. You can find it here: https://smile.amazon.com/dp/1484221834/

 

FEMA’s ICS course link: https://training.fema.gov/is/courseoverview.aspx?code=IS-100.b

Transcription: Beyond the runbook; DR Organization

Carlos:             Compañeros, welcome back to the SQL Trail! This is Episode 124. Today we’re talking with Greg Moore. He’s the owner of Green Mountain Software, and the author of the book IT Disaster Response: Lessons Learned from the Field. And ultimately, that is the topic of our conversation today, is this idea of getting beyond the runbook. So in a disaster, how we can better coordinate with one another, not so much in simply just the execution of the response. So, we’re excited to have him. Greg is up from the northeast portion of the United States and works with cave rescue teams and so we’re interested in talking with him about some of his experiences.

We do have a few Compañero Shout-Outs we want to get to. The first, John Workman reached out on LinkedIn. He just took a job with Microsoft, helping folks on-board with the various services that are there and wanted to reach out. I think we’re going to try to get John on the program here to talk about some of the things that he’s doing. Also, I’m not sure if we have any Podable listeners here, but if you’re getting your podcast through Podable, welcome. Welcome to the SQL Trail, we’re glad to have you. Want to give a quick shout-out to Dave Mason from the panhandle of Florida, reaching out, saying hello. Dave and I were talking about a few ideas for the podcast as well and I think I’m going to snag him here later on for an episode as well. Now Cristian Satnic, he suggested a topic, although I couldn’t convince him to come on, but he had suggested a topic, and wanted to talk about continuous delivery to the database. And so Christian, I’m happy to say that that episode has been recorded and will be out next month, so we’re looking forward to that. I also want to give a shout-out to Steve Stedman. Unfortunately, through work and busy lives that we lead, Steve is no longer going to be able to be on the podcast with us. That’s a big blow, I know, to all you listeners. I’ve enjoyed having Steve on the program for, gosh, these 50 plus episodes and appreciate all of his contributions to the program. That’s not to say that he won’t be back, but unfortunately, he won’t be a regular guest. Now he is going to join us for today’s conversation and there’s one other episode that he’ll be with us. Then moving forward, you will probably see a variety of other folks that are panels, if you will, that we’ll be doing on the program to try to get a few more people there in as part of the conversation. But I wanted to thank Steve, again, for all of his time and effort, because it does take quite a bit of time to do that many episodes, and so I’m appreciative to him.

So now I think it’s time for a little SQL Server in the News! I want to give a shout-out to some of my friends at Channel Advisor. I think there are a couple of others, but those folks, Tracy, Mark and Brian, have put together a website called We Speak Linux at wespeaklinux.com, surprisingly. This website is directed to Windows administrators and developers who want to learn about Linux. The reality is, I think, most of them are coming from the DBA background and they want to help others who are going to be in dual environments, which Channel Advisor is, and running SQL Server in multiple operating systems and some of the challenges that they’re going to face. I’m not sure if they’ve started yet, but I know they are going to be starting in March. The first week in every month, they’re going to have a webcast and invite folks to come on, ask questions, so there’ll be more content. They are just kind of getting the site up, but there’ll continue to be blog posts and whatnot around that. So if you are interested in finding out a little bit more about running SQL Server in Linux, then that site might be for you.

Our URL for today’s episode is going to be sqldatapartners.com/caving or sqldatapartners.com/124. Let’s go ahead and jump into the conversation with Greg.

 Carlos:             So, Greg, welcome to the program!

Greg:               Thank you guys. Happy to be here.

Steve:              Yeah hey Greg, it’s great to have you.

Carlos:             Yeah, it’s great to have you on the program today. We don’t get too many authors on the program and so it’s nice to see that you’ve put out a book and we’re going to, obviously, be focusing on that a bit today. Our topic is how IT folks can learn, maybe, from some of the other systems or other organizations and how they respond to disasters. And I think this idea of beyond the runbook. You know, as IT people, we know, hey, we have to have a runbook and that’s going to tell me what I need to do, or for those of us who have actually had a disaster, we know that it doesn’t always work quite as smoothly. So you actually give a couple of examples or scenarios and we thought we might chat about those and how do we apply them to IT.

Greg:               Sounds good to me, where do you want to start?

Carlos:             So, we have a caving example and then we have a plane crash example. Which one do you want to start with?

Greg:               Let’s start with caving and as folks can probably tell, I have this fascination with when things go wrong, so I kind of pick the highlights and the lowlights there, I suppose.

Carlos:             Yes, that’s right. What I find really interesting about this, at least from your book, I’m not sure if this is still the case, but at one time you were actually the database administrator for one of the caving teams, like this group of people who would actually do rescues.

Greg:               I still am the database administrator. It’s the National Cave Rescue Commission. We’re not a rescue team, we actually train folks who form the rescue teams. I’m also an instructor, so I like to tell people where to go underground and what to do when they get there.

Carlos:             Yes, so I thought that was an interesting mix of professional and personal interests colliding there.

Greg:               Yep.

Carlos:             So, in the book you tell this story, in the middle of the night you get called in, there’s a caver in a cave who’s been free climbing and they’ve been injured, and your goal is to go in and help them get out.

Greg:               Yes, one of the mantras that we try to teach is get people out as quickly as safely as we can and hopefully in better shape than we found them. He unfortunately was not in great shape after falling about 40 feet. Yeah, it was not a pretty scene, but by the time I got there, actually a lot of the rescue had taken place and there were quite a few other folks there so I was just one cog in the overall machine.

Carlos:             Now one of the challenges in this specific instance was that, so if you’ve never done caving before, if you’ve been on a tour, it’s not the same thing as some of this cave exploring. The holes are very small, so you go on a tour, everything’s very big. I won’t say caving is unsafe, you have to crawl around a bit. I’ve been caving here in Virginia and there’s actually instances where you have to, you’re on your belly almost, trying to get through some of these places. And so that kind of lends itself to this story in the sense that he was not in a big open space and you could just get the stretcher down there and roll him on out.

Greg:               You’re very right. I always love in movies where they walk through caves, there’s plenty of light and the floor’s nice and smooth and everything like that. Especially in New York, our caves don’t tend to have any of those features. Crawling is a big part of it. This particular cave in Vermont is much like that. The entrance is kind of a weird trapezoid shape that you kind of crawl in, crawl over some rocks, and then you get to the top of basically about a 50 foot drop and straight down from there, there is actually a large room. So, where the unfortunate patient had landed was actually quite a large room and there was room to work there. But to tie this back to IT, the cavers in the area had recognized that something could go wrong here, and they had done some measurements and they knew at the top of that drop, to get a patient through in a litter was not going to work. It was basically chest-size, normally, you put them in a litter, it’s not going to work. So they had actually had a pre-plan, kind of think of it as our runbook, that basically said, “hey, if we ever have to evacuate a patient from beyond this point, we need to get in here and remove some rock.” So they actually had spent about 2 to 4 hours removing that rock there and setting up some other stuff, which is why by the time I got there, it was about a 2 hour drive, I kind of got there into the middle of the rescue because the local rescuers had already done a lot of the pre-plan. But you know, as they say in the military, what’s the quote, something along the lines of no battle plan survives first contact with the enemy. That was the situation here. We had a great plan that we got him to the top of that drop, but them getting him out through that trapezoid shape, we realized very quickly that, hm, the plan isn’t going to work. It’s 40 feet long, so we can’t remove 40 feet of rock, we’ve got to come up with Plan B. Or, I like to say in the back of my head I’m thinking Plan B, C, D and E and trying to figure out what’s going to work.

Steve:              So, with that, then, with a 40 foot fall, someone could have pretty significant injuries.

Greg:               Correct.

Steve:              And to just try and drag them out without being on a stretcher or a basket of some kind, could do much more damage than the fall itself. So therein lies your challenge, is how are you trying to get them, you’re trying to get them through this narrow passageway and it just won’t fit because of the size of the stretcher and the patient themselves. So, what did you do at that point, then?

Greg:               Steve, you bring up a great point, and since this is more of a database podcast than a cave rescue podcast, let me tie some of this back. One of my fellow rescuers who got to the patient recognized the injuries, had some basic first aid training, a little beyond that and was able to stabilize the patient including putting a little traction on one of his legs, which immediately calmed the patient down. As you can imagine, having broken legs was a likely result of this and not a comfortable position. But as I mentioned, we had to clear rock out. And one of the first rules we have is, make sure we don’t make things worse. And it’s kind of the same thing in, say a database failed or something like that. Sometimes the temptation is “oh, let’s reboot the server, oh, let’s do this.” I’ve had a boss come in and say “oh, let’s do this” and make things worse because they didn’t stop to think, “hey, what’s the situation, how can we stabilize things, make them better?” In this case it was “let’s move the patient out of the way of the falling rock, let’s get the patient into a litter.” For those who are EMS geeks, we tend to use the big orange Ferno litters. Let’s make the patient comfortable, they’re going to be there for a while, let’s give them some food and water and wrap them up, so we spend a lot of effort stabilizing the patient. So, we have one team kind of doing that while we have the other team putting in the ropes, opening up the rock, things like that. And it’s one of those things that I’ve seen in what I call kind of the large IT disasters a failure sometimes where there isn’t a whole lot of thinking beyond the immediate problem. There’s the “oh, this is crashed, we need to focus on this” but someone’s got to be thinking about, “well, who’s going to handle the flux of emails we’re going to get from our thousands of irate customers that might overwhelm our email server. So maybe I should have a team focusing on that while we have a team focusing on the dead database”, or in this case, a team focusing on the patient and a team focusing on the rockfall.

Steve:              As you say that, I’m thinking back to an example I worked on last week with a corrupt database and with the whole concept of don’t make things worse. And I think that when people run into database corruption, one of the first things they do is they try and reboot the server to see if that’ll fix it and that never fixes it, and in fact, sometimes it makes it worse. But the one that I was working on, it was actually if they had rebooted it in the condition the database was in, it would not have come back up. So that was one of them we got lucky that they followed that same logic, like you were talking about there, don’t make it worse, and they didn’t do the reboot. They just left it running until somebody could look at it. And I think that so many of these things you’re talking about here, really translate over from rescue world to dealing with database issues or IT crises.

Greg:               Exactly, and then the other problem that we run into and I started to mention this, too, is so we get the patient up and we try to get them out through the entrance and realize, huh, our initial plan is not going to work. And this is where, because I knew several members of the rescue team, I had worked with them previously, or I had trained with them, I had actually given instruction to a couple of them. And I also had with me an expert rigger inside the cave and one outside the cave. And I basically came up with a plan, I yelled at them, said “hey, come close so I can tell you, this is what I need.” And then I stepped back and let them make it happen. And it’s something I’ve seen again in IT, my best managers during a disaster have been the ones who know what help to offer and what help to step back. I joke, but it’s a true story, back, I’d say it was probably about 2000 when the internet was still young and we were all not as jaded and grey haired as we are now. I don’t remember what was on, our website was suffering performance issues or something and the CEO of the company who was also a good friend of mine, he came in, looked at us, realized that we knew what we were doing and he said, “I probably can’t help you, but can I order you some pizza?” And it sounds like a small thing, but it really was a big thing because we were going to be there for another hour or two at least, it was close to dinnertime and we were all focused on the problem but in the back of our mind, we’re all thinking, “hey, I’m kind of hungry, here.” So as a manager, he recognized his limitation, he wasn’t going to sit down and start typing out SQL commands or anything like that, but he could remove the small issues and he could keep all the sales people from bugging us every 5 minutes, “when’s the site up, when’s the site up?” So, he worked very well in that role as, let the people who can solve the problem, solve the problem. Back to the cave rescue here, that was my attitude was these two, they know what I want, I trust their rigging, if they can’t make it happen they’ll let me know and when they do make it happen, we’ll go ahead and try it. And it worked very well, and that’s why I say I don’t want to take credit for it. It might have been my idea, let’s try this rope this way, but these guys did all the hard work. And in a disaster, knowing that you can trust your team, you can trust the people below you or alongside you is a big part of that. And I’m going to guess Steve, as an EMT, you probably have that where you work with certain partners where you just know that you get to the patient and you’re zoned in and you just know what to do without having to shoot a hundred questions back and forth.

Steve:              Oh yeah, absolutely. And I think that, the thing I see oftentimes, from flipping that around on the IT side, is you get that IT manager who doesn’t have the knowledge or confidence in the team and their solution of dealing with issues is to come in and sit next to or stand next to the people who are doing the most critical work and ask them questions and basically interrogate every single step that they’re doing, instead of ordering pizza like in your example. Where that actually can make things much worse or slow things down, either in the rescue situation or in the IT side of things.

Greg:               Exactly, and I think we’ve all had that. I know I’ve had that. I’ve actually had, it was a brand-new VP to the company basically come in my office and “well, what were you doing?” and giving me a hard time. I basically said, “while you were asking a thousand questions, I was solving the problem.” It was almost a career limiting event at that point, but it’s kind of what he needed to set him in his place and I would not recommend doing that to most managers. But again, the most successful managers I’ve had have been the ones who will back off and say, they’ll ask the pertinent questions, “hey, do you have an ETA so I can get out an email. Okay, that’s 30 minutes, great. I’ll check back with you in 31 minutes.” But yeah, the ones who hover over you and stand over your shoulder while you’re trying to get stuff done, I don’t think they realize that half the time they’re going to make things take three times as long.

Steve:              Right, and they’ll even add the risk of more mistakes being made, because I know I seem to make more mistakes when somebody’s looking directly over my shoulder. I’m more concerned about what’s their next question going to be, than what it is you’re doing on the system, perhaps.

Greg:               Oh sure, and of course, there’s a rule that the more people standing over your shoulder, the harder it is to type, regardless. And if the high-ranking, your fingers aren’t even going to find the keyboard.

Carlos:             Yeah, and I think some of this ultimately kind of comes back to culture as well. And if we’re not working with our teams, and we tend to be very siloed, when we do get into this situation where we have to, it’s all-hands-on deck kind of situation, if that’s not a culture that we’ve developed, then I think it does make it very challenging. And I think maybe in the manager’s defense, perhaps, is that if the teams are not reaching out to them, so I’m not singling anyone out here, necessarily, but if we’re not reaching out and trying to have those conversations and work together to solve those problems, then when it becomes super important, the pressure of that doesn’t make for a great combination.

Greg:               Agreed.

Steve:              So, one thing I’m really curious about there in your caving example is that, and I see this on the EMS side quite a bit as well as in IT is that it really comes back a lot to training. And there’s things that are sort of the common things that could happen, and those are the things that you train for quite a bit. And then there’s also high-risk things that can happen, but maybe they’re less frequent. And those things you train for because of the risk involved in them. And I think that when you come across a scenario where it’s something you’ve trained for and you’ve seen it lots of times or you’ve trained for it lots of times, it makes it much easier to deal with. But then when you get in a situation where it is way different than anything you’ve trained for or there’s some twist that really changes it up, that’s where I think mistakes end up being made. And I guess I’m curious if you’ve seen that same thing in your background.

Greg:               Yeah, I definitely would agree and this is one area where I think what I’m going to call street EMS vs backcountry EMS or rescue differs a bit. Street EMS, like you say, a lot of it is pretty standard calls, broken limb, broken that and everything like that. And then like you say, you’ve got the very out of the ordinary ones. With cave rescue, there’s a, I don’t want to say a magazine, something that comes out about every two years called the American Cave Accident Report. It’s put out by the NSS that basically covers all the reported cave accidents, and one of the goals is to say, here’s the commonality, here’s the things. But one of the things that strikes me is generally, while there’s often a lot of similar injuries, and a lot of them are, honestly self-rescue, you know, a person twists an ankle and their friends help them out or something like that. When actual calls come out, every cave is unique. It’s really kind of hard to develop standardized plans, so what I really try to teach, and I’d say the NCRC, the Cave Rescue Commissioner I work with tries to teach is, here’s a set of skills, here’s a set of tools to put in your tool box and keep them in mind because you may never know which one you’re going to need. And that’s a big part of it. This one rescue up in Vermont, it was a lot of putting little pieces together of, hey, this will work here, this will work there, but there was obviously nothing we could train for when you have this particular cave, do that. Unless you’re training on a particular cave, which some teams might do because they know that cave is popular. But in terms of making mistakes, you’re right and again it goes back to something I was taught decades ago, literally, was kind of a joke but not entirely was, in the back country when something bad really happens, first thing you should do is sit down and make yourself a cup of tea. And it sounds like “oh my gosh they’re going to”, well, unless they’re going to bleed out in the next five minutes, in which case, take care of that. If it’s 6, 12 hours to a rescue, making that cup of tea is going to force you to slow down and stop and think about what’s going on. And one thing that we all get caught up in is the term Go Fever. We’ve got to go, go, go. Well, sometimes we need to slow, slow, slow. And it’s a hard thing to really do and I think especially with a more critical and often what appeared to be the more larger disasters is the ones where we most often have to slow down and stop. To use an analogy, Steve, if you’ve ever done mass casualty incidents, they always say “the person who’s screaming in pain is probably not the problem.” You know, hey, they’ve got an airway, everything like that. It’s that person sitting quietly that you’ve got to think about. What’s going on with them, why are they not making any noise? And again, we have to slow down and do that slow, slow, and stop and think about it and take that extra 5, 10 seconds to evaluate before we jump in.

Steve:              Yeah, you know what’s interesting about that is that there’s a term that we use oftentimes in EMS, which is called the distracting injury or a distracting patient. Where it’s something that jumps out as sort of obvious and big but it hides the real issue. For instance, it might be a broken leg when someone’s really having a cardiac issue. And people will tend to focus on the obvious thing and sometimes overlook the more serious things and I think that happens quite a bit in the database world as well, because you’ll see one thing and you’ll immediately start dealing with that one distracting issue and not realize that you’ve got a file system failure or something more significant on your database that is really causing all of it and not just causing the distraction.

Greg:               You bring up a great example. Just dealing with a client of mine where “oh my gosh, our backups filled up our disc space, what went wrong. It’s like, okay, we can clean up a few of the backups, no I’m not going to reduce your retention rate because you really need to have this many days’ worth just in case, but maybe we should look into why for three days in a row your backup that normally takes one hour took five hours. Your worried about the disc space, well that’s a symptom of the fact that something went wrong with your file system that your backups now are taking five times longer.

Carlos:             Yeah, I also think that going back to the differences in caves and whatnot, so standards, if we’re going to apply that to an IT environment, so standards might play a role here, and everybody does things a little bit differently. And I think one of the challenges that we as IT people have is that we tend to think that our way is the best way.

Greg:               Well, of course my way’s the best, but that’s beside the point in this discussion.

Carlos:             Yeah, and coming back and say okay, well, do we have any processes, maybe going back to that runbook, how is it that we’re supposed to be tackling some of these things.

Greg:               I’m a huge fan of people learning the how and the why behind things, not just what to do in X, but why are we doing X, why are we doing Y? bringing back again with Steve’s EMS, a lot of it is why is the patient’s heartrate this way. You know, we can treat it, but if we understand why, we can do a better job of it. Same thing with databases, why is this happening, why do I back up the tail of the log? Well, because it’s useful, not just “oh, let me hit a button because the commercial software says to do that.

Steve:              And you know, that’s interesting because I often say that why is perhaps the most important question you can ask and somebody, I’ll get this working with clients where somebody will ask for something and I’ll come back and ask “why do you want to do that” or “why are you asking for this?” Because they just said, “do this” without any detail around it and sometimes when you really understand that why, it changes how you approach it or how you’re going to deal with things.

Carlos:             A question I want to bring up, we’ve talked about preparation, how different might things be training. Now again, I think for a lot of people, getting the runbook would be a huge win, right?

Greg:               Yep.

Carlos:             Like hey, we’re done, right, I’m finished. But what are your thoughts around taking that to the next step and how can we be better, is prepared the right word for working together?

Greg:               I think one of the things that companies are afraid to do sometimes is practice. One of my former employers had a very good setup, you know, very robust network architecture and everything like that. And my network engineer came to me one day and said, “hey, I’d really like to do a failover of our switches, just to make sure it works and everything.” And I said, “yeah, that’s a great idea, what’s the risk” everything like that and he wrote it all up. And it was something like maybe worst-case 30 seconds to rehome the BGP tables and all of that and we would do this at like four in the morning when nobody’s using our site. I said, “great.” So, I went to my managers and one of the other managers and they’re like “oh no, no, we can’t do that, it’s too risky.” And what I couldn’t get through to them and my network engineer felt the same way was, it can’t be too risky. We either trust the setup and prove it or we don’t, in which case we can’t rely on it in an emergency. And I think in my book, and I mentioned in one of my talks, supposedly Jeff Bezos would walk through his data centers and just randomly pull out a network cable or hard drive. And you know, I say that to people and they get this look on their face, “oh my gosh, I could never do that.” And I ask them, “how do you know your DR snare is robust enough to handle that? If you have a saying and it’s supposed to handle a drive down, how do you know it’s going to do that until you actually pull a drive?” Now, that said, I’m not quite that extreme of just walking and pull out hard drives because you might want to check first that you don’t already have a failed hard drive. But you know, practice really is an important part of it.

Carlos:             Yeah, so from that perspective, cause one of the things that, the problems you have, and there have been plenty of examples I think, so what was it, Delta that had the big downtime? They were trying to switch their power supply and they were like, “ok, well, this is a test. Let’s switch from one to the other.” And then it failed and then, so the practice caused an outage, and so there’s that fear, which I think is real. Part of that is because you already have all this infrastructure, we talked about testing, is already in a production state. Users are depending on it, somebody is depending on it. It can get complicated. I’m not saying you could necessarily recreate your entire production environment. Obviously, that would be nice, but you know, routers and switches and things, that gets expensive. But what about maybe trying that in another environment. Okay, yeah, maybe I don’t have all the users on there, but can I at least attempt on these two VMs, and go from this one to that one? Okay, well, at least I know it can do that.

Greg:               I think it’s a brilliant question and a brilliant point. Again, what’s it worth to you. If you’re doing mission-critical systems, say Wall Street where they’re literally trading hundreds of millions of dollars a minute, then investing in basically a duplicate infrastructure is probably worth it. If you’re doing $1000 an hour of business, well, yeah, you probably can’t duplicate your entire data center, but like you say, these days you can pretty much virtualize anything in your environment and taking the time to do that. But I’m going to go back to your Delta comment for a minute and say consider this. In a proper test and everything, you have people in the key places that, oh yeah, normally software would do this, but we’ll have someone there that can manually type in the command if need-be or something like that. So, imagine that the Delta scenario was not a test but had happened for real, that for whatever reason the power had failed and they didn’t have any of their preparation in place? I’m assuming that they had a plan in place of, okay, if something goes wrong, we’ll fall back. Again, think about if it happens in real and I don’t have anything in place, it’s probably going to be worse than if I do it in practice and something does go sideways.

Steve:              Oh yeah, and I think that whatever outage you cause when you’re doing a practice like that is going to be less significant than when it happens in reality if you had not practiced for it. And yeah, if you cause an hour of downtime because of a practice, that might prevent 12 hours of downtime in reality.

Greg:               Exactly.

Steve:              And that’s one of those that I think it’s oftentimes hard to get management to approve.

Greg:               Oh, very hard.

Carlos:             Oh, I agree, I agree.

Steve:              Because oftentimes, it’s running right now and if you touch anything it’s your fault when it breaks.

Greg:               Exactly. You know, I’m reminded, I just finished teaching my son to drive, he’s got his license now, and you know, growing up it the northeast, we get snow storms. So, one of the first things that we did last winter, actually was took him over to the local parking lot with big snow banks, said okay, let’s do some skids, let’s do some slides. And he asked that question, “well, dad, what if I slide into a snowbank?” or there’s like two light poles that are in the middle of the parking lot, “what if I hit that?” and I had to point out to him is “I’d much rather have you do that at 5, 10 miles an hour in the parking lot than on a highway for real where there’s other cars and people around. I mean, yeah, I’ll be upset if my car gets a dent in it, but that’s a far cry from if you’ve never practiced it for real and go into a real skid on a real curve with other people in the car and stuff like that.

Steve:              Very good points.

Carlos:             So, what do you think, I guess maybe the last question here and then we’ll move into SQL Family. So, for the non-managers, for managers, I’m not saying it’s easier, but ultimately you have a little more decision-making authority. So how could we, as the humble IT worker that we are, how could we begin to influence some of those things?

Greg:               How can we manage our managers? I would do two things. I would honestly try to read up a little more on the incident command system, ICS, and then on cockpit resource management or crew resource management, which deals with how management both in terms of structure and then in terms of communications can work during a disaster scenario. ICS was developed to handle everything from a car accident up to wildfires in California. And I honestly would love every IT manager and maybe one step above that to take at least a 100 level ICS course, which by the way, you can go to the FEMA website and click through the whole thing in about half an hour. Maybe us IT folks can convince our manager to take that class once and let me do a little more research on this so I learn more on how to be part of a team and then how to be able to communicate up and down and across my team.

Steve:              Oh yeah, I would completely agree with that. If IT management could learn ICS, it would make disaster response much more organized, I think.

Carlos:             Yeah, interesting. So I guess I’ll leave, maybe another word. I remember one time I was at church and an older gentleman was in the pew and he appears to be, so there’s something wrong with him. He slumps over, I was several pews back but he appears to stop breathing, and so obviously there’s a situation going on here. And so, people around were like “oh gosh, what’s going on?” and they didn’t immediately move him. Obviously, they called the paramedics and the paramedics come over, they get him out. And what was interesting, I did not see this, I was seeing this second hand or hearing this second hand, was that again, we were all concerned, “oh my gosh, do something.” And what the EMT did, he came over, again, he hears this call, guy stopped breathing, this older gentleman, you know, 80 years old, might be having heart attack. And at that time he was alert, and the EMT walks over to him and just starts talking to him, “hey, what’s your name? I’m” whatever. And they’re like “wait a second, aren’t you supposed to be doing something?” But again, I think through that training, he wanted to establish “where am I in the process?”

Greg:               He’s assessing the situation and everything like that. And that’s a big part of it. One last thing with cave rescue, we tell folks and we drill this into our students, when you come across a scene, the first thing you do is stop and look around and see what happened. Why did that person fall down? You don’t want to, if it’s bad air in the cave, which is actually very rare, but you don’t want to go charging in after somebody and find out that, oh, it’s bad air.

Carlos:             Right, and then you’re down there as well.

Greg:               So yeah, really is often just sometimes the simple things of, “hey, let me just ask a few questions” and patients, 9 times out of 10, they’ll tell you everything you need to know. You just have to be listening. With databases, you can’t quite ask them questions in terms of English, but if you’re willing to take the time and look at the logs and spend an extra couple minutes looking at stuff, it’ll answer most of your questions.

Steve:              Yeah, it’s interesting because so often when there’s a problem whether it’s in EMS or in a database, people get tunnel-vision and they just think about a specific issue that’s being reported but they don’t ask the questions like that either from the patient or from a database to understand what the real issues are. I think, yeah, I would completely agree there.

Carlos:             Awesome, shall we do SQL Family?

Greg:               Sounds good to me.

Steve:              So Greg, how did you get started with SQL Server?

Greg:               You know, I joke because when I was in college, I took a class on databases and it discussed the SQL language and how this is starting to be more popular and maybe someday it’ll be a force in the industry. And probably within about 10 years I had a client that did Information Management System for Laboratories. And they started moving from using Informix, for anybody who remembers that, now I’m dating myself, over to SQL Server. And what they needed was their team could install their software, but they wanted someone to go in a day or two beforehand and install the SQL Server setup and everything like that. So, I kind of became their go-to person for that and after that, moved into an internet start-up in 1998 where I became the DBA and stuff like that and just really decided I really enjoy it. I love the fact that there’s so much data at your fingertips and if you just know how to ask the questions, you can get some amazing stuff out of databases.

Steve:              All right, very cool.

Carlos:             If there’s one thing you could change about SQL Server, what would it be?

Greg:               You know it’s funny, I just answered this question on Quora.com. one of my beefs, and it’s a small one, but I’m sure every DBA’s gotten bit by it once is the fact that when you do a Restore Database, it defaults to With Recovery. And if you forget to type With No Recovery and you’ve just restored that database that’s taken 5 hours to restore and you try to restore the logs and discover, ooo, I can’t, you’re going to say “ah, I wish I’d typed no recovery.” So really wish No Recovery was the default because worst case is, if you’ve got nothing else to do, you just then say With Recovery and you’re done.

Steve:              What is the best piece of career advice that you’ve received?

Greg:               I don’t know what’s the best piece I received. The best I can give is do what you enjoy, not what you love, because if you do what you love for a job, it becomes a job and you stop loving it. But if you do what you enjoy, you’ll probably still enjoy it while working.

Carlos:             There you go. Our last question for you today, Greg, if you could have one superhero power, what would it be and why do you want it?

Greg:               I go back and forth between teleportation and flying. I think being able to fly like a bird and swoop and soar and all that would be pretty darn cool. And I think cool is a good enough reason for wanting a super power.

Carlos:             Well, there you go. Yes, I was going to say you must still have hair that the wind can flow through.

Greg:               I do, I just got it trimmed last week, so it’d be pretty short, but I’ll grow it out if I ever get the chance to fly.

Carlos:             Awesome, Greg, thanks so much for being on the program. We do appreciate it.

Greg:               I enjoyed it. Thanks for having me and if you ever want to talk about plane crashes in the future, we can do another one on that.

Steve:              All right, sounds good, thanks Greg.

Carlos:             So, I went ahead and took Greg’s challenge here, and went to the FEMA website, and I found this ICS course that he talked about. While I admit I didn’t get through all of it, I did begin the course. It won’t take too long, I just need to spend a little bit more time with it. But I did find it interesting that this was created and kind of looking back to the origin of FEMA and some of these things is the California wildfires in the 70’s. They just got bigger and bigger and with more homes and people dying as a result of these fires, the government decided to put together kind of a plan of action, or at least some ideas or a framework on how first responders and emergency personnel could get together and coordinate ideas. I do think it harkens a lot to some of the thoughts that Greg had in today’s episode and so of course we’ll have the link to that course up on the website if you’re interested in checking it out. I think we in technology get a little bit spoiled with some of the interactivity that we have. Now there are some videos and whatnot that are in the course, but for the most part it’s going to be PowerPoint types or slides that you’re going to go through and do a bit of reading on. Again, very interesting, I think one of these ideas from a culture perspective, so this is not necessarily technology-driven, but how can we change the culture around disasters? I was actually at an event last night talking with folks, and these are general IT people, and security is always a big issue. But there’s always that question in the back of my mind is, “if I make this change, am I going to break something? Is it going to cause an issue, and then what are the repercussions there?” So, I think part of this idea is that we can more forward if we know how we’re going to interact when emergencies do come. And so, thanks to Greg for that conversation.

As always, compañeros, we’re interested in your thoughts on what we should be talking about on the podcast, or if you’d like to join us, of course we’re very interested in that. Our music for SQL Server in the News is by Mansardian used under Creative Commons. Compañeros, you can always reach out to us on social media. I am interested in connecting with you on LinkedIn. I am @carloslchacon. And compañeros, I’ll see you on the SQL Trail.

Episode 123: Top 5 things to know when getting admin access

Listener Eduardo Cervantes wanted to get our take on what developers should do when they get admin access to a database.  We take on this challenge and I give 5 points you might consider if you are a developer with admin access to the SQL Server.  As Uncle Ben in Spiderman quotes, “With great power comes great responsibility.”  We hope you use yours wisely.

Episode Quotes

“The question you should be asking yourself is, why was this not already enabled? What is the downside to implementing this course of action?”

“Who owns the code? We all do. The same could apply to the database. Documentation then comes into play.”

“Rolling back is very hard when you don’t know the original state. ”

“You may think you understand the consequence, but then if there’s unintended consequences, give yourself a way to get back.”

“Patching, security, backups, boring. Perhaps, but they need to be taken care of and they do become important.”

“Just because you have admin access doesn’t mean that everybody else should have admin access.”

Listen to Learn

02:09        Compañero Shout-Outs
03:39        Tips & Tricks
06:42        SQL Server in the News
09:48        Intro to the topic
10:12        We are going to assume there is no DBA present.
10:32        This could apply to both production and development
10:50        Do you know the consequence of your action? You have some control of the behavior of the system—just make sure you understand the consequences.
12:28        Who is the owner of the system? Does that person now what you are doing? Shared ownership-new concepts in programming.
14:19        Rollback is hard when you don’t know the original state. Always give yourself a way to get back. Containers. 😊
16:26        Don’t forget the small stuff. (patches, security, backups, etc)  Just because YOU have admin access doesn’t mean everyone should.
18:28        Install Database Health Monitor
19:33        Close-out

Transcription: Top 5 things to know when getting admin access

*Untranscribed Introductory Portion*

Carlos:             Compañeros! Welcome to another edition of the SQL Data Partners Podcast. I am Carlos L Chacon, your host, and this is Episode 123. We would like to excuse Steve Stedman. He’s not with us this episode, so that’s unfortunate. But the good news is, he’ll be back next episode. We’re looking forward to having him. We’re doing something a little different this episode. Of course, if you’re a long-time podcast listener, you won’t notice, however, we are trying to incorporate some video into this episode. So, if you’re joining us via YouTube, welcome. And of course, for you long-time compañeros, welcome back to the program. It’s good to be back with you.

Today’s topic is five things developers should know when they get admin access to a SQL server. For a lot of you developers out there, you don’t have DBAs, we know that there are more and more of you listening to the program. So, actually in 2018, we are trying to gear more of our content to you so we can help bridge that gap and help you be better prepared as you’re trying to tackle some of these administrative tasks while you’re developing. This topic was suggested by a listener, Eduardo Cervantes, so we thank Eduardo for suggesting this topic and sending it our way. So, before we get into that, we do have a couple of compañero shout-outs. We want to give a shout-out to Aaron Hayes. Aaron Hayes is ready for the SQL Trail event . Hooray! Aaron joined us last year, in 2017 and looking forward to another great event. I know we are still trying to work some of these details out. We haven’t announced the 2018 event just yet, but hopefully that will be coming fairly soon. We’ve been working with some sponsors and looking to do some things with some labs, and so that will be exciting once that finally gets announced. Vivek Patel reaching out on LinkedIn. Hey, Vivek! Nathan Hills chimed in on Episode 114 on Getting Started with Consulting. He thought that was kind of interesting, so thanks for that feedback, Nathan. And Chriss Voss, shouting out, sharing some enthusiasm for having developers on the show. I know we made fun of Bert a little bit in Episode 120 and Chriss said “hey, glad to have developers on the program, ‘cause there’s hope for the rest of us!” So Chriss, thanks for that, and we are trying to do, like I mentioned, a bit more content for you developers. So welcome, and it’s great to have you as compañeros. And then Davy, all the way from Langholm, Scotland, chimed in, “really enjoyed the podcast, really well structured, surprised how much I know and horrified there is so much that I don’t know.” So, Davy, yes, join the rest of us in not knowing a whole lot and we look forward to having you around for future episodes.

So, we did promise you compañeros that in 2018 we would include a Tips and Tricks segment. This obviously was user-suggested, and we’ve been soliciting some ideas, and I have to admit, I’m a little concerned. Those ideas aren’t coming in quite as rapidly as we would like. So, we’re going to go ahead and start sharing some of the ones that we put together. And I think that maybe you’ll be surprised at potentially how simple they are, and this idea of, again, just sharing the way that we work. And so, for this Tips and Trick, and again, I’m going to be using the power of video here. So, let’s say that we wanted to pull some data from a database, or we wanted to copy some data from the internet, and it’s not formed well. In this example that I’m using, I’m actually just doing SELECT * from sysdatabases and I have this, and let’s say I want to put it in a report or something, I want to format this for whatever reason. In this case, I’ve actually put the output to text so that I can get that format, so that it lines up correctly, cause if I do it from grid and copy there, you know, the headers get all messed up from the body. So, I run that SELECT query, I copy the results, and then I want to do something with that, so I’m going to paste that. In this case, I’m pasting it into SQL Server Management Studio, but this will actually also work in Word and the other Office products. And so, what I see here is I have a lot of white space in between some of the columns. Or let’s just say there’s a column that I didn’t want to include in my report. So, for example, I know that by policy, is auto-shrink on, is not something that I’m ever going to have to take care of, but I don’t want to have to go through each line and delete that data from the line. So, one of the things I can do is I click where I want to start, and I’m going to hold down the ALT key. If I hold down the ALT key, then I can click and drag, and what that will allow me to do is to get blocks of data, or blocks of columns, if you will, and I can select that and remove it and all of the rows will line up nicely. And so this is particularly useful, again, more for white space, where there’s white space I just want to take out, but it’s in multiple rows. So, I would click there, hold down the ALT key and I can do that and again the size doesn’t really matter. I find that to be really helpful when I want to some formatting and so I hope that you’ll find that useful if you’re not already using it. And so that is the podcast Tip and Trick for this week.

And now, time for a little SQL Server in the news. So perhaps not news. It was announced, I know at PASS, I think they introduced it at PASS Summit last year in October, and this is the SQL Operations Studio. What is news, is that they have started, very similar to the SQL Server Management Studio, is that they have started to introduce monthly updates to this program. For those of you who are not familiar with the SQL Operations Studio, the SQL Operations Studio is the new tool and it is ultimately a visual interface to be able to connect to SQL Server in a visual way but that runs on LINUX and Windows. For those of us who have been using SQL Server Management Studio for some time, you’re going to find some of the features lacking, so don’t get too concerned here. I think there are some very interesting developments that are happening, that are coming, and talking with the product team, again, when I was out in Seattle, there are some things coming, but it’s just like everything, it will be slow coming. So, kicking the tires a little bit on the SQL Operations Studio, again, very similar to SQL Server Management Studio, I can connect to a server, I can see some of my databases, I can write queries, select the database that I want to run the query against. There we go, I can change the database and run it against that database. So again, all very simple, if you will. If you’ve been using Visual Studio to do some SQL Server queries, that’s probably a little bit more familiar in the sense of, you’re specific to writing those queries. A lot of the administrative-type tasks, yeah, not baked in, and I’m not sure that they’ll come. SQL Server Management Studios probably is still going to be the place for you to do that, but there are some different things that they’re trying to do here with the Operations Studio. One of which I know, and again, talking with the product team, that they’re going to allow us to do is to create these reports, if you will, and then tag them onto the dashboard. So here they have a couple, backup status and search databases, and you can actually go in here and run the query and that’s what this is, and it will show you the data behind the query. What at least this knuckle-dragging Neanderthal hasn’t been able to figure out just yet is how to take a query that I’ve written and then put it into the dashboard. I believe that’s going to be coming. Maybe it’s already there and again, I just don’t know how to do that. I found it difficult. I couldn’t immediately find the documentation to do that. But I know that it is coming. So again, it’ll be exciting to see what happens with the Operations Studio. Not super feature-rich, but particularly if you’re in an environment where you’ve started using LINUX but want that visual way to be able to connect to the database and start doing something with it, then again, you have that ability with SQL Operations Studio.

So, with that, we’ll go ahead and get into the episode. The URL for the show notes for today’s episode will be sqldatapartners.com/access. Yes, like the database, because we’re talking about developers getting access to the SQL Server environment, or you can go to sqldatapartners.com/123. For this conversation, the idea is what developers should know as they get admin access. So, we’re assuming that there is no DBA present, and there’s kind of a shared responsibility model. There’s not maybe a group responsible, and so you or the other developers are taking on that responsibility. And this, obviously, would apply to both production and development environments. I’m not necessarily going to get into specifics as to is one different than the other. In my mind, these initial steps are going to apply to both scenarios, and then we’ll go from there. So, the top 5 steps are one, do you know the consequence of your action. We like to make changes, you might run into a problem and you read on the internet, oh, you should do X thing. You should change this parameter or you should use this function or you should enable this trace flag. While the suggestion is that that will help you solve your problem, the question you should be asking yourself is “why was this not already enabled? what is the downside to implementing this course of action?” You know, there’s a reason they don’t turn certain things on by default. Now, more and more, some of that is just because they don’t want to break the old stuff, so if you’re developing new things, there are lots of, you know, those best practices evolve over time. I get that, but at the same time, when you want to start implementing something, you need to be able to understand what the bad stuff is, what the negative is. So, do you understand what that consequence and what the trade-off is? What maybe you might be subjecting yourself to, now that you have taken this action, that you wouldn’t have otherwise? And you could potentially, you know, could you solve that a different way, based on that knowledge? So that’s the first thing to take in mind because when developers get that access, we tend to “oh, it’s exciting”, it’s like “hey, I can finally do what I need to do.” But we want to take a moment, pause, think about what we’re doing.

The next thing is to identify the owner of a system. I’ve been the DBA for many organizations. They’re going to come to me as the owner. So, a lot of DBAs think of themselves as the gatekeeper, but you won’t have that without a DBA and if you’re developers taking care of this, then who then is the owner? So, from a developing concept, there’s pair programming and there’s a shared ownership idea. These concepts have been around in programming and so it’s almost similar to say well, who owns the code? Well, we all do, so the same could apply to the database. Now having said that, documentation then comes into play. Source Control does a great job for the store procedures, the views, even the table structure. All of that you can get into Source Control and then you can okay, well, here’s what’s changed and whatnot, who changed it, things like that. What’s very difficult, or what Source Control doesn’t give you are system setting type processes. So, who changed the trace flag, or who allowed this action in the database. So, from that perspective, you’re going to have to find a way to document what those changes are, who’s making them, and then, are we all okay with making this change? You know, with code it’s easy to “okay, let me work on it, let me commit it and then people can take a peek at it and give feedback.” Again, with those settings it’s a little bit harder and so then coming up with a “here’s how I’m going to manage this” is an important thought process to go through. Cause rolling back is very, very hard when you don’t know the original state. How do we get back to the way we were?

Which brings me then to step number three: always give yourself a way to get back. Obviously, backups are a good way, if you’re changing parameters or system settings, writing that down, what was it, what did I change it to, those kinds of things? That’s no-brainer-type stuff. But we know that database changes, sometimes, they can be problematic, particularly as your database starts getting large. Like oh, do I really want to take a backup, can I just save off a table? Yeah, maybe. Maybe you can. Me personally, I think this is one of those areas where containers are going to start playing a larger role. And obviously if you’re using containers in your development environment, the implementation of containers into the database is going to be a no-brainer. It’s going to give you that ability to be able to spin it up, okay, let me try to make this change implement, right, I’m implementing something here, you could think of that as a code push as well. Is it behaving the way that I think it’s behaving? Okay, yes it is, now let me apply the same change to the database that I want to change. It’s very easy to make that leap because I have a way to test that very, very easily. So I’m interested to hear how containers play a larger role in the database environment, and I think that specific scenario, so in the dev scenario, and being able to get a container of your production environment very, very quickly, without having to do all that restore, allocate all that space. It makes a lot of sense and I think that’s a great way to go, if that’s something that you can pull off. Obviously, the folks over at WinDocs will be very happy to help you out. We are looking to partner with them and work with them and help more organizations take advantage of those containers. And so that all goes back to giving yourself a way to get back, just in case. Cause you never know, you may think that you understand the consequence, but if there’s unintended consequences, give yourself a way to get back.

The fourth thing is don’t forget the small stuff. So, patching, security, backups, boring. Perhaps, but they need to be taken care of and they do become important. Just recently, just this last week, we’re getting word about the CPU bug, where under certain circumstances, someone gets access and they can actually get access to the memory layer for the CPUs and see in clear text all of the things that we’re trying to encrypt because it’s what’s the CPU sees. And so again, these are very real problems. You’re patching your software, for example, the things that you develop, you know they’re going to have bugs. Well, okay, the database is no exception and you have to think of a way, “okay, how am I going to keep up with this stuff?” Or again, who is going to take ownership of that or how have I looked at solving this problem? There are lots of third party applications out there from a backups perspective. So again, just understanding what it is that you’re getting from that point in time, how long am I going to keep those backups, those kinds of things are things to think about. And then of course, just because you have admin access doesn’t mean that everybody else should have admin access. Security still plays a role there, particularly from the application perspective. You have a web app, you don’t want to be giving that user admin rights because you’re just inviting the ability to be compromised, frankly, because you have bad security in place.

So, the four have been: do you know the consequences of the action, identifying the owner of the system, and if you have a shared ownership model, detailing who’s going to be making those changes and how they’re going to be documented, always give yourself a way to get back and then don’t forget the small stuff. And the last suggestion I have is to install Database Health Monitor. In my mind it’s a no-brainer. It’s a visual interface, it’s going to give you that ability to be able to get better insights into the database without having to look at all of the logs or know all of the queries. There are lots of reports that are baked in that are going to give you some feedback. The biggest benefit to the Database Health Monitor is the wait stats history. Now you’re going to have to install a small database to keep some of that history, but now when you go back and you’re like “wow, at 10am this morning I was having problems”, you’ll actually be able to have the history to be able to go back in there and start digging around. What was going on? Again, lots of different ways to home-grow that yourself, and if that’s your option, go for it, but for an easy out-of-the box way that’s free, Database Health Monitor, I think, is the way to go there.

So, compañeros, what do you think? Do you agree with my list? I’m very interested in hearing from you. I’d be interested to get your thoughts and feedback. And you’ll note that I didn’t give specific, “hey, you should do this”. There are some best practices out there, but I thought that identifying some of these first steps would really be more important. Again, even using the Database Health Monitor as an example, that would help you understand what the best practices are and you can start implementing that as you see fit. So, you can leave comments in the show notes or reach out to us on social media. We are always interested in hearing from you. That is going to do it for today’s episode. We have quite a bit of music. Our music for SQL Server in the news is Mansardian, used under Creative Commons, as is all of the music that we used today. We hope you’ll reach out to us on social media. A lot of people are connecting with us on LinkedIn and we invite you to reach out to us on LinkedIn. I am @CarlosLChacon and compañeros, I’ll see you on the SQL Trail.