We all make them and sometimes it can be painful. Making and sharing mistakes is one of the most powerful tools for learning; however, most of us do all we can to avoid and hide those mistakes. In this episode, Steve and I talk about making mistakes and share some thoughts around how we might improve how to respond when we don’t do something quite right.
Listen to Learn
- Two rather public mistakes and how the companies addressed them.
- Why we are so quick to point the finger.
- What you should do when you make a mistake.
- Thoughts on ways to avoid mistakes
Carlos: Compañeros, welcome to Episode 93. Today, we’re switching it up a little bit and it’s just going to be Steve and I talking a little bit about making mistakes.
Steve: Yes, we know we’ve all made mistakes along the way but really how you handle it and how you learned from is the key there, I think.
Carlos: Yeah, that’s right. Steve and I were talking about this. We had some interesting thoughts and wanted to have a dialogue a little bit about it. And again because we’re recording about this after the fact the conversation kind of took some interesting turns, I thought, so of course we’re interested in getting some of your feedbacks. We are always interested in knowing what you thought about the episodes.
Carlos: Having said that, Warner Estes from Boston talked about Episode 91.
Steve: Yes, and I think he said, “Awesome SQL Data Partners podcast with the DBA Tools team.”
Carlos: Yes. Thanks again to the DBA Tools team for coming on and talking with us about their tools. It was a very good show and lots of people chimed in there.
Steve: Next on the Compañero Conference we have a speaker who has now joined us in the line up to announce.
Carlos: That’s right, so Kevin Feasel is going to be part of the cast if you will at the Compañero Conference. Kevin has been on the podcast twice as an individual guest and twice as a panelist. And will be actually joining us again in Episode 96.
Steve: Yes, and he will have two sessions at the conference, one on Day 1 and one on Day 2, and good information there.
Carlos: That’s right, so he is our resident Microsoft MVP, so we can say that we will have Microsoft MVPs at the conference. So you can get some additional data at companeroconference.com.
Steve: Yes, indeed. We also have a discounted price from the conference for a limited time. What is the current price and how long does that last for?
Carlos: That’s right, so the first 10 tickets to the conference we’re selling for $400. After that the price will get bumped up to $500. There are still a few tickets left so if you’re one of the first few you can join us for only $400, and we’d love to see you there.
Carlos: Yes. So as this session comes up, this next weekend I’m actually going to be in SQL Saturday – Baltimore so if you are in the area and want to come up and say hi, we’d love to chat with you and thanks to Slava and Ravi for putting on that conference. I’m sure that I’ll have a great time.
Steve: And I’d also like to give just a quick shout out to all the SQL compañeros that I ran into at SQL Saturday in Redmond, just over a week ago.
Carlos: Oh, that’s right.
Steve: I was surprised that how many people there I’ve bumped into that had either been on the podcast or who had come up and introduced themselves as podcast listeners, and I had some of those podcast stickers to hand out. I know that Argenis was the first one to jump on board with a SQL Data Partners Podcast on his laptop.
Carlos: Yes, very nice so we apologize for kind of being slow to have those stickers available and actually thanks to Doug Purnell who kind of said, “Hey, Carlos, where is my sticker?” I’m like, “You know, we do need a sticker.” So as you see those or at different events we will have them to hand out and another reason to come and say hi.
Carlos: Ok, and now for a little SQL Server in the News. So actually, as we record this, yesterday was the Data Amp, I don’t know if conference is the right word, event, news briefing might be something, and lots of interesting information coming out of that. One that I thought was very interesting was a new feature that the ability to migrate your databases to Azure SQL database. Part of the hurdle previously was that you had to do it in two steps. I had to export the schema, and then I had to migrate the data over. And now basically they’ve created an ETL process. I don’t want to say ETL but it’s a process that will connect to your SQL Server whether that be in house, or in Azure, or wherever that might be and will actually suck in the schema and the data and make it available to you as an Azure SQL Database, so very, very cool.
Steve: Wow. That should really help with the migration of people want to move to Azure.
Carlos: No, I agree and I think part of this announcement came with their knocking down many of the other hurdles that were still in place with Azure SQL database. And so there are still a couple of, I don’t want to say minor limitations. If it’s your limitation then that’s going to be still a show stopper but the amount of functionality that’s now available in Azure SQL Database is really amazing. Very compelling and I think a lot more people are going to start looking at it.
Steve: Another SQL in the News item that was introduced last week was the SQL Server 2017 Community Technology Preview 2.0 is now available.
Steve: Yes, so there’s been a lot of speculation along the way as to what Microsoft is going to call that next version, and they’re going to call it SQL Server 2017. And there’s some really big enhancements coming with SQL Server 2017 and CTP 2.0. Well, and then the vNext released previously, the support for SQL Server on Linux, that’s a huge one. Another one in there is the resumable online index rebuild so if something happens and you have to pause your index rebuild part way through there’s a way to pick it up and finish it up later. Another thing as part as part of CTP 2.0, Python is now built in as part of the SQL Server install. And I’m not sure if I like that or not.
Carlos: You have to, you know, they’re lumping everything else in there, right? XML, Json, hey why not, you know, another language.
Steve: Yeah, just wondering if Python in SQL Server sort of makes me think of like snakes on a plane.
Carlos: Yes, that’s going to be interesting. Now, you might say, “Oh my gosh, 2017, are they going to come up with another version, should I wait if I’m considering upgrading to 2016, should I wait for 2017?” My answer to that would be, if you’re on Windows, I would not wait. Obviously, if Linux is your thing then you have to wait but I don’t think it would impede my decision to upgrade waiting for that 2017 release.
Steve: Yup, that’s a good point. I think that with Microsoft increasing the rate of SQL Server releases lately that if you’re waiting for the next version you may be waiting indefinitely because there is always going to be the vNext version coming out.
Carlos: Also true, so the show notes for today’s episode is going to be available at sqldatapartners.com/mistakes, with an S.
Steve: Or at sqldatapartners.com/93.
Carlos: So compañeros, today we wanted to talk about a topic that we’ve all experienced but we all seem to have trouble coping with, potentially. And admittedly, some of this idea was jumpstarted by a post from our previous guest, Andrew Pruski, who was on Episode 80 talking about containers. And he posted something on his blog about failure and coping with failure. And there’s a couple of examples that we want to start with and then we can jump into more of the conversation. So the first one is about a database incident that happened on GitLab. GitLab is a place where they have merge requests and you can put code, so it’s a public repository for code, and they had an incident which cost 6 hours of data to be lost. So ultimately it was a hacker type incident and they were attempting to shut off the hacker and in doing so, because of this kind of almost denial of service attack, the replication or their DR strategy got behind. And so while they were trying to attack this thing, their replication process got behind like I mentioned, and in their attempts to ward off this problem, they inadvertently took down the main or the production server. Now luckily before they had started this, they had taken a full backup, but they had missed the time in between that full backup and the time they took it down and all that data was lost. And so that is the first scenario. The second one, you may be a little bit more familiar with.
Steve: So recently, Amazon had an outage in their S3 system. And what was interesting about this one was the number of well-known websites worldwide that were really impacted by this. It was quite amazing to see how many people are using the Amazon S3 service. What had happened, was it was about a 3-3½ hour outage including the time to get everything back up and going, but they were troubleshooting an issue in the S3 billing system and they issued some commands to remove a small number of servers from the system that was part of that billing process. But what had happened was unfortunately they had the wrong parameters on that command and a much larger set of servers were removed that what they’d intended. So trying to take a server offline and you take a whole pile of servers offline.
Carlos: Right, you take your domain controller offline as well.
Steve: Yup, but beyond what’s the interesting of how many people are using it they were very open on this afterwards and they went and did their sort of post mortem and they put a bunch of information out there describing what had happened, and just really being open as to what they did and what they ran into. I think it’s one of those things that there are so many businesses out there that try and hide mistakes and pretend they never happened. But I think that in my experience when you get this kind of a summary from the company that had the problem, it gives me more confidence that they’re honest. They’re willing to tell you what happened. They’re willing to admit that, “Yes we made a mistake and yes, we’re going to learn from it.”
Carlos: So both of these incidents happened in February of 2017. And we wanted to kind of backdrop that against, you know, conventional wisdom, maybe things your mother told you when you were growing up. And there are a couple of quotes here. So Thomas Watson who was the CEO of International Business Machines (IBM) from 1914 into the 50s, he said, “Would you like me to give you a formula for success? It is quite simple really. Double your rate of failure.” And Bill Gates, who some of you may know said, “Success is a lousy teacher. It seduces smart people into thinking they can’t lose.”
Steve: Yeah, I’ve seen that happen, both sides of that.
Carlos: And so we have kind of this backdrop of that it’s ok to make mistakes. We learn as we make mistakes and then we’re going to take these two specific examples because they were both very visible to the community. The Internet makes more visible, particularly with the Amazon one. When they were trying to get it back up a lot of people started asking the question, “Are they going to fire the person responsible?”
Steve: Yes, and that’s one of those things that really seems to be a popular thing to do in today’s world is that, “Oh, somebody made a mistake. We can get off the hook if we just fire them. And make them look like they are the cause of the problem and we won’t have that problem anymore.” But really what it comes down to is that might be one of your most brilliant people on your team and they simply goofed on something. They made a mistake and by getting rid of them, you’re getting rid of that experience that they’ll never make that mistake again.
Carlos: Exactly, and I think how many times even in the case of this Amazon situation, is it a process issue versus someone did something that they did unwittingly, right? Again, we’ll go back even to, so we had our SQL Saturday here in Richmond we had the pre-con. I took my son over, luckily it was in the afternoon, people were kind of just finishing up and my son pulled the fire alarm. So, of course, everybody has to evacuate the building, the alarms go off, you know, whatnot. And immediately, I mean he pushed it, the sound had just started blaring and I looked at the lady who was kind of in charge of the building and she said, “Did he pull it?” And I said, “Yes ma’am, he did.” So immediately they knew this is not a real fire, so she took the precautions of letting people know that a mistake had been made. So then we started the process of trying to turn this thing off, and what was interesting is that even though they knew that it wasn’t a real situation it took them an hour to get that fire alarm off because nobody could remember the steps they needed to execute to restart the system or to turn the system off, basically. And I think a lot of that is that sometimes we can see mistakes or we see problems but is it in the sense that we didn’t give our people enough information so that they could make the best decisions or do the things that would have caused that to go without a hiccup.
Steve: Yup, and that’s one of those reasons that the retrospective or the post mortem meetings are so important, is to be able to learn from those so that you can do better next time and to be able to find out, “Why did that take so long?” “Well, it’s because we didn’t know how to do it.” “Ok, we didn’t know how to do it, so let’s get that documented in a procedure manual or something that can be used or referred to when those type of problems arise.”
Carlos: Exactly, or even adding it into the repertoire of things that we have to do on a semi basis. And again, it can be one of those things, particularly talk about disaster recovery or whatever. Everybody likes to talk about it but how many times do we actually practice that stuff and that can be difficult. I’m not saying there is an easy way to fix this, but I think it is something that we need to do better about creating environments that support and flush out mistakes.
Steve: Yes, in a constructive way.
Carlos: That’s right, exactly.
Steve: Yup, and I think that brings us back to Episode 61 that we had with Russ Thomas on the podcast which was around the debrief or the retrospective and making sure that you have a safe environment there, and the right ways to do the debrief so that you can learn from it and improve.
Carlos: Yeah exactly. I mean, so we talked about that nobody likes to showboat. Nobody likes for them to bring all the attention to themselves. But it seems like when a mistake has been made the knives come out and we’re all kind of looking for blood, and it’s interesting. So Andrew actually mentions in his posts. He says, “Show me a senior technical person who has never made a mistake causing an outage and I will show you a fibber.” Right, so I feel like it’s something we all have to go through or we will all experience at one point in time to some degree, so how can we get better at that?
Steve: Absolutely. Yup, and I think the way to get better with that is to learn every time you make a mistake, to learn from those mistakes. Like the formula of success mentioned by Thomas Watson. Learn from it and improve. And the more mistakes you make, the less likely you are to make them again. Now hopefully, you can make those mistakes in a safe environment, a test environment, things like that, rather than on a production system. But when you do have an outage or mistakes are made, or disasters occur, those kind of things, once you recovered from them they’re great learning experiences.
Carlos: Exactly. And I think one of those old adages that I learned very early that has kept with me is kind of, “Always give yourself a way to get back.” So never climb without a rope type thing. And I know I have told this story before on the podcast, but I was doing some data scrubbing for a health organization. They wanted all the guys to be Harry’s and all the girls to be Sally’s, and everybody kind of live on Main Street, that kind of thing, so they could show some of this data without exposing the sensitive portions of it. And somehow I made a mistake and I was working on a development environment and somehow I got that thing pointed to production and ran it. It took me a minute to realize what I had done, and then the next thing I did was I tried to understand the scope of it. Because this is one of those each client has their own database kind of scenarios. And I’m like, “Ok, how many databases did I just whack and how many tables, how wide was it?” And so once I figured that out, luckily it’s just one table and one database, I went and I reported that to the manager. I said, “Look, I’ve just done this thing. We need to fix it.” And of course they verified it and they’re like, “Holy crap! Not good.” And actually we kind of got, it took about 30 seconds, that the heart starts pumping, everybody is getting excited. Like, “Oh no, something bad has happened.” But something in that environment that I’ll never forget that was super, I mean I’ll appreciate this for the rest of my life was that he looked at me and he’s like, “Look, you’re not fired. Let’s make it right.”
Steve: That’s a good place to start.
Carlos: Exactly, that’s right, because it got one of those like tension pieces out of the way because you’re so concerned about the aftermath when you do make that mistake. And I think this was to his credit as a leader is that he was like, “Look, let’s nip that in the bud, but now let’s figure out and address the real issue.”
Steve: Yeah. Earlier you mentioned that sort of out for blood mentality. And I found that if you are in a position where someone is acting that way where when a mistake is made they are out for blood, one of the ways when you make the mistake is to own up or to help with that is to own up to it right away and say, “Here is what happened. Here’s what I did. Here’s the problem, and here’s what I think we can do to fix it.” And owning that can make a huge difference when you’re in that position of people trying to go out for blood and hunt you down and figure out the problem, because if you own it, you’re part of the solution even though you were part of the problem of creating it. But the flip side of that I’ve seen and it just works out horribly is when somebody does something and they don’t own up to it or they try and hide or pretend it wasn’t them. I remember one incident where a developer was connected to a live database to do debugging a specific scenario that can only be reproduced in the live environment. And they hit a breakpoint in their code and it was inside of a transaction. But while they were stepping through the code their phone rang and then when they got off the phone, they went out and got some coffee, and they eventually came back and realized that their code was still broken inside of a transaction and that everything was being blocked in the database. And what they did was they just hit the key to keep running through the code and clear it out and pretend it didn’t happen. Then the DBA’s working on it spent hours trying to figure out what happened, what went wrong at that point in time. And we eventually were able to determine that it was this specific user and after confronting him on it he confessed that, “Yeah, that’s what I did, but I didn’t tell anyone.”
Carlos: Didn’t think it was a big deal.
Steve: And had that person come right away and said this is what I did, it would have saved hours of our work on trying to figure out what the problem was and thinking that we had something horribly gone wrong with the system. It was just a mistake made by someone.
Carlos: Right, and I guess there is that balance of learning to make those mistakes and I guess the central question is what can we do to make those mistakes a little bit easier. Now, when you have something which you can only test in production, those are tough but at the same time, could you use some of the, again one of those like the best way to learn is by learning from the mistakes of others. And bringing that to your management’s attention and saying, “You can even take this Amazon experience and say look, they were doing something in a production environment that didn’t go according to the way they thought it was. We need to be able to replicate this in an environment that won’t take things down.”
Steve: Right, or the other option that I’ve done when it’s too cost prohibitive to replicate that type of environment is to change the procedure. And one of the ways that I’ve done that is if you’re doing something risky and it’s not something that you’re super comfortable with because you haven’t done it 50 times and it’s in a production environment, it’s good to have someone look over your shoulder. Someone who’s there sitting next to you in the cube or connected with Gotomeeting or something like that, reviewing what you’re doing along the way so that problems can be prevented by sometimes a second set of eyes just catching that typo on a command or the fact that you didn’t highlight the entire script and you left out the WHERE clause or those kind of things. And that’s incredibly valuable to be able to do that. That’s far less expensive than what’s involved in duplicating gigantic production environments.
Carlos: Exactly. I think the other, in addition to that and to use some of the sports analogies. I know I’ve mentioned at the end of the year last year I wanted to start doing some refereeing, which I’ve started doing, refereeing soccer games again. One of the things that helps from a sports perspective to get good at things is to repeat them frequently. And we kind of talked about trying to do some of that idea of a repetitive processes and of course we look to automation. So, getting those processes down and then making it so that you remove some of the human interactions. Again, easier said than done at some cases, but ultimately, we can go along ways to make that things much smoother that way.
Steve: Yes indeed, and the other part of that is putting in safety checks. So let’s say you have a script that’s going to run to turn off some servers, like in the S3 example. And you might have it so if you’re turning off 1 or 2 servers that it doesn’t take an extra confirmation. But if you’re turning off more than 3 that it ask you are you sure you want to shut down all the servers, or the hundred servers, or whatever it may be. And I know people really hate those confirmation message boxes or prompts. But sometimes that can make the difference in, are you sure you really want to do this.
Carlos: It’s saved my bacon before.
Steve: Yeah, so as far as different events that we’ve been part of, I know one I guess as far as the making mistakes go, and I’ve made plenty of mistakes, I’m not gonna try and hide any of those. But what I love to do is to be able to learn from those mistakes over time so that you don’t do them again. And I think one of them I can think of is a couple of years ago, I was with a new client, first day on-site with them and I was doing some server analysis and I was using Database Health Monitor. What was happening was that their SQL Server started crashing. And I thought, well, is that normal for you guys? And no, they don’t normally crash like that. And then I learned that there were some DMVs that had been released in SQL Server 2008 R2 Service Pack 1 which was version 10.5.2500. And that version number has burned in my mind today and I probably will never forget it. So from 10.5.2500 to 10.5.4000 these new DMVs have been introduced that weren’t quite entirely stable and that a certain combination of calling into those DMVs caused the SQL Server to crash. And by crash it completely dumped out of memory and had to do crash recovery, and startup, and all that. That was one of those that I was certainly embarrassed that I had caused that but it was something that wasn’t part of my normal test scenario and I hadn’t worked on too much for a couple of years. And I then learned from it, fixed the problem on the client side, obviously, and then went into Database Health Monitor and made it much more robust by checking those version numbers and making certain that if someone tried to do, that was in the quick scan report, if someone tried to do the quick scan report on one of those specific versions of SQL Server that it stopped immediately and just gave you a message it says, “Warning: You need to upgrade your SQL Server before trying to do this.” I mean, who would have thought that certain combinations of those DMVs would crash the server but it did.
Carlos: What I like about that scenario is that in a sense you could say, well it wasn’t my fault.
Steve: Oh yeah, it’s your fault, you’re running the bad version of SQL Server.
Carlos: Yeah, right, or how many times and again I’m not saying it’s perfect by any stretch of the imagination, but how many times could we blame a third party, i.e. Microsoft, “Oh, you stupid. If you’d only made your softer better, right, I wouldn’t then be having these problems.” I’m getting to a fine line here potentially, right, blurring the line a little here. But I think it’s one of those things that again, if you are going to be interacting with the software that you need to take the time to understand what it’s doing and then be willing to respond to it. And again, I feel like it’s one of those culture ideas as well in getting to know those systems so that you can provide the best care and feeding possible, rather than just saying, “well, it’s not my problem” and trying to pass the buck.
Steve: Yup, absolutely. And with that scenario I first made sure everything was fixed and good for the client, and then when I came home that evening I was up well past midnight building out the test environment on virtual machines in my home office so that I could make certain that I had a way to reproduce that, and that I could fix it and never have to cause that problem again. And I learned from it and I think that my experience grew in that event. I mean, with what had happened it wasn’t a huge issue with that customer but we got them to a better place and we got Database Health Monitor to a much better place as well because of that so, now, I know better and I do much more testing based off of that.
Carlos: Right, so I think if we’re going to try to take a couple of takeaways, if you will, from this, my thoughts are, one, be upfront about a mistake that you’ve made. Two, try to implement a solution or make things better, right? By of course, helping get everybody back up on their feet but also, as you mentioned, ok how do I change this process, or how can I influence this process to help it make sure that this isn’t going to happen to me again or to somebody else in my team? And then three, though repetition and through involving a few more people which at times admittedly can be cumbersome and through process or whatnot but ensuring that everybody knows about those processes and understands them and help create that culture so that when mistakes do happen they can be easily resolved and you can move forward.
Steve: Yup, very good points. I think that’s one of the key differences between someone who is a DBA and someone who is a Senior DBA. True there’s lots of trainings and other training that go along with that but to me when I see the difference between a DBA and a Senior DBA, it really comes down to experience, and how have you learned along the way, how have you improved, and what mistakes have you made in the past so that you don’t have to make them here again. I think that if someone ever asked the question of how do I go from being a Junior or just a regular DBA to be a Senior DBA, it’s learn, build test environments so that you can make mistakes and learn from those mistakes without impacting production systems, or learn from other people’s mistakes.
Carlos: Right, and move forward. So, awesome. Andrew, again, thanks for that post and kind of spurring this idea. Thanks to you, Steve, of course, for your input. And compañeros for tuning in. Of course, if you have some thoughts around failure, how to deal with failure, how we could get better organizationally-wise, or maybe even as a community-wise, with failure, you can let us know on our show notes at sqldatapartners.com/failure.
Steve: And if you haven’t listened to the Episode 61, which was on the retrospective, that would be a great episode to check out relating to the same topic.
Carlos: That’s right, and that one is a sqldatapartners.com/debrief. Our music for SQL Server in the News is by Mansardian, used under Creative Commons. If you want to reach out to us on social media, you can hit me up at Twitter. I’m @carloslchacon.
Steve: And I’m on Twitter @sqlemt and we’ll see you on the SQL Trail.