Database corruption is something that will eventually come to all data professionals. While I have been fortunate to only come across is a handful of times, I was glad that I was able to get back the corrupted data and get everyone back on track.
In May of 2015, Steve Stedman set the data community ablaze with his database corruption challenge. Some of the biggest names in the community jumped in with solutions to his challenge and I knew I wanted to have Steve on the show.
If you are interested in hearing more about his experience and his suggestions for dealing with database corruption, this episode is for you!
Transcription: Database Corruption
Carlos L. Chacon: Welcome to the SQL Data Partners podcast. My name is Carlos L. Chacon, your host. This is Episode 12. I’m super excited to have Steve Stedman on the show with me today. He’s the Database Administrator from the Pacific Northwest. We’re talking about corruption today.
Certainly, that’s a challenge that all data professionals need to consider. Steve had a corruption challenge earlier this year and it was very heavily attended. Lots of big names threw their hat into the ring there. If they were paying attention, I know it’s something that I can learn from and I hope that you will today, too.
We are always looking for feedback. We hope you’ll give us some feedback either on iTunes or on Stitcher. You can hit us up on Twitter. I’m @carloslcachon or you can shoot me an email at [email protected]. We hope you’ll reach out to us. Let us know how we’re doing. If there’s topics you want to talk about, let us know that too.
If you’re headed to Seattle for the summit, the PASS Summit this October 2015, we are going to have ribbons. For my compañeros who are listening to the podcasts, it will say, “SQL Compañeros.” As you see me, please make sure to get your ribbon at summit. It would be fun to meet up there.
Also, hopefully you’ve noticed we have a new mic. We’ll be using this new mic from here on forward. Hopefully, the music and the words, there will be quite such a distinction. There will be a better listening experience. Again, thanks for tuning in compañeros. As always, welcome to the show.
Recorded Voice: SQL Data Partners.[music]
Carlos: With that Steve, welcome to the program.
Steve Stedman: Thanks, Carlos. Glad to be here.
Carlos: Thanks. I know you’ve talked a little bit about this in some previous forums, but go with us briefly into how you decided to create the Database Corruption Challenge?
Steve: That was interesting. In the beginning, I didn’t really think of it as a challenge. It started out as just a blog post that I was doing on my blog at stevestedman.com, explaining the specific type of corruption. After spending a few minutes on the blog post, I decided that rather than just explaining the concept, or showing how to do it, I thought it might be a lot of fun to throw it out as a challenge to see if anyone would like to participate.I posted that first challenge on a Saturday morning, and announced it on Twitter, and LinkedIn. Nobody responded. I didn’t get any feedback.
Steve: After about six hours, the first solution was submitted by Brett Ozar, who then Tweeted about it, and it spread from there. I think his help really got it moving at that point.
Carlos: OK, very good. He’s known to be a mover and a shaker, that guy. [laughs]
Steve: Who knows, if he hadn’t been the first to submit a winning solution, it may not have gone anywhere.
Carlos: Interesting. You ultimately come up with 10 corruption challenges, and so I’m curious now. I’ll admit that luckily, and I’m knocking on wood here [knocks], I’ve only experienced corruption in a handful of cases. You’ve mentioned that you started out with a blog post. Did you think you were going to end up with 10? How did you know that you would be able to come up with 10 ideas, or different forms of database corruption?
Steve: Honestly, in the beginning I didn’t know I would be able to come up with 10. I just started doing it, and then I think it might have been week two, or week three where I figured out that I’d limit it to 10, and just go until that point.
Carlos: OK, very good. Now that you’ve had the challenge, you’ve had people submit, and of course you’ve graded them and announced a winner, and whatnot. Do you feel like your experience with corruption has increased? What’s been your learning experience through this event?
Steve: I think the learning experience has been great. I think going through it with some of the big names who were actually participating was kind of like being judged each week myself.
Steve: When I would create the challenge, as to “Did I do it well? Did I do it right? Was it worthy of throwing out to the community?” I absolutely learned something every single week. I was forced to go for more and more difficult corruption challenges. I made mistakes going through it.If I made a mistake the people participating certainly caught it, and let me know. One of the things I had to do, because everyone out there has different versions of SQL server, when I put out a corruption challenge I myself had to solve it first to make sure that it was solvable with no data loss, but I also had to make sure that on SQL server 2005, 2008, or to 2012, and 2014 that it was doable.
Steve: Some of them that were using features that didn’t necessarily work on 2005, for instance.
Carlos: I see. Can you give us an example of a database corruption challenge, that maybe you weren’t as familiar with before the challenge started? Do you have one of those?
Steve: Yeah, I am thinking probably week four week five. I forget which week it was. It was a database corruption challenge that involved no backup files. Let me think, the data had been corrupt and the only way that we were able to get the data back or to get the things back once the corruption was fixed was using Change Data Capture, which had been enabled on the database.I had not done a lot with Change Data Capture before that. There were a few things that I learned in doing backups and restore with CDC enabled. I am a lot stronger right now. However, I sure didn’t know what I was doing with them at that point.
Carlos: [laughter] OK. So let’s get into a little bit about that, about how you would go about attacking a database corruption. Again, my experience is that once you have identified that you actually have corruption, you need to then go in and identify how widespread the corruption is. Are there other components like that in the challenge as well?
Steve: Yes, absolutely. The big part of figuring out how to eliminate the corruption without losing any data is first addressing what data is actually corrupt. If that data is corrupt, how can you save some of it off, pull it from an existing backup or bring it in from an index or even extract it out of a DBCC page output, if you have to.How can you pull the data in a way that once you eliminate the corruption and when you run CheckDB or CheckTable for instance, and it just purges pages as they come across corrupt pages, how are going to refill those again? I think first understanding exactly what you’re going to lose, and then figuring out how you are going to get it back is one of the key things in being successful, at fixing that database corruption.
But the other thing that you should keep in mind is, once you are at that known point where you have corruption and you’re going to attempt to go fix it, you need to have a way to get back to the exact same point, in case what you do makes it worse because sometimes you try something, and you realize, “Wow, I didn’t realize half the table was going to go away.”
Steve: “I thought it was just going to be a dozen rows.” If you’ve got a way to roll that back or to get back, even just backing up that corrupt database that can be restored and used again, if you’ve got a way to get back to the stage, it really reduces the risk associated with working on database corruption.
Carlos: Sure. That’s an interesting point. I think potentially, you may have referenced, I think it’s the DBCC command. Is it restore with corruption?
Steve: There is DBCC, CheckDB or CheckTable and with those, there’s different options. For instance, one of them is Repair Allow Data Loss.
Carlos: Repair Allow Data Loss. That’s the thing.
Steve: What that does, when it goes through, it finds pages that are corrupt. Again, a page is an 8K chunk of memory in SQL server. That 8K chunk may contain several to many rows of data out of a given table, and when it finds even a single corrupt row in that page, what it does generally is, it throws out the whole page, then rebuilds the indexes around it as if those pages had never existed.So you end up, even if you have a single corrupt row, you might end up losing 15 to 50, who knows how many rows to clean up that single corrupt row.
Carlos: That would be an example of you missing the data, then maybe you wanting to get some of that back?
Steve: Yes, absolutely.
Carlos: So taking a backup of the database at that point before you start to make changes would be part of your repertoire of fighting database corruption?
Steve: Absolutely. I’ve been making a backup for at least like, stopping the SQL Server, copying the MDF and LDF files off, so you can use them again if you had to as well. [inaudible 10:08] other options, we are just doing a regular backup.
Carlos: Just help me with that. I guess my understanding is wrong there. I was under the impression that if you stop the services when it went through its initial check, to bring the database back online, it would see the corruption and you might not be to get the database back online. I guess this is not your experience.
Steve: It depends on, on the type of corruption. Now, there certain corruptions, like there’s a certain page, page nine, in file one. That’s the boot page for the database. If that happens to be corrupt, you certainly don’t want to take your database offline or anything. There’s other things you want to look at first, before dealing with that one, but generally if it’s in an index or in a table somewhere, you can usually stop the database and restart it without any difficulty.
Carlos: Without problems. OK, very good. That’s good to know. How would you go with backup? We’ve now taken a backup of our database. We are ready to start making some changes to it. How those steps that we take will determine, I guess on the type of corruption that we identify in the database?
Steve: Yes, absolutely. The one thing is you’ve got a backup or a way to get back to the known bad state. You want to assess what good backups you have. There might be a case where the corruption occurred three or four weeks ago and somehow nothing caught it and all of your backups you have for the last three weeks are bad.
Carlos: They’re no good.
Carlos: There’s other cases where you might run into the corruption and you have a backup from yesterday. This is just fine. However, there could be transactions where a lot of data that’s changed in that times. You want to assess what you’ve got as your available tools to bring that data back in, a good backup or a bad backup necessarily.Then once you have a point or when you get back to the point in time that you’re right now, a way to start over if something goes wrong basically, you know what’s missing and you know what other backups or other places you have to pull data in from, then you want to figure out exactly what’s going to be lost when the corruption is fixed.
One way you can do that is you can do the DBCC CheckTable for instance, inside of a transaction. Let it fix the corruption. Look at what’s left in the table. If it’s not what you expected, roll back that transaction.
Carlos: I think that’s a very interesting point but not one that I would have used. That’s interesting that you put that out there. I’m curious, in your challenge you’ve come up with these little nuggets or these little potentially ways of fighting corruption based on the different options.Now I know that there are the parameters of the challenge in there. Did you take the time to write up, “Here’s what I would’ve done to fix it”? I guess the solutions are there as well.
Steve: Yes. For each of the 10 challenges, the way I worked each week was I posted it and I usually gave people from three to five days as an opportunity to solve it themselves, submit their solution and then once all the solutions were submitted, I posted the results of who the winners were each week. In some cases, I would post my solution, “Here’s how I solved it.” In some cases I would even post, “Here’s how I corrupted the database.”
Steve: There’s usually a lot of warnings on that one that says, “Don’t try this at home.”
Carlos: You’re right. [laughs]
Steve: Then I would find when there was either the winning solution from one of the participants each week, I would post that to my blog as well or if there was someone who was not the first one to solve it, but they had a really interesting solution that I hadn’t seen before, I would post those as well. Of the 10 different competitions, for every one of them is at least two posted solutions and sometimes three or four.
Carlos: OK. Very good. So folks that maybe wanted to sharpen their corruption skills, they could still use this to go out, get the corruption challenge and try to solve that by themselves and then take a peek at the answers, if they get stumped?
Steve: Yes, absolutely. If you go to stevestedman.com/corruption challenge, that will take you to a summary page that shows you here is the 10 weeks of the competition and it will include links to the initial challenge as it was posted, and the solutions from each of the winners each week.Then from there you can jump to other solutions. But as long as you don’t really cheat and jump off to that solution before you’ve tried to solve it yourself, you’ll be in good shape.
Carlos: We should say that, those links to Steve’s site and the database corruption challenge will be in the show notes as well. So, if you come across those on the podcast page, there’ll be links there.You mentioned being careful with corrupting the database in your own environments. Any thoughts or parameters about doing that or would you recommend folks just grabbing what you’ve created for them as the best first step?
Steve: As a first step, I would certainly recommend grabbing what I’ve created out there and using that. However, once you’ve mastered all of those you want to go on and try some more, you can give it a try to corrupt your own database. However, it’s much more challenging than you might think.For each of these challenges that I did, some of them took me 20 hours of preparation to put together the challenge, test it, confirm and post all the details on it. Others didn’t take quite as long. They only took six or seven hours.
There’s a lot of work that’s gone into there and you don’t think you can just start messing around in memory in SQL Server and think you’re going to end up with something that’s going to be usable, a corrupt databases that could be recovered in any way.
Carlos: Recoverable. Having said that, how many instances that you start to muck into the SQL Server and not be able to recover?
Steve: Let me think back. There were a couple that I tried where I realized after I got into it that it wasn’t going to work in a specific environment. For instance, the Change Data Capture one that I enabled, that’s a feature that tracks changes on certain tables in SQL Server.That was one that I originally built it on, I think SQL Server 2008R2 and then I realized.”Oh, that features is just not there in the older versions of SQL Server. So, I’m not going to able to make it work.” But going through it, there were certainly ones where I corrupted the database and even knowing how to fix it, it took me two or three different tries to get to a solution that actually worked. But before doing it I always make sure I had a way to get back to that known states, so I could try again if I had to.
Carlos: That’s an interesting point,. I guess. I know we’re going to talk a little bit about generalizations. If folks wanted to deep dive they should got out to your site, take a look at those different instances. But I think if you were to generalize that corruption and being able to repair from corruption, I think a couple of things you talked about are important.One, being able to get back to that state and then two, would you actually recommend trying your change in that restore copy, if you will. You talked about testing your data to make sure that’s what you wanted it to happen or you go ahead. Let’s assume that this is a normal “active transactional system.” People are using the system at the moment. Are you going to be making your changes in that prod environment or you’re going to be making that in the copy environment?
Steve: Eventually you’re going to have to make it in the production environment, to replace that corrupt database. But whenever possible, I would want to be working in an alternate environment and that alternate environment may be a second copy of that same database with a different name, running on the same server. Or maybe just to restore that database running on a different SQL Server depending on what environment you have around to be able to work on those things.But I certainly wouldn’t want to start in production on that live database as the first point. However, there may be certain scenarios where you’re working through and transactions are failing or the system is just not usable because of that corruption, and you may want to just block users out and say the system is down until the corruption can be resolved.
It really depends on the scenario. If it’s something that the corruption’s there but it’s not really posing a problem for other people, it might be OK to let it run for half a day or a day while you actually work out the problem.
Carlos: Yeah. Particularly if that corruption is in older data, for example, archive data, things like that.
Steve: But if that corruption is in a bank account type database where it has recent transactions in it, and every time somebody tries to use it, it fails and money is lost somewhere, you’re going to want to block people out until that gets resolved.Once you’ve got your plan on what you’re going to do, what you think you’re going to do to fix it, and if you can run that plan by someone else who knows little bit about corruption, maybe more than you do, whether it’s somebody who competed in the challenge or even me or somebody you work with, if you can run that plan by someone, often times when people dealing with corruption or this type of issues, it’s not always when you’re well rested and bright-eyed and bushy tailed. It’s when you’re late at night or sleep-deprived.
Sometimes, if you just run it by someone else, they can ask questions or check to make sure you’re not going to do something really foolish. It happens. People make mistakes, but anytime you get someone else’s opinion on that, it might help reduce those mistakes.
Carlos: We are talking a little bit in generic terms, but are there things that as you went through the challenge that you may be found were like, “Oh, my thought process was that I would overwrite the table or let’s say overwrite the data from the backup and I know now that that’s the best way to go about that.” Do you have some no-no’s of database corruption restoration that you learned as a result?
Steve: The one that I would say to stay away from at all costs is the undocumented DBCC WRITEPAGE command. If you want to use it to go create a corrupt database, it might be a good thing to use, but if you’re trying to fix corruption or actually try to bring a system back, there’s a lot of danger in using that command because, really, what it’s doing is it allows you to write data over any chunk of memory inside of the database.One of the tendencies of people who learned about it in the corruption challenge was just to use that command to overwrite what might be broken somewhere, but there are some negative things that can happen with that, and I don’t know it’s for certain, but it is rumored that when you use that, somehow it tags something in your database so that Microsoft is no longer able to support your database, and they’ve kind of got an out if you ever had to call support.
I don’t know if that’s true, that’s only a rumor I’ve heard, but I would recommend staying away from DBCC WRITEPAGE unless you’ve used up every other possible option, but I would never use it on a production database. You want to use it on a copy, and then maybe copy the data out of that other database.
Carlos: There you go, very good point. That’s a good advice for database administrators in general, data professionals. Always give yourself a way to get back.Another component that we wanted to talk about with corruption is finding a source of the corruption. Now this is a little bit different in your scenario because you are the one causing the corruption. [laughs] So what are some of your thoughts about how you go about finding the source of corruption, once corruption has happened?
Steve: I think that probably falls into two different camps to look out there. One would be that one off corruption that hits and you fix it, and then you don’t see it again. Oftentimes, you may not be able to track down when it occurred.
Carlos: Just like [inaudible 23:38], maybe there was hard reboot of the system, something quite yet written quite right.
Steve: Yeah. Then there’s the ones that you get corruption for some reason, it gets fixed, and then a few days to a week later, it comes back again. When those kinds of things happen, generally, there’s something going on that really needs your attention to get fixed.For instance, it could be a networking issue between your SQL Server and your storage on the sand. There’s a story I heard one time about a janitor who, when they came in to do work in an office, they unplugged the SQL Server every night to vacuum around it, and then they plugged it back in. Those kind of things can certainly cause corruption.
But being able to figure out when it occurred and…once you figured out when it occurred, then you can start looking at other events related to that. For instance, if you know when it occurred, you can look at the error logs, you can look at the Windows error logs, as well as the SQL Server error logs, and try and figure out what events are happening at that point.
If you have it occurring again and again, is it happening at the same time everyday or every week or is there something going on at that point in time that’s causing it or is it just completely random? Which is going to be a little bit harder to track down.
Carlos: We’ve talked a little bit about what you do about the corruption, just identifying the source. Now that you have gone through this, any changes into how often you are checking for corruption?
Steve: What I’ve learned with this is that the sooner you can detect corruption, the better off you’re going to be trying to resolve it. For instance, if you have corruption today and you find out about it tomorrow, and you’ve got a backup from yesterday that you can use, that you can pull data back in from, you’re a lot better off than if it’s six months out that you discover you had corruption.My recommendation is that you do DBCC CHECKDB or CHECKTABLEs or certain subsets of those as often as you can without having a negative impact on your system. I don’t mean hourly, but if you can run it on key databases at least once a day, that’s great. In the real world, that doesn’t always happen, so I’ve set up scripts and sometimes with databases when you can run a subset as them every day of the week, and by the end of the week, every one of your databases has been checked.
The other thing I like to do with it is to make sure you run DBCC CHECKDB against the system databases, the non-user databases, Model, Master, TempDB, those type of databases, MSTB, because if those get corrupt, it may have a different impact. If your user data gets corrupt, but you want to know about it so you can fix it as quickly as you can.
Carlos: I had a master database corrupt on me a couple of months ago. [laughs] May be just because it’s master, and we didn’t know…so this was a consulting engagement, I wasn’t with them. They called me up and were like, “Hey, we have this error and it looks like our master database is corrupt.”They didn’t know about it till they restarted SQL Server and then the database servers [inaudible 27:11] because master was corrupt. That’s definitely what you want to know about [laughs] before you start treading things off because you may not be able to come back.
Am I naïve or am I incorrect in my thinking that the master database, if there was a corruption issue, I don’t want it’s more important, but a little bit more bothersome than a user database or is that not good thinking?
Steve: It’s certainly more bothersome, and that you might not be able to restart your SQL Server without quite a bit of difficulty.
Carlos: Yeah, that’s a bit of a bothersome.
Steve: I think, additionally, most people aren’t as experienced at dealing with backups or restores or rebuilding of the master database if they have to. I think people are a lot more familiar with, “Maybe I’ve got corruption, I’ll just pull in a backup from last night on my user database.”
Carlos: Yeah, a little more everyday issue, and having to overwrite that.
Steve: As part of the corruption challenge, I did not issue a corrupt master database to anyone because I didn’t want to destroy their environment. [laughs]
Carlos: The community thanks you for that. [laughs] I think this has been a great conversation, definitely being able to learn more about corruption. I know that I still have a little bit of homework that I need to do and get in there and read some of those posts. Is there anything else that you would like to bring up as a result of challenge during corruption that you think that the listeners would [inaudible 28:57]?
Steve: Plan ahead. I like to think that fixing database corruption is a lot like when people have to do CPR. At the event that you need to do either of those things, if you have to go online and do the research and figure out how you’re going to do it, you’re probably not going to do it as well as if you trained ahead of time.They’re both situations that can have pretty negative outcomes, although CPR would certainly have a more negative outcome than most database issues. At least with database issues, if you do it right, you can try again.
Carlos: That’s true. You get multiple opportunities.
Steve: Train, plan ahead, and act like you’re going to have a corruption tomorrow and train for that.
Carlos: Thank you, Steve. We do appreciate the conversation. We do have a couple of other questions. One of the things that we like to do is try to provide some value to folks and see what other data professionals are using and doing. So one the things we like to talk about is your favorite tool. This could be paid tool, it be a free tool.I know you have actually written one, so I am not sure you want to talk about that. But what is your favorite tool and how are you using it?
Steve: I am a little bit biased on this one. My favorite tool is Database Health Monitor. That’s a tool that I’ve spent the last five years building. Earlier this summer, we got to version 2, finally, and then just a week or so ago, I released version 2.1. It’s a free tool that I’ve make available to the SQL Server community. What it does for me is really a couple things.First, it provides a quick way to get the status of your SQL database, your jobs, your queries, your user connections, general health of the database without having to go through some of the more challenging ways of getting that out of SQL Server Management Studio.
Second thing that it does is you can install a monitoring database that will keep track of historic waits over time, and with the small job, it will track wait statistics and then reports that will show you, historically, back in time from when you first started monitoring what was going on, what was causing SQL Server to be slow, what was causing it to wait on things.
If someone says, “Hey, yesterday at two o’clock, the database was slow. Do you know why?” you can go back and see what queries were running and why it was waiting at that point in time. There’s a lot of other tools out there like that but they are very expensive, and this is a free thing that I did to give back to the SQL community.
Carlos: I think from that history perspective, [inaudible 31:49] the value add of those third party tools is the capturing of the data. It gives you that history perspective, and then some basic reporting on top of that, so to have a tool that folks can use to get both of those things is very, very helpful.A lot of people come to the database profession in a variety of different ways, and we all love it for different reasons. Right? I’m interested to get a story or some background on something that helps illustrate why you enjoy being a database professional.
Steve: One of my favorite things is working on performance tuning on SQL server. I know we talked about corruption, but performance tuning is a lot of fun too. The thing I love about this is when you can solve problems for people where you get immediate feedback at how valuable it is.An example of this was, a couple of years ago I worked on a system that had about 200 users who each had to perform from a repetitive task about 15 different times a day. That repetitive task took about five to seven minutes to complete, waiting on a single query to run. I did some performance tuning on the database, and I was able to reduce the time it took that from five to seven minutes, down to three to four seconds.
Steve: It was a good gig. I got paid for it, and all that, but what made me really excited about it was when the client sent me a response that was basically an equation that showed how much time was saved by their staff every day. It worked out to be something like 20 to 30 full time employees by fixing that one query.The client was happy, I was happy, and it was just a few hours of work. I was able to save them so many hours of work every day. I was able to make a significant impact on their business. That’s the type of work I love to do, where you can do something, and get feedback, and it makes a difference in people’s lives.
Carlos: That’s right. That’s the nice thing, and the scary thing, all at once. That you have access to be able to affect that change, but then you also have the responsibility to come along with it. The corruption that we just talked about is one of those areas that you are then responsible for. Always being able to help, I think that’s a very common theme throughout the data profession.
Carlos: Very good. Before we let you go, Steve, we do have one more question for you. That is, if you could have one superhero power, what would it be, and why do you want it?
Steve: Wow! That’s an interesting one.
Steve: I think I would have to say time travel on that one.
Carlos: There we go.
Steve: The ability to jump back in time, and then to return to the present, kind of a “Doctor Who” type time travel, where when somebody comes to you, and they’ve got database corruption, you can just jump back in time and figure out what happened when it occurred.
Carlos: That’s right. [laughs]
Steve: I think that would be a lot of fun.
Carlos: I live in Richmond, which is a city that has lots of history in it. I’ve always wanted to kind of jump back in time to see what it was like. If all the stories they tell us in the history books are really true. [laughs] I think it would be very interesting.
Steve: There is so much to learn.
Carlos: That’s right. You have a little bit more noble pursuit in avoiding database corruption, but hey…[laughs]
Steve: I would do a bunch of fun stuff with it too, I’m sure.
Carlos: No question. [laughs][background music]
Carlos: Steve, I really appreciate you being on the show today.
Steve: Thanks for having me. It’s been a good time.
Carlos: Yes, and compañeros, I’ll see you on the SQL trail.[music]
Recorded Voice: SQL Data Partners