One of the neat benefits of putting the podcast together is talking to a wide spectrum of folks with varied backgrounds. As we get more feedback from listeners, the idea of the Lone DBA emerged. The Lone DBA is someone who is the only person responsible for database administration at their company. These tasks may have been assigned or they may have simply been required to help support an application. Steve and I thought we should explore this a bit and to help give us some context, we invited Monica Rathbun to chat with us about some of the challenges she’s faced.
Besides the challenges, Monica shares tactics and techniques to juggle many tasks while being a lone DBA. This includes using social media as a way of building connections with different experts, getting the right tools for the organization’s needs and making your boss your supporter. Have your own perspective on this? Let us know in the show notes for the episode.
“To be a lone DBA, you’ve got to have 48 arms attached to you, doing 20 things at a time”
“It is important for the leaders in the community to mentor and pull people in”
“Making network connections gears you towards your future career”
Listen to Learn
- Techniques, tactics and challenges of being a lone DBA
- Using social media in building network connections
- Managing huge amount of task as a lone DBA
- Setting the right tools for the organization
US News DBA Job Outlook
Monica on Twitter
SQL in the news
Near-real-time replication from SQL Server to cloud
About Monica Rathbun
Monica is currently a Sr. Database Administrator/BI Architect at Massimo Zanetti Beverage, USA in Norfolk Virginia. She has been a Lone DBA for 15 years, working with all aspects of SQL Server and Oracle. She is co-leader for the Hampton Roads SQL Server User Group and can be found on Twitter daily. When she’s not busy with work, you will find her playing taxi for her two daughters back and forth to dance classes.
Carlos: This is Carlos Chacon.
Steve: I’m Steve Stedman.
Sean: I’m Sean McCown.
Carlos: Sean, welcome to the program.
Sean: Thanks, guys.
Carlos: Yes, nice to have you back. Episode 41, you were with us and since then you made a couple of changes, added Steve to the program, and it’s nice of you to come back and chat with us.
Sean: Is this my second or third time on?
Carlos: Oh, I thought it was just the second time.
Sean: It’s possible.
Carlos: Yeah, you’re just a popular guy. You’re just everywhere, here and there, everywhere.
Sean: Oh, God, I wish.
Carlos: Well, so today, we’re going to go backwards a little bit and we’re going to be talking about a command and then we’re going to back into the why we want to be using it and some of the issues around it, so our conversation is going to focus on DBCC CheckDB.
Sean: Oh, one of my favorites.
Carlos: There you go.
Sean: An oldie birdie.
Carlos: That’s right, and the infamous or what made, maybe not what made him but definitely what added to what Paul Randall is today. He likes to frequently give credit for building the CheckDB command, at least, the original one, I think it has gone through some iterations since then.
Sean: I mean he rebuilt it on 05. He didn’t build the original one that I’m aware of. He just did the major overhaul in SQL 2005, right?
Carlos: Oh, the way I remember him telling the story was that he and his team built it originally.
Steve: I think what I’d heard was closer to what Sean was saying, that he redo of it in 2005.
Sean: Yeah, because he was always talking about how it was really messy and how bunch of the error messages were kind of fubarred and they didn’t really give much info on how he went through and straighten all that stuff out in 2005.
Carlos: Got you. Ok, well then, there you go. So I stand corrected. So he made it a little bit better and so ultimately the question is why would we be wanting to run the DBCC CheckDB command?
Sean: Wow, that’s a big question. The general answer, the lightest answer is because stuff happens when you’re dealing with media and typically magnetic media. All kinds of things can go wrong in those rights whether those things that go wrong come from the magnetic media themselves or some kind of memory corruption, sometimes CPU errors can cause things since you can’t physically see it, right? It’s not like sending out a tray of food where you can see that the order served is in disarray, right? You need something to tell you when thing aren’t right or you’re going to be making very bad decisions and you need something that helps you fix on to. I mean, people take that kind of thing for granted but it’s just as important now as it ever was.
Steve: So then with the CheckDB process when you say when things aren’t right, I mean, what type of things are going to pop up that aren’t right that you are going to know about because of CheckDB?
Sean: Did you say what are some of the problems that you wouldn’t know about because of it?
Steve: No, no, what are the things that you’re going to find out about when you run CheckDB if you had problems that you maybe hadn’t run it in the past?
Sean: Right, so you can have problems like out of range errors, so you get data that is out of range for a specific column. That used to be really common in the lower version of SQL. I haven’t seen that kind of thing as much, so the data purity. So that used to be a really big thing when I was doing upgrades from like 70 to 2000, and from 2000 to 2005, and maybe even a little bit on 2005 but it was a lot better then. Where you got an upgrade and you start getting all these PK violations because they didn’t do such a good job in checking primary keys and there are things just flat out like page linking problems. Same thing happens, right, two pages don’t get linked properly and therefore when you go to run a query it can’t find the data it needs and you get an error because the link that it has on the page doesn’t point to anywhere. You know, and of course they are the same problems with index pages as well. Those are clearly the easier ones to find. And then of course then your disastrous ones where you’ve got problems and pages for, and the system pages for the database itself. There are different classes of those problems. I mean, Steve, you’re a big CheckDB guy so you know all about that kind of stuff.
Steve: Oh yeah. So I guess one of the things, I mean, when I see that oftentimes people are either not running CheckDB at all or they’re not running it frequent enough. And I guess frequent enough is one of those sort of variable terms. But, I mean, how often should people, and from your perspective, should people be checking their databases with CheckDB?
Sean: Well, you know, there’s actually a third one there. People, they don’t run it at all, they don’t run it often enough, or they run it and they don’t check the results. That’s a really big one. Well, I’m sure we’ll talk about that more later. But as far as how often should they be running it? Man, that varies so much. I mean, what does often enough mean? I would say if you’ve got a small database where you’ve got plenty of resources and really you’re not in danger of affecting a workload. Hell, run it two or three times a day. I mean, you’re not going to hurt anything, right? It’s not like re-indexing where it’s actually going to, or you could actually mess stuff up.
Steve: So then, like let’s say, though you’ve got a much larger database and it’s really not feasible to run it that frequent.
Sean: As often as you can.
Steve: I had conversations sometimes where people say, “Well, we can only run it, we can only afford to run it once a month on each database because things are just too big.” One of the things I commonly see there is that well they’re maybe running it once a month but they only have three weeks of full back ups. If they’re doing something like that they might not have back up history far enough to go back to where that last CheckDB was. How would you come back, I don’t know, how would you address that?
Sean: Are you baiting me?
Steve: No, no, those just come up in conversations. I was just saying what would you take beyond that when somebody can’t run it. I mean, it’s so infrequent that they don’t have a back up period long enough to go back in the database.
Sean: I know that. I’m just saying, I’m just asking if you’re baiting me because that’s one of the big situations that I was adamant about handling in Minion CheckDB. I wondered if you’re baiting me and to see how long I could go without saying Minion CheckDB.
Carlos: So let me jump in here I guess just for a second as the knuckle dragger Neanderthal of the panel here. We’ve talked about some of these scenarios. I don’t think we explicitly called out corruption although I think all of the scenarios that you listed could be classified as a corruption type issue. But the constraint or the reason people don’t want to run it, you talked about impact, because it has to read everything don’t we kind of get into a memory constraint issue? Because generally our databases are going to be larger than the memory that we have available so we’re going to be causing pressure because we’re going to be flushing things out of the buffer cache.
Sean: You know, usually the pressure I’ve seen for CheckDB isn’t as much memory as it is disk and some CPU running there as well depending on how much CheckDB is going to multithread of you’re on Enterprise. But I would say that the bigger issue is probably going to be something like snapshot space for your disk. I haven’t personally ever run out of memory when running a CheckDB.
Carlos: Well then what I mean by that is that the reason I can’t run it I’m going to affect my applications because now I’m flushing things out of the buffer cache that I need for my application because I’m running CheckDB.
Sean: No, because it’s going to take an internal snapshot and it really runs it against that internal snapshot. So really when we say it’s going to affect external application it’s going to be through CPU and disk. Let’s say that, one of the big scenarios especially on a really busy system with a really big database is that it’s going to run for quite some time. Let’s say it takes 5 or 6 hours to run a CheckDB which not only isn’t all that uncommon, it’s really not all that long for a really big database. And if you don’t have enough disk space then you’re going to expand that snapshot. It’s going to keep on expanding until it fills up the disk and then either it’s going to crash or it’s going to keep your database from being able to expand when it needs to and then you’re going to halt your processes and start rolling stuff back so that and some CPU area. Maybe Steve know something about this that I don’t but I just don’t see flushing the buffer as big of a deal because it’s running against that snapshot.
Carlos: Ok, well so then, I guess I need to ask another question because obviously there’s something about this process that I did not understand. When I hear snapshot, right, I generally think about I’m making, not necessarily a copy but as things change I’m going to keep the history of that, and then the database continues, and then my snapshot grows larger as more changes happen to the database.
Carlos: This is what happens when I’m running DBCC CheckDB?
Sean: Yes. Only it’s internal snapshot that SQL itself runs. It creates a snapshot behind the scenes for you.
Carlos: So depending on the frequency of change in my database then that snapshot is going to increase and that is what’s going to cause me the disk outage that you mentioned.
Sean: Correct, and Steve was trying to say something in the background that I’m really interested on what he’s going to say.
Steve: Right, so what I was going to get at there was around the whole squeeze on memory like Carlos is talking about. The thing I’ve seen is when you’ve got a system where the database is significantly larger than what can fit in memory and queries generally being run that are doing the best they can to deal with the page life expectancy and keep, the right things are staying in the memory for them to run. Oftentimes, what I see with that is sort of if you chart the page life expectancy you’re sort of see the chart pattern where it grows and grows and grows until that CheckDB runs. And then that pushes everything out because everything is needed to be loaded in to be scanned, and the page life expectancy drops to zero when that CheckDB runs. I look at that usually as an indication to be good to add more memory to the system. However, that’s not always feasible in all environments.
Carlos: Ok. So it could affect the three biggies, right? It could affect memory, CPU and disk. But disk obviously because we’re pulling that whole entire database off of the disk and be the snapshotting process I’m getting reads and writes so that’s probably the biggest concern, the first concern potentially.
Sean: I really don’t think it pulls in the entire database at once. It probably goes on time.
Carlos: Oh sure.
Steve: No, it will bring it in page by page just as pages would be normally used. But as it’s doing that it’s pushing out something that hasn’t been used for a little bit through the normal buffer pool process. And you end up pushing a lot of pages out in that process that may have been reused had CheckDb not been run.
Sean: Sure, sure but that doesn’t necessarily mean it’s going to lead to memory pressure.
Steve: No, and I guess, I wasn’t saying that it’s going to push it to the point you’re running out of memory. I’m just saying it’s pushing to the point that you’re causing a lot more I/O because things have been pushed out of memory to make room for what’s being brought in when CheckDB gets run.
Carlos: Yeah, exactly. I mean, so in a certain scenario so would it be fair to say in a sense you could compare that similar to almost like a restart? Because all that stuff just gets pushed out. Maybe not that extreme.
Sean: Sure, let’s go with that.
Steve: Ok, so then one of the things, I mean that comes up is a lot of people say, I can’t run DBCC CheckDB in my whole database. It’s not just good enough there. It’s not going to work. I mean, what are the main reasons you see that people say that or people are looking at they just can’t do it.
Sean: Right, so I’d say there are two big reasons I would say that I see for that. One of them is that that disk filling up that we talked about. Right, Because their databases is just too big and too busy and it won’t run in significant time that the snapshot doesn’t fill up. And that’s a really big problem. I’ve talked to lots of people who just flat out don’t run CheckDBs anymore because of that alone. And another one is that it just takes so long for it to run even if they can withstand the disk, it still just takes so long to run that you know, that’s a drain on resources, right, I mean, ultimately that is a drain on resources. And they just can’t have that resources drain for that long, or you know, they have a thing where they need to reboot now and then or something, something, something, right? They just can’t have a process that runs for literally two days.
Steve: Yeah, you got to exceed your maintenance window, right? The shops talk about, okay, you can do all your maintenance kind of in this window, and then, always extra things, you’re taking away from the system, therefore, can’t do it.
Carlos: Okay, so earlier, one of the things you mentioned was that one of the problems is that people who are running CheckDB and they’re not checking the results. I know that if you click off CheckDB and then SQL Server Management Studio Window, you’re going to get results, are all going to show up red on your screen if you’re interactively running it. However, if you’ve got it running as a job what are your best options for knowing the failure or success of that CheckDB process?
Sean: Right, so you’ve got to look at the job itself, right, the job could fail itself and all this depends on how you’ve written your CheckDB routine, right? If you’re like on built in maintenance plans doesn’t it push it to the error log? I think you see those errors in the error log.
Carlos: That’s normally what I see, you run it as a job and that’s how I’m going to know that it failed, is in my job failed.
Steve: Right, and I know that whether it’s a maintenance plan or a job or just interactively running it, whenever a CheckDB hits a problem it does right to the error log.
Sean: Right, but.
Steve: There’s also going to be things that show up in your job history as a failure if it’s running as a job or a maintenance plan there.
Sean: Right, and now, I don’t know about you guys but I come from a lot of big environments where we’ve had, you know, a thousand servers, 2 thousand servers and tens and tens of thousands of databases. I don’t know about you guys but I really really want to search the log for something vaguely CheckDB related every week or every day or every something. And then not only that but then go inside of there and parse the results out and try to figure out exactly what the problem was, right? So that’s what I see most, right? The guys who have a little bit more on the ball will run it with table results flag which is better because now you can at least get the table results but you have to build on a whole process for that thing too because that whole table results thing has to be managed and it didn’t help. For the longest time table results was undocumented. Like, somebody was going to bring down the server with table result or something. I never did, I still don’t know why it was undocumented for so long. But the ones that are really on the ball we will run with table results and at least get in a table but like I said, you know, you have to build the whole process around that and it’s not just as easy as, “I’m going to run this with the table results flag and that’s it because I’ve got tens of thousands of databases now.” And to me, any process that I develop has to work against one and two thousand servers with tens of thousands of databases or it’s not going to do me any good for the five databases I have. Right, I mean, we’re being asked as DBA’s to do more, and more, and more and, you know, to spend my time going around and checking all of these things and then parsing the results. Who’s got time for that crap? You know?
Carlos: Yeah, and that’s really an interesting point there because the thing, I mean, I look at the DB, people talk about, “What’s the DBA job going to be in the future?” And I think that it’s something that’s always changing where we’re going with it. But I look at the people who maybe haven’t change in the last 15 years of SQL Server and they’re the kind of people who their job consist of everyday looking at error logs. And mainly scanning those error logs and determining are there any problems that you have to deal with. Whereas the DBAs who were evolving and going forward in the future are the ones who were building the automated processes so you don’t have to manually do that. That you can focus on real issues rather than reading error logs.
Sean: Yeah, because you know, there are two aspects. There’s discovering the problem, reading the problem, solving the problem. And so you don’t want to have to go out. You want this stuff to be as easy as possible especially with corruption. You want to know about it as soon as you possibly can and you want to know the nature of the issues as soon as you possibly can as well. I think that sending CheckDB results to the log is ridiculous. I think sending deadlock information to the log is ridiculous. The hell am I going to do with deadlock information in the log. I mean, I’ve got this XML graph kind of thing and taking up 37 lines in my log. What am I going to do with that?
Carlos: Sure. You know, it will be a little bit more proactive.
Sean: Just proactive but Microsoft needs to do a better job at making the stuff that we need to monitor accessible. I mean, one of the worst things they did in x events was make the whole thing XML driven. I was one of the worst they have done because the average DBA doesn’t know really anything at all about XML. And so it’s great that I can take that stuff and save it to a file. Now, I’ve got all this XML in a file what the hell am I suppose to do with it. Now, DBAs have to go to a 2-week course just to be able to query the XML data out of there. Thanks for that. I could at least import server side tracing profile data into a table and query it. But here you got to go through some major hoops to get it in the same format. They haven’t done a good job about taking the stuff that we absolutely have to have and making it really accessible to people that are managing on a large scale.
Carlos: So Steve, I’m sensing that we should have asked our SQL Family question little sooner.
Steve: Yeah, that’s good point. We’ll come back to that at the end then.
Carlos: That’s right.
Steve: Yup, so then I guess back to the question around, “Are we baiting you?”, and that’s the time we might start baiting you on. I mean, now that we’ve talked about CheckDB and some of the pros and cons and difficulties and challenges. You’ve come out something here with your Minion CheckDB.
Carlos: Is there something better that we can run?
Sean: Yeah, I like to think it something better. I targeted Minion CheckDB for those large shops with DBAs that are super busy they don’t have time for searching through the logs or for building their own process with table results, and also for the super large databases where you don’t have time to run CheckDB, you can’t run CheckDB because your database is too big. So it was specifically designed with those guys in mind.
Carlos: Well, so now is it different from DBCC CheckDB? Or is it using the same command just kind of slicing it up and giving us more options.
Sean: Yeah, no, it’s all about the process. When the rubber hits the road I’m still running CheckDB and CheckTable but it’s all about the process behind it. For instance, let’s talk about one of the things that I’m particularly proud of process was. We’ll talk about the large database that Steve mentioned where they just don’t run CheckDB because their database is too big. So one of the things you can do with Minion is you can set it up to do a rotating CheckTable instead. So you can say, I want to do 5 CheckTables a night or 500 CheckTables a night until all the tables are done and then start over again. So you’re doing all of the objects. Because what is a CheckDB really? Is it a CheckDB or is it the equivalent operations?
Carlos: Right, all the pieces underneath.
Steve: Although it’s more than just CheckTable and all the tables.
Sean: Of course, it is.
Steve: I think CheckTable and all the tables get’s you started on a big amount of the areas where you’re most likely to have issues.
Sean: Yeah, but you can also run the other smaller operations like as well. But if you want to check your objects and you just can’t because it takes three days then do a hundred at a time, 500 at a time.
Carlos: So how do you get around that? I’ll use smaller numbers so I can wrap my head around it. So I have 10 tables and I can only do 5 at a time. So my maintenance window let’s just say is I can do 5 tables once a week. So it’s going to take me two weekends to get through all ten tables.
Carlos: So when the change happens, so do 1 through 5, weekend number #1. Weekend #2 I’m doing 6 through 10 but corruption happens in Table #1. I’m not going to find that until the next day. I guess going back to Steve’s point, I just take that into consideration and make sure that my backups exceed the time that I can get the total database done.
Steve: Go ahead Sean.
Sean: Another scenario that I do with that is I allow you to automatic remote CheckDBs. So you can set it up to every single night. It will take the latest backup and restore it to a remote box and run a CheckDB on it and send the results back to the prod box.
Carlos: Got you, which is will be awesome particularly maybe not if you have thousands but if you’re in the 20s, 50s, maybe of databases would be a nice way to do it.
Sean: Well, you don’t do that for every database. I mean, every database isn’t that big. But you do it on the ones that can’t do and the other ones you do local. And, you know, so I give you three or four options. Three options I think for restoring your backup remotely and I give you a number of options for the type of remote restore or the type of remote job you want to do. I mean, it’s incredibly configurable with just three, or four, or five columns worth of data in a settings table and then it just goes. I’ve got people doing it and they set it up and once they got the permissions work out and everything that was the biggest thing, right, it was the permissions between the two servers. Once they got that worked out they found that the process itself just runs.
Steve: So then Carlos, going back to your example of small 10 tables but you can only do half of them each week. And I think around that scenario if you’re only able to do half of them per week because that’s your maintenance window. If you were doing full CheckDB previously I think the likelihood of be you never going to actually do CheckDB so even if you’re only finding out a week behind or two weeks behind that something has been encountered that’s a problem, corruption for instance. It’s better to know about it within two weeks rather than never knowing about it.
Carlos: Or when it usually tells you, “Hey, I’m getting this error.”
Steve: Right, right, and you found out you had corruption 18 months ago and it just sort of crept through that amount of time.
Sean: Well, the cool thing is you can even set it up to run your important tables every night. So you could say, I want to run these guys on a schedule but I also want to run my really important tables every single day because these are the guys that I have to have, right? And it’s really easy to exclude tables or exclude schemas or something like that so you can really target it. And really importantly you can order the databases and order the tables as well. So in your maintenance window even if you’re doing all of your tables at once which is another scenario, right? We’ve got the concept of what I call, what the hell do I call it, dynamic snapshots, so getting around that problem of your snapshot file filling up and just killing your CheckDB operation. Well, you can put your on snapshot on there so can create your own disk and put your own snapshot in there but if you don’t have a disk laying around that is big enough, you can switch to CheckTables. But even if you’re doing your own snapshot and CheckTables you’re still going to fill that up. So with dynamic snapshots, what we do is we allow you to say, you know what, every 30 minutes I want you to drop the snapshot and create a new one, and then continue doing the CheckTables. So if you’re going to do a thousand tables and it’s going to take you say 7 hours to get through it. Every half hour it’s going to drop that snapshot assuming that it’s after an operation or if one table takes an hour and a half there is nothing I can do about that. if it’s over that half our it will drop that snapshot, create a new one, and then carry on with the rest of the tables and that’s going to keep you from filling up that snapshot space, that disk drive. That’s another scenario that we threw in there for this big database problem, right.
Steve: So one of the scenarios that I’ve come across a couple of times that has been challenging has been when you have a larger database that takes a while that run CheckDB against it, and so you’re trying to do CheckTables or other options. But then that one large database, although it’s got a few tables in it there’s one table that really takes up like that 95% of the entire database. So then you run CheckTable against that one really big table and it ends up taking almost as long as CheckDB would have taken to begin with. So are there any options or anything that you see of how you would deal with that differently? And then perhaps run it on another server?
Sean: Well, in that case, I mean how important is that table?
Steve: Well, usually when it’s that size it’s something that’s pretty darn important and it’s just that big because some software along the way wasn’t architected to correctly warehouse things or break it out into another tables.
Sean: Right, well I mean it could just be an archive table that’s sitting there and it doesn’t get data but once a day, or once a week, or once a month, or something. It’s not really that important of a table functionally. The data just needs to be there so there are two different scenarios that you’re talking about. You’re talking about the archive table that’s just sitting there getting bigger and bloating your CheckDBs. And the one that is an actually active table that they just refused to archive and, you know, we’ve all seen it, right? So in that first scenario it’s really just to exclude that table from your CheckTable process and only run it let’s say once a month or once every after two months or something. I mean, that’s really easy to do. The second one is a little bit harder because that is an active table and so you’ve got a couple of things you can do there. You could put it in its own file group and run CheckFileGorup. I like that solution a lot less because I don’t think you really gain anything from that. But yeah, you would either run the CheckDB remotely or you could as well run CheckTable against all of the other tables every day and only save that one for like the weekend or one a month or something like that if it’s going to take a really long time then do that one by itself. Or since it is 95% then run a full CheckDB or full Checktable process on the weekend or once every two weeks or something. But for your dailies, or even better yet you can give it, I want to say a perfunctory. But you could run it with physical only more often than doing a full data check. Just make sure that physical only will just make sure that all the page structures and everything are proper. It won’t bother checking any of the other data types stuff. So it’s a much faster check because it’s a lesser check. But at least you’ll know that the tables themselves have integrity and not necessarily the data. And I’ve got somebody doing that and by the way that’s really easy to do. And Minion is to say on Monday through Friday or on Weekdays I want to run this with physical only, on Saturday I want to run a full CheckTable on it.
Steve: Yup, ok. So another area that’s come up in the bait here and there as I’ve done things with the database corruption and what not is around doing DBCC CheckDB on TempDB. And that’s one that I’ve seen a lot of really intense arguments or one or the other whether you should do it or whether you should not do it. And I’m really just curious of what your thoughts are on checking TempDB.
Sean: Of course you should do it. You know, TempDB is central to everything that you do. I mean so much stuff uses TempDB these days. And if you’re pushing all of your Temp tables through there and a lot of your spill results and you’ve got corruption in there or more importantly you’ve got it on a disk that is causing corruption in there. And you’re going to now rely on the results from these operations. You need to know if TempDB is corrupt. I mean, what you do about it is for debate depending on what’s going to happen. I mean, of course, easiest thing or the only thing that you can really do is delete the files and restart SQL. You’ve got no choice, you can’t really run a pair on TempDB. But yeah, I think because that the data isn’t persistent it doesn’t mean that it can’t be creating corrupt results and your entire environment and results and what not that are coming through.
Steve: Perfect. And although I tried to ask that question in a very middle of the road state of trying to not waste my opinion one way or the other. I completely agree with your answer there.
Sean: You did a good job.
Steve: And I think that next time I’m in an argument over this with someone I’ll say refer back to Episode 90 of the sqldatapartners podcast and there’s a brief discussion on that.
Sean: You can tell them I said so, that’s right.
Carlos: I of course agree with you too Steve if that matters but I’m not sure I’ll get a vote.
Sean: You will get a vote we just may not count it.
Steve: Alright, so then I guess back on sort of the topic of the Minion CheckDB side of things. I mean, there’s a lot of people running CheckDB, there’s a lot of people running their own custom CheckDB scripts, there’s people that are running the Ola Hallengren’s database integrity script. So really what I’m trying to understand is who out there is going to get the most value out of the Minion CheckDB product versus one of these other ways of doing it.
Sean: Dude, I hate to sound like that way but absolutely everybody. The amount of logging that we’ve thrown into this product is obscene. And I don’t mean obscene in a figurative. I mean it is obscene.
Carlos: So you mentioned some of that, so kind of downplayed it a little bit the role or the usefulness of going to the error log. So when you talk about reporting what does that mean? I’m getting an email message, there’s a dashboard, what does that mean?
Sean: So, it means that we store everything in log tables instead of in files or in the log. So we make it really easy to get to.
Carlos: So you can then query it. I mean, is there built in processing that’s going to help notify me like, “Hey, your biggest table has corruption in it.”
Sean: So we don’t get into the alerting business at this level. We save that for our Enterprise product. Because I don’t know how you want to alert. But I’ve made it as simple as possible. I mean, there’s even a flag that says to fail the job because we handle all the errors the job won’t fail so you have to tell us specifically I want to fail the job. Some people don’t want the job to fail. Some people have a separate process that go out there so there are network monitors that can consume SQL queries. Or maybe you want to add a separate step to the job that queries the table and says, “Email me if any errors occurred.” But the point is that we give the errors in a very consumable state. We even tell you how many allocation and consistency errors. We tell you how long it took. We tell you what the last status was. We tell you whether it was a remote job or not. We tell you what server it ran on. We give you an obscene amount of information. There are probably 40 or 50 columns in an hour log details table, logs for absolutely everything. And I would say that’s the biggest advantage that we have over everybody else is again in that process, in the processing of the information and the jobs because we even allow you to do things like, let’s say, that you’ve got even, you know, that you’re using Ola and you’ve got 300 database on a server. And they’re all relatively medium size, right? You have to do all those serially. Or you have to create different jobs and then put them in, 10 or 20 at a time in the parameters.
Carlos: Do the leg work to carve it up yourself.
Sean: To split that up so that you can run those concurrently. You can just, I mean, Minion is already multithreaded so you just say run 10 at a time and it will split them up and it will run 10 at a time.
Carlos: That’s bags the question, when do you start seeing, were you introducing problems by doing so many at the same time. Have you even seen any?
Sean: Like resource wise?
Carlos: Yeah. When is too many too many?
Sean: That’s going to depend on your server and everything involved.
Carlos: That’s fair, too shady.
Sean: I gave you a fully automatic machine gun. What you choose to shoot with it is your problem. One thing I despise is tools that try to be smarter than me. I may not ever use a multithreaded process because this one big server won’t support it most of the time. But then there’s one time when we’re down for our quarterly maintenance and they say, “You’ve got 2 hours to do anything you need while we do all of these other stuffs in our app boxes that hit the SQL box.” And I’ll say, “Ok, I’m going to run 5 CheckDBs at once and get them all done in 2 hours.” You know, because I’ve got the box to myself. You may not use that normally but you may use it three or four times a year when something like that comes up. And you know there’s just so much other stuff in there, right? You can do the same thing with that same rotation we talked about with CheckTables. You can do a set of tables and then do another set of tables and then do another set of tables. You can do CheckDBs that way too so you could rotate those 300 databases; 10 a week or 10 a night. Or you could say you can also run it base off of time. You can run CheckDBs for an hour a night and then just pick up where you left off with the next database tomorrow so it will just roll through all those guys on a timed basis as well.
Carlos: In that scenario is it keeping a snapshot. I’m getting a new snapshot, right? If it has to break the database up so I guess one hour a night, my process takes two hours, which means I need two nights to get through the whole database. I’m getting a new snapshot, right?
Sean: It depends.
Carlos: Oh, ok.
Sean: If you’re running time to CheckDBs which only work off of a rotation which makes sense of think about it then it’s only going to run the CheckDBs that it can do in that time. So you can’t CheckDB half a database.
Carlos: Got you, ok got you.
Sean: But if you’re running CheckTables then it will run through as many of the tables in an hour as it can. Then in the snapshot settings if you’re creating your own snapshot that is, right. If it’s an internal snapshot you have no say there. But If you’re creating your own snapshot then you can tell it to leave the snapshot or to delete the snapshot when it’s done, and then you’ll just get another one tomorrow so you’ve got options there.
Carlos: Another question I had and this goes to something. I want to say it was Grant Fritchey said it but it’s giving me a little bit of heartburn and that is when you have corruption his comment was, “You want to find out why it happened.” And this kind of goes back into their TempDB discussion and you mentioned the disks. So my question is, how do you normally go about identifying why corruption happens.
Sean: I would say that depends. There are a number of things that can cause corruption. Almost in my experience, most of the time it’s going to be a spinning disk. But it can be cause by you copying a database over the wire and it gets corrupted and the bits get corrupted over the wire. Or you’ve got bad memory and the memory is starting to go out and that will do it. I would say disk and memory are the top 2 but I see disk pretty more often.
Carlos: I think I’ve also seen one scenario, I guess, in that a client called me up. They were having some corruption and they’re trying to do like a big bulk import. I don’t know exactly what happened but it sounded like somebody just pushed a button and powered down, you know, hard stop the machine. Something like that caused the problem as well.
Sean: Right, and that can usually be boiled down to probably something being written incorrectly to the disk.
Carlos: Right, so some kind of transaction error or something.
Sean: Right. So to answer your question directly, I kind of preface it there. So to answer your question directly usually if I find like a corrupt table I’ll go on and fix it, and just kind of move on but watch it. Because things happen, stuff happens. But if I start seeing another corruption within a reasonable time say another month, maybe two then I’ll say, “Ok, we need to figure out because we’ve gotten two corruptions inside of a few weeks. And this is just too often.” So I’ll have one of the guys run a disk check and then I’ll have one of them run a memory check. You need to specifically check the disk, and specifically check the memory to make sure that they’re ok. Because you’re right, you do need to know where it is coming from. And I want to say most of the time that’s going to be disk. The thing is I’ve had them come back before and say, “Well, the disk came back clean.” I’m like, “Well, that’s great but you’ve corrupted three tables this week.” I had this conversation with somebody six months ago. “But you’ve corrupted three tables this week. The disk is not ok.” We just ran a full disk check. Well, either run it again or run a different check because you’re not ok. And after going back a few rounds with them they would get the vendor who would either show them a different way to do it or give them a different tool, and bam, the bad sector show up.
Carlos: Sure. There are some flag or some option that they didn’t have before that.
Sean: Check the controller. It could be the controller as well. But something there in the disk process, in the disk write process is not right. And if you stick to your guns, I’m saying this to the DBAs who are going to be listening, right. In this case, you got to stick to your guns. Data doesn’t just corrupt on its own. It just doesn’t get tired so you need to stick to your guns and don’t let them come back and say, “Oh well, we didn’t find anything.” Because you know what, I have found quite often that when they say they do something. Like these guys are network guys, “Oh well, we didn’t find anything.” They actually didn’t look.
Steve: That brings up an interesting story. I mean, I work for a client years ago where. Yeah, there was no reason that they could that they were getting corruption and it just happened. And after looking into it a little bit more, well, it turned out that it just happened because the SQL Server was running a desk without any kind of UPS or back up or anything like that. And every time there was a lightning storm, or a power outage or any kind of problem, it would basically just completely power the whole thing off. And surprise, surprise, you have magically a corruption at that point.
Sean: I had that same problem. One of my first DBA gigs, true DBA gigs was at Pilgrim’s Pride and they had SQL boxes and these were 65 boxes, but still. They had SQL boxes in the kill rooms and so it was just a SQL box sitting on a milk crate with a piece of plastic over it covered in splatters of blood.
Sean: And I was just like you’ve got to be kidding me.
Carlos: When you’re talking about kill rooms you’re talking about they’re killing turkeys in this.
Sean: Pilgrim’s Pride, well they process chicken.
Carlos: Chicken, oh my God.
Steve: Oh boy.
Sean: Enterprise wide they process back then about 600,000 chickens an hour, 24/7.
Steve: Oh my gosh! So the server was in the same room.
Sean: The server was in the same room because they have to weigh the chicken at every stage so they want to know what’s the weight of feathers is, and what’s the weight of the guts is, they shift all that stuff out, right? The weight of the feed is, they got to get all that stuff out of this so they got to get the weight and find out exactly what’s what. Yeah, you’ve got to have a SQL box right there by the scales.
Carlos: There you go. So because I feel like I’m going to be left out if I don’t tell a story. Now, I can’t tell you directly back to corruption however being from Richmond, Virginia. Philip Morris, this is the headquarters for tobacco industry and it used to be, I think they’ve finally changed this policy because they were pro tobacco people you could smoke anywhere even in the server rooms. They had pretty good ventilation systems supposedly but that was always the fun story that people would tell is that, “You could walk anywhere and smoke and whatever tobacco product you could use it anywhere in the building.”
Sean: How gross? So to go back to your question a little bit we kind of got sidetrack a little bit and it’s just the nature of how I talk. I would say back to who would benefit from using Minion. You know, it’s out of the gate when you install it. It installs what the job and it installs with schedule. So if you’re not a very good DBA, you just hit the installer and it does everything and it will be up and running and you’ll have CheckDBs. But it’s also configurable for like the world’s greatest masters to come in there and tweak all the bells and whistles and get absolutely everything out of it they want. And one of the big advantages you get especially with all the logging is you get a history at your fingertips of all of the corruption you’ve ever had on that server in that database on that table. And you can say, “Has this ever happen in this table before?” And you can back and look. It gives you a huge advantage of something that you don’t get elsewhere. If you run the query to look at the last time the database was CheckDBed, SQL stores that information in a page. But it doesn’t store anything about CheckTables. So if you’ve got a database where you’re even rotating CheckTables every month and it takes you like 2 weeks or 4 weeks to get through all the tables. That database will show that it has never been CheckDBed eventhough it has. So having all of these here you could easily query, “Oh well, these tables were done here.” And you know when all of the tables have been done you can count that as a CheckDB. So you could write and SP that tells you the last time it had a full CheckDB which was did all the tables get done last month then this was the last time the tables got done. SQL won’t ever know it but Minion will and we’re going to do more for that as we go through the versions. I mean, this is just version 1 and it was a very ambitious version. I mean, we’ve got so many features in there. It’s ridiculous. One of the things I like about it the most is how, I’d have to say something as corny as we crossed the finished line on a lot of these features. But say you’re doing a remote CheckDB and you’re restoring last night’s backup. On the prod box, I’ll show you in the log, in the status column that it’s restoring to that remote server and I’ll even give you the preset complete of the restore. And as it’s doing the CheckDB on that remote server I’ll even give you the preset complete of that CheckDB on the remote server here. So if you have 5 databases on Prod 1 then they are all being remote CheckDBed to five different servers. You will see the status and get the results back on prod one. You don’t have to go around to all those different servers and check all that stuff manually. You get it back on Prod where it belongs because that’s not a CheckDB of Dev 1 or QA 1. It’s a CheckDB of the Prod 1 database. It’s just being run on over there. So the results belong back on Prod 1. And when I want to see status that status belongs back on Prod 1. I shouldn’t have to go all over BFE to find this information. I think we’ve crossed the finish line really well on some of that stuff like that. We really made it as user friendly and as complete as we possibly can.
Steve: Ok, I think that’s a very in-depth answer to the question I asked but definitely good stuff there.
Sean: You know I never shut up about my product. You bought that. Think I was a snake when you picked me up?
Steve: I think that wraps it up for all the questions we have to cover. Anything else you want to hit on before we move in to the SQL Family questions?
Sean: No, I better not. I’ll just go and go and go. I’m still so incredibly in love with CheckDB. Well, I’m in love with two things. I’m in love with CheckDB as a product and I’m in love with not having to be developing it anymore. It took me so long to write this and I’m just thrilled to be working on something else.
Carlos: I thought it was available earlier, and you mentioned like shipping it out like last week or.
Sean: February 1.
Carlos: February 1, yeah, so was that actually the first release?
Carlos: Oh, got you. Ok, I guess I’ve just been seeing the stickers or whatever previous to it.
Sean: Yeah, we tease it a little bit. We had the stickers printed a long time ago because we know it was going to be released. I just didn’t know. It took me like a year, a year and a half to finish it, something like that. I mean, you know when I first started it was like, “Hey, CheckDB. This will be easy because there’s not nearly enough in there. There isn’t really much in there as there was in backup. What could possibly go wrong?” But then when you start adding multithreading and mind you this is T-SQL multithreading. I mean, I’m not breaking out in the .NET or Power Shell or anything. This is all done in T-SQL. So we’re doing multithreading, and then the rotations, and then the remote stuff and it adds up. I way over complicate everything so.
Carlos: Well, I heard an interesting thought and talking with one of the Microsoft guys, and all this rapid release things they talked about. And his comment was, “Releasing is the feature.”
Sean: Yeah, I like that.
Carlos: I’m like, you know, yeah because you just get kind of a trap, “Oh, it’s not good enough, it’s not…” You know what, let’s just release and will fix it later.
Steve: Alright, so let’s go on to the SQL Family questions then.
Sean: Do it!
Steve: So how did you first get started with SQL Server?
Sean: Man, I’ll give you a shorter version of the story as I can. So I was a French chef for many years and followed a French chef all around the world, and the latest bistro that we were in closed.
Carlos: This is in France or this is in the States?
Sean: This was in the States. We were back here at that point. And I was like, “Man, I don’t know what I’m going to do. I’m not really looking forward to moving again.” Because chefs have to move around if you want to get any good experience. And I was kind of tired of moving around and I didn’t want to end up in a subpar hotel or something opening cans on the line. So one of the waiters said, “Well, you know, my sister works for Stream. And I was like, “Ok, great, a field magazine. Great.” And he goes, “No, no, they do tech support.” I was like, “What the hell do I know about tech support? I’m a chef.” And he says, “You know, they’ll train you.” And I was like, “Oh, really?” So I had my own computer and ended up knowing a little bit more that I thought. And being in French kitchens, and being who I am a linguist and all that, I ended up working for HP on the French line. So I was French tech support for Windows for HP. And so I stayed there for a while and then left there and went to this young start up called Health Web and it was a SQL based product for health care where you could put all your patient information up online. And they were like one of the first ones that did this and now it’s everywhere. But they were one of the first ones who did this and I was their tech support. Well, the application was written really well because the guy really knew what he was doing. And so most of the problems that we had ended up being SQL related and this was like SQL 60 and 65. We had one customer at that time. They were hiring me ahead of time and so I had nothing to do. And there was this whole wall of shelves in the back room with all of these computer parts on there. So I built 4 or 5 boxes and put them under my desk and network them and just started learning SQL, and IIS, and Windows. And I didn’t really know that much, right, but I had a guy there who was willing to take the time to show me some stuff. He really mentored me. So I just sat there all day long with the SQL training kits, you know, the official training kits they give for the cert exams back then, and just went through those things and work all the examples and did them different ways. And every day when I came in I would break all my boxes down. All the way down from Windows, and reinstall Windows, reinstall SQL, create databases, load them with data, do their exercises and do the same thing tomorrow because repetition is king. I worked there for like a year and a half or so and every problem we ever had was always SQL related. It was never the application. So they got sold and they offered me a DBA job in Denver. And I was like, “I just bought a house 2 weeks ago. I’m not moving in Denver. I just bought a house. You could have told me that you were selling the company 2 weeks ago.” But it did kind of thought in my head because I had then looking for, you know, when we find out about the sell we have been looking for help desk gigs because I was help desk guy. That’s what I was. But when they came a couple of weeks later and said, “You could move to Denver and be a DBA.” I can’t move but then I was like, “Hey, why don’t I start looking for DBA gigs here.” And I ended up getting hired on at Pilgrim’s and being a helpdesk guy with a lot of lab time under my belt and just a little bit of actual and no real production trouble shooting, no real production database running just running it as a help desk guy answering a question every now and then and I was really pretty green if you ask me. I mean, looking back on it I was a new.
Carlos: We are all were.
Sean: We were all dumb in the mid 90s, right? But he hired me on the spot; he said that I was the best candidate they had ever seen. And I was like, “Really, me?” I guess because I read the book. And the rest is history, I kind of fell into databases. I wasn’t going to be a DBA. I wasn’t even going to be in computers. I was going to be a chef my whole life.
Steve: Very interesting, French chef to SQL DBA.
Sean: Right, so that’s my story.
Carlos: Now, we talked a little bit about this earlier and perhaps we can revisit it. If you could change one thing about SQL Server what would it be?
Sean: Oh jeez, man, I would say the thing that’s most important to me most of the time is they don’t make information as easy to get at as they think they do or as they’d like even when they do something to help somebody, like the new clone database thing. If you ask me that kind of thing, that thing was half asked. I don’t like the way they do most of the stuff. But anyway, I would say make the information easier to get at. Like there is no reason whatsoever why last CheckDB date shouldn’t be in sys databases. There is no reason why you can’t have table properties and say the last CheckTable date or something like that. Some of this information is just so incredibly hard to get at. Why is it so hard to get table size? Why is it so hard to find some of this information? Even the last time an SP was updated or the last time a table was written to. They don’t make the stuff easy. You got to jump through some major hoops to get some of this stuff. So why isn’t this stuff easier to get at? Why do we have to jump through so many hoops or basic information about the database, right? And I think CheckDB information being the theme here is the best example of that. Why do I have to do this ridiculous DBCC page thing and parse and all of that crap to find out when the last time a database was CheckDBed. You know when it was. Put it in the sys databases and move on. Another one is why is it so hard to find out the last time a database was used. The last time somebody used the database or use a table or something like that to find out if the database is even still in use. SQL should be automatically keeping the list of all the applications that hit it so that you could easily query which applications have hit it and from what server? That stuff shouldn’t be a process that I have to write, SQL should just monitor that stuff and keep a history of it for you.
Steve: Yeah, definitely agree on that one.
Sean: And it should automatically have a firewall. You should be automatically be able to say. I only want these applications to be able to query SQL. Be able to lock anybody who’s not DBA out of every single access except through the intended application. I shouldn’t have to buy a third party product or write something really thick to do that for me. You should just have a list of applications that have connected. Let’s say, click, click, click, add one, I want you to do this, I want you to add this one to it and then that’s it. For anybody who’s not a DBA, DBAs can do whatever they want, but don’t let anybody hit it with Excel. Don’t let anybody hit it with Crystal Reports. That’s what the other boxes for. This one is for this application and that’s it.
Steve: Alright, so we’re into our last SQL Family question.
Sean: So you get me riled up about things.
Steve: We’re not trying very hard either.
Sean: I know, right, it’s so easy.
Steve: Yup. If you could have one superhero power what would it be and why would you want it?
Sean: Oh, God, what did I say last time, Carlos? Do you remember?
Carlos: I’ve had to look it up.
Sean: You know, right now, I’m going to say I wish. It’s split between two. And I really can’t decide which one. I would either want to be able to read minds.
Carlos: Oh yes, I’m pretty sure that’s what you chose all the dirty laundry that comes along with it.
Sean: Or I would want to be able to see details by the way. Not just as vague BS that you see in movies but details into the reasonable future. I don’t want to see vague shapes and colors and find out what that means as I get there, that’s ridiculous. I want to be able to see a day or a weekend to the future and know what’s going to happen and be able to be prepared. Imagine in our field how cool that would be, to know that a problem is coming and have seen the fix already and to be able to do something about it. You’d be a superstar in every sense of the world.
Carlos: Awesome, very rich stock picker as well.
Sean: Yes exactly, that hadn’t crossed my mind.
Carlos: Warren Buffet it your heart out.
Sean: I know right.
Carlos: Well, awesome. Sean thanks so much for coming on the program today. It has been a blast as always.
Steve: Definitely been fun. Thanks, Sean.
Sean: I love this topic. I’d like to see you guys again for re-index. I’m starting now. And we didn’t even get to talk about the new stuff in Minion back up but we’ll save that for another one.
Carlos: There you go. You guys got another 45 episodes or so.
Sean: Yeah, exactly. You can’t make this the Sean show. Thanks a lot guys!
Steve: Alright, thank you.
Carlos: Thanks again, Sean.
[…] https://sqldatapartners.com/2017/03/28/episode-89-lone-dba/ […]
[…] SQL Data Partners Podcast Episode 89: Lone DBA (Carlos L. Chacon) […]