Episode 141: Wait, Wait, What?

Episode 141: Wait, Wait, What?

Episode 141: Wait, Wait, What? 560 420 Carlos L Chacon

In this episode, we chat with David Maxwell about wait stats, what they are and why we should care about them. There are so many different directions we can go and if you have a specific use case you’d like us to consider, let us know and we can include it in a future episode.

Episode Quotes

“Context matters. A particular wait type might be a problem in one context, but it might not necessarily be a problem in another context.”

“Whatever monitoring tool you choose, monitor your server over the period of time where you’re seeing that slowness and see which are the wait stats that are being accumulated so that you can better target things.”

“Waits don’t occur in isolation, so you really can’t troubleshoot them in isolation.”

Listen to Learn

00:40     Intro
01:35     Compañero Shout-Outs
02:03     SQL Trail Event announcements
03:34     World Cup Woes
03:57     Intro to the guest and topic
05:07     Why do we care about wait stats?
06:28     All blockers are not equal – Context matters
08:19     Waits you should concentrate on
11:59     How to reduce these waits to make your system faster
14:38     Wait stats are a universal troubleshooting tool
16:49     Tying waits to queries
19:07     Use sp_whoisactive by Adam Machanic and compare those snapshots to the broader view
23:08     CXPACKET and new to SQL Server CXCONSUMER
25:28     Confusing or misleading wait stats
29:07     What does a healthy set of wait stats look like?
31:57     SQL Family Questions
39:52     Closing Thoughts

About David Maxwell

David has been working with SQL Server since the year 2000 and SQL 7.0. He’s been in many different environments from healthcare to manufacturing to finance and education. He loves internals, performance tuning, troubleshooting and really good coffee. David also helps in running the SQL Saturday in Columbus OH with the local PASS user group there. In his spare time he can be found playing video games live on twitch.tv.

David’s SQL Blog: dmmaxwell.wordpress.com

Games on Twitch: twitch.tv/dmmaxwell

Carlos:             Compañeros, welcome to another episode of the SQL Data Partners podcast. I am Carlos L Chacon, your host, and it is good to be with you! I apologize for missing last week, but it’s good to be back on the air. For those of you who are in the United States and gosh, if you’re listening to this podcast, Happy Fourth of July! We’re celebrating the independence of the United States today, as this episode is released, and so I’m sure that I’m having a great time doing whatever it is that I do on the Fourth of July. Our episode today is entitled “Wait, Wait, What?” We’re talking about wait stats and we’re happy to have David Maxwell on the program with us. David is a database consultant out of Columbus, Ohio, and a gamer and while admittedly we don’t get much into the games, we do draw that out a little bit and maybe we’ll circle back around to that in another episode.

We do have a couple of Compañero Shout-outs we want to give. The first being to John Wells from Louisiana. Hope all is well down there. Aaron Hayes, reaching out for a little networking, so it was nice to catch up with you, again, Aaron. And Saul Cruz has a podcast idea that we’re going to come back to here in a few episodes. We’ve been trying to nail down, and so Saul gave it another vote up, so we’ll try to get that on the docket, here.

So compañeros, if you’re new to the program, of course welcome. For those of you who’ve been long time listeners, thanks for sticking around. I do appreciate it. One of the things that we’re doing or putting together in Richmond is the SQL Trail event, so this is my iteration of what a data platform conference should look like. It is going to be two and a half days. We’ll be in Richmond from October 10th through the 12th. We’re going to get together, do some get-to-know-you, actually select some sessions based on what people want to hear about and the speakers that we’ve invited. And then Thursday will be a bit more traditional in the sense that it will be sessions, obviously based on feedback, but we’ll be there together. Depending on the number of people that we have, we may do single track, or at most two tracks. Again, this idea of being able to extend the conversation, and then on Friday, we’re going to have a lab with Melissa and Meagan. And so, if you’re looking for training, if you’ve had trouble finding opportunities that are within your training budget for the year, but are looking for something where you can contribute, it’s relaxed, it’s laid-back, but it’s also very, very in depth. There’s going to be speakers from all over the place, different areas. Again, PowerBI, the reporting stack, the traditional DBA arena and so we invite you out. Sqltrail.com, check it out, and you can use the code SQLTrail to save yourself $100 off the registration.

Okay, those of you kind of keeping score, I’m sure hopefully if you’re in the World Cup and that’s your thing, hopefully your team persists. For me, I’m a big Argentina fan. Spain has also fallen, and so we’ll see what happens. But those European teams are definitely showing their dominance this tournament.

Our show notes for today’s episode are going to be sqldatapartners.com/wait or sqldatapartners.com/141. With that, let’s get into the conversation with David.

Carlos:             David, welcome to the program.

David:             Thank you very much for having me.

Carlos:             Yes, taking a little time from both work and I know you’re a big gamer, as well. I see, from social media, from time to time you’re inviting people. I admit I don’t know enough about it, but kind of join you in some of these online games, so thanks for being here.

David:             Not at all. Not at all.

Carlos:             Have you been doing much gaming, lately?

David:             I have not been playing a whole lot online. I play a lot of single player games and I kind of enjoy it when people jump in and comment and from time to time give me hints and things like that, but otherwise, I’ve actually been spending a lot more time at work, lately. I actually started the gaming channel because I wanted to give myself an excuse to get away from work for a little while, so I figured I can at least schedule an hour a week, sit down and do something that would be fun for me, but maybe have more of a social aspect to it, so that’s why I actually started the gaming channel.

Carlos:             Okay, well very cool, so we’ll have to explore that potentially in another episode. But today’s topic is, we’re talking about wait stats. This is a topic near and dear to many of us and so I guess just to get started, why, as either administrators or those of us who are writing code, why do we care about wait stats?

David:             Wait stats is, I’ll probably best answer that question this way, if we look at ourselves as doctors, we’re treating patients. Our servers, our systems, those are our patients and we are the people that are responsible for keeping those things healthy. Wait stats are essentially how SQL Server tells us, “hey, this is what’s hurting me right now. This is what I need help with.” SQL Server’s tracking a resource, processes are coming in, putting pressure on that resource, asking for more and more resources in a certain area and SQL Server says, “hey, this is where it hurts. This is where I’m having trouble. This is where I’m causing things to slow down. Please pay attention to this.”

Carlos:             Sure, sure. I think more and more of us are also familiar with, and we’ve talked about Agile on this show before, but the Agile methodology and this idea of getting together and being like, “hey, why can’t I move forward’ kind of an idea. “What are my blockers?” And I think wait stats are that same idea. It’s like, “hey, I would like to do something, but I can’t for this reason.” The wait stats is that way to let us as the administrators know, “hey, this is what I’m having problems with.” Then, continuing forward, does every time SQL Server wait on something, or does every blocker really matter? Are all blockers the same?

David:             All blockers are not equal, as a matter of fact, in this particular case. They’re not all the same. SQL Server will break up wait stats into various lists that it manages. One of the lists that I constantly refer to, when I’m talking about waits that shouldn’t necessarily catch your attention is the timer list. When somebody puts something in a piece of code like “wait for a delay X number of seconds”, SQL Server says, “okay well, I need to wait for a certain amount of time, so I’m going to track that wait anyway.” For processes that run on a cyclical schedule, like something that happens like a checkpoint that happens every 60 seconds or so, SQL Server tracks the length of time between checkpoints, as well, and it tracks it as a wait statistic, because it is literally waiting for a certain amount of time to elapse. You’ll see those wait stats show up whenever you’re looking through the wait stats for your specific instance, but those aren’t necessarily ones that would catch your attention, or should catch your attention, in that case.

Carlos:             Right, and this is where it can make things a bit tricky. What’s important, what’s not important and then what do I do about it? Which ones can I safely ignore?

David:             Exactly, there’s a phrase that I use a lot when I’m talking about, well, really just about anything, lately, and that phrase is “context matters”. A particular wait type might be a problem in one context, but it might not necessarily be a problem in another context. For example, OLEDB waits. If you’re tracking those and you know you’ve got linked servers in play, you may be concerned about the OLEDB waits, but if you’re seeing those OLEDB waits while you’re running CHECKDB, then that’s to be expected, and not a concern in that particular case.

Carlos:             So, when the waits occur or under what conditions also plays a role into whether we should be concerned or not?

David:             Exactly.

Carlos:             Gosh, okay, so I know that one of the things you talk about or put some information out there are what are the waits, then, that we should be concerned about? Maybe let’s focus on those and at least if you’re just getting into the waits, you can at least put all the others aside just for a moment and let’s take a peek at these first few and go from there.

David:             Sure. If I was going to give somebody a set of waits that they would want to concentrate on to begin with, they’re going to come in two groups. Number one is what I would call, this is my own terminology for it. I would call them Primary Resource Waits, and those are the things that we would normally track with any other performance monitoring tool. These would be waits that would be related to CPU usage or memory or disc IO. For disc IO, I would say you want to be looking for any wait that has IO in the name like PAGEIOLATCH or WRITELOGs like we’re writing to the transaction log and we’re incurring waits on that. SOS_SCHEDULER_YIELD is one that would I would consider a concerning wait for CPU. RESOURCE_SEMAPHORE for memory waits. If you’re seeing a lot of those, you’re going to want to pay attention to those. The other group of waits that I would tell people to concentrate on are the ones that are showing up most often. Probably every time that I have given a session on wait stats, I ask people to just throw the name of a wait stat at me. Just the first one that comes to mind and invariably, every single time, almost the entire room responds as one with CXPACKET. That’s the one everybody sees all the time, so it’s another one of those ones that I would say you want to pay attention to. But more from a learning perspective than it is from something that might actually be giving you trouble. Because if you’re seeing CXPACKET waits on your system, that means you have more than one cord, you have parallelism. Yay! That’s good. That’s great. But is it actually a problem on your system? That’s where the learning comes in. That’s where learning how to understand what these waits are and what they’re doing really helps out.

Carlos:             Yeah, that’s right. This is one of those ones that’s so tricky. Tim Ford, I remember, coming back from my SQL Cruise days, going there and this topic came up and that was more of his approach. I think his phrase was, “well, when you see CXPACKET waits, you should say, ‘hey, I know my SQL Server is doing work.’”

David:             Yeah, absolutely. There are a lot of other wait types like that, too. For example, one that I see cropping up on a lot of the client’s systems that I’m working with right now is BACKUPIO. Great, you have backups, perfect! That’s what I want to see! Now the question is why are we waiting for BACKUPIO? What else is going on in the background of that, that’s causing wait times for BACKUPIO? We can investigate from there. But yeah, just because you see a wait, doesn’t necessarily mean there’s an issue. Another example that I like to give to people when I’m doing sessions on wait stats is I’ll point out that I have waits for, I keep calling it HEKATON. I think it’s in-memory processing or–

Kevin:              In-memory OLTP.

David:             Right, in-memory OLTP. I think the other term that’s used for it is extreme transaction processing. The reason I bring that up is because that’s how it’s listed as a wait stat. You’re not going to see “in-memory”, you’re not going to see “hekaton”. It’s going to be XTP_something else. That’s literally the name of the wait type that shows up. I’ll point out that I have that on my system, but I’m not using in-memory OLTP, but SQL Server, in the background, is still checking to see, “hey, are you using this feature? Should I be doing anything with this? No? Okay.” And then it’ll wait for a while and check again. That will incur those wait stats even if you’re not actively using that feature on your system.

Carlos:             Wow. So now that you’ve thrown out some names and be like, “ah, you know, but maybe it’s a problem, maybe it’s not a problem”. Ultimately, this is a performance question. What are some of the common-sense ways or steps to approach to start trying to understand or battle with these and reduce them, in a sense, to make your system faster?

David:             Sure, like we were talking about a little bit earlier, context always matters. The first thing I’m going to tell someone is you want to get a picture of what are the waits that are being accumulated on your system over time, and it’s the “over time” part that’s the most important to the whole process. Pick your monitoring tool of choice, whether you’re going with a third-party monitoring tool or you want to use Performance Monitor or you want to use Extended Events, which is what I usually use when I’m tracking wait stats. Whatever that is, monitor your server over the period of time where you’re seeing that slowness and see which are the wait stats that are being accumulated so that you can better target things. Another way to approach it is if you’ve got a particular process that is running particularly slowly or you want to troubleshoot a specific query that’s not performing the way you expect it to; SQL Server actually provides a specific DMV. It’s dm_exec_session_wait_stats that you can query against a specific process ID. If you know your sped’s 75, you query against that DMV for process ID 75 and it’ll show you just the wait stats that have been accumulated for that specific process. Having understood when the waits are occurring, that gives you a better way to target where they’re coming from, which processes are causing those waits. Actually, that’s the reason I normally recommend using the Extended Events method is it’s very, very easy using Extended Events to correlate the wait stats information back to the processes that are causing them. Knowing that, you see what are the wait stats that are being caused. Is it a resource wait? If it’s, for example, an IO wait, am I doing too much IO or is the system actually slow? The best way to determine that is to look at what is your average wait time per wait. Once you’ve figured that out, for example for IO waits, if you’re seeing it for each individual wait longer wait times, that’s when I’m going to say, “okay, we might have a slow IO subsystem. We might need to look at our disc performance.” But if I’m seeing relatively quick wait times, but just a whole lot of waits, that’s when I’m going to say, “okay, I’m asking too much of this particular IO subsystem. I’m doing too many reads or writes to disc. I need to look at the queries that I’m running. I need to look at my processes and see if I can tune it from there.”

Carlos:             Basically, you’re telling me there’s no great Easy Button.

David:             Well, like a lot of troubleshooting, a lot of it is spending your time defining the problem, and that’s essentially what wait stats are for. It’s another tool in the tool bag to help us define the problem.

Carlos:             Yeah, this takes me back to Episode 28 with Erin Stellato and talking about baselines and that idea. I think that’s probably the key point. A lot of times, we’re going to see recommendations to say, “oh, well, if you see X, then do Y.” But again, so many factors come into that and being able to say, “okay, now if you see X, when Z is happening, then you can do Y.” Again, that base line then gives you some reassurance, because you can also then say, “okay, well now I’m going to do Y. What happens in the future? I’ve done that. Am I better? Or are things worse?” How are the needles changing and then how can I go forward?

David:             Yep, very good.

Carlos:             Now I’m curious, Kevin and Eugene are a bit more in the analytics, the ETL, if you will, side of the house. Are wait stats purely a, we’ve used the word OLTP, are they purely an OLTP thing, or how might these guys be impacted by wait stats?

David:             Any time you’re running a query against SQL Server, you’re impacted by wait stats, so this is sort of a universal thing. In fact, one of the first major projects that I did with a client of mine, this is many, many years ago. I want to say this was back, probably on SQL Server 2008, we used wait stats extensively to troubleshoot performance issues in their data warehousing process. They would have a nightly ETL load that would run and we were able to pick out key portions of that load that we could tune and optimize for better performance. A lot of that was based on watching the IO wait stats and figuring out where we could optimize some of the processes that were both writing to and reading from a disc far more than they needed to be at that point. Yeah, it’s a universal troubleshooting tool, absolutely.

Carlos:             There you go. I do want to drill down for a moment. You had mentioned in the Extended Events, this idea or this ability to say, okay, let me tie the wait, so maybe for an easy example, the BACKUPIO. I want to tie that to a specific query or a specific time. Walk us through that a little bit and obviously through the Extended Event, I’m capturing the query that’s getting run. Are you then exporting that and aggregating it somehow, or using some other tool? Take me through that from a process perspective.

David:             This is all completed within Extended Events, so it doesn’t require any specific third-party tools. It does require a little bit of XQuery, which tends to turn people off just a little bit, because as DBAs, we see XML and the hair on the back of our neck stands up and the teeth get bared and it’s like (growl) “I don’t want that anywhere near my database. Get away!” But as far as using Extended Events, correlating queries with their individual performance. We’re getting into a whole ‘nother topic here, but with Extended Events, you can track just about anything within SQL Server and correlate those two things together. I actually have, on my WordPress site, which is DMMaxwell.wordpress.com, if you go to my speaking page, there’s a presentation that I do using Extended Events to track wait stats. You can download and see this in the scripts in there as well. But you would set up an Extended Events session that would capture two specific things. One would be the queries, and the other would be the wait stats in and of themselves. Then using the time stamps saying, “okay, this query started at this time, I also had waits that started at this time.” You can correlate it that way and it’ll show you “while this query was running, I was accumulating X number of waits, this type of waits.” Tie that back to the query and say, “okay, these are the waits I had for this individual query. The cool thing about doing that with wait stats is you don’t have to tie it back to an individual query. With Extended Events, you can tie all of the wait stats back to a database or you can tie it back to an application name or you can tie it back to a user name. It gives you that flexibility to be able to say, “okay, this is where these waits are actually coming from.”

Carlos:             Yeah, so I have another thought there. Again, I’m not sure that it’s necessarily the Easy Button, from a collecting perspective, but I kind of feel like Adam Machanic’s sp_whoisactive, now the downside is that it has to be an active query. If you’re going to go back in time, then you’re out of luck. But that is another, I feel like that’s along the same lines. It’s going to go in there, grab the snapshot and tell you what the wait is and then give you some of that query information. Almost exactly what you just said, and as long as you collect that, again, if you know that you can recreate the workload, you’re going to run on a different time table. If you’re not sure that you can recreate it, then maybe you run it less frequently, just to get an overall sense. But I kind of feel like that’s another way that people could collect some of those and, I don’t want to say more intuitively, but it does feel like a lot less work.

David:             It is a lot less work and I’m glad that you actually brought that up, because that’s one of the things that I recommend that everybody has and uses on their systems is the sp_whoisactive procedure by Adam Machanic. For the remote DBA work that I do, for every system that I manage, I have a job on there that runs sp_whoisactive and logs the results to a table, usually about every 5 minutes or so. Just because, you’re right, it is a good, quick, easy way to gather a lot of good, historical information on the processes that are running on your system. And you don’t have to limit it to the currently active processes, sp_whoisactive will actually show you processes that are at rest, as well. That’s a really, really great way to do that.

Carlos:             Oh, that’s fair, that’s true. I guess I need to use that flag a little bit more often.

David:             Now the difference, though, between capturing something with Extended Events versus capturing it with sp_whoisactive is, it uses kind of the same DMVs that other stored procedures would use for that, meaning it’s only going to show you the last recorded wait information for the process that it’s running. You used the right term for that one, which is a snapshot. It’s a snapshot of what’s currently happening. When you’re doing tracing with something like Extended Events or with Performance Monitor or another tool like that, you’re getting all of the accumulated waits that are occurring for your processes over time. So you might run sp_whoisactive and see the last wait information that you had for this particular process was, 130ms for PAGEIOLATCH_SH. Okay, great, but prior to you running that snapshot, you may have been accumulating large amounts of WRITELOG waits for that particular process and you’re not seeing that at that particular time, just because this PAGEIOLATCH was the last one that happened to occur when you ran that snapshot.

Carlos:             Gotcha. So there could be a few more levels, if you will. I guess even using the whoisactive, you still would want to compare that against, I don’t want to say cumulative, but maybe a broader view of all the waits that are collecting.

David:             Sure, and sometimes you could also look at it from the perspective of where do you want to put the complexity? It goes without saying, Extended Events is far more complex to sit down and set up, especially if you’re doing it through TSQL, which is what I prefer to do. If you’re doing it through TSQL, far more complex to set up and use than just installing sp_whoisactive and hitting the Go button. But the question of where do you want the complexity? Once you’ve collected all of this information with sp_whoisactive, then what do you do with it? You have to go back and you have to write the queries that are going to pull that information back from wherever you’re storing it and depending on what you’re looking for, those could also get fairly complex, as well.

Carlos:             That’s true. That’s a fair point. So again, depending on what it is you’re trying to do, how much history you want to keep, might dictate the solutions that you’ll use to try to solve this problem.

David:             Sure, just like any other problem in SQL Server.

Carlos:             That’s right. Kevin, are we hitting all of your questions? I should have brought you in a little sooner, here.

Kevin:              I’d actually like to go back to one of the earlier comments about CXPACKET and, with the 2016 SP2 and I think it was in CU3 of 2017, there’s a new wait type of CXCONSUMER as well. I wanted to see what your thoughts were on CXPACKET and CXCONSUMER, now that the parallelism part has been split out into benign and less benign.

David:             I have not had a chance to test that, yet, and see how that plays. I do have a couple of client environments that are on 2016 that I have been meaning to look into that, but I have not had the chance, yet, to look into that. That was added, I want to say within the last couple of months, I believe.

Kevin:              Yeah, Pedro Lopes had a blog post on it back in November, but it wasn’t out at that point. It was SQL Server 2017, the CU3 was where we first saw it. It was probably also in Azure SQL Database before that. But the whole idea was that with CXPACKET, they’re trying to make that the wait type which is interesting, that if you see CXPACKET on a newly updated instance, that it can be troublesome, but CXCONSUMER is simply, “hey, we have parallelism, producers need to send rows to parallel processes and thus there’s a wait on something further upstream.” At least that’s how I’m interpreting it, but it does make CXPACKET a bit more useful, I think, in getting down to a single point of focus that, “oh, I see this wait and I actually know that I should respond in this way, rather than panic and set MAXDOP to one.”

David:             Right, and I’m definitely all for it. That’s one thing that I can definitely say about Microsoft is they’ve done a really, really good job over the years of enhancing the wait stats. I want to say when they first introduced wait stats back in 2005, there was maybe a couple hundred of them, and now, last I checked, there’s over 1100 different wait stats that they’re tracking within SQL Server. I’m glad to see that they’re starting to take some of these ones, especially the ones that have common misinterpretations, and breaking them out to try and make them more useful. So yeah, I did see that that wait stat was created and I’m pretty excited about it. I just haven’t had the time to sit down and play with it, yet.

Kevin:              Are there any other misleading or confusing wait stats that, like CXPACKET, that can be very ambiguous and that you want to bring up?

Eugene:           Maybe PAGEIOLATCH.

David:             The thing with PAGEIOLATCH is the subtype of that wait, whether it’s a shared latch, an exclusive latch, an update latch, people just see PAGEIOLATCH and they’re like, “oh, darn it, I’ve got a slow IO subsystem.” Oh, not necessarily, and kind of like what I was talking about before where the big thing is you want to look for your average wait time. You may be seeing a lot of PAGEIOLATCH waits, but are they 1ms, 2ms, 5ms, maybe not necessarily a bad thing. But if you’re seeing some PAGEIOLATCH waits but their average is like 300ms or 400ms, yeah, okay, now you want to start looking at your IO subsystem. Outside of that, again, what is the type of work that you’re trying to do? Is it OLTP, is it OLEP, what performance are you expecting out of your IO subsystem, and how much work are you actually throwing at it? So yeah, those are ones that will show up a lot. Another one that shows up a lot that people tend to interpret is ASYNC_IO_COMPLETION or I’ve also seen NETWORK_IO interpreted this way. It’s when an application is consuming data from SQL Server and SQL Server is waiting for the application to return a confirmation, “okay, I’ve got my data, I’m good to go.” Well, is that really what’s going on? Because some applications don’t do that. That’s another one that people tend to jump real quick and say, “oh, well, that’s the application’s fault. I don’t need to worry about SQL Server.” Well, do you? Is there anything else going on that maybe you should look at? What other waits are occurring at that time? That’s actually another point that I should make, now that I’m thinking about this, is that waits don’t occur in isolation, so you really can’t troubleshoot them in isolation. You see a particular wait type and say, “okay, well this wait type has this answer, this wait type has that answer.” Not necessarily. You may want to look at the other waits that are occurring at that time, as well. If you’re seeing, for example, let’s go ahead and stick with the PAGEIOLATCH waits. If you’re seeing a lot of IO waits, and then you’re also seeing a lot of memory waits, like RESOURCE_SEMAPHORE. That’s more likely to be a very, very intensive workload that’s being run against your system at that time. But if you’re seeing those PAGEIOLATCH waits, but you’re also seeing waits like WRITELOG or IO_COMPLETION, that’s when I would start saying, “okay, maybe this really is an IO subsystem problem, rather than a query problem that I need to solve.”

Eugene:           The thing I was thinking about with the PAGEIOLATCH is that sometimes it can be memory pressure that’s masquerading as an IO problem.

David:             Exactly, and that’s when you’re also going to see those RESOURCE_SEMAPHORE waits. Another thing that I tell people is don’t give up on Performance Monitor when you’re going over to the Dark Side, when you’re going over to wait stats. Because you also want to be looking at things like, in that case you would be looking for memory grants. If you’re seeing a lot of RESOURCE_SEMAPHORE waits, you’re probably also going to be seeing memory grant issues, as well. So, correlating a lot of that information, wait stats is where I would recommend a lot of people get started, but don’t forget to monitor basic performance counters as well. Again, you don’t want to everything in isolation. You don’t want just be looking for one specific thing and not be looking at other parts of your system, as well.

Carlos:             Well, there you go. I think that’s a great line to end on, there. just because we’re mindful of time. Any last thoughts before we transition?

Eugene:           I’ve got a quick question. I was literally at a customer site yesterday doing a health check. Things were kind of terrible, and one of the things I looked at was the wait stats. And like you said, it’s all about context and I saw BACKUPIO at the top and I’m like, “oh, well that’s normal.” But their backups were taking like seven hours every night. But the question that I always struggle with, okay, I’ve done a health check, we’ve done some performance tuning, we fixed all the junk, what does a healthy set of wait stats look like, because it’s always going to be waiting on something, right? Usually what I expect to see is, you’re going to have to have backups, you’re going to have to have WRITELOG, you’re going to have a little bit of CXPACKET, you’re going to have some locks, but ideally it’s all spread out, so you don’t have something that’s like 40, 50 percent of the waits or anything like that. What are your thoughts?

David:             I would be less inclined to look at the percentage of waits and look for a distribution there, than I would for total wait time for your top waits for your system. I would be saying, “okay, how long has this system been running? What are the milliseconds, what are the ticks?” And then look and see what kind of waits have accumulated over those times. Have I been accumulating, for example, WRITELOG waits? And this is where it gets a little bit tricky. This is where I don’t generally use wait stats as an overall indicator of health, because they can be a little bit misleading. The reason for it is, we come back to stuff like the parallelism or having multiple processes or multiple databases on a given system. If I’ve got a system that’s got maybe just a few databases on it, but I’m seeing a whole ton of WRITELOG waits, I’m going to want to dig a little bit deeper. But if I’ve got a system that has hundreds upon hundreds of databases on it, even if it’s a relatively nominal activity for each of those databases, I know I’m going to be writing to the log a whole lot. So I’m going to expect to see some WRITELOG waits on that system. Again, I think it comes back to the idea of nothing in isolation, everything in context. If I was going to describe what a healthy set of wait stats look like? I think a lot of it would depend on what the system is built for and what it’s intended to do in the first place.

Eugene:           Awesome.

Carlos:             Yeah, so again, the context is always going to matter, there.

David:             Absolutely.

Carlos:             Just like with everything else. And that might be an interesting, maybe at some other time we break some of that down and we look at some different patterns. Like you mentioned, OLTP systems with the large database, a handful of very large databases, versus the OLTP systems that have, everybody else, so like a lot of times you see, “oh, here’s the important ones. They get their own server. And then everybody else has got to share in this pool,” and there’s a hundred of them. Versus the data warehouse and things like that. That might be interesting to put together and take a look at sometime.

David:             Sounds like an excuse for me to write a blog post.

Carlos:             There you go. Okay, so should we go ahead and do SQL Family?

David:             Sure.

Carlos:             Okay, so your all-time favorite movie?

David:             The Good, the Bad, and the Ugly with Clint Eastwood, and I wish I could remember the other two names, but for two reasons. One, the soundtrack by Ennio Morricone is probably one of my favorite movie soundtracks of all time. Two is, if you think about the time period they were making that movie in, just in case anybody listening hasn’t seen it, there’s a scene in this movie where they’re in the Civil War and they’re the Blue and the Gray, they’re fighting over this bridge and Clint Eastwood’s character, and another character decide they’re going to blow up the bridge because they need to stop the fighting so that they can get across to the other side of the river. Well, while that bridge is still there, and the two sides are still fighting, there’s nothing they can do, so they decide to blow up the bridge. It’s a huge bridge across a decent sized river and there’s no CGI back in that day, so they had one shot to pull that off. They had to build the bridge and they had one shot to blow the whole thing up and capture it on film and they did it perfectly. It is one of the best movie effects I’ve ever seen.

Carlos:             There you go, interesting. Yeah, we kind of take for granted, now you look at the movie sets and it’s all green screen, I mean, just about. So yeah, interesting.

David:             Yeah, it’s just a beautiful piece of filmmaking. It’s a long move, but it’s a good one.

Carlos:             The city or place you most want to visit?

David:             I’ve got about 15 different answers for this one because there’s just so many places I want to go. But I’m going to go with Akihabara in Japan, and I say that because of just the fact that it’s kind of a tech hub and I’m a total geek. I’m also into anime, so there you have it. But I just feel like going into a culture like that would be so different and so new and such a learning opportunity to immerse yourself in a culture that’s vastly different from your own and just have that experience of being in a place that’s completely foreign and unfamiliar to you. I would really enjoy that.

Carlos:             Very cool. The food that reminds you of your childhood?

David:             When I was in high school, we had these things, I don’t know if anybody else has heard of these, called fiestadas. These were hexagonal shaped, I mean, they couldn’t even make them round. They were like hexagonal shaped TexMex flavored pizza. I don’t know if it was just they put way too much cumin in them, but with the Monterey Jack cheese or whatever it was that they put on them. but they were the greasiest things in the world. You could hold them in your hand and tilt them up and there would just be like this river of grease that would come down on your plate. During high school, every time it was fiestada day, everybody got so excited because they loved them so much. In retrospect, it was probably one of the most disgusting things I’ve ever ate, but back in the day, when it was fiestada day, it was like, “yes, finally we get to eat something cool.”

Carlos:             There you go, yes. It’s interesting what sticks out to us. And the fact that that was a school thing, as well, not even a family thing or a, yeah. Let’s see, tell us how you first got started with SQL Server.

David:             Long, long, long time ago, I was working for a management consulting company by the name of AT Carney and I was a systems administrator there. the previous systems administrator, through no fault of his own, unfortunately, kind of left in a hurry, so things were a bit of a mess when I got there. I showed up and I started going through the drawers, going through the supplies and everything that was there and I found this copy of this disc and it said SQL Server 4.2 on it. I said, “this looks kind of interesting, let’s just load it up and see what it does.” So I installed it and I had absolutely no clue what was going on and I said, “this is going to be something that I would have to sit down and take some time and learn and I can’t do that right now”, so I completely forgot about it. And then I went to a new job and the boss there said, “I’m going to need you to keep an eye on these SQL Servers as well, and I said, “oh hey, I’ve heard of that before.”

Carlos:             I installed it, once.

David:             Yeah, I installed it, once, I know what I’m doing. That was with SQL Server 6.5 and the other administrator there said, “well, we’re trying to upgrade this to this new SQL Server 7.0 but it looks kind of weird. Do you think you can help out with that?” So, that’s when I started really poking around SQL Server and getting used to it and then several years later I decided that that was what I wanted to specialize in and I’ve never looked back, since.

Carlos:             Very nice. If you could change one thing about SQL Server what would it be?

David:             This is a question that I have a hard time answering. Not because there aren’t things that I wish you’d change in SQL Server, but I think back over the years about the things that I did want to see changed in SQL Server and those things are actually coming to fruition right now. One of the big things that I’ve always said is I wish I just had a simple query tool that I could bring up. Once Query Analyzer went away in SQL Server 2000, for years I complained about not having just the simple query tool I could bring up, run a few queries and close it without the overhead of Management Studio. Well, now, SQL Server Ops Studio is out. I used to complain about error messages in SQL Server and those have improved vastly. I think if I was going to suggest something that I would like to see, like an improvement that I would like to see in SQL Server, again, because I’m a performance nut, I would like to see easier access within the native SQL Server tooling to performance information. Like right now I’m doing a lot of stuff with Extended Events trying to pull information out. There are some dashboards that you can download, some community-made dashboards that you can download. There’s an ability to create widgets now, within SQL Server Ops Studio. But again, that’s a lot of stuff that you have to sit down and create on your own. I would like to see some easier tooling for getting performance information out of SQL Server.

Carlos:             Yeah, being able to capture that. I guess that’s the trick, is just how much information are we going to store, versus how much you’re going to look at, because it’s not free to keep all that stuff around. That’s a fine line can be there, sometimes.

David:             Exactly, and I realize it’s kind of a huge request, if you think about it, because everybody’s needs are a little bit different. The way everybody approaches troubleshooting is a little bit different.

Carlos:             What’s the best piece of career advice you have ever received?

David:             Best piece of career advice I’ve ever received is “Stay curious. Don’t assume.”

Carlos:             You’re watching the Dos Equis commercials, right?

David:             Well, I don’t watch a lot of TV, I don’t see a whole lot of commercials, but it doesn’t surprise me that people are using that phrase. This was actually something that a mentor of mine said to me back in 2004, 2005, somewhere around there, but it’s the idea of never stop learning. There’s always another layer. You keep peeling back the onion and there’s another layer underneath that and another layer underneath that. You can always go deeper. There’s always going to be something interesting behind what you’ve found, so stay curious, keep learning. Keep learning new things, different things, keep going deeper in your individual subjects that interest you. That’s what I feel is probably the thing that’s served me best throughout my career with SQL Server.

Carlos:             Very cool. Our last question for you today, if you could have one superhero power what would it be and why do you want it?

David:             Teleportation, without any question or reservation or even much thought. I am an impatient person by nature. I want to be where I want to be when I want to be there, right now, which isn’t to say that I don’t enjoy traveling. I will drive. I think the longest I’ve ever driven for a SQLSaturday was about 10 hours. That would be from here to Kansas City. I enjoy driving, that’s like my thinking time, but at the same time, it would be nice to say, “hm, I need to be in Chicago in, well, right now.” Boom, and I’m there. To me, that would be, that’s I think one of the things that kind of slows us down in life is travel. We have places we have to go, we have things we need to do and if we could cut out some of the travel time, I think that helps. That’s another reason I like remote work so much is I’m already there.

Carlos:             There you go, that’s right. Awesome. David, thanks so much for the conversation today. It was a pleasure having you.

David:             Not at all. Thanks for inviting me and it was great talking with all of you.

Carlos:             I do think this is one of those episodes where it’s hard to nail down, and I’m not sure that we came up with good, hard and fast rules here. We kind of tossed around some ideas, but I think this is one of those topics where context does matter, and so I think coming from a maybe circumstantial approach or better said, you have to include that context and not having that as part of the discussion is probably not in your best interest. So, we didn’t want to throw out a lot of blanket statements, there, as far as “thou shalt always do this”, under these circumstances. However, we are interested, if you do have specific scenarios that you’d like us to explore a bit more, please let us know and we’d be happy to focus, again, on those scenarios and double-down on those, if you will. That’s going to do it for today’s episode, compañeros. Thanks again for joining us. As always, I am interested in connecting with you on social media or you can reach out to me. [email protected] Always interested in what you have to say, what you’re doing, and we’ll see you on the SQL Trail.

Leave a Reply

Back to top