If I were a betting man, I would wager your disk IO system is probably the biggest bottleneck in your SQL Server right now. Sure, there are lots of moving pieces, but if you can tame the IO situation, you are putting yourself in the drivers seat. So how do you go about putting together the pieces you need to identify your IO constraints?
I chatted with Theresa Iserman, a Premier Field Engineer at Microsoft about some of her thoughts and why IO in general is an issue. She gives us some guidelines and even some metrics to check out. Join us in the conversation, and I’ll see you on the SQL trail . . .
Transcription: IO Tuning
Carlos L Chacon: This is the “SQL Data Partners” podcast. My name is Carlos L. Chacon, your host, and this is episode 23. Today we’re talking about IO tuning, and I have as my guest, Theresa Iserman.
Theresa is a SQL Server Premier Field Engineer with Microsoft. I’m always glad to have the Microsoft folks on the podcast. They have quartered the fore front of a lot of what we do as SQL server professionals.
Today we’re going to talk a little bit about IO tuning, how to capture it, how you’d go about deciding what changes you need to make.
Theresa makes a couple of very specific counters or measures in some of her comments. Obviously, you’re in your car, you’re walking your dog. You’re not going to have an opportunity to write those down so we’ll have those short notes available at sqldatapartners.com/podcast. You can check out today’s episode, episode 23.
In addition to that, I’m actually going to put together a template based on some of the counters that she suggested.
That way if you want to go ahead, just put that right in. You just start taking a look at some of those items and begin measuring that over time. Compañeros, it’s always good to have you. I do appreciate you tuning in.
If there is something that you’d like to hear about or something that we might need to improve about one of our episodes, you can reach out to me on twitter, @CarlosLChacon or you can reach me via email at [email protected]. Let’s get into it and welcome to the show.
Children: SQL Data Partners.[music]
Carlos: Theresa, welcome to the program.Theresa Iserman : Thank you Carlos.
Carlos: It’s great to have you. We happen to meet up in Portland, and I saw your session with a very interesting name that I called “Anatomy of an IO tuning exercise,” and I thought, oh, man! A lot of us have IO problems. We know that our systems are limited by the resources that they can run.More often than not, IO or those disks are probably going to be the biggest bottleneck in our systems. I want to have you on the show today, to talk a little bit about this, your experience with tuning some of these IO problems.
Interestingly enough, there’s a side note here, is I’ve done a 20-something of these episodes and you’re our first female guest.
Theresa: Oh, wow.
Carlos: [laughs] You just happened to beat out Jessica Moss. In fact, I think maybe I’ll have to have you ladies back to back because, just from scheduling, we couldn’t quite get her on.So we talk about IO and, looking at how we’re going to go about tuning some of that, what are some of the components? What are some of the system processes that we had to be familiar with to help us really identify what’s going on in our systems?
Theresa: Yes. We want to look to see where we have a problem. A lot of times, when we identify an issue related to IO, we’re looking at latency. Latency is a measure of time in terms of the time it takes to return a page or an extent from disk to memory. The amount of time that that takes really equates into time that our users are waiting.We want to know what’s normal in our systems by performing baselines in terms of — we want to know what’s normal in terms of latency. When we identify problems, many might tell us the server is slow. We want to identify where that problem is.
We would start typically with — everybody has their own tuning methodology, but some of the tools that you could start with would be wait stats. That would give you an idea of what your system is waiting on. If you identify a wait that’s related to IO, like write log or page IO latch, you might want to dig in and look at virtual file stats, which is a dynamic management function that gives you file by file information about your leads, rights, and stalls. You can look at on latency by file.
Carlos: Let’s actually dive into that a little bit, into those waits. Why would my SQL server be waiting? You mentioned a couple of things. It could be waiting on IO, you mentioned a couple of wait examples. Under what circumstances would, my server be waiting?
Theresa: We always have waits in our system. They happen by design because we have lots of things happening in our system and resources are unavailable to process the things that that SQL needs to do at a time. We have a process called scheduling.We all know about processors that we have in our servers. Each of those processors is associated with a scheduler. The scheduler queues up threads, and these threads work their way around different states on the scheduler until all the resources that they need to perform their task are available.
Any time a resource is not available, the thread is not on the processor and running. We’re waiting until, or the thread is waiting until what it needs to perform its task is available.
Carlos: These wait stats, and ultimately we are using some dynamic management views, some DMVs, and SQL servers collecting the amount time that it’s waiting for all of these difference components.
Theresa: Exactly. We have a couple wait tests that are particularly helpful. I mean a couple of DMVs that are particularly helpful when we’re looking at wait stats.We’ve got sys.dm_os waiting tasks. This is, typically, the first DMV that you want to run when somebody tells you that things just started.
This is a picture of what’s happening right now on your server. It’s not cumulative like many of our DMVs are. You can also look at sys.dm_os wait stats which gives you a picture of all your wait stats and since the server was last restarted, or the wait stats were cleared.
Typically we want to look at this DMV with what we call Delta. So we want to take a snap shot at one point and then a snapshot at another point and compare the difference so that we can get an idea of what we’re waiting on that particular period of time.
Theresa: We have to do some computations on this DMV to see if to remove benign wait stats, like if people wait for, command in our code, we don’t want to include that. We don’t necessarily need to include lazywriter or checkpoint waits.I think Paul Randal has a really great query that you can use that includes information about the derivations that are available and the wait stats to filter out. If you search on the sqlskills website, the waits…
Carlos: Yes, that was one of the ones I was going to bring up. Ultimately, with a lot of these DMVs the work has been done or kind of getting to the nuts and bolts of what you’re looking for, is out there available from our community. I think even his post is called “Tell Me Where It Hurts”.
Carlos: We’ll have that available. A couple of these scripts and some information we’ll have available at sqldatapartners.com/podcast. You can look at today’s episode on IO tuning. We’ll have all of that information available so that you can go and grab those DMVs and get started.Now that we have this information, we’ve used the DMVs to collect the wait information. You mentioned ultimately we’re going to be looking for those IO specific waits. Then we looked at another DMV you mentioned was using the file stats comparison and getting that file by file comparison. Why is looking at that information important in your database system?
Theresa: That’s a great question. We want to know exactly where to pinpoint our troubleshooting efforts. We want to know which file in our system is hot, which one is the one that has latency, that’s affecting our workload, and what our storage can deliver.If we identify where the bottleneck is, then we are able to remediate our issue by drilling down further and finding out if there’s a specific query that’s impacting things if a query needs to be tuned, or what we need to do if that query has to stay the way it is, if we need to make some adjustments on our storage side.
We go into virtual file stats. You can look at one individual file or you can look at all of your files in your system to identify reads and writes on each file, IO stalls. We can look at your transfer size. The biggest thing here is latency. You get to look at latency without going to look at perfmon.
Carlos: I found that this is extremely useful. Generally we tend to think, have several databases on our system and we’re not sure which one is being the slow guy here. We tend to think of our larger databases as being the ones that get the most usage or whatnot.I’ve been in a couple of scenarios while, looking at the file stats of the database, I found a smaller database was actually taking up a much larger percentage of the IO. When we actually looked at doing some things including removing that smaller database that we could put somewhere else, the whole system benefited from that move.
The size of the database may not be a great correlation to the IO that that database is using.
Theresa: Right, that’s a great point. That’s a great point. We can’t make any assumptions when we are doing troubleshooting. We really have to let the data determine the direction that we go when we’re looking at things.
Carlos: We collected some waits stats. We’ve looked at our file stats. We try to figure out which of our data files or data buses are calling the most problems. What other collection tools do we have available to help us continue to gather stats?
Theresa: Another tool I like to use Perfmon. Perfmon is a Windows tool that lets you monitor operating system and SQL resources so we can diagnose performance problems. We can use it both in real time and by collecting data that we can use later to do some analysis.Let me say one more thing about Perfmon.
Theresa: With regards to Perfmon, we have all this counters that are available to us and the counters let us monitor specific activity related to a subsystem or a SQL in general.We can look at things like SQL logs, SQL windows processes, process information and logical disc and these are all objects that are available in Perfmon.
We look at these counters and we can correlate information from the counters to tell us, give us a picture of what’s happening on our system and we can use them specifically related to IO to look at the logical disc counter.
There’s hundreds of counters to look at in Perfmon, but when we’re looking at IO, I want to highlight a few in the logical disc object. We were talking a little earlier about how we want to look at latency.
We also want to look at latency. This is the amount of time it takes to deliver an IO from disc to memory. We also want to look at high-ups and we’re going to look at throughput. At least these perfmon objects are not called latency, high-ups and throughput — they’re all part of the logical disc object, and we have counters. You just have to know by getting experience which counters are associated with latency.
For latency, we have three counters that we want to look out and it’s average to seconds per week, write and transfer.
If we’re looking at high-ups, we want to look at disc reads per seconds, disc transfer per second and disc writes per second. We’re looking at here how many IOs can we read right in a second and then transfer as a combination of reads plus writes.
One last one is throughput. Throughput we can think of is the quantity of stocks that were put through the pipe. It’s technically the number of bytes at SQL reading from storage or writing to storage.
The counters that we look at here are disc bytes per second, disc read bytes per second, and disc writes bytes per second.
Carlos: I think that’s where we get a good correlation to the files stats to see which files are moving data in and out of the system more often in others.
Theresa: Right. We can use all three of this metrics to compare to the file stats especially latency that’s going to be the first place that going to give us an indication of what’s problematic.When I think of latency, I think of kind of acceptable values for latency and I know that if I see something that’s higher than what I would consider acceptable for my system then I would want to drill in there.
A general rule of thumb, which might be different on each person’s system, is that for data files, you want to have for all TP system probably less than 10 milliseconds of latency and then for log files you want to have zero to two milliseconds of latency, and then for warehousing workload, geometrics are going to be a little bit different because you’re looking at the quantity of information that could come over that could be delivered.
Theresa: When you see numbers that are higher than these base lines then you know that there’s some pertain place to take in.
Carlos: You mentioned each environment is different, but I know you’re in a lot of different environments. Let’s say there on some older hardware, are this speeds for they are the new faster stuff or you feel like even some of the older disc? Let’s say five-years-old, could still get this kind of numbers?
Theresa: I think that you want to target this for any system that you’re working for. Obviously, faster disc is going to deliver different metrics, but the based off process is that your users are going to be waiting for a certain amount of time, and for each system you have to think about how long is it OK for users to be waiting.It might be OK that your users wait 20 seconds for a particular request and we have to work with our application means to find out what’s acceptable on their side, but if we’re looking at different tiers of disc, this is a general place that would give you an idea where to drill in.
Carlos: Sure. Now, I know opening a perfmon for the first time, there are thousands of cameras that seems like it can be a little bit tricky to get started. Obviously, you’ve pointed us in the right direction from IO perspective, but I’m wondering have you ever or do you use the PAL tool at all?
Theresa: Somebody in my team actually develop the PAL and I haven’t used it a ton, but it’s a great starting place and I guess some people use it more for just starting, but it gives you an opportunity to put the results of your perfmon analysis into this tool and get some recommendation and graphs that help aggregate information to do your troubleshooting.
Carlos: Right. Davis Plus, I think he’s kind of taken ownership of taking the SQL server counters that are use in the PAL tool. David, if you’re listening, we still love to have you on the show. The other nice thing that you mentioned was the charting with that sometimes even just looking at the pure numbers.As we’re the detectives, the numbers can make a little bit more sense to us, but sometimes as we then tried to convince some span to be in this area because we need to increase our server capacity what have you. Having those charts I found can be very, very helpful [laughs] because they provide some thresholds for you like these numbers that you gave and you can give that all over a week or even a day. Then show them, “Look, this is what’s happening during our peak load.”
When you’re running these specific reporting queries things like that, and then giving them that visual representation like it kind of goes, “Oh. I get it now. Let’s go invest in another disc or what have you.”
Theresa: The numbers are what tell the story. I think when you can substantiate what’s going on your system with some hard metrics, it really helps drive decisions that business is making about the investments that they’re putting into hardware and systems.
Carlos: We collected this information. We look at our wait stats, our DMVs. Now, we have some perfmon metrics. We think that we have identified a specific area that is the problem. How do you go about identifying the changes that you’ve made if they’ve helped or not?
Theresa: Great question. Whenever we’re doing performance troubleshooting, we want to have a methodical way that we do our analysis and then we want to methodically go through remediation.We want to analyze with the efforts that we put into remediation, the same way that we analyzed our issue. We have this process that you looked at to find out what your issues were and then after you do your remediation because of the same process and you compare the numbers to see what the improvement or degradation was.
Base lining is something that really helps so that you know what’s normal in your system, but if any absence of base lines if you identify a problem, you look at where your numbers are during the problem identification process and then after remediating you’re going look at what the values are to see if you have some improvement.
Carlos: Maybe honing in on the performance stats, for example that you gave us, is a good place to start, but occasionally you’ll have other metrics that might change because of something that you’ve done and you may see. I guess my question is this, how do you then balance an increase let’s say in CPU, for example, if which you’re after is a decrease in IO? How do you balance seeing one number increase while another number gets lower?
Theresa: Over time you get familiar with what you can correlate, but performance troubleshooting is a process that I heard somebody describe like the game Whack A Mole, where your immediate one problem and then another one comes up so you focus on what you’re trying to remediate first and then it’s expected that you’ll probably see something else comes up because really veer cases are tuned perfectly.
Carlos: That repetition or repeating process one step at a time and continue to measure as you make gradual improvements.
Carlos: What are common scenarios that you have seen specific to IO operations that folks might be able to start digging into or looking at a near systems to see if they are experiencing this?
Theresa: One of the things that I demonstrated in the presentation that you and I were together for was the location of log files. We were seeing log files and data files on the same storage. The characteristics of what you need to do for log filing into right sequentially.If you’re using spinning media, you know you’re writing sequentially to that disc spinning around, but then if you’re accessing your data files for an oil TP workload, you’re typically doing random access on your spinning disc.
That disc had is going to be moving all around that disc for what you’re doing in your data files, but it’s going to need to write sequentially for the log files. Those two things that you’re trying to do were kind of conflict.
Your log writes are going to slow down and so we hear a lot about how we want to separate our log files and data files on spinning media that’s because the characteristics of the workload is different for each file so that’s something that we see a lot as people configure systems.
I would say that’s probably one of the common things we see storage that can’t deliver at the speed that we need it to for the workload.
Carlos: They’ve probably taken some defaults there. It should take a little time to divide those out.
Carlos: Thanks for this information. Of course, we’ll be posting some examples and some access to the DMVs on the website sqldatapartners.com/podcast.Now, we want to switch gears a little bit and we’re going to go to what I’m going to call the SQL family portion of the program. We like to get to know you a little bit better and kind of some of the things that you’re doing in your career.
The first question I have is what’s your favorite tool? Normally, we would like to talk about SQL tools here, but I’ll expand that. This could be a free tool or paid tool. How do you use it? What do you like about it?
Theresa: I love perfmon, we talked about that a little bit so I focus on another tool called SQL diag. SQL diag is a tool that comes with SQL server, and we use it a lot in my job for diagnostic information.When I work with a customer, I don’t have the benefit of going back to the system like we do as a DVA on premise where you have a problem and you say, “I’m going to look at this configuration setting or I’m going to look up what’s happening on this drive.”
In my job, I have to collect all the information upfront. SQL diag is one of the tools that I use that gives me a whole bunch of information in one fell swoop. Doesn’t take a lot of time to gather, it’s fairly lightweight and it’s free and comes with SQL server.
Carlos: It used to take profile traces as well, does it still do that?
Theresa: It still does. You can get a profile or trace. You can get perfmon information. You can get your application logs, your event logs, and your SQL logs. I don’t capture all of that, but some people do. I like to capture the perfmon and trace information correctly because trace can be a little bit of a heavyweight tool.
Carlos: In the Microsoft group, you visit with a lot of customers. You’ve been on a long journey, spent a lot of time in SQL server. Take us back a little bit. What’s one of the best pieces of career advice that you’ve received?
Theresa: I think to rely on people in the community, to find a mentor, to give back. I think I thought when I first started that I could just figure out this all on my own. What I have found is that as I’ve gotten to know more people in the community and had different mentors over the years that we can’t learn it all on our own and each person has experiences that they share that we haven’t had.So we can add so much more value to the customers that we’re working with by knowing what other people experiences are. They help us point us in the direction when we’re solving a problem.
What we learned we can share back with other people so there’s this really nice symbiotic relationship, but finding a mentor is really important.
I still have mentors now, but as I was getting started I had a mentor that helped me with some B6. Now, I have mentors that help me learn about different scenarios I haven’t been exposed to and people asked, “How do you find a mentor?”
You can ask people to be a mentor. I’ve read people stories, people blog a lot in our community about their experiences. One person that just starts speaking at SQL past, for example, I mentioned that she asked to help her out and she’s got a long way from just having that experience.
Carlos: Those in the SQL server data community are extremely fortunate because as you mentioned folks are extremely generous with their time. If you’re willing to put it a little bit of effort to reading the articles at least trying to understand the concepts then they are people who help you, you’ll get into the finish line.
Carlos: Thanks again for being here. We do have one last question for you and that is, if you could have one superhero power, what would it be and why do you want it?
Theresa: Oh, what a great question, Carlos.
Theresa: If I’m thinking career wise, I would love to have a photographic memory…
Carlos: Oh, here we go.
Theresa: Because our industry is changing so fast and I love learning and I wish I could keep up with everything as it comes out. I feel like we really have to focus on one area. If I had a photographic memory, I could know all sorts of things. I could be all knowing.
Carlos: There you go. Lots of MSDN articles rallying around there.
Carlos: You mentioned career focus. I guess there’s another power that you’d want that’s not so career focused?[laughter]
Theresa: I guess to see into the future and I wouldn’t have to think about career at all. We could just know what the lottery numbers were to pick.
Carlos: [laughs] There you go. Yes, you can have too many of those by going around. It would be you’re sharing a lot all of your winnings with [laughs] the same folks. Theresa Iserman, thank you so much for being here.
Theresa: You bet Carlos. It’s a pleasure talking to you. Thanks for having me.
Carlos: It’s been great. So, compañeros again sqldatapartners.com/podcast, we’ll have additional information. If you missed something, you saw the question. I will have Theresa’s Twitter information. Sure, she’ll be willing to answer any questions that you have and we’ll see you from the SQL trail.[music]
Children: SQL Data Partners.