How can you tell when a change is enough of a change? How do you know you didn’t make too big of a change? Steve and I discuss some thoughts around what you might look for when you make certain changes. Some of the components we cover include Max degree of parallelism, cxpackets, virtual log files, the number of log files for a database, backup retention and memory.
“Then I think on the cost threshold for parallelism, again it’s going to take a little bit of knowing your system.”
“What it really depends on oftentimes in most businesses is how much money you are losing for every minute that that database is down.”
“If your system is performing great no reason to even talk about more memory unless it’s growing and you’re planning for the future.”
Listen to Learn
01:37 About the Companero Conference
02:54 Max degreee of parallelism, knowing the sweet spot
06:28 Max degree of paralellism , number of cores
10:08 CXPACKETS Waits
12:05 Cost threshold for parallelism
14:54 Virtual Log Files (VLF), growth and issues
18:53 What makes VLFs a bad thing?
23:23 How do you know if you have enough, not enough or too many?
29:00 Number of log files for a database
30:39 Backup retention periods and scenarios
37:35 More on memory
41:50 Page Life Expectancy
Transcription: When is enough, enough?
*Untranscribed introductory portion*
Carlos: Companeros welcome to Episode 105. It’s good to have you on the program again today. Steve, how is it going?
Steve: It’s good. Today’s topic is on when is enough enough. So Carlos, when is enough enough.
Carlos: Yes, that’s right, so this question was asked to me personally during a session as we were talking a little bit about parallelism and the change made to the cost resolve for parallelism. So the question was, ok now that you’ve made this change how do you know that you’ve raised it too high and so I thought we could talk a little bit about kind of extension of this idea from base lining which we’ve talked a little bit about in previous episodes, kind of best practices and now as we start putting in some of these things some of the sweet spots if you will and how we know that we’re in that sweet spot things to look at. So that’s kind of our topic for today.
Steve: Alright, so do we have any companero shout outs to mention?
Carlos: We do. We want to give a companero shout out to Daniel Temensken. He is a PFE at Microsoft. Daniel says, “Thanks for what you do.” He tries to plug-in when possible for us to keep up the good work, and so Daniel we’re thankful you Microsoft guys are keeping an eye on us and keeping us in line.
Steve: I don’t know about the keeping us in line I haven’t heard any feedback on that side yet. We’ll see.
Carlos: It’s only good and I don’t filter all of that to you, only the bad stuff. Ok, we’re still planning for the Companero Conference in October and one of our speakers, Mindy, she’s been on the podcast earlier and while we were chatting with her she gave us a little information about herself and what she’s going to be presenting at the conference.
Steve: Yes, I think one of the things I like about the conference is the whole single track concept and I’m going to get to see all of the presenters this time. Well, I’m not going to see me unless there is a mirror. But I’ll get to see all the sessions and I’m looking forward to seeing Mindy’s session as well.
Carlos: Yeah, that’s right. I think it’s going to be good. Again, just the ability to be able to review kind of see things from a different perspective. I think it’s going to set up a lot of questions for people to continue their learning. So let’s see as we get into the episode today. The show notes for today’s episode will be at sqldatapartners.com/enough.
Steve: Or at sqldatapartners.com/105 for the episode number.
Carlos: Yeah. I guess let’s go ahead and jump into this conversation. We have a couple of items or setting if you will that we’ll talk about. The first one, let’s go ahead and kind of tackle the max degree of parallelism. How do we know when we’re in a sweet spot?
Steve: Alright, time out for a second.
Steve: I’m getting a bunch of noise coming from somewhere. Hold on I need to close the door or something.
Carlos: Sure, sure.
Steve: Alright. It’s my neighbor has a leaf blower. Can you hear any background noise because of that?
Carlos: Yeah, I was going to say I could not hear it but I was talking at the same time, but nothing really stood out.
Steve: Ok, it sounds like he has moved around the other side of his house now. Sorry about that, so let’s jump back into the max degree of parallelism.
Carlos: Alright, I think this idea so I guess we’ll start with kind of the best practice and that is. What we’re looking at is. I guess we should start this conversation with all of these answers are going to start with it depends, right?
Steve: Yes, it does.
Carlos: And so we’re going to talk about some of these things in kind of lose general generic terms but that doesn’t mean you should necessarily adopt them as gospel. Obviously, your individual scenarios or your environments are going to
dictate some of this and so we’ll try to address a handful of them but we won’t be able to address every scenario.
Steve: Yup, but as far as max degree of parallelism, I don’t like to talk about that alone. I like to always talk about cost threshold for parallelism at the same time because I think the two of these goes so hand in hand. I think that there is a lot of confusion and just sort of downright bad defaults on both of this. And I guess both of these originated somewhere in the 90’s back SQL Server 65 or 7 or somewhere around there. They may have made a lot of sense 25 years ago. But the defaults you have on those today are not so great and I think that it comes down to the fact that hardware has evolved a whole lot since then.
Carlos: Yeah, Adam Machanic has his great line there where he says that these only make sense if you jump to a time machine and go back to guy who developed this at Microsoft, and use his computer to execute your queries.
Steve: Right, right. And unfortunately or maybe fortunately in this case we don’t have a time machine. So the default for the max degree of parallelism is zero which means unlimited parallelism. The default for the cost threshold for parallelism is five which means any query that has an estimated plan of greater than five it will be considered as a parallel query.
Carlos: Sure. The reason for max degree of parallelism is zero is because. As you begin to install SQL Server and you could think about it from their perspective, they don’t know how many CPUs you have, right? And I guess dynamically trying to figure that out and all the different scenarios that you might have as far as how much stuff you have on the box with SQL Server which I think is listening to a degree if you’re probably listening to the show. SQL Servers by itself, again there are going to be exceptions to that and so you’re going to see that conundrum that they are in there.
Steve: Yup, so the max degree of parallelism there is a lot of different opinions out there on what it should be set to. But most of them fortunately are kind of in the same range and typically what I see on this and what I like is close to, and I say close to because there is a lot of opinions there, but close to a max setting of 8. Or a max setting of the maximum numbers of cores you have or the max number of cores per numa node, and whichever number is smaller out of those.
Carlos: Right, so there is a little bit of Math involved there, right? It’s all the variations that are going to play into that. So taking into example that number 8, right, if you only have four CPUs, right you’re number is going to be four.
Steve: Right, but if you have four CPUs that are split amongst two numa nodes which usually numa is not that common with that few of CPUs. But if it was you would want to set it to two in that case. Sort of part of it around there, I mean usually with smaller number of CPUs like less than six or eight CPUs you’re probably going with the number of CPUs 99% of the time unless you have numa enabled. But when you get into SQL Servers with many CPUs this is where it can get really interesting. So if you’ve got a server with 50 CPUs, or 30, or 80 CPUs or even more than that. What could happen here is that a small query that probably shouldn’t go parallel could go parallel and could be run on all of those CPUs. I need to start saying this differently. They may be running all those cores because generally it’s not CPUs. It is multiple cores per CPU. So be run amongst all those cores so and that can be really inefficient. And generally what I see even on
servers with lots of cores is that 8 is a good setting. I mean it’s what a lot of people are recommending. It’s kind of the best practice to use there as a maximum because what you get into is when you have more than 8 there is a lot more work involved in bringing back all of the results from each of those different CPUs or different cores doing the work. There is a lot of work involved in sort of collating the results and waiting and if one of them takes a little bit longer there’s maybe a lot of waste going on there. You’ll see that show up because you may have a lot of CXPACKET Waits which maybe we should take a second to talk about that because CXPACKET is one of those wait types that is oftentimes there is a lot of misinformation out there around it. Some people say, “Yeah, I don’t have to worry about about CXPACKET Waits that’s just means that parallelism is occurring.” For the most part that’s probably true most of the time but you can get into a point where there are things that will cause excessive CXPACKET wait, one of them being incorrect settings on your parallelism. And when I say on your parallelism it’s really on your max degree of parallelism setting and your cost threshold for parallelism setting because those kind of go hand in hand there.
Carlos: I guess if we’re going to jump here so that question is we’ve made the change. How have you decided to make your change? You’ve now made it, how do you know, like how can you determine that you’re in a better place. I think there are a couple of components there. One I think the overall percentage of waits because CXPACKET will probably continue to be in there but you should see other waits to kind of float up. There should be a better mix if that’s the right word if for example CXPACKET waits was taken up 99% of all the waits. That should float down a little bit.
Steve: Yup, and usually what I look at there is as if more than maybe 50% of your waits is based on CXPACKET that maybe there is something there you should take a look at. If it’s more than 90 there’s probably there you should take a look at. If it’s up to 99% then there certainly something you should be looking at.
Carlos: Then I think on the cost threshold for parallelism, again it’s going to take a little bit of knowing your system but you just want to go and take a peek at, which is easier said than done but kind of getting in and knowing those queries what the big queries are and ensuring that they’re still parallelizing, right? And then do you feel like there are queries that are taking up lots of CPU that might benefit from parallelism and then again that’s kind of one of kind of decision as to whether try to tweak that or not.
Steve: Yup, and I think that there is kind of a balance there because you don’t want every query to go parallel. But the flip side of that is you don’t want every query to be single threaded either. And what the cost threshold for parallelism is really doing is it’s saying at what point, at how expensive of a cost are we going to decide that parallel processing in that query maybe better, and the default for that is 5. And if you look at most queries that you are running today I would assume that most people that are running queries or looking at queries for performance, they’re going to have a cost of more than 5. There sure there are some quick ones that don’t, and what that means is most of those queries are going to be processed in parallel. And that may not be a bad thing depending on your server but there are some queries that are low cost but they are greater than 5, so maybe they have a cost somewhere between 5 and 50 or 5 and 60, that they would probably run more efficiently if they weren’t ran in parallel.
Carlos: Well that’s the tricky thing I mean even though they exceed that threshold, parallelism is available to them but that doesn’t mean they will always run in parallel.
Steve: That’s right because you’ll end up with two plans. You will have a parallel plan and a non-parallel plan and with that the query optimizer will decide which one is better. I mean oftentimes from my experience it will go with the parallel plan at that point but not always.
Carlos: Got you. Yeah, again that’s another tough one. The answer there is that you’re
going to have to know a little more about your, watch your environment and see how the changed that you’ve made, how it’s affecting your numbers and then even drilling down to the query level if you will to determine if more changes are needed. So from there we’re going to jump into log files.
Steve: Alright, so the thing that usually comes up around the conversation of log files is virtual log files or VLF. And just a little bit of background before we jump into the enough is enough part of it but the way VLFs work is that as your log file grows, every time the log file grows additional VLFs are created or these are segments inside of that log file that are broken up that can be used individually one at a time. So when SQL Server is writing into the log and it always does this sequentially it will pick a next available VLF chunk that it can write to. It will fill that up and then move on to the next one, and the next one, and the next one after that. And those stay used, well if you’re in full recovery model, they will stay in used until you have a log backup. Now, if you’re in simple recovery model they will stay in used until all of the transactions that are in that like they are using that VLF until all the transactions have completed.
Carlos: And then that can be marked as “deleted” or being able to be over it.
Steve: Yeah, available for reuse basically.
Carlos: Available for reuse that’s a better word.
Steve: So what happens though and this part varies a little bit between different versions of SQL Server. SQL Server 2014 and 2016 did a little bit better job in how virtual log files grow but it used to be that when your log file grew if it was smaller than a certain size that growth would have four virtual log files associated with it. I think that size was around 256MB. Now, if it was between like 256MB and a gig you ended up with 8 virtual log files. And then if the growth was greater than a gig you ended up with 16 virtual log files. So with SQL Server trying to guess based off the size of that file growth how it could chunk that up appropriately so that you get sizes that would be more reasonable to use. And then with SQL Server 2016 and 2014 there was some changes around that so that when log files grow, with the smaller growth sizes you would oftentimes only get one VLF rather than several VLFs at that point. But the problem that you run into is that a lot of people or lot of databases have some growth settings initially and a lot of the defaults would either grow by 1MB or 1% with a starting size of 10MB. And as it grew you would end up with all kinds of really tiny virtual log files. And what that meant is that if you have a big transaction it would have to span multiple of these VLF files that were really tiny.
Carlos: Sure, and again kind of more work because it had to deal with multiple little files to get in and out.
Steve: Yup, I mean depending on how database was created it will oftentimes if you are an accidental DBA you don’t know about VLFs you might never have checked this but I’ve seen some that have had over 100,000 VLFs in a single log file. But why is that bad? Part of understanding when enough is enough is to know well what makes that a bad thing? I mean it’s just a lot of chunks in a big file. But what makes that bad is a couple of things. One is that when a transaction runs it needs to use multiple of those VLF chunks in order to be able to write to the log, and with big transactions you got to have multiple of them in use which may make it harder for the SQL engine to be able to write everything it needs to happen there. But the flip side of that and this is the real killer is that when you’ve got
that many and you try and restore your database SQL Server has to go through and recreate all of those VLF chunks when you do the restore. So part of restoring the database is allocates the data file, allocates the log file, and while it’s allocated in the log file it’s writing all those chunks. And I’ve seen restores of databases that took 8-10 hours because the VLF count, and then in the same database after reducing the VLF to a more reasonable number took an hour to do the restore.
Carlos: Wow! You know, obviously disk speed and all of that comes into play as well, right?
Steve: Oh yeah, and that’s just the same database in the same disk doing the comparison there. I’ve seen it be 8-10 times long because of a lot of VLFs.
Carlos: That’s an interesting point because normally we talk about performance, like we’re talking about application performance. Like my user is trying to get information out of the system, right, that scenario. But in this case one of the bigger performance killers is like an RTO perspective. I now need to restore that thing, “Well, I’m not going to be able to get the performance to meet the expectations of my users”, and that could be a problem.
Steve: Yup, and that’s one of those. I always try and think about that as if the system is down and you have upper management looking over your shoulder or calling you on the phone continuously saying, “Is it done yet?” And you’re just there twiddling your thumbs waiting for the log file allocate and not being able to tell them. I mean you’re thinking, “Oh, it’s got to be done soon.” But it could 6 or 7 or 8 hours before it’s done. I think that’s one of those things that misunderstanding of the VLFs there could lead to what people end up referring to as an RGE or a Resume Generating Event where you oftentimes, if you tell management you got a 2-hour recovery time but it turns out to be 10, that may be the end of your job depending on where you work.
Carlos: Sure. Now I hope that they will give them a little bit of leeway as long as they can get backupped. Now if they can’t get backupped that’s a different story. But yeah, that would be a rough place to be if they were that tight with the leash there.
Steve: And what it really depends on oftentimes in most businesses is how much money you are losing for every minute that that database is down.
Carlos: Exactly, that’s right.
Steve: I remember years ago when I worked at Amazon.com one of the things there that they measure for any outage was how much money did we lose. And if that money is a few hundred dollars it could be very different than if it was a few hundred dollars.
Carlos: Right. Yeah, I know that’s true. I think it helps put things in perspective. And again that kind of goes back to the culture of recognizing the value of things being up and then hopefully if that’s the window and you’re pricing things that way which I think again as administrators we could probably do a better job of saying, “Hey, what, you know what…” And I guess I’ll use the ChannelAdvisor guys those have access to selling things outside is a little bit easier. Customers purchasing products that’s easier to tally that downside or the cost there. But to be able to calculate that and say, “Hey, look you know what guys if we can’t do this then we’re going to lose X number of dollars. It’s going to cost Y to put it in.” The ROI does make sense at that point kind of a thing.
Steve: So then with virtual log files, how do you know if you have enough, not enough or too many?
Carlos: Yeah, that’s a good question. I was thinking about this and you put it on the list. I mean, so in my mind and luckily right, as knuckle dragging Neanderthal I haven’t had too many experiences having problem with this, Maybe at least that I have recognized. I think as long as I have a more consistent size that’s kind of where I feel better about things. What about you, Steve, when do you think enough is enough there?
Steve: Well, ok so really what it comes down to is how long does it take you to restore that database. I mean, that’s the key thing that I look at on VLFs. So if you have an environment where you’re doing a regular restore from production to a test server or a development server, that’s a great way to know. But if you don’t have that hopefully you’re in an environment where you test your backups. And if you have a backup that takes you let’s say an hour to run but 6 hours, assuming it’s similar hardware then that could be an indication that you might have a VLF problem there. However, that alone is not the only indicator. I have a script that I created and you can get to that on stevestedman.com/vlf, and it does the DBCC LOGINFO command and then puts that into a temp table and then does some sort of visualization with sort of a character based bar chart in the query output window.
Carlos: In the result, you could actually see the size of the individual files and kind of gives you an indicator as to how big they are.
Steve: Yup, yup, and with that, and again there’s a lot of opinions out there, it always comes to it depends. But my rule of thumb is that any time it’s over 1,000 VLF files in any one database or any one log file associated with that database that’s something that usually I want to deal with that right away. Anytime it’s over a couple of hundred or maybe 300 VLFs in a single log file that’s something that I like to deal with but it’s not super urgent. And just keeping it somewhere in that range. I think opinions will vary but I think most people who have experience with VLFs would agree that more than 1,000 VLFs can be an issue. And that many will also agree that more than 500 is something that wants attention as well.
Carlos: Right, and I think that’s a great point is that you have a metric to go back on and that is your restore time, right? So that can be your benchmark. That can be your feel good. You know now obviously there’s only so much performance you can squeeze out of that, right? It’s not going to go from 10 hours to 5 minutes I don’t think.
Steve: No. But it may go from 10 hours to 1 hour, and that could make the difference between staying up all night to deal with an issue or at least getting some sleep that night. Now the flip side of that is you don’t want to have too few of VLFs either. I mean, if you have too few like let’s say you only have 16 VLFs in your whole system. That would mean you only had 16 chunks of the log file, and when of them, depending on how things cross over between the VLFs that they would me marked as in use and they would stay as in use until all transactions touching them were complete or until they have been backupped. It’s kind of a balance there.
Carlos: And it also depends on the size, right? I mean if I have a new database, even like your DBA database that you have your own internal stuff in, 16 might be fine.
Steve: Oh yeah, absolutely.
Carlos: Then I have the production system with all my transactions on it that might be a different story. And again, I guess, I lean a little back on that because if I’m growing them in equal chunks, if I grow the log in equal chunks, then I guess I’m kind of trusting a little bit that they are going to grow in the best way, and the equal sizes they have equal number of VLFs so a little protection there and the database kind of indicating how many VLFs there will be for the size that I’ve specified there.
Steve: Right, but I think part of that comes back too to having your log size large enough that it doesn’t have to auto grow over time.
Carlos: Yeah, great point. That’s right.
Steve: The auto grow is kind there as in case of emergency this is what we want to grow but hopefully we’ve got that log file big enough that it’s not regularly growing. So do we want to talk about how we fix those?
Carlos: No, I think we’ve kind of mentioned them in different places or we can comeback on them, I mean, even in your performance tip. I think we’ll say that on different episode.
Carlos: The only thing there on log files is so number of log files. We talk a little bit about sometimes dividing up or creating multiple data files. TempDB is kind of the classic example although it can be use in our other databases as well. But what do you think, number of log files?
Steve: It’s simple, one per database. The reasoning behind that is that log files are written to sequentially and whether you have one or you have 20 it’s still only ever be writing into one log file at a time.
Carlos: Yeah, exactly. So that idea of being able to use thread, parallelize, or how you want to think of that, being able to use each of those files without having to wait on something else won’t apply to the log.
Steve: Right, and I think the misconception there is that if I create two or three or four log files that it will be load balance between them as transactions are written but it doesn’t work that way. It just starts on one and uses until there’s no longer available space on it and we go to the second one if there is a second one. And it really just doesn’t make any sense especially if they are on the same drive. It really just doesn’t make any sense to have multiple files. I don’t believe I have ever come across a use case with SQL Server where I would have recommended multiple log files.
Carlos: Right, interesting.
Steve: So enough is enough, one is enough, two is too many on the number of log files for a database.
Carlos: That’s right. So another I want to go to and that is backup retention periods. We talked about, so ultimately we get into a scenario where our backups might be taking a little bit longer because we have two much, well I guess one history but I guess we’re using the word retention is how long we should keep them for?
Steve: Right, and then there is this sort of pack rat view of that which we keep them forever. And then there is the we’re tight on disk space view which means we keep them for as little as possible.
Carlos: Or just delete the ones that I don’t need so I can make room for the ones that I do need.
Steve: Right, right, so hopefully there is some balance in the middle there that can find. I think that from the DBA perspective, I mean I would like to have as many around as I possibly can because when something goes wrong that you’ll need a backup to recover from for instance corruption, oftentimes people don’t catch it immediately that day. It might be even a week or depending on your monitoring longer than that before you know that there is something wrong that you might need to go pull something out of the backup. Now, usually if you need to do a full recovery from a backup and you’re just going to abort the current database, usually you know those events pretty quick. But it’s the type of event where you realized something was missing, maybe some rows out of a table were deleted three weeks ago and you really need to restore that database and just pull those rows back in. It’s hard to know exactly what that retention should be.
Carlos: Right. I’ve worked for a couple of .coms and another scenario is not just the data but it’s either a data change or the table of table change, right? So a column gets added, a column gets removed, whatever and then all of a sudden because the reality is this change control process maybe got side step or whatever because it’s not good history. Nobody can remember exactly when the change got made and now all of a sudden it’s affecting things, so that ability to be able to go back and restore one from a couple of weeks ago and say, “Well, it was either this way or not this way as of this state.” I feel like been able to at least get things off of my plate by being able to provide that information.
Steve: I think one of the things that comes down to there on the backups and how many is enough is sort of a granularity based on how old the backups are. So for instance for me if I had unlimited space or a lot more space than I ever use for database backups, I probably wouldn’t keep every single backup. And most of the backup solutions that you have built in with SQL Server or other scripts available out there, they don’t really consider the granularity of backups over time, meaning you have a small retention window. After X number of days all backups are going to be deleted, or maybe after X number of days all your full backups are deleted, after different period your differential backups are deleted and after different period log backups are deleted. Hopefully you keep your full backups around the longest because you can’t use the other two types without those. But one of the things that I’ve come across when dealing with corruption is somebody is able to discover a corruption and it’s been in their system for months. And I’ll ask, “Well, do you have a backup from when before the corruption occurred?” And then the answer is oftentimes “No”. But sometimes somebody else will say, “I’ve got a backup from 18 months ago.” Will that help that all? And sometimes depending on how fast your data is growing or where the corruption occurs and 18 month old backup might be nice to at least see what was there or compare to what was there on older data on the database. So one of the things I like to do in a perfect scenario is maybe, like let’s say you’ve got enough space for one month of full backups but rather than keeping around a month of full backups taken let’s say every single day, which most people don’t have the space for that. But let’s say you did I’d rather keep around a week of full backups and then second week would have maybe backups from every other day and then the 2nd to 4th week or the 3rd and 4th week you might have backups once a week and then your retention period beyond that would be maybe a single full backup once a month that you keep around.
Carlos: Something like that, right.
Steve: Yeah, and I know that’s a lot harder to manage, and it might be as simple as you just have a job that once a month it takes one of your backups and copies it off to different network location.
Carlos: Right, we used to do that for auditing like the end of the year backup, you know, database for auditing purposes. So that auditors would come April, we have to keep this thing around for a time.
Steve: And my experience has been working with different areas of management is that oftentimes they don’t always understand exactly what it means when you talk about the backup retention, and when you explain to them that if we have a 2-week backup retention that means we have zero backups that are older than two weeks. Oftentimes that creates a bit of fear there that will lead to possible solutions.
Carlos: And of course one of the classic entrances into the cloud because nobody wants to go out and buy more disk space for just backups. The disk space has become very commoditized, very cheap in the cloud and so being able to store that information for a little bit longer. That’s kind of a lot of people’s first taste with getting a cloud technologies into their environment.
Steve: Yup, and I think that backups are great way to start with it.
Carlos: Well, let’s see we have a couple of others. I think I want to skip to TempDB. Do you want to touch CPU and memory? Or you want to call this an episode.
Steve: Yeah, I think it would be good. I mean you’re saying we want to skip TempDB. Yeah, because enough is enough on CPU and memory, I’d always take more.
Carlos: Yeah, that’s right. So ok, let’s hit on memory then just for a second. Yes, enough is enough, that’s a great question for that and there is this saying I hear somewhere that more memory will cover up a multitude of sins in the coding and all these issues that you could have in the database.
Steve: Yeah, a great way to fix I/O problems is with more memory.
Carlos: Yeah, exactly that’s right and the database just lives in the memory then no I/O problems.
Steve: So interesting on that I worked on a SQL Server on a few years ago where it was having some performance problems and it had, if I remember correctly, 64GB of RAM and they increased the memory from 64GB to 512GB of RAM.
Carlos: Oh, wow, so that’s substantial.
Steve: Yeah, very substantial.
Carlos: Exactly a license leap I think. I feel like 128GB is the.
Steve: We’re already on Enterprise Edition. Yeah, but that kind of a jump and at that point in time I think the cost for the memory was somewhere around $10,000 +/-. But that basically got rid of all the I/O issues that we were having there. Well, it still had a few I/O issues when you restarted the instance because nothing was cached in RAM. But once the instance is up and running for a bit it had so much stuff cached in RAM. I mean it was so much faster because it never had to go to disk to get anything. Of course it had to go to disk when there are rights but most of the performance issues around reads on this database, and it just took care of all the performance issues at that point. That was probably 3 or 4 years that continued to run without any significant performance issues simply by adding that much memory. Now, the side was effect was it also made it so that some of the developers not have to worry about performance tuning. So that when the database eventually grew to the point that that wasn’t enough memory for it, well they may have had more difficult performance issues to deal with at that point.
Carlos: Sure, and then it all comes down crashing down at that point. You’re probably talking about another system to go to the next level of memory at that point because computers are getting more robust. You know, once you start talking about terabytes of memory those are different systems.
Steve: Yup absolutely, so the question on how much enough is enough. Well, on that specific system where we set to 512GB of RAM. The thing I noticed that when it was normally running, for probably the first year that that was running it never exceeded about 400GB of memory used.
Carlos: Oh, that’s interesting because that was probably the size of the database.
Steve: Yeah, when on that one when everything was cached up and whatever the size of the database plus Temp tables and everything it was using, I mean it really didn’t exceed 400GB. But then a few years later it grew and it eventually got up and hit that limit but what that told me, when sort of it had that flat line or right around 400GB just sort of it, was that we perhaps bought too much memory and I kind of bite my tongue as I say that because it’s hard to say too much memory. But the fact that we never used more than 400GB indicated that if we had put 400GB of RAM in there that would have been enough.
Carlos: Sure, the application would have worked just as fine.
Steve: At that point in time.
Carlos: Yup, at that point. Yeah, so how do you know when enough is enough? I mean, obviously I think there is the indicator that people have lots of different thoughts about but the Page Life Expectancy (PLE) which I think has kind of been kicked around a little bit. I think for better or for worst somebody wrote a whitepaper at Microsoft to kind of came up with a recommendation that was quickly adapted as the standard. So in environments where we don’t have 512GB of memory, how do we know when enough is enough?
Steve: Yeah, and I think that comes down to sort of balancing the Page Life Expectancy with the page faults and knowing when something has to be bought in from disk versus when it’s able to be read from memory and looking at the overall performance. I mean if your system is performing great no reason to even talk about more memory at that point unless it’s growing and you’re planning for the future. But if you’re having performance issues there, whatever it is, whether you’re doubling from 4GB to 8GB, or doubling from 32GB to 64GB. Memory is oftentimes a cheap way compared to alternatives to improve that performance. So one of the things I like to look at because you brought the Page Life Expectancy was to watch how that Page Life Expectancy grows over time. One of the things that if you chart it over time and how it’s growing. If you’re running a regular CheckDB job, oftentimes that will skew your numbers because when CheckDB runs it of course has to bring the entire database in the memory bit by bit. And oftentimes when that happens you’ll end up pushing a lot of pages out of memory and it will skew the numbers on your chart there. But if you weren’t running CheckDB, how long does your Page Life Expectancy continue to grow? And if you chart that you could can sort of see pattern where it will grow and grow and grow and then at some event happens. And that event might be CheckDB or it might be nightly ETL process or it might be some job that runs that has to pull in a whole lot of data on a regular basis. But if it continues to grow and grow and grow throughout the day until you hit a few of these events that tells me that most of the time you’ve got enough memory and it’s only those certain events that you hit that would have prevented it from being read from disk if that data was cached in memory. And if those things are happening in the middle of the night or they are not impacting the actual business at that point. Yeah, no big deal, I wouldn’t worry about it. But if it’s the kind of thing where it’s an ETL that kicks off at midnight and it runs through until 10AM and it is impacting business in the morning well you may want to consider more memory to help with that. Let me just finish one thing on that first. So that being said though, you would want to understand that it’s definitely a memory constraint before throwing memory at it. Because I’ve seen people throw memory at long running ETL jobs and then they found out that it has no improvement because the bottleneck is not the memory, it’s something else in the system.
Carlos: I guess I’ll kind of go along with that a little bit in to take a peek at, obviously you don’t want to take a look at memory consumption, but the thing that I found out helpful is also just taking a look at the chattiness of the disk. You know, one of the things that I have found is that databases will kind of get grouped together and you have one database that’s kind of like the mission critical or very important. It’s rising in this rank, maybe not mission critical but it is becoming more and more important. But then you’ve talked on these other databases that are kind of one offs or not as “important” and then you found out that they are the ones that kind of being chatty, or they are taking up more space than they should. Those are the situations where for the system that you’re most interested in maybe you have enough and it’s just a matter of moving that lesser database of somewhere else so that the system that you care about is not impacted.
Steve: Yup, that’s a great example and that’s a great reason to have different instances running in different VMs, and that you could constraint some of those less important databases to not chew up the memory that way.
Carlos: Right. Ok, so I think that’s going to be our episode for today – Enough is Enough.
Steve: Enough is enough.
Carlos: Yes. Do you agree with our list? Give us your thoughts, let us know. You can leave you comments on social media or the website at sqldatapartners.com/enough or on sqldatapartners.com/105.
Steve: Oh, I just totally tripped there. Didn’t I?
Carlos: That’s fine.
Steve: Sorry, Julien. I’m making this hard for you to edit today.
Carlos: So let us know if you agree with our list. If you want to leave us a comment, you can do so on social media or you can do so at sqldatapartners.com/enough.
Steve: Sorry, I am just blowing it today because I was thinking about our LinkedIn and Twitter at the end. Let’s do one more take on that because I just spaced. Oh, sorry.
Carlos: No, that’s fine. I guess is there anything else we need to include before we end or do we want to wrap up there as well?
Steve: No, I think we could just wrap up there because we’ve kind of already included everything.
Carlos: Ok, sounds good. So that’s our list, let us know if you agree. If you have other comments you can reach out to us on social media or you can leave as a comment at sqldatapartners.com/enough.
Steve: Or at sqldatapartners.com/105.
Carlos: You can always connect with us on LinkedIn. We love hearing your comments and connecting with more of you. You can reach me I’m @carloslchacon.
Steve: And you can get me on LinkedIn @stevestedman, and we’ll see you on the SQL trail.