As database administrators, we have to live with virtualization. As a consultant, I can’t think of a single environment I have been in that didn’t have some of the SQL Servers virtualized. Troubleshooting issues in a virtual environment can be a bit tricky and in today’s episode I chat with David Klee about his experience and how you might learn from his experience.
What has been your experience with virtualization? I would love to hear from you in the comments below.
Carlos L. Chacon: This is the SQL Data Partner’s podcast. This is episode 31. I am Carlos L. Chacon, your host. I welcome all my compañeros to the show.
Perhaps you’ve listened to several episodes or perhaps this is your first one. Either way, thanks for joining us on the SQL trail.
This podcast is dedicated to SQL Server-related topics, designed to help you become more familiar with what’s out there and how you might take advantage of them.
Of course, the SQL community is just a great, big family and we like to connect with others as well. Today our topic is virtualization. I am joined by David Klee.
David Klee: Hello.
Carlos: He is the owner of Heraflux Technology, a shop devoted to virtualization and SQL Server infrastructure. So we’re going to be digging into his thoughts on SQL Server virtualization and how he deals with that with other organizations.It’s good to have you on the show again, compañeros. Welcome to the show. David, thanks for being on the program today.
David: Thanks for having me. This is great.
Carlos: Yeah, always exciting. Virtualization’s always one of those things that comes up. I think a lot of people are starting to do that.With your focus on virtualization in the SQL Server space, I thought, “Who better to come and talk with us a little bit more about that topic, and some of your experiences with that.”
Again, some people might be familiar with virtualization, but to set the tone, let’s talk a little bit about virtualization, and what it is, and how people are using it.
David: You bet. In a nutshell, virtualization is, in my opinion, it’s an extension of the physical machine hardware itself.It’s technically an added layer on top of the hardware that allows compartmentalized operating systems, and whatever apps you want to run on there, it allows them to coexist on the same physical machine without them being aware that the other apps or operating systems are there.
The way I look at it, from that perspective it’s a single physical machine, a few VMs, and life is good. The fun part there is when you start to look at in en masse, you start to remove the definition of the physical server.
I start to look at it as a collection of physical machine compute resources that you have available to you, and queues for these operating systems, and their respective apps, to get to it.
If you need more resources in the physical machine virtualization cluster, you add more and tell it to load balance. You can assign these resources to a VM, and every one of these VMs has to go through the queues to get to those resources.
Carlos: That’s interesting. You talk a little bit about those queues. I guess my thinking of virtualization was that I just take a big box and I carve it up into chunks and make little boxes out of it. [laughs] Then I guess there’s a few more components to that, the queues being one of them.
David: Yeah, there’s a few more in there. If you were to just strictly carve it up, and this is what a lot of folks do. Let’s say I’ve got a 16-core physical machine. Does that mean I can carve up eight two vCPU VMs and that’s it?Not necessarily. You might actually be able to get away with 20 vCPUs, or even 30, being allocated on the machine even though you might only have 16 there to use. That’s where the resources in the queue model really works, because it’s not just a one-to-one map or X amount of this.
I can over-provision and over-commit. Sometimes you can do it safely. Sometimes not so much. This is one of the biggest challenges with database server virtualization. It’s that over-commit versus performance.
Carlos: Right, and I think that would be a common scenario in that from the VM side, or the network guys, or whoever is controlling the infrastructure there, the infrastructure guys, they’re like, “Oh, hey, we can get more servers into that,” but the SQL Server guys are like, “Well, hey guys. You’re killing me here.”
David: [laughs] Well, that’s the fun part. The history of virtualization started with leveraging virtualization to consolidate the data center, and it’s real fascinating because that consolidation mindset persists today. When you look at it, the VM admin, they’re told by the organization to squeeze as much onto the infrastructure that we can afford.From an DBA perspective, they generally don’t have a lot of experience with the infrastructure. Conversely, the VM admin don’t usually have a lot of experience with the apps that they’re virtualizing.
What worked back in the day for file servers, print servers, active directory, maybe web servers doesn’t necessarily work for enterprise-level database servers today, so you end up with a conflict of interest where one side doesn’t know the best practices for the other.
Carlos: How do you go about or engage with clients or whatnot? Again, these are SQL Server folks listening to this podcast.As they’re going to engage their VM admins, what would be some of the tactics or some of the questions that they might be asking to insure that they’re getting…And even some of the data that they might need to provide to say, “Hey, VM admin, this is what I’m seeing.” Or, “This is what I need from you to make sure that I’m not being impacted.”
David: Yeah, it’s all about the education first and foremost. What the DBA needs to know is how to translate what they know and what they need into the language that the sys admin really has. The DBA knows transactions per second, disk stall. VM admin doesn’t know that.What you have to do is translate that to disk latency, IOPs, CPU time, things like that. The stats that a DBA need from the VM admin, it’s essentially what VM runs the same host, what’s the performance characteristics on that host?
Am I under memory pressure? What’s going on on this machine in any given time? For example, let’s say I’ve got five SQL Servers on one physical machine.
If CFO comes along and runs a really big financial report, and it spikes the CPU, and the disk is just churning, what is that doing on the other VMs on that same physical machine? The DBAs don’t know, because it’s all isolated and compartmentalized.
All you know is that stuff is just running slower. So they need to ask what else is running at this time. What are the resources doing as they basically get allocated from VM to VM?
Carlos: Would you normally provide that just like some PerfMon metrics?
David: There’s a lot of different layers in there. There’s PerfMon inside Windows, inside each VM. There’s a number of things within side the SQL Server DMV — the DMV scripts, all over the place. Then you’ve got inside from system center with Hyper-V or V-center within VMware. You may also need to get switching infrastructure — networking, or fiber, or both. There’s storage performance characteristics direct from the SAN.If you have something with application-level metrics — some of these utilities like New Relic, you can actually see how long it takes relative to an app click — is it application server, is it backing database server, is it actually the browser? You’ve got to put all these metrics together and once you do that, you can paint a really clear picture of what’s going on.
You let these trends come out for you. I don’t have to make any assumptions. The data there gives you the answer. You just have to put it all together.
Carlos: Right. I think that so many other topics you’ve covered on this podcast is the need to communicate and work with the other groups in your organization.
Carlos: I think virtualization is obviously something you can’t do all by yourself, right? Particularly unless you own the VM stack there as well, but if you’re dependent on that you’ve got to agree and get along with those team members, to share some of that information.You mentioned a couple of things. One VM in the SAN is involved, which may or may not be the same group. Then the database — that’s switching. That’s a lot of organization and orchestration that needs to happen.
David: In the networking group that may be slow for availability group transfer, and backups may have competing schedules. And DR, which may cause other problems. It gets real fascinating real fast.
Carlos: I think this is one of those areas, where you can, as a database administrator, you can begin to separate and provide leadership. That role is changing a little bit and we can go and say, “Well, I’m not just going to take another database, I am going to empower or provide information to these other groups, so that we can all provide a better experience for our clients, for our customers — whoever they may be — whether internal or external.”
David: Exactly. The role’s changing and in some really interesting ways and a lot of DBAs are either resistant to the change or don’t know where to begin with it. It’s really interesting.
Carlos: Having said that, with virtualization and all of these things and the DBA role, where do you see that heading?
David: The DBA is always going to be around. I see it becoming even more important to an organization as time goes on. The challenge here is that, 10 years ago, if I were a DBA, there’s a good chance that I knew the NUMA configuration in my machine.I knew how to configure an HBA. I knew disk configuration. In virtualization, and now I don’t really need to know that. If you’re a DBA, you need to know I need these many CPUs, I need this much RAM, I need the storage presented.
If you go forward, look at how things are moving. You have software to find everything, you have storage, networking, security — all this stuff. A lot of the requirements to be a DBA from back then aren’t around today.
Tomorrow it’s going to change even more. Where I see a DBA going is less from an administrative and an availability standpoint, more towards an orchestration and a performance standpoint. Because the performance challenges will always be there and the goal — in my opinion — is to make things more efficient and not just perform better. If I make it more efficient, I use less SAN, I use less memory.
As a result, there’s a good chance that your app gets faster. But, at the end of the day, as public cloud becomes more and more here, essentially, if I can cut down the amount of CPU that I need and have just by understanding the performance characteristics of a data model, store products, or [inaudible 11:22] things like that, then, if I’m moving that to the public cloud at some point, I might be able to get away with a smaller footprint database or a smaller VM.
David: That saves it’s money. It makes things faster as you save money and that makes you a hero to the organization, [laughs] it makes you even more important.
Carlos: Yeah, exactly. I think taking on or embracing that change will only give you advantages as the organization continues to change.
David: Exactly. And that’s actually how we named our company, believe it or not, Heraflux Technologies. You know that phrase, “The only constant is change”?
David: …that was coined by a Greek philosopher by the name of Heraclitus.
Carlos: There you go. Now you know the rest of the story.[laughter]
Carlos: [inaudible 12:14] Very good. I’m sure that most organizations now have some form of virtualization in their environment. What current hang-ups do you see or…I guess we’ve talked about some of the issues that we can have but what are some ways that organizations are not doing virtualization correctly?
David: A lot of it goes down to those queues that I mentioned. A lot of is I’m trying to do more with less. Common sense would tell you more CPUs is faster, right?
David: Well, not necessarily in virtualization. It sounds weird. It’s counter-intuitive to everything that a DBA has ever learned about infrastructure.One example from about six months ago — a company had a 32 vCPU VM running dog slow and they couldn’t figure out why. I went and did a quick PerfMon analysis trying to figure out what’s going on. I told them, “Take this VM. Make it 3 vCPUs”
So, we went 32 to 3 and they wouldn’t do it. They wouldn’t do it. [laughs] I finally had to say, “If this doesn’t make an improvement, I’ll buy you a steak the next time I see you. [laughs] Conversely, if I was right, I get a steak.
Carlos: There you go.
David: The next maintenance window, we shut it down, we knocked it down to three, and we turned it on. In their nightly run-time and on most application clicks throughout the day, 3.1 times performance improvement.
David: That’s pretty significant.
Carlos: That is significant.
David: When you can get this by reducing the footprint of the VM, there’s less that has to be scheduled. If there’s a lot of scheduling already happening because it’s a busy environment, you cut down the scheduling, things can get faster and some cases they do.
Carlos: So you mentioned you were looking at PerfMon and so, this idea of queues. We have queues in SQL Server and that’s what we get wait stats and things for, these queues. QL Server works through these different items.But, when you talk about those queues, what were you looking for then? Is that a specific metric that people could take and then go talk to their VM admins and say, “Hey, can we take a look at that number of queues that are allocated to this VM?”
David: In some ways you can. It’s a little different. PerfMon, essentially, what I was seeing was that things were slow but the CPU activity was low. It was point-blank low. There was no reason for it to be low given the workload they were under. We actually got into the hypervisor, at this point, sort of poking around.The cool thing is most hypervisors, VMware and Hyper-V, both give you the amount of time that those CPUs spend in that queue. The amount of time spent in that queue is time not spent, actually, doing what you want it to do.
David: If it’s a small amount, no big deal. That’s normal. Everything is…
Carlos: It’s going to happen. It’s just a normal by-product, right?
David: Yeah. And, you know, most environments you’re looking at a quarter of a percent to maybe one to two percent. If you look at the way SQL Server parallelizes it, those values just get inflated to like one percent to three percent or four percent. That’s nothing. It’s measurable, it’s not really noticeable and everything’s OK. In this particular environment, they were at 65 percent.
David: That’s not cool. [laughs]
Carlos: No. That’s right. That’s a lot of time waiting in line.
David: Yeah, right. That’s exactly it. So we reduced the footprint of the schedulers, and that 65 percent dropped to 4 percent.
Carlos: Um. So, then the CPUs, actually, could be doing work instead of waiting around.
David: Exactly. And the 32 percent average CPUs consumption was in the neighborhood of 3 percent. When we went to three, the CPUs were at about 55 to 60 percent, but that’s OK because things were getting executed properly and we actually being efficient with the entire stack there.
Carlos: Interesting. That’s good to think about there. So, now you brought up Hyper-V and I think the landscaping virtualization is changing. As I sip my Microsoft Kool-Aid, right, I think they’re changing a little bit.At one time, VMware, I think was still the king or was the king, maybe still is, right, but obviously, Hyper-V has gained some ground and, I think, Azure are also using Hyper-V, so it’s gaining a little more respectability. Where do you see the landscape shifting there?
David: It’s an interesting time to be a geek. We’ll put it that way.[laughter]
David: VMware has been around for longer than Hyper-V. Hyper-V has come a long way in the short amount of time it’s been around. Honestly, I’m at the point where both hypervisors really work well.The performance overhead is very low. The performance challenges are generally not with the hypervisor at all but it’s more on how it’s configured and how it’s managed. The management tools are constructed on a different philosophy based on other platforms but they both work. There are some buzzwords that are different but in terms of features, in terms of availability options, they really both work well.
Now, that said, they both have direct interaction with Azure now and you can do Amazon. You pick your flavor of cloud. So, when it comes to telling people what to implement in their environment, I turn it back to them and it’s, “What do you have in-house experience managing?”
A lot of folks have Windows Server admins that are in and do a great job with it and then Hyper-V is a logical fit. They already know it. If it’s a VMware environment, if they’ve got folks that background experience with that, go with that. It’s whatever you can manage easiest in-house.
Carlos: Yes. I guess I’m not sure it never would have dawned on me. So, I’m assuming Amazon web services, AWS, is all VMware?
David: Oh, no. It’s their own derivative of, I think, one of the open-sourced variants.
Carlos: The way you said that, I thought, “Oh, wow. Gosh.” I didn’t realize. [laughs]
David: No. No. If they are, that’d be interesting.
David: That would be interesting. Yeah. I guess, one of the last things I want to talk about and that is so, disks. Right, so SQL Server disks are generally our biggest bottlenecks now in all of our systems. Of course, you give a CPU example so that’s not a 100 percent of the cases. Maybe that’s more in the physical world as well.I feel like in the VM environment there is a lot of hoops that I have to go through, and I don’t feel like I have a good understanding of my disk set-up in a VM environment. I guess advice for DBA’s noticing unacceptable disk speed or they have some questions about their disks, what may be some terms and phrases that they need to use to talk to their VM admins about it? [laughs]First of all, you walk into a big conference room and say “Of the DBA’s in the room, how many of you love your SAN admin?”
You get about one percent of the room that kind of quietly tries to raise their hand. Then you say “Of those who raised their hand, how many of you are your storage admin?” and usually most of those folks raise their hands again.[laughter]
Carlos: Oh boy!
David: Oh yeah. Storage has been the most biggest problem with SQL Server performance for as long as I have been a DBA and a sys admin. If you look at the speed of CPU, you look at the speed of RAM, storage is the slowest piece of the puzzle. It is also the most complicated.It is a single point of failure in most environments. They’ve got 20 SQL Servers and probably only have one SAN. It’s tough.
David: So, what do you do? Disk is getting faster and what the advent of all-flash arrays that are now becoming cost effective the infrastructure from the physical machine to get to the SAN now becomes a bottleneck. [laughs]It’s still a storage problem. From a DBA perspective all we see is elevated disk stall inside SQL Server and elevated latencies inside Windows. From a VMWare perspective you get the same boat.
You don’t have a lot of that control over where things go directly but you can still detect it, you can still see it. Those are the things to watch out for as a DBA. Inside SQL Server is disk stall. Actually, if you go out to heraflux.com not to market that but I’ve got a script out that actually lets you collect disk stall from inside SQL Server and turn it into disk latency matrix on a given pulling interval.
You can overlay that with Windows PerfMon and you can start to say “Do I have an IO challenge inside SQL Server that Windows is also seeing?” conversely you do not, SQL Server might have it and Windows might not, and if so you have an IO bottleneck inside the SQL Server engine.
This is one of the reasons why people say create more than one temp DN data file, because you can have an IO contention bottleneck to get to that one single temp DB datafile. So you can detect it pretty easily. Those are the things to look for.
Now what do you do about it? If it’s a SQL Server problem then more data files, more file groups, potentially more logical drives or mappoints in Windows and more scuzzy controllers can help to spread off that work-load.
If it’s still a problem, you got to start looking at the VM layer and beyond that. To have these IO matrix from both SQL Server and Windows to confirm that you have got these challenges it is vital, you have to have those. But then when you go to the infrastructure folks you have to have them really open the playbook, have them show you what’s the SAN doing.
What is the disk configuration like? From a DBA perspective I shouldn’t have to care how the SAN is configured or managed, or what the background load is, it just needs to be faster than what I need.
Carlos: Gathering those metrics is ultimately the key point right?
Carlos: You know what your SQL Server, your operating system is telling you and you can provide that data and say “Hey guys this is what I’m seeing, let’s see if we can make something happen.”
David: Exactly. I’ve got a goofy example of this. There is one company that — good friends of mine, they had invested $6 million in storage. All flash, heck of a set-up. On paper it looked greay. When they put it into practice and moved their VM server to it, the SQL Servers were experiencing over 54,000 long IO alert warnings per day.
David: The SAN was averaging 0.4 milliseconds response time and was idle. It was a fascinating project because for one, it was the only time I’ve ever had a chair thrown in a meeting.[laughter]
David: Yeah. It was basically an argument between the SAN folks and the DBAs. The SAN people were saying, “Your metrics inside SQL Server are invalid because it’s a VM,” which was incorrect to begin with.The SAN is doing this, “So DBA whatever you are talking about, be quiet do not bring this up again” and the DBA was saying, “We are seeing significant performance problems versus where we were at 6 months ago.” They just started arguing, it was fascinating.
What it turned out to be was actually something in the middle. The all-flash array was working great. They had a fiber switch in the middle that had one little setting wrong and everything behind that fiber switch was stacking up.
Carlos: Wow long queues, huh?
David: Yeah. It was just incredible.
Carlos: There we go. I think the idea of working well with others, playing nicely with folks. Yes, you’re a SQL Server administrator, you know a lot of things, but don’t forget that they know a lot of things too.
David: Yeah. It’s get the objective measurements, get the objective data to prove your points. Don’t accuse. You want to work with them, you’re there to support the business.So you get the data and you say, “We have a problem. These are the symptoms, these are the patterns, this is the data. Let us look at the rest of the environment and overlay the data that you have with the data that I have and let’s put this together.”
Carlos: Very cool. We’ll actually put a link to the disk stall script up on the website sequeldatapartners.com/podcast. They will have a link to Heraflux and to that script, so you can get to that on the show notes or that page.
David: Awesome! There’s two scripts out there actually, there is one for the disk stall collector and there is another one for how to set up PerfMon to collect data in the same way that SQL Server is doing.
Carlos: Very cool. Well, David thanks for that conversation. Always interesting to talk with smart folks like yourself, particularly about virtualization. Now we are at what I call the SQL family portion of the program.
Carlos: Here we get to talk more about you and some of your life experiences. We always like to talk about tools, find out what other people are doing. One of the things I like to ask is what is your favorite SQL tool? It could be a free tool, pay tool? Why do you like it and how do you use it?
David: Wow, good question. [laughs] Number-one tool, Glenn Berry’s diagnostics scripts over at sqlskills.com.
Carlos: There we go.
David: These are phenomenal. They provide an exceptional amount of insight into any SQL Server that you’ve got and I leverage them almost daily to be honest, because I do a lot of health checks and efficiency checks for different customers.I don’t ever want to be the IT guy that comes and sits in the corner, does the work and disappears. You have no idea what he did. I love the education side of it.
What I do is, I say, “Hey, go here, let’s download these scripts. We are going to walk through these in your environment, and we are going to add this into the infrastructure health-check. We are going to show you the methodology of how to do the health check and that way you won’t have to call me back to do this on other environments in your server farm.”
Carlos: Yeah, very cool. They can use that and as Adam Machanic says, “Only call you for the good stuff.”
David: Exactly! Exactly. I want them to call me back for other things but not something I have already shown them how to do.
Carlos: David you have had a wealth of experience working for other people and now you are here at Heraflux. You have this consultancy. Which one is the best piece of career advice you have ever received?
David: Hmm. Ooh. Hmm, good question. The best piece of career advice I have ever got was actually by my dad. My parents have been entrepreneurs, serial entrepreneurs for a long time. I have watched them jump from different things to different things and it was always, “Don’t be afraid to go out on a limb and take a calculated gamble.”When you are in college they teach you that you want to be a programmer, that is what they tell you and that’s what you know. I was a sys admin in high school and college. I got out of college with a computer science degree and said “OK, well next up logically is development.” Well, I found out I am absolutely horrible as a software developer.[laughter]
David: I mean, I could do it but it just was not natural you know?
David: It took a gamble and said, “Well OK, I’m going to get out of that, I’m going to go back to what I know and go be the infrastructure guy.” I took that job and then SQL Server was a part of that and I always kind of tinkered with SQL Server.I figured I might as well learn this, it might be interesting and at the end of my first week there I realized that I really enjoy SQL Server. I had no idea that I would. I loved it and didn’t really abandon the sys admin side but definitely put it on the back burner to dive in hard and heavy on the database side of it.
From there I took a job at a performing arts center and had a lot of work with the infrastructure there and data and integratiom and reporting. It was just a blast and the gamble to go independent was fun, it’s honestly the craziest decision I have ever made in my life. I wouldn’t wish it on anybody unless you have that mindset where it’s OK to take a gamble and you’re willing to run with it.
I absolutely love being independent because I can pick the technologies that I want to learn and I want to explore and can figure out how to adapt it to business needs. It’s so much fun to be able to incorporate something new and completely change an attitude on something.
Carlos: Very cool. So tomorrow or later this afternoon we hang out and you inherit $1 million dollars, what are you going to do with it?
David: I’m going to hire five people that I know to help me to continue to build up my business.
Carlos: Hey, there you go.
David: Then I will probably take a couple of weeks to relax and then get right back to it.[laughter]
Carlos: Compañeros, if it happens we’ll make sure we get those postings out to you.[laughter]
David: I’m serious, we were joking a couple of weeks ago with the lottery thing. If that were to have hit for some oddball reason. I got about five people on the shortlist that I would love to pick up the phone and call.Hey, we started something that I feel is special, let’s continue to grow this.
Carlos: Make it happen. Our last question — if you were to have one superhero power, what would it be and why would you want it?
David: Probably time travel, if that’s a thing. I’m a closet history buff, and I would love to be able to travel back and meet Leonardo Da Vinci. He is, in my opinion, probably the most important Renaissance man-type mentality, and it’s something that I’ve been trying to do for a long time.Not just be, you know the phrase, what is it, “Mile-wide and very shallow in a lot of areas.” I want to be a mile wide and a mile deep in a lot of areas. I’ve come to terms with the fact that I can’t learn everything about everything, but I love learning a lot about a lot, and I love being a mile deep in a few areas, because at that point you know what you don’t know. It helps you to stay humble and it helps you to continue pushing.
Carlos: Well, David, thanks again for being here. I do appreciate your time and your willingness to share your knowledge.
David: Awesome, thank you. This has been a lot of fun.
Carlos: So compañeros, if you like today’s episode, I invite you to leave a comment or review on iTunes or Stitcher. This will allow others to more easily find it, so they can enjoy the program as well. We do appreciate you tuning in, and we’ll see you on the SQL trail.