One of the newer features in SQL Server is availability groups, which can help solve a number of business problems. As administrators, availability groups introduce some complexity as we are tasked to make sure the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) can be meet for these servers. The complexity comes because we have different instances that must work together, but they don’t always see eye to eye. In this episode, we chat with Tracy Boggiano of ChannelAdvisor about how they go about monitoring their availability groups and the pros and cons of the out-of-the-box tools. Our discussion touches on the availability group process and Tracy has posted her scripts on her blog for you to use as you look at reviewing your environments. I think you will enjoy this discussion.
Episode Quote
“You just need to allocate the resources and play with the stuff in your staging environment and make sure you have resources”
“I much prefer having a query and using PowerShell and just running in the multiple instances
“We use a third party monitoring solution for our monitoring rather getting a whole bunch of SQL agent alerts”
Listen to Learn
– Monitoring availability groups
– Data synchronization on availability groups
– Asynchronous and synchronous mode
– A review of RTO, RPO and SLA
– Errors and agent alerts
Tracy on Twitter
Tracy on LinkedIn
Tracy’s Blog
About Tracy Boggiano
Tracy is a Database Administrator for ChannelAdvisor. She has spent over 20 years in IT and has been using SQL Server since 1999 and is currently certified as a MCSE Data Platform. She also tinkered with databases in middle school to keep her sports card collection organized. She blogs at tracyboggiano.com. Her passion outside of SQL Server is volunteering with foster children as their advocate in court through casaforchildren.org.
Carlos: Tracy, welcome to the program.
Tracy: Thank you for having me.
Carlos: Yeah, it’s great to have another companera here with us. You are going to join us on October for the Companero Conference and so we appreciate that, and for you being on the show today.
Tracy: I appreciate you having me.
Carlos: Yeah, and I’ve been telling Steve we’re going to get through all those ChannelAdvisor folks. Now, we get to cross another one off our list.
Steve: Are there any DBAs there that we haven’t talk to yet?
Tracy: There is a couple left. Yes.
Steve: Ok.
Carlos: In fact, one of them I actually, so I met when I was down there in. He is a big fan of PSSDiag and so I talk with him about maybe coming on and seeing if he could convert me.
Steve: Oh, that would be interesting.
Carlos: Yeah. I’m like, it seems like a little. Anyway, that’s for another episode.
Carlos: Oh boy! Yeah, I’m not sure. Not today’s episode.
Steve: Speaking of which today’s episode is on monitoring availability groups.
Carlos: Yeah, that’s right. I think this is going to be near and dear to anybody who is trying to setup availability groups and ultimately that idea that you have DR scenario/situation and you want to make sure that the data is getting over to the secondary node. That you’re not going to lose that data that you kind of meeting your SLAs and so being able to ensure that just makes a lot of sense. And so, Tracy, it might be helpful why don’t we take a minute and just review that data synchronization process as we talk about availability groups? What are the components and what are some of the pieces that we need to be looking at?
Tracy: Ok. Well, the first step is that the log has to flush to the disk, and you have a log cache that it caches the records into that it has to send over to your secondary server. And it is stored in this area called the log capture area and it sends across the network, and at some point it gets an acknowledge commit whether you are in synchronous at asynchronous mode and that depends on what mode you’re in. Once the log is received on the other side it is stored in another cache into disk. And on the other side you have a redo thread that sits there and replace the pages back into disk. And you have a performance counters that capture your logs and queues, size, and the rate that it’s sent, and you also have the redo size and redo sync rates that are all captured and performance counters on both sides so you can monitor those.
Carlos: Right. Now, so we talk about asynchronous versus synchronous, right? So when I send it over, writes to that second node and then writes to the log, that hardening process. If it’s in synchronous, once it writes the logs, is that when I get the acknowledgment back? Or does it actually go through the redo process before it will acknowledge that?
Tracy: Once it hardens to the log it is sent back as committed. And when you’re in async it sits and hardens on your primary it’s considered committed.
Carlos: Ok.
Steve: So when we are working with that what are the things that we really want to monitor or keep tracking to make sure things are working well.
Tracy: One of the first things you want to make sure you don’t have is a lot of network latency. A lot of times this technology especially if you’re in async mode you’re looking at a DR situation or you had your secondary site in a secondary location like an AWS or different data center. You don’t want to have too much network latency. The other being that you want to make sure that your secondary storage aren’t slow and not able to process the data as fast as you need them to. Some companies like to go skimpy on their secondary servers and not have them as their primary servers. And they can get behind just because they don’t have
enough memory or CPU resources. So you want to keep an eye on those redo queues and make sure if it’s actually able to keep up or not. And that’s where it’s important to keep up with your SLAs and make sure that you’re actually meeting those or not.
Carlos: Right. Now, this is very interesting concept because you think, “Ok well, all that secondary server has got to do is process the log”, but on busy systems that can be quite a bit of work. And to keep up with that and again, ChannelAdvisor, you guys get to play with all the cool toys that some of the higher transaction rate systems you actually needed a pretty hefty server, secondary server, just to be able to keep up with that even though it wasn’t actually doing any “OLTP” workload.
Tracy: Yes, we had a system with a 256GB of memory on the primary side that we still needed 64GB of memory on the secondary side just to process the redo log to keep it current. I mean, we still are able to take 25% out. You know, we try to go less than that and we were trying to run in the cloud which can be a little bit more expensive and we weren’t able to do it. We started off at 16GB and it just wasn’t performing.
Carlos: Now, was that synchronous or asynchronous?
Tracy: Asynchronous.
Steve: So then when you say it wasn’t performing there, was it just that the continuous load was so great that it couldn’t keep up? Or was it that there were big sets of transactions that were kicking off at some point in time that were causing it to backlog and not catch up?
Tracy: We just had too many transactions going. We were hitting, 20,000-30,000 transactions per second and it just backing up on the secondary side. One we bumped up the memory it’s plainly a memory problem just trying to read all those pages into the buffer pool so it could update it. And once we bumped it up to 64GB it was able to keep up.
Steve: So then with that secondary backing up, if it’s backing up to the point that it can never caught up, what is the outcome there in the availability group? Does it eventually crash because of that or is it just very delayed?
Tracy: It’s just very delayed. It’s behind to the point you are not able to failover if you’re meeting your SLAs. In this instance we were just testing our AGs for ourselves. It wasn’t a production instance for us as far as we wanted to failover to it. It was just us testing our AGs on a production instance but not for a disaster recovery situation. But for us to see what we would need in order to setup a DR situation and we discovered that, “Hey, we’re going to have to allocate some resources to it.”
Carlos: Alright, now that’s another interesting question because you are failing over and it was just your test, so when we talk about not keeping up what is your window there? Is it 1 minute, 5 minutes, an hour, a day?
Tracy: Our service level agreement is set to 4 hours. But we were finding that the server is up to a day behind.
Carlos: So again, just to put in some of the perspective some of that, right? I wasn’t like you were asking for milliseconds or something like that. I mean 4 hours is significant, right?
Tracy: Yeah. I say you just need to allocate the resources and play with the stuff in your staging environment and make sure you have resources. We played in staging but we don’t have the transactions per second on the current staging that we have in production.
Steve: So I know when you’re monitoring presentation there are a few acronyms that you brought up along the way being RTO and RPO and SLA and how those apply in the availability groups scenario. Can we maybe talk a little bit on those and cover what that means to the availability group with the RPO and RTO and how
those apply to SLAs.
Tracy: Yes, the RPO is your Recovery Point Objective and that tells you how much data you are able to lose so that measures how many megabytes, or gigabytes, or data you are allowed to use so how much data has change that you can lose. Your RTO is how much time you can lose, so that’s how long it is going to take to do that redo on the other side. So if it says it’s a day behind that’s a day you have to wait for that redo log to play. These two are measured differently. You could have a gig of data behind but it will only take an hour to replay it or it could be the reverse, it could be a day behind and a gig of data to replay. It depends on how your system is setup. And those two combined to make your SLAs, your Serious Level Agreement, what your businesses agreed to allow you to lose data and how much time they are allowing you to recover that data.
Steve: Ok, great, so when we look at the built-in dashboard for availability groups. How well does that really do the job for you to be able to monitor and know what’s going on?
Tracy: When you are looking at a single AG it pretty much does the job for you. If you’re in our environment where you have 100+ databases, 100+ servers and you’re looking to have AGs on all of those, not so well. But you have to look at, you have to connect everyday and then to see what’s going on.
Carlos: So kind of broad brush strokes, it’s cumbersome. You have to connect to every single one of them.
Tracy: Yeah, you have to connect to every server and bring up that gone down through the dashboard in Studio Manager and bring it up. That’s why I much prefer having a query and using PowerShell and just running in the multiple instances and see which ones are on trouble, and going and checking those out. But overall is you have a couple to manage. The dashboard shows you what you need to know.
Carlos: Ok, so it’s pretty good there. It’s kind of meet your needs from an individual availability group or perspective but just that once you start getting with more than one then you guys are going to start rolling your own.
Tracy: Yes.
Carlos: So I guess talk us through some of the ways or some of the things that you’re looking at to accomplish that?
Tracy: Well, we’ve used the queries that the dashboard runs in the background. And one thing we’ve created is some PowerShell command lines that we can run and then it returns in a data grid and then we can sort and filter. And we can just run that again to any of our AGs that we have and fun data. We also have more focus onto our team who loads a lot of stuff into Grafana for us so we can see data.
Carlos: Yeah, you guys likes to meddle, who likes to tinker.
Tracy: He has our performance monitor, counter talking about before our logs and queues, our redo queues and the rates all logged into Grafana for us so we can go there and view them by instance if we need to. So we’ve got a couple of different solutions that we have that we run with right now.
Steve: Yup, so when we talk about your PowerShell scripts that you’ve got there. Are those something that’s internal and private or is that something that’s available for other people who take a look at.
Tracy: I actually have a blogpost to my website that you can pretty much plug in any SQL script you want into and run against multiple SQL Server and turn it into grid, so that you can run any troubleshooting script you want and it will run across any list of SQL Server that you provide and return it in a grid.
Steve: Ok, we’ll have to include that in our show notes, link to that.
Carlos: Well, another kind of interesting we’ll call it a hack because I like to call things hacks today. But that little cheat if you will of have a SQL Server 2 something. You pull up the manager, or the wizard, or the dashboard, whatever and then you pull up profile, I’m assuming or extended events, if you so choose and then you just figure out how it’s getting all its information and then you use that down the line.
I think that’s a great way to be able to grab that stuff and understand what the SQL Server team has put together to look at.
Tracy: Yeah, I saw what I did. First, I pulled up the dashboard. I’ve seen it only comes with like 5 columns and it’s like that’s not very useful. It has a add/remove columns out to the side so I started looking at the columns names and I was like, “Hmm, these are the things that are useful to me and that’s what I went into the DMVs to find.” And that’s what I added to my queries.
Carlos: Interesting. So I am curious and maybe it’s different because I know you guys are on AWS. So in Episode 76, we had Jimmy May on and he was talking about some of his testing experience and the improvements they’ve made particularly in their redo log and whatnot. It just made me think are there scenarios or real situations or is there anything that you can do so let’s just say maybe the scenario would be, I guess it’s not really a cluster, so patching. This may not work but I want to use the patching scenario. Maybe not exactly what I’m looking for but you need to failover. You need to move from one node to other for whatever reason. And let’s say that you’ve gotten behind, what are you going to do or what kind of things are you looking for to help you get back up to speed? Does that make sense?
Tracy: Yeah. In our situation, the AGs we currently have it run in production are synchronous on premise ones because we don’t have our DR stuff that’s setup currently for AGs. We have a different solution for DR at the moment. So those are synchronous and up to date but if you were doing an async environment because we have done this to do migrations. And to migrate from 2014 to 2016, we use AGs, so we set up AGs as asyncs and then when we got ready to failover we set them in syncs. Like the day before so that we get caught up and then we did the failover. It would be the same though with patching. You would go ahead and patch your server, set it to sync, but it force it to catch up, and once it’s caught up you’d failover to the patch server.
Carlos: Got you, ok.
Steve: Ok, so then I guess one of things that I’m interested in is that with the monitoring that you’re doing it sounds like you’re doing your own beyond what the built-in is just so you can get across multiple servers all at once. But are there additional things that you’re monitoring that aren’t part of what you would see through the normal dashboard there.
Tracy: The only thing I say I’m doing differently is the rate. The rates in the dashboard aren’t updated and laugh because Microsoft only updates those when data is being sent across. But if you’re not actively sending data the rates aren’t updated so you got some monitoring around your rates. It may not always be accurate so I do the monitoring base on the performance counters DMV rather than the DMV that just stores the rates for availability groups. So I hava a as part of my downloads for my presentation I have a thing that captures it, captures it again and actually calculates the rates that you can see the difference between what DMV for availability groups say and what rate actually is. Just in case you need to know what the rate or what the real rate of actually use as occurring. But other than that it’s pretty much straight what the queries are running on the dashboard.
Carlos: Ok, great.
Steve: Now, there is another feature or option that is available to us to be a little more proactive if you will as far as letting SQL Server notify us when it has a problem. A lot of us are going to be familiar with the alerts. I don’t know if they are called agent alerts but these are the 19-25 or 20-25 and then those like 18-32, 18-24 etcetera because of problem breathing from disk and things like that. SQL Server can say, “Hey, I had this problem, just thought you should know.” We have a
couple of those for availability groups, right?
Tracy: Yes, I think there are about 5 of them. There’s an error 1480 that lets you know if the server failed over. There is an alert for when the data movement has suspended for some reason. I’ve seen it suspended because the disk is out of space for example. There is also one that will take when it resumes. So if it decides to resume by itself or somebody actually puts a button and resumes it. It will tell you if the AG goes offline for some reason and if you’re in synchronous mode it will tell if the AG is not ready for an automatic failover. You got to set up for automatic failover. All those numbers are available using my, you can look those up online or download my slides.
Carlos: Sure, and I suppose something like that, not keeping up some tinkering will need to be involved to kind of make sure that you have the right secondary formula because I would have imagine you don’t want to be getting a lot of those emails. You get a couple of like, “Ok, more memory than the secondary”, or something, right?
Tracy: Yeah, that’s why you wouldn’t configure your secondary for automatic failover if you didn’t want to know if it wasn’t ready for automatic failover for example.
Carlos: Well, fair point. And I guess the automatic failover. Those are probably going to be a little more similar in size.
Tracy: Yeah. You definitely don’t want to have those set up exactly the same. That way when it failed over you got the exact same performance going on because those are going to hopefully happen at 3:00 in the morning. And you necessarily wouldn’t get paid when something fails over automatically either because hopefully that’s what you wanted so you could sleep. That’s why we invented AGs, to sleep.
Steve: Yeah. Nice, so when you’re monitoring your availability groups are there any weight statistics that you normally take a look at or keep an eye on just to see how things are going?
Tracy: There are a few that you can keep an eye on. The one to watch out for is there’s one the HADR_SYNC_COMMIT to tell if it’s taken a while to commit on your side or not. I’ve seen that one pop up when it’s just waiting to commit. We know so a lot when we were low on memory. The other one is write log. That will typically occur if it’s taken a long time to write a log on your secondary. Other than that I haven’t seen any in particular there pop up in our environment.
Steve: Ok, so with the sync commit is that when it’s doing the synchronous or asynchronous mode and it’s doing the commit and you wouldn’t see that if you’re running an async mode.
Tracy: I’ve seen it in async and sync. It’s just whenever it’s doing a commit.
Steve: Yup, got it. Alright. And then as far as any extended events, is there anything we should be aware of or keep an eye on there when we’re working with availability groups.
Tracy: Mostly extended events it creates always on health session by default for always on and everything you need is pretty much in there as far as events that you need including all those agent alerts that we’re talking about.
Carlos: It creates a separate event, separate monitor or it just includes them in the default extended event trace.
Tracy: It creates a separate one for you called always on and it includes all the events that you need to know including all those alerts, the agent alerts we were talking about. Those are all included in there and as part of the demos that I do at the presentation. I’ve got queries that query those out and we actually as part of our internal monitor just query those directly rather than receive SQL agent alerts. We use a third party monitoring solution for our monitoring rather getting a whole bunch of SQL agent alerts.
Carlos: Yeah, that’s right. I mean once you get so so many instances you’re going to want kind of a complete solution there.
Tracy: We don’t want so many emails.
Carlos: Yeah, exactly. Well awesome, Tracy, great information on availability groups. We do appreciate it.
Tracy: No problem.
Carlos: Should we go ahead and do SQL Family?
Tracy: Sure thing.
Steve: So Tracy, how did you first get started with SQL Server?
Tracy: Well, I was a developer at a company that wanted some reports and Access wasn’t good in it as far as querying stuff so I imported some data into SQL. Well, I had to learn a lot of stuff. I then I just kind of got hooked ever since. I took some training classes on it and next thing I know I was a DBA. I’m not a developer anymore.
Steve: Very nice. It’s interesting to see how often Access is the lead into SQL Server for people.
Carlos: Yup. Tracy, I know you guys are using a lot of features over there at ChannelAdvisor but if you could change one thing about SQL Server what would it be?
Tracy: Right now, I wish hackathon would quit writing so much to my error logs. I’m getting gigabyte log files every night on my system drives and they’re driving me crazy.
Carlos: Oh wow! That is a lot of data.
Tracy: And it’s just informational messages.
Carlos: And there’s no flag or nothing to do to turn that stuff off.
Tracy: Nope.
Tracy: I’ll stick with that one for right now.
Carlos: And so what’s your say about that? Surely you’ve, he knows about it.
Tracy: I don’t know. Brian has mentioned though. I’m not sure what Microsoft said about it yet.
Steve: Alright. What is the best piece of career advice that you’ve ever received?
Tracy: Just to always keep learning stuff. Feel change a lot in IT. What you know now is going to change. As you can tell SQL Server. We’re getting releases like every year. And now we go learn a new operating system.
Carlos: Yeah, never stop learning. Tracy, our last and favorite question for you today, if you could have one superhero power what would it be and why do you want it?
Tracy: I want to be able to teleport.
Carlos: Teleportation, ok, and why is that?
Tracy: I spend a lot of time in my car doing volunteer work and it will save me a lot of time.
Carlos: There you go, very good, awesome. Tracy thanks again for being in the program with us today.
Tracy: Ok, thanks for having me.
Steve: Yup. Thanks, Tracy.
[…] SQL Data Partners Podcast Episode 102: Monitoring Availability Groups (Carlos L. Chacon) […]