Have you heard about the new High Availability features in SQL Server 2016? Availability groups provide both high availability features and disaster recovery options, but they also have several areas you must be aware so you don’t introduce more risk into your environment. The major advantage is availability groups allow for you to fail over more than one database at a time. In Episode 59 we talked about general data availability options and in this episode we focus on the new features of Availability Groups in 2016 and how data availability options have changed with our guest John Sterrett. John shares his experience getting a large database to a highly available situation along with some other ways to use availability groups.
If you are using availability groups, we’d love to hear about your experience along with any issues you’ve had in the comment below.
Listen to Learn…
- How High Availability has changed since SQL Server 2012
- The new 2016 features and how to use them in your environment
- The new “round robin” fail over option that gives you more control over recovery
- Why your Network Admin might be who keeps you from implementing HA
- The single biggest impediment to HA, according to John
- His best piece of career advice and how PASS helps you with this
Episode 63 Quote:
“So basically, availability groups are sort of taking the best features of both failover clusters and database mirroring and putting the best features go together to get what we know today as availability groups.”
This Week’s Tuning Review Topic
Episode 63’s Tuning Review is about Rebooting SQL Server. Listen and learn over at http://stevestedman.com/reboot.
Today’s guest is John Sterrett, a SQL Server DBA and Microsoft Data Platform MVP out of Austin, Texas. John runs the High Availability and Disaster Recovery Virtual Chapter as well as the local PASS chapter in Austin. He also blogs and speaks regularly at PASS events on . Follow him on Twitter @JohnSterrett and on LinkedIn.
SQL Data Partners: Episode 59
SQL Data Partners: Episode 61
Multi Subnet WFCS Setting
Direct Seeding Benchmark
60 TB database made available with no downtime
Readable Secondary Replicas
Today’s Tuning Review: Rebooting SQL Server
PASS High Availability and Disaster Recovery Virtual Chapter
John’s Posts on SQL Server Central
MSDN: Data Replication and Synchronization Guidance
Episode 63 Transcript
Episode 63: Transcript
Carlos: Well John, welcome to the program!
John: Hey guys, thanks for having me.
Carlos: Yes, always nice chatting with you. We do appreciate you coming on the show today. Ultimately, our topic for today will be expanding on some of the things we talked about in episode 59, talking about some HADR-type scenarios. And we’re focusing on the 2016 implementation of some of these features and so John has been doing a bit of playing around with that. So before we get into the new stuff, let’s take a quick second and review some current features, if you will, of availability groups. So John, why don’t you talk us through what availability groups are, how they get set up, etcetera etcetera.
John: Sure, is this where we hit pause and tell everyone to watch Episode 59?
Carlos: There we go. At SQLDatapartners.com/dataavailability.
John: Nice. Well to get the cliff’s notes, there’s a couple of core high availability features that have been around for a pretty long time in SQL Server. We’ll touch base on a couple of these because they kind of morph into what Availability Groups turned out to be. First of all there’s Andy’s, if anyone’s watching Episode 59, favorite HA feature, Database Mirroring. And this basically was a database-level of high availability that would take your database and make you have a second copy, almost in real-time, where you have options of using asynchronous or synchronous, to send that data over to another server.
John: So, there also is windows failover clustering which would be used for a failover cluster index. What this gave people was a single name that they would always use when they connect, so they can access the data regardless of which server A or server B or whatever server in your SCI topology is hosting the data. So basically, availability groups are sort of taking the best features of both failover clusters and database mirroring and putting the best features go together to get what we know today as availability groups.
Carlos: Merging those features into one. So you’ve got two data stores but then you also get the windows clustering, which is where you get the naming conventions.
John: Exactly. And this came in 2012. So you had to have Enterprise Edition. Same thing in 2014, we get 2016 we’ll talk about how that changes. You now get to have some Availability Group love even in Standard Edition.
Carlos: There you go. And the reason for that is because the mirroring is going away, and they had to find a way for the common folk, if you will, to have some availability.
John: Exactly, and one of the super cool things there is that you can actually use async mode, where you’re mirroring. Before you would have to have enterprise edition for that. Even if you love how you use mirroring today, that’s one huge huge bonus feature of using the Availability Group in Standard Edition with AG 2016.
Carlos: So let’s go into that. So in standard edition, what are the feature sets available to me in 2016.
John: So in 2016 you’re going to be able to use what is the “basic availability group”. I know that doesn’t stand for BAGish either. But this is very similar to what we expect in SQL database mirroring today. You’re going to have a single database that would be in the availability group and really loud 1 replica that you can use there as well so a lot of it is the same if you want to still have your connection strings we got your fill of her partner and you can do that as well or if you want you can have the listener as well
Carlos: No I didn’t realize that was so much I I didn’t realize that was a requirement as well that you can only have one database on availability group
John: So yes, you can only have one single database inside of the availability group
Steve: So wait, just to clarify at that point you don’t mean one SQL Server you just referring gle database on that instance?
John: That is correct. So you’d have an availability group for each database out there.
Steve: So you can do the multiple databases each one just has to be..
John: Oh yeah yeah each one is in its own so you have the same experience as you’re not allowed to prep the databases together unless you went with Enterprise Edition.
Steve: So then if someone out there knows and loves mirroring today do you think they’re going to be happy with the feature set in the basic ability groups in 2016?
John: Well I think there’s some big benefits that you don’t see today. Mainly the biggest one that I really love is the fact that you can be asynchronous. Now you’re not forced to be in synchronous mode to where if for whatever reason you have some kind of latency or you don’t have the fastest disks that you are using the heart of your transactions. You can be in asynchronous mode and a moment of delay will have less impact across your end users.
Carlos: Right I think the big downside, however is that it does come with the windows clustering and there is some management or some potential politicking that you’ll have to do with your network team in order to get that set up.
John: For sure yeah that always becomes a huge challenge especially if you’re going to be doing multi submit or basically having a stretched out Windows failover cluster that would span across multiple data centers. But there’s a really nice, probably my favorite feature in 2016, for availability groups is this new functionality called a Distributed Data Center
Steve: Now I’ve heard that described as being like an availability group of availability groups. Is that the right way to think of it?
Yeah, it really is. So before, currently in 2012 or 2014 , Iwould have to have a Windows failover cluster that would span all the way across the data centers that you want to use inside of that availability group. And what’s really cool now is that it’s no longer the case. You can actually have Windows failover cluster inside of an AG and then have an AG of those ages. So some of the big problems that I’ve had with customers when they’re implementing availability groups the first time, where they’re spanning multiple data centers, it’s just that fact that they don’t have enough time to correspond the default settings across multiple thresholds. So when the heart beats are going across the latency is just too much that it can’t keep up and end up with quite a lot of failovers. So the nice thing here is that you can basically have one Windows failover cluster in your primary data center and you set up your AG there, and then you do the same thing in another Data Center and have an AG on top of them that’s going to use the listener to set the endpoint trap between the AG’s. It’s really really cool not only does it help with that problem, but I see a lot of people are going to use this in the future for helping with upgrades as well because this would actually allow you to completely isolate the windows failover cluster so you can upgrade one that they are independent so you can upgrade you can have a new address or you can go ahead and even upgrades SQL and move it over and make that process a lot more seamless for you.
Steve: So that would be handy if you needed to take the time to completely rebuild the cluster on one hand I would assume.
John: Oh definitely.
Steve: So do you think that distributed availability groups would be more of a high-y or more of a disaster recovery option or kind of combination of both?
John: You could definitely get a combination of both but it’s definitely a lot more there for ecovery then Availability.
Steve: okay. so you lose one Data Center you’ve got it all running at the second Data Center ou can just fail over to that through the distributed AG concept.
Carlos: So what about some of the new features that are coming to us in 2016?
John: So another one that I have used a lot recently, in fact I have a great blog going over kind of beating this with a hammer, is what is known as direct seating. So one of the main points that people have had with implementing availability groups in 2012 or 2014 is the fact that your data has to be completely in sync in order for you to turn on availability groups. Just like you would have with database mirroring. So for example if you’re just going to have a simple concept of two replicas, whether you’re using the Wizard and by default it basically would take a copy-only full backup, log backup for you and send it over to restore from the network share over there, you have that data in sync and order for you to be able to join the replicas together. And while this could be very trivial for a very small databases, it gets pretty complex when you’re working with a really big set of data. And what direct seating can do for you it’ll actually do all of this for you. It’s a fun feature that actually has been in Azure SQL databases for quite a while now with the geo-replication going on there. That basically, you have a database, you want to get highly-available you turn it on it’s going to use that endpoint to basically do a VSS back up that’s going to send over the wire and restore at the same time as it’s going. It’s actually pretty cool and I’ve seen some really good results with that I’ll make sure that we sent over the link to that to some of the benchmarking I’ve done there.
Carlos: Yes will make sure that gets on to the website let’s see what we going to call this episode at SQLdatapartners.com/2016HAfeatures. Now do they give any guidance, John, as to when you both look into the threshold of you’re too big to use direct seating?
John: So not really but I’m glad you mentioned it, Carlos, because the big caveat that people definitely need to know if they’re going to consider using this in production, and it’s going to make a lot of sense once they’ll hear it. It’s the fact that your transactional log on your primary, it’s not going to be allowed to truncate data until the direct seating finishes.
Carlos: Oh wow. So the entire process has to finish, the database has to get over there, before the log can go over there?
John: Exactly, and it will make sense because it’s a VSS backup in real time then you’re going to want to have all the difference of the data and between when it kicks off or when it completes. So, a perfect example for me that I had that I’ve done a good real world test case with this was a 60 gig database where it took about an hour and we were seeing about 1.4 gigabits of consistent throughput through the endpoints. Our actual limitation of that point was for storage. But it’s really really good performance. I’m sorry it was 600 gigabytes not 60.
Carlos: So go ahead and say that one more time?
John: so it was drug testing that I’ve done with a 600 gig data base and I saw that it took what’s 66 minutes and we are running about 1.4 gigabytes per-second of consistent throughput and are bottleneck at that point I actually was our storage that we were using that that was persistent the secondary over to
Steve: So then for that entire 66 minutes, the transaction log backups were not able to truncate or run and truncate the transaction log at that point?
John: that is correct, yes. That is one thing you definitely have to test out there before you do this full-on in production. But it is now at the point where this can be very seamless for you because of that even if you have log backups going through all the time, it’s not going to have any jeopardy of having the two replicas not being synced. So it makes it really really simple and I think a lot of people are going to enjoy using availability groups if they have not ran into a problem with it was 2012 or 2014.
Steve: Okay so then one of things I heard a little bit about is a different load balancing options with the readable secondaries. Is that something you can talk about?
John: Oh yes definitely. so there’s also this new feature of round robin and we’ve had read intent before in 2012 and 2014 and basically what this would allow you to do is basically set up a relationship so every replica when they became primary would find in a thing of think what we’re going to find next and if that doesn’t work. We’re going to keep going down the chain in 2016, we were able to group these replicas now so for example instead of just going to the second one and it going to the third and fourth on a failure you can actually round-robin between replicas. So for example if you had a scenario where you had the five replicas in your AG and we’ll say replica one was primary. You could tell it, okay try this group of replicas 2 and replica 3 first. And what it would do is round-robin in between those two so that if you connect using the read intent with your connection stream, it’ll bounce back and forth between replica 2 and replica 3 which is pretty cool. And from there you have it go down and do the next one whether that would be try replica four or five or group them together where you could bounce between those as well. It’s pretty cool and allows you to do a lot of really nice kind of load down balancing with the read intent with SQL straight out of the box in SQL without having to use any third-party products
Steve: Oh wow that seems like a pretty big win there for that load balancing feature
John: Oh definitely.
Carlos: So I think the caveat, or maybe not the caveat, but in order to use that in your n stream the application has to specify that parameter setting of the read intent.
John: That is correct yes.
Steve: Interesting. So then one of the things I guess that I’m really curious about here is what is the biggest set of data that you’ve been able to make highly-available with availability groups and how did you go about doing that?
John: Good question, Steve. so earlier this past year I actually had a great opportunity of taking a database that was 80 terabytes in size and at this point in time it was basically a stand-alone instance we were able to make it highly available by having basically for replicas to in the primary data center that we added some DR to it as well by adding a third copy to the second Data Center. And we were able to use the log shipping to get the data in sync so that when it was time for us to implement we actually didn’t need any downtime the actual physical cutover took us about 30 seconds for it from start to finish to have the availability group replicas to be able to be joined together. Now all shipping itself took quite a bit of work; it took about a week to get the data in sync and going to fall over to the second Data Center getting log shipping and going so all the data was fully in sync.
Steve: be able to switch over that quick once for 80 terabytes that’s amazing
Carlos: right I think that’s kind of an interesting idea in the sense that you’re using multiple technologies to kind of get to where you want it to be right, so some of these older things well that’s not the primary role I can play, it can be used to help you get to the more highly-available solutions.
John: Yes for sure I mean so yeah we did all kinds of real fun stuff that used to be cutting edge way back in the day. For example looking to see that the data is in sync through log shipping, being able to stand by so you can query the secondary through log shipping to actually look at data and compare it to the data that’s currently in your primary. So you know, you’re definitely right there Carlos, it’s going to take a lot of great features that come with SQL server in order to kind of bridge them together to do some pretty cool stuff.
Carlos: Now in all this coolness, right, is there a reason why we might not consider, wny we might consider not putting availability groups into our environment?
John: Oh for sure yeah one of the biggest reasons that I see out there in the field is that if you today don’t have the infrastructure around or the people to manage the process is for availability groups it’s definitely not the solution for you. So for example if you don’t really have a network administrator and the guy who does that role basically is like [inaudible], you may end up having quite a lot of network issues that make your high availability solution very unstable and unreliable and down more than available.
Carlos: Yeah that’s an important concept, right, as we try to put in this feature we don’t then want to be the cause of the outage.
John: That is definitely correct that reminds me of a client I was working with this year where you know they called me and they said what’s going on and they were wanting some tips. I’m glad to help people so I told them to try a few things here and there. And they did it and they called back a week later saying this is constantly waking me up at 2 a.m. fix it for me. So I ran a script that that Tracy Jordan had presented at the high availability virtual chapter that basically would go through and scan all the errors. And we could see there were over 30 failovers in the last week mainly because of that scenario we talked about a little bit earlier about them having multiple data centers and the heartbeat just wasn’t making it across the latency and with the activity that was going down the pipe. So just by jumping in and changing some settings instantly we were able to eliminate that pain points. That’s just a perfect example of if you don’t understand that working if you don’t understand Windows failover cluster is this can be a pretty hard technology for you to grasp. In fact and going back to your podcast 59, I would strongly actually disagree here I would say that replication is a lot easier to manage than availability groups if you don’t understand this core pieces that are part of your infrastructure to making your availability group be successful.
Steve: Sure and I think to add to that a little bit it might not be that the people don’t understand those networking areas, but it might be that they are not permitted access go to some of those too. Where you got the network person conceptually on the other side of the wall that that sets up your cluster and doesn’t let you in on it.
John: I was going to say yeah you definitely have to have a good team and everyone has to be pretty pretty solid on what they’re doing for availability groups when they work they’re phenomenal. But yeah if you’re missing some pieces there and what you need it cannot be a favorable solution for you.
Carlos: and ironically in episode 61 Russ mentioned a scenario where some firewall settings were changedand he failed over and then couldn’t fail back because of the firewall changes.
Steve: Right and that was a change completely unrelated to anything that the database was doing. The networking guys made that change without even telling them, however it had a big impact on it.
Carlos: Interesting. Well good alright great conversation on some of these features. I think I am looking forward to it. Let me say that I have had some opportunities to put up some availability groups and and you know, again knock on wood, they’ve work pretty well for me. I haven’t run into some of the same networking issues, but the future is pretty neat.
John: Yeah and you know the only way to really get in there is to get your hands dirty. In fact one thing I recommend to a lot of people when they want to get started with availability groups as you know don’t just set it up and leave it like a lot of people are doing. Get a synthetic load tool like Hammer DB or DB store. Get a workload that will just hammer that. Hammer doing lots and lots of transactions with inserts updates and beliefs and then go through and monitor and see exactly, okay how is this working with how you configured it and the infrastructure you have behind it.
Carlos: Yeah. Good points. So I guess last thoughts about Availability Groups or Disaster Recovery features in 2016?
John: Just get in there and get going with it. There are so many added features and availability groups and it’s amazing to me alone how they got through all those, not to mention all the other little parts of the product. There so many things that we didn’t even go into much detail about like group managed service accounts we can now use in our support, or DTC, or even the increased number of auto-failover targets. Yeah there’s a lot of new things that are added over there in 2016 just for availability groups.
Carlos: I’m anxious to see what the future holds there.
Carlos: Should we do SQL Family?
Steve: Alright so with a sequel server features changing so rapidly with each new release what kind of things do you do to keep your SQL Server skills sharp?
John: So I do a lot of different things. I love this question too because there’s so many different things people can do and there’s no right or wrong answer. Just doing something is a great start a great step towards having a good career. For me, I’m highly involved in the PASS community so I actually run the High Availability Virtual chapter and I run my local chapter here. So that forces me to watch a lot of sessions and see a lot of different things, so I always love learning from that group. I also have a pretty nice home lab here as well so I have my own Mac Pro with 64 gigs of RAM, six cores. I have my own Synology storage NAS here that I can use VMware to just beat up and play with any concepts that I want to learn and get my hands dirty on.
Steve: Yeah I can see how valuable that lab would be if you’re doing anything with availability groups.
John: Oh definitely
Carlos: so we been talking about some SQL server features tonight but if there was one thing you could change about sequel server what would it be?
John: So one thing I wish there’s actually a lot that’s kind of hard to focus on one so I’m probably going to have to go with the one that I wish I had right now. And this would be with distributed replays. So I love hammering and replaying workloads I couldn’t pay workload tuning and do it as a way to test things with high availability, like always on availability groups. But one thing I wish it would be added to that is the fact that right now you can only do a one-to-one replay. And I really wish there was a way I could take a workload and say just multiply it by 10 so if you have a bunch of inserts and identities and selects and some updates that it could just scale it for us and allow us to magnify it and hammer that even harder with actual real work loads that we have.
Carlos: I see. Gotcha. it helps with those inserts can be troublesome.
Steve: Alright so John what is the best piece of career advice that you’ve ever received?
John: So the best advice I have ever received is to find good mentors and then latch on to them. So I’ve been very blessed throughout my whole career, even all the way from being the intern, to having some really good mentors that were able to help guide me and help save me from spending more time learning things instead of just going down better paths.
Steve: that’s definitely a good point I think there’s a lot of things out there that could be that people could you better if they had a mentor to help them along the way.
John: I couldn’t agree more. Yeah that’s not really just SQL, that’s life in general. I’ve been a bit blessed to see a lot through being a business owner being also a single professional or just a SQL person. I’ve had a lot of great people surrounded around me too kind of help me become a better person and to be a better SQL person.
Carlos: John our last question for you today: if you could have one superhero power what would it be and why would you want it?
John: My wonderful superpower skill would be memory manipulation. And every year you know we have this wonderful conference called SQL Pass Member Summit. I strongly recommend people go. And there may always be some things that you know accidentally gets done that I wish people would forget and never knew it happened. So if I were able to pick any superpower it would be nice to have that one.
Steve: Would that somehow relate to SQL karaoke?
John: Definitely it could.
Carlos: John, thanks so much for being on the program with us.
John: Yeah thank you guys for having me, it’s been a blast
Steve: Yeah thanks John. A lot of good information today.