Episode 76: Availability Group Improvements

Episode 76: Availability Group Improvements

Episode 76: Availability Group Improvements 560 420 Carlos L Chacon

Availability groups provide some exciting features in mixing high availability and disaster recovery; however, from a performance consideration, there are a few drawbacks.  With the advances in SQL Server 2016, our guest Jimmy May, set out to test the features and see if they could get some really high performance out of an availability group with two synchronous replicas.  In this episode he talks with us about some of his findings and some of the pains associated with getting there.

Jimmy May

Our Guest

Jimmy May

Jimmy May is a SQL Server technologies for SanDisk, a Western Digital brand.  His primary skill set is related to tuning & optimization of SQL Server databases. He formally worked at Microsoft on the SQL CAT (Customer Advisory Team).  When he is not working on databases, you can find him on skiing wherever they might be snow.

Never Miss An Episode

Subscribe to get podcast notifications by email.

“I’ve been a big fan of flipping the faster bit ever since even before I was a member of SQLCAT.”

Jimmy May

Episode 76: Availability Group Improvements
  • Carlos:             So Jimmy, welcome to the program.

    Jimmy:            Hey guys, thanks for having me. I really appreciate it.

    Carlos:             Yes, it’s nice of you to come out. Always interesting to talk to the former Microsoft folks. Guys have worked on the SQLCAT team, lots of experience under your belt. So thanks for coming and being willing to share a little bit of your time with us.

    Jimmy:            Sure. Thanks!

    Steve:              And I know, I’ve seen Jimmy present a couple of times and I’ve always had a great time learning along the way so I hopefully will have that same experience today.

    Jimmy:            Crossing our fingers.

    Carlos:             And one last thing just to butter this bread a little bit more. One of the nicest people I think in this SQL community always willing to talk, kind of get some feedback, or share your thoughts, opinions with those at conferences, SQL Saturdays, you know, what have you. And so if you ever get a chance to go and say Hi to Jimmy take advantage of it because you won’t regret the time.

    Jimmy:            I really appreciate that guys.

    Carlos:             Ultimately we want to talk about tonight is we want to talk about your experience that you had in putting together an Availability Group, and trying to understand some of the pain points that customers are experiencing, and then trying to figure out how to get around those pain points, and kind of putting the proof in the pudding if you will at some of the advancements in SQL Server 2016.   Yeah, so let’s go ahead and dive into that. Give us a quick recap of kind of that how you put that together and then we’ll dig in to which you found.

    Jimmy:            Ok, well, the goal is to, as you stated, we want to put together an Availability Group architecture that was actionable not just, we weren’t just trying to set a super high world record numbers that something we can tassel the transom for an arbitrary customer to read and implement. And along the way we, of course as you know, I work for a company that solves a lot of flash. And I’ve been a big fan of flipping the faster bit ever since even before I was a member of SQLCAT. And so that was big part of the picture and there are some challenges even on flash, storage latency, things like that. And we have a handful of lessons learned. We proved out the, really exciting a log transport improvements that were a bottleneck in Availability Groups in 2012 and 2014. No longer we get that data across the wire in a near real time. However, there is still some issues with the log we do at the high end performance and along the way also implemented some other findings. The –k startup trace flag which is something that virtually any installation can take care of. We’ll go into it more in detail I hope later but we have startup trace flag, fully supported, not well documented to a throttle checkpoints at a level that you designate and that’s pretty exciting. There’s amount of lessons learned.

    Carlos:             So why don’t you take us through I guess what that setup looked like. You have a couple of boxes and ultimately you wanted to use Availability Groups and you wanted synchronous commits from the primary to both secondaries.

    Jimmy:            Yeah, to setup itself was largely pretty vanilla and we do that by design. The only “exotic” pieces were some of the hardware components to make sure we didn’t run into bottlenecks. But we had three 2U boxes – HP DL380 G9s. Our friends wouldn’t like us to call them commodity boxes they’re super fantastic and amazingly performant but they were off the shelf boxes. Two sockets, we used Broadwell processors, also started with Haswells but the tragic, during the duration of the project we upgraded. 256 Gigs of RAM, nothing crazy. And we had some pretty high network throughput. It didn’t come close to hitting the bandwidth that we had. For availability, we put in two metal [inaudible – 4:15] 40GB a piece and we theme those using standard Windows theming. Pretty simple to setup and configure something which I had not a whole lot of experience but it turns out to be very easy to setup. And even a non network geek could do it with a lot of ease. So we have a total aggregated bandwidth of 80GB so if you wanted to use it. And that would serve as well want to put together multiple AGs, Availability Groups. And what else can I tell you? Oh, the data of course of, the storage, the stars of the show. We used, flash doesn’t come in just a lot of us are used to seeing the, what used to be Fusion I/O, PCIE cards, right? Well Flash comes in 2½ inch standard small form factor flavor now. And what we did was we started out with eight, decided to end up with ten 2½ inch disks and a Flash. Only 800GB a piece, relatively small, there is coming a 1.6GB flavor. And in fact, our company makes 4TB 2½ disks. It’s amazing the whole 2½ disk in your hand and you’re holding a 4TB. And very very soon we’re coming out with a 8TB disks. By the way, these were, a total of ten 800GB disks.

    Carlos:             Inserted right to the box?

    Jimmy:            Right to the box. Yeah, that’s one of the amazing things about these “commodity” servers off the shelf. A variety of volumes including HP. Make this boxes with 24 slots. Not a special order. You would say, “Hi, I want the box with 24 2½ inch slots in the front.” And you get it. And so think about that you put 4TB disks in there you got 986GB of raw storage. Put the new 8TB disks in there and you’ve got almost 200TB of raw storage in boxes that’s about as big as two pizza boxes. It’s amazing. It’s really amazing and you’ve got the new processors from Intel. It’s just amazing you have CPU to drive that capacity and the performance. Anyway, but we used six 800GB disks for the data and four 800GB disks for the log.

    Carlos:             And then it’s all yours. You’re not going to share those disks.

    Jimmy:            That’s correct, it’s a non-shared storage. Oh my! All yours.

    Steve:              I’m just listening to describe that. I’m just thinking about configurations we have where we have shared storage or a bunch of virtual machines sharing the same hardware. And I was just thinking the configuration you’ve got there is certainly going way more performant than what I typically see.

    Jimmy:           No question about it. Well, wait until I start talking with performance number. You know, again, a guy who used to dedicate his career to flipping the faster bit. We got some latencies that were so low that I had to double check the math. We will get into that later, hopefully.

    Carlos:             Ok, so I guess, let’s go back and now kind of set the stage for some of the pain points you managed to resolve there that we’ll talk about. And so ultimately again, in our podcast we want to make sure that we kind of set the level of playing field and we’re going to try to explain some of these concepts. So with the Availability Group, you have a server. You’re setting it to synchronous commit meaning you have a second server. The transaction has to write to the first server, send it to the second, commit on the second, get a reply back and then the transaction is complete.

    Jimmy:            Exactly, that we have three replicas not just two or one primary, two secondary. We actually have to get this act from two servers before we proceeded.

    Carlos:             Right, the act meaning an acknowledgement?

    Jimmy:           Yes, yes, before the data can be hardened or committed back on the primary log.

    Jimmy:            Well, the way it work in terms of AGs is the log is flashed from the primary database. Let’s call this the main server, the primary server. The data is flashed to the Availability Group’s logs. In this case we have a simple one database one AG. Ok, and the data on that log is what is shipped across the wire. It’s received by the secondary AG logs and then it’s applied the replica data files on that side.

    Steve:             So when it’s supplied on that side then that causes the write to the data file appropriately and then that log get saved as the local log file. Is that correct?

    Jimmy:            Yeah, it’s the local log file. And on fail over for example that log file becomes the primary log file. The secondary, the recovery queue is emptied. The secondary is the AG does a little bit of inter manipulation to say. Preparing myself to be the new primary and then that what was the secondary becomes the primary and the user transaction transpired there just like it did originally in the primary.

    Steve:              So then you mentioned some enhancements with SQL Server 2016 around that process of getting the logs transported then the logs redone on that end.

    Jimmy:            Ahh! We were so pleasantly surprised. As early as far back as CTP1 we were seeing this enormous in performance improvements of getting the data across the wire. So you see historically and this is part of the hashtag that just runs faster. SQL 2016, a lot of amazing performance improvements have been implemented. I’m so truly, again, this is Jimmy May former Microsoft employee, former member of the product team, but third party now, ok. So I can say this with complete impartiality, “Huge fan of what the SQL Server product team has done.” SQL Server 2012 is when AG is introduced and there was a bottleneck. It was tragic. AGs were pretty cool conceptually, you know, an enhancement of database [inaudible 11:09] but we were limited to 40 or 50 mbps across the wire. No matter what hardware you had. No matter how many CPUs, no matter how much memory, no matter how broad your network pipe was. There was a hard coded bottleneck build into the Availability Groups. It was a visage of the database [inaudible 11:30] code buried very deeply in the valves of it. When it’s code was written way back when there was no hardware in the world that require the kind of throughput that you could get 4-5 years ago and certainly not today. So SQL Server 2012 and 2014 there was this hardware bottleneck that you just could not work around, 40-50 mbp. In 2016, they unleashed the hounds. Again, without no special tuning, same server, hardware you just use a different version of SQL Server – 2014 vs. 2016. And I’ve got some great charts. We have, in fact, I’ve got a whitepaper you guys need to, it’s part of the podcast. You need to publish the link to it that has the documentation for all these. And what I hope is a very lucid presentation. And the appendices are amongst the most valuable pieces of a whitepaper. You got the usual “yada-yada” and the body of the paper. But the appendices have some very interesting findings that a geek like me and like you guys would be interested in, including charts comparing head to head matchups of SQL Server 2014 log transport vs. 2016. Again, just changing the version of the software, version of SQL Server on the same hardware. We went from that 40-50 mbps up to a quarter of a gigabyte. Boom! And that genuinely expanse the scope of applications that are suitable. Even warehouse, ETL, any kind of index maintenance etcetera. It was easy to hit that bottleneck in 2012 and 2014. And the problem, let me back up a little bit, the problem with this bottleneck is it isn’t just that your bottleneck and you can only get 50mbps across the wire. While that data is being generated on the primary application it is queued up on the primary bottleneck waiting to get across the wire. Waiting to get, you know, inserted into the log transport mechanism. And if something happens to that primary you’ve lost data. I mean it’s possible you can regenerate it depending on your application but at a fundamental level that data is gone. And that’s not a formula for higher availability whereas now up to, easily up to 250mbps, and with little tuning you can get more, is in real time sent across the wire via the log transport mechanism hardened on the secondary logs in real time. It’s just amazing. So on fail over, boom, the data is there. No data loss, so not only do we have better performance, you’ve got better HA and then DR. Ok, so that’s the log transport mechanism. Do you want me to dive into another aspect of the AG log related stuff?

    Carlos:             Yeah, I think we are going to talk about the log redo?

    Jimmy:            Yes, exactly. So we’ve just been talking about the log transport and the exciting performance improvements that 2016 gave us. Log redo is the other piece of the Availability Group log related mechanisms that is required for the process to work properly. And the log redo is simply a continuous restore process. Your

    data that shoveled across the wire is hardened on the secondary log and then via redo process. It’s applied to the constituent secondary replica database files. This is something you eluded a little while ago, Steve. Nothing too exotic about it just a continuous restore process. The challenge here though is that the high performance levels we are generating. The redo can’t keep up. Now the team, the SQL 2016 team, did a great job. They’ve turned it into a parallel process. That helps but it still just can’t keep up. And so at the maximum kind of levels we were doing the redo was falling behind. That’s not a formula for true high availability. Disaster recovery, yes, but in the event of fail over with if the recovery queue in the secondary are building up, when you do try to fail over that recovery queue is got to empty before the server is actually available to accept transactions. Now the good news is at this point there aren’t very many applications that require the kind of throughput we were throwing. You know, again, it’s pretty impressive. This 2U boxes, two sockets, you know, two CPUs, a little bit of flash, workload generator throwing 250mbps across the wire. Now how many applications require that kind of performance? Well, not that many frankly today. Even some of the apps I used to worked for at SQLCAT just a handful of them. So the great news is, as always the SQL Server team is very receptive. They are aware of the issue and they are actively working on it. So unlike the log transport issue which took them two versions of SQL Server to remediate, this issue, which again is not a problem for hardly anybody in the world today. It was likely to be remediated of relatively medium term before becomes a problem for the rest of the world, for your day to day application.

    Steve:              So then just to recap to make sure I understand that bit. If you’re pushing the data through on the log transport at up to that quarter of a gigabyte per second then your log redo queue is going to fill up on the secondary side and get backlog to the point that it may take a while to catch when you fail over.

    Jimmy:            Exactly. Exactly correct. Good summary, very peaty there. I wish I could be that peaty. Yeah, and so the log redo recovery queue just continues to grow at these high performance, at these high rates of performance. Most high performance applications aren’t get even close to that kind of a throughput and you’re going to be ok. But this is a theoretical event. It’s really interesting though, again, I mentioned the appendices in this whitepaper we just published. The appendices has this in detail. It actually shows some of the internals how you can interrogate it looking at what’s going on the secondary replica etcetera to see for yourself what’s going on.

    Steve:              And we’ll have to include the links to that in the podcasts notes.

    Carlos:             That’s right which will be at sqldatapartners.com/jimmy.

    Steve:              Alright, so unto another topic. I know you mentioned the –k startup flag. Can you maybe jump into that a little bit and what you’ve found in there?

     Jimmy:            Ok, this is one of the most exciting takeaways of this work. I know during our podcast, when we are level setting. I asked of you guys if you’ve heard about it. And the truth is very few people have heard about it. I first heard about the –k startup trace flag in SQL Skills Training during immersion, IE 01 with Paul, Kim, Jonathan, Erin, Tim, Glenn etcetera recently.

    Carlos:             I managed to look it up on the MSDN and it’s not listed on the engine services startup options.

    Jimmy:            It is fully supported. And it has been, believe it or not, since 2005. So we need to get that fixed. You know what we have great connect item. It is hard to find documentation. In fact, when I was first, I remember using it, I was involved again as a third party person, post Microsoft. I was invited by the CAT team come in and assist with the lab about a year ago. And I first, I saw this implemented in real life with one of the customers that do crazy cutting edge stuff. And my friend, Arvin, one of the first MCMs in 2008 was implementing it with this customer, and so I got the chance to see the behavior of –k in real life. And it’s

    amazing. It’s something that I’m pretty sure a lot of places that need cutting edge performance are going to start adopting. And here is why. Even in conventional shops, you know, as a consultant for Microsoft services, you know, 10 years or more so ago. My old day job, big fast database, when checkpoint occurs, yeah we didn’t talk about it, we talked about –k. What is –k? It throttles the checkpoint to a user defined level. Many of us who have any experience in I/O or familiar with the checkpoint spikes that occur. You know, your dirty pages in the buffer pool, SQL Server at times sufficiently often to keep the database recovery on fail over to be, again, by fail over, I mean on restart I should say, independent of Availability Groups. Now, this –k is important not just for availability groups, not just for any stand alone SQL Server instances that experience challenges with checkpoints overwhelming the disk I/O subsystem. And it’s very very common, it’s in for years and these challenges can go anywhere from a spike when the dirty pages get flashed disk to a stalagmite. You know, a little bit of a pyramid, the checkpoints basically never finishes. And so 5, 10, 20ms latencies for spinning media. 1 or 2ms latencies for flash can suddenly turn into 100ms, 500ms latencies for an over long period of time. And it just hammers performance and plus you’ve got inconsistency problems, you know, suddenly the whole system during the duration of the checkpoint throttles the entire system. So in our case for example during the default behavior without implementing –k we were getting, without checkpoints, 99,000, you know, between checkpoints 99,000 transactions per second. During checkpoints we were only getting 49,000 transactions per second. So if you look at the chart in the whitepaper you’ll see what looks like literally a rollercoaster. Just high, vhoov, high, vhoov, so it’s quite a ride but it’s not something that your user wants or your system wants. Implementing –k, and you implement it by simply adding trace flag like any other followed with no space by an instanture representing the number of megabytes per second of checkpoint throughput you want SQL Server to provide but no more. And in our case, the sweet spot, we did several tests. In our case, the sweet spot was 750, -k 750, said, “Hey, SQL Server I don’t care how many dirty pages you have never shove more than 750 mbps down at my disk.” And doing so allowed this rollercoaster ride manifested by the default behavior we all know and hate to a remarkable flat system performance. Checkpoint pages per second are throttle etc. in 750 mbps and that allows the system and because you define it, you develop the testing, you define the sweet spot and suddenly CPU is basically flat. Your transactions per second are basically flat, log flashes per second are basically flat and consistent. And the great news is that not only do you get consistent behavior but the average is far higher than the average of the rollercoaster ride. And so, yeah, I’m telling you I posted the paper when it was published to the MCM distribution list and I had several comments back about it. “Oh, where’s this been all my life.”, and that kind of stuff. I think you’ll be seeing, you’ll start to be seeing some adaption. And so again, one of the biggest takeaways of this Availability Group paper isn’t just the great things that we can do with the 2U box and a little bit of flash and improvements to 2016 but also an outcome is the first time I’ve ever seen it documented anywhere. The outcome of the impact of the –k startup trace flag.

    Carlos:             And when we start talking about trace flags one of the things we want to at least make sure that we understand the downside. Or at least why it is not enabled by default. Maybe a good reason or a good thought. And if I think about it it’s really then what you’re going to potentially increase, again, kind of your mileage may vary depending on your sizes or what not. Is that when that instance restarts you

    may have to go through more of those redo logs to get back. Is that a fair?

    Jimmy:            Ah, that’s a good guess. That’s a very good guess but first let me be clear. This –k isn’t directly related to anything involving Availability Groups. It’s basically.

    Carlos:             I guess said redo log while I meant the log, right. So you’re going through the log and move forward or back.

    Jimmy:            Right, so you could do a –k that is too low and you basically get a sustained checkpoint that never ever actually finishes, and that’s what we did, that’s part of our seriously. And you would run exactly that situation the recovery on database restart, on server restart will take far longer than you want. But it turns out by doing the, when I refer to the sweet spot, part of the sweet spot means not just limiting the value so that it’s below the limited disk I/O system can sustain in terms of how many dirty pages it can take per second. But also you want the checkpoint, you want that to be never sufficiently high so you don’t run into that situation where you restart the database and it takes forever to recover. And so that is a potential downside, because, you know, we didn’t talk about this beforehand so I’m impressed that you came up with that. But the reality is with just a little bit of testing, we test that thoroughly and take less than a day, half a day probably. And we get a number that is again not, so the recovery is normal you can restart the database, you know, in a few seconds yet we’re not overwhelming the disk I/O subsystem. You asked the question, you Steve, why isn’t this on by default? That’s something I hadn’t thought of but the answer came immediately to my mind. The SQL Server product team by its nature they do things conservatively. No harm etcetera. And like trace flags 1117, 1118. How long have we wondered why aren’t these the default. Why don’t they create a TempDB files or four, whatever by default. Well, they were conservative, right? They finally realized in 2016. They said, you know, the case was made, “We’re going to implement 1117, 1118 at least in the context of TempDB by default.” It’s a known bottleneck. It’s been the best practice for years for users to implement it. We’re going to save them from trouble and finally make it the default even though it’s been the best practice for 3, 4, 6 versions, who knows. I have to think really hard how far back we have to go discover when they were introduced. So I think it’s kind of like that. Especially as flash becomes more and more a predominant host for SQL Server data and log files. And with these crazy performance we’re able to provide like a temporary servers. It’s a shame that people are going to have to resort to software needlessly with the default checkpoint behavior. So, it won’t surprise me if some subsequent version of SQL Server we see that implemented by default.

    Steve:              However though you would still need to have a way to adjust it to your specific environment even if there was a default for it. I would assume.

    Jimmy:            No question, no question, that’s going to be, no, maybe SQL Server can be smart enough. Maybe, because you know, because first it’s no harm, it would be easy to implement a default that in some cases could be harmful. You’re right there Steve, absolutely. But SQL Server is getting smarter all the time so who knows what they’ll think of next.

    Steve:              Ok, so on that then, how about any other best practices that you learn while you’re going through this process of building this out and testing it?

    Jimmy:            Oh man, I could go on. At least give me time for two please. Hopefully a bit more. But the two next most important ones I think hardware validation, again this is independent of AGs. This is independent of SQL Server 2016. It’s been a long best practice I have as a geek parachuting in to cities all over the planet for Microsoft’s consulting services. I evangelize, evangelize, evangelize, “Validate your hardware before you put it into production.” You know, you flip that production bed on that big million dollar sand before validating it thoroughly then you realized you have a problem. You’re in trouble. Seriously, promise, they’re not going to be fix very easily. And no one is going to be happy about it. So the time to validate performance is before you put your hardware into production, whatever the hardware is and especially storage. So the goal is to, I would say, “When you buy a new hardware you need to hit two different

    thresholds.” One you need to hit the specs that are documented. You can download from whoever your vendor is. The second one is to make sure you hold the sales geeks feet to the fire for that vendor. Make sure that the promises they made you’re able to hit and if you can’t hit them get their people in there until you do. And so circling back to the work that we did we were struggling. We can apply with our own best practices, believe it or not, the cobblers kids have shoes in my shop. I was able to hit the numbers I needed. I said, “What, here’s the specs? Oh, what’s going on here? I was consistently by 20% below what the nominal specs where.” And we struggled for couple of three days until we realized we had the wrong firmware on the SSDs. We were really expecting our hits and had to call in some bigger geeks than I am. For the record, validating the hardware wasn’t on my plate. It wasn’t my job to do that.

    Steve:              Someone else’s job but it’s to your problem, right?    

    Jimmy:            Yeah, exactly, but the point is we discover this not halfway into the hardcore testing where I had to restart all the experiments. This was at the fundamental part in the work. We were pretty early on and we got it done. Takeaway from this is not merely, yes, validate your hardware but also, and this is very important, this is flash awaited, most of us don’t upgrade the firmware or drivers for our spinning media, ok. I certainly never have. Occasionally, when I’m asked, I say, “Yes, yes we do.” But though if answer buts, answer no questions. Flash, like many most of the other electronic in our servers require updates, both the drivers and firmware. And make sure you have the latest and greatest. Test and verify and all of that of course. But get your greatest and latest stuff out there. And I know from experience working with SandDisk and no Western Digital and formerly Fusion I/O. That’s stuff could make a big difference, so two takeaways there. One other one I want to throw at, one of the takeaways I have. We have a luxury working on all flash environment using crazy parameters to do our backups and restores. We backup and restore multiple terabyte databases routinely as a matter of course in a matter of minutes, and it’s pretty cool. Whereas environments I worked in the past, you know, multiple terabyte database can literally take hours, overnight, multiple days. So it’s cool to be able to sling those bits around in a matter of minutes. Then we do this by modifying some long time parameters that have been available to us in SQL Server for basically over a decade. Since, well, gosh, a decade and half, two decades. The number of devices, max transfer size and buffer count. My historical default has been to use 8 devices for data files, max transfer size, you can’t make it more than 4mb, the default is 64K. So at every request from a data file the backup process will grab a 4mb chunk instead of a 64k chunk. And also buffer count, and this is how much memory backup processes are allowed to use. And the default varies, as I understand it based on the hardware configuration that the database is on. But to maximize performance, backup performance, we multiply the number of logical cores times four. This is something that my old buddy and mentor Thomas Kejser taught me, and I’ve been doing that for years. Well, you may remember, circling back to the purpose of this whitepaper was to provide some actionable guidance for people to implement. And part of a real life, real world scenario is doing backups wherein we chose to implement our log backups every five minutes. And that we needed to find values for both database backup and log backup that wouldn’t impact the latency of the application. So instead of tuning for backup performance we have to balance that with application performance which is something I wasn’t used to because I haven’t been a production DBA for well over a decade now. And that was a lot of fun. So if you were to download the whitepaper you would find that the parameters we used, I don’t need to go into the details here, we used different parameters for data files vs. the log files. Easy to implement but the point is we did some of the heavy lifting for you and provided a template that you could use to implement it in your own installations.

    Steve:              Yeah, that’s great. I would bet that probably 99% of the backups I see out there are just going on with the defaults on those.

    Jimmy:            And it’s tragic.

    Carlos:             Yeah.

    Steve:              Absolutely.

    Jimmy:            I know tragic is probably overkill of a metaphor but we are talking squandered resources. Okay. So there we go.

    Carlos:             Well, so Jimmy, thanks for coming on and chatting with us today. I know that you’re willing to take the pains and kind of take one for the team and then share with us so that we can learn from your experience.

    Jimmy:            Are you kidding? This is the best job I’ve ever have in my life. I mean SQLCAT is a lot of fun but this job is just a blast in terms of, so I don’t mind. These are the kind of hits I have to take, fine keep them coming.

    Carlos:             Before we let you go shall we do SQL family?

    Jimmy:            Sure.

    Steve:              Alright, so the first SQL Family question is on keeping up with technology? How do you go about keeping of all the changes that are continuously happening? Besides writing whitepapers.

    Jimmy:            That’s really a tough one. You know, because the SQL Server is so broad and deep and, you know, the impostor syndrome, a lot of people are. Gosh, I could go 20 minutes on this. I’m keenly aware of my deficits, they are broad and they are deep. Man, I just take it until I make it. I’m on candid at this point of my career can be candid with my deficits. And my buddy John Stewart our shiny new MVP. He told me, I think it was John who told me, “If you stay one chapter ahead of your customer you’re an expert.” That’s part of the secret and the truth is I’m actually taking a new tech. I live an intentional life, seriously I create a vision statement from month to month, quarter to quarter, year to year, I craft my goals. And one of the reasons I’m very excited about this December, I have a couple of weeks really hardcore weeks off that between ski sessions, ski days I’m going to craft my vision statement. And that includes a hardcore plan, not just I’m going to read powershell[inaudible — 37:18], I’m going to plot when I’m going to do those chapters. Michael Fall another shiny new MVP highly recommended that book among other people. And also I’ve enrolled in and I have just started the Microsoft Data Sciences professional curriculum. And my goal is to finish it by the end of December. So to answer your question besides taking it until I make it, besides being candid, besides not buying into impostor syndrome, I live an intentional life and I plot out my goals. And one of the things, speaking of impostor syndrome, a dear sweet, a beautiful gem of a friend, Marilyn Grant during a training session at Microsoft Consulting Services said the following and this is very important. It really helped me in my dark days when I didn’t, when I though I wasn’t smart enough to tie my shoes, stranded by brilliant people within Microsoft the customer side, and that is the following. You are as smart as they think you are or you wouldn’t be here. And that’s a very valuable aphorism for me. Okay, I hope that answer, not peaty like you guys but I hope it was, I you enjoyed it.

    Carlos:             So you’ve worked for SQL Server for a long time, Jimmy, but if there is one thing you could change about SQL Server and maybe, you did talk about that redo log, if there is one thing you could change about SQL Server what would it be?

    Jimmy:            That would be it right now. The redo log, it’s the big bottleneck. It’s the thing right now that is keeping, that has the potential to keep SQL Server from actualizing its true capabilities. Now again, not a lot of people need a hit right now. But it’s limiting us. I can’t invest time documenting AGs solutions if we have this bottleneck. If that bottleneck is remediate tomorrow we would be off the races and with various OEMs documenting one Availability Group solution one after another.

    Steve:              So what’s the best piece of career advice you’ve ever received?

    Jimmy:            Man, you guys are hammering me. I used to be in grad school so this is actually a question I’ve heard before so I know the answer to this. I used to be in grad school, I don’t know his name. We’re doing our thing that day and he without solicitation gave me this advice, he says, “Find, specialize, find a niche and dig deep.” And that’s paid off for me. You’ve heard some more things maybe from

    other people but it’s really paid off for me. Some of you may know my original “claim to fame” was I published the stuff related to disk partition alignment. I evangelized it. Hey, you know, I did invent that but I was, let’s call it clever enough, smart enough, had a head enough with around me when I heard about it I was just gobsmacked and couldn’t believed this existed. And you know, no one really knew about it. And so I searched for an opportunity to actually do the testing, seized that opportunity, validated it for myself and we’re off to the races. I mean it was good enough for, that is what brought me to attention of the SQLCAT team and SQLCAT team has been a very valuable part of my success and put me to a position to be successful at SandDisk, Western Digital for example with the Fusion I/O people. So specialized, find something and go for it. One other thing, I’m sorry can I interject two other things? They just came to mind.

    Steve:              Of course.

    Jimmy:            Oh, thank you! Focus on the fundamentals. If you notice you go to your SQL Saturday sessions etcetera. Some of the sessions on fundamentals or some of the most well attended. And it’s true, fundamentals are important. You can’t get too much of them. And speaking of those sessions, community, community, community, can’t get too much community. You know, you guys, referred to the SQL Family questions. These whole set of questions is based on SQL Family. This close knit group of folks we hang out with. Pretty cool.

    Carlos:             So you’ve mentioned a little bit about your background and some of the places that you’ve been. But give us the nickel tour how did you actually get started with SQL Server?

    Jimmy:            Oh my gosh, I could get in trouble with this answer. Okay, you said, nickel, I went to grad school, went to California, came home a few years later with my toil between my legs and got a job. That’s the part that get me in trouble. But I got a job way back in the days of Windows 31 and I needed to do some, we did everything in paper and I said, “We can computerize this. I didn’t know anything about computers but I thought I was, I could figure it out.” And the computer that I wanted to use that had Excel on it was always busy. So I went this other computer it didn’t have Excel but have this thing called Access on it. And I thought, “Well that looks close enough to a spreadsheet to me. I’m sure I can figure it out.” And I had a hard copy manual of Access 3.0. I smuggled it at home with me and curl up with it every night. I never slept alone.

    Steve:              So Access was your gateway to SQL Server.

    Jimmy:            Yeah, Access was my gateway. In fact, I remember the first time I saw SQL Server. I sat down and I thought, “Where’s the GUI. What am I supposed to do now? What is this crap?”

    Steve:              Alright, so I think our last question is if you could have one superhero power. What would it be? And why would you want it?

    Jimmy:            Oh, I would be, I wish I had more charisma. I know I’m a nice guy, but I wish I was more charismatic. But the truth is, I want to be indefatigable man. I need my sleep, you know, as I approach my middle age, wink wink, I need my, I need not only my 8 hours. I could start turning into Mr. Pumpkin or you know, in fact, I went to the Microsoft. I was invited to go to Microsoft Server holiday party the other night and actually had to make an implicit, my friend got an implicit commitment from me that I wouldn’t start whining after 9 o’clock. I would man up and hang out until 11. So, that’s my thing, if I could get by with six hours of sleep on consistent basis instead of 8, I’ll be a different man.

    Steve:              I’m with you there, make it until 11 is much harder than it is used to be.

    Jimmy:            Yeah, but I will say though related to health is the importance of fitness. I know, Steve, you’ve lost some weight. Carlos I don’t know what your situation is. I’ve lost 85 pounds three different times. Speaking of a yoyo, you know, rollercoaster ride, three different times I’ve gained and lost 85 pounds. So I can’t tell you how many times I’ve gained and lost 50 pounds. I finally, I have gone to myself to a point where I am relatively stable, and healthy, and I make fitness a priority. My claim to fame, I mentioned my internal, my misspent youth awhile ago where I came back where toil in my legs while out there on the left coast, Richard Simmons, you guys have heard of Sweatin’ to the Oldies?

Listen to Learn

  • What improvements have been made to the log transport process
  • Changes to the log redo.
  • A new startup parameter -k and what it does
  • Why hardware validation is still important
  • Jimmy was the guest of what celebrity from the 90’s?

Imagine what’s possible with a dedicated SQL specialist on your team.

1 Comment

Leave a Reply

Back to top