Episode 97: SQL Server Replication

If you have ever been responsible for a replication environment, you know what a pain that can be.  Many of the guests of the show will point out their frustration with replication and many are loathe to use it–myself included; however, there are a few brave souls out in the world who do use replication regularly–and even use it well.  Chuck Lathrope is our guest today and he talks us through some of the components of replication and why we might not need to be so skittish about this long time feature.

We cover all the basics including an overview of Transactional Replication, Merge Replication, Peer-to-Peer replication and Snapshot Replication, the concept of publisher, distributor and subscriber in relation to replication.  The most important piece is Chuck gives us some good business cases for why we might use replication and I think you will find the discussion interesting.  Thanks for joining us on the SQL trail!

 Episode Quote

“I like to talk about it to waylay fears and give out some best practice knowledge”

“Stick to people who know what they are doing”

“The real ultimate goal of replication is to offload work from your main source of truth database server”

Listen to Learn

– Why you might use SQL server replication
– Different types of replication
– Use cases of transactional replication
– Replication and trace token
– Troubleshooting tips
– Where to get good information

https://gallery.technet.microsoft.com/scriptcenter/SQL-Server-Transactional-e34ed1e8/

https://www.brentozar.com/archive/2014/07/performance-tuning-sql-server-transactional-replication-checklist/

https://www.sqlskills.com/blogs/kimberly/8-steps-to-better-transaction-log-throughput/

http://github.com/SQLGuyChuck/

http://tribalsql.com/

http://download.red-gate.com/ebooks/SQL/fundamentals-of-sql-server-2012-replication.pdf/

About Chuck Lathrope

Chuck Lathrope is a Data Architect for Limeade, a SaaS health and wellness platform. He is a two-time SQL Saturday event speaker on replication, was a Top 5 nominee of Red-Gate Exceptional DBA Award in 2009. With over 20 years of experience in IT, he have used different operating systems as well as worked with different applications. He was also a Windows System Administrator for 12 years.

Transcription: Transactional Replication

Carlos: Chuck, welcome to the show.

Chuck: Thank you very much guys, glad to be here.

Steve: Yeah, it’s great to have you on the show. I know we kind of been talking about this for a few weeks trying to get together, and happy to have you here as a guest.

Chuck: Yeah, I know. Glad to be here guys.

Carlos: Yeah, even if you are a west coaster you’re still accepted here.

Chuck: Oh, great!

Steve: Anyhow, Carlos, there are more west coasters than there are east coasters right now.

Carlos: Oh men.

Chuck: Yeah, we’re winning.

Carlos: Well, what I lack in numbers I make up for ego apparently.

Steve: Alright, so today’s topic is on SQL server replication. I know this is a topic that I first heard you talked about at SQL Saturday in 2012 and I found your presentation there really interesting I didn’t picked it up very quick very soon after that. But then when it came around to a time that I needed to use replication I lean back to some of those things that I learned from you back then, so it’s great to have you on the show to be able to talk about replication today.

Chuck: Yeah, great, thank you! Yeah, I love talking about SQL server replication at SQL Saturday events. I try to do it as often as I can because I have all the battle wounds from having a very large environment with the transactional replication and so. Whenever I can I like to talk about it to waylay fears and give out some best practice knowledge.

Carlos: Well that’s interesting because I think there is a lot of fear around replication. We’ve brought it up in Episode 56 and yeah nobody wants to touch it with a 10-foot pole and here you said you have a large environment with replication, so I know I’m interested to kind of get in here and talk about some of your experience and why maybe we have so much fear around it.

Chuck: Yeah. I think the difficult part about replication is there’s a great job of keep retrying even though there is a failure. So if you’re not monitoring that correctly it will go sideways and if you’re not monitoring appropriately then it can get all kinds of bad and then you get to stuck situations like, now I’ve got this gigantic log file or distribution database. How I’m going to fix it? And then you go on the net and google stuff and there is so much bad advice out there that would basically destroy your entire environment and you will get basically a lot of wounds from trying to follow some of the advice out there. It seems to me like most of the advice out there is for people doing it like a dev environment not a real production environment so it’s not great. I usually stick to people like Hillary. Anyways, there are some great people out there that do from the Microsoft side and from consultant side that give great recommendation for replication. I mean, stick to people who know what they are doing, not random posts on the internet.

Carlos: Well, so I guess shall we go ahead and kind of let’s tee up some of the components around replication and let’s talk about some of the moving pieces.

Chuck: So there are three types of replication. There is actually four but one has been deprecated, that updated subscriptions. There is Transactional Replication which is always running, always sending out data to your subscribers. And then there is Merge Replication which allows you to do bidirectional updates, and then there is Peer-to-Peer Replication, and then there is Snapshot Replication. The snapshot really is you do it on less frequent basis than all the time like transactional. So like in a data warehouse scenario I could do like every 6 hours, or every 12, or every 24 hours and get that over to my data warehouse, so that’s what snapshot does. And my expertise happens to be in transaction replication.

Steve: Ok, so then what are some of the business cases or real use cases around transactional replication then?

Chuck: Yeah, so the real ultimate goal of replication is to offload work from your main source of truth database server. So you got multiple databases and they can all
be a part of different publications then send out to subscribers which are your clients that receive that data. The main goal there is just to offload work on off of that main production server and that’s kind of goal in life.

Steve: So occasionally in high availability or disaster recovery conversations somebody will bring up replication as a possible solution and I’ve never had a good feeling about that and I’m just curious what your thoughts are.

Chuck: Yeah, well just like you guys were talking about in Episode 59 and higher availability solution. Replication wasn’t really designed to be a higher availability solution. It was really designed to replicate some data to give off to other servers so they can do processing of that same data for whatever purpose you may need it for. Because what you can do with replication is published out tables or if you have Enterprise you can publish out a view and you can even publish out to your store procedures and functions and whatnot and your indexes. But you don’t have the security that kind of goes along with that and then there is still automatic failover anything like that. Your best option there is to stick with always on availability group or log shipping or database mirroring to do that high availability.

Carlos: Ok, so if I really want to just move some data off to another server for reporting purposes that’s where transactional replication or snapshot replication might be a good solution but not for the HADR.

Chuck: Yes, and a nice thing about the replication is it’s available on all editions and versions of SQL server so I can have Standard Edition of SQL out there and use that. I can even use that in my distributor server. So kind of give the bigger picture of that transactional replication. You have the publication server and that’s your main source of truth data, and then you have the distribution server that holds the distribution database that then sends that information to the subscribers so it’s like there is always a 3-way kind of path goes from publisher to the distributor to the subscriber, and you can have many subscribers and then you have one distribution server that could actually be on the publisher or the subscriber side. You don’t have to have a dedicated one. But when you start getting to bigger environments you definitely want a dedicated distribution server.

Steve: Ok. Now, I know you said it works with all of the latest versions of SQL server, when was replication first introduced?

Chuck: It’s been forever. I think it even came back in Sybase days. It’s been around that forever. It basically just pulling data from your log files looking to see which of those transactions are flagged as mark for replication and then sends that on down the path, so it’s been around for ages.

Steve: Yup, so with the replication going from a publisher to the distributor to the subscriber does the subscriber have to be on the same version of SQL server as the publisher?

Chuck: No, it doesn’t. You can have a mix. What you typically have as your distributor just needs to be at a level of your publisher and so you could have 2008 R2 publisher and 2014 distributor and a 2016 subscriber so the distributor just needs to be at a level according to your publication so you can have high level on your subscriber but you can’t just go lower on your distribution server than your publication server.

Carlos: So we’re talking about some of these terms, right, publisher, distributor, subscriber. What the original documentation release that I saw mentions kind of this newspaper hierarchy which some of our millennials may not even know
what that is. And so I feel like, and one of the things that was brought up is just some of the overhead and maintaining some of these. Where has some of the love been? Because it’s been around so long where has some of the love from Microsoft been with replication?

Chuck: Yeah, good question. So the latest kind of feature set just with replication is the support for always on availability groups such that if your main active server, primary.

Chuck: So you’re primary server can move to any of the other secondary servers and replication will go along with it, so that’s kind of the main reason support. And also that same team also does change it to the capture and change tracking so some of those features have been receiving a lot of love in the past few editions of SQL server. But SQL replication has been around for such a long period of time. It’s not a lot more features they can add on there that I could really think of other than maybe helping out with support ability of replication, so most of the love has gone that route.

Carlos: Got you. And I think that’s where I would have to agree is that in a setup like you mentioned like all the kind of the bad advice from fixing or troubleshooting. I guess I would like to see a little bit more love there because I feel like a lot of the fixes would just start over again. You know, and so being able to tap into that a little bit easier and feel better about what’s going on would be super helpful.

Chuck: Yeah, it would especially for people just getting into it. It could be overwhelming to do. I mean I was so passionate about in the past days to help author a chapter of a book SQL book from Redgate just on replication and kind of hear this. Here is how you can kind of monitor and here are some best practices for it just to kind of get more information out there because it’s been relatively few books out there. I can give a link to that. And then also Sebastian Mein which I think you guys had on a recent podcast. He created fundamentals of SQL server replication. It’s a free download from Redgate’s books too is a great starting point in kind of getting into replication.

Steve: Yup. Now, on the front of troubleshooting replication don’t you have some script available that you built that are available maybe on GitHub that can help with that? I know I’ve used them a couple of times I just forgot where exactly they were.

Chuck: Yes, so my SQL Guide Track on GitHub has some replication monitoring script for live use. And It kind of what I’ve lessons learned in replication it gives you the ability to create what are called tracer tokens. And it’s basically you insert this tracer command into your log file and it’s basically stuck at the very end and it watches it go through the entire process to your subscribers to see that latency that involved in there. You can do that with the replication monitor tool which is a GUI tool that you would use to kind of monitor and maintain replication. But the trouble with the tracer token is that it’s not really, you know, if you’ve got a busy environment that’s kind of overwhelmed currently, that tracer token might take ions of time to get through and it’s you mainly do it or you can automatically script it. The Microsoft IT guys actually have upon one of the free sites for script center where you can monitor replication. And you’ve automatically created this tool that would just continuously insert tracer tokens and you could see kind of health with it. But I found that a little bit limiting so my trick to tracking replication is look at the distribution database and figure out how many commands are waiting to be distributed out to the subscribers, and then I monitor that on a frequent basis like every 15 minutes or every half an
hour and it sends me an email. And it’s kind of gone above some sort of threshold, and that threshold is a variable amount that I can put in as a parameter. Because in some environments, you know, 5,000 records behind it is a big deal. On the other environment it’s be hundreds of thousands of records behind if you have a really gigantic installation. So that’s what my monitoring kind of does. It just monitor that queue that’s basically up there in distributor and then I also monitor for kind of pain points that replication has.

Carlos: I apologize. Before you get into that, I want to like I guess stop, one and make sure that everybody knows so all the scripts and things we talk about today will be available at sqldatapartners.com/replication so we’ll make sure to put all those things up there. But you had talked about tracer token, so I guess I’m all of a sudden thinking, it’s like I’m thinking of a tracer and that’s not profile, all of a sudden there was something that came to mind that was like that but. This is new to me, right, so it almost sounds like this is an object that I want to create in SQL server that’s going to start capturing commands which again kind of sounds like profiler or extended events. Is it just something that is specific to replication? I guess help me define what the tracer token is.

Chuck: Yeah, it’s very specific to replication. You basically you are sticking in this special command. You can use T-SQL that create this little tracer token or you can use the replication monitor tool to insert that tracer token and just watch it in the GUI. And it just shows you time from your publication to distributor and then the time from distributor to subscriber. Basically this goes through that path. You don’t see it just happens kind of behind the scenes kind of little hidden thing that happens but you can programmatically watch that information and see the results of it.

Carlos: Ok, so I don’t need to create it or modify it. It’s already there. It’s just what I’m using to collect the information from.

Chuck: Yeah, you have to basically interject it. It’s like putting a dye into a stream. You put this dye, a little drop in, and you can just do like one at a time. So you have to manually put that dye drop in the stream and you watch that dye drop go the entire path to the ocean I guess.

Steve: But in this example, that dye drop as you refer to it is really a record being injected into the transaction log. And then when it’s complete that means that it’s been through the part that catches the transaction log and moves it to the distributor and then actually ends up on a subscriber side. Conceptually it would be like almost like if you have a table and you just inserted a row to it and then you knew on the other side that that row actually arrived.

Carlos: Because that’s the one you’re looking at.

Steve: Yup.

Chuck: Some people even use what I call a canary table, so this create a generic table and then it will update the date on it. So it will be just a table of date of one column and this is just the current date and so replicate that up and use it kind of like a canary table so they can look at the subscriber and see how basically latent they are because it should be as close to the current date as possible. That’s another method you could use for monitoring.

Carlos: Interesting. Yeah, I haven’t heard of that one.

Steve: So I know with the tracer token in my experience of using the replication monitor to go through and inject tracer tokens on a few dozen different publishers. It can be painful clicking through all the screens and it sounds like you can do the same thing with T-SQL to just inject the tracer token. To me that’s one that I wasn’t aware of and it seems like that I could really speed up the process of troubleshooting, what’s working and what’s not working with replication.

Chuck: Yeah, I mean the scripts that I have are pretty useful. It doesn’t go down typically to the subscriber level though so I have the typical monitoring of jobs and failures. The standard best practices that all DBAs do for all production databases you need to do that for all the components in the replication including your distribution database and your subscribers so just the best practices known on jobs fail because all the replication jobs are basically kicked off with SQL agent
jobs so you need a monitor for failures of those and be watching that.

Steve: So you mentioned these jobs as SQL agent jobs and it seems like there are several of these jobs that are created when you initialized replication. What are all the jobs that are created in and what specifically do they do?

Chuck: So there is the log reader agent job that’s basically sitting on you master database you’re doing the publication on that is reading the log file. Then there is the distribution agent. It’s sitting in either one of two locations at your subscriber or at your distributor. The reason you have that is what we call is there is a push and pull scenario when you get the data to subscriber. And when you do a push all the work that is happening on your distributor server and when you’re in a pull it’s done on the subscriber side. So if you have a really gigantic environment and you have this dedicated distribution server you want to even offload some of the work off of it so you actually have what’s called the pull subscriptions in it. Basically you think of it, “I’m the subscriber. I’m going to the pull the data from the distribution server so I’m expanding all the work to figure out what data that I need to grab and I’ll go to distributor and grab that information. And the push scenario is usually in your smaller environments, and smaller I mean like you’ve got 5-10 kind of subscribers. In that scenario the distributor is doing all that work and pushing that data down the subscribers. So this is kind of the performance tricks that you kind of switch going from push to pull.

Carlos: Right. Now, what’s interesting is we talk about these jobs. We kind of almost talk about, you mentioned reading it from the log, right? The log is a sequential operation. You know, the way that it writes and things. Ok, and then you’ve talked about workload. Now, obviously the answer is it depends but from a percentage, a generalization perspective, is the reason that replication gets out of whack is because it just gets overloaded and you just can’t catch up?

Chuck: That is one of the scenarios, and then which if you’re monitoring properly it should be good. I mean you could have not pick the proper server for your distribution server, and maybe it’s kind of poor performer, or doesn’t have enough RAM, or doesn’t have enough CPU. Typically it’s I/O problem. So you find out quickly that you have tons of transaction you’re trying to replicate and you’re overwhelming your distribution server. Because in smaller environments you can do it all setup in one server if you wanted to. It’s not really I don’t know why you would ever want to do that because the whole point is offload work off of your publication server but you could overwhelm any of those…

Carlos: So getting back in there and I think you brought up or made a major point at least in my mind and that is the reason that you would use replication and again the offloading component. I think a lot of times replication gets jumped to because you’re trying to keep to systems in sync or something like that. Again, some of these new features maybe a little bit better at that but you’re potentially using replication for the wrong reason. And so replication get’s installed on that main server and so it just has to do with more work to keep up with all of that.

Chuck: Yeah, that all the truth. I think people are trying to use this because it’s available on all editions. It is like, “Oh, I can use Standard edition and save a lot of money”, because you really don’t have replication. I think database mirroring can do in Standard edition with two nodes. I can’t remember now. But now in 2016 always on availability groups can do two nodes kind of cluster.

Carlos: Right, they give you the read-only, or you can’t read from it if it’s available.

Steve: So jumping back a minute to when we’re talking about the different jobs on the
publisher, distributor and subscriber. One of the tips that I heard recently and I haven’t actually try this myself but it was a suggestion that for those jobs, it was suggested you go in and add a second job schedule to it so if something goes wrong and that job fails. There will be a second schedule there that would restart it. Is that something that you’ve heard or is it a common best practice?

Chuck: No, I mean, replication does retry and retry so, and the distribution agent where you automatically retry when you hit a failure, like if you’re like automatically scheduling like a snapshot, and it fails, that could be a scenario we might want have these multiple things. But this like you’ve said before, proper care and feeding have your SQL server and monitoring those job failures. I mean, I could see it in a maybe a snapshot scenario but I wouldn’t see it in the transactional replication because it will just continue to retry. So basically what it does is it retries and fails. You know a minute later retry again and fail, and it will do that forever until you basically fix the problem. And usually the error message is it provides are pretty useful.

Steve:  Okay, great.

Carlos:  Well so I guess getting you, kind of keeping in theme with that, you know, again the sky is the limit but what are your, you know is there like a general rule percentage-wise other than just that process getting overloaded? Why does that stuff get out of sync?

Chuck:  Yeah. You know I think it’s kind of these random hiccups that kind of happen and when you have these bigger environments, you know there’s something happens. It like the common error that I see even in my environment is that the road doesn’t exist at the subscriber. Like well, why doesn’t it exist? So why don’t you just add it?

Carlos:  Right exactly. Like you’re subscribing, you should get that real. That’s part of the process, right?

Chuck:  What the heck happened to that bro? And so then you have to spend time to figure out because everything starts backing up once you have that one error. So you have to do tricks like you can skip that one row or maybe you figure out what row it is. There’s a sp_replic commands that can get you information about the transaction that was trying to apply to subscriber. And you could figure out what row it is and go manually add that row. There’s also a command line tool diff.exe I think. We’re basically at the command line, datadiff. It can do a datadiff between your subscriber and your publisher and tell you what rows kind of missing and kind of screwed up and actually fix them for you.

Carlos:  Interesting. So you’d be okay and I guess to that point so the row being missing in the subscriber, is the common one of that that I’ve seen, you’d be okay with us just basically you know scripting that out and inserting that row?

Chuck:  Yep. Just to make it go past that error but I usually find this, usually like a group of commands that somehow didn’t make it to subscriber. And there’s a, you could have the build that you two, let’s say is, there’s different agents that you can set up on basically all these jobs. But the ones I typically modify in a short term, if I just wanted to skip errors because I know that it’s trying to update a row that got deleted for some reason that doesn’t exist in the publication. So it really got and skip errors. And so you can setup just the SQL agent to use this new agent that’s kind of like an agent profile is what it’s called. So I could change this agent profile and say skip errors and so skip a bunch of errors but then you really want to use that Datadiff 2.0 to figure out what rows are kind of out of sync and you can just manually update subscriber. And sometimes it really comes in that somebody on subscriber actually pull in and deleted a row but in reality is I know that’s typically not the case because I’ll make that security read-only on the subscriber side. So it’s like there’s no way it got deleted but somehow it just, you know, row missing so you got to do every do to update subscriber and you have full rights to do. You can do whatever you want to subscriber side and replication won’t stop you. You can even make a setting that just says, I don’t want to delete any of the data. So when the delete happens on the publication I can turn the setting on replication say don’t delete this row, if you wanted to do that for your workload.

Steve:  Okay sometimes that’s something that we talked about I think in SQL Saturday recently was with that when if the stored procedure was updating a million rows. I just clarified to make sure they’re right. Yet, if you didn’t replicate the stored procedure call it would send all millions of those updates across through replication but you could just push that store procedure call so it gets called to neither side. And then it’s happening sort of independently on both sides rather than pushing it row by row. Is that right?

Chuck:  Yeah. Yep.

Carlos:  Okay so now help me kind of connect the dots there right because in my mind I’m thinking replication, that’s by object right. I go in and I replicate this column with this table and whatnot. How do I setup or how do I visualize the process of saying when this store procedure gets executed that’s what I want you to take down instead of updates and deletes to this table?

Chuck:  Yeah. So you basically configure replication and tell it that I want to use, to replicate these calls to these store procedures. And I honestly haven’t used that use case because usually in my environment it’s like I can’t just describe it as one transaction because it’s related to maybe some other data that’s within the system or the data comes from something else. I haven’t been able to ever do that but you basically just tell replication I want to use to replicate these commands with some stored procedures and you just tell it and it will. If everything’s right it will use that store procedure call and replicate that rather than all the individual changes that happen to your table.

Carlos:  Right. Well I think that’s a pain point for a lot of people. And going back to the whole subscriber thing and like checking when the subscribers have issues, is that when they have, to say more than a couple, even then let’s just say you have three. When that’s near you had, I’m still how to troubleshoot one, so the source system. Two, my distributor which in you said it should also be another system, and then each of the three subscribers. So that’s five servers that I have to go and kind of do something with. And I think that’s maybe one of the pain points around replication. Doesn’t it give you a great way to centrally do all of that?

Chuck:  Right. Well the replication monitor allows you to see all those three servers and help you with troubleshooting. You know, I was in an environment where we had like 8 publications and like 30 to 40 subscribers in a dedicated distributor. And we use to have replication monitor up on our knock window and people would, you know look for errors and stuff happening there but that was just unrealistic and that’s why I created those monitoring tools to look for those kind of errors or replication latency where you’ve got tens of thousands of commands that haven’t been pushed to the subscriber. And it’s like semi-alert and then every 15 minutes I see the trend, if the trend is going down and everything’s good, if the trend is going up something’s wrong because that
replication will continuously retry even though it might’ve complete error. And if you don’t monitor that well enough you come to this point of it’s been for 48 hours how come? You don’t want to be in that scenario.

Steve:  So with the transactional replication on the subscriber side, if I’m using that as a reporting mechanism to offload some of the work from the publisher, if I need specific indexes on the subscriber to help with whatever the reporting is doing on that side, do I need to add those on the publisher and replicate them over or can they just exist on the subscriber?

Chuck:  Yeah, that’s a great question it’s because that’s when the most powerful features of the replication is I can say either replicate the indexes that exist on the source publication or don’t replicate them. And the great thing about not replicating them is your publication server has this specific workload and use case and your subscriber’s typically always different. And what you can do is you can setup in the replication setup is it, what’s called the post-replication script. It’s basically one file that can have any SQL code in there that will then create anything you want on the subscriber side. And it is ran after the publication, when you initialize the publication, so initialization is just the method you go to publish all your data and get it to subscriber. And so in that command there, there’s actually two that you can do, you can do a pre and a post after you do this initialization. So what I typically do is that I come up and figure out my tuning on my subscriber side what indexes I need. I just add it to that script. And that script is you know, check if it exist already, if not you know, add this index to this table. And that gets applied after your initialization has happened so your table’s been populated with PCP. You know, that’s the use of PCP behind the scenes and then it creates indexes if you told it to. And if you told it to not do the indexes then it’ll basically go to your script if you have one. You don’t have to have one. And then you just add them and I think that’s the super hidden benefit right there for replication.

Steve:  Oh yeah. It really seems like that could eliminate a lot of the replication bandwidth by not having to have those indexes in two places and synchronize over.

Chuck:  Yeah. I mean it’s not going to replicate over all your index changes. I think that’s kind of, it’s just whatever happens in the data tables themselves.

Steve:  Right. But on the initialization of it it’s going to have to bring, if you’re replicating the indexes, it’s going to have to bring over the whole index right?

Chuck:  No, it won’t transfer that data. It replicates out the script. Here’s what the index is and it will create that after the fact.

Steve: Okay. So then you’d use this post-script only if you needed, or if the indexes were only needed on the subscriber side.

Chuck:  Yeah.

Steve:  Wow. Lot of great information here. This is good stuff.

Chuck:  Yeah we only scratched the surface.

Steve:  Oh yeah.

Carlos:  Oh yeah, that’s right.

Chuck:  There’s so much. It’s crazy and the amount that you have to get into and just carry on that earlier thought of having those three subscribers. So you typically would have those behind the load balancer. And then so I have this like common name, like this is my order processing system. So my applications stack would just point to this when load balance name and then it would get the information that it will need behind the load balancer. And then when I have maintenance, I can say load balancer take this one server out. And then it allows me to create a new snapshot and then I’ll pause all three of them. So all three of them will become stale but once I do a new snapshot to a new initialization everybody tries to grab it and pull down so you cause an outage if you don’t pause the distribution agent. So then I just, then I work on that one server, get the snapshot applied and then make sure the indexes are then applied. And then add it back to the load balancer and I take my next server out and then so it’s up to date. And I may
decide that I’m going to take all of the servers and just use this one that’s now up to date or I may have one stale and one up to date and one maintaining the other. And so it’s kind of what, in a bigger environment, what it looks like.

Steve:  Okay, great. Well shall we wrap it up with the SQL Family questions?

Chuck:  Sure sounds good.

Steve:  So Chuck how did you first get started with SQL server?

Chuck:  Yeah. It’s how it was. And for that, you know, I didn’t really pursued databases as a career until that kind of dotcom crash in the early 2000’s when unemployment paid for my MCDBA certification classes. And so that like I was completely scared to write in SQL when I was doing Access, like I avoided it like.

Carlos:  Sure.

Chuck:  And some are tipped classes like oh this isn’t so bad. So I did some consulting work in SQL 7 in 2000. Then I became a contractor at Microsoft supporting SharePoint for just the worldwide internal SharePoint farm and work with some guys and Mike Watson They are kind of well-known players in that space and they kind of look to me as the go-to SQL guy. I’m like, “Holy crap I’ve never done SQL clustering before, and load balancing stuff.” I’ve always, you know, just done this MCDBA course before there and that kind of gave me the bug so I become a developer after that. And then I had a boss who said that, “No you can’t be the DBA because I don’t think you got the skills to do that, so just keep doing your SQL development stuff.” So I basically quit and found a new job doing that SQL developer and DBA and that were, yeah that’s a hidden lesson out there for managers out there. It’s pretty easy to piss off a developer and say no or, you know an ambitious person you know looking to learn new things. And my next gig, I work for 7 years as a DBA and a DBA manager and supported just a gigantic replication farm.

Steve:  So, doing all the things your previous boss told you you weren’t qualified to do?

Chuck:  Yes. And I even got picked for a top 5 finalist for the Redgate’s DBA of the Year Award. You know it sounds like ultimate response to no you can’t do that. Okay. Sure I can totally do this. It’s not a problem. So now I’m a data architect so I’ve kind of gone the spectrum from developer to data architect but I’m still a DBA as well. It’s tough to find good DBAs.

Steve:  It’s really interesting how telling someone they can’t do it will motivate them to just be wildly successful at doing it somewhere else.

Chuck:  Yeah. And I’m really thankful he did that because I thought I wouldn’t be where I am today. Yeah, I love being a DBA. It’s a great job.

Carlos:  Now as great as SQL server is, if there was one thing you could change about SQL server, what would it be?

Chuck:  Only one?

Carlos:  We’re going to take your top one today. Maybe the next time you can, we’ll dig deeper.

Chuck:  Yeah right. I think that we need to get licensing in more in tune with modern hardware specs because yeah you know SQL express cost 1 gigabyte a RAM and a 10 gigabyte database. I mean you’re kidding me, and you might just as well say say, “Go to MySQL because we don’t really care about you.”

Steve:  Yes. There are very few environments that SQL express actually is a good thing to you.

Chuck:  It’s useful for right. I mean they can use it even the replication environment which is kind of cool. But I mean my phone’s got more power and capability than SQL express does. And then Standard edition’s kind of the same way. I mean you’re limited to, what’s it, 64 gigs of RAM and you know they say that you can use more with the heap space now with the later editions. But still I mean it is like designed by untechnical marketing people and no reason where these numbers come from. So if we want to limit to one thing that change, I think we,
we’ve got to fix that.

Carlos:  Yes.

Steve:  Okay.

Episode 96: Building Community Tools

Have you ever created something and wanted to share it with others with the hope they will contribute and make it better? Sites like the new defunct codeplex and GitHub are full of functionality people want to share and work with others on; however, many projects remain dormant.  What are the characteristics that create an environment where people want to contribute to your project?  Steve and I discuss a few ideas and we interview two very successful community project leaders in Brent Ozar and Chrissy LeMarie about how they got started and what it takes to put together a good community project.

 Episode Quote

“We don’t know that building community tools is for everyone. . . , but if you have a problem that you’ve solved and you are committed to it, you liked the idea and I’ll say go for it.”

SQL Server in the News

https://azure.microsoft.com/en-us/services/analysis-services/

” open=”off” use_border_color=”off” border_color=”#ffffff” border_style=”solid”]

Episode 95: Polybase

Big data is term we have been hearing frequently as of late and this might cause some concern for us in the SQL Server space.  Microsoft has introduced some new functionality to help connect different data stores with PolyBase.  We are happy to have Kevin Feasel from ChannelAdvisor back with us and  Kevin will discuss some of the basics around what PolyBase does. We’ll be discussing a lot about integrations using PolyBase specifically on Hadoop and Azure SQL Blob Storage. We also touch on some of the technologies that will be supported in the future.

For those looking at implementing both structured and unstructured data stores, PolyBase will be a way to help bring these environments together. Kevin gives us a great overview and we think you will enjoy this episode.

PolyBase

 Episode Quote

“PolyBase gives you this integration and it’s opening this door to possibly getting rid of link servers.”

“PolyBase simplifies that a lot for us by making an assumption that there is a consistent definition for each row.”

“Learn something new… You learn something the first time, you can learn something again.”

Listen to Learn

– What is PolyBase?
– Technologies supported by PolyBase
– PolyBase integration with different data sources
– Some thoughts around which teams are going to own which pieces of the project
– How Hadoop integrators are responding to PolyBase

Kevin on Twitter
Polybase Guide

About Kevin Feasel

Kevin is a database administrator for ChannelAdvisor and the leader of the PASS Chapter in the Raleigh NC area.  Since he was last on the podcast, Kevin has been awarded the Microsoft MVP and will be a speaker at the Compañero Conference.  He also enjoys old German films.

Transcription: Polybase

Kevin: My name is Kevin Feasel, I am a Data platform MVP. I’m also a manager of a predictive analytics team here in Durham North Carolina. I’m extremely pleased to be able to speak at Campanero Con, even though I can’t pronounce it. I’m going to be speaking on a couple of topics one of them is Security and really getting an understanding of that network security, getting an understanding of what a database administrator can do to help secure a SQL server instance. I’m also really looking forward to talk about a big data solution basically how do I get started in that. I’m a database administrator, I’m the only database administrator at this company and somebody is coming to me talking about big data, where do I start? What do I start looking at? What actually is the benefit? What kinds of workloads would work well under this and which ones don’t? And getting some of the ideas of what’s happening in the industry and seeing how this different technologies are evolving and turning into a full ecosystem. Finally, showing how that ecosystem some integrates with SQL server.

Carlos: Kevin, our all-time podcast episode extraordinaire. Welcome back for another episode.

Kevin: Thank you! It’s good to defend the title.

Carlos: Yes, thank you for coming and talking with us. One of the things and one of the reasons we continue to have you on is you’re doing lots of different interesting things, and as database administrators we’ve been hearing this for a little while this idea of big data and it’s kind of been at the door. Lots of people even from a past perspective, they’ve open the doors to analytics to kind of join those two worlds. But for a lot of us it’s still kind of an unknown entity, it’s different technology and we think that we have something here that will kind of save the day if you will in the sense. And so our topic today is PolyBase and we wanted to talk with you about it, you’ve been talking about it, and presenting on it and working with it, so why don’t you give us the tour of PolyBase? What is it and why would we be interested in it?

Kevin: Sure, here’s the nickel tour version. PolyBase initially came about, I believe it was actually first introduced in 2010, so it was part of SQL server parallel data warehouse edition which later became APS otherwise known as Extremely Expensive edition. Enterprise is expensive, PDWAPS extremely expensive, in SQL server 2016 this was brought down to the masses or at least the masses who could afford Enterprise edition. It’s been around for few years but 2016 feels like first version for the rest of us who didn’t have a chance to play with really expensive hardware. The concept of a PolyBase does at a really high level is it allows you to integrate with other data sources, so before people start thinking, “Oh no, it is link servers all over again.”It’s not links servers. It’s not that bad. So as of today PolyBase supports a few different links where you can connect to a Hadoop cluster. We can connect to Azure blob storage and we can use PolyBase to migrate data from Azure blob storage into Azure SQL data warehouse. At PASS Summit 2016 there were a couple of interesting keynotes where they talked about expanding PolyBase beyond Hadoop and Azure blob storage looking into Elasticsearch, MongoDB, Teradata, Oracle and other sources as well.

Carlos: Wow, so basically we’re going to have the ability through a SQL server Management Studio to be able to interact with, move data to and from all of these different systems that you have mentioned?

Kevin: Yes, and be able to query it using just regular T-SQL. So when you create this table you create what’s called an external table. It’s a concept that lives on the source server like the Hadoop cluster. The data is over in Hadoop but when you query that table select star from my external table it’s going to go over, request data from Hadoop cluster, and pull that data back in the SQL server where you can treat it like it just came from a local table.

Carlos: Got you, so now is it going to store that like a time bases so that you know, I run my select star and then 10 minutes later Steve runs his. Is it going to pull that data back over again or there’s some management now that we have to think about because the data is now on my SQL server?

Kevin: So the data doesn’t really get persisted to SQL server. It’s an external table meaning that it will live in the blob storage or on your Hadoop cluster. The mechanic that PolyBase uses to allow this to work is it will pull the data into SQL server into a temp table but it’s not a temp table that you should know about as a developer of a T-SQL query. It’s kind of like behind the scenes temp table that then acts as the table that you’re going to query against, so you query DBO.myexternal table. Behind the scenes there is a secret temp table and it has the form and structure of that external table, data gets pulled in, collected and then processed as though it were local. But once it’s done it’s gone.

Steve: So then that process sounds very similar to sort of underlying workings behind when you run a query over a link server where it issues a command on the other side it brings the results back, it’s basically stores them in a hidden format so you can use them and the rest of the query, locally. And I guess, I mean I’m just trying to understand the correlation there. Is there a big difference on how that’s been done versus the link server?

Kevin: So there is one major difference and that is the concept of predicate push down. So the idea here is let’s say that I have a petabyte of data in this Hadoop cluster and petabyte of data in this folder, I want to be able to query, I’m sending a query that maybe I just want a few thousand rows or I want to aggregate the data in such ways that I don’t get a petabyte back, I just get the few thousand rows I need.

Carlos: Hopefully, because if you’re turning a petabyte of data you’re going to be in trouble.

Kevin: Yeah. I don’t have a petabyte of data on my SQL server instances. So I write this query in my WHERE clause, maybe I do summations, GROUP BY’s, HAVING’s. All of that predicate will get sent to the Hadoop cluster and on the Hadoop cluster side, PolyBase instigates a MapReduce job or set of MapReduce jobs to perform the operations that you wrote in T-SQL. It generates all of the jobs. It creates the data set that comes back and gets pulled into SQL server. So the link server if I were doing a link server to an another SQL instance, well another SQL instance is a special case, but if I were doing it to Oracle, I have to pull the whole data set back or from querying out to Hive I have to pull the whole data set back and then any filters get applied. So predicate push down is what lets you get back the rows that you need, only the rows that you need and gets around that whole links
server problem where, oh yeah I’m querying a billion rows I’ll see you tomorrow.

Steve: Sure, very interesting. I’ve heard some people speculate that link servers are dead or will be going away because of what we can do with PolyBase. Do you think that that’s a fair assessment?

Kevin: I am crossing my fingers hoping that this is so. As soon as they announced at 2016 PASS Summit what PolyBase is going to do in the future, I got really excited because I thought about, “Wait, what if I could connect to another SQL Server instance.” And there is one extra bit of PolyBase that I haven’t talked about yet. That is the concept of head nodes versus compute nodes. This concept in massive parallel processing that you have a head node, this is the orchestrator, this is the server that knows what queries are supposed to come in and out, and then it passes off details to different compute nodes. In Hadoop you have a name node and you have a bunch of data nodes. Over in PolyBase there is actually a similar infrastructure, so there is a head node, that is your SQL Server instance, must be Enterprise edition, and it controls the jobs. But you can to different compute nodes. They call it scale up cluster. These are Standard edition SQL Server instances that I can have sitting there doing work connecting to this Hadoop cluster to the different data nodes on the Hadoop cluster, pulling data back and getting aggregated data back to my head node. So unlike a link server where I have to pull all the data over to my one instance I can now have several PolyBase servers getting data, aggregating it locally, sending it up to the head node, sending up that aggregated as fine as they could data back to the head node where the head node finishes aggregation and presents to the end user the result.

Steve: Very interesting.

Carlos: Yeah, kind of scale out approach. Now I guess at this point it might be worth kind of going back and talking about some of the things that I need to put in place. Now you mentioned kind of this architecture perspective, I have Enterprise version I can have Standard versions but let’s just scale it down a little bit. I just have one node and I want to start using PolyBase. What are some of the things that I need to create or steps that I would take to in order to set that up?

Kevin: Ok, so let’s take the easiest example, that’s connecting to Azure blob storage. On my SQL server instance, I have to install PolyBase. That’s part of the setup; there is a little checkbox you can select. But in order to install PolyBase you must install the Oracle Java Runtime Environment.

Carlos: Yes, I cheated, and I was looking at the docs here and I saw that and I thought, “What in the world!” It’s like sleeping with the enemy, right?

Steve: So just to recap then, if I want to query against Azure blob storage with PolyBase when I install SQL server I need to also install, and again you get this as part of the install, but Oracle components for the Oracle Java Runtime.

Kevin: Correct. So you install those. There is a couple of configuration steps that are involved like there is a setting in SP configure that allows for external queries. Turn all the stuff on. There are configuration guides that can help you with that. Once you’ve got that done what you do is you create three things. The first thing that you want to create is an external data source. So the external data source says, this is what I’m connecting to, this is the resource location. If I’m connecting to Azure blob storage there is a type, actually I think for Azure blob storage you just use a type of Hadoop. If you use a Hadoop cluster you just use a type of Hadoop. If you’re writing in Azure elastic scaling query, there is a different data source type for that. But that’s little bit beyond my kin I haven’t written those yet. Ok, so you create this data source. This data source says, over here is the WASB address, the Azure blob storage location of the folder or file I’m interested in. So, actually let me, I may have to rephrase that because I’m now looking at the, opps. Ok, let me, sorry Julien, going to have to cut this part just a little bit because I just said something wrong. Now, I could just keep going and make it
sound like I’m an idiot. That wouldn’t be the first time admittedly, but.

Carlos: No, that’s fine. We’ll make it right.

Kevin: Ok, so let me start over. So the next thing that you do after you’ve configured PolyBase is you want to create an external data source. For Azure blob storage we create this external data source that points to the WASB address of your blob storage container location. So you’ll point to the container and the account name.

Steve: URL right?

Kevin: Yeah, that is WASB or WASB[s] address. It’s an Azure blob storage location. You’ll include your credentials because if it’s a secure blob you’ll need to pass in credentials. So you create this data source. The next thing you want to do is you create a file, so an external file format. That file format says, any files that are on a source that I specify, any files are going to follow this format. There are a few different formats. One of them is delimited text, so just text maybe colon delimited or semi-colon delimited or however you’ve delimited your text. You can use other formats as well. I would recommend starting out just use delimited text that is easiest to understand. You can grab a file and look at it. But when you start asking about better performance, one of the better formats is ORC, which is row columnar format that high views the store data. So it’s much more efficient for querying especially aggregating data but you can just use flat files.

Carlos: So knuckle-dragging Neanderthal that I am like how am I supposed to choose what kind of file that I need to use. Is there like a, if I’m going to be, I don’t know anything about Hadoop, how would I choose that?

Kevin: Yeah, absolutely, so knuckle-dragger, delimited file. Keep it easy for yourself. Once you get passed that, once you kind of get passed the doorway and you say, ok now how do I get do better? You have to think about whether your data is more of aggregation like what you would find in a warehouse table. In that case, I would use ORC. If I’m storing the data and it’s more of a row store style data I would use Parquet. There are a couple of other formats as well but those are the main two that really supported within PolyBase.

Carlos: Well now, so in that determination, so again I’m going to use the delimited file. I start, I don’t know, three months in, right, I start writing queries. There are processes that I now have in place and I decided, “Hey, I think I can do better. I want to change the format.” Am I going to have to like start redoing my queries or what’s all involved if I wanted to change that format down the line.

Kevin: Great question. What you would have to do? Let’s say you have delimited files now. You’ve created an external file format that’s of delimited type. Later on you say, well I am actually storing this as Parquet, so you create an external file format that’s Parquet. And now we get to the last portion of a PolyBase external table so the table has a two part name. It looks like any other table when you query it dbo. something or maybe external.mytable. You have the column definitions so all of the attributes in your table and at the bottom, the part that’s little a different is there is a WITH clause and inside that WITH clause you specify the location of your data, so those would be the specific file or folder that you want to point to. The data source and the file format.

Carlos: Got it. So when I wanted to do a new, if I wanted to change file formats I’m creating a new external table.

Kevin: Yeah or you just drop and recreate the one that’s there. The external table doesn’t have any data. It just has some metadata around it. So if you have a few second downtime you can drop that table, recreate the table, use the new format, maybe point to a new folder that has the data in a different format. All the nasty work of converting those files getting them into the other format, yeah you still have to do that stuff, but you can do that as a back fill process or you can do that kind of off to the side and just switch when you’re done. That way you don’t have to update any of your procedures or any calling code.

Carlos: Got you, ok, so that’s nice.

Steve: So then when you say the external file doesn’t really have anything more than just for definition there. That’s the definition that sits on your SQL server that’s defining where it’s going to go and get that data for instance out of Azure blob storage. So it’s really just a pointer off to that data and you’re switching it around and if you point it to a different format file you have to give it a format type appropriately.

Kevin: Yeah, so the external table, yeah it’s just metadata. It’s just some basic information.

Steve: Ok, so then with that it’s pointing to a file in Azure blob storage and can you just start out with an empty file and then start filling in with data from there or does that file in Azure blob storage have to have been created somewhere else to meet those formats?

Kevin: That’s a really good question. So you have the ability to insert data into blob storage or into Hadoop. There is another configuration option you have to turn on to allow for inserting and once you do each insert operation you do will create some number of files in blob storage or in Hadoop. So you have to have a folder as your right location. But every time you insert maybe you’re inserting once a month, you’re taking last month’s financial data, all the individual transactions and you’re writing it over to blob storage for long term storage. That insert generates 8 files over in Azure blob storage and then the data is there. You can query it just like it was always there. But you cannot update that data from PolyBase, you cannot delete that data from PolyBase.

Carlos: Interesting, so now obviously it’s going to vary from place to place but for me a setup perspective let’s say, right, so again I’m the only database administrator in my organization or I’m not familiar with Hadoop or these other. Well, I guess when the other databases get on boarded then there will be more access, right? But when I think from a big data perspective generally there’s going to be another team, maybe a vendor comes in, installs Hadoop, starts loading data, things like that. What are we as database administrators, were going to create all of those components that you just talked about, are the Hadoop vendors familiar with PolyBase? Are we talking the same language here or is this still something kind of a very SQL server centric idea? Does that make sense?

Kevin: I would say that vendors, they’re not really going to know a lot of the PolyBase details. They’re probably not going to be familiar enough with PolyBase itself to do it. I’ve had some discussion with people who worked at Hadoop vendors and they’re very interested in the concept but there is not a lot of internalized information around there. These are typically not people who spend a lot of time in SQL server, with SQL server so they don’t necessarily know how it works, how to set it up, what the positive and negative aspect are, how you can shoot yourself in the
foot.

Carlos: Well, so speaking of that so what are the ways we can shoot ourselves in the foot?

Kevin: Oh, you have to go and ask that. There are some assumptions that are built into the way that PolyBase works today. This is not a critique of the PolyBase team, of the developers, of the PMs. This is not at all a critique aimed at them. I like you guys still, don’t worry. One issue that you can run into is let’s say you have just text data and your file has new lines in it but the new lines don’t represent new lines of data. Maybe it’s a free form text field where a person typed in new lines to symbolize a new paragraph. Well, PolyBase doesn’t understand this idea of ignore new lines unless I told you that it’s a new line. It will just pick up that new line and say, oh yeah you got a new line here.

Carlos: A new record basically.

Kevin: Right. There are some assumptions that are built in. You can also burn yourself by defining your result set so you create that external table and maybe you define a value as an integer. Well, if the value comes back as a string because some of the data is malformed coming in then those rows will be rejected as they should. So you’re going from a non-structured or a semi-structured system into a very structured system in SQL server. That semi-structured system is ok with you throwing whatever garbage you want into this file but you have to define structure when you pull this out. Historically, on the Hadoop side that structure was defined in the mapping in the reduction phases, so MapReduce. It was defined by the developer putting together the data in such a way that the developer understood what this data point signifies. PolyBase simplifies that a lot for us by making an assumption that there is a consistent definition for each row, so we say an integer age is the first value. Well, it’s going to assume that there is an integer value there and it’s going to populate age with that. If maybe every 20th row we have something that’s totally different. Maybe instead of age it is eye color because something weird happened with our data. Well, every 20th row gets rejected. The way you can shoot yourself in the foot, let’s go back to you have a few billion rows of data that you want to pull over. Maybe you want to get just everywhere were the person is exactly 14 years of age. So you’re scanning through this data and every 20th row instead of it being integer age it’s actually a string. Every one of those rows gets rejected. There is a cutoff for the number of records that you are allowed to reject before just failing a query. That cutoff can be 0 or it can be as many as you want. It can be percentage or a numeric value. So let’s say 1 billion rows and you have a cutoff of 5,000. You’re going to go through quite a few records to get 5,000 rejected rows. Once it’s done, once rejection happens, once failure occurs the entire transactions roll back and you don’t get to see the data that was there already. It’s roll back. There was an error.

Carlos: Oh, got you, that’s right, yeah.

Kevin: So you may be sitting there for an hour waiting for this data to process and it comes back and it fails.

Carlos: Yes, so you might almost think about in a sense, again not try to discount Hadoop. At least in my mind a knuckle-dragger that I am, I think about that almost like an Excel file, right. I want to load that into something that it can accept it and then let me take care of finalizing any of that and look rejected rows and things like that. Almost like an ETL process, right?

Kevin: Sure. This is a fairly common pattern in the Hadoop ecosystem as well where; ok we have a raw data coming in. It’s there we put it into the data lake. So ideally the data lake has a lot of nice clean data in reality it’s more like a data swamp. It’s where you throw in a bunch of old stuff. You got mattresses in there, just all kinds of dirtiness.

Carlos: Fish with three eyes.

Kevin: Yeah, exactly. And so you pull that stuff out and you try to clean it up in some
process. Usually it’s going to be a Hadoop process. Maybe that’s a spark job, MapRecuce job that scrubs this data, tries to give it some symbol of sense and then writes it out to another directory where it’s more of a structured format. In that way you can read it in Hive which is SQL for Hadoop. You can read it with SparkSQL, SQL for Spark, or you could read it with PolyBase, SQL for SQL.

Carlos:  Got you, so that kind of almost goes back or takes me back to that idea again of, kind of that who’s working with who type idea, and it almost sounds like if we wanted to we could push some of that to like hey guys can we work on this MapReduce. Is that a fair question to say, hey can we work on this that when the data comes back it gets cleansed before I see it? Or is that still kind of, you know, I need to as a SQL server person assume all responsibility for that kind of thing?

Kevin:  I think that depends on your environment. It depends on relative levels of familiarity. But personally my expectation would be that if you are say using SQL server as the engine to see final results, then I believe that it makes perfect sense to ask the people on the Hadoop side, “Hey guys give me the data in a format that I can pull it easily.” So for example, maybe we are reading a lot of data coming in from IoT devices. We have Kafka setup. Kafka’s a big distributed message broker. It’s a really fascinating thing and we’re getting tremendous numbers of messages that are streaming in to our Hadoop cluster. We’re cleaning up those results, we’re storing the results and maybe we have some aggregations that we’re doing to show hourly results by device type. And then load that data in to a file that PolyBase can read. As part of an ETL process you may pull that data over the SQL server, Persistent SQL server. So query like SELECT FROM your table INSERT into the real SQL server table, and you’re keeping a smaller streamlined data set that you can use to populate a PowerBI Grid or that you can use to populate web application. In that scenario, personally I’d argue that yeah the Hadoop side people probably should be doing most of the cleanup work. If you are both sides, it becomes more a question of well what am I more comfortable doing, like sometimes if the data’s relatively clean to begin with, or if we’re willing to accept a certain level of failure, take it, bring it to the SQL server, I can do really cool things in SQL server.

Carlos:  So it kind of goes back right to the adage of knowing your data, right?

Kevin:  Absolutely.

Carlos:  Being familiar with it and then making a decision based on that.

Steve:  So then back to that example with the age and putting that into integer column in the table definition, do you see that, I mean, there’s lots of things that could be valid for ages in there. You could have 6 mo. to represent someone who’s six months old but then obviously when that gets pulled down and try to go into integer, it’s got text data in there and it’s not going to work. So do you find that people sort of shy away from those restrictive types in their table definitions or maybe just leave it as something that’s more open like a varchar max or something like that? Or do find that people go through the battle of cleaning it up or filtering it ahead of time?

Kevin:  Unfortunately, probably more the former. It’s more of, well it’s a string, every string works so we will pull that in as a string and then we’ll clean it up here. That is a downside where with a lot of ETL, through ETL tools I can take a data element, I can make decisions based off of what that element looks like, like 6 mo. I can do a substring, I can parse out, is there a MO or YR or some known value here, and use conditional logic to convert that into something that is consistent across the board. PolyBase isn’t going to give you that. It’s going to give you the easy way of pulling data but yeah that, it means, it doesn’t do the transformations for you.

Steve:  Okay. So another area that I’ve thought a little bit about is that and I know sort of jumping back to the whole link server example is that when you’re running a query in sort of old school link server, whatever’s going on in the other side really gets hidden from execution plans. It’s just blindly calling something on the other side across the link server and your execution plan doesn’t give you any details other than it was waiting on something on the other side. Now, is there an option for seeing execution plans when you’re using PolyBase to get a better understanding of if a query’s taking some time, maybe where that’s time is being taken on when it’s connecting out to Hadoop for Azure blob storage?

Kevin:  Yeah. The short answer is yes. The long answer is yes if you look at the XML. So you look at the query plan XML, it will give you some details including there’s a remote query which is XML inside of the XML. So you have to deserialize the XML, decode the XML, and you’ll be able to see what the remote operation looks like. So it gives you a few indicators of what’s happening. It’ll show you the individual steps. Also, there are several dynamic management views that are exposed for PolyBase. And those DMVs will show you a lot of the same information. They’ll show you the individual steps that occur for this MapReduce process or for the data retrieval process.

Carlos:  So I think very interesting topic and we’ll let you give last thoughts here but one of the things that I feel, that I’m confident about or happy about is that while there’s still some unknowns here, right? Having the Hadoop, you know, in my environment or being able to connect to it, Azure blob storage, all these other things that are coming down  the pipe, at least it’s going to be, I have a tool that I can do or integrate with some of these things on my own turf. And it’s not completely foreign that I have to go and, you know, pickup new technologies right away.

Kevin:  Yes. That’s how I’m thinking of it. This is why I like it so much. This is why, honest I think this was the best feature in SQL Server 2016. A lot of people are going to say query store is the best feature. Query Store is an awesome feature but PolyBase gives you this integration and it’s opening this door to possibly getting rid of link servers. It’s opening a door to distributing queries, distributing those really expensive SQL server queries. Kind of like what you do in Azure SQL data warehouse, hoping that maybe we get something like that locally.

Steve:  So I know you talked about how PolyBase is perhaps one of the best features in SQL server 2016. I know that SQL Server 2017 community technology preview too I believe just came out recently. And is there anything that’s in there new with PolyBase that you know about?

Kevin:  Nothing new with PolyBase.

Carlos: Got you.

Steve:  Okay.

Kevin:  There’s a whole bunch of really cool stuff I’m excited about but.

Carlos: The fair question to think or assume but it will be supported in Linux version as well.

Carlos:  Because it’s a core feature if you will, I know they’ve been working and talking with Travis, the PM over there for the Linux migration. That’s what they’ve been trying to accommodate. Again, listening to the AMP conference or event or whatever it was called. They did mention some additional
functionality that would be in the Linux version. I don’t remember them specifically calling up PolyBase but, you know, I had to imagine that it will be there even if it’s not there on day one.

Kevin:  The answer that I think is safe to give is in today’s CTP, CTP 2 for SQL on Linux, there is not PolyBase support but there is no reason that PolyBase cannot be there.

Carlos:  Got you. There you go. But again well we did mention that this ultimately Enterprise only feature, right?

Kevin:  Yeah, for the head node it has to be Enterprise edition. I think even with the SQL Server 2016 SP1, I think it still required to be Enterprise edition for the head node.

Carlos:  Okay, got you. Yeah, I feel like that PolyBase was in the list of things that they made available in the lower editions but I’m not sure if that includes the head node or not.

Kevin:  Yeah, I know that the compute node was available in Standard edition but I’m not sure.

Steve:  Yep. So given that it’s been a little while since 2016 came out, around a year roughly, and with PolyBase sort of been mainstream available since then, do you see that a lot of people are actually adopting this and using it in production environments or do you see more people just sort of experimenting and trying things out at this point?

Kevin:  It’s more of experimentation. I don’t know of many companies that are doing it. The way that I would put it is okay well you have to have SQL server 2016 which already cuts out large slice with companies. You have to have Enterprise edition and you have to have Hadoop cluster or you could use Azure Blob Storage and get value out of that way, but this is going to be a fairly narrow segment of the population even today.

Carlos:  Got you. Yeah, make sense.

Steve:  Well perhaps after this podcast more people will give it a check.

Kevin:  Yeah, I hope so.

Carlos:  That’s right. Compañeros if you are using PolyBase after what you’ve heard here today, I want to know about it. We’re going to report that to Microsoft. Let them know you heard it here first folks. Okay, so I know you’ve been on the show here before, Kevin, but we’re going to still go through SQL family.

Kevin:  Excellent.

Carlos:  Can we do it?

Kevin:  I think so. I may make up new answers.

Carlos:  Well would you have a couple of new questions that I think that have changed since last time you’re an individual guest so.

Carlos:  Okay. So the first question is how did you get started with SQL server?

Kevin:  I got started as a Web Developer. It was about a decade ago and I was an ASP.NET web forms developer. It was my first real job, so I was the person who was least afraid of databases. I’ve written SQL queries before and we had a need for database administration so I.

Carlos:  How hard could it be?

Kevin:  Yeah, pretty much. Like hey why is the server failing? Oh it’s ’cause it’s not a disk space.

Steve: Alright.

Carlos:  There you go, and now you know the rest of the story.

Steve:  So if you could change one thing about SQL server, what would it be?

Kevin:  That’s a good question because everything that I think of tends to happen which is really cool, I like that. So last time around I said I want PolyBase to support Spark, and I’d like to see that happen still. I’ve wanted Python support for machine learning within R services which is now machine learning services. And we just got that so that’s really cool. The thing that I want most right now is a really good client for Linux. So I want Management Studio for Linux or something Management Studio ask for Linux that does maybe like 70% of what SSMS does.

Carlos:  Interesting. In all flavors of Linux or do you have a particular flavor that you’re interested in?

Kevin:  I’m kind of okay with pretty much any flavor. I mean you can get it to work. Nowadays, I use Ubuntu or Elementary a lot. Previously I’ve done a lot of Redhat. I go back to Mandrake for people in the know.

Steve:  Right. Yeah, I know recently we heard that, what was it SQL command, was going to be available on the Mac and that was a big move. And I think we’re a long ways off from Management Studio being on other platforms. But who knows, I
could be wrong there.

Kevin:  Yeah. I’m looking forward to whatever they are able to provide.

Steve:  No, I know that’d be certainly cool.

Carlos:  Although, and we do have request into the PM for SQL Server Management Studio. We haven’t quite been able to get them on the show just yet, but when we do we’ll ask them that question.

Kevin:  Put them on the spot.

Carlos:  That’s right. Okay, so best piece of career advice you’ve received.

Kevin:  I’m going to flip this on its head, best career advice I can give.

Carlos:  Well, here we go.

Kevin: Learn something new. Especially if you’re in a shop where you’re on SQL server 2005, take some more of your own time. Learn something new. It doesn’t matter that much what it is but expand out just a little bit. It could be features, it could be new versions of SQL server, it could be let’s learn a new language, let’s learn a new part of the stack. But don’t get caught in this one little part that you find out someday oh look your job has been animated away and you lost all of those skills to learn. You learn something the first time, you can learn something again. So that would be my advice.

Carlos:  And that is why we’re going to have you as a speaker at the Companero Conference. So folks if you want to hang out more with Kevin and learn all of his wisdom, you can come to the conference and hang out with us.

Kevin: Wisdom and $5 gets you a cup of coffee.

Steve:  And on to our last SQL family question, if you could have one superhero power, what would it be and why would you want it?

Kevin:  We’re getting close to episode 100. Nobody else has ever answered that this way. I want phase walking. I want the shadow cat kitty pride be able to phase through walls, phase through objects. Nobody else has answered that so either I’m completely insane and picking the wrong power or I’m the head of the curve. I’ll let the audience decide.

Steve:  Or it could be you’ve just answered the question several times before as well and you’ve had more time to think about it too.

Kevin:  That is also possible.

Steve:  Alright, very good.

Carlos:  Awesome, Kevin. Thanks again for stopping by. We always enjoy it.

Kevin:  I’m glad to come here.