Carlos: Compañeros! Welcome to another edition! Episode 194, as we inch closer to that very exciting episode, at least I hope it’ll be exciting Episode 200. We are joined today by the usual cohorts, Eugene Meidinger.
Carlos: And Kevin Feasel.
Carlos: And we should say, compañeros, that, you know, again we do everything in our power, and I guess when I say we, I mean Eugene and Kevin do everything in their power to make sure this episode or these podcasts continue, as he’s checking in from his vacation in sunny Florida.
Eugene: Well, it’s a working vacation. I’m still getting billable work done, but yeah, I could be outside in the 90 degree weather right now.
Carlos: Yes, so we will make the conversation count for you, Eugene.
Eugene: I appreciate it.
Carlos: So, our topic today, based on a listener question, let’s see, what we say, George Lopez?
Kevin: We said Fernandez, but I’m sure George Lopez also sent in questions, cause we get like thousands of these questions every day.
Carlos: That’s right.
Kevin: There’s literally a mail bag that gets delivered to Carlos’s house. It’s a giant burlap sack full of listener questions. I don’t know why Eugene prints them all out and sends them in the mail, but he does.
Eugene: You know, it just reminds me of my youth.
Carlos: Yeah, yes, I like getting letters. So, after recording Episode 180, a listener came in and said, “hey, thanks for the review, really appreciate it. I like to hear a little bit more about some of the use cases.” And so, compañeros, you have asked, and we will deliver. So that’s what we’re going to talk about today. But first, we want to give a couple shoutouts. To those showing a little love for the podcast on social media and as always, compañeros, you’ll forgive me for your last names as I mispronounce them. So, we got Nicky Van Vroenhoven and Justin Bahr, so thanks for connecting with us on the podcast. Okay, so our show notes for today’s episode will be at sqldatapartners.com/polybaseusecases. Don’t say that three times fast, compañeros. Polybaseusecases or sqldatapartners.com/194. If you’d like to know more about PolyBase, like what it is, you need that refresher, you can go back to Episode 180, but what we want to do today is talk about why we might want to implement PolyBase. So I guess I’ll start with one. So, we actually have a fan, Rob Wagner, who is near and dear to our heart, he has a SQL server and a Postgres server that he’s doing ETL from, so he’s pulling ETL from, and he’s had problems with it and drivers and blah blah blah blah blah. And so, one of the things that he’d like to do in SQL Server 2019 is to be able to use PolyBase to pull that data in so that he can then work with it in SQL Server. So, Kevin, is that going to be a good use case for Rob?
Kevin: Short answer ‘yes, with an if’, long answer ‘no with a but’.
Kevin: No, actually there is no ‘no with a but’, it’s just I love that sentence quotation. So, yes, as long as the ODBC driver is one that will work with PolyBase. I have not tried one with Postgres yet. I just added it to my to-do list, so you’ve got that going for you.
Carlos: There you go.
Kevin: A blogpost may come out of it. But as long as that ODBC driver works, then absolutely, ELT is a great use case for using PolyBase. So I emphasize ELT over ETL. Quick reminder/primer: ETL, extract, transform, load, you’re pulling data out of a source system, you’re performing all of the transformation as part of your preparatory process, and then you load it into your destination system. Historically that’d be taking from a transactional processing system, loading it through Informatica, SQL Server Integration Services, whatever, and finally loading the result into a data warehouse facts and dimensions. Kimball style, Inmon style, dealer’s choice. We’ve seen a bit of a change. ETL is still a viable paradigm, but ELT has taken over, particularly when you have an enormous amount of data. One of the problems that we see with Integration Services and other tools like it is the difficulty around scale. And even within Integration Services, we kind of landed on an ELT pattern where you don’t do work in Integration Services. You pull the data in from the source, maybe you do some derived column transforms, non-blocking asynchronous operations, and write those things to a staging table as quickly as possible, and then merge in from the staging table.
Carlos: Yeah, all the fun happens after that, yeah.
Kevin: Yeah, that’s the pattern that I’ve seen and written, quite frequently with Integration Services, where the moment you enter in a sort operator or any sort of blocking operator you can forget about it working with large datasets.
Carlos: Yeah, yeah, cause you generally run into memory limits and things like that because of the architecture of SSIS.
Kevin: Exactly. Memory limits are–
Carlos: And the boxes that they’re on.
Kevin: Yep, CPU limits. Those will be your two main limitations. So instead, we’ve seen a shift toward ELT, where I’m going to extract the data from the source, you know, you’ve got to get it from there, we’ll load the data into a staging table in the destination, and then perform transforms as we get from that staging table into our final set of tables. Now, where PolyBase fits into all of this is that I can have on my– let’s say it’s a warehouse. I can have in my warehouse an external data source, that points back to Postgres, again assuming ODBC drivers for Postgres work with PolyBase, TBD. I create external tables that represent what that Postgres data looks like, and then I can write t-SQL queries on my data warehouse to pull the data in and transform it on my warehouse server, as I load that data into a SQL Server instance. Now, it doesn’t have to be just Postgres, and it doesn’t have to be just one data source. This is where PolyBase as data virtualization technology can really play fairly well because I may have two facts and four dimensions which come from Postgres, another three dimensions which come from a CosmosDB collection, two more dimensions from SQL Server, three facts from Oracle, and 2 facts in a dimension from yet another SQL Server. So, I may have all of these different things. I could theoretically join them all together, but as soon as you do that on a significant data set, you can expect a query to run until the cows come home.
Carlos: The cows come home.
Kevin: Yep, I’m glad that we picked the same folksy down-home metaphor. You can tell Carlos and I’ve been doing this for a hot minute. But I would not recommend that strategy. So, if you are in a situation where you get the feeling that, “I really want to take advantage of data virtualization, I really want to have these multiple data sources feed into my warehouse, or my destination system,” you can certainly do that.” And if you load them one by one, and like if you have load Dimension A, Dimension B, Dimension C, and you’re pulling data from those different sources, you can tune that reasonably well. If it’s like two sources combined, you can sort of get away with tuning it by loading temp tables and joining the temp tables together, but you start to lose some of the ability to push down predicates or to filter queries effectively once you’ve introduced more than a couple of data sources in one query. So, a bit of hesitation on that.
Carlos: So now, we’ve talked about predicate push and you’ll forgive me that this is kind of getting more into the feature set than maybe I thought we originally would. And I hear about predicate pushdown in Power BI, so does that mean that from a transactional system I’m depending on that source data and that engine to pick up some of that and filter that better and somehow, I’m losing that translation? Cause the same thing still applies, like in this case we’ve been talking about Postgres, like I’m expecting Postgres to filter out some of that stuff before it sends it my way?
Kevin: That is exactly correct. Originally, predicate pushdown in PolyBase meant when I’m connecting to a Hadoop cluster, the SQL Server PolyBase engine would translate my t-SQL that I wrote and create a MapReduce job, create a jar file, send it over to the Hadoop cluster, schedule a yarn task to execute my jar file, which meant that the Hadoop cluster would run a MapReduce job, generate all of the data that it needed and then send the results back to SQL Server. So that was original predicate pushdown. With newer PolyBase, V2, connecting to SQL Server, Oracle, or whatever else, we have a similar notion, but it’s exactly like you stated. We’re going to translate that t-SQL, turn it into something that the other side can filter on. So, my where clause be able to push that over to the remote Postgres installation. There are limitations around what can be pushed down and how complicated the query can be for pushdown to be supported, but that’s the general concept, yes. And that’s definitely beneficial once you start dealing with larger data sets where I don’t want to stream all 20 million records over just to get the 5000 that I want. I’d rather go over and ask Postgres, “hey you got a table with a lot of rows, but you know what? You’ve probably indexed it pretty well. You’ve probably prepared for this inevitable query, so give me back the rows that I need, and that’s going to be a lot more efficient.
Carlos: Yeah. Is it fair to start thinking of PolyBase as an SSIS, I don’t want to say replacement, I don’t know if equivalent is the right word.
Kevin: Oh, I’d think of it an SSIS replacement, and in many cases, absolutely. If your Integration Services packages, A, if you run SSIS on the same server that you’re running your database engine, you might be able to use PolyBase, because hey, you’re already– like you could free up some resources by not running Integration Services.
Carlos: Yeah, yes. You think about like memory allocation, like, “oh, I’ve got to save some for SSIS. Now you can just let SQL Server handle all that.
Kevin: Yeah, exactly, give some more to the database engine and give it to the buffer pool. B, if your Integration Services packages are similar to what I explained, that basically you’re doing a lot of prep logic in a source query and then you load the data through SSIS by essentially pulling it from the source and writing it straight to the destination, or maybe doing a couple of drive transforms, or maybe a lookup. Those are the types of things that you can really easily reimagine as straight up T-SQL queries. The whole reason we have Integration Services there is, “well, I just need to be able to connect these two servers in their separate servers.” As long as the two servers can communicate, that you don’t have any network segmentation that would prevent it, or anything along those lines, yeah this could very easily replace it. Cases where it would be a lot more difficult to replace include where you’re doing more complicated work. So, if you are writing to Excel files, okay writing to something that is not a SQL Server instance is going to not work with PolyBase, with one exception. That one exception is you are inserting rows into a Hadoop cluster via HDFS or into Azure Blob storage. But I can’t insert into Postgres. I can’t update rows anywhere using PolyBase. So, this has to be running from the destination, I’m pulling data down and then I’m going to keep that data locally. Now another case where Integration Services would not have to worry that it’s going to be left at the altar is if you’re doing looping over a lot of servers. A PolyBase external data source is linked to a particular instance or server. So, what I mean by this is, if I have 150 instances, SQL Server instances in my environment and I want to loop over all of them, that’s probably not going to work very well, because I’ll have to create external data sources for every single one of them. I could. You know, there’s nothing that prevents me from doing it, but there’s a bit of complexity, and now I’m writing t-SQL while loops or cursors to iterate over the list of so many. And you start to ask yourself, “why am I doing this here instead of in a set of code that’s more designed for looping activities, bring it out into .net or even use Integration Services?”
Carlos: Yeah, now, that many servers– I mean, if you have to loop over that many servers to get data for a single source, my condolences, first of all, but it my immediate thoughts was you’re looking at something like administration at that point. So I have– I’m at ChannelAdvisor and I have a 150 instances that I want to make sure are up or whatever, that I’m thinking like PowerShell comes to mind as a potentially better choice, although very, very different use case.
Kevin: Yeah, administration through PowerShell, but also thinking about that, you have 150 instances, you still have some common data across them. So, you may have, say error handling data, error logs that you want to aggregate someplace and display in Power BI, so there’s still a scope for potentially going over a large number of instances. But, that’s ELT. That’s one case that I think is right in PolyBase’s wheelhouse.
Eugene: So, what I’m kind of hearing is situations where you’re really just using something like SSIS primarily for data movement, PolyBase could fit that they need pretty well. I know at my last job we had something like that where we had a linked server to MySQL. Don’t judge me, because our ERP system was in MySQL. But trying to do any kind of significant joins or anything against that was very, very slow and I forget, but either it didn’t do predicate pushdown, or it didn’t do it particularly well. So, assuming that PolyBase had better support for MySQL, I could see how that could be a lot better, because we were just doing like a full database copy or import every night. And it wasn’t particularly efficient but thankfully, we were only talking a few gigabytes at night, so it wasn’t as bad as it could be.
Kevin: Right. So yeah, this could potentially have been a better case for that. Also, with PolyBase, you do have the notion of scale-out groups, so even if the predicate pushdown is still just as bad with PolyBase as it is with linked servers, then you could still afford to throw more hardware at the problem. Assuming that you had the hardware to throw at it, you could link them together as a scale-out group and throw additional servers at the problem in order to meet a window, if you were going past it.
Kevin: So that’s one scenario. Another scenario, one that I’m currently kicking around, is as a kind of similar to ELT, but a replacement for certain types of transactional replication. And admittedly, I have not implemented this. I’ve thought about it just a little bit, just a tiny bit.
Carlos: No, did that make it into the book?
Kevin: No, no, this is actually fairly recent. It was with discussions with another person who will remain nameless, but he knows who he is. And the idea here is that there’s replication of data which doesn’t change very frequently. I mean, you could call it nearly static data. Sometimes there’s a new row that gets inserted, a dev team enters a new row maybe once or twice a sprint. But this is not the type of stuff that’s constantly updated, however, it needs to be on every server. So this is a case where transactional replication works quite well. You have your publisher, you only write to the publisher, you limit access on the subscribers, and then the publisher pushes right down to the subscriber, you get the results back pretty fast, and life is good. Until you get to the point where you have so many subscribers that your publisher gets overwhelmed, and then you start looking at republishing. You start looking at more and more complex replication typographies and begin to wonder, “where did I go wrong in life?” So, an area where PolyBase, combined with change tracking or change data capture could be beneficial. Would be if you turn on either change tracking or change data capture, all of those instances, the subscribers, would have an external data source pointing to your publisher database. Not the replication, the distribution database, or any sort of pub database, but the actual databases you care about, the actual instance you care about. You look at whatever database has your nearly-static data and you have an external data source pointing to that SQL Server instance. And then you have maybe a SQL Agent job running on each instance that says, “go grab the latest data from each server. I don’t need it this second, but maybe I need it this minute.” So, I can have a SQL Agent job running every minute that goes and checks to see, using change tracking or change data capture, if there were rows that have changed, what rows have changed, and depending on which of those two methods you use, how much effort it would take to pull that data over. You can do that and scale it out nearly– well, you can scale it out much, much further than you can with transactional replication. Transactional replication is a bit heavy, especially if you have a push model we’re now all of those SQL Agent jobs are running on your publisher.
Carlos: Yeah, so I was trying to think from a– so the big knock against replication as well as, okay, so let’s say if something changes, I have to add a column, so the number of touches, that wouldn’t really go down with PolyBase. I’d still have to go to each one, although it’d be in t-SQL, rather have to mess with–
Kevin: Yeah, actually, so this is modifications you’d still have to have a fairly planned out deployment process, but I I would argue you’re right, modifications actually would be a lot easier with PolyBase. And here’s my reasoning, if you have to add a column, well you change all the articles, you have to generate new snapshots, you have to push the new snapshots. If you have a reasonably substantial amount of static data, that can take some time. So, adding a new column, removing a column actually is even worse. Some of this stuff is not, it’s obvious that you can do it, I mean people do it, so I don’t want to oversell this as always literally impossible to do otherwise, but replication has a well-deserved reputation for being a management pain. Merge replication has a well-deserved reputation as a management nightmare. Ask me how I know, and I still won’t admit that I know anything about merge replication.
Carlos: Just say no.
Kevin: Yeah, yeah. But with PolyBase, if you think about it, you have the publisher side the subscriber side. On the subscriber side, let’s say I’m adding a column. I can add the column to the subscriber side, make it nullable or have a default, and then I can add it to the publisher side, backfill the publisher side, update the change tracking job or change data capture job to include that new column and in whatever procedure I have that loads my subscriber table from the ELT table. I get my result, so it’s a few steps; it’s not one step. If you wanted one step and you have a relatively low frequency usage, you could simply create an external table that points to the publisher. At that point you’re no longer publishing, you’re just redirecting, so this works if you’re in a not-too-busy environment. If you’re in an environment where you implement replication because it’s too many calls on one server, you would want to have a an external table that will be your ELT table, and then you’ll have a real table that will be your static table. And then you check the changes in ELT, merge them into static. Sort of like we were talking about with ELT into a warehouse. So that would be a strong recommendation if you’re at a point where you have just so many hits against that database or against tables in that publication database.
Carlos: Interesting, yeah.
Kevin: But, it’s an option. And like I said, I’m sure I’ll run into some difficulties the first time I try to do something like it, and also does not solve anything around merge.
Carlos: Sure, that’s right, which I almost feel like is a completely different ball of wax you’re trying to deal with there.
Kevin: Yes, a much messier one.
Carlos: Yeah, I agree. Very good, well, I think that that idea in and of itself is, the ELT option is– there are people that are excited about that feature when they can finally go to 2019, assuming that their ODBC connections–
Kevin: Right, exactly, and it’s definitely worth the amount of money you pay for this podcast, dear listeners. Oh, it’s time for Eugene to wake up,
Eugene: Sorry, you said replication like 50 times and I just zoned out.
Kevin: His alarm went off and it was time to flip over so now he gets sun in the back.
Carlos: Yeah, he’s looking at the palm trees and he’s like, “hm.”
Eugene: Look, I haven’t had to manage a database in over a year, so.
Kevin: All right, so another scenario, one that I love. I love the scenario of cold storage.
Eugene: Oh, interesting.
Kevin: And the idea behind this is, everybody has somewhere, some fact table that’s time-based, and for legal reasons, or for whatever reasons, you have to keep 7 years of data, 10 years of data, 5 years of data, whatever that number is, but people only ever query the last 3 months. So, you keep all of this data, and you’re probably keeping it on really fast disc, the fastest disc you have because you do that for your SQL Server so that you can get queries back in a reasonable time. But you have all this extra space that’s used up, you have all of this risk if somebody accidentally does a select * from the table, and pulls in 7 years worth of data where, you know, normally keep the last 3 months in the buffer pool. And so you may wonder, “well, what can I do about this?” And one of the techniques that I like to show off is where I can store this data in Blob Storage or in a Hadoop cluster, probably a much slower disc, much cheaper disc. I probably save a lot of money by putting it in Blob storage. Technically, I could save a lot of money putting it into a Hadoop cluster, depending on whether you have the resources to manage that Hadoop cluster. So that’s that’s something you have to work out the math on, but probably could. And what I would do is that I can partition out the table by time period. So, folks who’ve been around the block a few times remember SQL 2000 style partitioning, where we have a table, I’ll call it facts table. And we would break up the facts table into a bunch of smaller tables, like fact table 2001, fact table 2002, etcetera, etcetera. And then you create a view that is the union-all of all of these where your date is between the begin and end dates for that table. The where clause in the view is really important, because that’ll help the contradiction detection in SQL Server figure out, oh, well, these tables just aren’t used at all. These tables, in this query, where the date is between June 1st of 2019 and July 1st of 2020, I don’t need to hit the 2018 table at all. So that concept, going back to SQL Server 2000, before we had table partitioning in Enterprise Edition, actually can come into play here. Where I can take my data in my giant facts table, start splitting out into other tables. Well, those tables don’t have to be on the same hardware. I can write to Blob Storage so I could write, depending on how many rows I have, I may break it out by month, I may break it out by year, I may even break it out by day if I have a large, large amount of data, but I can start writing it to Blob Storage. And the one time of year when an auditor comes in and you have to show data from five years ago, it’s going to take a long time because you still have to pull it from Blob storage, you still have to do all the work against it, but at least for the other 364 days of the year, that data’s not taking up really precious, extremely fast disc.
Carlos: Right, or potentially memory, somebody, you know–
Kevin: Yeah, yeah.
Carlos: Fast fingers something. Right? Like, “ooh, I didn’t mean to do that.”
Kevin: Absolutely, absolutely. And also, these are the types of queries that typically you’re not expecting your real time speed on. Because, “the auditor’s coming in today and needs some data.” “Alright, well, it’s going to take six hours.” “Okay, no problem.” Or more realistically, “the auditor’s coming in next week, here are the queries that they need.” “Okay. I’ll start running them right now, it’ll take three days to finish, but you know, who cares, because person’s not getting in for a few days.” So we have that kind of scope, but if you have that sort of ability to say, “if you want your data fast, you stick to this lane. If you need older data, well, we can build an extract and get it to you in a day or two.” Sort of like going into Glacier where, “hey, you put your data in here and if you want it back, it’s going to be 24 hours, or you can pay us extra money and then we’ll surprisingly make it available right now.”
Carlos: Yeah. Yeah, so I think it again, it kind of all depends on architecture needs, cost, all those things, you have another option available to you, because I know we talked about using the cold storage option for Microsoft is, yeah, laughable at best.
Kevin: Stretch DB, you mean?
Carlos: Stretch DB, sorry.
Eugene: Is that still a thing? I’m like, “did he just reinvent Stretch DB?”
Kevin: Well, I made Stretch DB, but where people might actually use it, because the problem with Stretch DB, one of the problems of Stretch DB is that it would cost you a lot more money to store the data than it would– just to store the data in Azure, then it would to store it on-prem. I did some quick math and the rough estimate was I could buy a brand new SAN every month for the amount of money that I would store my data in Stretch DB.
Eugene: Oh, fun.
Kevin: So, it’s seriously that– it was that bad of pricing. This is a lot cheaper, because you’re talking about a couple of pennies per gigabyte per month. So again, it’ll be slow. Getting that data back won’t be very snappy, but it is cheap and if your primary reason for holding this data is simply, “I need it to be out there because one time in a thousand somebody’s going to ask for some of it,” then this may be an option for you. And if you need better performance in Blob storage, and you have a Hadoop cluster, you can shunt it off there and then right have PolyBase write MapReduce jobs when you do need that data. Because there’s no way to make Blob storage faster. All Blob storage is going to do is stream all of the data back to SQL Server and then you do all of the work from SQL Server. There’s no push down to Blob storage. There’s nothing to push to.
Carlos: Yeah so there you go, I mean maybe only because we just published our DBA work or episode that this is a great example of all the considerations that you need to think about working with data or moving data around, hosting data, housing it.
Kevin: Oh yeah, oh yeah, and nothing in this is new. I mean, I mentioned the partition view thing goes back to the SQL 2000 days. The idea of moving data between systems, well, we already talked about Integration Services and Informatica. Moving data between systems is nothing new. As a database administrator, hopefully you’re listening to this and saying, A, “oh yeah, that makes a lot of sense,” and B, “yeah, I could do some stuff like that before.” And where PolyBase helps is that you get to stick with t-SQL instead of going out to Integration Services, setting up that server, taking up the CPU and memory for Integration Services, building those packages, and having that extra bit of development. So, it’s acting as potentially simplifying your job instead of giving you something so radically new that, yeah, you never had any concept of it before and so a grizzled database administrator could never understand this. Something like that.
Carlos: Yeah, awesome, well very good. I think that hopefully, dear podcast listeners, that this is a little more helpful in in terms of context or reasons why you might be thinking about implementing PolyBase. You know, it is interesting, this came up with Javier when we talked about SQL Server 2019 in general, it’s somewhat uncertain what the new features would be, but it does seem like there’s lots of little nuggets in here that we could start leveraging, and I think this is one of them. Okay, well I guess that’s going to do it for today’s episode. Thanks, compañeros, for turning in. Again, our show notes is sqldatapartners.com/polybaseusecases. We cannot let this episode end without mention again that are very own Kevin Feasel has put out a book on PolyBase. And we’ll have a link on our show notes, but hint hint, it’s available on Apress and wherever technical books are sold.
Kevin: Not the fine technical books though, just the mediocre ones. I do have conclusive proof that at least two people have bought a copy of my book. So, I’ve sold at least two copies, so life is– I’m rolling in dough right now.
Carlos: Yes, literally, because until he gets to 10, they pay him in dough, rather than cash.
Kevin: Even after 10, it’s just dough.
Carlos: Okay, and as always, compañeros, we’re interested in getting your thoughts. Tell us what you want us to be talking about and you can reach out to us social media or hit me up. So, [email protected] if you want to chat and Eugene, how can people talk with you?
Eugene: Yeah, people can find me on Twitter @sqlgene.
Carlos: Mister Kevin?
Kevin: You can technically find me on Twitch, if you know where to look. Otherwise, you can find me in a hotel room but not in the state of Washington right now.
Carlos: Yes, or the country of Italy.
Kevin: Or Italy or yeah, there’s a list of countries I’m not allowed to be in at the moment.
Carlos: And compañeros, you can reach out to me on LinkedIn. I am at Carlos L Chacon. Thanks again for listening and we’ll see you on the SQL Trail.