Episode 180: PolyBase in SQL Server 2019

Episode 180: PolyBase in SQL Server 2019

Episode 180: PolyBase in SQL Server 2019 560 420 Carlos L Chacon

This week we heard the news of the release for SQL Server 2019 and a major focus of this release is data virtualization. In this episode, we discuss PolyBase–a topic we covered in Episode 95; however it has really come of age with SQL Server 2019. Our very own Kevin Feasel talks about some of the improvements of the feature and we discuss some of the use cases of when it will be useful and some of the pitfalls you might run into as you implement it. We discuss how PolyBase can help you reduce storage and other costs by avoiding the need for ETL processes that duplicate data in order to make it accessible from one source. Kevin is putting the final touches on his book PolyBase Revealed, and if you are looking for a deep dive into the subject, we invite you to check out his book.

PolyBase Revealed by Kevin Feasel

Kevin’s PolyBase in Action blog

Free supported Java in SQL Server 2019 is now available: Travis Wright’s blog post

Episode 95: Polybase

Episode Quotes

“I’d rather have a central server that understands and talks to all of those and allows me to query it using a consistent language and get my results back, and that’s what PolyBase allows you to do.”

“One of the trade-offs of using PolyBase is, don’t expect it to be as fast as if you had designed everything to fit on your SQL Server instance and had appropriately architected all of the pieces.”

“Being able to share out the load of a problem, this, I think, is the ultimate ideal for PolyBase.”

Listen to Learn

00:38     Intro to the team & topic
01:23     Compañero Shout-Outs
02:52     What is PolyBase?
06:42     The world has changed so much, even in a couple years
11:24     The further away your data is, the slower it will be to work with
16:10     It might be worth the extra seconds waiting to save a lot more time and effort
19:46     Let’s talk about Java – do we need it?
24:36     The differences between PolyBase and linked servers
26:30     There is a role for an administrator in security
29:24     Where Kevin sees this going
30:47     Using other technologies with PolyBase
32:06     Last thoughts on PolyBase
33:00     SQL Family Questions
36:18     Closing Thoughts

About Kevin Feasel

Kevin Feasel is a Data Platform MVP and Engineering Manager of the Predictive Analytics team at ChannelAdvisor, where he specializes in T-SQL and R development, fighting with Kafka, and pulling rabbits out of hats on demand. He is the lead contributor to Curated SQL (https://curatedsql.com) and the author of PolyBase Revealed (forthcoming). A resident of Durham, North Carolina, he can be found cycling the trails along the triangle whenever the weather’s nice enough.

*Untranscribed Introduction*

 Carlos:             Compañeros, welcome to another edition of the SQL Data Partners Podcast. My name is Carlos L Chacon, your host, and this is Episode 180. I am joined today by two very special guests. You may have heard of them before. Eugene Meidinger.

Eugene:           Howdy.

Carlos:             And Kevin Feasel.

Kevin:              I deny everything.

Carlos:             Which is going to be interesting for this episode, because actually we’re talking with Kevin about PolyBase, so this could make for a very, very interesting conversation. So, you may remember that we talked about PolyBase in Episode 95, but never fear, compañeros, if you haven’t gone back and listened to that, or maybe it’s been a little while since you’ve listened to it, we’ll forgive you, if that’s the case. We’ll review some pieces, and then talk about what’s new. Before we get into that, however, I do have a couple of shout-outs I want to give. Allen White. I can’t not tell the story when I think about Allen, to tangent for a moment. So, my very first SQLSaturday that I went to, the very first session I was in was Allen White, and he said something that has stuck with me, and admittedly, compañeros, has played a role in one of the reasons that this podcast exists. And he said, “we all have something that we can teach each other. You know something that I don’t know,” and then his thought was that “hopefully in this session I’ve put together something that can help you on your way.” And so, I’ve always been appreciative of that, and that idea around the community, and particularly around this show, has been built on that idea, so, shout-out to Allen. Ahmed Mohammed, Stacey Hamlin, and Cara Ziegel. Thanks, everybody, for connecting with us. Our show notes for today’s episode will be at sqldatapartners.com/polybase, or sqldatapartners.com/180. As this rolls out, Kevin will have hopefully be recovered from the jetlag on his whirl-wind tour. He’s got trips to Australia and New York and Europe, as well.

Kevin:              It’ll be an interesting October.

Carlos:             I’m exhausted already, just thinking about it, and I don’t even have to–

Kevin:              Possibly it will have already been an interesting October.

Carlos:             Yes, that’s right. I think by the time the compañeros listen to this, it will have been. Okay, like I mentioned, at the beginning, PolyBase is our topic, and obviously there’s been lots of changes, and I should say we’ll get into some of the other pieces, or maybe new technologies that PolyBase plays a roll in, but first, we should probably just revisit high level concepts. Tell us what PolyBase is. Why did they implement PolyBase in the first place, Kevin?

Kevin:              Yeah, it’s definitely worth revisiting, because the nature of the product has changed over the years. Going back to its origins, you go back to the original papers, and the intent was to be able to work within parallel data warehouse, massive parallel processing in SQL Server, and then connect that to HDFS, the Hadoop Distributed File System. So, the early editions were all about that integration. When us normal folks, who can’t afford PDW or APS finally got PolyBase, this was SQL Server 2016, and here, the intention was, once again, I’m going to connect to the Hadoop Distributed File System, or to Azure Blob Storage, because in a stroke of luck, it turns out that Azure Blob Storage exposes the same protocol that the Hadoop Distributed File System does for web interactions. It’s called WebHDFS. And therefore, the developers could just integrate directly with Azure Blob Storage, pull that data back to SQL Server and process. So, you’ve got those two pieces together, and originally the idea was, if I’m working with a Hadoop cluster, I can shunt off some of the work to Hadoop. I can run a MapReduce job against that Hadoop cluster, which is probably cheaper than my very expensive core licensed SQL Server instances. And so, dump that work off, get some results back, and display those results to users without them necessarily knowing that you went to Hadoop or that you went up to Blob Storage and pulled data down.

Carlos:             Right, and in theory, not needing new skillsets to be able to do some of that, as well.

Kevin:              Exactly, it’s the same T-SQL that we all know and mostly love. And then with SQL Server 2019, the mission has changed a bit. It’s less about Hadoop, specifically.

Carlos:             Yeah, so MapReduce has lost favor a little bit, because they’re like, there’s got to be better ways to do this, or different ways.

Kevin:              Yeah, the specific implementation of MapReduce in Hadoop, it has the downside that it can be very slow in comparison to other things. Like, for example, Spark, which has taken over that space, where you’re still technically doing MapReduce. It’s not MapReduce, the implementation concept, it’s MapReduce, the specific implementation for Hadoop 2.0.

Carlos:             Yeah, V2, yeah.

Kevin:              Yes. So Spark has taken over, because it’s more in line with how computing is today, where we have much larger amounts of memory. We have the capacity, still, to do this distributive processing, but we don’t have to hit disc as much, and we also don’t necessarily have big boxes filled with direct-attached storage spinning discs, where we can dump a whole bunch of data on them and not care too much. In a world where I’m going with NVMe hopefully direct-attached, or in a world where more realistically, a lot of us have SANs, or we’re storing our data in Blob Storage or S3, the concepts behind MapReduce in Hadoop and how they work don’t play quite as well. It’s still useful, but it’s not as useful as it was in 2012, 2013, when Microsoft research was developing this.

Carlos:             Yeah, exactly. you mentioned in the last podcast that PolyBase ultimately started in 2010 and so it has been quite the journey. I mean, you think about what the landscape of the world was, or what you were doing in 2010, and what we’re doing now, and things have changed. I mean, even we’ve had several episodes about Power BI, and I’m not even sure if that was on the roadmap yet.

Kevin:              Yeah, exactly, and the world changes. PolyBase is also adapting, and in 2019 what we’re seeing is this shift, this mental shift from, “I’m going to integrate with HDFS”, to more of, “I’m going to perform data virtualization.”

Carlos:             Right, you kind of hinted at that. I mean, it seemed like the product team wanted to do that, it just wasn’t quite there, but that idea of, “I want to be able to connect to anything.”

Kevin:              Yes.

Carlos:             And now we’re starting to see that a bit more in reality.

Kevin:              Yes, so a couple of years ago, Microsoft bought a small company. You could see maybe three or four articles on Google about it, and they all just say, “Microsoft bought this small Israeli company that does data virtualization.” And they have a product that would allow you to connect, integrate a whole bunch of different products together. So, SQL Server, Oracle, Teradata, etcetera, etcetera. Microsoft buys them and a couple years later, we see in SQL Server 2019 this move toward data virtualization, meaning, I have on my SQL Server instance, sort of links to other servers, but they’re not linked servers, so you know, we don’t have to worry about linked servers. We should talk about the difference between these as we go on, but the concept of data virtualization is, I frankly don’t care where the data lives. Have it be where it is most useful, have it be where it is best stored, but when it comes time for me, the report writer, to query this data, I don’t want to have to connect to a MongoDB instance in order to grab some data and then connect to an Oracle instance to grab more data and then connect to a Hadoop cluster and grab more data and try to merge it all together on my desktop. Instead, ideally, I’d rather have a central server that understands and talks to all of those and allows me to query it using a consistent language and get my results back, and that’s what PolyBase allows you to do.

Carlos:             So, the approachability of data is a big driver, here. We talked about it only being there in Hadoop, I guess what has been your experience? I mean, is that really more marketing-speak at this point, or is that much more approachable? Do you find that reality and the marketing, or the idea, are they meeting?

Kevin:              In general, I think so, so far. With with 2016 and 2017, it was rather limited in what you could do. You had the ability to connect to a Hadoop cluster and basically was taking what was developed for PDW/APS, bringing it to the masses and trying to get people to understand that, hey, if you have a Hadoop cluster, you don’t have to write Sqoop jobs to pull data over. You can have the data still live in Hadoop and you can operate off of it over in Hadoop and just get the results back. With 2019 now, you know, we’re seeing, “hey, I can connect to SQL Server, Oracle, Teradata, Mongo, Cosmos, Spark, Hive,” even if you want really big data, you can connect to Excel.

Carlos:             Most powerful database? No, I shouldn’t say most powerful.

Eugene:           It has in-memory analytics, right?

Kevin:              Exactly, exactly. It’s a beast. And you can connect to all of those. I know, because I have connected to all of those, because I’m writing a book on the topic.

Carlos:             There you go. We’re glad to know you’re doing your research, there.

Kevin:              We’re a few minutes in and I finally let slip.

Carlos:             And it’s available for sale, and compañeros, you can get it.

Kevin:              Actually, I don’t know if it will be available for sale by the time this comes out, but you can probably find it on Amazon by the time this comes out. It’s going to be called– the working title, as of August 8th, is PolyBase Revealed. So, you can’t find it on Amazon today, August 8th, but–

Carlos:             Compañeros won’t know that. That’s right, they won’t hear about it.

Kevin:              Exactly, by the time you’re listening to this, it’s going to be out. Go make it a best-seller or something. I don’t know, if a hundred people buy the book, I probably get a dollar.

Carlos:             Yeah, so Eugene and Kevin are now racing for dollars.

Kevin:              Well, in my case, it’s for dollar, because I don’t expect to–

Eugene:           Yeah, if you watch half of a course of mine, I’ll get a dollar, so it’s pretty easy to out-beat Kevin, here.

Kevin:              That’s true, that’s true.

Carlos:             Games are afoot.

Eugene:           So, help me. There’s something that just feels off to me about the idea of PolyBase, and it’s probably just ignorance on my part, or trauma scars from working with linked servers, I’m not sure.

Carlos:             Yes, for shame, for shame.

Eugene:           Yeah, well, we needed to connect different SQL instances, including ones that were on the same server. But I think I can see the benefit if your data is in this hodgepodge of various different big data and NoSQL systems, kind of like what you were describing. So basically, your team drank the buzzword Kool-Aid, and so you’ve got MongoDB and Hadoop and maybe like Elasticsearch and CSV files and all of this kind of stuff. So I get that, if you’re like, “alright, I just want one-stop shopping for all of my data and I want to be able to use SQL.” Okay, that makes a ton of sense to me. But the whole idea of data virtualization just seems wrong to me, coming from a SQL perspective. Because in my mind, the further away your data is, the slower you’re going to be able to work with it, but then also, hopefully you have really good support for predicate push-down, because what happens whenever you decide, “oh, I want to do a JOIN between some of that data in Mongo and some of it in HDFS.” Am I now having to pull it all back in and JOIN it all together? It just seems like it’s adding a layer of latency and work that shouldn’t be necessary.

Kevin:              Yeah, you are absolutely correct about that, so I would say that one of the trade-offs of using PolyBase is, don’t expect it to be as fast as if you had designed everything to fit in your SQL Server instance and had appropriately architected all of the pieces. It’s going to be slower, and there are a couple of reasons why it will be slower. Some of them, I hope, get improved in future versions, some of them are just intractable problems of the space. One of the problems that I hope to get improved is the quality of how things connect together. So if I’m pulling data from Oracle and from CosmosDB– let’s say Oracle, Cosmos and Spark. I’ve got three sources and I have a filter in Oracle, and that filter in Oracle can implicitly, subsequently filter down things in Spark, which can then subsequently filter down things in Cosmos. In an ideal world, as humans, we can figure out that, “well, if I connect to this data source first, pull down just the bare minimum data I need, and then use that as an extended filter into Spark and say, ‘well, only grab rows that are like this’, and then pull those back, and then extend the filters further within the course of the query to Cosmos and pull data back, then we get something that is closer to optimal.” In reality, what tends to happen is that if it’s in your WHERE clause, PolyBase can potentially push that filter down, get data from Oracle, but then it’s probably going to pull all of the data back from Spark, all of the data back from Cosmos into temp tables on your SQL Server instance, JOIN that data together and return it to you. I’m being purposefully general and fully acknowledge that there are exceptions to this, but that is the way that you can think about it. And, so it has its subsequent performance cost, that yeah, if I had all of this data on one server to begin with, I can write a better query. The optimizer can come up with a better plan, but as of today, it tends not to.

Eugene:           Yeah, that makes sense. And I know, like, some of this stuff is definitely possible, because I’ve seen it with a similar technology in the Power BI space with query folding. So, one of the demos I always love to do with Power Query is if you join an Excel file and a SQL database, it’s smart enough that it can pull the data from Excel and push that back to the SQL query. So it’s got to be possible, but that helps clear things up, for sure.

Kevin:              Right, and even with that, it is still possible, it’s just, you have to write queries a little more intelligently, I could say, or rather, I suppose it’s writing to the test. So, here’s the way that I can get fast performance pulling this data from Spark. I have to do this extra work, maybe load the Oracle data into a temp table and write a dynamic SQL query that gives me an explicit WHERE clause that I can pass over to Spark.

Eugene:           Right, that makes sense.

Kevin:              There are tricks around that, if you’re aiming for performance. With data virtualization, oftentimes, it’s not so much about immediate performance, and it’s more about, say, pulling data in to a warehouse instead of having Integration Services be your ETL process, virtualizing the data sources and then transforming them to load into a warehouse.

Carlos:             Right, so remembering what that use case is– yeah, that was my thought is some kind of ETL process or overnight process. I mean, you think about, well, what would be faster, trying to port all that data over into SQL Server so you can then report on it, or using something like this so that you can actually start getting some data out of it, even if real-time, it might take a few more seconds than you’d like?

Kevin:              Yeah, exactly, and it also helps if you’re in a space where you’re performing, say data science work or analytics work where I need to run these queries. Frankly, I don’t care if it takes 12 seconds instead of a quarter of a second, or if it takes 15 seconds instead of 5, or if it takes 5 minutes instead of 10 seconds. I can eat that, because I run the query once and I get my data. And so, if you’re thinking of it from the development standpoint of working where this data is the first step in a chain that a human is going to be involved with, and time is not of the essence, then PolyBase is a great solution, because, again, you teach people one language. You don’t have to teach them the variant, the dialect of Spark SQL. You don’t have to teach them PLSQL. You don’t have to teach them how to interact with Cosmos or any of these other sources. You know, Teradata has its own dialect of SQL.

Carlos:             So you’re saying my Zero to SQL book has a longevity to it, still?

Kevin:              Exactly, exactly, because you’re teaching people one language, and you’re just saying, you know what? You don’t even have to think about where that data comes from, because in SQL Server, especially if you’re using Azure Data Studio, it just shows up as, say, dbo.mytable, and dbo.mytable could really be an external table pointing to Oracle. You don’t really care. Within Management Studio, they hide them under a separate folder that says external tables, so you have to drill into that, but again, from my perspective as a query writer, it’s just a table that is local to the database I’m in, which means I can use two part naming, it means I can join to it like anything else, I can write any T-SQL I want, even stuff that’s specific to T-SQL, like the APPLY operator, like certain functions. I can write my user-defined functions, if I’m a masochist. And if I really, really want bad performance, I can write user-defined functions involving these external tables, because at the end of the day, the data’s coming back to SQL Server and it’s being exposed in a way that SQL Server understands.

Eugene:           Yeah, that makes a lot of sense, and I especially like that you bring up the data science piece, because I think, and you’re going to have a better feel for this than I am, but part of the reason people are moving to data lakes and this more like ELT kind of approach is the whole thing with data science is so much data exploration that you don’t know what structure you need the data in or even exactly what you need until you get there. It’s a very iterative process, and so, it’s hard to build out the structure you need, because you haven’t done the experiments to figure out what data is actually, you know, correlates. And I think building on what Carlos said, I think you had a very good point that well, how long is it going to take to build this data warehouse that will perform really, really well, but could easily take a year. Or you may start going in one direction and realize that you’ve wasted a bunch of effort, because now you need these different tables or these different columns, or whatever.

Kevin:              Yeah, exactly, and building out an entire Kimball-style warehouse when you’re still trying to figure out the dimensions is a classic case of premature optimization. And we know how bad that can be, because you end up making the wrong choice, and then you pay for it in the end with extra development time.

Carlos:             So, one of the things that we found interesting with the 2016 version of PolyBase was the need to install JRE, and kind of this integration of Java and we thought, “ooo, that’s kind of weird.” However, in 2019, one of the things that’s changed is that there’s now more Java integration, and actually taking a peek at one of Travis Wright’s, who’s one of the program managers. You know, we were scratching our head, “oh, you know, what the heck?” Well, he makes the case that Java has actually been integrated into SQL Server a little bit longer, because they have the JDBC driver, right, that’s been around for a while. And I’m like, “oh, okay, well, I guess that’s fair.” So, I guess from a needs perspective, to get started, I’m assuming 2019 version still requires Java Runtime environment and that hasn’t changed?

Kevin:              So, let’s talk about Java, because the answer is kind of, but not necessarily. It’s a lot more fun than that.

Eugene:           This is like an intervention. Alright, kids, let’s talk about Java. Don’t do Java.

Kevin:              Exactly, don’t do Java.

Carlos:             Well, we have the SQL Server loves Java photo for this episode.

Kevin:              SQL Server may love it, but I don’t. Coming from a guy who spends a lot of time working with Spark and Hadoop. It’s a necessary evil, not a best friend. So, anyhow, 2016, 2017, you absolutely needed to install the Oracle Java 7 update 51 or later, and they supported Oracle 7 and Oracle 8. You could either get it Oracle SE, so the official edition from Oracle, or OpenJRE, OpenJDK. OpenJRE and OpenJDK are still Oracle-based. Oracle developers still do a lot with it, but it’s open source, it’s basically community supported and that means that if there’s a problem that’s Java-related, you have to get community support instead of paying Oracle for support. Now, this year, there have been changes to the Oracle licensing scheme to where– I should never talk about licensing, because I always get it wrong, but if you are in a production environment using Oracle Java, then you are supposed to pay Oracle for the privilege of licensing for production.

Carlos:             Interesting.

Kevin:              As a result of this, Microsoft, in the latest CTP, as of August 8th, which is 2019 CTP 3.2, Travis had a blog post where he talked about the integration with Azul Systems. So, Azul and Microsoft had been working together in Azure since, I think, last year, and now when you install SQL Server 2019 CTP 3.2 or later, if you select one of the things that requires a Java installation, then you have the ability either to specify where your JRE lives. So if you’ve already installed Oracle SE or if you’ve already installed OpenJRE, you just point to where that lives. Otherwise, you can install Azul Zulu Java, which is a separate commercial platform for Java, handled by Azul, Microsoft has a licensing agreement with them, so if you have a production problem you talk to Microsoft, and then if it’s a Java-related thing, then they communicate. I don’t know the details, really, because I don’t really care. Welcome to me thinking as a developer. But, due to these changes, basically you have this option of a fully-supported license without you shelling out cash to Oracle. Now, answering your question specifically, do we need Java? The answer is, you need a JRE if you have machine learning services Java support, obviously if you’re going to run Java locally on your machine, you have to have Java.

Carlos:             Right, right, makes sense.

Kevin:              Or, if you are using PolyBase and want to get the HDFS connector. Microsoft has separated this out, so you have what I’ll call V1 and V2. V1 was that HDFS, which meant Hadoop and Blob Storage. V2 is more ODBC-based. It’s connectivity with Oracle, Teradata, Mongo, and then SQL Server and generic ODBC. If you only want that V2 stuff, if you’re only connecting to Cosmos and Oracle, and you don’t have Blob Storage, maybe you’re an AWS shop, and you don’t have Hadoop, then you don’t need to install Java, but you still get PolyBase V2. If you do want to connect to Blob Storage, you will need the Java Runtime environment.

Carlos:             There you go. Which I can imagine that, I mean, we’ll see what happens as they continue to integrate here, but I have a sense that more and more of us will be having to make choices like that when we start getting into the newer versions of SQL Server. Okay, so we said we were going to talk a little bit about linked servers. You talked about some of the connectivity, some more ODBC-based, but then you’re like, “well, it’s not a linked server.”

Kevin:              Yes, it’s not a linked server. So, there are some differences between PolyBase and linked servers. Linked servers, by their name, are intended to be server-level objects. You create the linked server connection, and now you have the ability to integrate with many of the sources that I talked about. We’ve been able to connect to Oracle through linked servers for years, SQL Server, even to Excel. I joked about Excel with PolyBase. You can do it, but you’ve been able to do this for a long time. There are a couple of differences. One is that for PolyBase, it is database-specific. I may have a database where I want to pull data in from some remote SQL Server instance, and within that database, I set up an external data source that says, “hey, I’m aware of a remote SQL Server living over here at this host name or this IP address. And now I want to create an external table that mimics a table over on some database on that remote server. So, locally, on my local instance, I have an external table that just looks like any other SQL table, and it is just that table. Users don’t have to try to drill down and figure out which database, which schema, which table.

Carlos:             Sure, okay.

Kevin:              Which makes PolyBase nice for cases where I have a few tables, or where I have a specific number of known tables. If I want to have that–

Carlos:             Right, so yeah, you–

Kevin:              Oh, go ahead.

Carlos:             So you know what you’re going after, in a sense.

Kevin:              Exactly. Now if I want that flexibility of, I have 500 databases on here and each one has 200 tables and I might need to get any one of them, then a linked server’s probably your better bet, but I might question how frequently that comes up to where you’d need that at runtime.

Carlos:             Right. One of the big issues we’ve had with linked servers is just from a security perspective. You set up the linked server with a specific permission, now everybody on that box basically has access to that linked server. At the database level, in essence, you can– security by obfuscation, you have to have access to that database before you can use it. In theory, we’re not giving sysadmin to everybody, and you can help control security that way, as well.

Kevin:              Right, right. Now, with PolyBase, this is one of the things that is currently, I think, a pain point. And that is, the permissions necessary to query a remote external table tend to be a little high, because you need to be able to open a database master key.

Carlos:             Tada.

Kevin:              The reason you need to do that is, that’s where your credentials are stored for the external service. So, let’s say you’re connecting to Azure Blob Storage and you’re connecting to a public blob. In that case, you don’t need any special credentials, it’s just a table, it executes, life is good. But if you’re connecting to a remote SQL Server instance, we need to be able to authenticate. Again, this may change by the time this episode airs, but at present, SQL Authentication is the only method available to connect one SQL Server to another, so I need to store those credentials on my local instance. I’m going to encrypt those using the database master key, I’m going to open up the key and then write my SQL SELECT statement. Now, as an end user, I don’t necessarily need to know what those credentials are, remotely, but I do need to be able to open the database master key, and that’s going to require extended permissions. If I want to create an external table, I need very high-level permissions. So it’s not something that I would give to a business analyst or to a regular developer to start creating those tables. If you have people who do need them, then probably the best solution is creating stored procedures, within the stored procedure, you may have code to open the master key and then execute in the context of some authorized user, using a procedure sign in.

Carlos:             Right. Okay, so there’s still a need for us to get in there. There’s still a roll for an administrator, I guess is what I’m trying to say.

Kevin:              Yes, absolutely,  I’m not a huge fan of the way that this is currently set up, and I hope that it does improve, because that data virtualization experience, in part, I see it as something that is not just administrator-driven, where I need a DBA who has sysadmin access to be able to create this external table and walk through the regular process. I’d like to have something that’s a little bit more lower-level user-friendly, where I can create an external table, and I can do this work, and if I take down my server because of awful performance, it’s my own fault, but where I don’t need to have sysadmin, or granted control on the database to be able to do this kind of work.

Carlos:             So, my question is, where do you think this is going? We’ve seen some evolution where it was going to Hadoop clusters, with the desire to move into some of these other systems. Ultimately, we’re still talking about SQL Server. Maybe my question is, how do you see some of the other technologies, and we’ll use Power BI, potentially, as the example, of now being able to piggyback off of PolyBase?

Kevin:              Where I see it going, I’ll get to third-party tooling, but I think the more important area of where I see it going is massive parallel processing. So we talked about that just a little bit at the beginning with PDW and APS, but being able to take my SQL Server instances, and maybe I have 5 or 6 instances, and I want to work on a larger data problem that one instance, alone, might not be able to handle by itself. Being able to scale out SQL Servers, while keeping the integrity of my ACID processing, all of my constraints, all of the nice things that drive us toward relational databases. But being able to share out the load of a problem, this, I think, is the ultimate ideal for PolyBase, in my mind. That is, scaled-out, fully distributed, but still ACID compliant, working consistently, that’s the needle they need to thread. As far as integration with other tooling goes, Power BI is a little bit of a special case, in that you already can connect to half the world with it.

Carlos:             Right, that’s fair.

Kevin:              So, I don’t think you’re gaining anything, specifically, for Power BI. Even Reporting Services, you can connect to a bunch of different stuff, but if you have something that Reporting Services can’t connect to, or you have a third-party app that is only allowed to connect to SQL Server and maybe a couple other things, here you can use PolyBase as a way to expose data from a different system, pretending that it’s SQL Server, and then that third-party tool just continues along.

Carlos:             Sure, yeah, extend that tool a little bit.

Kevin:              Yeah, and this is most useful, I think, actually less in third-party vendor apps, and instead, more in custom apps that people have written. Where it’s a .NET shop, they’ve worked with SQL Server, so they know how ADO.NET works, they know how to query data from a SQL Server, but might not know how to query from somewhere else. And instead of trying to figure out all that process, you say, “here, I’ll create the external data source, I’ll create this external table. You just query it like it was just a regular SQL table and continue on with life.”

Carlos:             And the world rejoices.

Kevin:              Pretty much.

Carlos:             Awesome. Yeah, I can see that being very useful to that specific situation. Some cool stuff. Last thoughts on PolyBase?

Kevin:              Yeah, I would say, this episode comes with a timestamp, obviously. This is still under active development. Over the course of 2019 releases, we’ve seen, in pretty much every CTP, there have been additions or changes to PolyBase, even if they’re not in the release notes, and so we see that this is an area where Microsoft is dedicating a pretty good amount of time and I’m glad for that. So, that does mean that by the time this airs, there may be some really cool stuff that I didn’t know about, yet. Hopefully. I’ll cross my fingers for that.

Carlos:             There we go. Sounds like some feed for SQL Server in the News. Yeah, it’ll be interesting to see how this plays out. Which I don’t know that we made clear in the beginning, that a lot of what we were talking about is specific to SQL Server 2019.

Kevin:              Yeah, so get on the latest version.

Carlos:             Yeah, that’s right. Okay, should we do SQL Family?

Kevin:              Let’s do it.

Carlos:             Based on compañero feedback, right, the podcast that cares about you, compañeros. We normally wouldn’t do SQL Family, cause Kevin’s gone through the process, we’ve asked him questions, but the compañeros demand new questions of our guests. They want to know more about them, so here we’ve come up with some new thoughts, here. So, Kevin, your hometown? Now, that doesn’t necessarily mean where you were born, but where you feel like you grew up.

Kevin:              Oh, I grew up in Columbus, Ohio.

Carlos:             Well, that explains a lot. No, no.

Kevin:              Exactly. The classic cow town. So, my favorite cow town story: when the Columbus Blue Jackets were going to exist, the city of Columbus was granted an NHL franchise, there was a vote in the city on a team name. I voted for the Columbus Mad Cows. I would have been happy with that team. I’m okay with the Blue Jackets, it’s fine.

Carlos:             There we go, okay. What’s your go-to order at your favorite hometown restaurant?

Kevin:              Now I’m switching hometowns to where I live now. Notice how slippery I am on this.

Carlos:             There you go.

Kevin:              My go-to order, what I’m going to order in about an hour and 15 minutes is fried chicken white meat, spicy macaroni and cheese and fries, along with Cheerwine, the official drink of–

Carlos:             North Carolina.

Kevin:              North Carolina. Now, this is fried chicken that’s made by hand. It takes 20 minutes to make, which means I’m sitting there at this place. It’s Gas Station Fried Chicken, it’s the best fried chicken.

Carlos:             It’s scary how good that is, sometimes.

Kevin:              Oh, this is amazing. Honestly, this is the best chicken I’ve had in my life, and it’s in a strip mall gas station.

Carlos:             Nice, in Raleigh?

Kevin:              Durham.

Carlos:             In Durham. Okay, so what profession, other than your own, would you like to attempt?

Kevin:              When I was growing up, I wanted to be a cardiologist, until I realized that I have no hand-eye coordination, and that I don’t really like the sight of blood.

Carlos:             That’s a problem.

Kevin:              Yeah, I had to change my mind. Later on, went to school for economics, and thought seriously about a PhD in economics. If I could not be in this technical field, I probably would be in economics.

Carlos:             Makes sense to me. I would have thought economics, so math, I would have thought a little bit more math. I feel like–

Kevin:              It depends a lot on what area of economics you get into. You can get very heavily into math with econometrics.

Carlos:             That’s true.

Kevin:              The extremely high-level statistics, yeah.

Carlos:             So we could have a Nobel Peace Prize winner. But instead, compañeros, he is our co-host.

Kevin:              Probably not a Peace Prize.

Eugene:           Oh, my goodness.

Carlos:             Room, desk or car, which do you clean first?

Kevin:              Anybody who’s seen my car knows that I don’t clean it very much. The inside is nice. The outside, I just let the rain clean it. Probably room. I like walking around. I like having that open space, and stuff not in its place bothers me. Desk, I just have too much stuff going on at once.

Carlos:             Oh gosh, yes, I’m looking at my desk right now and I just sigh. Okay, very good. Well, as always, Kevin, thanks for being on our program, today.

Kevin:              It’s what I do.

Carlos:             That’s right, and we’re appreciative. And as always, compañeros, that’s going to do it for today’s episode. If you’d like to reach out to us on social media, you may. And where will you be, today, Kevin?

Kevin:              That’s a good question. I mean, you can find me anywhere, just go on your SQL Server.

Carlos:             There you go, and it won’t be a linked server.

Eugene:           But what’s the licensing?

Kevin:              It’ll be an external table.

Carlos:             Eugene?

Eugene:           The external table you can find me on is Twitter @sqlgene, or my blog sqlgene.com.

Carlos:             And compañeros, you can reach out to me on LinkedIn, and I have to say I do appreciate those of you who put in a message, rather than just connecting. Let me know what you’re thinking about. It’s always fun to chat with you guys. And thanks again for tuning in to today’s episode and we’ll see you on the SQL Trail.

3 Comments
  • SQL Data Partners Episode: PolyBase – 36 Chambers – The Legendary Journeys: Execution to the max! November 11, 2019 at 9:45 am

    […] Episode 180 is now out for all to hear, so give it a listen. […]

  • Thank you for the timely update on Polybase in SQL Server 2019. One point that I appreciate on the new version is that, up until 2019, there was no support for an external table to reference a SQL Server table in another database, even if the databases were on the same instance. This feature was only supported in Azure SQL DB. Now that both on-prem and SQL DB can use this feature I can consider adding the functionality to a data warehouse framework that I work on which has multiple databases and many cross-database queries. Not having parity between the versions prevented me from using Azure SQL DB, as I would have had to fork my project and make a different version for Azure SQL DB vs on-prem or Azure SQL Managed Instance.

  • Dew Drop – November 7, 2019 (#3067) | Morning Dew November 7, 2019 at 6:59 am

    […] SQL Data Partners Podcast Episode 180: PolyBase in SQL Server 2019 (Carlos L. Chacon) […]

Leave a Reply

Back to top