Big data is term we have been hearing frequently as of late and this might cause some concern for us in the SQL Server space. Microsoft has introduced some new functionality to help connect different data stores with PolyBase. We are happy to have Kevin Feasel from ChannelAdvisor back with us and Kevin will discuss some of the basics around what PolyBase does. We’ll be discussing a lot about integrations using PolyBase specifically on Hadoop and Azure SQL Blob Storage. We also touch on some of the technologies that will be supported in the future.
For those looking at implementing both structured and unstructured data stores, PolyBase will be a way to help bring these environments together. Kevin gives us a great overview and we think you will enjoy this episode.
“PolyBase gives you this integration and it’s opening this door to possibly getting rid of link servers.”
“PolyBase simplifies that a lot for us by making an assumption that there is a consistent definition for each row.”
“Learn something new… You learn something the first time, you can learn something again.”
Listen to Learn
– What is PolyBase?
– Technologies supported by PolyBase
– PolyBase integration with different data sources
– Some thoughts around which teams are going to own which pieces of the project
– How Hadoop integrators are responding to PolyBase
About Kevin Feasel
Kevin is a database administrator for ChannelAdvisor and the leader of the PASS Chapter in the Raleigh NC area. Since he was last on the podcast, Kevin has been awarded the Microsoft MVP and will be a speaker at the Compañero Conference. He also enjoys old German films.
*Untranscribed introductory portion*
Kevin: My name is Kevin Feasel, I am a Data platform MVP. I’m also a manager of a predictive analytics team here in Durham North Carolina. I’m extremely pleased to be able to speak at Campanero Con, even though I can’t pronounce it. I’m going to be speaking on a couple of topics one of them is Security and really getting an understanding of that network security, getting an understanding of what a database administrator can do to help secure a SQL server instance. I’m also really looking forward to talk about a big data solution basically how do I get started in that. I’m a database administrator, I’m the only database administrator at this company and somebody is coming to me talking about big data, where do I start? What do I start looking at? What actually is the benefit? What kinds of workloads would work well under this and which ones don’t? And getting some of the ideas of what’s happening in the industry and seeing how this different technologies are evolving and turning into a full ecosystem. Finally, showing how that ecosystem some integrates with SQL server.
Carlos: Kevin, our all-time podcast episode extraordinaire. Welcome back for another episode.
Kevin: Thank you! It’s good to defend the title.
Carlos: Yes, thank you for coming and talking with us. One of the things and one of the reasons we continue to have you on is you’re doing lots of different interesting things, and as database administrators we’ve been hearing this for a little while this idea of big data and it’s kind of been at the door. Lots of people even from a past perspective, they’ve open the doors to analytics to kind of join those two worlds. But for a lot of us it’s still kind of an unknown entity, it’s different technology and we think that we have something here that will kind of save the day if you will in the sense. And so our topic today is PolyBase and we wanted to talk with you about it, you’ve been talking about it, and presenting on it and working with it, so why don’t you give us the tour of PolyBase? What is it and why would we be interested in it?
Kevin: Sure, here’s the nickel tour version. PolyBase initially came about, I believe it was actually first introduced in 2010, so it was part of SQL server parallel data warehouse edition which later became APS otherwise known as Extremely Expensive edition. Enterprise is expensive, PDWAPS extremely expensive, in SQL server 2016 this was brought down to the masses or at least the masses who could afford Enterprise edition. It’s been around for few years but 2016 feels like first version for the rest of us who didn’t have a chance to play with really expensive hardware. The concept of a PolyBase does at a really high level is it allows you to integrate with other data sources, so before people start thinking, “Oh no, it is link servers all over again.”It’s not links servers. It’s not that bad. So as of today PolyBase supports a few different links where you can connect to a Hadoop cluster. We can connect to Azure blob storage and we can use PolyBase to migrate data from Azure blob storage into Azure SQL data warehouse. At PASS Summit 2016 there were a couple of interesting keynotes where they talked about expanding PolyBase beyond Hadoop and Azure blob storage looking into Elasticsearch, MongoDB, Teradata, Oracle and other sources as well.
Carlos: Wow, so basically we’re going to have the ability through a SQL server Management Studio to be able to interact with, move data to and from all of these different systems that you have mentioned?
Kevin: Yes, and be able to query it using just regular T-SQL. So when you create this table you create what’s called an external table. It’s a concept that lives on the source server like the Hadoop cluster. The data is over in Hadoop but when you query that table select star from my external table it’s going to go over, request data from Hadoop cluster, and pull that data back in the SQL server where you can treat it like it just came from a local table.
Carlos: Got you, so now is it going to store that like a time bases so that you know, I run my select star and then 10 minutes later Steve runs his. Is it going to pull that data back over again or there’s some management now that we have to think about because the data is now on my SQL server?
Kevin: So the data doesn’t really get persisted to SQL server. It’s an external table meaning that it will live in the blob storage or on your Hadoop cluster. The mechanic that PolyBase uses to allow this to work is it will pull the data into SQL server into a temp table but it’s not a temp table that you should know about as a developer of a T-SQL query. It’s kind of like behind the scenes temp table that then acts as the table that you’re going to query against, so you query DBO.myexternal table. Behind the scenes there is a secret temp table and it has the form and structure of that external table, data gets pulled in, collected and then processed as though it were local. But once it’s done it’s gone.
Steve: So then that process sounds very similar to sort of underlying workings behind when you run a query over a link server where it issues a command on the other side it brings the results back, it’s basically stores them in a hidden format so you can use them and the rest of the query, locally. And I guess, I mean I’m just trying to understand the correlation there. Is there a big difference on how that’s been done versus the link server?
Kevin: So there is one major difference and that is the concept of predicate push down. So the idea here is let’s say that I have a petabyte of data in this Hadoop cluster and petabyte of data in this folder, I want to be able to query, I’m sending a query that maybe I just want a few thousand rows or I want to aggregate the data in such ways that I don’t get a petabyte back, I just get the few thousand rows I need.
Carlos: Hopefully, because if you’re turning a petabyte of data you’re going to be in trouble.
Kevin: Yeah. I don’t have a petabyte of data on my SQL server instances. So I write this query in my WHERE clause, maybe I do summations, GROUP BY’s, HAVING’s. All of that predicate will get sent to the Hadoop cluster and on the Hadoop cluster side, PolyBase instigates a MapReduce job or set of MapReduce jobs to perform the operations that you wrote in T-SQL. It generates all of the jobs. It creates the data set that comes back and gets pulled into SQL server. So the link server if I were doing a link server to an another SQL instance, well another SQL instance is a special case, but if I were doing it to Oracle, I have to pull the whole data set back or from querying out to Hive I have to pull the whole data set back and then any filters get applied. So predicate push down is what lets you get back the rows that you need, only the rows that you need and gets around that whole links
server problem where, oh yeah I’m querying a billion rows I’ll see you tomorrow.
Steve: Sure, very interesting. I’ve heard some people speculate that link servers are dead or will be going away because of what we can do with PolyBase. Do you think that that’s a fair assessment?
Kevin: I am crossing my fingers hoping that this is so. As soon as they announced at 2016 PASS Summit what PolyBase is going to do in the future, I got really excited because I thought about, “Wait, what if I could connect to another SQL Server instance.” And there is one extra bit of PolyBase that I haven’t talked about yet. That is the concept of head nodes versus compute nodes. This concept in massive parallel processing that you have a head node, this is the orchestrator, this is the server that knows what queries are supposed to come in and out, and then it passes off details to different compute nodes. In Hadoop you have a name node and you have a bunch of data nodes. Over in PolyBase there is actually a similar infrastructure, so there is a head node, that is your SQL Server instance, must be Enterprise edition, and it controls the jobs. But you can to different compute nodes. They call it scale up cluster. These are Standard edition SQL Server instances that I can have sitting there doing work connecting to this Hadoop cluster to the different data nodes on the Hadoop cluster, pulling data back and getting aggregated data back to my head node. So unlike a link server where I have to pull all the data over to my one instance I can now have several PolyBase servers getting data, aggregating it locally, sending it up to the head node, sending up that aggregated as fine as they could data back to the head node where the head node finishes aggregation and presents to the end user the result.
Steve: Very interesting.
Carlos: Yeah, kind of scale out approach. Now I guess at this point it might be worth kind of going back and talking about some of the things that I need to put in place. Now you mentioned kind of this architecture perspective, I have Enterprise version I can have Standard versions but let’s just scale it down a little bit. I just have one node and I want to start using PolyBase. What are some of the things that I need to create or steps that I would take to in order to set that up?
Kevin: Ok, so let’s take the easiest example, that’s connecting to Azure blob storage. On my SQL server instance, I have to install PolyBase. That’s part of the setup; there is a little checkbox you can select. But in order to install PolyBase you must install the Oracle Java Runtime Environment.
Carlos: Yes, I cheated, and I was looking at the docs here and I saw that and I thought, “What in the world!” It’s like sleeping with the enemy, right?
Steve: So just to recap then, if I want to query against Azure blob storage with PolyBase when I install SQL server I need to also install, and again you get this as part of the install, but Oracle components for the Oracle Java Runtime.
Kevin: Correct. So you install those. There is a couple of configuration steps that are involved like there is a setting in SP configure that allows for external queries. Turn all the stuff on. There are configuration guides that can help you with that. Once you’ve got that done what you do is you create three things. The first thing that you want to create is an external data source. So the external data source says, this is what I’m connecting to, this is the resource location. If I’m connecting to Azure blob storage there is a type, actually I think for Azure blob storage you just use a type of Hadoop. If you use a Hadoop cluster you just use a type of Hadoop. If you’re writing in Azure elastic scaling query, there is a different data source type for that. But that’s little bit beyond my kin I haven’t written those yet. Ok, so you create this data source. This data source says, over here is the WASB address, the Azure blob storage location of the folder or file I’m interested in. So, actually let me, I may have to rephrase that because I’m now looking at the, opps. Ok, let me, sorry Julien, going to have to cut this part just a little bit because I just said something wrong. Now, I could just keep going and make it
sound like I’m an idiot. That wouldn’t be the first time admittedly, but.
Carlos: No, that’s fine. We’ll make it right.
Kevin: Ok, so let me start over. So the next thing that you do after you’ve configured PolyBase is you want to create an external data source. For Azure blob storage we create this external data source that points to the WASB address of your blob storage container location. So you’ll point to the container and the account name.
Steve: URL right?
Kevin: Yeah, that is WASB or WASB[s] address. It’s an Azure blob storage location. You’ll include your credentials because if it’s a secure blob you’ll need to pass in credentials. So you create this data source. The next thing you want to do is you create a file, so an external file format. That file format says, any files that are on a source that I specify, any files are going to follow this format. There are a few different formats. One of them is delimited text, so just text maybe colon delimited or semi-colon delimited or however you’ve delimited your text. You can use other formats as well. I would recommend starting out just use delimited text that is easiest to understand. You can grab a file and look at it. But when you start asking about better performance, one of the better formats is ORC, which is row columnar format that high views the store data. So it’s much more efficient for querying especially aggregating data but you can just use flat files.
Carlos: So knuckle-dragging Neanderthal that I am like how am I supposed to choose what kind of file that I need to use. Is there like a, if I’m going to be, I don’t know anything about Hadoop, how would I choose that?
Kevin: Yeah, absolutely, so knuckle-dragger, delimited file. Keep it easy for yourself. Once you get passed that, once you kind of get passed the doorway and you say, ok now how do I get do better? You have to think about whether your data is more of aggregation like what you would find in a warehouse table. In that case, I would use ORC. If I’m storing the data and it’s more of a row store style data I would use Parquet. There are a couple of other formats as well but those are the main two that really supported within PolyBase.
Carlos: Well now, so in that determination, so again I’m going to use the delimited file. I start, I don’t know, three months in, right, I start writing queries. There are processes that I now have in place and I decided, “Hey, I think I can do better. I want to change the format.” Am I going to have to like start redoing my queries or what’s all involved if I wanted to change that format down the line.
Kevin: Great question. What you would have to do? Let’s say you have delimited files now. You’ve created an external file format that’s of delimited type. Later on you say, well I am actually storing this as Parquet, so you create an external file format that’s Parquet. And now we get to the last portion of a PolyBase external table so the table has a two part name. It looks like any other table when you query it dbo. something or maybe external.mytable. You have the column definitions so all of the attributes in your table and at the bottom, the part that’s little a different is there is a WITH clause and inside that WITH clause you specify the location of your data, so those would be the specific file or folder that you want to point to. The data source and the file format.
Carlos: Got it. So when I wanted to do a new, if I wanted to change file formats I’m creating a new external table.
Kevin: Yeah or you just drop and recreate the one that’s there. The external table doesn’t have any data. It just has some metadata around it. So if you have a few second downtime you can drop that table, recreate the table, use the new format, maybe point to a new folder that has the data in a different format. All the nasty work of converting those files getting them into the other format, yeah you still have to do that stuff, but you can do that as a back fill process or you can do that kind of off to the side and just switch when you’re done. That way you don’t have to update any of your procedures or any calling code.
Carlos: Got you, ok, so that’s nice.
Steve: So then when you say the external file doesn’t really have anything more than just for definition there. That’s the definition that sits on your SQL server that’s defining where it’s going to go and get that data for instance out of Azure blob storage. So it’s really just a pointer off to that data and you’re switching it around and if you point it to a different format file you have to give it a format type appropriately.
Kevin: Yeah, so the external table, yeah it’s just metadata. It’s just some basic information.
Steve: Ok, so then with that it’s pointing to a file in Azure blob storage and can you just start out with an empty file and then start filling in with data from there or does that file in Azure blob storage have to have been created somewhere else to meet those formats?
Kevin: That’s a really good question. So you have the ability to insert data into blob storage or into Hadoop. There is another configuration option you have to turn on to allow for inserting and once you do each insert operation you do will create some number of files in blob storage or in Hadoop. So you have to have a folder as your right location. But every time you insert maybe you’re inserting once a month, you’re taking last month’s financial data, all the individual transactions and you’re writing it over to blob storage for long term storage. That insert generates 8 files over in Azure blob storage and then the data is there. You can query it just like it was always there. But you cannot update that data from PolyBase, you cannot delete that data from PolyBase.
Carlos: Interesting, so now obviously it’s going to vary from place to place but for me a setup perspective let’s say, right, so again I’m the only database administrator in my organization or I’m not familiar with Hadoop or these other. Well, I guess when the other databases get on boarded then there will be more access, right? But when I think from a big data perspective generally there’s going to be another team, maybe a vendor comes in, installs Hadoop, starts loading data, things like that. What are we as database administrators, were going to create all of those components that you just talked about, are the Hadoop vendors familiar with PolyBase? Are we talking the same language here or is this still something kind of a very SQL server centric idea? Does that make sense?
Kevin: I would say that vendors, they’re not really going to know a lot of the PolyBase details. They’re probably not going to be familiar enough with PolyBase itself to do it. I’ve had some discussion with people who worked at Hadoop vendors and they’re very interested in the concept but there is not a lot of internalized information around there. These are typically not people who spend a lot of time in SQL server, with SQL server so they don’t necessarily know how it works, how to set it up, what the positive and negative aspect are, how you can shoot yourself in the
Carlos: Well, so speaking of that so what are the ways we can shoot ourselves in the foot?
Kevin: Oh, you have to go and ask that. There are some assumptions that are built into the way that PolyBase works today. This is not a critique of the PolyBase team, of the developers, of the PMs. This is not at all a critique aimed at them. I like you guys still, don’t worry. One issue that you can run into is let’s say you have just text data and your file has new lines in it but the new lines don’t represent new lines of data. Maybe it’s a free form text field where a person typed in new lines to symbolize a new paragraph. Well, PolyBase doesn’t understand this idea of ignore new lines unless I told you that it’s a new line. It will just pick up that new line and say, oh yeah you got a new line here.
Carlos: A new record basically.
Kevin: Right. There are some assumptions that are built in. You can also burn yourself by defining your result set so you create that external table and maybe you define a value as an integer. Well, if the value comes back as a string because some of the data is malformed coming in then those rows will be rejected as they should. So you’re going from a non-structured or a semi-structured system into a very structured system in SQL server. That semi-structured system is ok with you throwing whatever garbage you want into this file but you have to define structure when you pull this out. Historically, on the Hadoop side that structure was defined in the mapping in the reduction phases, so MapReduce. It was defined by the developer putting together the data in such a way that the developer understood what this data point signifies. PolyBase simplifies that a lot for us by making an assumption that there is a consistent definition for each row, so we say an integer age is the first value. Well, it’s going to assume that there is an integer value there and it’s going to populate age with that. If maybe every 20th row we have something that’s totally different. Maybe instead of age it is eye color because something weird happened with our data. Well, every 20th row gets rejected. The way you can shoot yourself in the foot, let’s go back to you have a few billion rows of data that you want to pull over. Maybe you want to get just everywhere were the person is exactly 14 years of age. So you’re scanning through this data and every 20th row instead of it being integer age it’s actually a string. Every one of those rows gets rejected. There is a cutoff for the number of records that you are allowed to reject before just failing a query. That cutoff can be 0 or it can be as many as you want. It can be percentage or a numeric value. So let’s say 1 billion rows and you have a cutoff of 5,000. You’re going to go through quite a few records to get 5,000 rejected rows. Once it’s done, once rejection happens, once failure occurs the entire transactions roll back and you don’t get to see the data that was there already. It’s roll back. There was an error.
Carlos: Oh, got you, that’s right, yeah.
Kevin: So you may be sitting there for an hour waiting for this data to process and it comes back and it fails.
Carlos: Yes, so you might almost think about in a sense, again not try to discount Hadoop. At least in my mind a knuckle-dragger that I am, I think about that almost like an Excel file, right. I want to load that into something that it can accept it and then let me take care of finalizing any of that and look rejected rows and things like that. Almost like an ETL process, right?
Kevin: Sure. This is a fairly common pattern in the Hadoop ecosystem as well where; ok we have a raw data coming in. It’s there we put it into the data lake. So ideally the data lake has a lot of nice clean data in reality it’s more like a data swamp. It’s where you throw in a bunch of old stuff. You got mattresses in there, just all kinds of dirtiness.
Carlos: Fish with three eyes.
Kevin: Yeah, exactly. And so you pull that stuff out and you try to clean it up in some
process. Usually it’s going to be a Hadoop process. Maybe that’s a spark job, MapRecuce job that scrubs this data, tries to give it some symbol of sense and then writes it out to another directory where it’s more of a structured format. In that way you can read it in Hive which is SQL for Hadoop. You can read it with SparkSQL, SQL for Spark, or you could read it with PolyBase, SQL for SQL.
Carlos: Got you, so that kind of almost goes back or takes me back to that idea again of, kind of that who’s working with who type idea, and it almost sounds like if we wanted to we could push some of that to like hey guys can we work on this MapReduce. Is that a fair question to say, hey can we work on this that when the data comes back it gets cleansed before I see it? Or is that still kind of, you know, I need to as a SQL server person assume all responsibility for that kind of thing?
Kevin: I think that depends on your environment. It depends on relative levels of familiarity. But personally my expectation would be that if you are say using SQL server as the engine to see final results, then I believe that it makes perfect sense to ask the people on the Hadoop side, “Hey guys give me the data in a format that I can pull it easily.” So for example, maybe we are reading a lot of data coming in from IoT devices. We have Kafka setup. Kafka’s a big distributed message broker. It’s a really fascinating thing and we’re getting tremendous numbers of messages that are streaming in to our Hadoop cluster. We’re cleaning up those results, we’re storing the results and maybe we have some aggregations that we’re doing to show hourly results by device type. And then load that data in to a file that PolyBase can read. As part of an ETL process you may pull that data over the SQL server, Persistent SQL server. So query like SELECT FROM your table INSERT into the real SQL server table, and you’re keeping a smaller streamlined data set that you can use to populate a PowerBI Grid or that you can use to populate web application. In that scenario, personally I’d argue that yeah the Hadoop side people probably should be doing most of the cleanup work. If you are both sides, it becomes more a question of well what am I more comfortable doing, like sometimes if the data’s relatively clean to begin with, or if we’re willing to accept a certain level of failure, take it, bring it to the SQL server, I can do really cool things in SQL server.
Carlos: So it kind of goes back right to the adage of knowing your data, right?
Carlos: Being familiar with it and then making a decision based on that.
Steve: So then back to that example with the age and putting that into integer column in the table definition, do you see that, I mean, there’s lots of things that could be valid for ages in there. You could have 6 mo. to represent someone who’s six months old but then obviously when that gets pulled down and try to go into integer, it’s got text data in there and it’s not going to work. So do you find that people sort of shy away from those restrictive types in their table definitions or maybe just leave it as something that’s more open like a varchar max or something like that? Or do find that people go through the battle of cleaning it up or filtering it ahead of time?
Kevin: Unfortunately, probably more the former. It’s more of, well it’s a string, every string works so we will pull that in as a string and then we’ll clean it up here. That is a downside where with a lot of ETL, through ETL tools I can take a data element, I can make decisions based off of what that element looks like, like 6 mo. I can do a substring, I can parse out, is there a MO or YR or some known value here, and use conditional logic to convert that into something that is consistent across the board. PolyBase isn’t going to give you that. It’s going to give you the easy way of pulling data but yeah that, it means, it doesn’t do the transformations for you.
Steve: Okay. So another area that I’ve thought a little bit about is that and I know sort of jumping back to the whole link server example is that when you’re running a query in sort of old school link server, whatever’s going on in the other side really gets hidden from execution plans. It’s just blindly calling something on the other side across the link server and your execution plan doesn’t give you any details other than it was waiting on something on the other side. Now, is there an option for seeing execution plans when you’re using PolyBase to get a better understanding of if a query’s taking some time, maybe where that’s time is being taken on when it’s connecting out to Hadoop for Azure blob storage?
Kevin: Yeah. The short answer is yes. The long answer is yes if you look at the XML. So you look at the query plan XML, it will give you some details including there’s a remote query which is XML inside of the XML. So you have to deserialize the XML, decode the XML, and you’ll be able to see what the remote operation looks like. So it gives you a few indicators of what’s happening. It’ll show you the individual steps. Also, there are several dynamic management views that are exposed for PolyBase. And those DMVs will show you a lot of the same information. They’ll show you the individual steps that occur for this MapReduce process or for the data retrieval process.
Carlos: So I think very interesting topic and we’ll let you give last thoughts here but one of the things that I feel, that I’m confident about or happy about is that while there’s still some unknowns here, right? Having the Hadoop, you know, in my environment or being able to connect to it, Azure blob storage, all these other things that are coming down the pipe, at least it’s going to be, I have a tool that I can do or integrate with some of these things on my own turf. And it’s not completely foreign that I have to go and, you know, pickup new technologies right away.
Kevin: Yes. That’s how I’m thinking of it. This is why I like it so much. This is why, honest I think this was the best feature in SQL Server 2016. A lot of people are going to say query store is the best feature. Query Store is an awesome feature but PolyBase gives you this integration and it’s opening this door to possibly getting rid of link servers. It’s opening a door to distributing queries, distributing those really expensive SQL server queries. Kind of like what you do in Azure SQL data warehouse, hoping that maybe we get something like that locally.
Steve: So I know you talked about how PolyBase is perhaps one of the best features in SQL server 2016. I know that SQL Server 2017 community technology preview too I believe just came out recently. And is there anything that’s in there new with PolyBase that you know about?
Kevin: Nothing new with PolyBase.
Carlos: Got you.
Kevin: There’s a whole bunch of really cool stuff I’m excited about but.
Carlos: The fair question to think or assume but it will be supported in Linux version as well.
Carlos: Because it’s a core feature if you will, I know they’ve been working and talking with Travis, the PM over there for the Linux migration. That’s what they’ve been trying to accommodate. Again, listening to the AMP conference or event or whatever it was called. They did mention some additional
functionality that would be in the Linux version. I don’t remember them specifically calling up PolyBase but, you know, I had to imagine that it will be there even if it’s not there on day one.
Kevin: The answer that I think is safe to give is in today’s CTP, CTP 2 for SQL on Linux, there is not PolyBase support but there is no reason that PolyBase cannot be there.
Carlos: Got you. There you go. But again well we did mention that this ultimately Enterprise only feature, right?
Kevin: Yeah, for the head node it has to be Enterprise edition. I think even with the SQL Server 2016 SP1, I think it still required to be Enterprise edition for the head node.
Carlos: Okay, got you. Yeah, I feel like that PolyBase was in the list of things that they made available in the lower editions but I’m not sure if that includes the head node or not.
Kevin: Yeah, I know that the compute node was available in Standard edition but I’m not sure.
Steve: Yep. So given that it’s been a little while since 2016 came out, around a year roughly, and with PolyBase sort of been mainstream available since then, do you see that a lot of people are actually adopting this and using it in production environments or do you see more people just sort of experimenting and trying things out at this point?
Kevin: It’s more of experimentation. I don’t know of many companies that are doing it. The way that I would put it is okay well you have to have SQL server 2016 which already cuts out large slice with companies. You have to have Enterprise edition and you have to have Hadoop cluster or you could use Azure Blob Storage and get value out of that way, but this is going to be a fairly narrow segment of the population even today.
Carlos: Got you. Yeah, make sense.
Steve: Well perhaps after this podcast more people will give it a check.
Kevin: Yeah, I hope so.
Carlos: That’s right. Compañeros if you are using PolyBase after what you’ve heard here today, I want to know about it. We’re going to report that to Microsoft. Let them know you heard it here first folks. Okay, so I know you’ve been on the show here before, Kevin, but we’re going to still go through SQL family.
Carlos: Can we do it?
Kevin: I think so. I may make up new answers.
Carlos: Well would you have a couple of new questions that I think that have changed since last time you’re an individual guest so.
Carlos: Okay. So the first question is how did you get started with SQL server?
Kevin: I got started as a Web Developer. It was about a decade ago and I was an ASP.NET web forms developer. It was my first real job, so I was the person who was least afraid of databases. I’ve written SQL queries before and we had a need for database administration so I.
Carlos: How hard could it be?
Kevin: Yeah, pretty much. Like hey why is the server failing? Oh it’s ’cause it’s not a disk space.
Carlos: There you go, and now you know the rest of the story.
Steve: So if you could change one thing about SQL server, what would it be?
Kevin: That’s a good question because everything that I think of tends to happen which is really cool, I like that. So last time around I said I want PolyBase to support Spark, and I’d like to see that happen still. I’ve wanted Python support for machine learning within R services which is now machine learning services. And we just got that so that’s really cool. The thing that I want most right now is a really good client for Linux. So I want Management Studio for Linux or something Management Studio ask for Linux that does maybe like 70% of what SSMS does.
Carlos: Interesting. In all flavors of Linux or do you have a particular flavor that you’re interested in?
Kevin: I’m kind of okay with pretty much any flavor. I mean you can get it to work. Nowadays, I use Ubuntu or Elementary a lot. Previously I’ve done a lot of Redhat. I go back to Mandrake for people in the know.
Steve: Right. Yeah, I know recently we heard that, what was it SQL command, was going to be available on the Mac and that was a big move. And I think we’re a long ways off from Management Studio being on other platforms. But who knows, I
could be wrong there.
Kevin: Yeah. I’m looking forward to whatever they are able to provide.
Steve: No, I know that’d be certainly cool.
Carlos: Although, and we do have request into the PM for SQL Server Management Studio. We haven’t quite been able to get them on the show just yet, but when we do we’ll ask them that question.
Kevin: Put them on the spot.
Carlos: That’s right. Okay, so best piece of career advice you’ve received.
Kevin: I’m going to flip this on its head, best career advice I can give.
Carlos: Well, here we go.
Kevin: Learn something new. Especially if you’re in a shop where you’re on SQL server 2005, take some more of your own time. Learn something new. It doesn’t matter that much what it is but expand out just a little bit. It could be features, it could be new versions of SQL server, it could be let’s learn a new language, let’s learn a new part of the stack. But don’t get caught in this one little part that you find out someday oh look your job has been animated away and you lost all of those skills to learn. You learn something the first time, you can learn something again. So that would be my advice.
Carlos: And that is why we’re going to have you as a speaker at the Companero Conference. So folks if you want to hang out more with Kevin and learn all of his wisdom, you can come to the conference and hang out with us.
Kevin: Wisdom and $5 gets you a cup of coffee.
Steve: And on to our last SQL family question, if you could have one superhero power, what would it be and why would you want it?
Kevin: We’re getting close to episode 100. Nobody else has ever answered that this way. I want phase walking. I want the shadow cat kitty pride be able to phase through walls, phase through objects. Nobody else has answered that so either I’m completely insane and picking the wrong power or I’m the head of the curve. I’ll let the audience decide.
Steve: Or it could be you’ve just answered the question several times before as well and you’ve had more time to think about it too.
Kevin: That is also possible.
Steve: Alright, very good.
Carlos: Awesome, Kevin. Thanks again for stopping by. We always enjoy it.
Kevin: I’m glad to come here.