Episode 183: SQL Server Big Data Clusters

Episode 183: SQL Server Big Data Clusters

Episode 183: SQL Server Big Data Clusters 560 420 Carlos L Chacon

SQL Server 2019 is out and one of the most interesting additions is SQL Server 2019 Big Data Clusters. In this episode, Kevin Feasel covers the marriage of SQL Server and Apache Spark. We discuss what Big Data Clusters are (and how it’s not a SQL Server feature or an edition of SQL Server, but a thing in itself), cover some of the architecture behind the solution, and explain how we can use them.

A scalable compute and storage architecture in SQL Server 2019 Big Data Clusters – James Serra

Episode Quotes

“I would probably start by pointing out one of the bigger use cases for big data clusters that kind of makes all the pieces fit, and that is, I need a data lake, but I don’t want to put my data in Azure or in AWS.”

“When Microsoft ported SQL Server to Linux, it made a lot of these Linux-based applications now, all of a sudden, available, or more available.”

“We have one extra pool called the compute pool. These things are our helper nodes. An end user is never going to connect to them directly. An end user isn’t going to care that they exist. They only work to make queries faster.”

Listen to Learn

00:38     Intro to the team and topic
01:41     Compañero Shout-Outs
02:02     SQL Server in the News
02:47     What do big data clusters have to do with data virtualization?
04:20     Circumstances under which you would want to not move your data
07:40     The time- and effort-saving benefit to big data clusters
10:03     Why big data clusters isn’t just Microsoft jumping the shark
16:00     Architecture concepts – don’t look at the diagram if you’re driving
21:57     When you’re spreading the fact out, you’re still paying for some data movement
23:46     There was a reason for Java Interrupt
25:22     Will the Fonz jump over a shark next time?
27:58     Ben Weissman and Enrico van de Laar are working on SQL Server Big Data Clusters Revealed
29:35   Last thoughts on big data clusters
32:14     Closing Thoughts


Music for SQL Server in the News by Mansardian

*Untranscribed Introduction*

 Carlos:             Compañeros! Welcome to another edition of the SQL Data Partners Podcast. My name is Carlos L Chacon. It is good to be with you again, today. I have, on the podcast today, Eugene Meidinger.

Eugene:           Hey!

Carlos:             And Kevin Feasel.

Kevin:              Howdy.

Carlos:             Kevin will be talking with us today about SQL Server Big Data Clusters.

Kevin:              Yeah, it’s weird being on this side of the microphone again.

Carlos:             Yeah, that’s right. If it ain’t broke, don’t try to fix it.

Kevin:              In the giant SQL Data Partners studio, which is like tens of thousands of square feet, it’s amazing. You should see the green room. I’m always on the other side.

Eugene:           He rented the old Amazon warehouse that was in Coffeyville, that they abandoned.

Kevin:              Yeah, it’s a bit of a drive to get there, for us, but you know what, we’ll do it.

Carlos:             Only the best for you, compañeros, and the speakers. Man, all of a sudden, my little office cubicle here, I feel very restricted all of a sudden. Okay, so this will tie– we’re going to have lots of tie-ins to our previous episode, which all of a sudden, I can’t remember the number. But two episodes ago, on PolyBase, but we’ll get to that in just a moment. So first, we do want to give a couple of shout-outs. One to Jim Staley for connecting on LinkedIn, so thanks, Jim. And the other, Mel Vargas. So this is a long time coming, Mel, and I apologize for the delay. We sent Mel a SQL Data Partners t-shirt and he is rocking that t-shirt and put a couple of pictures up on social media. Looking good, Mel, we appreciate that. So, in terms of SQL Server in the News. One of the important pieces in relation to this conversation, is that we now have a release candidate for SQL Server 2019 now this ties directly into our topic, because big data clusters is ultimately a 2019 feature. And so, we should see some hardening between what we’re going to talk about today and what’s actually in the release candidate. There may be some changes or some movement, but hopefully what we’re talking about is what we’ll get in the release candidate, or when it actually releases, rather. Okay, the show notes for today’s episode is going to be at sqldatapartners.com/bigdataclusters or sqldatapartners.com/183. Okay, Travis Wright, one of the program mangers at Microsoft, this was just after the Ignite Conference in the Fall of 2018, over a year into this already, but he talks about this new architecture that will be released over SQL Server that “combines together the SQL Server database engine, Spark, and HDFS into a unified data platform called a ‘big data cluster’.” So, earlier we talked about PolyBase, the concept of data virtualization, so how does a big data cluster– how does that expand the notion of data virtualization for us?

Kevin:              What it does is open up a bit more in the way of massive parallel processing with SQL Server. With PolyBase, something we didn’t really cover that much in the previous episode on PolyBase, we have this concept of scale-out groups, where I have a head node or a control node or whatever you want to call it. It’s the driver, it’s in charge of things, and you have a bunch of compute nodes, which in this case, are other SQL Server instances that are going to do the bidding of that control node. So for PolyBase queries, things that are virtualizing data from external sources, we have this capability of parallel concurrent operations from these compute nodes. Big data clusters is a way of formalizing this concept and expanding it out a bit further.

Carlos:             Right, and so all of a sudden, we’re starting to put together lots of pieces, even technologies, if you will, when we start thinking about big data clusters, and they have plenty of scripts and then the piece that I’ve looked at as far as an implementation perspective, they’ve tried to make that, as straightforward as they can. But talking about massive parallel processing, things like that, one of the things that gets involved is Kubernetes and containers.

Kevin:              Or as Wilford Brimley pronounces it, Kuberneatez.

Carlos:             There you go, so depending on your pronunciation guide. And so you have to put together some of these different pieces. So, I tend to think of that as increasing the complexity in some cases, and so I think maybe it would be worth revisiting or talking about under what circumstances we would want to not move the data, so I guess, kind of going back to data virtualization versus data movement. We talked a little bit about data warehouses and how I’m going to ETL that stuff in there. What are some of the scenarios under which I might just want to virtualize that data and not have to move so much of it around?

Kevin:              Typically, we’re going to want to virtualize data when we have some canonical source system outside of our main platform. I may have an Oracle server that handles this transactional data for invoices, and I want to be able to access that data, but I don’t necessarily want to keep moving it. I don’t want to have to manage Integration Services packages or have to manage those processes. I want a relatively easier to maintain solution for situations where data changes frequently, in terms of structure. So, if I have a streaming application and I’m regularly changing what attributes are going to show up, then it’s usually easier for me to virtualize, because if the attributes step changes, I can drop and recreate the external table and that’s done immediately. Versus recreating an Integration Services job or VS needs new metadata.

Carlos:             Sure, like, “hey, I forgot these columns or these pieces.” And now, storage costs are getting cheaper, but there is still the overhead of having to move it and store it and as we get more and more data, having more copies isn’t necessarily in our best interest, either.

Kevin:              Yeah, though with this, the downside to virtualization, which leads into one of the upsides of big data clusters, downside of virtualization is, it tends to be slow. It tends to be a lot slower than having all of your data in one location. And that’s one of the things that big data clusters offers as a capability where I can have that data in Oracle, and I can generate an external table and then materialize that data within SQL Server. So, I’m bringing back this concept of ELT or of ETL, but I get the performance benefits of having everything within one or a set of SQL Server instances.

Carlos:             So I tend to think of it, it just popped into my mind, this idea of learning a different language. If you want it to perform fast, you’ve got to learn the language. If you don’t want to take the time to– all the study and the prep work and whatnot, you can use like a Google Translate, but your conversation’s just going to take a little bit longer, but it is possible. I feel like it’s kind of the same thing applies in this case. Okay, so let’s maybe just talk about, from an architecture perspective, and actually you mentioned something before we started recording here. You said SQL Server versions. Now I haven’t seen anything. Are they going to release a big data cluster version and then SQL Server proper? Or when I install SQL Server, will this just be another option that I am configuring, because again, of all of the Kubernetes and the container pieces that I need?

Kevin:              Yeah, so it’s not an edition of SQL Server. At least as of today, what is being built. Instead, think of it as a pattern that they’ve made easy for you to implement. Relatively easy. The secret behind this, which is really not a secret, is that you can do all of this yourself. If you look at the architectural diagrams, you look at the products that are in place there, there’s nothing which is so incredibly unique that you could not replicate it, but the benefit is, I can build this up with one Azure Data Studio notebook, instead of having to learn how to build a Kubernetes cluster, learn how to generate pods, learn how to integrate with HDFS, learn how to spin up a Spark cluster, make sure that I install SQL Server on Linux and Spark together and get everything connected and configured and also learn how to do all of the things like saving that data in distributed SQL Server instances and then using PolyBase to query those. So, the pieces are there, but the benefit that we get from this is, “hey, I’ve got a script that I can run that will generate it all.” So, there may come a time when we learn about the licensing of this, and then there may actually be a big data cluster edition of SQL Server. I know nothing about licensing, and maintain that ignorance on purpose, but it could also just be that, well, a big data cluster is an instance of Enterprise Edition plus however many instances of Standard Edition, plus whatever else you need.

Eugene:           That makes sense.

Carlos:             But red flags here in the sense that you are going to need more than just SQL Server to get this thing up and running.

Kevin:              Right, the whole infrastructure is more than SQL Server.

Eugene:           So, a question from sheer ignorance, I’ll pretend it’s the Socratic Method, but it’s just me being dumb here.

Carlos:             Welcome to the program, Eugene.

Eugene:           Yeah, following Carlos’s lead. No. So, back when they introduced SQL Server on Linux, it would have been easy to look at it and go, “okay, they’ve jumped the shark. Like what, what is going on here?” And when I look at big data clusters, my knee-jerk reaction, and I know it’s knee-jerk, it’s, “oh my goodness, Microsoft has jumped the shark.” What they did is they looked at all of the open source tools or the popular tools in this kind of modern data warehouse, ETL space, and they said, “we need some way to slather SQL Server all over it.” So, help me understand the leap forward, or the workload that this is solving, that this isn’t just jumping over the shark. Because I feel like there’s got to be other existing tools or solutions that are better for this than, “okay, well, everyone’s using Spark and Kubernetes and HDFS, so let’s just like slap on SQL Server on top of that.”

Kevin:              Yeah, so I would probably start by pointing out one of the bigger use cases for big data clusters that kind of makes all the pieces fit, and that is, I need a data lake, but I don’t want to put my data in Azure or in AWS. So, you can build all of this on-premises as well. The notebooks are intended to be built using the Azure Kubernetes service, and integrate pretty nicely with the rest of the Azure ecosystem. But if you’re a large company that wants a data lake and doesn’t want to be up in the cloud, this is a way to do it. And specifically, a way which still lets your analysts and developers, and everybody write queries using SQL Server, using T-SQL, connecting to one instance, not having to care about the deeper infrastructure. So in this way, it’s like Azure SQL Data Warehouse, also on-prem, where I can have my queries scale out to some number of instances limited only by how much money I’m willing to spend, and get that kind of performance for warehousing or for data marts or for working within a data lake. So, I think that’s probably the place where you’re going to see the biggest benefit, here. It’s less of, “let’s put SQL Server some place where it barely belongs,” and more of, “let’s kind of drag some of the T-SQL people into modern data warehousing.”

Eugene:           So, let me paraphrase just to make sure I’m on the same page, because I think I kind of get it now. So, it’s not so much, “okay, we see this untapped space and we need the like SQL Server branding slapped on so we can charge for Enterprise cores” or that sort of thing, but more so, “okay, we have this really mature tooling but the definition of scale-out has broadened and changed a bit.” Availability groups isn’t going to cover what some people need to be able to do, and instead of saying, “okay, not only do you have to move your data to the cloud, which you may not be able to do because of regulatory issues, but you have to use all these new technologies completely from scratch. We’re going to say, we’re going to give you some stepping stones so that you can kind of move into that data lake space or that massively parallel processing space. But it’s a few steps away from the SQL Server you know and love, as opposed to, “okay, now you have to use all of these technologies that are indistinguishable from Pokémon names.”

Kevin:              Right, and to be fair to this process, Kubernetes, for example, is not absolutely required, but it makes life so much easier. The reason being that you’ll use Kubernetes to say, “I want 16 SQL Server boxes available.” I’m being deliberately vague on boxes, because we can talk architecture. But, “I want 16 boxes available in this pod,” and let the infrastructure manage that for you, instead of saying, “oh, I need to add four more SQL Server instances because its month-end processing time. Well, we’ve got to put in a ticket 45 days in advance so that somebody can spend 6 hour spinning up a set of VMs.” So, in that case, that tool is not being used because it’s popular, it’s popular because it’s useful.

Eugene:           That makes sense. I’m starting to appreciate containers a lot better. Reading the Phoenix Project, and then starting to use them for my demos, because whenever you’re demoing on SQL Server 2019, being able to just spend 5 minutes to spin up the latest version is really, really nice.

Carlos:             Super nice, yeah.

Eugene:           I think I’m still getting to the point where I don’t treat Kubernetes like blockchain. Like, any time somebody puts the word blockchain somewhere, I just roll my eyes, and I’m making it past that part of the hype-curve with Kubernetes, at least.

Kevin:              Yeah, I think it deserves to go a bit further than that, yeah.

Carlos:             Now, I will also say, going back to your question, Eugene, that one of the reasons that they can make this leap, and admittedly, Travis talks about this in the announcement, is that when Microsoft ported SQL Server to Linux, it made a lot of these Linux-based applications now, all of a sudden, available, or more available. And so, I don’t know about necessarily jumping the shark, but now it’s like, “okay well now we’ve crossed that bridge, now we’re on the other side, what is here now that we can, I guess, conquer?” If you want to think of it that way or integrating with.

Kevin:              Yeah, exactly, where it fits in a space that Microsoft historically has not really been a player.

Carlos:             That’s right.

Eugene:           This is not the Microsoft of 10 years ago, I’ll tell you that. I remember in high school, which was more than 10 years ago, I think it was like a career thing, and somebody’d come in, and I forget the exact context, but they’re like, “you know, you could work at Microsoft one day,” and I had drunk the Slashdot Kool-Aid at the time, and I was like, (raspberry sound) you know, practically spit on the ground. And it’s just so interesting that all this stuff that we’re talking about just never, ever would have happened 8 or 10 years ago. It’s quite a change for the company.

Kevin:              Yeah. So since I did talk about architecture, it’s probably a good idea to get a little bit more into that.

Carlos:             Yeah, you should probably describe that a little bit more.

Kevin:              Yeah, there’s a few concepts, and I think we can do this over radio reasonably well.

Carlos:             But we will put in a diagram. James Serra has a diagram. I mean, there’s a lot of pieces here, but we’ll make sure we put that up on the show notes as well.

Kevin:              Yeah, don’t worry, I’m not going to read the entire diagram. It’s cool.

Eugene:           But if you’re driving, please don’t look up the picture right now. Be safe.

Kevin:              Yeah, yeah. So, at the top you’ve got this master instance. That’s what we call the driver in Spark, it’s what we call the control node in PolyBase, the name node in Hadoop. In this case, it’s just a SQL Server instance, so you connect to it with your applications, Management Studio, Azure Data Studio, your .NET apps, whatever. It’s just a regular SQL Server, port 1433, you can treat it like a regular SQL Server, everything’s copacetic. Under the covers, we have this concept of pools, so, you have a storage pool. This is basically pulling in data through Spark, storing it in HDFS using a format called Parquet. Parquet is a file format, which is columnar and compresses very well. It’s great for that column store style feeling where I’m going to aggregate data from a couple of columns and get results. The types of stuff that you would see in fact tables; the types of things that I would drop a cluster column store index on if it was in SQL Server. So, we bring that data in, we’re storing it in HDFS and then we can access that data directly, either through the HDFS endpoint, so I can connect through WebHDFS and pull the data in as an external table, or through a SQL Server endpoint, make it look like it’s SQL data, so it has to be structured that way. That data’s stored, it’s going to live there, but HDFS is not necessarily going to be the fastest way of accessing this data. That’s a very polite way of phrasing that. It works well when you have enormous amounts of data. It scales out really well, but there are costs to accessing the data that you don’t get when stored in a B-tree structure. So, we have a data pool. The data pool is what I’ll call the caching layer. That is a bunch of SQL Server instances whose job it is to store data. These data pools, they tie back to PolyBase. One of the complaints that I have about PolyBase is I can run a query and it pulls data back, and that data’s ephemeral, it’s gone, so I can’t cache it in SQL Server. Which means that the next time I run that query I have to go back and get the data again, which means it’s fresh data at least, but it’s going to be equally slow. So, if I have something that takes a while to retrieve that data, I can cache it in the data pool and again, it’s just a bunch of SQL Servers, so  it’s like me taking the data from Oracle and Mongo and CosmosDB and wherever and dumping it into SQL Server, but it’s distributed across all of those instances I’ve set up, sort of like Azure SQL Data Warehouse. If I have 30 instances, I’m spreading that data across the 30 instances, I get the possibility of running, say 30 times faster, if I have a query that will scale linearly. And then we build up the idea of data marts, so, especially I can see this working well with a Kimball-style model, where I have fact data. If I have that fact data distributed across my 30 instances, and I don’t need to cross instances to write my queries, I can distribute the dimensions everywhere or hold them on the master instance. We’re starting to get into architecture for Azure SQL Data Warehouse, because conceptually it’s the same. So, I’ve got that fact data spread out, I may have the dimension data up on my master instance, and I write my queries, it’s going to go across the separate instances for my data pool, retrieve that data. We have one extra pool called the compute pool, and that is my PolyBase scale-out group, so it’s just another set of SQL Server on Linux instances. They’re controlled by Kubernetes. They’re going to be able to reach out to my data pool. They’ll retrieve the data and do whatever aggregation I need to do. They have the ability to reach out to the storage pool and pull data from HDFS. They can reach out to the outside world via the master instance, and then retrieve data in, so pull data in parallel. So, these things are our helper nodes. An end user is never going to connect to them directly. An end user isn’t going to care that they exist. They only work to make queries faster.

Carlos:             Right. They’ll continue to interface with the master instance, ultimately.

Kevin:              Right, yeah, everybody talks to the master instance and that’s a common pattern in this. In Hadoop, as an end user you pretty much never talk to a data node directly. You just talk to the head node. So, that’s the high level of the architecture. Those are the pools, and the pools, they’re just a bunch of SQL Server instances, so get out your wallet.

Carlos:             That’s right. With the idea of being that this is temporary in nature, like this is, “I want to throw a bunch of stuff together and then I can bring it down.”

Kevin:              So, there is that possibility, and that’s, again, what Kubernetes brings to the table. That I can say, “well, it’s a slow point right now. I only need, say, 4 nodes in my compute pool. But then we’re going to do nightly processing for the warehouse, let’s bump that up to 16. Let’s get this stuff done fast, and then I can scale it back down. And I could do that on-prem just as easily as I can in Azure Kubernetes service.”

Carlos:             Sure. Now, one thing that the dot that’s not quite connecting for me, and I think it’s just lack of experience is we talked earlier about data virtualization and not moving data around. One of the things that I don’t– I mean the compute perspective make sense to me, but we talked about spreading the fact out. Isn’t that still moving data around? I mean, I guess I’m not transforming it necessarily, it’s already in that state. But I feel like am I still, in essence, paying for some data movement as I’m scaling up and down?

Kevin:              Yes, and the answer to this is you choose what you want to do based on your circumstances. If I have data that I just want to be virtualized, I don’t want to have it stored locally, then I can do that. I can just connect out to external data sources. I can create an external table on my master instance and retrieve data. If I do want to bring that data in, if I want this to behave more like a data lake, I also have the option of pulling data in, storing that data, reshaping that data and then making it available internally. Meaning, for a warehouse, most of my facts and dimensions, I’m probably going to store locally, because I want to store history, because I want it to be faster, but we may have some dimensions that we just want to virtualize, and okay, this almost never changes, and it’s rarely used, but “something, something, something,” I decided not to do that.

Carlos:             Sure, interesting. Yeah, and then all of this, we can begin to leverage all of the other pieces that have now been put into SQL Server, thinking about R and Python and in 2019, Java, right?

Eugene:           Yay.

Kevin:              And there is a Java play in here, as well. Because when you think about you have all of this data, we have the data in a data lake and perhaps we want to run an ML job. You know, I want to take that data and try to forecast the future results or something. I have a couple of options. I can use SQL Server Machine Learning Services and ML Services can hit that set of data, I can run my R queries or my Python queries to generate a model and then generate predictions from that model. I also have the ability, because the data’s stored in HDFS, and I have a Spark cluster, I can connect to the Spark cluster and run Pyspark notebooks or I could run spark.ml, sparklyr, whatever I want to, to generate models in Spark across the entire set of nodes in that storage pool. So, I can run distributed models, build them, and then I can even take those results and feed them back into SQL Server using the Java Interrupt. So, there was a reason for Java Interrupt.

Carlos:             There we go. And we’ve come full-circle, ladies and gentlemen.

Kevin:              Using the entire animal.

Carlos:             That’s right. Waste not want not, right? So, I guess that is the big data cluster, I guess we’ve touched on PolyBase already, so if I don’t necessarily want all of those additional nodes, I can still use PolyBase to go outside. Yeah, I guess what I’m saying is, those are all of my questions. Did you have any other questions, Eugene?

Eugene:           Let me think for a second, here. I guess one other question that comes to mind is– and this involves you just looking into a crystal ball, so I’m sure you’ll have the exact answer. So, for V next, so to speak, so whatever, you know, SQL Server 2021 or whatever.

Carlos:             I was going to say, we haven’t even got 2019 out–

Eugene:           No, I know. Well, no, I mean, so my question is do you expect that for the next version, Microsoft’s going to be focusing on kind of stabilizing this or do we expect that there’s going to be another Fonz jumping over the shark for the next version? Do you know what I mean?

Kevin:              My hope is that what we get out of this, there are going to be problems. It’s a V1 product. There’s always limitations, there are always things where, “eh, I wish you did this instead.” So, my hope is that there is enough adoption that they decide, “we’re going to continue investing in it.” That investment may take a little while. PolyBase, we normal people, the plebeians only got to see it in 2016 and then the next version, 2017 came out, and really not much had changed with PolyBase. But then 2019 comes out and, “hey, guess what, we’ve got a bunch of new stuff.” So, it may take a couple of years for them to figure out where the pain points are and then come up with a way of addressing them. But my expectation is that they’re dumping a lot of time, effort and money into this in the hopes that people are going to start using it. In the hopes that enough large enough companies are going to find a need for this and not want to build it themselves, that it’ll be worth continuing to maintain and invest in. But again, it’s sort of little bit of chicken and egg where they’ll give you the product, but people have to use it for them to want to spend more money on it. But it does fit very nicely in with where the SQL Server leadership team have been talking about, “this is where the product is going.” This is where bringing code to the data, using SQL Server as that central point for virtualization, for machine learning, for data analysis, visualization, kind of having that be the centerpiece of your environment and I think this is another part of that.

Eugene:           Yeah, I guess I’m just hoping it stops going places, so I have a chance to catch up a bit.

Kevin:              Oh, ever since 2005 it’s just been a fool’s errand.

Carlos:             I was going to say you’re in the wrong sector, Eugene.

Eugene:           Wrong business, yeah. They told me databases change slowly. They lied.

Carlos:             Yeah.

Kevin:              Now, one other thing that I do want to mention that there is a book. Ben Weissman and Enrico van de Laar are working on SQL Server Big Data Clusters Revealed. Currently the release date’s scheduled for February of 2020 and I would be remiss to Ben and Enrico, especially Ben, who probably would hunt me down, and he does come to the US often enough.

Eugene:           It’s true.

Kevin:              So, I definitely want to say go buy that book after you’ve bought my PolyBase book, and whatever book Eugene’s working on. Which is probably neither PolyBase nor big data clusters but still worth buying.

Eugene:           Not even slightly related.

Kevin:              Still worth buying.

Carlos:             Cause you have to have a presentation on all of this, right?

Eugene:           Right?

Kevin:              And now Carlos has to write another book.

Carlos:             Yeah, yikes.

Kevin:              Or you could just buy Carlos’s current book. That one’s also going to work.

Carlos:             There you go, for the way back machine. Yeah, interesting. So I think, the big takeaway from a big data cluster perspective is that you don’t have to have pipelines, you can connect to various types of systems through being able to scale out, you can take on a bit more volume, perhaps, than you could in the past, and now you can integrate all of the other code pieces that you mentioned in that data. Yeah, it’ll be interesting to see what happens.

Kevin:              Yeah, all of this does come at extra complexity, but I think it’s useful if you have the data requirements. If you’re running fine on a single instance of SQL Server, it’s not worth the money to say, “well, let’s now make it 20 instances of SQL Server.” One other thing that I do want to mention is on the administration side, one of the areas that I’ve been particularly appreciative of big data clusters is that they are heavy in using notebooks for administration. I mentioned the notebook to create the cluster, there are also notebooks to manage the cluster. So, you can run these things and see what are all the instances doing, what are all the nodes looking like. It’s a nice approach, because you have a lot of different pieces, so you have to manage how does Kubernetes look, what does that configuration look like, what is the status of those? On Spark, what do those Spark nodes look like? How are they running? The SQL Server instances, yeah, you can use your existing T-SQL skills, your existing DMVs and get much of that information. But having this all in a central notebook where I can simply run it and get a status check is quite useful. Particularly because we don’t have much in the way of UI for all of this, and I’m guessing that if the notebooks work out for people, there may not be much in the way of a direct UI and it’ll, instead be within Azure Data Studio. Currently they have a little dashboard, click this button to get the status, and that pops up a notebook.

Carlos:             Yeah, the management of all of this will be very interesting, and I’m sure there’ll be lots of interesting things that we can talk about as it becomes available and adoption increases.

Kevin:              Yeah, and this may help it get around the classic Service Broker problem, where it was a fairly good technology for its time, just didn’t have a UI and so very few people wanted to dig into the plumbing of Service Broker and it just was underutilized.

Carlos:             Extended Events might be another. Now, obviously they’ve built a UI.

Eugene:           Oh yes, oh Extended Events.

Carlos:             But that’s, yeah.

Eugene:           I had to do any auditing project for a company and some of their servers were on 2008 R2 and so I’m writing these Extended Events without any GUI. It was terrible. So yeah.

Kevin:              And yeah, remember early on, Jonathan Kehayias had to put together a UI because Microsoft didn’t have one. The 2012 UI was a start, but I don’t think it was until 2016 that you could confidently say, “yeah, this is a pretty decent UI.”

Carlos:             Exactly and be able to get in there and filter and you know, feel like you could get what you wanted and all of that stuff.

Kevin:              Right, there’s a reason people still use Profiler. It’s not just curmudgeonliness.

Carlos:             Right. No, that’s right. It’s ease of use, which ties into what we talked about in the beginning is that they’re trying to make these other pieces available to you a little easier.

Kevin:              Yep, so that you don’t have to learn, manage and maintain it all yourself.

Carlos:             That’s right. Okay, compañeros, I think that’s going to do it for today’s episode of the SQL Data Partners Podcast. Thanks, Kevin, for enlightening us on big data clusters.

Kevin:              Or at least befuddling.

Carlos:             There we go. So our music for SQL Server in the News is by Mansardian, used under Creative Commons. And as always, compañeros, we are interested in what you have to say. You can connect with us on social media via various means. Eugene?

Eugene:           Yeah, you can find me on Twitter @sqlgene or sqlgene.com for my blog.

Carlos:             Kevin?

Kevin:              I’ve already created an external table in your database, so just insert a row into that.

Carlos:             And you can reach out to me, compañeros, on LinkedIn. I am at Carlos L Chacon. Thanks again for tuning in to today’s episode, and we’ll see you on the SQL Trail.

1 Comment

Leave a Reply

Back to top