Carlos: Compañeros! Welcome to another edition of the SQL Data Partners Podcast. I am Carlos L Chacon. It is good to be with you today. Today we are doing Episode 188, and as always, I am joined by my legion of cohorts, Kevin Feasel.
Kevin: Hello, there.
Carlos: Yes, recently recovering from a trip out to the Consumer Electronics Show. And Eugene Meidinger.
Carlos: Still from Pittsburgh, Pennsylvania.
Eugene: Maybe in 10 years that will change.
Carlos: Maybe in 10 years it’ll change, yeah. No, Pittsburgh is a nice place. I just have to keep teasing you about it. Okay, so our topic today is Databricks. Data, Databricks. So, lots of, I would say, I feel like new hotness, although it’s been around for a little while. Well, in Azure terms it’s been around for a little while, but we’ll get into that. We want to talk about that, and of course we will try to leverage pieces that are pertinent to those of us working with data and that know and love SQL Server. But before we do that, want to give a couple of shout-outs. First shout-out to data_relish, the organization. I’m not sure who is doing their social media at Data Relish but thank you for giving us a little love on the podcast. We appreciate it, guys. And then also, this is the first episode we are recording in the new year, so thanks to everybody who wished us a Happy New Year, we do appreciate that, with special inclusion of Rafael Coln-Garcia, Shanal Singhal. Thank you, compañeros, for reaching out, again. And I have a last minute entry, here, so, thank you for leaving messages on the website. This is Sachin Gangwar. Really liked the content and he enjoys mixing things up with Power BI. A lot more folks, I am hearing, I don’t know if it’s in the new year or as they start to move into different roles, but more and more of you are using Power BI or starting to hear more about that. And I know we’ve talked about it, but Sachin wants to hear a little bit more about Power BI Premium. So, we’ll have to get that on the–
Eugene: Oh, interesting.
Carlos: Yep. On the episodes. So, because we have talked about it, Sachin, if you would follow up with let me know maybe what specific questions you might have and we’ll make sure to address those in a future episode. Okay, our show notes for today’s episode will be at sqldatapartners.com/databricks or at sqldatapartners.com/188. Okay, interestingly enough, so the story goes, that Netflix, I guess this was a couple of years ago, now, put out a one million dollar reward for anyone who could create an algorithm for picking better movies.
Kevin: Recommendations for end-users, yes.
Carlos: There you go, recommendations for end-users. And there was a group from Berkeley that put together what ultimately became Apache Spark.
Eugene: Oh, interesting.
Carlos: Now, they did not win, apparently. They came in second.
Eugene: That doesn’t mean much, because the group that won, from my understanding, Netflix didn’t end up even using the algorithm because it was too complicated.
Carlos: Oh, interesting.
Kevin: That is correct.
Carlos: There you go, so things have to tie back to movies on this podcast, we had to show a Netflix connection there. So, they came up with this, and why is this relevant, you ask? Because Databricks is ultimately built on Apache Spark, so this came out of the machine learning, how to do predictive analytics realm. And I feel like I’m stealing a little bit of Kevin’s thunder, here.
Carlos: But maybe let’s get into a little bit of the architecture. Do you want to take us the rest of the way here, Kevin?
Kevin: Sure, but I think I’m going to start with a little bit of story telling.
Carlos: Here we go. You mean my Netflix story wasn’t good enough? No, I’m just kidding.
Kevin: It was okay. It was like a 6 out of 10. For anybody in the crowd, I am currently going through a phase where I became a 70-year-old woman with a 12 pack a day habit.
Eugene: Life goals.
Kevin: Yeah, yeah, it was a really difficult week to get there, but I’m glad I had a chance to experience it. Anyhow.
Carlos: Yeah, ready to get back into the time machine, right?
Kevin: Two more days I’ll be cleared up, hopefully.
Carlos: There you go.
Kevin: But, to talk about Apache Spark, I like to talk about Hadoop a little bit, first, because that was really the basis of the types of problems that companies wanted to solve. And some of the largest companies on the planet, your Googles and Facebooks and Microsofts, Twitters, a lot of information that they need to process and the types of reporting that they would do could not be done on a single machine. It was just not possible given the constraints available to those companies. So, in about 2003, 2004, there were a couple of Google whitepapers released. One on the Google file system and one on a programming concept called MapReduce. But the Google file system stores your data laid out on a whole bunch of different machines. MapReduce allows you to write programs that can query those machines, roughly independently of one another. There are still points where they interconnect, but they can work mostly in parallel. So, if you have an algorithm which supports it, you get approximately linear scale, which is really great, cause it means if I need this thing to finish twice as quickly, I’ll throw twice as many machines at it. So, that was good. It was designed around the hardware at the time, the software at the time, the networking at the time, which was 2006, 2007, when Hadoop comes out. So, we’re talking mostly physical machines, direct attached storage, spinning disc, say, 1 gigabit network, 100 megabit network. Not exactly the types of like 64 gig fiber channels that companies will install today. Apache Spark comes out of a research project. It was a lab at UC Berkeley called the Algorithms, Machines and People Lab. AMPLab. And the authors of the product did compete in the Netflix challenge, but this was also a PhD project for Matei Zaharia and a couple other additional co-authors. So, they came out with two of their own papers. Paper number one was on Apache Spark as a product, and this was, let’s see if I can remember the date, 2009. Then in 2010, Zaharia had 8 co-authors on what is the more popular paper, was a concept called Resilient Distributed Datasets. This allows Spark to be like Hadoop in the sense that you distribute out your work, but–
Carlos: To lots of different places, yeah.
Kevin: Yeah, to a lot of different machines, but all of that data stays in-memory. The thing about Hadoop is, you’re assuming that you have more data than you have memory, so, disc has to be involved. Disc is also the slowest part of those machines outside of maybe network, just because of bandwidth constraints at the time. So, they were trading, basically, memory for disc. On the Hadoop side, they accepted the disc slowness because it got a solution, but eventually people wanted something faster, so Spark comes in with their concept of in-memory distributed computing. Similar set of algorithms, similar set of operations. The concepts of MapReduce still apply, but now we’re talking about, “how do we keep this in-memory as long as possible and solve problems that way?”
Carlos: There you go. So this is an important point, because one, I guess if we’re laying Databricks, and if I’m saying this wrong, Kevin, please forgive me, but ultimately, that’s with the Apache Spark. Databricks then creates a layer on top of that as the interface between the technology and then being able to get to your reporting, and it makes the data available in different forms. Obviously, you can do the reporting in notebooks, or you can connect like Power BI to it or these other Qlik and Tableau, things like that, to it, right?
Kevin: Yeah, so Spark is the open-source project. Databricks has two meanings. First is the company, Databricks. Zaharia and most of those paper co-authors went off to found this company. Their product, the Databricks Unified Analytics Platform, is also generally known as Databricks, and that is Spark plus a few extra niceties.
Carlos: Exactly. So, these guys were interested in making a little bit of money. They had the open-source software, but they’re like, “hey, how can we provide some additional value, here and sell something?”
Eugene: You talk about a little bit extra money, it looks like they’ve raised nearly a billion dollars in funding, so sounds like there’s some big money, there, potentially.
Carlos: Oh yeah, I mean, I think you talk about like industry disruption, Hadoop kind of had it, I don’t know if cornered is the right thing, but you were hearing a lot about it, and then it wasn’t too long to get introduced to now, Azure Databricks is all anybody’s talking about, right?
Eugene: Yeah. Well, I mean, I think Hadoop as an ecosystem or as like a file format and that sort of thing still lives on pretty strong, it seems like. It’s just the original, like, Hadoop MapReduce piece is wavering. So, Kevin, I have a question. I don’t understand Apache, in the sense that I had two ideas of what the Apache group was, and it seems like they don’t fit this very well. So, my first understanding was, I think if you’ve worked with Linux anytime before the past, like, 5 years, with Nginx becoming more common. You knew of Apache as a webserver, right, like the whole LAMP stack, Linux, Apache, MySQL, and then like PHP or Python was canonical for a very long time. That’s how I learned to do web development. And so, for the longest time, in my head, Apache was just a webserver. And then later, as I sadly started working with Adobe Flex, which was based off of Adobe Flash, I saw that it got donated to Apache and became Apache Flex, and so in my mind, Apache was where open-source programs went to die. It was basically, “okay, we don’t want this anymore, we need someone to like maybe maintain it, so here, Apache, go ahead and, you know, the Island of Misfit Open-source Projects,” kind of thing. But then I hear about this, about the fact that Apache’s shepherding this project that apparently there’s some big money sloshing around. So, what exactly is Apache or the Apache Group or whatever you would call it?
Kevin: Yeah, the Apache Foundation. The Apache Software Foundation. It’s actually the opposite of your second notion.
Kevin: It’s a place where you have projects that incubate, so there are a couple of notions of levels of products. Spark, Hadoop, Hive, a lot of these are what are called top-level products. And what that means is this is an established product, it has a viable community, it is growing on its own and so the Software Foundation kind of takes a hands-off approach and says, “these are mature products.” Below that, you have incubator projects. So, these are projects where you’re not ready for prime time yet. And that could be because it’s only you developing it, so we need more developers, a higher bus factor. It may not be in use very often, very frequently. It may not be fully formed so that there’s a stable product. So, the Apache Software Foundation’s goal is that they provide assistance to these types of open-source projects, helping them try to find community, helping them fit into whatever ecosystem they’re trying to play in and making them viable products over time.
Eugene: Nice, okay. No, that’s definitely interesting.
Carlos: So still could be misfits, but they are trying to keep them alive, instead of–
Eugene: Right, yeah. The plan is to get them out of the incubator stage, in a lot of cases, okay.
Kevin: And also following the Apache licensing.
Eugene: Oh, okay.
Kevin: There are licenses for Apache products and that license, they’ll follow different open-source licenses, but there is kind of a default Apache, if you don’t have a better choice. Like if you don’t want to go MIT or Berkeley or GPL.
Eugene: Right, very cool.
Carlos: Okay, so then we get into Databricks, and Kevin had mentioned the in-memory pieces. The other thing that we should consider is that the way that Databricks was set up is that, well, because it has to be in-memory, there is no direct interface, while I can have my data live in a SQL Server, for example, it’s got to go from that SQL Server into the, and I’m just going to call it Databricks, into that memory space. So, there has to be a data movement there. The other interesting piece, now, and this is probably specific to at least the Azure Databricks, although it’s worth mentioning, so Databricks is the company. Both Amazon and Microsoft are ultimately licensing the software from a third party, in this case, Databricks. So, the data has to live in the cloud, you have it there, you would ingest it with Databricks, and then you’re actually going to use that third-party Databricks software to do what you need to do, and then it can go onto the next hop. So, whether that’s the visualizations or you’re going to use it in the notebooks, or what have you. And so, I think that is something that was new to me, in the sense that I’m not going to interface with this thing in SQL Server. I’ve got to interface with it in the Databricks environment.
Kevin: Yes, Databricks is going to be using file systems to store data, so we can use what they call the Databricks File System, and we’re going to store data in there in some format. That could be a delimited file, like a .csv file, it could be in Parquet format, ORC format, those are probably going to be your three major formats for data. You can also pull in data from Blob storage, from data lake storage, if you’re in Amazon’s site you can pull from S3. You’re not going to pull it directly from SQL Server.
Kevin: You could use a product like PolyBase, push that data into Blob storage, and then read it from Databricks, if you’re–
Carlos: Or Azure Data Factory is another option.
Kevin: Yeah, yeah, you could also use Azure Data Factory. I have my own monetary reasons to use PolyBase, but–
Eugene: I was just thinking that. I hear there’s a great book on PolyBase.
Kevin: I hear there’s a mediocre book on it.
Eugene: Oh, okay.
Kevin: So, yeah, you can get your data from SQL Server or from a relational system, as long as you can get it into a format that Databricks can support, it’s another step in the process.
Carlos: Sure. So, when you think about all of those old SSIS packages that were moving data all around, this is–
Kevin: Well, since you say that, in Azure Data Factory, all of the SSIS-like functionality around mapping data flows and wrangling data flows, if I remember right, that’s using Databricks under the covers.
Carlos: Oh, is that right?
Kevin: So, they’re using Databricks for a lot of the data flow movements, and in addition, you can also just straight-up execute Databricks notebooks in Azure Data Factory as a step.
Carlos: Yeah, and so I think for us, and where we have been, from a SQL Data Partners perspective, generally, because I don’t consider myself a Databricks person, per se, but what we do, we have done a lot is trying to help map the orchestration of those steps. “Okay, so where is my data now? Where do I want it to live? When it comes out of Databricks, what am I going to be processing, and then where’s it going to go from there?” A bit more architecture stuff, and I do see for those of us who are working with data, so machine learning or getting into those concepts that might be one option, but I also think there’s still a, I don’t know, I hate to use the word ‘basic’, but just some where’s the data going to flow and then why do I have to take those steps? And of course, then there’s cost considerations to all of this. So similar problems that we’ve been experiencing, just using a few different technologies.
Kevin: Sure, and one of the most common uses, here, of Databricks, machine learning is a pretty common use, but ELT: I’m going to take the data, land it in Databricks and I’m going to reshape it and possibly land that somewhere else. So, I can do a lot of heavy processing of data, pulling in a lot of disorganized data sources, do my data cleansing, and I can’t do that in Data Factory. So, I’ll do that in Databricks, finish the processing, and that way I can dump it into my data lake, or I can dump it into Azure Synapse Analytics or into some other end-product, you know, even into Power BI.
Carlos: Yep, to then do all that reporting and get that back. We mentioned this is a cloud-first technology, so, the question there is I have my data on-premise, I have my reporting, like my data warehouse, for example, I haven’t quite gone to the Azure tools there, and either through merger or acquisition or whatever, I have to take on additional data sources. Yeah, so that ability to do some of those things, but then still get that back on-premise without having to, spin up a whole bunch of stuff, yeah, can be very appealing, can be helpful. Okay, so what else? I guess it’s worth probably talking about that in, again, the Azure components, that there are two models, so they have a Standard and a Premium option or tier, rather. And I know they have different workload options in the engineering and analytics and I’m not sure that I completely understand those pieces. Do you happen to know?
Kevin: Yep. So, there are three levels of pricing for Databricks and they come down to how much interactivity you’re going to have. I’ll start with the pricing for the data engineering versus data scientist or data analytics-style model. And at the bottom end, there is a Data Engineering Light, which is going to be, “you’re just running scheduled tasks.” It’s like the equivalent of a SQL Agent job executing a stored procedure. You have no real control over the cluster. Once the job is done, the cluster shuts down, goes away. You can’t restart it. So, if you have a problem, you can start up a new cluster, but you’re not going to troubleshoot anything within there. So, that’s about 7 cents per Databricks Unit. A Databricks Unit is a measure of processing capability per hour and that has about as much meaning as some of the other pricing units.
Carlos: It means you have to test it a couple of times to figure out what the cost’s going to be and like, “oh, okay, there we go.”
Eugene: Yes, these aren’t metric, where it’s based off the size of a platinum cylinder in France.
Kevin: So, for 15 cents per Databricks Unit, think of it as per server, but certain servers actually take up more than one DBU, so for a small server. If you want to keep that in your mind. Fifteen cents gets you the ability to run those jobs on automated clusters, but you get certain optimizations, certain improvements. So, that would be the type of cluster that you would want to run in a production environment where you have things figured out and you want to run something more than just a Scala job. With the 7 cents per DBU, Data Engineering Light, you only get job scheduling running like a Python script or a Scala script, whereas with Data Engineering, you get notebooks, you get the ML capabilities and the ability to use something called Delta Lake, which we can talk about in a bit. Data Analytics, that’s 40 cents per Databricks Unit per hour and that’s going to net you the ability to have interactive clusters. So, now my data scientists can start up a notebook and play around with the notebook and then have it hit the cluster, so I don’t have something pre-written that I’m executing as a scheduled job. I’m running it ad hoc. Now as far as tiering goes, Premium versus Standard is mostly about the ability to have multiple users. When you’re in that Standard cluster, you get one user, basically. There’s very little in the way of operative security. It’s a lot less expensive, it’s a little easier to manage, but if you have to worry about different data scientists on different teams needing to sometimes cooperate together but really needing their own separate spaces, that’s where you want to put Premium in mind. But if you have, say, one team using a Standard cluster and it’s okay that everybody uses the same account, it’s not a big deal.
Carlos: Right, okay. Yeah, so a couple of the takeaways, at least for me today is that, so one, the data’s got to live in-memory, so I have to do some– or it has to get into a storage component that Databricks can take it, which is not SQL Server, at least currently. It’s going to get processed when I’m in with Databricks, I’m actually using a third party, in this case, again, Azure, because we were kind of on a focus there, and with Amazon the same way, is licensing. So, I’m actually going to go and work in azuredatabricks.net to do some of the notebook stuff. And then once the processing is done, I can then take it and then do whatever it is that I want to do with it. And so, yeah, it’ll be interesting to see how this continues to evolve, and what they come up with it. It sounds like there’s still lots of other pieces that you can, I don’t know if ‘bolt on’ is the write word, but it’s another, I guess, another tool in the toolbox for you to be able to manipulate data, bend it to your will and then make it available, and at scale.
Eugene: Well, so I agree with a lot of that. I have a slightly different view on kind of the value-add for Spark and Databricks, because you–
Carlos: Oh, here we go.
Eugene: Yeah, well, I mean, so one of my goals is to learn some Spark for 2020, because there’s a good chance that I might be doing some courses on that. And you know, you talk about it being a tool in your toolbelt, it feels like Spark and Databricks is an attempt to kind of make more of a Swiss Army Knife, cause when I first started reading about Spark, I expected to basically be one, like Kevin said, Hadoop, but in-memory. But also, I’d known a little bit about it using Scala, and about kind of these iterative algorithms where you make some changes and then you kind of, I don’t know if you persist the results, or you basically have these intermediate stages, like you do some work and then you’re like, “okay, that’s a stopping point,” and then you continue. And I’m probably butchering it, but hey. But what I found out, instead, is that Spark, kind of like CosmosDB, support a bunch of different modalities. So, you’ve got Core Spark, you’ve got Spark SQL, you’ve got Spark Streaming, you’ve got the machine learning stuff and you’ve got graph processing. And so, you combine that with like your basic Apache Spark with the cluster management and the notebooks with Databricks and what I see this as, and Kevin, you can either validate or invalidate this, is this may have started out as, “hey, memory’s gotten cheaper, let’s do Hadoop in-memory, so to speak.” But it’s turned more into, “okay, we want to do advanced analytics, we want to have a significant maturity in how we handle analytics when we work with our data engineers and data scientists, but gosh darn it, I am sick of using 15 different tools that sound like Pokémon. As an IT admin, I want a single solution. I want to be able to say, this is our analytics cluster. Look at how beefy it is. Here’s where we do our advanced analytics. Here’s where we do our regular kind of MapReduce algorithms. Here’s where we do machine learning. Here’s where we do streaming. Now we can do a Lambda architecture approach where we say, okay, for old data we’re going to do batch processing, but for the new stuff, we’re going to do streaming so we kind of get the best of both worlds, and just have it all in one place.” That’s the vibe that I’m getting as I start to learn more about Spark.
Kevin: Mostly yes. The only part where I would levy a little contention is, you’re still going to be dealing with a lot of the Pokémon characters.
Eugene: Sure. Yeah, it sounds like it. I mean, I took a look at the Apache projects, and I’m like, “that’s a lot of names.”
Carlos: Yeah, and that’s why I say tool, because it’s not– well, I shouldn’t say it’s not a one-stop shop, but I mean, I don’t know, yeah.
Eugene: Well, I feel like it’s trying to be a front-end, even if it’s not a one-stop shop.
Carlos: Okay, that’s fair, that’s fair.
Eugene: I feel like it’s trying to coordinate some of the chaos that you run into the moment you leave– well, I say leave the Microsoft ecosystem, but obviously Azure Databricks is part of the ecosystem. But I don’t know, it’s just I think a lot of us, you know, as you’d say knuckle-dragging Neanderthals, coming from the Microsoft world are used to, “okay, here’s your 5 flavors of SQL Server. You’ve got Integration Services, Reporting Services, Analysis Services and Database Engine, enjoy.” And then you start looking into anything open-source, start looking into this data engineering, you start looking into anything that’s not Microsoft and you’re like, “there’s 30 different names I have to memorize, now.” And so, I think with Spark, you’re still going to have to learn a lot of these things, but at least, especially from an IT admin standpoint, you can say, “okay, like here’s our front-end. Hey, we just hired a data scientist. Here’s our Spark cluster, go have at it.” I don’t know, that’s my hope, and that’s my impression, but obviously I’m still picking this stuff up.
Carlos: Sure, and I mean I think that’s part of those pieces to look forward to is can it continue to add pieces or components to it to allow it to be that– yeah, I’m not saying it’s not a one-stop shop yet, in terms of processing, but yeah, I still think there’s a lot of things– cause you have to track the data, right? Where is the data going? How is it flowing and–
Eugene: Yeah, well, learning more about Spark helps me have a better appreciation for SQL Server Big Data Clusters. Cause I mean, we had an episode about Big Data Clusters, and I’d looked at it initially and I’m like, “what are they trying to do, here? This seems really goofy. Like they’re trying to turn SQL Server inside out and just put it as a light veneer.” But now that I’m learning more about Databricks, it’s like, “oh, maybe Microsoft wants their own Databricks, so they don’t have to pay licensing fees.” I’m sure it’s more complicated than that but learning about this helps better understand where Microsoft’s trying to go with Big Data Clusters.
Carlos: Yeah, and it gives you an on-premise option, as well. Whereas the Databricks currently doesn’t provide that. Interesting, okay. Last thoughts, guys?
Eugene: I am excited and terrified to be branching out beyond the SQL ecosystem, honestly.
Kevin: I would say, “give it a try.” If you have Azure credits, use your Azure credits. Small clusters? They can be pretty cheap. You’re talking a few bucks an hour. You can get started on a one-node cluster or two-node cluster just to try stuff out and that’s a very inexpensive way to get started. You don’t need the Premium tier stuff, because that price adds up pretty fast. Obviously in a corporate environment when you have to do Active Directory pass-through, when you have to do role-based access control, when you have that sort of segmentation, then you go to Premium, but for playing around, stick with the cheapest stuff. If you don’t even have Azure credits, there is a community edition of Databricks. It is the AWS version of Databricks but look for Databricks Community Edition and that will give you the website where you can get a free one-node cluster and store something like up to 10 gigs of data for non-professional usage, so there’s no reason not to. Like you can’t say, “I can’t afford it.” It’s absolutely free.
Eugene: Very cool.
Carlos: There you go. Okay, well, that’s going to do it for today’s episode, compañeros. As always, you can connect with us on social media. Eugene?
Eugene: Yeah, you can find me on Twitter @sqlgene, at sqlgene.com and hopefully in the next month or so I’ll have a course out on Pluralsight on Distributing Excel Files. Very, very exciting topic.
Carlos: There we go. I missed a shout-out, Nigel Foulkes. So Nigel, thanks for wishing us Happy New Year, and he did mention that he is paying your Pluralsight token.
Eugene: Oh, nice.
Kevin: Getting Eugene that cup of coffee.
Carlos: Listening to your Pluralsight courses.
Eugene: Dude, I need it right now. I need to make up about $10,000 in tax revenue between now and April 15th because of all the basement stuff.
Carlos: Oh, gosh.
Eugene: Nah, I’ve got a plan, but–
Carlos: Fun times.
Eugene: I’ll take what I can get.
Kevin: So, everybody go listen to Eugene’s Pluralsight courses.
Eugene: Please. Well, let’s see, I think I make like $2 or $3 per, so if 5000 of you could go watch one of my courses, I would be set and then Uncle Sam won’t have to break my kneecaps.
Carlos: Okay, Kevin?
Kevin: You can find me wherever consumption is treated.
Eugene: I believe that’s the 1800’s.
Carlos: Again, that time machine. And, compañeros, you can reach out to me on LinkedIn at Carlos L Chacon. I will say for 2020, if you happen to know, at least in the States, a medical practice using Centricity or you happen to know a printer, printing shops using the EPMS software that might need a little help, we’d love to connect. So, let me know at LinkedIn, Carlos L Chacon. And compañeros, I’ll see you on the SQL Trail.