Big Data. Do you have big data? What does that even mean? In this episode I explore some of the concepts of how organizations can manage their data and what questions you might need to ask before you implement the latest and greatest tool. I am joined by James Serra, Microsoft Cloud Architect, to get his thoughts on implementing cloud solutions, where they can contribute, and why you might not be able to go all cloud. I am interested to see if more traditional DBAs move toward architecture roles and help their organizations manage the various types of data. What types of issues are giving you troubles as you adopt a more diverse data ecosystem?
“People ask what is big data? I say it’s really all data.”
“By having a data lake and also keeping a relational database store, you have the best of both worlds.”
“It’s really important when you’re looking…to build this large data warehouse and a data lake is to add a lot of time in there for data governance.
“The people who really succeed in their careers were willing to take risks, to jump out of their comfort zone, to do something that they weren’t sure if they can do.”
Listen to Learn
01:07 Compañero Shout-Outs
03:49 Intro to the guest and topic
04:42 What is big data?
06:07 The difference between SMP & MPP – The cloud allows you to scale up and down
10:21 How fast is the data coming at me, and what do I need to do with it?
13:10 What does a data lake do for you?
18:11 Have a data lake and a relational database store – the best of both worlds
20:41 You need to have many tools in your kit to get the results you need
22:05 Data governance is more important than ever
24:37 How to get your data into SQL Server
30:11 Consider your team’s skills when deciding what tools to use
32:42 What is your tolerance for change? Are you open to open source?
35:05 It’s important to pick the right technology
36:24 SQL Family Questions
42:11 Closing Thoughts
About James Serra
James is a big data and data warehousing solution architect at Microsoft. He is a thought leader in the use and application of Big Data and advanced analytics, including solutions involving hybrid technologies of relational and non-relational data, Hadoop, MPP, IoT, Data Lake, and private and public cloud. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 30 years of IT experience. James is a popular blogger (JamesSerra.com) and speaker, having presented at dozens of PASS events including the PASS Business Analytics conference and the PASS Summit, as well as the Enterprise Data World conference. He is the author of the book “Reporting with Microsoft SQL Server 2012”. He received a Bachelor of Science degree in Computer Engineering from the University of Nevada-Las Vegas.
blog URL: www.jamesserra.com
Carlos: Compañeros, welcome to the SQL Data Partners podcast. This is Episode 131. It is good to have you on the SQL Trail again today. Our topic is Big Data Solutions in the Cloud and our guest is James Serra from Microsoft. I actually met James in Richmond. He came to the SQLSaturday and he really enjoys the Mongolian Grill, so that’s an interesting fact you may not know about James.
Before we get into the conversation today, we do want to give a couple of Shout-Outs. We want to give a shout-out to Alex Branskey talking about Episode 130. He said, “every day that I don’t sound like an idiot is a win.” I definitely agree with you there, Alex. I just wish that I had a few more days, at least consecutively, perhaps. Shout-out to Jeremy Fry, we were talking about SQLSaturday Richmond and failed to mention him. We followed up afterwards and wanted to give a shout-out to him, posting some pictures out there on the race track when the speakers were getting together, doing our speaker event. That was a lot of fun. Shout-out to Henock Fikadu and Alexander Lawson connecting with me on LinkedIn, appreciate that, giving us some thoughts. Compañeros, it is with some sadness that we remember two of our compañeros from the SQL Family. Tom Rush, we lost earlier in the year. I didn’t really know Tom. He was from the West Coast and other than being in the same circles, I didn’t really have a chance to interact with him. I did meet him once. But the other SQL Family member we lost recently was Robert Davis. Robert, I did have a chance to get to know a little bit. In fact, he was a podcast guest on Episode 112 and probably knows more about SQL Server than anybody else that I know about. The saying is, he’s forgotten more about SQL Server than I will ever know. Very soft spoken and had been dealing with some health issues. I don’t know that anybody expected him to pass as early as he did and so we’re thinking about him and his family. It is one of those things that makes you stop and think about what’s important. Of course, we’re grateful, one, to their contributions that they’ve made to the SQL Family, but then two, for their friendship and the things that they’ve passed along, so they both will be missed.
In talking about the SQL Trail Conference, we are happy to announce that Angela Henry has agreed to speak. She’ll be coming up from the North Carolina area. She is one of the newest Microsoft MVPs in the area and so we’re excited to have her. The website is still being put together but by the time this episode will go live, there is some information out there, and again we’re hoping to start taking registrations on May 1st. The conference will be October 10th-12th in Richmond, so if you’re on the East Coast and you’re looking for something to do in the latter part of the year and don’t want to make it out to the West Coast, we invite you to check out the sqltrail.com for more information about the conference that we’re putting together.
I’d like to go ahead and jump into the conversation with James. Our podcast show notes for today’s episode will be at sqldatapartners.com/bigdata or sqldatapartners.com/131.
Carlos: James, welcome to the program. Thanks for coming and chatting with us. Today we have a big topic. We’re going to take on the idea of big data and how we might choose some technologies. I think we’re even specifically going to focus a bit more into the cloud. As a Microsoft person, we know that Microsoft has made huge investments in the Azure infrastructure. Talking a little bit about this today, now, I think in fairness, because there are so many technologies that are out there, we’re not going to get into all of them and we may follow up with those in different episodes. But to kick things off at first here, when we talk about data, we’re really just talking about the entire data ecosystem in all its shapes and varieties.
James: Yeah, and people ask, “well, what is big data?” and I say it’s really all data. It’s data, whether it’s structured or semi-structured or un-structured or relational or non-relational. The idea is, can I capture all this information and make better business decisions in there? And that’s been difficult in the past, especially with on-prem because you’re limited by storage space and performance and maintenance windows. The cloud has opened it up to allow for the ability to bring in all this data that you never would have been able to afford because of scalability is unlimited and such. I always talk to customers, well, big data means you can capture everything.
Carlos: Yeah, that’s right. You think about the evolution, and obviously we’re now used to social media, for example, but even 15 years ago, as an organization, I didn’t have to worry about capturing social media information, but now I do. It may not be pertinent to my business, but maybe I’m interested in sentiment or how people view my brand. So, these are types of things that I now have to take into account. Now, maybe I don’t want to capture all of that or keep it long-term, but it may be something that I want to take a peek at and look at. So yeah, it is, how do I go about doing that? You’ve mentioned some of the obstacles that we’re faced with. I think to help lay some ground work, and ultimately the folks that listen to this podcast, most of them are technical in nature. Maybe we talk about some of the differences or I don’t know if criteria is the right word, but sectors or areas in which some of these technologies might come into play. One of the first that we talk about is processing types. You might have heard the terms SMP and MPP. Take us through those, what do they mean and how do they affect us?
James: Sure. SMP is what we traditionally think of with products like SQL Server Oracle, where we have this database and it has the ability to scale up in there if we need more horsepower, whether it would be CPU or memory. It works great until you run out of capacity, and then you’re limited on a ceiling of how much you can continue to scale up. Generally, this is a system that is attached to a SAN, especially if you’re dealing with on-prem. Some of the issues, and this happened to me a lot when I was a DBA for many years, if somebody would submit a query and they would just take all the resources up and you’d KILL command and you’d clear them out when anyone else goes in there. Now MPP is a difference in that it can scale out, so if I need more horsepower, I can add additional servers and I can distribute this data in maybe a large database or data warehouse. One thing to know is SMP and MPP, when you get into MPP, you’re talking about data warehouse and not an OLTP solution. If I have a large data warehouse, I can add more servers so I can scale out. What happens is the data’s distributed among all these servers, and the query is then distributed. I submit the query, it goes out, gets all the data, returns the value, scrunches them together and sends them back. The analogy I like to use, which a customer actually said is, think of if I have a deck of cards and I want to go through and find the King of Diamonds in there and I go through one by one, it could take a while. If I take those cards and I spread them out and I gave 26 people two cards and I ask them all at the same time to find that King of Diamonds, it’s going to be a lot easier. That’s what we do when we paralyze these queries. Also, it’s a shared nothing environment, meaning each query will get its own CPU and memory and disc, so you don’t run into the issue of one query colliding and taking all the resources up because they’re given their own slice of that pie. So, it comes into play when people are looking to build large data warehouses, whether on-prem or in the cloud. You want to look for an MPP solution and the end result is queries go 20 to 100 times faster in an MPP solution than an SMP.
Carlos: Yeah, it is a slightly different way of looking at that. Traditionally, I think, from on-prem it’s like “well, how much memory and how much CPU do you think we need?” This is more about “okay, well, how much data do I have,” to your card example, “and how far out do I want to spread that data? And then I’ll just attach COMPUTE to it and then we’ll figure it out. Oh, that wasn’t quite enough, let’s thin that out a little bit more and try it again.”
James: Yeah, and that’s the great thing about the cloud is the ability to scale up or down. There are products that you can use that allow you, on the fly, to scale up or down and even pause the COMPUTE so you can save money that way. But you don’t have to worry about, “hey, let’s get as much as we think we need and handle that max period when everything else may be really slow. But we need to handle that big data. In one of my prior lives, I ran several websites and leagues and Sundays were crazy and the CPU would bump up and then on Monday and Tuesday and Wednesday, it would be 2, 3 percent in there.
Carlos: There’d be nothing.
James: Yeah, and the cloud allows you to scale up or down to save money, and also generally you have unlimited amount of resources, so I can scale up and fit the needs no matter what, and then scale back down after hours and such in there. And in the storage, the same thing, there’s unlimited storage in there. So, I tell customers, “start out small, and if you find that’s not enough, you just scale up and within a few minutes you have additional horsepower.
Carlos: Now that’s interesting you talk about starting small and then working up. Another idea or philosophy or idea or consideration for the cloud is this idea of velocity, particularly when we’re talking about data. How fast is the data coming at me, and what do I need to do with it?
James: Yeah, that’s where it becomes a challenge, because as I said, big data is all data. So, it doesn’t matter what the variety or the velocity or the volume is. If we’re talking about velocity, we talk about screening data now, and that’s a big thing with IoT (Internet of Things) devices in there. I was going through a project today on smart buildings and all the things they can do to check the building capacity and for security and for changing the temperatures in the rooms and the lighting. All these things they can do to save a lot of money, in addition to utilizing space and such, because they can capture this stuff in these devices in there. One of the guys I know, he did that to his home and he tracks everything from when somebody opens the mailbox to when the water level is low in his pond and when somebody is walking across the driveway and when his dogs are let out of the backyard because the doors are open. It’s great, this ability to capture all this information in there and you need a solution that does that. Now companies can do that for, there’s just a million different things. Connected cars are a big thing in there and predictive maintenance is a big thing in there. Maybe we can capture data and predict when something is going to fail and fix it before it actually fails. You need a solution that’s going to be able to take this all. Maybe I want alerting, so this data is coming in there, I want to know when the temperature gets too high in real time. So, if you build this big data solution there, you want to have the ability to capture this data no matter what volume, velocity or variety is coming in.
Carlos: Another interesting example. I was talking with a facial recognition sales person. They were working on races, so this idea of you run a race, they take your picture, and you have your bib number. But they wanted to be able to track different things, like did you seem happy, were you interacting with other people, so to be able to take that, match it and then do some analytics on it. We started talking about what the future of facial recognition would be. He thought there would be a day when you could actually go into a restaurant and pay for your meal just with your face. I’m not sure that I want that, but you could think about the types of organizations that then need to consume data, just in that example, again, that’s a very futuristic deal. Potentially, he was thinking years, one handful of years, but that idea of now a restaurant has to keep track of all of that and be able to gather that information? That’s definitely a different dynamic than just serving up food.
James: Yeah, that’s interesting.
Carlos: It is interesting what people are going to be tasked with. And then of course we’re looking here, and so this velocity, I think also begs the question that all of the data can’t always be structured, because we don’t necessarily know what all of it is all of the time, either. We set up the new sensor on the lake or the mailbox, maybe an open and close, that’s fairly straightforward. But what if it’s gathering more information like time or how much sun was on the mailbox when it was opened, things like that. That may be something that I don’t what it to look at right now, but in the future, I might. I think this segues into this idea of the data lake. Admittedly, to people that listen to this podcast, my compañeros, they know what a knuckle-dragging Neanderthal that I am, but it didn’t click. I almost had this lightbulb moment, James. I’m embarrassed to admit that in preparing for this episode, I was like, “oh, a data lake!” I finally get it. That idea of literally, your rivers, your streams go into a lake to hold water because they can’t contain it all, that’s this idea of the data lake. It’s like, you know what, I can’t figure out what to do with it just yet, it’s coming in so quickly and maybe I can’t process it or I don’t know what questions to ask of it, so let me stick in my data lake.”
James: Yeah, that lightbulb moment. I have a lot of clients and sometimes you have to go through examples and the pros of having a data lake. That’s the benefit of the cloud is, you can use a data lake product that has unlimited capacity in there and I can dump files in there whether I’m going to use them or not. I can put them in there and determine if they have value now or down the road, but at least I’m starting to capture these things. Particularly when I look at IoT devices sending us information. The case I like to use is when I was a DBA and somebody said, “I want to use this data”. I would have to go and because it’s schema on read versus schema on write in a database world I have to go and do the schema on write. Meaning I have to create the database and the tables and the fields and write the ETL and move it in there and clean it up and structure it all. Then they go and create the report and then come back and say “well, I didn’t really need that data, it’s not what I thought it was.” And you did all this work for nothing. The data lake, seeing it’s schema on read, I can dump the files in there and then I have the power users or the data scientists who can go and see if that data has value in there. To me, it’s no work. It’s just putting the data in a data lake and letting them go and pull it out. The data lake has so many benefits for preventing the problem of somebody going, “hey, can you put this data in a database or data warehouse?” Yeah, okay, I’ll get to it, but it may take a month or two and then they go off and get impatient and wind up creating their own data warehouse in there. Now you can put the data in there and say, “it’s there, go ahead and use it.” I always caution that the schema on read means you’ve got to pay the piper somewhere, so that means when you’re pulling the data out, you’ve got to put schema on there. So, you need to have the power users who understand how to do that. You can’t just go to the average user and say “hey, just go use hive or pig and run up a Hadoop cluster and pull the data out there.” Because the data lake is just a glorified file folder, so once you put the data in there it’s great, but it could be semi-structured, non-relational data and they have to have the skills to then pull it out and get value out of it.
Carlos: Sure, somebody has to be able to pick it up and say, “Oh, I know what this is. I can make some sense of this.” I think this comes in handy, and I think it’s the challenge, this is a jump. This is going to be one of the questions that we get a lot of the podcast is, “hey, as a DBA, how is my world going to change?” I think this is a great example is that you are the beneficiary a lot of times, and sometimes you created it, but you’re the beneficiary of having those answers. I have an address table, a person, here’s the things I want to know. Now we don’t have those things, so being able to go through the questions and say, “okay, here’s what I don’t know, and here’s how I’m going to solve that. Here’s what I do know.” Putting these things together makes a lot of sense. I have a customer that they were talking to their trucking company and they’re struggling a bit with this. They have a traditional SQL Server environment. They’re getting all of this data and because things continue to change, sometimes the analytics of it, or what they’re expecting the data to be is not quite there. The data actually gets put in the wrong place because it’s not what they were expecting. Then they have to go through and fix that afterwards. Whereas, if they had a data lake, “you know what, let’s just put it all there and then we’ll figure it out later, because we don’t need it right now. It’s not like we’re processing it immediately. We can do that tomorrow, in an hour, whenever it might be. I just need to consume it, let the other end of the application, yeah, I got it, and then I can do something with it later.” That’s where this evolution is very interesting in how data flows through the cloud.
James: Exactly. By having a data lake and also keeping a relational database store, you have the best of both worlds in there. So, you can satisfy some users who want the data quick and they know how to get value out of it by putting that schema on it and they can use the data lake. Then most of the other users, they need it in a relational format so you can eventually move it to the data warehouse and put it in a nice and pretty format and maybe a cube and they can just click and drag and create their own reports. The data lake gives you so many other options, too, because it’s a giant staging area. You can capture the raw data, so if you find something wrong with the data warehouse, you can go back to the data lake and I have a raw layer, I have a clean layer, a presentation layer, all these different ways to go back and look at the history of the data and fix things. You can use the data lake as storage. Maybe I don’t want to put 10 years’ worth of data in my data warehouse. I want to save some money and put only 5 years in there and have the older data in the data lake and also use it for backup purposes. There’s so much you can do in a data lake and that’s why I stress data lake’s a great concept, but you also want to have your data warehouse in a relational store. People went down the wrong path of using the data lake for everything and it just didn’t work. Now, the general consensus is to have both.
Carlos: Who would have thought? They’re promoting a new feature that everybody gravitates towards and then they’re like, “hey wait a second, this is not exactly what we need.”
James: Yeah, I have stories of CTOs who thought the data lake can handle everything and they were going to replace all their data warehouses, relational and such. One customer came in and that’s what they wanted to do, all data lake. I strongly suggested against it, but they go, “no, it’ll work. Our CTO came in and he said that’s what we’re going to do.” A couple months later, this same company comes back to visit. I said, “well, what are you doing here so quick?” They go, “yeah, the whole data lake thing didn’t work out. They fired the CTO and now we’re going to go and use the data lake and a relational data store in there.” I heard that a number of times with my customers. It’s just basically giving them the proper use case for the data lake. It happens with all the technologies. There’s this trough that they have, it can do everything and then the reality sets in there. It’s my job to go to customers and explain the proper use case of these products. The fact is, when you curate a big data solution, you’re going to use a lot of products. It’s not just, “hey we have one product and we throw all the data in there.” It’s a lot of products. So, it’s more time upfront to build a solution, but you wind up having a solution that can handle data no matter what the velocity or the variety or the volume is. Anything that’s going to come down the pipe, we can handle it.
Carlos: Right, there is no one-size-fits-all solution, way of doing this. We talk about to that point, big data, it encompasses everything. So, you know you have to have many tools in the kit to be able to handle those different things, because each tool has its own specific things that it’s good at and you want to use that for those pieces.
James: Yeah, and I hear people talking about “well, we don’t need cubes anymore.” Which I strongly disagree, because there’s a lot of additional reasons you want to use a cube, no matter how fast your data warehouse is, a cube aggregates all this data and a cube can return data in milliseconds. So, if I’m looking at attaching a dashboard, I generally don’t want to attach a dashboard to a big data warehouse, an MPP solution, because of concurrency as well as, maybe it’ll only take two or three seconds in a data warehouse, but that’s too long for a dashboard where somebody’s slicing and dicing. The cube gives you the faster performance, it creates a semantic layer when you look at the start schema, so it makes it much easier for the end user to build a report off of there and to get into things like hierarchies and KPIs and roll out the security that can be put into a cube. It oftentimes becomes another part of this big data solution in there. The confusion is trying to understand when to use a cube and when you wouldn’t need it, and when you create data marks. That’s where I spend a lot of my time whiteboarding these solutions to customers.
Carlos: Right. Having all of these tools doesn’t remove your responsibility to actually understand what you’re trying to do with the data and the objectives that you’re trying to fulfill.
James: Yeah, and that’s a great point. Data governance, you need more than ever. The other problem with data lake is people thought, “oh, I don’t need much data governance in there. I’ll just put the data out there.” Again, it’s a glorified file folder, so if you just throw data out there, you’re going to get junk in and junk out. You need to spend a lot of time, I tell customers, on data governance. Creating layers inside the data lake can be a difficult process. How do I break out these folders, do it by date or by department or some security issue that I need to have in there? It’s file-based, so what do I do if I have a file that’s got a lot of data in there for different departments and I want to have somebody only access one department in there? There’s a lot of challenges with it, so you need to have that data governance and part of that is knowing what data is put in the data lake. To avoid becoming a data swamp is you need to keep track of those things so people are not dumping same copies of the data in here. There’s a lot that has to go around when you’re creating a large data lake and data warehouses and data marks and cubes and reports. A lot of data governance needs to go into play and there’s tools that you can use to catalog all of that, but it’s really important when you’re looking at building a project plan out, to build this large data warehouse and a data lake is to add a lot of time in there for data governance. Especially the cleaning and the mastering of the data, because as you incorporate tons of stuff in there, the problem is companies think, “oh, my data’s pretty clean.” Ninety nine out of a hundred times, the data has a lot of issues in there and you need to spend the time cleaning it.
Carlos: Yeah, it may not be major issues, but at least they are issues that have to be considered. There are exceptions, there’s missing data, things like that. Normal things that those of us who work with data see.
James: Yeah, a quick example, when I was a consultant and building data warehouse solutions, they would go, “oh, my data’s clean.” And I would say, “I’ll bet you 100 bucks right now I’m going to find problems.” I won that bet every time because there’ll be fields, for example, of birth date and you find out people don’t know the birth date, but they have to put a date in there, so they put some date in the future or some date 200 years behind. All of a sudden you get this data and you go, “well, look at this, this is not valid data. What do we do about it?” That’s where the data cleaning comes up. You can imagine a lot of instances where people just put junk data in there. You’ve got a solution of how do I convert this data? And that’s where the data governance comes into play.
Carlos: So now, let’s chat a little bit about how life will change. Again, we’ve been focusing a bit on the data lake idea, but we put some of this data into the cloud, we’re now using a hybrid model. So, I have the data in the data lake, you mentioned file folders, I still have a traditional SQL Server, most of us that are listening to this show, we talk about SQL Server a lot. It’s what we know and love. and How am I going to be getting that data into my system? Let’s talk a little bit about that relationship and how that would work.
James: Sure, and you’re referring to ETL and ELT, different concepts.
Carlos: Yeah, is it as simple, I mean, you mentioned a file or my data is now in a folder. How am I going to get that data into my SQL Server? Can I just query it, if it’s an Excel document? Can I just load up an SSIS package and throw in the Excel data source and bring it in? Or what does that look like? What additional considerations do I now have?
James: Yeah, there’s a lot of challenges, especially if you have a lot of data coming in at one time in there. Just like I used to do in the 1990’s with data sitting in a folder. It still applies. I can use SSIS if I’m in the Microsoft world to pull that data in there. I can do that in the cloud. There are various technologies that now allow you to have data sitting in a data lake and use a product like SQL Server with a technology called Polybase where I can point to that file that’s sitting in Hadoop in a data lake store and give it structure and it’s creating an external table. I can create that table definition, point it to that file sitting in Hadoop and I can run SQL on top of it and the end user doesn’t even know that it’s sitting in Hadoop. There’s various ways of using those technologies to load data or to keep it where it’s at and query it and even push down the queries so you get sort of a federated query approach to it in a logical data warehouse. There’s a lot of options in there. Of course, for best performance, you want to load the data in there so all the solutions that we’ve always used on-prem, we can use in the cloud, and then there’s additional ones. This is where we get into the area of what is the format of the data? SQL Server now, you can use JSON data and put it in there and it actually works pretty good up to a certain point. If customers are dealing with IoT devices that are pulling and pushing out data in JSON format and there’s millions of these, that’s where you start to get into maybe a no SQL solution would be better because they can handle data that’s in JSON because that’s the native format. Totally different world but we see that when we’re dealing with millions of transactions per second and the data is in JSON, it may be the no SQL solution would be a better option for a company.
Carlos: Okay, so a standard approach, just to reiterate, is if I have data in my data lake, I’m going to ingest that with Hadoop first and then use Polybase to get it into SQL Server? Is that the most direct route? Let’s say that it’s, going back to the mailbox. I want to capture opens and closes of my mailbox and that’s going to be date. Date formats and then that’s what I’m pulling in. SQL Server can do dates. Can I just connect to that and pull it down or am I going through those, I’ll call them hops, of Hadoop, Polybase to SQL Server?
James: I can put it into Hadoop, a data lake store, which is just storage. Now I’ve got to pile some compute on top of that. I can do things like fire up a Hadoop cluster and it can go and process that data that’s sitting in a data lake. Maybe I want to clean it and join it and master it and write it back out to the data lake. I can do that outside of my SQL Server, so I’m saving the performance hit that hits SQL Server, especially if it’s 24/7 and you have users, you don’t want to collide with them. I can do some of that processing in the data lake using some Hadoop technologies. As well as, I can decide to create that external table and point to it if the file is a format that, say csv or pipe delimited, some way that I can put schema on there fairly easily, then then just keep it in that format and query it using Polybase technology and never have to pull it within the database. Or I can say, “well, I need to pull it into the database,” and I can use the Polybase technology to pull it into there. Or I can use the various bulk-loading mechanisms that have always existed and pull it into SQL Server. So, the challenge is finding out the best way to do it, and a lot of it just comes down to how much data are we talking about. Because I can put data in a folder and I can every few minutes or hour query it and pull it in there using SSIS or Azure Data Factory. There’s so many options out there. What I try to do with customers is go over all of these options, the pros and cons and give them the best approach that may fit their particular situation.
Carlos: Yeah, again, we’re back to the square one of what is it that I want to do? What questions am asking and how do I want to go about solving those? A number of different components to consider, cost of the compute, what’s my current infrastructure look like, all of these things come into play and unfortunately, there is no easy button, it sounds like here.
James: Yeah, I wish there was an easy button on eBay. I looked, I couldn’t find it. Or a magic wand, don’t exist on eBay, either. The other point that you sort of hit on is, there are customers who I can say, this is the ultimate solution performance-wise, but it may be too expensive, so then I’ve got to talk to them about other alternatives that won’t give them the performance but will save them the cost. When you get into our technologies, you may have storage, but you may have hot and warm and archive storage. It’s all cheaper storage, but it adds some complexity to the solution. It’s understanding those options and picking the one that’s going to be the best for you when you look at the performance versus cost.
Carlos: Well, and then I think, again, you posed a couple of questions when folks are considering cloud solutions. The one that we haven’t touched on is what skills does your team have. You kind of threw Hadoop out there, but what would it take to get a Hadoop person into the organization or do I have someone that wants to learn that? What kind of learning curve is that going to cause for me? Yeah, maybe it’s, I don’t know if standardize is the right word, to use the Hadoop cluster, but what does that then mean for my organization?
James: It’s a great point. This is one of the major questions I ask customers when they say, “hey, we want to build a solution”. I’ll ask them, “okay, can you use the cloud? Is this a new solution to migration? Do you have Hadoop skills? Will you use non-relational data? What’s the volume or velocity of this data?” it goes on and on and on. When I ask those questions I narrow the solutions down in my head to then whiteboard and discuss with them. The big one is, what is your current skill-set because while Hadoop is a great solution in some aspects, most of the companies I deal with, the bigger companies, they don’ t have Hadoop developers, just SQL Server and they’ve been using it for years. For me to come in and talk about open source, they’re going to go, “no, we don’t want open source in there.” That’s going to limit a lot of the options right away, but there’s plenty of solutions you can build without using any Hadoop and open source. When I go on user groups and I like to ask the question, how many people know about Hadoop, everybody raises their hand now. When I ask how many have solutions in production, you maybe get a handful of that, maybe 5 or 10 percent have done that in there. In particular, you just don’t see it among the larger companies for a bunch of different reasons. But it tends to find its place in little aspects. Maybe for the real-time, I’m going to use Kafka as the ingestion in there. It becomes a question of open source is free, but nothing is free in that there are other costs associated with it. It could be you have to pay for support and that’s where Hortonworks or Cloudera comes into play. Or you have to train your staff, and that could be overwhelming when you come to the costs and the time commitment to do that.
Carlos: Right, and then how long is that taking to get up to speed, am I missing out on the opportunities of plunging ahead?
James: Yeah, and that’s a lot of the indirect costs. That’s what we try to talk to customers is when you put a cost on, I can go into the cloud and press a few buttons and I instantly have a server that I can start building a solution on, versus the weeks or months it’ll take to do it on-prem. You can’t really put a dollar figure on that through an Excel spreadsheet. You generally have to pull something out of the air, but that goes into the benefits you get that are indirect costs.
Carlos: Another question posed is your tolerance for change. You kind of mentioned open source versus, “hey, we’ve been using SQL Server, maybe we don’t want to break out of that stack.” Another consideration is it seems like every day there’s some new feature or technology that’s available on Azure, but a lot of it is first available in preview. Not to say that it’s not functioning, but it also means that you may have to change as the product evolves. Is that something that you’re open to, as well?
James: That’s a challenge we have at Microsoft because we want to know about all of the things that are about to come in private preview so we can then inform customers about things that are coming down the road and that they want to be part of that leading-edge technology, it’s something we want to offer to them. That’s where people push back on open source because it changes so frequently and you have all these products that you may have to put together and have different versions. That’s the only reason Hortonworks exists is because they don’t create anything new, they just have a data platform that has 22 open source products on there that they’ve tested that work together and every few months they come out with a new version of data platform because they’ve tested all that stuff together so you don’t have to. You’re paying them just for that and for support in there. That’s where people don’t quite understand the open source world and how it works when it comes to making things open also means it can change frequently, and a lot of times it’s too frequent for them. You look at customers, some of them are still SQLs that are 2005 or older. The pace of change is too fast for them.
Carlos: That’s right. These are all interesting questions that we still have to tackle. Lots of new features, lots of opportunities, but it doesn’t change some of the basic facts about what it is we’re trying to do, what problems we’re trying to solve and what our team knows.
James: Exactly. I just stress that the great thing about big data is the data is the new gold or the new oil. I hear all these sayings. I can get better insights into my data that I’ve never been able to do before. When I talk to customers, I like to have lightbulbs go on their head because when I talk of IoT or social media data and the way they can use the data that they’ve never thought about. I can pull in weather data, competitor data, and I can then have that data be valuable and I can make better business decisions and I can go ahead of my competitors and generate more sales for my company or whatever it may be, because of the extra data that I’m getting. It’s a long road to building solutions in the cloud and you can educate yourself or you can reach out to, in my case, I’m a solutions architect for Microsoft, but lean on other people to learn about these technologies. I always say I try to make customers look like a hero because I do a knowledge transfer to them and then they go and they talk to their managers and they help make better business decisions on what they’re going to build. I’ve been involved in a lot of products and especially data warehouse and when I’ve seen them fail, it was most cases, they chose the wrong product. Not that the product was bad, it’s just that they use it for the wrong use case, and so it’s very important up front when you build a solution is to look through all the products, understand the proper use cases and make the right decisions that will last for a long time as opposed to getting 6 months or a year down the road and realize, “uh oh, we picked the wrong technology.”
Carlos: Right, exactly, the ease of you picking a technology may not equal ese of implementation or usability of what you’re expecting.
James: Right, SQL Server is great but it can’t be used for everything, so you’ve got to realize its limits. Yeah, you can code just about everything in there, but there may be other solutions, other products that are much easier for your use case to build a solution around.
Carlos: That’s right, and get you there faster.
Carlos: Awesome. Okay, let’s do SQL Family.
James: SQL Family it is.
Carlos: How did you first get started with SQL Server?
James: Well, this will show how old I am, but I remember in 1989 I was working for a government contractor and they brought in this computer that had OS 2 on it and this thing called, I think it was called Sybase Server. I know it was a collaboration between Sybase and Ashton-Tate and Microsoft. I remember playing with it on OS 2 and thinking, “this is awesome”. This was a step above the d-base 4 or 3 whatever it was at the time. That was where I got my first foray into that. Shortly after that, we started building solutions in there and eventually Microsoft was the sole provider and I think came out with 4.21. I don’t even want to calculate how long ago 1989 was, but it’s been quite a long time and quite a long change. It’s always been a part of my career. I’ve been very happy with it.
Carlos: Very nice. We have seen a lot of changes over the years. Obviously, you’ve gone onto bigger and wider pastures, to say the least, but if you could change one thing about SQL Server and we can expand that to the ecosystem, potentially, what would it be?
James: One thing I wish they had, and maybe this will come down in the future, and working for Microsoft, I don’t have any inside knowledge of it, but the ability to scale out. I mentioned before when you get into MPP world, you can do that, but if you look at the OLTP world, how can I take a system that has massive amounts of writes? I can scale it up, but I can’t scale it out, and I wish there was a way in the future to have that option and think of it maybe as distributed writes. I can have many multiple masters in there. What’s changed in this world of having the internet and webapps or phone apps is the ability for millions of users to jump on there and use it that we never had to worry about before. Scaling out has become a big issue. The only solution to that was to jump to a no-SQL world and it’s very different. I wish there was a way to do that with regular SQL and companies are trying that with NewSQL, as sort of a go-between no-SQL and regular SQL, but I wish it was easier to just say, “I want to scale out and add more write nodes or what it may be with SQL Server.”
Carlos: Interesting. Yeah, I think CosmosDB, they might come back with that, but again, it’s another technology that you have to then adopt and adapt to.
James: Yeah, your choice is either going to a CosmosDB, a no-SQL solution, which is very different, or creating some sharding mechanism, which yeah, you can do, but it adds a lot of complexity.
Carlos: That’s right, and then you’ve got to maintain it all. I hear you. So, James, what’s the best piece of career advice you’ve received?
James: I remember early on, one person that I knew at a job I had, and I was trying to decide if I wanted to go and do consulting. His response was, “sometimes in life we’ve just to take risks.” What I’ve noticed in my career is the people who really succeed in their careers were willing to take risks, to jump out of their comfort zone, to do something that they weren’t sure if they can do. For many years, when I changed jobs, I always thought, “oh maybe I don’t know what I’m doing. This is imposter syndrome.” but I had to get over that and I had to take these risks, and it’s paid off. The people that don’t want to jump out of their comfort zone is a lot of times their career can stagnate. I always tell people if they say they have 5 years’ experience, I say, “do you have 5 years’ experience or is it one year repeated 5 times?” You always have to look at your career. To get ahead is you have to learn new technologies and that may mean leaving your current job and switching to another one.
Carlos: Yeah, and that can be such a tough thing to make that leap and particularly painful when it doesn’t pay off. But the willingness to take those chances can definitely pay dividends.
James: Yeah, it doesn’t always pay off, but most times it does and you look at famous scientists, they failed many times before they eventually got it to work. You’ve got to think of the same thing with careers. In most cases, you’re going to go to a new experience, you’re going to learn new technologies, you’re going to meet new people, you’re going to network, and even if it doesn’t work out like expected, what you’ve learned during that short period is going to apply and make it easier to find other jobs. That’s where you always have to keep your technical skills up and that may mean if you’re not getting it at your current role, to move onto a new role. I frequently have presentations that I give at SQLSaturdays where I talk about building your career and taking those risks. It’s people that just can’t get over that is where I try to help them to look at the brighter side of things and how much better their career can go if they made that jump.
Carlos: James, our last question for you today. If you could have one superhero power, what would it be and why do you want it?
James: I definitely would want to be the Flash because traveling is such a time-consuming thing, especially in my case. I live in the New York Metro area and to get around clients can take forever. Jumping on planes, jumping on trains, jumping on subways, walking, driving. If I can just go and be at a client within a short period of time, all that extra time I would have to go and learn new things or deal with more clients would be awesome. So, if you have the ability to give me that Flash, I would really like that.
Carlos: Sorry, we’re giving that away next episode.
James: I’ll pay extra for it, if you have it.
Carlos: Well, we’ll see if we can’t work out something. Awesome, James, thanks so much for being with us today. We’ve appreciated the conversation and you joining us.
James: Nothing better than talking technology and then throwing some interesting questions at the end, so I really enjoyed this and really glad to be here.
Carlos: There you go, compañeros, I hope you enjoyed that conversation. I do think it was interesting, we’d talked first about the facial recognition and I’ve actually started to see Apple Pay ads that appear to allow you to do this. They’re obviously very short on details but they say Pay With a Nod, so somehow, they’ve worked out or starting the system, so it’ll be interesting to see how that continues to play out. My big takeaway is that big data is all data. Regardless of its form, its variety, all of the information that you think you have to manage can be encapsulated into the big data umbrella. I think this is where we, as professionals, have to separate the marketing a bit from the problem that we’re actually trying to solve. James and I talked a little bit about this and I think this is another area where we, as professionals, if you’re unsure of where you think you might go in the future, this is a problem that many organizations are going to have. Being able to think about, “okay, how do I temporarily store my data in a lake, and then what’s that data going to look like as it moves to the organization?” I think that’s going to be a very interesting problem and one that is going to take more than one person to solve that, in a lot of cases. I think that’s going to be a very interesting scenario that us, as administrators, if you’re looking for a move, that might be a natural fit, there. It also made me think that we need to have a conversation just on the data lake, itself. We got into it a little bit. I know we’ve talked a bit about Polybase as well, but I think that idea of, “okay, now my information is in that lake, how do I get it out?” I think I’d like to explore that topic a bit more. This also reminds me that this is one of those scenarios where, again, separating that marketing, we get brought a tool and we’re like, “hey, use this.” Well, okay, we need to ask some questions, first. I think if we can identify the problems that we have, there are tools out there that can help us solve those problems, but if we’re trying to implement a tool without understanding the problem, that’s where we’re going to run into issues. I think James illustrated that when he mentioned organization just like, “oh, we’ll just go to data lake and that will solve all our problems.” More often than not, the tool will not solve our problem, but it will allow us to address specific issues that we’re having. I think that’s another area that we’re going to have to be cognizant of. As people approach us and they hear things from the marketing side, how are we going to address them or how do we ask those questions to be able to give good feedback or good guidance on what may or may not be appropriate.
That’s going to do it for today’s episode. Thanks again for tuning in. we’re always interested in what you have to say, so reach out to us on social media. You can connect with me on LinkedIn. I’m @carloslchacon and we’ll see you on the SQL Trail.