Carlos: Compañeros! Welcome to another edition of the SQL Data Partners Podcast. My name is Carlos L Chacon, your host, and it is good to have you on the SQL Trail once again. We’ve been very busy this summer, so thank you, compañeros, for hanging in there, for sticking with us. We’re happy, in this episode, to have Jason Horner and Cathrine Wilhelmsen with us. We are actually going to break this episode up into two pieces. We had some great conversation and I’m sorry that we had to break this up, but I felt like it made the most sense to do so. We are going to be talking about Is SSIS Still Relevant? Really, this is getting more into the tooling and Jason and Cathrine are going to give us some primer on some of the new tooling that’s been available, particularly in the Azure space, about moving data around and patterns around that. That’s ultimately the conversation that we’re going to be having today.
Before we get into that, I would like to give a couple of Shout-Outs. The first to David Pless, Kathir S, from Lucky on Twitter. In addition to Kevin Carter, Bill Slate, and Francisco Tapia. Francisco wants to know when we’re going to be heading out to sunny California. That is a great question, Francisco. Maybe we’ll take the show on the road here, in 2019, and get out there. Admittedly, I’m kind of an East Coast guy and don’t get out there as often as I should. But Francisco, you may have changed my mind, there, so we’ll see next year about doing something out there.
We should bring up the SQL Trail, as well. I was asked this week, as we’re getting closer– October 10th through the 12th in Richmond, Virginia, we’re bringing our version of a data platform event to the area. I was asked, “Are you going to have a focus on Azure?” When I thought about it, for whatever reason I first thought about that question, I thought, “well, we’re going to be talking about topics that the attendees want to talk about. I started thinking about that a little bit more, we are having our Azure Workshop, and it’s been a little while since I mentioned that. So, Melissa Coates and Meagan Longoria are going to be putting on a full-day session, Building Modern Analytics in Azure. They’re going to be covering not everything that we’re going to be talking about today, necessarily, but a lot of the same types of ideas, and we’re going to get some hands-on experience with some of these tools. So I thought that would be a very interesting way, even if that’s not your forte, to get some great exposure into what’s possible, so that, again, at least you can start speaking to it, understanding it, and in this case, have some experience with it, so, we’re really excited about this session that Melissa and Meagan are going to be giving. So, the workshop is included in the price of the ticket, or you can purchase a ticket just to the Friday workshop, if you’d like to do that. You can get more information there, at sqltrail.com.
Again, it’s been a little while, but it’s time for a little SQL Server in the News. This is going back to the July release of the SQL Operations Studio. Again, kind of exciting to see all of these changes. Of course, the Management Studio continues to get updates regularly, as well. One of the cool things that I thought they added to the Operations Studio was now the support for SQL Server Agent pieces, so you can view and edit operators, proxies, jobs and job steps. In a separation, if you will, from some of the other tools, they now actually give you a little bit of graphing on the job so that you can actually see some history as to the duration and if it was successful or not. They give you that in color-coding and in terms of the size of the bars and whatnot, so I thought that was an interesting addition and a neat way for them to continue to add or enhance that product, so that’s going to be pretty cool. There are a couple of other things that they’ve added. They’ve added some profiler extensions and then some additional wizards and whatnot. If you haven’t used that in a little while, it may be time to check out Operations Studio once again.
With that, let’s go ahead and get into our conversation for today. Our show notes will be at sqldatapartners.com/SSIS or sqldatapartners.com/144.
Carlos: Cathrine and Jason, welcome to the program.
Cathrine: Thank you.
Carlos: Yeah, it’s great to have you. So tag-teaming, a little bit today, and because we all know that two heads are better than one, it’s nice to have you both on again. Both previous guests and so, fun to be able to revisit some of these topics. Ultimately, our conversation today focuses on, I would say the idea or the concepts of integration. Ultimately, we’re data people, if we haven’t already, it doesn’t take very long to get the request “I have data in system A or place A or in Excel A and I needed to get it over to another place.” So, that’s what we’re talking about and it’s funny that this process has evolved. It’s almost a little subsystem. I know lots of folks in the community or elsewhere kind of make their niche just in doing this, in moving data around. So, one of the big, I don’t know if it’s a debate, debate’s probably a strong word, but one of these things, we’ve traditionally referred to this as an ETL process. Extract, transform, load. And again, maybe movement is a strong word, but I kind of feel like there’s a lot of people talking about ELT: extract, load, and then transform. Maybe let’s jump into this and talk about what the difference is, is there really a difference, and why I might choose one over the other.
Jason: Sure, I’ll start that off and then kind of hand it over to Cathrine. You’re absolutely right to call out, one thing I want to tag back to is this notion of data integration, so it’s not just about ETL and ELT. There’s just a need to move data between systems in an OLTP context and in other contexts like that, and so traditionally it all fell under this blanket of EIM, Enterprise Integration Management. But what we’re really talking about today is more focusing on it from the standpoint of we’re populating a data warehouse and so there’s the two approaches you called out: extract, transform and load, and extract, load and transform. Really, what I think the genesis of the different approach is large volumes of data.
Carlos: Yeah, everything is just getting bigger.
Jason: Yes, exactly. The four Vs, the volume, variety, veracity and velocity. And so traditionally you would use SSIS as your transformation tool, and that has some scalability limitations. In 2017, they introduced the scale-out, but you can’t scale out a data flow, so that’s a problem. The issue is that only have so many resources, locally, on that machine, so on and so forth, so if you’re trying to process a terabyte of data in an hour, it’s probably not going to happen in a traditional ETL approach. So Cathrine, maybe you can talk to us a little bit about the ELT story and what that’s about.
Cathrine: With ETL, it has to transform in the middle, meaning that you do the transform before you load the data at the end. Obviously, there’s some limitations there, on how much can you transform inside one pipeline. And then came, well, you could say some new tools, some new capabilities, better hardware, the cloud, all of these things that kind of changed the way we could do that. Where, instead of doing all of those transformations and then loading it, we kind of just replicate the data by loading it first, and then we can do the transformations inside of that same process. I wish we had one of our pictures to show now, because that would make that difference slightly easier to explain, I think.
Carlos: I think at the end of the day, it almost seems at first, like a slight step back, because you’re basically just making another copy of the data. You’re bringing it over from the source and you want to store it. Now, maybe you’re not bringing all of it, maybe there’s some filtering going on, but as far as what it looks like and how it’s stored, it’s the same. It feels like, “oh, well, why am I just making another copy?” I think the problem is, or one of the issues has been is that you start transforming it, and then the business approaches you with, “oh, hey, what about column X?” And then you haven’t put that into your transformation and you’re like, “oh yeah, how do I do that now?” With this model, okay, I already have the sourced data, I can now manipulate that a little bit easier, without having to potentially do as much rework.
Cathrine: Yeah. With traditional ETL, because it was rare that you could actually do everything inside one single pipeline, you still had what we usually refer to as the staging layer. So you would have to make an additional copy of that data before you could process it, in many cases. Even though it might look like a step backwards, it’s similar, but like you mentioned with the ELT process, you can also kind of expose that more raw data and enable new things, like for example, for a data scientist, who might want to work on all of that data and look at that without having it aggregated and transformed for a data warehouse prospective. So, there are a lot of different opportunities here, and especially looking into different technologies, we’re talking about data lakes instead of just the data warehouse, so there’s a lot of new opportunities with this slight, you know, you just mix a couple of the letters up and you get a whole new world of things you could do.
Carlos: Is that, then, the crux? Now you’ve introduced a new source, potentially, in the data lake, but is it when I get to the point where, “oh, I have a terabyte of data, therefore, I must be going from ETL to ELT”? Or are there other decision points or other factors I need to consider between the two components?
Jason: I think one of the things to consider is the tool that you’re using. SSIS is primarily an ETL tool. It provides those in-process transformation capabilities. The current version of Azure Data Factory, V2, doesn’t support that transformation, but supports the notion of compute activities, which allow you to transform the data after you load it. So, the general pattern we’re seeing arise in Azure Data Factory is this concept of moving the data, so the loading, and then transforming as a secondary step. What’s interesting about that is it gives us quite a few different options to do the transformation. For example, I can call a stored procedure on a SQL database or a SQL DW and do that in a more traditional, I’ll call it a little data approach. Or I can use some of the modern big data architectures. I could use HD Insight, so I could kick off a Pig job. I could kick off a USQL task inside of Azure Data Lake Analytics to process my data lake data, or I could use Azure Databricks and use Notebooks to do more of a data science-type engineering approach to a big data pipeline. But it’s an interesting question, because if you think about it, if you go back to a more traditional, old-school, an Inmon model, really, what we’re talking about with ELT is almost a type 1 ODS, where we’re replicating all of the data, we could even use transactional replication to move all of the data from these operational systems, off-load that reporting workload, but to Cathrine’s point, we’re not massaging the data, we’re not transforming it, we’re just kind of staging it down. Then, if anybody wants that raw data, so Cathrine brought up the concept of this data science use case or exploratory data warehousing, they can just access that data without it being baked. I think that’s a key thing to understand.
Carlos: Again, so the idea, I don’t know if it’s self-service, but let me just take a peek at some of the data, start asking some questions before I really know what I want in the end, and then that may help me decide, “oh yeah, okay, now I have a good model, now let’s bake this in and make it into something bigger.” More formalized process.
Carlos: Now, you’ve thrown a couple of tools our way. One of the nice things about SSIS, Integration Services, is that it comes with SQL Server. But now we live in the cloud worlds, Microsoft is trying to host everything and anything, and so there are a couple of other tools available to us. The Pig stuff, I feel like that’s like all the HD Insight components, but you bring up Azure Data Factory and then the Databricks. Let’s maybe drill into that for a moment. You’ve already talked about there being a version 2, so in the beginning, it didn’t seem, again, they kind of brought this out as like, “hey, this is going to be SSIS for Azure” but it was very basic, and obviously didn’t give you a lot of the visualization that SSIS gives you. But now they have V2, and I guess there are still some limitations, but maybe, are we a little bit closer to that interface? Is that necessarily still the goal of having it replace SSIS? Maybe thoughts around what Azure Data Factory is going to do, and again, under what scenarios people are going to have to pick it up and learn it?
Jason: I think that’s a good question. The question I’d throw out to Cathrine in her experience is, is SSIS still relevant? Is that still a technology? If I’m in the cloud, is that still something I want to use, or do you think just by default, everything should be Azure Data Factory, moving forward?
Cathrine: I think you keyed into that in your question, where is it relevant in the cloud? Well, there are two things to that. One is that you can execute your SSIS packages in Azure now, through Azure Data Factory, so you have a full hybrid experience there. so, you’re no longer limited to just using Azure Data Factory in itself in the cloud. You can also just lift and shift your SSIS packages in the cloud. Then the other side of that question is, is SSIS still relevant? Absolutely yes, if you’re all on-premises, if you’re not in the cloud, if you don’t have a hybrid environment, then SSIS is still the way to go.
Carlos: Sure, and I guess correct me if I’m wrong, but it would make very little sense to me, I have two SQL Servers in my environment, I’m not going to introduce Azure Data Factory to move data from one server to the other.
Jason: Yeah, I think that’s probably pretty accurate. You could do it, but again, it’s not probably a good architecture.
Carlos: Yeah, yeah, exactly, where I gotta suck it up, I’m assuming that there’s going to be some data moving into the Azure Data Factory service, and then it just seems like that would, yeah, convolute things a little bit.
Jason: Yeah, I’m glad you brought that up, because that’s a common misconception. How Azure Data Factory works, if we take a step back, and it has this concept that used to be called the data gateway but now it’s called an integration runtime. That is the footprint, so you set up a VM or you can install it locally on the SQL Server, although that’s not advisable, but that’s where the compute actually executes. It’s this little client that sits and allows you to poke a hole from the firewall, so let’s go through kind of a real-world use case. I have an Azure Data Factory deployed in Azure to East US 2. The metadata and the scheduling and all of that occurs in East US 2 but if I’m moving data from on-prem server A to on-prem server B, that compute will all actually execute on the integration runtime local to your on-prem network. Now, if I was moving from server A on-prem to, let’s say, Azure Data Lake storage, it’ll obviously use the integration runtime local on your on-premise to execute the SQL query to select the data out, and then it will choose, adaptively, the closest Azure integration runtime that’s nearest to where your data lake is, and it’ll run it from there. so it’s kind of an interesting concept, because people get confused, because they kind of do think in that on-prem mentality, where you do have a SQL Server running, an instance of it running on a box and it’s a little bit different with Azure Data Factory. You brought up another good point with the transition from V1 Azure Data Factory to V2. V1 was very much about moving time slice or time series data. If you think about immutable data like weblogs, things that are generated on a specific cadence, that was the use case it was really optimized for. I think Microsoft really confused the marketplace because you had some people saying, “think this is SSIS for the cloud.” But then it wasn’t. Well, then they came up with–
Jason: Right? And then they came up with V2, and so with V2 you got visual editing, more robust visual editing, so it’s starting to look more like SSIS. And then Microsoft, again, is like, “well, it is now SSIS for the cloud” because we’ve got control flow now, where we didn’t have control flow previously. We have a lot more data sources, but at this point, we still can’t do those transformations, so we can copy data, and I can call a task afterwards, and I can scale that out using a big compute. So I can do USQL, I can do Databricks, I can do HD Insight, I can do custom tasks in my own, using Azure Batch. But what I can’t do, is I can’t do a simple transform inside this pipeline, because it doesn’t exist. I can do basic mapping from a source data set to a destination data set, but I can’t really do any transformation in there. That, in my mind, is the main gap between saying this is a true SSIS replacement and it is where it is right now, from an Azure Data Factory standpoint V2.
Cathrine: And then Microsoft goes ahead and makes it just a little more difficult for people to understand the difference between ADF and SSIS by introducing SSIS inside ADF. So now you suddenly have this true hybrid thing, but you can’t really say that ADF is SSIS in the cloud because now you’re running SSIS in the cloud. I would say, for someone who is very familiar with SSIS, it can be a little tricky, actually, to figure out “what do I do? How do I start? How does this fit in?” And then you have all of these other technologies, like we mentioned, with, for example, Databricks being a completely different alternative to that. There are a lot of technologies in there that you can use for similar purposes and it can be slightly difficult. Even for someone like me, who’s been working with this, to fully understand “what is the best option for my use case?”
Carlos: Sure. Am I correct in saying, again, knuckle-dragging Neanderthal that I am, I think that SSIS feels like the transform tool, but Azure Data Factory or ADF feels like the load tool.
Cathrine: Yep, I think some other terms are data movement tool or data orchestration tool, while you’re absolutely correct in saying that SSIS is very much focused on the transformation part.
Carlos: Right, okay. Now, continuing that, we now have these other data sources or other tools and you’ve mentioned Azure Databricks here a couple of times, and admittedly, I know nothing about, and I failed to study about this before the podcast episode today, so educate me and the rest of everybody listening. Talk to me about what Azure Databricks is and some concepts behind it, here.
Jason: Carlos, great question. Azure Databricks is very new and a lot of people are trying to figure out “well, where does this fit in our data engineering ecosystem?” I think it’s a very good question to answer at a high level. What I would say is Azure Databricks is what they call a first party offering on the Azure cloud. Databricks is a distribution that is built on top of Hadoop Spark and it uses Python as the primary. I like to think of them as Gemini, like the twins, so there’s a data engineering aspect, which is all about using Python or even R Notebooks on top of Spark to do data transformation and to create a data pipeline of processing your data.
Carlos: So, I guess having said that, does that mean that I am generally talking about a Hadoop environment when I am using Azure Databricks?
Jason: Yeah, I mean some people can kind of get into semantic battles about it, is it Hadoop or is it Spark, so on and so forth. I’ll just say it’s Spark-based. The difference there is we’re not doing MapReduce. We’re not constrained on IO, we’re doing things in memory, so the theory is that it should be faster. What’s interesting about it is, I could go in an IaaS, Infrastructure as a Service, and I could spin up VMs and build a Spark cluster and do Databricks on top of that. I could control that, I could write a bunch of PowerShell scripts to spin it up, shut it down, even add notes dynamically, but that’s a lot of work. There’s a lot of stuff to it. By using Databricks on Azure, what it really buys you is the ability to do the cluster on demand. So even with HD Insight, my options are I can destroy the cluster, or I can create the cluster, but I can’t pause the cluster like I can do with Azure SQL Data Warehouse, and I can scale the cluster. That’s going to take some time. With Databricks, what I can do is I can say “here’s the notebook that I want to run” and oh, by the way, I can either use a persistent cluster, or I can spin that cluster up on demand. I can say, “for this notebook, I want a cluster configured with these libraries pre-loaded and with this amount of cores and this amount of RAM,” and so that’s the data engineering play. I actually have a client that is processing a bunch of point-of-sale files in JSON and we were doing that with Azure Data Lake Analytics and USQL and we were having some performance issues. I’m not saying that it’s a bad product, because we’ve been successful with it in a lot of other things, but the JSON shredding was very intensive, for whatever reason.
Carlos: Anybody who’s done an XML shredding knows this pain. I’m assuming it’s transformed into, we have a different language, but yeah, shredding is still painful.
Jason: Yeah, exactly. And so what we found is when we tried it on Databricks and moving to Python, as opposed to USQL, all of a sudden we were able to process this orders of magnitude faster. We were able to do the whole shooting match in 15 minutes, where before it was taking like an hour.
Carlos: Oh wow.
Jason: So, that’s kind of the example, and it’s not a knock against .net or Microsoft or anything like that, it’s just the right tool for the job was Python, in that case. It gives us the ability to run Python or even R at scale, so not necessarily bound to a single process space. That’s the magic about Databricks. So, the decision point is, okay, if I’ve got people that know SSIS and they’re C# people, do we want to then spin them up on Python and Databricks, or do we want them to use USQL and ADLA and Azure Data Factory that’s maybe more familiar? The answer to that is really what’s the right tool for the job, I think.
Carlos: Sure, so what are your people familiar with, and then obviously if you’re going to do training, then that’s a different discussion, or are you willing to invest in the training to adopt the new tool? I think yeah, it can be very frustrating, particularly for some of the technology folks, because they do hear, their managers kind of get the marketing hype and they’re like, “well, hey, I know we’re having this problem, let’s just go use this other tool. It should be simple.” Everything in Azure’s just plug and play now, but there’s still some major decisions to be made, even though the data integration tools are becoming, I don’t know if fluid is the right word, but they’re interoperable, maybe is a better word?
Carlos: But you can’t just willy-nilly decide, “oh hey, I’m going to start using something different.” I’m sure, like in your example, even though you were having those pain points, I guess you probably had to go through and make some decision points as to, “okay, do I really want to adopt this Python, and what’s it going to take for me to convert from the USQL base to the Python base?”
Jason: Yeah, and people often overlook the work to actually operationalize something. It’s one thing for a guy to go off in a closet and hide out for a couple days, and it’s another thing to actually make that a robust, repeatable production process.
Carlos: Sure, that people can support, and like you mentioned, administer that, because it’s not just the guy in the closet anymore who has to know about it. All of the documentation and whatnot has to be there, straight. Interesting. So, tooling is one piece of it, but then I guess, should we also talk about patterns, as well? Even going back to the ELT versus the ETL, does the tool that I choose impact the way that I need to then deal with my data?
Cathrine: Yes. I would absolutely say so. You know, over all, when you think of patterns, there are some thing that is so abstract that it’s independent of the tool that you use. But then you can dive into varying specific patterns for SSIS to make SSIS work and run in an optimal way, and the same thing with ADF. I personally haven’t worked with Azure Databricks yet, but it’s the same thing there. It’s a very different tool and a whole new way of thinking for data engineers. One of the things that I often talk about is BIML, Business Intelligence Market Language for SSIS, and that came in to solve some of the pain points with SSIS and turning SSIS development away from the manual dragging and dropping and repeating yourself, into thinking more pattern-based, because that wasn’t really an option in SSIS when that was the only tool you had to use. Then you have Azure Data Factory come in and you might have to rethink some of the way that you work because it has different capabilities. It has built-in language support, which SSIS didn’t in the same way. But then you can talk about more the design patterns and the data loading patterns, themselves, and they don’t really change, because you still have to go through a lot of these pillars of ETL frameworks, that you need some kind of auditing and some logging and change tracking and all of these things that you will need, regardless of the tool that you use, really.
Jason: Yep. That’s a good point, Cathrine. I think it’s kind of interesting to note that you talked about BIML and one of the, I know this is like a SQL family oriented broadcast, but thinking about where do patterns first get introduced. We know that from a software engineering standpoint, that is kind of the traditional thing that people did, but the first person, interestingly enough, was a chicken farmer named Andy Leonard. Not a lot of people are familiar with him, but he introduced this concept of “let’s use patterns in SSIS to remove some of the manual drudgery of building all of these things and let’s create a framework that’s reusable so that if tomorrow we need to accommodate a new ERP data source, we already know how to solve that.” So, the way I look at it is, patterns are really a license to be lazy, because I don’t have to make any decisions. The decisions have been made for me. I’m following the framework, I’m following the patterns, and it’s less work. But through that, I can get consistency across projects, I can leverage the concept of DRY or Don’t Repeat Yourself. It’s more testable, it’s less buggy, because I’ve followed the same patterns, and it’s easier to troubleshoot when there is a bug, because I know how it should behave, and it leads to higher quality code with a faster implementation timeline. Then BIML really takes that to the next level, and it says, okay, you’ve codified your patterns, here’s my incremental load pattern or here’s my staging table pattern or here’s my dimensional loader. Here’s how I do a type 1 or type 2. You’ve codified those patterns and BIML really lets you say, “okay, we’re going to use BIML script and we’re going to do that at scale.” So, whether you have one table that you’re extracting or 1000 tables, it’s done. You just run your BIML and generate the packages.
Carlos: And then it will put all of those pieces in place, yeah.
Cathrine: Yeah, and the same thing kind of goes for ADF, and I’m sure Jason can talk to that a little bit more, but you know, you can use BIML for but that might not be the tool that you prefer to use, so there are other ways. But even so, you still think of the patterns and tying that together with the framework because absolutely, don’t repeat yourself. You don’t want to sit there and do the same thing over and over again, and I’m an extremely lazy developer, which is why I started using these tools to script and automate things for me. I’m glad to see that in some of the newer tools like Azure Data Factory, they’re working on building in that kind of support, which they didn’t have in SSIS.
Jason: Yeah, so one of the things that was frustrating with SSIS, well the first thing is, you know, you’d start off in SSIS as a junior developer and you’d say, “okay, I’ll create a package and I’m going to load my data warehouse of one package. And then I’ll create multiple data flows within that package,” and so you get these franken-packages. I’ve seen packages that take five minutes just to open up, with 30, 40 data flows in it. So from a pattern standpoint, that’s where we said, “okay, we’re going to do one package per data flow.” Well, now all of a sudden, we’ve got 1000 packages and so how do we manage that? Let’s say Cathrine came up with this great idea of, “well, we need auditing or logging.” Now you’re going to go tell me I’ve got to go back and manually update 1000 packages with this new logic? What I’m kind of saying is that what ADF gives you is it was kind of built with that pain or that, we’ve already solved this problem, or we’ve realized this is a problem, so they didn’t want to redo that problem, so they made it much easier to iterate over things in scale and not be so tightly bound to data flow. In SSIS, it was very high performance, because all of the data flow operated on buffers in memory, so it was very much tied to a type system and everything was strongly typed. But that means that the SSIS package had to know specifically what data it was getting and what data it was putting out, so that it would know that at compile time so that at run time it can make the optimizations it needs, so there’s an execution tree, it’s very similar to how a query plan works in SQL Server. So, in ADF, you can do that tightly coupled data set, or you can do a looselier coupled data set, so that I can just say, “okay, here’s a list of 70 tables. Go SELECT star from all of those tables and copy it over one by one or in parallel.”
Carlos: Gotcha. Okay, yeah. So you don’t have to go in, again, to use the SSIS example and do all of that mapping.
Carlos: It’s going to be like, “oh hey, I get the fact that this is what I’m going to be doing and I’ll figure out some of that for you and then when you get your table, it will look like that which you SELECTED from.”
Carlos: Yeah, interesting. Does that mean, then, that we should be spending a whole lot more time architecting than we do these days? I mean, is that where consultants like you spend a lot of your time? I mean, maybe you are still doing some of that execution piece, but I feel like that’s where a lot of that value is, is deciding, “okay, well, how are we going to skin this rabbit and what makes most sense to do it? With what tools?”
Jason: Yeah, I mean, I think a lot of times when I come in, clients are just looking to solve a pain. So, a pain for a current client is, they don’t have any central data warehouse. Right now, all of their reporting is going off of one Access database and the client has one guy, he does every month he does a 180-step process to basically do all of this ETL to produce these reports, and these reports are mission critical. So now what we’re coming in is we’re saying, “hey, let’s take a step back, let’s integrate all of your different data sources into a central place. That will solve a lot of this manual work. And then let’s build a dimensional model on top of that to make it easier to enable these self-service scenarios.” And so, providing some guidance around that, and “oh by the way, you guys aren’t an IT company, so why don’t we move this into the cloud, where resources are elastic, you can provision things a lot quicker, you can be more agile, and if you do it right, you can control costs a lot better.”
Carlos: Right, sure. Now, it’s interesting that you bring that up, the idea of cost. Is that because you are introducing new options that they didn’t have before, that they didn’t have to go and spend licensing dollars for? Or, what’s the scenario there?
Jason: No, it’s a great question. The analogy I like to make is it’s like my 5-year-old son at the all you can eat buffet. IT was, “well, we’ve got these VMs, whatever you want, we don’t charge anything back, and if you need 8 cores and 64 gigs of RAM for this ETL server, great, knock yourself out.” So, to support this kind of resource binging, we thin provision things. We thin provision IO, we thin provision CPU, we thin provision memory, and then when we stop juggling the balls and everything hits the ground, and it all falls apart because we’ve written checks that the physical hardware can’t cache anymore. So, with the cloud, it’s a double-edged sword, because we need to very much figure out what’s the right size for this resource, because everything has a cost.
Carlos: We’re now paying for it.
Jason: Yeah, we’re now paying for that.
Carlos: We can slide the needle, but then that’s a little more, the pain, our wallet takes a hit, there.
Jason: Yep, and the other thing that I find a lot, too, is that when we start to peel that onion back a little bit and look at it, we see, “oh, well, this process, this stored procedure could be tuned, and if we tune it, then all of a sudden we don’t need all of these buffer pool resources, so then we can scale the SQL Server down. And all of these concurrency issues, you know, if we use memory-optimized tables or all of these other patterns, we can then reduce the overall costs.” But the thing of it is, is that you as a DBA or a data engineer, you need to be in tune with this process and realize that it’s not just about tuning for tuning’s sake. It’s about tuning to actually save money and to either reduce costs or enhance revenues. That’s what a company wants out of IT. One of those two things. If you’re not doing that, either of those, then you’re just off in an ivory tower just building fiefdoms.
Carlos: Sure, and that’s when they get upset. That cost center, that “hey, what are you guys doing for me? What have done for me lately?”
Carlos: Yeah, so that’s an interesting idea, and one that, yeah admittedly, a lot of us, I mean licensing, that’s for procurement, right? We haven’t had to look at those costs before, and you know, we tend to think of new technologies that we have to adopt, like Azure Data Factory, like, “oh, I gotta learn that.” But to a certain extent, we’re going to have to learn business processing or the pricing pieces of it to then be able to communicate that to the business and say, “okay, business, this is why we need to spend another, whatever, $5000 a month. It’s for these reasons.” Because you want to do X, Y and Z.
Cathrine: Definitely, and in terms of things like Azure Data Factory, as a developer or a data engineer, you actually do have to go in and think about that when you design your solutions, because once you start paying for activity runs, the number of runs, then you have to optimize for something that you might not even have thought about previously, while working with SSIS, for example. So, it is a new world, even for developers, trying to, or needing to consider new things, where previously, that was left to a completely different department who actually bought the hardware and then you had that hardware. So, that also, obviously affects the patterns that you choose, because you don’t want to choose a pattern that’s expensive, because you’re running all of these activities, if you could do something differently to save on those costs.
Carlos: Yeah, it’s like the infrastructure is coming to you now, and you now have to answer to that, whereas before, we didn’t have to. Interesting.
Jason: And getting back to the patterns discussion, I know Cathrine’s probably worked with a lot of different frameworks and built those, and being a BIML hero, she’s probably used BIML to automate some of that stuff.
Cathrine: A couple of times.
Jason: Maybe just a few. And so, the thing I would see is that a lot of the patterns themselves maybe haven’t changed drastically. Cathrine, if you think about, what are the common patterns that you’ve built in your SSIS frameworks?
Cathrine: Typically, it’s all about how you want to load the data. You have from low volume data with simple truncate and load, like the simplest thing you can do, just yeah, delete all the data and reload it. Obviously, that is not a good strategy, once you start getting a little more data, and then you start thinking about these incremental loads. In terms of data warehousing, you still have facts and dimensions that you need to load with some kind of lookup to find the keys. These are all patterns that you still have to think of when you build your solution, and it doesn’t really matter if it’s ADF or SSIS. Those are still the common patterns that haven’t really changed that much. Like I mentioned previously, as well, all of the framework around ETL, also for ELT, with the auditing, the logging and things that you have to think about with orchestration and scheduling, that kind of ties into that, as well. And all the transformations, those are still relevant, and you still have to think of it in terms of abstract patterns, not really tied to a technology.
Carlos: Should we go ahead and do SQL Family?
Carlos: Okay, so like I mentioned, we’re going to go ahead and divide this up and ladies first, we’ll start with Cathrine. So, your all-time favorite movie?
Cathrine: Fight Club.
Carlos: Ooo, interesting. I actually have not seen that movie, still.
Cathrine: That’s not a common answer, I take it.
Carlos: No, I would agree. I haven’t been asking that question that long, but that’s not one that I’ve heard.
Cathrine: I could probably pick from 50 different movies, but that is one I have seen 50 times now, and I’m still not sick of it.
Carlos: Interesting, very good. City or place you most want to visit?
Cathrine: Ooo, does it have to be in the real world?
Carlos: Ah, no, here we go. Let’s hear it.
Cathrine: Yes, well, I would love to visit all of these places that you see in movies that you can’t go to. Any kind of outer space would be cool, but probably not very soon in my lifetime. If I want to pull it back down to Earth, I can combine that and say New Zealand would be an amazing place to visit, and I’m in the Lord of the Rings.
Carlos: There you go. I was going to ask if Middle Earth was a destination. So, you’re not putting a lot of stock in Elon Musk, there. You don’t think that’s going to happen in your lifetime, huh?
Cathrine: I’m not so sure. You know, we’ve got that fancy little car with the dude sitting flying around the sun, but I’m not so sure if it ever comes to that, if I can afford it. Maybe that’s kind of the thing.
Carlos: That’s right. The price tag to get up there is still pretty high. A food that reminds you of your childhood, and I think you’re the first international guest we’ve had since we started asking this question, so I’m very interested in the answer.
Cathrine: Oh really? Okay. Let’s see if you can spell this. Lapskaus.
Carlos: Yeah, sorry, I don’t think– I think the transcriptionist is going to have a problem with that one.
Cathrine: Yes. So lapskaus is a very traditional dish. It’s basically like a stew, so you have some meat, potatoes, veggies. It’s a very old farmer’s dish, basically. You put everything that you had left over and you let it sit with some yummy spices. And one of the reasons that that reminds me of my childhood is that I love it now, and my mom, she can spend, you know, four or five hours prepping this dish and it’s delicious, and we eat it with flatbread and all yummy. But when I was a kid, I could not stand the thing, because it was a mixed food and I had the very weird quirk where I could not eat mixed food. Everything had to be on a separate place on the plate. Do not mix my food. And then suddenly, from one month to the other, I loved it. And I freaked my parents out because they were like, “are you sure you’re okay, or did you hit your head somewhere?”
Carlos: Yeah, that’s right. Who are you and what have you done with my daughter?
Cathrine: Yes, exactly.
Carlos: Interesting, well, very cool. That’s cool. So, something that you didn’t even like as a child, but you still remember from.
Cathrine: Well, it was a traumatic experience, being served that, the first time that I can remember, and basically almost gagging from the sight of it and my mom was not very happy, because she’d just spent five hours prepping this food for me. So, yep, that definitely reminds me of my childhood.
Carlos: That’s hilarious. Now, getting into a little bit of tech here. How did you first get started with SQL Server?
Cathrine: Purely on accident. My background is all over the place. I’ve went from graphic design to interactive design over to programming because I thought that was kind of cool. Then I ended up with something in SharePoint, and actually when I started working in my first company, I was hired as a SharePoint consultant. The issue they had was they didn’t have any SharePoint projects for me. So, I had just spent six weeks getting certified through something called Microsoft University in Norway, so six weeks of training and certifications. SharePoint was the big deal for me back then, and then I ended up in a data warehousing project, because that was what they had. And that’s when I first actually used SQL Server and started developing. And I fell in love with it. It was something, I’d seen databases previously, but I’d never really worked on it. But then coming into the data warehouse world where I could take all of my little quirks that I like grouping and sorting and organizing things, and I do that everywhere in my life, that was perfect for me, with data.
Carlos: Interesting, okay, so now in that time that you’ve spent with SQL Server, there’s one thing you could change, what would it be? And we could expand this. I’ll take some liberties here, and expand this into some of the tooling, the SSIS tooling or Azure Data Factory, if you wanted to go that route as well.
Cathrine: Yeah, so one thing I would change about SQL Server is I think better out-of-the-box handling of metadata, and part of that is coming from me with my BIML background, now. That came in and kind of solved some of it for SSIS, but it’s still something that’s not really out of the box with SQL Server, yet. So other products like Amazon, they have Glue and they have all of this data catalog concepts that we kind of sort of have, but it’s not where I would love for it to be yet, in SQL Server.
Carlos: Sure. Like the extended properties stuff? Is that what you’re–
Cathrine: Yeah, more on like metadata handling, storing, the data catalog side of it, really.
Carlos: Oh gotcha, gotcha. Okay, so actually in the (?) processes, all of that stuff, interesting okay. Very good. What’s the best piece of career advice you have received?
Cathrine: Oh, I think I’ve received a lot of good career advice. Don’t be afraid of failing, I think. It’s terrifying if you’re a new person in this field and you don’t really know where to start and there are so many tools and products and things that you need to consider, but don’t really be afraid of failing, because it is a learning experience like nothing else. And once you do something like that, because eventually you will fail in some way, what can you learn from that and how can you help others, maybe by sharing what you learned from it.
Carlos: Our last question for you today, if you could have one superhero power what would it be and why do you want it?
Cathrine: I want to fly. I think I said this the last time I did this podcast with you, as well, and it hasn’t changed. It’s been a dream of mine since I was a little kid. I would love to be able to fly and explore and basically just be completely free.
Carlos: Very good. Well, great. Cathrine, Jason, thanks for joining us today.
Cathrine: Thank you.
Carlos: That’s going to do it for today’s episode, compañeros. One of the very cool things about talking with Jason and Cathrine is the realization that there is simply more options, more tooling that we can use to solve some of the different problems that we have. If we can understand what those problems are from our business perspective, then there are going to be lots of ways for us to solve these problems. It’s an interesting time. We have talked a lot about what the future is going to hold for us in the data profession, and I think this is another area that is very compelling and very, very interesting. So I think as technologists, if we want to continue to learn and we’re open to some of these new ideas, then there’s going to be lots of opportunities for us in the future. We are going to come back and finish up the conversation with Cathrine and Jason next week. We have a little bit more we want to get through. But that will do it for today’s episode. Thanks again, compañeros, for being here with us. As always, we’re interested in hearing from you on social media. You can reach out to us on many of the social media platforms, and you can connect with me on LinkedIn. I am @carloslchacon, and we’ll see you on the SQL Trail.