Compañeros! Welcome back to the SQL Trail. If you’ve ever thought about improving your ETL game, this episode is for you. In Episode 43 of the SQL Data Partners Podcast I talk with Rafael Salas. Rafael is a Data Platform MVP and SQL Server Architect with over 12 years of experience building business intelligence solutions. We talk about the nuts and bolts of your data’s back-end and the data warehousing that powers the data-driven decision-making in your organization.
Episode 43 Quote:
“I always recommend that the first stage of any ETL project is to spend the time on very basic things. Understand your requirements and go and profile the data. Get familiar with the data. ”- Rafael Salas
Listen to learn…
- Why it’s hard for IT folks to learn ETL
- Why ETL tools alone aren’t enough to get good data
- How to transition from an ETL mindset to a data architecture mindset
- Why the best data architects spend a lot of time learning their dataset
- Where to start when you want to architect your own data
- The data warehousing method Rafael Salas recommends
- Where to find good ETL learning resources
- How Big Data will affect ETL developers
- The SQL tools Rafael uses daily
As always, please join me on the trail by starting a conversation below or connecting on Twitter. If this podcast is helpful to you, please leave a review on iTunes so others can find it as well. Subscribe via iTunes, Stitcher, and Podbean.
This episode is sponsored by COZYROC
About Rafael Salas
Rafael Salas is a Business Intelligence architect with more than 14 years of experience providing data architecture solutions. He is a SQL Server MVP, MCTS, and an active member of PASS and local user groups where he regularly speaks on Power BI, ETL Architecture, and SSIS. He’s also a part time professor and program advisor for the Computer Technology Institute at Central Piedmont Community College (CPCC), participating in the Business Intelligence continuing education program and the Regional Effort to Advance Charlotte Information Technology (REACH IT) Business Intelligence program. Connect with him on Twitter and LinkedIn.
Follow Rafael on Twitter
SSIS: Design principles for robust ETL processes
Implementing a Data Warehouse with SQL Server Jump Start
Designing BI Solutions with Microsoft SQL Server
Red Gate SQL Prompt
Carlos: So Rafael, welcome to the program!
Rafael: Thank you Carlos, happy to be here.
Carlos: Yes, it’s always nice to have an East Coast friend to come on the show with us today. So interestingly enough, and I’m not sure if I’ve ever shared this with you, not that I’ve done a ton of SSIS, but what I’ve started doing is, I’ve started looking out there on the internet. Your blog was actually one of the first that I came across and helped me on more than one problem. So if I’ve never said it before, thank you.
Rafael: [laughing] Thank you again. I’m glad that you find it helpful. I think that takes my readership to up around two people: my wife and you. So thank you.
Carlos: [laughing] Yes. I thought it was great. So that’s one of the reasons why I wanted to have you on the show today. Our topic, ultimately, is around ETL and why ETL is so hard. It seems like a lot of people struggle with it and I think that it’s changing a little bit. And so I guess, let’s go ahead and jump in. What are your thoughts? Why do you think ETL is so hard for IT folks or data folks to do ETL?
Rafael: Yeah, that’s an interesting question. I think it’s a combination of things. For one, I believe many many of us still probably underestimate the complexity of bringing data. It goes like, it’s all about moving data. It’s all about moving data that we have and house most of the times. We know where it’s going. So what could go wrong, right? And at the next level of detail when the first pass goes, truly it’s coming from different systems at different frequencies at different granularities. And it’s not very clear that there is a common key between one table in one system and the table on the other system. And now we’re going to merge those two together. So I think it’s a compound problem that boils down to underestimating the complexity while not giving the right amount of importance to data quality as well. The assumption is, “Hey, the data is already in house, it’s running our CRM system, it’s already running our ERP system, so it should be good.” But the data is actually good for why it does on that systems, not necessarily for what you’re pretending to do after you extract the data from that system.
Carlos: Yeah, exactly. The downstream, who knows what happens.
Rafael: Yeah, so it all just comes down to data quality as well. My last one, my favorite reason for why make it harder is, again, people think, “Oh, all I have is data. And I have a good ETL tool, say SSIS, everything should be just downstream very easy.” And you know what, this tool doesn’t come with an architecture design and this tool doesn’t come with a system design. So there is a lot of engineering that has to come into place to make it sustainable.
Carlos: That’s an interesting thought, I think, when you think about data. Particularly from the traditional transactional model in that if I want to connect two pieces of data, it’s a join. Right? And I know that I have a foreign key relationship and I can join them and I can move forward. ETL is not necessarily like that, right? You mentioned that this data could be coming from anywhere, and so do you think we don’t spend enough time getting to know our data sources? Making too many assumptions there? I guess you kind of referenced that in the sense that there’s no architecture, right? We kind of have to figure that out on our own.
Rafael: Yeah, I think you are probably touching on and nailing it on the head with this. I think we fail to plan building this system. I always recommend that the first step or the first stage of any ETL project is to spend the time on very basic things. Understand your requirements and go and profile the data Get familiar with the data that you’re going to be moving and conforming. Make sure that it’s gonna fit the bill. And you don’t need to spend a month on that. You spend a couple of days, you spend a week, and you come back with your tool to say, “Here is what you’re asking me to do on this ETL process and based on the requirements and based on this sample data and on this profile I think there is a huge gap. So why are we going to spend the time building a system when the data from the most fundamental level of quality is not where it needs to be?”
Carlos: Alright, so let’s dive into that just a minute. You mentioned, before we started recording, that you’ve moved a little bit more into the architecture role. And so now you probably have a little bit more experience with saying, “Okay, let me step back and think about how I want to put all this together” What are some thoughts or some nuggets of wisdom, if you will, that you might share about making that mind switch? The mindset thought of, “Okay, I’m not just going from Point A to Point B. I’m getting to know that data.” And if I wanted to architect from that, where do I start?
Rafael: That’s an excellent point. So, I arrived at this point probably learning from my own mistakes. That’s obviously one thing. But there is also a lot of good material out there. One of the first articles I read when I started researching was, well, there’s got to be a better way. An easier way. And I always had been designing data warehouses using the Kimball methodology. The Kimball method. Star schemas. And he’s known for that. The fact tables and the dimensions, but, if you review literature and their books, they go very deep into ETL architecture. They actually break it down the ETL system into 34 subsystems. I believe it’s probably detailed and there are a lot of those subsystems that we will never ever use and will not apply if you are working on a data warehouse. But that probably will put you on the mindset of, “this is more than just copying, transforming, and writing data from one side to another.” There is a lot more to it and there are a lot more moving parts, so that article on architecture and the subsystems I found it very helpful. I think it goes back to 2004 and was updated in 2007, but I think the principles still apply.
Carlos: Interesting, yeah. I think that is a good point. I’m not a data warehouse person per se, but you hear of the Kimball method and I think you mentioned those dimensions and the way that fact table should be structured, but I don’t hear anyone talk about the Kimball method of ETL.
Rafael: That’s correct. Well, and sadly enough, ETL is probably harder to sell. It’s not as sexy. There’s not as much to show. Maybe for us we’re a little more geek here and we enjoy looking at data flows and seeing the colors turning from yellow to green and all that things. But there’s not so much to it. There’s no data, there’s no pretty charts and graphs. It’s all about moving. It’s plumbing. When you show your house, you show the sexy thing. You show the front room, the living room of your house, not so much the back space. The crawl space.
Carlos: Some of that goes back to, you know, the limited training opportunities. You had mentioned that you were teaching an SSIS class at the college, and I admit that I haven’t looked at college curriculum in a few years. Because it’s not the value-add, so managers are definitely as you get higher they’re not concerned about ETL. And I’m not sure that workers when they sit down for training go, “Hmm. I’d like to do ETL training. I want to do this class or what-not.” Why is that so limited if it’s such an important thing? You know your plumbing, if your toilet backs up all of a sudden it becomes a big deal.
Rafael: Yeah, absolutely. I think you’re right, nowadays it’s probably easier to find said training online. There are a few resources but exactly. If you compare the amount of resources that are available on the ETL side versus what might go into, you know, writing queries, performance tuning, creating databases, utilization reports and charts, obviously ETL may not have that much of resources. So yeah, that’s an interesting point. However, I think there is still a lot of material. I think you have to dig it up a little bit in order to find it. I have found things that Microsoft has on the MVA, the Microsoft Virtual Academy. I think they have good self-paced training that probably will go into some of that, but to your point there is nothing specifically about ETL. Just ETL pieces as part of an overall larger curriculum around SQL Server, for instance.
Carlos: Right. Here are the individual components, right? Not so much like, “Here is how your put it together.” Or, “Here are some common thoughts if you’re going from an Excel file into your database.” Or something like that.
Rafael: Right, right. That’s a lot into, “Here’s how you can move the switches or change the properties within the tool.” Not so much, “What’s the best way to do it?” The answer to that probably requires a little more research on your own, you know, making your own mistakes and learning from them.
Carlos: So, we talked about making your own mistakes. One of the first things we hit upon is, we test our package, we create the package, we create our test environment with a handful or a couple of hundred rows. And then we push it out into production and it’s like, “Whoa!” We start to have some performance type issues. So I guess I’m curious what you see as the common performance issues that people have. I think most of our conversation revolves around SSIS. We talk about the ETL tool for SQL Server folks, SSIS is kind of a no brainer, right?
Rafael: Yes, absolutely. Well, looking at two specific mistakes. SSIS or any ETL tool for that matter, you open the toolbox of molar on ETL tools and you will find a component for pretty much every task. But just because it’s there doesn’t mean you should use it. Case in point, typical mistake of sorting. Why would you sort the data in memory within the data flow or the ETL pipeline when you are reading data from a highly optimized database engine? You are just an ORDER BY away from pushing that piece of logic or that task back to the database. I imagine many times you still find people sorting the data thinking that if I sort the data before inserting it into my destination, it’s going to be stored in that order. And then that makes people think it’s going to buy them some benefits. So, there is a little bit of misunderstanding between the responsibilities of your ETL tool of choice, SSIS in this case, you need to learn and understand the pros and cons of using each of those components and tasks that might be there. And also, looking outside your ETL tool and understanding the resources on your destination. What are the things your destination could help with as part of the overall ETL process and trying to have that product pushed down, pushed up type of optimization?
Carlos: Right. And I think to that pint, because SSIS when you have SQL Server you get the license for SSIS, right? And it tends to be on the same machine. It’s one of the first things I notice when I’m in different environments and they’re having some SQL issues. Well, are you running ETL workloads on it and can we potentially move that off on another server?
Rafael: It comes from the old everything, you buy SQL Server and you get everything else for free, right? And that mindset. And that means store everything on that SQL Server and do it all on one server. And I think the fine print is, everything is included as long as you’re running everything on the same server. And that is not always a good idea.
Rafael: So you need to understand, I think it comes down to understanding the different workloads, right? If you are doing a workload of a data warehouse it’s different than the workload of an OLTP system. And sometimes you have them coexisting. And you top all that with, oh I also have my ETL process on top of the same server. And I also have my OLAP analysis services structures in the same server. Now you have a little bit of a mess and you got in trouble yourself.
Carlos: Right. And I think if you’re coming from the programming side, another concept that you might be used to is kind of that, not so much looping, but working with a single record set at a time. And then in ETL, that may or may not always work. And I think we talked a little bit about the role of temp tables in our ETL workloads as a performance bugaboo.
Rafael: Absolutely. It’s easy to visualize our ETL process. Like, hey, I’m moving rows. Maybe I’m moving one row or a few rows at a time and we fall into that trap. And the truth is, SSIS and many of these ETL technologies are benefiting from using memory engines. So they are not really going one row at a time. So sometimes we fall into the trap of designing our data flow thinking that’s the way it’s going to process. And we might start having like a OLE DB command transformation on that data flow. Those who are not familiar with that maybe you’re good and should not be using that, because that means you’re issuing T-SQL commands against the database for each record going through the pipeline. And if you’re moving a few hundred, maybe, you would not see the difference. But if you are moving a few million records, you will see a lot of difference. So, I think taking into account those two things I want you using an in-memory pipeline and in-memory engine and I want you to be thinking in set of rows. And sometimes a transformation being applied to those set of rows as opposed to the rows going through the transformations it requires a little bit of practice or time to visualize that but once you get those you’re going to be in a better position to decide your flows in a better way that is better optimized.
Carlos: Sure. So we’ve already talked about some of the issues that ETL folks have in limited opportunities, but they’re also kind of facing some changes to the landscape, the IT landscape. Big data was one, right? Like, “Why do I need to transform it?” Potentially and even tools like PowerBI. How is that affected the role of the ETL developer if you will?
Rafael: Yeah. So interesting question. When the first time we started thinking about Big Data and we thought we were going to be out of work very soon, right? Because Big Data was taking over and relational database engines were going away. ETL, data warehouses, everything like that. Well, I guess with a few years and a little bit of better understanding of those technologies we realized that that’s not the case. They’re here to help out, they’re here to complement.
Carlos: Sure, they have their place.
Rafael: Absolutely. One big promise that’s still there with big data in terms of ETL in my opinion is the whole concept that they have that schema. Land the data the way that they are and you are able to have the flexibility to define the schema or multiple schemas on top of the same data set as you need to query that. If you think about that, a lot of the time it goes into ETL processes, you need to understand well the data and the data types need to be well-defined and the destination is likely a relational database and so if you say TIME and you try to pass a number it’s going to blow on you. And if you say it’s NUMERIC data type and you pass a letter, it fails. A lot of that, a lot of the time and testing goes into that to ensure those little tiny details are in place. With big data technology, it promises the user, and one of the buying propositions is, that maybe you are able to land it in its native format and you just define the different schemas as you need to rate them. I see a big saving on that, I don’t think it’s going to replace a traditional ETL and a traditional database engine next to a need for certain use cases. But I believe that it’s just like another tool that you have on your toolbox that might speed up the process in certain areas.
Carlos: Isn’t the funny right? Even in the big data scenario, data types still matter.
Rafael: Oh absolutely. It probably matters a little later, but it still matters. AT some point, when you’re going to rate the data and you’re going to create that visualization or you’re going to create that report, or do whatever is next, that information needs to be defined.
Carlos: I was at a presentation with James Phillips, who is the VP of the Microsoft BI stack, if you will, and PowerBI was ultimately one of his things. And he talked a little about how that role might change a little bit and I guess we could dive into the PowerBI discussion a little bit here. So the thinking is, and again, I think there will still be a place for it because I don’t think that the data warehouse is dead, but some of that might change and that ability to glue separate sources is going to become more important as the self-service reporting kind of continues to evolve.
Rafael: Yeah, I think that’s something that resonates with me. Now you would have the ability to probably have two different sides of the house if you will. The more structured, maybe the more curated side of the house, which might use the traditional ETL or the traditional data warehouse approaches that we know. Perhaps the less curated, or the high volume or the most rapid data side, on the other side. And the technologies now have so many different layers and they allow you to use as you say glue them together and bring value a little quicker to the business. So that’s definitely something very interested to keep an eye on.
Carlos: Right, so if you could go back, you’ve been working in the data space for a while. If you could go back to the younger Rafael and give him some pointers for this journey, what thoughts would you share or what points would you want yourself to know if you could go back and do some things over?
Rafael: Hmm. For this side of the house that I’ve been working, I think the earlier we could learn about writing better SQL scripts, especially query. Write better queries, I think that would help us. A lot of the times we get in trouble on our ETL system just because we fail to take a good approach. We rely on techniques which we should not be using that are less performance. So I think that would go along with the way with understanding better data modeling. You know, avoiding integrating those tables where if the following datatype is varchar 255 for every column. So, just learning. Yes, yes, this is still cheap and you will get cheaper, but writing and reading data from disk is not cheap. That takes a lot of time. So understanding a little bit more internals, when we’re younger we’re just happy to see the data move from A to B and when we query the data, it looks red and we declare victory, right? I think learning these things probably will help out and I’m into understanding more now into trying to understand the bigger picture, the architecture. Think about this as an overall system that’s not just moving data but it needs to be maintained it needs to be backed up, it needs to be secure, it needs to be expandable.
Carlos: So Rafael, let’s do SQL Family. So one of the things I always like to ask people is how they get work done. And I know we’ve mentioned SSIS here, but what’s your favorite SQL tool and how do you use it to get work done?
Rafael: Well, I spend a lot of my time querying the data, even in my new role as a data architect. The first step that I always take is, “Let me get familiar with the data.” So as I write queries in unknown databases or relatively new databases to me, SQL Prompt from Red Gate is something that I found very very helpful. I know that one pet peeve of mine is that I hate when a query is not properly formatted. And properly formatted by my standards.
Carlos: It looks ugly.
Rafael: Exactly. I mean, I tend to think that maybe my SSIS package is running slower because the SQL statement hasn’t been properly formatted. So, I like SQL prompt. And if yous pend some time in Visual Studio creating anything that has to do with BI projects, not necessarily SSIS, I think BIDS Helper is another good tool that’s also free. It comes from CodePlex.
Carlos: The BIML stuff, all of those components. Very handy.
Rafael: Oh absolutely, absolutely.
Carlos: If you could change one thing about SQL Server, what would it be?
Rafael: The hard part of that question is you’re only saying one. No, SQL Server is really great tool and has been feeding my family for many years now. But of course, there is always room for improvement. In the SSIS world, things have improved a lot especially since 2012. I always wish we had a better repository for executions. We are there, we have kind of a good way, but I think I would like to see more enterprise-class ETL. Ability to have clustering, mirror monitoring, better partitioning, a little bit more of the scale up, scale out capabilities in SSIS would be on my wish list.
Carlos: So, what’s one piece of career advice that has been helpful to you throughout your career?
Rafael: Wow, several things come to mind but one thing that probably hit me early is of value. Somebody told me, “Never stop learning. Don’t be afraid of learning new things.” And sit in the front row and raise your hand and say, who can do this? And you just go for it. I think that has been a big help for me. Don’t be afraid of learning and jumping into new opportunities. Sometimes it looks daunting or scary or you think you’re going to look bad if you fail. But you know, that’s the way it’s going to go and it’s going to open you up to new opportunities.
Carlos: Yeah, especially in technology. You have to keep up. Our last question for today. If you could have one superhero power what would it be and why would you want it?
Rafael: Hmm. Yeah, I was driving today from work to take this call with you and then I got stuck in traffic. And I wish I could fly. I think I would like to be able to fly and not be stuck in traffic and see if I can get to work on time more often.
Carlos: [laughing] Very good. I’m sure your wife would enjoy that as well.
Carlos: Rafael, thank you so much for taking some time with us today.
Rafael: Thank you, my friend Carlos. Thank you so much for having me. It’s such a pleasure to be here.