In this episode we finish up our conversation with Jason and Cathrine and talk about how they keep up with all the changes in data conversions. The advent of Azure services is great, but can be tricky to navigate. We also get to hear Jason’s answers to the SQL Family questions.
Episode Quotes
“It’s kind of like the Mike Tyson quote, ‘everybody has a plan until they get punched in the face.’ I think that’s really relevant to the data engineer, because we get punched a lot.”
“The world is changing, and the tools are changing, and you really need to make sure that you understand what problem you’re trying to solve before diving into any of these technologies.”
Listen to Learn
00:40 Intro
01:18 Compañero Shout-Outs
01:42 SQL Trail
02:32 The architectural framework, the Agile approach
07:16 Cathrine & Jason’s training strategies – how did they learn this?
11:47 Know the problem before choosing a tool, fight for quality, but be flexible – find, learn and use the right tools
14:48 Prognostications on the future of the ETL/ELT world in the Microsoft Stack
17:39 SQL Family Questions – Jason
24:02 Closing Thoughts
Part 1: https://sqldatapartners.com/2018/08/29/episode-144-is-ssis-still-relevant/
About Jason Horner
Jason Horner is a Microsoft Certified Master of SQL Server and has worked with it for more than 10 years. He began work as a DBA/Application Developer in 1999. He made the transition to delivering BI solutions in 2002 and began working with spatial data in 2008. Jason’s latest focus has been architecting large-scale spatially enabled BI solutions.
About Cathrine Wilhelmsen
Cathrine loves teaching and sharing knowledge. She is based in Norway and works as a consultant, focusing on Business Intelligence and Data Warehouse projects. Her core skills are ETL, SSIS, Biml and T-SQL development, but she enjoys everything from programming to data visualization. Outside of work she’s active in the SQL Server community as a Microsoft Data Platform MVP, BimlHero Certified Expert, author, speaker, blogger, organizer and chronic volunteer.
Cathrine’s blog: https://www.cathrinewilhelmsen.net/
*Untranscribed Introduction*
Carlos: Compañeros! Welcome to another edition of the SQL Data Partners Podcast. My name is Carlos L Chacon, your host, and it is good to once again have you on the SQL Trail. Today’s episode, we’re going to finish up the conversation we started with Jason Horner and Cathrine Wilhelmsen. We were talking about this idea of is SSIS still relevant, so ETL principles, practices, tools. So, if you haven’t listened to that first episode, you may want to go back, take a listen to that one. This episode gets into a bit more of the how do we keep up, more of the training components and so I think that you’ll find this information interesting, as well.
Before we get into that conversation again, we do want to give a couple of shout-outs. One to Luis Fernando Aoki. I hope I said that last name right, saying he’s enjoying the podcast, and thanks, Luis, for reaching out. Eric De La Torre, Shehab El-Najjar, and Jim Jones. Thanks, everyone, for connecting and saying hello. We do appreciate it.
The SQL Trail event is coming in October, the 10th through the 12th and my question for you, while many of you have signed up, if you haven’t signed up, I’m curious as to know why. Travel is one reason, obviously if you’re outside of the United States, I can see how that would be a very high barrier to coming. But for those of you who are state-side, I guess I’m curious, what have been your biggest hang-ups? If you’ve been approaching your managers and they’re saying no, I’m curious. I’d very much like to know why that is. There is still time to register and we hope to see you. Sqltrail.com for more information, there.
Today’s show notes are going to be at sqldatapartners.com/ssis or at sqldatapartners.com/145. With that, let’s go ahead and finish this conversation with Jason and Cathrine.
Carlos: When we think about trying to take on some of those responsibilities, all of a sudden, it can seem pretty daunting, because there are either terminology, technologies, which we’ve already talked about, which are new, and ways of doing things that are just different from us or from what we have been doing. Other than the school of hard knocks, how have you guys developed the skills to get into this? I know, Cathrine, you’ve come from a BIML perspective. I’m assuming there’s still some data warehousing pieces. I know, Jason, I feel like you’ve been in that integration. We’ve talked a lot about data warehousing. Is data warehousing a, I don’t know if a prerequisite is a right word, but is that where we get a lot of the terminology and the ideas for some of this processing, or how do you go about understanding that, better?
Jason: I think if you go back to Kimball and Inman, they kind of created this high level architectural framework. For what are the components of a data warehouse and how do you move data around in the data warehouse, how do you solve business requirements? Then I think you take a step down from that and understand that overall context, but then, “okay, now I need to actually implement this.” Now it meets the real world and there’s problems. I’ve got data quality issues, I’ve got multiple source systems, I have limited access, I’ve got lots of volume of data. So, some of those architectural tenets can start to break down when they hit the real world. It’s kind of like the Mike Tyson quote, “everybody has a plan until they get punched in the face.” I think that’s really relevant to the data engineer, because as data engineers, we get punched a lot. We have to solve all of these problems. A while back you were saying, “do we have to spend more time, big architecture up front? Do we need to spend more time architecting?” I would say that we do, but we want to do it in an agile manner. We want to iterate. I think what this looks like from my personal process is, I will solve one piece of the problem. How do I do an incremental load? The one thing I think people don’t always do is they don’t realize SQL Server provides some great technologies there. There’s change tracking, there’s change data capture, and now, to some extent, we have temporal tables that can be used for that, and even if we go old-school, like back to 7.0, we have timestamps. So there are all these hooks, and when I go and see all these implementations, I see people rolling their own incremental load strategy, using create, update and triggers and stuff like that. The product supports this and they’re very robust and they’re very lightweight, so especially now, since you don’t need to use Enterprise for change data capture, where you used to have to do that. So now that’s enabled a lot of these things, so I think we need to rethink how we’re going about that fundamental incremental load process. We solve that and then they say, “okay, now I’ve got incremental load working, now how do I audit this process? How do I make sure that I know when the data’s being updated? When did it happen? Who did it? What process triggered this update?” I add that, and then logging, how do I go back and figure out how many rows were loaded, how many were updated, how much data went through the data factory or through the SSIS pipeline? All that operational metadata that I want to publish in, perhaps, a PowerBI dashboard or some other tool. I could use Azure Application Insights, for example, to monitor that telemetry. That’s all useful, especially when we start talking about controlling our costs and making things more efficient.
Carlos: I like that idea of the Agile approach, but then does that basically give us license to stub our foot a little bit, as we work through some of these things? Another thought that I have, it feels like there’s the arm-wrestle, if you will, between what we’re being asked to deliver and components we should put in that will eventually make our lives easier. For example, like an audit. How many times is that on the request list versus, “oh gosh, now we have this requirement,” whatever 6 months, 8 months, a year down the road that, “oh yeah, now how come we didn’t think about that earlier? Now we have to put it in.” Does that make sense?
Jason: Yeah, totally, and Cathrine actually wrote a good blog post. It’s always okay to fail. That’s an okay thing, that’s a learning experience, it’s an opportunity. It’s okay to say, “you know what, we’re going to do V1 in a minimum viable product,” or an MVP, you hear that terminology being thrown around by all these software propeller heads. That’s valid in the data engineering space, but it gets very dangerous because an application is very ephemeral. The application is just the app. They don’t have to worry about the data. Data has a lifecycle and so we need to be very cognizant of this fact that if we take this shortcut today, we need to go back and be able to remediate it. The governance side of the house cannot just continually be shortchanged.
Carlos: Interesting. So then, going back to the ‘how did I learn this’? How do you spend your time? I know you guys speak at the SQLSaturdays and whatnot, but how many times did you have to fail? What’s been your training strategy to stay up with some of these things?
Cathrine: There are a lot of stubbed toes. There really is, because some of these things, it doesn’t matter how many sessions you go to, you’re going to run into something and go, “oh my gosh, why did I not see those before?” It’s that trial and error process that I think everyone goes through at some point. But in terms of learning, there’s so many resources online that it can be overwhelming, sometimes, because there’s so much information out there that it can actually be hard to find the right information that you need to take in. But I know that Microsoft, they have these conferences. They have Ignite where they publish all the sessions that are presented there, afterwards. They did the same for Inspire, which has a different audience, but still all the information is there. Then, slightly tangent to this, but the Business Application Summit, where they presented a lot of new features in PowerBI, where it came that they’re now introducing data flows in PowerBI, which is something that we haven’t touched on in this talk at all. That was presented at a completely different conference that doesn’t have the traditional SQL or data engineering crowd attending. You kind of have to stay up to date, and for me, I find that Twitter is a great way to keep up to date on all the news. And also, getting that real-time commentary from the people that I follow who are at that conference or just watching it, which kind of creates that interactive experience, even though I’m sitting home alone in my apartment, back home in Norway. You do have to be very aware of staying up to date, because things are moving so fast, you can’t keep up to date with everything. But I do think that everyone should schedule some time, maybe say 30 minutes every week, just to read some blogs or check the news or go to Channel 9 and watch some videos there, just to kind of try and hang on as best as possible until you get onto a project where you have to really dig into this.
Carlos: Yeah. Jason, thoughts there?
Jason: I read Cathrine’s blog. No, I mean, definitely there’s blogs, there’s the Channel 9 stuff is a great resource. There’s just so much out there. I’m a big fan of Microsoft Virtual Academy, it’s free. PluralSight’s another good one, although it’s paid. A lot of stuff on edx as well. We’re not in this, like it used to be when I was your age, when I first got into this, I had to go down to a bookstore and actually hope that they had the book and crack that open. Now there’s so much out there, it’s choosing the right one and Cathrine brings up a really great point. It’s the veracity. This notion of, well there’s six blog posts and they’re both telling me something slightly different, and so you have to understand your context. This notion of, it depends and best practices, well, it depends on what? It always depends on the context. If I’m in SQL Server 2008, I have a different set of tools that I can use or a different approach that I can take than if I’m in SQL Server 2016, perhaps. I have to consider, when was the resource published. But what I’m a big fan of is actually the docs for a lot of these cloud services are really good. It’s not just like the old booksonline stuff where everything is just kind of buried. They’re actually starting to link to YouTube videos from the project team or Channel 9 videos within the documentation. And then what’s also nice is, a lot of people don’t realize this, but if you don’t like the documentation, you can submit a GitHub pull request and submit additional content or more detail or even grammatical efforts. I know people in the community, like Steve Jones, is really started doing that a lot more and I’ve seen a lot of pull requests come from him in those booksonline documentation and that’s cool, because now we’re helping Microsoft improve their documentation at scale. That’s a powerful concept.
Carlos: Yeah, that’s been very interesting to watch. Treating your documentation almost like an application, in the sense where you can go through and it should be alive and living, people can give feedback and thoughts. Yeah, it’s very empowering to think that, “hey, maybe there’s something I have to add here, that could help someone,” again, given in the certain context or whatever the nuance might be. It’s kind of cool to see. Okay, very good. I think this has been a very interesting conversation. I guess, last thoughts, here as we start to wrap up?
Cathrine: Last thoughts, it’s always a difficult question, isn’t it? I think in terms of when you think of data integration and data engineering, it’s becoming more and more– it has always been that, but it’s more and more understanding the business needs. I think with some of these tools, you might actually be allowed to focus even more on that, than what you had previously. Then, the world changes, and now with new tools, such as Databricks, where you can work and collaborate with data scientists, which I don’t think there were any data scientists when I started working, and that’s not that long ago. The world is changing, and the tools are changing, and you really need to make sure that you understand what problem you’re trying to solve before diving into any of these technologies.
Carlos: Yeah, isn’t that always the rub, is that we have to understand the problem before we get to pick the tool?
Cathrine: Yes. Yes, and if you’re a geek like me, you just want to dive in and play with these shiny new tools, but at least if you’re working for someone, and I hope that most of us do, you do have to solve some kind of problem. That is the core of it.
Jason: Yeah, I’d tag onto that. I think Cathrine had some really good points, but I’d say this quote of “strong opinions, but loosely held.” So in data engineering, there’s so much pressure to get this ETL done or add this data source and so we kind of talked on this, but this is kind of my main thing: fight for the data quality, but also be willing to change your opinion. If a better approach comes along, if today you’re an SSIS developer, don’t be afraid of Databricks, or don’t be afraid of Azure Data Factory. Take some time to learn it and then figure out, “is this better than what I’m doing today? If so, let’s incorporate that into my toolset.” Let’s use the right tool for the job, but also don’t get so overwhelmed because there is so much. There’s so many new tools, new frameworks, so if we look at outside the Microsoft ecosystem, you have things like AWS Glue, NiFi, Luigi, all these Python frameworks for doing ETL, and it’s just kind of funny to me. It’s like they’re all kind of devolving back to a traditional ETL tool, even though they were built as a response to, “oh, these ETL tools suck. They don’t support CICD, you’ve got to use these goofy GUIs that they just don’t integrate. They don’t integrate with Source Control, they don’t do this, they don’t do that. But now we’re just building these frameworks that are solving things that the ETL tool already solves for us. Things like, logging and auditing and everything old is new again.
Carlos: Yeah, it is funny how that come around. I think one of the critical points there is the organizations aren’t paying you, well, even though the job descriptions still point out, “hey, you must know this toolset” and we even talked about this in the beginning, is that how are you going to train your team or what your team knows may dictate some of the tooling that you use. But at the end of the day, the real value is because you can solve those problems and if you can solve a problem with a different tool, then by all means, I think you’ll get some support there.
Jason: Absolutely.
Carlos: That’s interesting. I do have one more question that I guess I want to ask before we dive into SQL Family, here. That is, let’s prognosticate here, a little bit. We’ve seen this idea of Azure Data Factory, we have addressed the idea that SSIS is still going to be around, it’s still relevant. What do you see the future, and we’re going to stick to the Microsoft Stack. You brought up some of the other stacks, but we’re going to stick to the Microsoft Stack. What do you see the future holding for the ETL or ELT world?
Cathrine: Well, Microsoft is all about the AI, so I’m just waiting for them to build some kind of solution where I can tell them, “please build my ETL” and they will.
Carlos: There you go, there’s my data, let me go ask some questions, yeah.
Cathrine: I’m not sure that’s a reality, though, but if you watch anything coming from Microsoft, these days, it’s AI, AI, AI. You might think of it as hype or a buzzword but think about all the things that you’re currently doing that’s using AI. Even things you might not even think of, like flash fill in Excel. It’s AI, behind the scenes. And I hope that some of that will be coming more into the world of data integration as well, because we’re still doing a lot of things manually. Figuring out this and that, and I’m not sure I can predict anything. I think there are way smarter heads out there that can do some of that thought processing around that, but some kind of machine learning that will help us maybe focus more on the business logic and not so much on the implementation side of it. No idea if that is coming, or when it’s coming, but that would be wonderful.
Carlos: Very cool. Jason thoughts?
Jason: Yeah, I’ll probably take a little more shorter-term view of things. I think that’s definitely probably coming at some point, just more AI to handle things like data cleansing and golden record management, things like that. I think, from a tooling standpoint, it seems logical to me that right now, Azure Data Factory is fairly robust. There’s probably a few areas that can be improved. What I’m hoping Microsoft realizes is that they need to tie that back to some type of data governance. I’d like to see more integration with products like Azure Data Catalog and to be able to track some of that stuff. And then it would be great to really see Azure Data Factory introduce some kind of transformation capabilities, so maybe that’s on the horizon, maybe it’s not. I think those are some areas where it has some gaps, but overall, I’d say Azure Data Factory, for a V2 product, it’s very robust. It can talk to a lot of different data sources, and they can solve this integration problem of moving data from A to B very well.
Carlos: Very cool. Thanks again, for that. I think we have some interesting insights and I know I learned a bit, as well, in this conversation, so that’s always nice. It’s one of the fringe benefits of doing this podcast. Shall we go ahead and do SQL Family?
Carlos: First question. Your all-time favorite movie?
Jason: So I’ve got two. Fear and Loathing in Las Vegas and Fantastic Mr. Fox.
Carlos: Oh gosh, I don’t even think I’ve heard of either of those two movies.
Jason: You’re in for an education. Start with Fantastic Mr. Fox and then go to Fear and Loathing in Las Vegas.
Carlos: Okay, now are these like fantasies, sci-fi?
Jason: No, yeah, so Fantastic Mr. Fox is kind of an– I forget the guy’s name. He actually has another movie coming out, but what he does is, it’s kind of a unique film-making style and he characterizes animals in these roles, and then they have famous people do the voice-over. So for Fantastic Mr. Fox, I think it was like George Clooney was the main character’s voice. That was built, that was actually like a children’s book and so they took some liberties with it, but the dialog on these things, it was just so crazy and outlandish, it was really amazing.
Carlos: Oh, that’s funny.
Jason: And then the visual experience was cool. And Fear and Loathing in Las Vegas is based on Hunter S. Thompson and he did some interesting journalistic things back in the 60’s and 70’s, but it’s a very wild ride and there’s a Toby McGuire cameo early on in the movie that a lot of people aren’t aware of, so be on the lookout for that.
Carlos: Interesting, yes, now Kevin will be even more upset that he missed it, because I know that’s– eclectic movies are his specialty.
Jason: Nice.
Carlos: Okay, so the city or place you most want to visit?
Jason: I think in the short term, the one place that I really want to go in the United States I haven’t been is Cedar Point, to hit the amusement park there.
Carlos: Oh, is that right? And this is in Ohio, or is that–
Jason: Yeah, it’s in Ohio, I believe, and so I haven’t been able to align that, yet. But I always like to point out, really right now there’s two places potentially on the calendar this year that I want to go, both SQLSaturdays. So, there’s SQLSaturday in Puerto Rico, which I’m really excited to get back there. It would be my second time if I go. And then there’s another one in Costa Rica, there’s a BI Edition one. I love Costa Rica. I think it’s a beautiful country. Amazing food and just really great outdoors kind of stuff to do. So definitely encourage anybody that is in the SQL Family to support those SQLSaturdays and go to them. Actually, all of the LATAM SQLSaturdays are amazing. The people are very friendly, and they really take care of you well, and they’re amazing places to go.
Carlos: Yes. Yeah, I’ll second that. My dad’s from Costa Rica, so I make it down there quite a bit. Unfortunately, for whatever reason, the schedule never lines up for me to go to the BI Edition, but yeah, it’s a good time. Food that reminds you of your childhood?
Jason: This one’s a little weird, but I’m going to say bananas and oranges chopped up in a bowl.
Carlos: Okay, so kind of a fruit salad of sorts? Yeah.
Jason: Yeah, a little bit. You call it the hillbilly fruit salad, I think.
Carlos: Now is this something that you still regularly eat?
Jason: No, you know I don’t, and it’s funny, because I probably don’t eat enough bananas anymore. I do try to eat oranges pretty frequently, especially when they’re in season, but my grandma used to chop that up in a bowl and that was part of my breakfast. I just always remember eating that, but I haven’t eaten it in years.
Carlos: Yeah, interesting, okay. Now, how did you first get started with SQL Server?
Jason: I was a web administrator for a large ISP and I was the only person who could even spell SQL, so my first project, when I took that job was moving from SQL Server 6.5 to SQL Server 2000 and doing that migration for, I think we had about 3000 customers across 7 different servers. So not only did we migrate all these databases, but we consolidated them to a single server.
Carlos: Oh wow.
Jason: Yeah, it was pretty wild.
Carlos: Yeah, I’m sure that was some fun times. I mean, particularly back in the day. I mean now I think about DataTools.
Jason: Oh, the DBA PowerShell Database Tools, DBATools, yeah.
Carlos: Yeah, the PowerShell, yeah, the DBATools and I’m like I can’t imagine doing my migrations without them anymore.
Jason: No, it was about two or three real late nights of we had all these customers that had weird credential issues in their web.configs. Sorry, back in that day it was global.asx or something like that. So we had to actually go into the customer’s websites and fix connection issues, fix login issues, we had some customers using SQLOS, some using Windows, it was a nightmare.
Carlos: Yeah, sure, wow. Yeah, sorry to bring up the tragic period in your career, there. Now, if you could change one thing about SQL Server what would it be?
Jason: I don’t know. I mean I guess my biggest– I want a better solution around managing database projects. There’s the SSDT stuff.
Carlos: Right, so what don’t you like about that, now? I mean, so you say better, I guess you create the project, just because it seems so individualized, still or what’s the–
Jason: The main issue is I guarantee you any real world database that I point and try to reverse engineer or even just generate a backpack or a dacpac from, it’s going to fail because of cross-database references, orphaned credentials, just all sorts of goofy stuff that people do.
Carlos: I see what you’re saying.
Jason: And it just seems like the tooling just really struggles with that. I think the problem is that the database becomes the golden record and that’s not what you want.
Carlos: I see what you’re saying, so almost a way to infer, like “hey, if you’re referencing an object that’s not in there, create a stub or something for me, so that I can move forward without the whole thing breaking.”
Jason: Absolutely, yep.
Carlos: Okay. Best piece of career advice you have received?
Jason: Bad news is like a diaper, it doesn’t start smelling better with time, so, don’t let things fester and if you’ve got bad news, like, “oh I’m going to be late on this project”, it’s better just to get it out there and move forward.
Carlos: There you go. Interesting, okay. I like that. I like that a lot. Our last question for you today. If you could have one superhero power what would it be and why do you want it?
Jason: Actually, I want Cathrine’s PowerPoint abilities. But from a real superhero, she is a superhero with PowerPoint and just computering in general, I’d say. But I’d say, actually, the ability to clone myself.
Carlos: Oh, do we really need two Jason Horner’s in this world?
Jason: At SQL events, I don’t think they could handle it. I think that would be an extinction-level event.
Carlos: There you go. So now, is one going to be the worker bee and one going to be the vacation?
Jason: Yeah, we’re going to work in shifts and I’ll be the 6pm to 6am and the other guy can deal with going to the conference.
Carlos: Okay, well very good. Well, Jason and Cathrine, thanks so much for being with us, today.
Jason: Thank you.
Carlos: I just think that’s hilarious. “Everybody has a plan until they get punched in the face.” This seems to be appropriate here, as technologists, things are constantly changing. It can be tough to keep up. I think a lot of times, we’ve talked about on this program how the business niches will be the new criteria for what employers are looking at. I think that’s one way to hedge some of these technology advancements. That’s not to say that you’re ever going to be able to stop learning and it’s nice to have things like this podcast or other resources available so that you can try to keep up, or at least be aware of what’s around you. Then when you need to dive further, you have a head-start, if you will. I think that’s going to do it for today’s episode. Compañeros, thanks again for tuning in. we do appreciate you taking us with you wherever that might be. Of course, I’m always interested in you reaching out on social media. You can connect with me on LinkedIn. I am @carloslchacon and we’ll see you on the SQL Trail.
[…] Listen to Episode 145: Is SSIS still relevant? Part 2 […]