As one of the newest Azure offerings, Azure Data Factory can at first thought be compared to SSIS. I don’t think we are quite there yet with the comparisons; however, I talk with MVP Reza Rad and discuss some of the similarities and differences between the two.
A consultant now living in New Zealand, Reza has lots of experience with the data management and gives us the scoop on the initial offering of this new Azure tool. I hope you enjoy the episode. If you have some feedback or comments, feel free to reach out on Twitter–I am @CarlosLChacon
Transcription: Azure Data Factory
Carlos L. Chacon: This is a SQL Data Partners podcast. My name is Carlos L. Chacon, your host, and this is Episode 21. Today we’re talking about Azure Data Factory. As you may know, I am spending a lot more time in the Azure environment and wanted to follow up with some of the Microsoft teams and other people who are doing things in Azure.
I happened to bump into Reza Rad– no, I didn’t happen to– I sought out Reza Rad at the summit. Reza is from New Zealand and a very, very nice and generous guy. He had a session on doing differences between Azure Data Factory and SSIS.
As you might guess, the Data Factory is a transformation tool that you can use in the Azure environment. We sit down and we discuss some of the differences.
As always, comments are welcome, [email protected]. If there are topics you want to hear about, please let me know. It’s always good to have you, compañeros. I hope you’re learning something new every day, and welcome to the show.
Children: SQL Data Partners.
Carlos: Reza, welcome to the show.
Reza: Thank you. Hi, I’m Reza Rad, and it’s an honor to be here and to be in this podcast.
Carlos: Ultimately today, you gave a presentation at the summit, and it piqued my interest, since I personally am trying to do a lot more with Azure. You gave a presentation on comparing SSIS to the Azure Data Factory. That’s the conversation that we wanted to have today. Get us started, first maybe we should talk about what is the Azure Data Factory.
Reza: Yes. Azure Data Factory is a data-ingestion tool. It’s not like SSIS that need an ETL tool. It’s more of like an EL tool. We do extraction and load with Azure Data Factory, and that works quite well with large amount of data, with big data, with HD inside, with all those things.It can transfer massive amount of data from source into destination, but it is not actually a data transformation tool, so it shouldn’t be compared with data transformation tool. It has its own strength. When you compare that with something like SSIS, SSIS has its own strengths, Azure Data Factory has its own. It is a Cloud-based service, so you pay as you use.
Carlos: That’s an interesting concept. I hadn’t thought about taking the “T” out of that equation.I think a lot of people, particularly, will look at the Azure SQL database. You don’t have the SSIS or the jobs component to run some of those things, so I think there might be an inclination to think “Oh, the Azure Data Factory can do that for me.”
Carlos: But it’s not quite there?
Reza: Exactly that. Yes, I have also quite a lot of clients that says they want a Cloud version of SSIS. They have Azure SQL database, they have Azure storage, and they want to do things on Azure. But they don’t want, actually, to have an Azure VM and then set up SSIS on that, because that would be quite a different story.
Carlos: Much more expensive.
Reza: Yes, more expensive as well. Azure Data Factory can do that, but it’s not that much, let’s say, powerful in data transformation. It is actually powerful. It’s not, let’s say, good rich development experience of data transformation. Because when you do it in SSIS, you just drag and drop transformation, this is my look-up, this is my join, all those things. It’s just drag and drop.It’s really rare that you write scripts in SSIS. But in Azure Data Factory, if you want to do data transformation, you should write the scripts. These scripts might be PIG, HIVE, C# or even SQL stored procedure, and not all stored procedure, a special type of stored procedure.
This is really helpful, especially in solutions that are based on Azure, because a specific customer or client doesn’t actually want to pay for our premises SQL server when everything is on Azure. Why should they pay for a SQL service enterprise on bugs and the environment for that when they don’t do data movement on Azure?
So Azure Data Factory is actually helping on that site. So this is why Azure Data Factory invented Cortana Analytics, to actually work with Azure machine learning, Azure event hub, all those Azure components to actually provide that.
Carlos. Yes. I guess you bring up that, the use case, for Azure Data Factory, or the reason it was developed was to help with the integrations of some of the Azure offerings that are in the Cloud. You mentioned machine learning and what-not, right?
Reza: Yes, yes.
Carlos: So while it can do some integrations between my source systems A and B, it’s really more for, correct me if I’m wrong, when I have some source data, data warehouse, that I then want to integrate with Azure machine learning.
Carlos: And those are the components that’s going to wear its function and use?
Reza: Yeah, that’s right. Yeah. Because in most of the cases, if you are doing only things on, let’s say on premises, or even you just do things on premises, and then at the end, you deploy things or transfer those things on Cloud to just assure that I’m part of the [inaudible 06:21] or something like that, then you can do most of those things with SSIS.Then we talk about Azure Data Factory. We talk about something like Azure SQL data warehouse, big data warehouse that actually meet processing or compute engine that transfer these things into engine that can analyze that later on or it can be used for Azure Machine Learning batch scoring, those kind of thing, or Azure Data Lake, and all of these things.
This is actually the purpose of building Azure Data Factory, to do these functions rather than doing more data transformation as its size does [inaudible 07:12]
Carlos: Well, integrating from on premise to Azure, that ultimately, the recommendation, or your thoughts are, “I could continue to use SSIS for that.”
Reza: You can use either SSIS or Azure Data Factory. Actually, each of these has its own [inaudible 07:29] . If you SSIS, you will get rich list of data transformation. You can do whatever you want. In Azure Data Factory, you can do whatever you want as well, but you have to SQL thing.But pricing-wise, environmental-wise, those kind of things, Azure Data Factory, because you don’t pay for SQL Server enterprise version for doing such things, or you don’t need big server, real-time administrator who actually install SQL Server and take care of all those kinds of things. It’s just the server that you use.
Also, if you have massive amount of data, Azure Data Factory usually works better in this size.
Carlos: Sure. Because it’s the ability to scale, right? So that’s one of the things that I had for over and over and over, talking with the Microsoft folks, or folks like yourself that are dealing with that. Well, truly, we’re dealing with hybrid approaches, so you’re individual scenarios are going to determine tools you should use to go forward.You bring up a good point based with the licensing. For SSIS, if I’m moving things into Azure, I may not have the cost of SQL Server license to keep that. I do have Azure Data Factory to help me, but I may not have all the bells and whistles that I’m used to.
Reza: Yeah, then I have to write more SKUs to actually do the same BI.
Carlos: You mentioned, let’s get a little bit with the scripting. You said that there was a stored procedure option, that you’ll let me use that special stored procedure.
Carlos: So tell me what that means, special.
Reza: When I say special, because when you write stored procedure in SQL Server, you can do everything. That’s the ultimate power in the stored procedure. It functions if you’re doing things or something. But in this type of stored procedure, you should return something.
Carlos: So that’s a little more like a function.
Carlos: Like a raw result or something should get returned?
Reza: I’m not exactly sure about that, but it should be actually, I think, bullion type of result that should be returned as well as some other results that it can return. This is actually other’s case, so I’m not up to that point to say, but it is right now at this point of time.That’s one part of it. There are some, let’s say, developmental issues around that as well. So then you have your function, and that returns something like that in the stored procedure. Then you return your stored procedure later on, a couple of months later. The return function of that changed, return data part of something else.
Then you have to do some changes in your script in your activity pipeline in Azure Data Factory as well. It doesn’t actually pick that change automatically. It’s not that good development experience yet. It needs more development effort.
At the end, it [inaudible 10:46] the stored procedure as long as you provide the article three prior, then you can write whatever you want and do the transformation, whatever you want.
Carlos: I know that some of the Azure technologies are maturing, and they’re starting to be very comparable, particularly the Azure database, SQL Azure database version 2016. They’re baking things into the part of Azure product that will then be released for the on-premise product.You mentioned in your presentation yesterday, it seemed like Azure Data Factory we’re still in V1 a little bit.
Reza: Yeah. This is going to be much better from now, I believe. This has been released in August 2015, just three months, something like that.Matt Masson, who was the SSIS program lead before and then moved to Master Data Services, assigned back to Azure Data Factory now, so we should expect to have a lot of big changes in Azure Data Factory, and big improvements, because those guys know what they do, and they have big pipeline of things to be done.
Carlos: We could go into, maybe let’s do that, go into some of the nuances between SSIS and Azure Data Warehouse. I think we’ve picked up on a couple of them. First of all, obviously, it’s the GUI, or the GUI experience.
Reza: By the GUI experience, you mean?
Carlos: The development options in SSIS, you can drag and drop, I don’t know. I am not a developer, but I can go in there and create an ETL package.
Reza: That’s right. That’s the main thing. In both administration side or the development side, SSIS is quite powerful, because it used SQL Server data tools.
Carlos: That’s right. You have that suite of…
Reza: You can even create import-export package from management studio. That’s easy, right-click, import, export from this to that. But it’s not that easy in Azure Data Factory, because you have to write JSON. JSON is not. It is easy, but you have to be familiar with it, and you have to write a SKU, so you cannot expect a DBA to come up here and say, “OK, let’s write this JSON.”
Carlos: So the DBAs may have to get cozy with their JSON developers, right?
Reza: Yes, that’s right. But I believe there will be a very fast enhancement on JSON writing of that. Probably, there will be something that will generate that JSON from some result, something like that.There are visual studio templates for Azure Data Factory, but that template doesn’t do that much. It just create the project with folders, link services. And when you right-click on folder, and you say, “OK, I want to create a link service,” this will put the JSON escape template for you, and you have to fill that in.
Carlos: It’s almost like using the templates in SQL Server Management Studio, the parameters, so it gives you a basic script, and then you have to fill it in, fill in the blanks.
Reza: That’s right. In terms of development tools or administration tools, Azure Data Factory is still far behind to SSIS.
Carlos: There is one nifty feature that I’m going to group into the GUI category and Azure Data Warehouse, and that is the hierarchy or dependency function. Why don’t you tell us about that?
Reza: Data lineage, you mean.
Carlos: Data lineage.
Reza: Data lineage. In terms of data lineage, that’s the strong, the main power of Azure Data Factory against SSIS. In SSIS, we can have many data flows, many packages all around the world, and these can talk with each other, datas that can be filled in one data flow and can be empty in other data flow and join with something else in third data flow.It’s really hard in SSIS that you track down a specific table and say, “I want to know where this table is coming from.” You can’t track if you are double to track that package. You could, “Oh, I’m filling that in that data flow because I’m in this.” You don’t have to track down of that, or you don’t have the data lineage that you want to see where this is coming from.
In Azure Data Factory, on the other hand, we have the data lineage, so we can have a massive data pipeline, which is actually equivalent of data flow in SSIS. In that massive pipeline, if we click on the specific data set, this will only highlight the whole path down to only those data sets and pipeline that actually produced this specific data set, and it’s much easier to track.
For example, if there’s a troubleshooting activity, you want to know why this is wrong, you can just go to those pipelines and data sets. That’s the strong point. Microsoft team wants to implement something like that in SSIS, but that’s still in their pipeline of things to do yet.
Carlos: The visual experience, what you demonstrated was very cool. It would highlight that for you, so you have several different steps in your package. You click on that visualization, and then the other ones will go grey, and so you can more visually see that. That’s going to be nice once it’s a bit more usable. That’s going to be handy.
Reza: I’ve heard some rumors about, I’m not sure if this is going to be available. But about hat Azure Data Factory will support data lineage for SSIS as well, I don’t know how. Maybe importing those into Cloud or something like that, but this is just a rumor that I heard.
Carlos: Sure, we can only hope. What are the other comparisons we want to talk about?
Reza: In terms of security agenda, it all look like each other in terms of security. They both support role-based security. So as you do in SSIS, you can say in SSIS, I actually define a role that can deploy packages and other role that reap the packages or execute packages or whatever. Same in Azure Data Factory. In Azure Data Factory, you can do the same role-based security.One of the other differences, I have to say, is compatibility with HDInsight and Azure Data Lake. HDInsight is something that SSIS supports as well in their latest extension, Azure Pack, SSIS Azure Pack. So they support Azure HDInsight tasks, that you can do some part of the work, but it’s more built in and embedded in Azure Data Factory.
You can just define a compute. In this compute, I want to have a cluster HDInsight engine. In this engine, I want 16 clusters. You can define all of these in your JSON SKU, and this will build that HDInsight cluster for your compute, for your data factory, only for that, and this will close that at the end of that as well.
It’s inverted engine to work with HDInsight with PIG and HIVE, and all those other thing. You can do all of these things in SSIS, but you need to do some manual things.
Carlos: How do they determine the costs of Azure Data Factory in Azure?
Reza: The cost of Azure Data Factory, let’s say, compared to its SSIS, the cost is actually based on features that you use. If you use just import/export, we can express those things. If you use Lookup, those kinds of default transformation joins or all those transformation, then actually, you should pay for BI edition of that SQL server.If you use a special type of transformation, like CDC, for doing incremental load or those kind of things, then you have to pay enterprise, which is quite expensive, depends on the core and CPU and all those things.
Carlos: When we say CDC, we’re talking about change in data capture?
Reza: Change in data capture, yes.
Carlos: Those features are only available in the enterprise version?
Reza: Yeah. There are different ways of, let’s say, implementing incremental load in data warehousing. One of the good payoffs doing that is doing that through the CDC. If you want to use that feature, usually, you need enterprise edition SQL server.
Carlos: Personally, correct me if I’m wrong, I always thought of change data capture as being something on my database. I have enterprise version of SQL server. I have changed data capture. Now I have SSIS, and I guess a lot of time, those packages are going to run on a second server.You’re saying that a SQL instance that runs my SSIS would have pulled in and change into capture or do something with it or also need to be enterprise edition?
Reza: The database in the server has to be enterprise, because that’s about CDC or change into capture.
Carlos: My source data? My own TP?
Reza: Yes. For the SSIS side of that, yes. It has to be that. You can develop something with CDC source or also in SQL source in SSIS, or you can run that perfectly. But if you deploy that to production, it has to be enterprise edition that works with that.
Carlos: I did not know that.
Reza: There are also some other special component like Fuzzy, Lookup, those things that do. Sometime Lookup, text mining, those things that they are only available in enterprise.
Carlos: I’m not an SSIS guy, but I have done just enough packages. I’m sure there are certain features, but I guess I feel like Fuzzy Lookup was available to me. But I’m in so many different environments. I can never remember if I’m on enterprise or standard.
Reza: It might be. I’ll have to check. But yeah, there are some features that are only available in enterprise. So if you go with that fully feature version of that, usually, you need the enterprise.
Carlos: To extend that, a lot of the third party add-ons, for example, there are several out there that will help you with SSIS deployment in building things like so FTP-ing and special email function and what-not, because the SSIS is not simple grade at. Do you know of any? I guess there’s no third party integration tools to help you in Azure Data Factory?
Reza: No, it’s not.
Carlos: It’s super new?
Reza: That’s another strong point of SSIS, because there are good luck components that cause luck or fragmented works. They have their own SSIS component that you can plug into your package and use them and good luck. But the main reason for that is that SSIS has been there from 2005.
Carlos: That’s right. We have 10 Years development experience versus three months.
Reza: Yeah. And I believe there are quite a lot of things for Azure Data Factory in the future aspect.
Carlos: Sure. That’s always the downside for folks, for compañeros that are listening, that are tuning in. We’re here in November of 2015. If you’re listening to this in the middle June of 2016, you are going to want to revisit this, because things will have changed between them. I don’t want to report downside of working with Azure. Things are still in constant flux.
Reza: Continue with data pricing, data pricing of SQL server you pay for feature. I hear data factory, you don’t pay for feature. You actually pay for usage. You have all features available. It depends on how many activities you use and frequency of activities. Because in Azure Data Factory, we have pipeline, which is data flow. In each pipeline, we have activity.We can say activities are like transformation. Not exactly transformation, an activity can be just a copy. But some things are transformation. So it depends on how many activities you use in your Azure Data Factory and how frequent you run this.
Let’s say, for example, you run 10 activity in your data flow, and that 10 activity you run 10 times a day or something like that, so that’s 100 times a day. And then you run that for a month, so that makes 3,000 activities per month.
Then you fill into one of the tires of Azure Data Factory pricing. Let’s say for example, if you run this amount of activities per month, this is the price that you pay. It does not depend on the actual size of the data as well, so you can transfer massive amount of the data.
But usually for massive amount of the data, you need HDInsight cluster. HDInsight cluster take space, so you pay that size.
Carlos: Storage cost.
Reza: Yes, but it’s not that much, again.Comprising of these two. Usually, you get quite a lot of more pricing and costing bit SSIS or services like that against Azure Data Factory, which is really low cost based on the activities that you use. You can actually, the activities, some of those that you don’t use and say, “OK, I have this pipeline, but I don’t use that’s.”
The activity that you pay, I think, 80 cents per month for pipelines that are not in use, which is fine. It’s quite lower cost, much, much lower cost against SSIS.
Carlos: The upfront costs aren’t there. We use its cost, and that’s more than nice things. You can let it go dormant, and you’re not continuing to use it.
Reza: That’s right.
Carlos: Well, great. I think we’re all through. There are few more items so we can go in here. If folks want to know a little bit more, we’ll point them to your presentation to that information. On the website, of course, you go to sqldatapartners.com/podcast and look at today’s episode on Azure Data Factory, and we’ll have those notes available with additional information should you want to take a pick at that.Before we let you go, we have a couple of questions we want to ask you. We’ll listen to one additional way that folks can learn about SQL Server.
Host: Have you considered mixing SQL Server training with great friends on a boat in beautiful destinations. If you have, I invite you to check out sqlcruise.com for more information about the great training, the team, and the aim for putting together. There will be a standing sale for the Caribbean in January of 2015, and they want you on board.They’ve offered $100 discount to listen with all these podcasts, so you can check out sqldatapartners.com/sqlcruise for more information there, my experience on SQL Cruise and who knows, maybe we’ll see you on board.
Carlos: Reza, thank you for being here. Before we let you go, we have a couple of questions for you. We’ve talked about Azure Data Factory and SSIS. If that’s your tool, that’s OK. But we want to know, what’s your favorite SQL tool?
Reza: I like SSIS. Everyone knows that I like SSIS. I’ve been working with that since 2005, and before that with DTS. Why I like that, I think it’s more of my passion about data movement and data transformation. I like, actually, to do these things.
Carlos: I know you’ve been working with this for a long time. You have lots of experience in the industry. Take us back, if you will, what’s the best piece of clear advice that you’ve received and would want to share with others?
Reza: I have been in IT for I think 15 years, something like that. One of the very first advices that I have received and was useful to me was from very first manager of mine in a company. It wasn’t a consulting company, it was in health company that I was a developer, and I had a manager.At that point of time, I didn’t know about many of the technologies, about SQL server, about lots of things, and I hadn’t learn.
He said to me that you’re learning these things, and this is great. But after you learn these things, you might think that you know these things, and you don’t need to learn anymore. But this is not how IT works. In IT, you should always learn. If you don’t learn anything, you’re losing anything.
That’s really good piece of advice. Because IT technology, everything SQL Server, SSIS, Azure, all of these things, they are new things coming up.
Carlos: Constant change.
Reza: Yeah. If you don’t learn, you’re losing the new features that are available, and you’re still sticking to traditional way. I remember this past key note, Joseph Sirosh also said something like that. He said, I don’t think I can actually remember the sentence, but he said industry doesn’t respect traditions, our industry respects innovation.
Carlos: Compañeros, that’s a great piece of advice. Our last question, if you could have one superhero power, what would it be, and why do you want it?
Reza: I can tell it in this way rather than thinking and changing the question. One of my favorite superheroes is Batman. Batman is an ordinary person who doesn’t have any special superhero power like Superman or those things.He uses some normal things like special, I don’t know, custom, and he can fly and those things. But these are all based on normal things. So what I do usually like in my career’s role is doing that. I like to show people how they can do really good and powerful things with some tools which are available, and I try to do that in my presentations, those things as well.
Carlos: Very cool. That’s what I like about you, Reza Rad, always trying to reach out, super friendly. You befriend me and my wife. We met last year, and so I do appreciate your time in being here with us.Compañeros, again as well, take a peak, sqldatapartners.com/podcast for today’s show notes and additional information. Thanks for tuning in. If you like what you heard, let us know. Leave us feedback. Of course, we’re on Twitter. I’m CarlosLChacon, and we’ll see you on the SQL training.
Children: SQL Data Partners.