Welcome back to the SQL Trail! In Episode 52, I chat with Ginger Grant, also known as “Desert Isle SQL” in the #SQLFamily circle. We talk about using Revolution Analytics or R with SQL Server: its origins, how to get started, and what to look out for as your organization starts looking at implementing R. If you’ve ever wanted to know about how using R in PowerBI or SQL Server will affect you, this episode is for you.
Episode 52 Quote:
“The process that they do is called “chunking”. What they do is they break the processing into chunks and then combine it together. And that’s different from what open source R will do. Open source R will crash. And there are several commands of course you use to make it happen that are not a part of open source R.” – Ginger Grant
Listen to learn…
- How one language from 1996 is transforming data science today
- How R got started and why it’s built the way it is
- What C has to do with R
- How Revolution Analytics transformed R into a useful data science tool
- How the open source version differs from the Revolution Analytics version in SQL Server
- How to use R alongside SQL Server
I want to know – what have your experiences with R been? Have you tried it yet? And if this podcast is helpful to you, please leave a review on iTunes so others can find it as well. Subscribe to the podcast via iTunes, Stitcher, and Podbean.
This episode is sponsored by
About Ginger Grant
Ginger is a Pragmatic Works Senior Consultant, a Microsoft SQL Server MVP, and a regular speaker at SQL Saturday events on data management, PowerBI, and using R. She’s also a trainer for Pragmatic Works and writes about data topics on her blog.
SQL Saturday: Administration, Security and Data Management in Power BI”
SQL Saturday: Combining R with SQL Server 2016
T-SQL Tuesday #79 – Creating R Code to run on SQL Server 2016
(MSDN) SQL Server R Services
*Untranscribed introductory portion*
Carlos: So Ginger, welcome to the program!
Ginger: Thank you very much! I really appreciate you asking me to be on, Carlos.
Carlos: Yes, thanks for being here from Arizona. I think you’re probably our first guest from Arizona. If I have to remember correctly
Ginger: Probably so.
Carlos: And here we are, we are in June recording this. Have you melted? So are things melting in Arizona yet?
Ginger: No, it’s only gotten up to 119 degrees, not 120.
Carlos: [laughing] Only 119, right? So interesting that you mention the heat, because it seems like one of the “hot things” in the SQL Server space and in the analytics space now is R. That’s why we wanted to have you on and wanted to talk a little bit about R. Maybe help us define what it is. We know it’s coming into SQL Server 2016. Talk about how we might be prepared for it, and have some conversations around where it might be going.
Ginger: Sure, well R is not new. It was actually named when one letter languages were really trendy. So like C. So first they came up with C, and they came up with S, and S was proprietary. And when it became not proprietary it started being called R because it’s named after the two guys who made it open source. And that was in 1996. And so it’s pretty much been exploding since then. And one of the things we’re interested in, of course, because it’s part of SQL Server now. Last year a company called Revolution Analytics, and they made R, and they worked to incorporate a lot of R into not only into SQL Server but PowerBI as well.
Carlos: Oh interesting. So it’s starting to get into some of the other products as well.
Carlos: So talk about Revolution Analytics and now that seems different from the open source.
Ginger: It is.
Carlos: What are some of those differences and is there a difference even further between what’s in Revolution Analytics and what’s going to be baked into SQL Server?
Ginger: Yeah, well, good point. So the difference is that R as an open source language was created to be very fast and to run in memory. And in 1996 they weren’t talking, and nobody was thinking about, “I don’t know, let’s do analysis on a terabyte’s worth of data!” But that’s not necessarily an uncommon thing that people want to do now.
Ginger: Also, the guys who created it were statisticians. And they wrote a lot of it in FORTRAN, which isn’t terribly quick.
Carlos: Right. I’ve heard that it’s not like a regular programming language. It’s these math guys that went and got together, and approached it from that perspective.
Ginger: Right. It’s really procedural, as in you run it line by line. And there’s a bunch of libraries for it. What Revolution Analytics did was like, “Yeah that’s kind of cute, but we want to run large files up against it.” So to do that they did what pretty much everybody else does in the programming world is that they figured out how to swap things in and out of memory.
Ginger: That’s what revolution analytics brought to it. So you can possibly run out of memory, but you’ll have to work a lot harder at it to make that happen.
Carlos: So they’re doing some of the memory management, kind of like putting in a maximum threshold. Like, “Hey, my server’s got 16 GBs of memory and I’m approaching 10 GBs, I need to start swapping.”
Ginger: Yeah, the process that they do is called “chunking”. What they do is they break the processing into chunks and then combine it together. And that’s different from what open source R will do. Open source R will crash. And there are several commands of course you use to make it happen that are not a part of open source R. It’s in all the revolution commands, it starts with “Rx”, which is kind of cute. It’s like a prescription for revo analytics.
Carlos: [makes drum crash sound]
Ginger: I know! Somebody told me that who used to work for Revolution Analytics and I thought that was kind of clever.
Carlos: Very nice. So now we’re kind of baking this into SQL Server, and of course SQL Server has this in-memory OLTP objects where you can store table and other things in there. Is that going to be a part of the in-memory process or are they still in two different camps?
Ginger: Well, it is an in-memory process, I don’t remember where but in extended events you can see where it starts using memory. It does use the same, I mean if you’re running it on the server, it uses the same memory space as SQL Server does. So DBAs are going to have to start looking at it in the future and looking at overall memory usage in SQL Server and it’s a part of the environment and it can starve out your other processes.
Carlos: So let’s take a look at some of those use cases really quick. So it was written by statisticians and we looked at some of the analytics of it. Why would people be bringing in R when we have a data warehouse and ETL? And even on the Azure side, there’s stream analytics and things like that. Why would we be looking at R?
Ginger: Well one of the reasons for it is that R was originally written for statisticians. So they wrote some code and said, “Great! I’ve determined that the probability of this happening is X.” That’s big in insurance. The probability that a hurricane is coming through is X, and their losses are kind of this. And that’s great, but let’s run that against a different data set. And they’re like, “Oh, well, I guess so.” Because traditionally, R is run against a CSV. And somebody creates that CSV and then they write a little program to go against that CSV. So there’s a lot of extracting going on to make this happen. Well, if you have data scientists and you have people that are looking to make this code more production environment, putting it into SQL Server is a great thing. Also, you’re not tied to that CSV data set that you have around there.
Carlos: So does that mean that I’m going to have a table that is defined? Do I have to have a different kind of table, or can I take any of the tables that I’m used to and now I have this R program and analyze that?
Ginger: Think about it like a stored procedure. You have a defined set of inputs and a defined set of outputs. And instead of running a stored proc, you’d run R. And not only can you get a dataset that would give you a range of probability, but you can also get graphics out of SQL Server. It’ll generate a graphic file for you. Because R is really big on data visualization which is why it’s being incorporated into PowerBI.
Carlos: There you go. Well that makes sense. So if I’m using this stored procedure, from a security perspective then, does that mean that I’m going to be granting access to people at the R level, if you will? Are they going to have access right to the tables to be able to read them and what-not?
Ginger: They’re going to need data access required to read what they need to read. So you can actually just paste your R code there so it’s part of your dev set for an installation.
Carlos: So when you say “pasting your R code”, am I in SQL Server Management Studio, and I have a window, and I paste it in there?
Carlos: Very nice! We can get rid of SQL Server Management Studio.
Ginger: Oh no, because it’s one of the things is that R doesn’t have any management tools. There’s no like, “Oh, how long did this run?” And it’s just, “Ah, when did you start it?” There’s just a timer at the bottom. But if you’re talking SQL Server, you’ve got all the features and functionalities that you have with Management Studio. How long has this proc been running? How much data is this consuming? All of that you have because it’s in SL Server.
Carlos: Okay, so now as people start using it, you mentioned it’s in memory. Let’s look at these other scenarios. When I hear about R from folks, I hear it from analytic people that think R is going to help them answer questions they can’t get answers to. The ETL is not working as fast as they want to, things like that. In thinking about memory, and I’m thinking about an administrator, am I going to have to be cognizant and say, “Look guys, these tables now work with R, and these tables won’t.” Or what are the practicalities of implementing this and being able to support it?
Ginger: I wouldn’t look at is a table-specific thing. You’re going to need some data. Is it going to come from one table or a couple of tables? Are you going to do some joins to get that data? Wherever the data is that’s needed for your inputs, you’re going to need to be able to provide that as in input into R. Whether that’s a single one set of variables or it could be all of the data from three tables and you only want these five fields? So just like you would a stored procedure. Now the one thing I always bring up is that Revolution Analytics is unique in having these in-memory tools and if you have R code that is lying around somewhere, odds are it wasn’t written in Revolution Analytics and was not written to swap in and out of memory. So that’s another thing too. Now those functions are optional. They’re a part of the Microsoft version of SQL Server, but nobody’s making you use them. And if you don’t use them, then you will run out of memory. So you need to be able to look at the code that’s being offered to you and ask them, “Did you write this with Revolution Analytics functions to chunk this in and out of memory or is this going to blow up?” Because nobody’s making you write it well.
Carlos: Interesting. So let’s talk a little bit about that learning perspective and how do people go about learning R? I’m seeing more presentations about it. I’m seeing people talk about it. But I understand that it’s kind of a closed community?
Ginger: Not only is it a closed community, but let’s just say that they’re not terribly nice. Let’s just say that forum postings are, um, entertaining if you have no heart.
Carlos: Post at your own risk.
Ginger: Oh yeah, because if someone thinks that it’s stupid and you should know it already, that’s what the kind of comment you’ll get back. Another thing too is like, for example, is they’ll have their big convention. They have it one year in the US and one year in Europe, and this year it’s in the US. And they’re sold out. And, you know, that’s it. And so that’s how they are. Now one person that I’ll give a lot of credit to for trying to work on changing this is Steph Lock. Steph Lock is single-handedly trying to open up the R community and get it to be more like the SQL community, because that’s where she’s from. And she’s developing “SQL R Days”, kind of like SQL Saturdays. And she’s just kicking it off now. I don’t remember which location won worldwide, but she’s trying to kick that off in a way to kind of get it out of the very academic, very closed-shop mentality that it currently lives in.
Carlos: Interesting. So to circle it back for a second, you were talking about having to use Revolution Analytics to do that swapping. So does that mean that my stored procedure should start with the rx? And if I don’t see any rx, then I know that?
Ginger: If you are looking, as a DBA if you are looking at the code and you don’t see anything that starts with rx, they’re not using it. So as a quick DBA check on the code, if you don’t see rx in front of the function call anywhere, they’re not using it.
Carlos: So now to talk about that community. Microsoft has purchased the Revolution Analytics. Is that getting any better? I mean, is Microsoft providing better support there? I know 2016 just came out.
Ginger: Well, I will say that, you know, you can find support. The problem that I have found is that Revolution Analytics is a relatively small company. And so there’s just not that many people who know the product really well. I don’t know if you know this, but Buck Woody works with a lot of people who used to work with Revolution Analytics. He’s part of the team trying to get more people trained on Revolution Analytics, R, and also on trying to get a better integration of them with Microsoft. But there’s just not that many people.
Carlos: Sure. And again, from a different perspective, these are statisticians. They’re not developers. So I think it is kind of a different audience.
Ginger: It is, and one of the things about learning R is that the first introductory class on Coursera from Johns Hopkins. And I mentioned this at SQL Saturday Atlanta, I recommend that nobody take the class.
Ginger: And the reason why is that one of the big things I think is the strength of R is the visualizations. And this class doesn’t have you do any. EDx has a number of much better classes if people are interested in teaching themselves R. But to know R, you’re like, “Great! Now what do I do?” Well, then you learn stats. Because what people want to know with R is they want to know variances, they want to know groupings, and basically you end up learning a lot of stats if you’re going to be writing R because that’s what you use it for.
Carlos: So now is that a fair point against? I’m a DBA, and I have an analyst come up to me and say, “I think we should start adopting R.” Is that a fair question to be like, “Ok, who’s the statistician? Who’s the data scientist? Who’s going to be putting all of this together?” Because don’t come coming to me for me to write your R stuff!
Ginger: Yeah, that’s a fair question, because the other question, it reminds me of a Dilbert cartoon, you know. “I want a relational database. Well MOM has the most ram.” Why do you want to do R? Because it’s trendy, or what are you trying to do? Do you have a problem that R would be a good fit for? There’s that old saying, where if you have a hammer everything looks like a nail. You have SQL Server is not going away, SQL is really good at providing information that is aggregated in some form if you want to get into some kind of really interesting or nitty-gritty variance analysis or regression analysis or that kind of thing, it’s really good for that. Microsoft is positioning R as their machine learning on-premise solution. So, I do Azure ML as well and I know quite a few people who are like, “I ain’t doing that in the cloud. No. I don’t care it’s in cloud, I’m not going there.” The idea being that anything that you can do in machine learning you can do in R. So think of the kind of analysis.
Carlos: If you’re looking for trends, that’s what I think about with machine learning. Here’s my data, and then point to me where I see similarities and things.
Ginger: Yeah, the simple one being regression analysis. I mean here in Arizona, it gets hot here. As it gets hotter, my air conditioner bill goes up. Those are two points of data that are always in a line. You know, when the temperature drops my bill goes down. That’s one kind of very common thing that’s being done with machine learning that can also be done in R. Another one that’s really common that people don’t necessarily think about is anomaly processing. For example, your credit card. You ever gotten a credit card call from your bank saying, “Hey, did you make this charge?” This doesn’t seem like you. That’s machine learning.
Carlos: Gotcha. And for that I’m grateful, because I’ve had a couple of instances where a purchase had happened out of state and it was definitely not me.
Ginger: Yeah, I used to work near the fraud department of a very large bank, and for a while they told me that you have an address change to Hollywood, Florida, they would call you. Because they had so much fraud in that area that the machine learning picked up on. “If I see one of those, it’s probably fraud.” So it’s pattern recognition.
Carlos: Very good, very good. So I guess last thoughts on R or C, Steve, if you want to jump in here?
Steve: Yep. You know, I had a couple of questions from the DBA perspective on R. So, if someone starts using R on the SQL Server, you mentioned that can take up more memory or even starve SQL Server’s memory. You also mentioned that PowerBI as an alternative way to run R. So if somebody’s running R in PowerBI, is that then taking up the memory beyond their local client machine? Or is that passing it on along to SQL Server?
Ginger: Well, if you’re talking about using PowerBI you’re talking about a cloud deployment, so that kind of pushes it out there. I don’t, to be honest with you, you should be able to do it with an existing gateway. I’ve not looked at memory problems with PowerBI and accessing data, but if you had an on-prem gateway I assume that it would tax your memory to get it. But all it would do though because it’s running in the cloudspace. What it would do from a PowerBI perspective is it would get the data and then your processing would be run in the cloud so you wouldn’t be hit from a local on-premises on SQL Server just to get the data.
Steve: Okay, great. And then one of the other things around the load and the usage there with SQL Server Enterprise there’s the resource governor features which you can control the amount of memory and CPU and things like that allocated. So do you know anything about how resource governor on SQL Server would affect the memory or CPU utilization associated with R?
Ginger: I don’t know the specifics but I really think that R code would be a good place to put a resource governor on it.
Ginger: Because it’s relatively new so it’s too early to know how much memory that they’re going to be taking. So I see it’s a good use case for it, but as far as the specifics I couldn’t address that.
Carlos: I don’t recall, Steve, off the top of my head. Did they change the max memory setting in 2016? So I think this would also be in your area where that setting would become that much more important.
Steve: Yes, and that’s one I haven’t looked at specifically with 2016, but certainly if someone is using R you’d want to have a better understanding of how that memory is being used by different users.
Ginger: And another way of using R, one of the other things that’s a part of Revolution Analytics with SQL Server 2016 is they include an R server. And you would want to use this R server if you’re going to be processing, say, an HTFS cluster. Or if you wanted to make an OLDB connection to say, Oracle. Then it would run in your R server. So not only does it, and the reason you would want to do that again, is that you would want to take advantage of the ability to swap things in and out of memory. That’s what you would need your R server for. So not only with SQL Server 2016 do you get it incorporated as part of SQL Server, but you also have the ability to install an R server installation for connections to other data sources other than SQL Server.
Steve: Interesting. Okay.
Carlos: Very good. Ginger, let’s do SQL Family.
[SQL Family transition]
Steve: So one of the things we always ask on the show that is a favorite when it comes up. What is your favorite SQL tool? And this can really be anything that helps you with SQL Server or anything you’re doing with SQL Server.
Ginger: I really like SQL Prompt. It’s a great tool.
Carlos: That’s probably the most popular response to that answer.
Ginger: Yeah, I also like  tool as well. That’s a good one.
Carlos: You’re talking about the SSMS boost?
Carlos: Very good. If you could change one thing about SQL Server, what would it be?
Ginger: And this is on SSIS and this is also a major beef right now with Microsoft. I respect the fact that they want to make things useful for people who are colorblind, but did they have to take the color out of everything? There’s no joy in mudville. SSIS is now totally monochromatic. It’s so much prettier in 2012. Why does it have to look so ugly?
Ginger: And oh by the way, they use that same monochromatic “we don’t want any color here” to AzureML. Which makes it demo very badly.
Carlos: Ah, there you go. And there’s no themes you could apply there?
Ginger: You can get light blue, light, dark what have you, but… ehh.
Carlos: Bring back our colors!
Steve: Okay, so over your career, what’s the best piece of career advice that you’ve received?
Ginger: So actually this came from Steve Jones. And he was, I was at SQL Saturday in San Diego. I drove out there from Arizona with a friend of mine, we made a weekend of it. And he said that, “you know what, there’s gonna be stuff out there on the internet about you. You can either choose to contribute to it and make what they find your stuff, or you can live with what other people are willing to put out there about you.” And I thought that was pretty profound. And I put up my blog site a little bit after that.
Carlos: There you go. Kind of write your own story.
Carlos: Very cool. Ginger, our last question for you today: if you could have one superhero power, what would it be and why do you want it?
Ginger: Oh I want to fly. I have always wanted to fly. That would be so cool.
Carlos: I agree. Very cool. Well Ginger, thanks so much for being on the show today.
Ginger: Thanks for having me.
Carlos: Yes. We’ll have the show notes up on SQLDAtaPartners.com/R.
Carlos: I was going to make some pirate joke.
Ginger: Yeah, I know that Jamie Johnson always makes pirate jokes when he does R. So hey Jamie- arrrrr!
Carlos: Thanks again for being here.
Ginger: Thank you. I really appreciate you having me.
Carlos: Okay everybody, well thanks again for joining us today. Kind of an interesting conversation and some things to consider as organizations begin to look to R to bring that in-house. Big takeaways for you Steve?
Steve: Well, it’s one of those things where I really have to jump in and give it a try. I’d really like to maybe get an online demo or something and try it out at some point. It’s a ways off from what I’m used to.
Carlos: See, I had the opposite reaction. I’m was thinking, “Where’s my ten-foot pole?” Right? I want to keep that as far away as possible. And one of the things that I do think, to consider, is that you need to make sure that it’s the right version in the database. So using those Revolution Analytics, those stored procedures that start with “rx”. And then along with making sure that your memory settings are right, that you’re not going to let that take over the database. Which it very easily could with a larger data subset.
Steve: Yep, absolutely and I think that leads us into our Tuning Minute for this session.
Steve: So SQL Server memory is one of those things that people talk about setting to the max server memory setting. Or making sure that SQL Server’s not going to take up too much memory and squeeze the operating system out. Quite often, it can be at the other end of the spectrum where there may be other things running on your database and SQL Server isn’t given enough memory, and because of that it’s just assumed to be slow, but it might be slow because it doesn’t have enough memory. This is something I’ve come across in performance tuning work quite a bit. And you really need to get in and get an understanding of the database sizing with the overall memory footprint for your database. So take the example of a SQL Server with 8 GBs of RAM on the server, but because of other things running on there SQL Server only gets 1 gigabyte of memory. Due to other applications, it may be Exchange or other server applications that are taking up memory there.
Carlos: Even applications that are kind of in the SQL Server wheelhouse, I see commonly. Like Reporting Services or even Analysis Services is the other big culprit.
Steve: Yep, and that’s a great reason you can run those applications on a different server you’re your database. But they often get lumped together. Imagine that you’ve got a 300 gigabyte table that you’re trying to query that requires a full table scan. How’s that going to perform if your database only has 1 gig of memory to work with?
Steve: So then, take it a little bit further than that. On that SQL Server you’ve got ten databases running perhaps, splitting up that 1 gigabyte of memory. And if you’re lucky and it’s evenly split, that’s only getting about 100 MB of ram per database. And that really isn’t much in the SQL Server world. Now throw a few dozen users at that database. And I say a few dozen because it probably won’t handle more than that with that limited amount of memory. All of those users wanting to query larger tables without any memory being allocated basically, because there’s none there. Things can get really slow. So this is an example where you might be able to tune queries all day long, and have no impact on the performance of those. And the only thing you can really do in performance is add more memory or move other applications off to other servers.
Carlos: And then there is an interesting idea here. So I’m thinking some of our development servers might suffer a little bit from this. One time in an engagement, there was this test box. I think the box had been given 4 gigs of memory total. Like you said, SQL Server only had 1 gig because it was running something else, and they were having some issues there. So I said, you know what guys? You actually have some sizeable data here. Why don’t we increase the memory of the server? And they’re like, “I don’t think that’s going to do it.” Because again, it was one of those things where they had created the development server like a year before and kind of forgot about it. Data grows. And I was like, “Let’s try it. We’ll double the memory and if it makes no impact then we can switch it back.” And we did that and recycled the service, and they went back into the application and they’re like, “Whoa, that’s wicked fast.”
Steve: Yeah, and looking at that, comparing the price of that memory to the cost of the developer’s time for one or two weeks to be fighting with it, or three weeks to be fighting with it trying to make those queries run faster and not having them run fast. Memory is the cheap way to go there and oftentimes one of the best solutions.
Carlos: Well awesome. Well thanks again, companeros, for tuning in. Of course, you can reach out to us on Twitter. I am @carloslchacon.
Steve: And I’m on Twitter @SQLEMT.
Carlos: If you have questions or things you want to talk to us about with R as well, you can hit that up on the website. The show notes for today’s episode will be at sqldatapartners.com/r.