Creating reports in Power BI is awesome, but once you are ready to publish the reports–where are you going to store all the data? In this episode, we discuss storage options for Power BI and discuss how the version of Power BI can affect storage.
“You can go to the gear icon and depending on whether you’re a personal workspace or an app workspace, you can manage your personal storage or the group storage.”
“Power BI Desktop Aggregations, so basically pre-Aggregations, because it’s still in preview, it’s only supported if your detail table is in that DirectQuery mode, and then your dimensions are in this like dual storage mode, so Import and DirectQuery.”
“If you haven’t run into some of the limits of Import mode and you can’t articulate why you want to use DirectQuery, other than, “well, I think it’ll be faster”, then you really shouldn’t use DirectQuery, because as you said, you might shoot yourself in the foot.”
Listen to Learn
00:38 Intro to the team and topic
01:16 Compañero Shout-Outs
02:14 Workspaces – shared limit or individual limit?
05:19 You don’t have to worry about storage limits if you can do DirectQuery
08:15 The ‘minimum viable product’ approach
09:46 Scenarios where you might want to use Dual
12:58 There are some limitations with DirectQuery to help you avoid shooting yourself in the foot
14:50 Do you have to understand the query before implementation?
15:42 Issues with keeping the different storage options in sync
16:22 You can switch from DirectQuery to Import easily, but not the other way around
17:45 Closing Thoughts
Carlos: Compañeros! Welcome to another edition of the SQL Data Partners Podcast. I am Carlos L Chacon and this is Episode 176. I am joined today by my good friends Eugene Meidinger.
Carlos: And Kevin Feasel.
Carlos: Today our topic is Storage Options in Power BI, and so lots of new announcements around this one or changing pieces here, so again, depending on when you’re listening to this, you may want to check as the ground is literally shifting underneath our feet.
Kevin: Literally speaking.
Carlos: Literally speaking, yes. Okay, so before we get into that, I have a couple of shout-outs. First shout-out to Natalie McNamara, Elvin Kubicki, Stephen Tyndall, and Vishal Patel. Thanks everybody, for connecting and chatting with me. I do appreciate it. We, unfortunately, compañeros, this is going to be two weeks in a row, so we’re light. If you want a SQL Data Partners t-shirt, I’m going to have to start figuring out a different way to distribute these things. #sqltrail on social media and we’ll send you a t-shirt.
Kevin: I’m telling you, a t-shirt cannon. That’s the best way to do it.
Eugene: Yeah, I was thinking the same thing. You get a solid velocity and everything.
Carlos: Yes, we’ve been giving these away as our raffle at SQLSaturdays and most of those venues are a little too small for the cannon, although we’ve been very tempted to get it out, but we don’t want to cause injury to folks. Okay, our show notes for today’s episode is going to be at sqldatapartners.com/powerbi or sqldatapartners.com/176 . Okay, so as we get into this storage options, there are actually a couple of different things that we can talk about, and I think first, we should probably jump into, I want to say the ‘workspace’ category. So we had, in the past, Power BI Pro license, and again, this is kind of one of those ‘depending on the version that you have, your experience will vary’, but in the past, we’ve had the Power BI Pro license, which has a maximum of 10GB of data storage that can be assigned to that license, and I think that’s regardless of workspaces, right?
Eugene: Well, I think workspaces have their own limit. Yeah, the same limit. The part that I’m not sure about is if you put in a shared workspace, does it still count towards your personal limit or not.
Carlos: So the way I read this sentence, “at the tenant level, total storage can’t exceed 10GB per Pro user across all Pro user and app workspaces in the tenant.”
Eugene: It’s probably like SharePoint where you get a certain amount of storage allocated per user or something like that.
Carlos: Right, that’s what it sounds like.
Eugene: Okay, that makes sense.
Carlos: But you mentioned that they had been changing this up a bit.
Eugene: Well, so for the Pro piece, it hasn’t really changed too much. It’s been 1GB for free and 10GB for Pro for quite a while. What changed is that they’ve increased the model sizes they’re going to support in Power BI Premium. Yeah, they had an announcement that they’re going to be able to support 4TB models, theoretically.
Carlos: That’s a lot of data.
Eugene: The thing is they’re not going to be renting out 4TB of RAM to you. From what I understand, I think it’s going to be like a paging kind of situation, so, they’ll page in and out the data that you need. But it is interesting now that Power BI Premium has parity with SQL Server Analysis Services in terms of limits and all of that.
Carlos: Oh, interesting.
Eugene: Yeah, I think they had said at PASS Summit or something, somewhere around there, that their goal is for Power BI to be a superset of Analysis Services, so it’s interesting to see them bring the Premium in line with that.
Carlos: Yeah, there you go, so step-by-step getting a little bit closer there. So yeah, I guess going back to the Pro and being able to– you have that 10GB limit, so in there if you go to the My Workspaces and that gear icon, it actually gives you the Manage Personal Storage option and you can actually see how much data is assigned to you.
Eugene: Yeah, and I’m taking a look at that right now and it looks like it may be that there’s still an issue of ownership in the sense that even if it’s in a workspace– oh no, there’s group storage and then there’s– ok. Oh no, all right, it is separate, interesting, okay. But yeah, you’re right, you can go to the gear icon and depending on whether you’re kind of a personal workspace or an app workspace, you can manage your personal storage or the group storage.
Carlos: Yeah, depending on your permissions, yeah. There you go. Okay, so that is at the workspace level and I think what’s interesting here is that traditionally the data sets have wanted to– I guess I should say, from a lot of the marketing, if you will, the idea is, “hey, you just upload the data into the Power BI report, into the pbix file. It then goes up into the Power BI Pro Service.” And a lot of folks immediately “well, let me just do live query, so that way I don’t have to worry about that limit.” So that’s definitely a possibility, potentially as the data sets do get a bit larger, or you want to start looking at older data, but I think we’ve found that it’s quite the hit to go the live query route. Would you agree?
Eugene: Yeah, no I would agree. It depends a lot on what you’re trying to do. I think in our case, some of the stuff that we were doing for a customer was, okay, they want to be able to do this complex year over year kind of logic. They have a lot of data they want to look at all at once, and so you don’t really get great filtering with that. Whereas, if it was something that was, “okay, I’m looking at a very limited slice of time,” maybe current quarter or month-to-date, something like that, and I’m using a lot of filters and slicers to say, “okay, I want black products and I want sales in AMEA and that kind of stuff,” then in theory, you can take good advantage of any indexes you might have at the data source. For what we were doing, yeah, it wasn’t a good fit and we had tried it, too, because we wanted to take advantage of Power BI Aggregations. And right now, Power BI Desktop Aggregations, so basically pre-Aggregations, because it’s still in preview, it’s only supported if your detail table is in that DirectQuery mode that you mentioned, and then your dimensions are in this like dual storage mode, so Import and DirectQuery. So yeah, we had played around with that. I think there’s situations where it makes sense, especially right now the file limit is 1GB.
Carlos: Right, or the data set limit, yeah.
Eugene: Well, I mean I think technically it might be the pbix file as a whole. I don’t know, but it used to be a quarter of a gig, then they upped it to a gig, and honestly, if your pbix file’s any bigger than a gigabyte, you’re doing something wrong. Because I mean, at a certain point you should just buy SSAS and just slap everything in there, buy Azure Analysis Services or whatever. But yeah, I think DirectQuery, there’s very specific scenarios where that’s ideal.
Carlos: Sure. Well, now that’s interesting that you say “hey, we want Power BI to have parity with SSAS” and then you’re like, “oh no, no, at some point you just need to go buy SSAS.” It kind of feels like you’re talking out of both sides of your mouth, there.
Eugene: Sure, well I guess the distinction is Power BI Premium is probably implicitly what they mean.
Carlos: Okay, fair enough.
Eugene: You know what I mean? So yeah, if you’re–
Carlos: Because I was like, “$10 a month to Analysis Services licenses, that’s quite the jump” but I see, Premium, thank you for the distinction.
Eugene: Yeah, if you’re willing to pay $5000 per month, yes, they’ll be able to support your giant data model. that’s not going to be an issue.
Carlos: Fair, fair.
Eugene: Well, it’s interesting, too, because I’ve talked before about how Microsoft kind of did the whole minimum viable product approach. They said, “okay, we’re going to take like the smallest, most useless product we can that someone will pay for and then iterate every month.” And one of the negative consequences of that– there’s a bunch of good consequences. I mean they’re rapidly iterating Salesforce bought Tableau, Google bought Looker. Clearly people are getting spooked, but one of the downsides is their deployment story. Their whole product story is a mess. And they haven’t quite solved that yet, because first it’s like, “okay, it’s buffet style. $10 per user per month, as much data as you want, do whatever within 10GB.” And then they’re like, “oh wow, these Enterprises don’t want to pay $10 per month for a user that reads a quarterly report. Alright, we’re going offer you Premium starting at $5000 per month.” And they’ve started the backfill in between, so you’ve got the embedded SKU for Premium that doesn’t give you all the bells and whistles, but it’s cheaper and it’s something in between, but–
Carlos: Yeah, in the middle, uh huh.
Eugene: There’s still that disconnect of, “okay, software as a service per user, here’s these hard limits.” And then, “alright, you want all the bells and whistles, we’re going to need $5000 per month.” So, I think there is– it leads to that disconnect that you just called out where you’re like, “you’re kind of contradicting yourself” and it’s, “well, we really have two products that just happen to have the same name.”
Carlos: Okay, yeah, that’s interesting. Yeah, I mean, like anything, trying to mesh a lot of this stuff together can be tricky.
Eugene: Yeah, sure.
Carlos: So we’ve jumped in here, we’ve talked about the way that you can access that data. You’re going to store it, and so then in Power BI it’s called the Storage Mode. So in your properties of that data set, you get to select whether I want to do Import, DirectQuery, or we have something called Dual.
Eugene: Yeah, so actually, nowadays you can select it on a per-table basis. So it used to be, for a while, that if you wanted to do DirectQuery, everything had to be DirectQuery, and there are situations where that doesn’t make sense. What would be ideal in a lot of cases is, “okay, we want DirectQuery whenever we’re looking, and maybe a data set that needs to be updated frequently,” or “we want to look at DirectQuery whenever we’re at a granular level. So maybe we’re fine if the top aggregate level’s only updated every day. But whenever I start looking at specific product BL-XY2, I want to see the actually current inventory numbers,” or whatever. So now you can have kind of composite mode where you can have Import, which means you’re pulling everything into memory and every time you refresh you have to pull the data. And then you can also have maybe that detail table or maybe your sales is nightly but your inventory stock levels you want to have updated all the time. That’s DirectQuery, and DirectQuery is it’s converting the DAX straight into SQL. It’s not hitting a cached copy of the data. Well, largely. DirectQuery, if I remember, still technically does some level of caching up to, I think it’s around 15 minutes and I think you can change it. But the idea is that it’s real-time-ish. So Dual, to my knowledge, there’s only one situation you would do this. So Dual means you’re basically doing both, which is kind of weird. So you’re saying, “okay, this table, I’m importing copy, but I also want to be able to treat it as DirectQuery. So the only reason I know of why you would do that is Power BI Aggregations, where right now– eventually you’re going to be able to do desktop aggregations on everything on Import, but right now your detail level table has to be DirectQuery, which means all of your dimensions have to be in Dual mode so they can kind of flip back and forth. So that’s kind of my understanding. I don’t know of any other reason right now why you would do that. I guess maybe if you’re not even using aggregations, but you are using Import and DirectQuery and in that case you may need to put all your dimensions that talk to both in Dual mode.
Carlos: Right, so Microsoft gives the example of using a multiple visualizations for the same source, and so in some cases you may want the sum of an aggregation table, but then you have another visualization that is giving you by calendar date and that would be a scenario where potentially you’d want to do that, because depending on your visual, you may want it to behave a certain way.
Eugene: Yeah, and it used to be that there was a point in time where they had composite mode but they didn’t have desktop aggregations yet, and so you would kind of do your poor man’s aggregation where you’d have a big IF statement and you’d say, “okay, if this table isn’t filtered, then just pull from our Import pre-ag. If it is filtered, though, let’s hit the DirectQuery table” kind of thing.
Carlos: Yeah. But of course, we have to be a little bit careful with some of that because we can go from trying to help ourselves to shooting ourselves in the foot pretty quickly.
Eugene: Yeah. There is one safety on that data gun. With DirectQuery, it will not do queries over a million rows. For your own safety. And in SSAS, you can change that setting, in Power BI, you can’t. But basically, if you run Profiler or Extended Events and you say, “okay, what’s it doing?” You’ll see that it’ll do a top one million and one rows. Yeah, it’s interesting, and so if it gets a million and one rows, it knows that, “okay, your query was too broad and too detailed, so we’re going to throw an error.” So, there is some limitations that try and protect you from shooting yourself in the foot. DirectQuery has some other limitations, too. I don’t think time intelligence functions work. Those are mostly syntactic sugar, honestly. I think the DirectQuery tables that you pull can only come from one database. There may be a couple of other things, but yeah, I would only use DirectQuery if you have a clear and justifiable reason to do so.
Carlos: Clear and justifiable. I guess that could be the eye of the beholder a little bit, there.
Eugene: Well, I mean, so in my mind, that’s something that you can articulate to your boss. It’s not like legal standard, like clear and present danger or something like that. But just more that, “okay, we’re feeling a pain point” so that’s what I mean by ‘clear’ is like, “okay, Import is not working for us.” Either it’s too slow or the data model’s too big.
Carlos: Too large.
Eugene: You need to have enough of an understanding, the justifiable part, that you can send an email to your boss and say, “here’s why we should do X, Y and Z.” If you haven’t run into some of the limits of Import mode and you can’t articulate why you want to use DirectQuery, other than, “well, I think it’ll be faster”, then you really shouldn’t use DirectQuery, because as you said, you might shoot yourself in the foot.
Carlos: Gotcha. Okay, now do you have to go as far as to be able to understand the query that’s going to get run underneath before you implement DirectQuery?
Eugene: I don’t think so. I mean, certainly you can look at that. You can enable logging for your Data Gateway, in Power Query– well, no, Power Query’s different, sorry. This is a DAX layer. You can run Profiler and see what’s going on behind. The big thing that you just need to understand is, “okay, do I have the proper indexes to support what I want to do, here and do I know at what level of detail and how much data it’s going to be accessing?” So I don’t think you have to be able to articulate what the query’s going to look like on a T-SQL basis, but you should have an idea of the shape of it, is what I would say.
Eugene: You should be able to say, “okay, is this going to hit 10 million rows?” And you’ll be able to go, “well, no, because that’s 10 years of data and we’re just looking at a month.”
Carlos: Right, right. Okay, so because we have these various options that we can now put at the table, our data model could then have a series of different storage options.
Eugene: Right, yep.
Carlos: So, issues with keeping those models in sync?
Eugene: I mean, yeah. I don’t know. It’s one of those things that feels like it’s a self-evident problem. Some of your data’s going to be updated maybe every hour or nightly or whatever, and then some of your data’s going to be real-time.
Carlos: It’s dependent on the Gateway, as an example.
Eugene: Yeah, exactly. Yeah, so I think it’s one of those things that it should be pretty self-evident what the problem is going to be, there. One other issue to be aware of: if you start in Import mode, and then later you want to go to DirectQuery, it is a ginormous pain in the butt. I had to do this for a customer. Because the thing is, you can switch from DirectQuery to Import fine. You just say, “okay, I want this to be Import”. They don’t give you the option the other direction, to say, “okay, no, no, I changed my mind. I want this to be DirectQuery.”
Eugene: Yeah, so what you have to do, it’s a pain. I’m going to butcher his name, so please forgive me, Gilbert. Gilbert Quevauvilliers has a blogpost on some of the issues with Desktop Aggregations, and part of that is the whole DirectQuery/Import mode. And so what you have to do is you have to make a new table in DirectQuery mode, even though it’s the exact same Power Query, there’s some little flag that we’re not allowed to see that says, “okay, this started out as DirectQuery.” And then you have to copy over all of the Power Query stuff, and then you’re going to copy over all of your DAX measures if you have any on there. If you have them elsewhere, like a few dummy tables that are just measure groups, you’re fine. But yeah, it took me like an hour or two to switch that over for a customer, just so we could test and see if Power BI Aggregations was going to make things faster.
Carlos: Right, wow. Yeah, so that’s a bit painful, then.
Eugene: Yeah, that is one scenario where okay, maybe you start with DirectQuery, but still I would say just– you should still be able to evaluate it and it’s doable if you need to switch it over.
Carlos: Sure. Okay, so I think ultimately the idea here with this conversation is that we want to start with Import generally. If we know that our data’s going to be too big, we can look at DirectQuery, and then in those instances where we might need a little bit of both, we have the Dual option.
Eugene: Yeah. Something we didn’t talk about because it’s not listed as a storage mode is live connections, and that’s similar to DirectQuery, but that’s hitting SSAS. And so basically, you’ve got your SSAS model somewhere and you’re just hitting it that way. So that’s also technically an option, but yeah, I think that’s a pretty good summary of things.
Carlos: Well, there you go. And with that, I think that’s going to do it for today’s episode, compañeros.
Eugene: Cool beans.
Carlos: Thanks again, everybody for tuning in. We do appreciate it and again, if you want that t-shirt, hit me up, #sqltrail on social media and I will send you a shirt. Well, I guess I should say I’m sending shirts in the United States to this point, although I’m quickly thinking about expanding that. Okay, and as always, we want to hear what you have to say, compañeros and you can hit us up on social media. Let’s see, Eugene, you want to go first?
Eugene: Yeah, you can find me on Twitter @sqlgene or on my blog sqlgene.com.
Carlos: Okay, and Kevin has gone into the last state of storage that we haven’t talked about and that is secret state, incognito state and he had to drop off. But compañeros, you can reach out to me on LinkedIn. I’m at Carlos L Chacon and I hope to see you on the SQL Trail.
Hi guys, thank you for the episode. I want to mention an extra use case for direct query and that is the aggregation of a big fact table.
It involves importing the fact table in the underlying model with direct query (only a preview is actually fetched) and then creating a copy of it with power query (by using a reference), aggregating it and then switching the aggregate table to import mode. This way it is possible to use the aggregate table for visualizations, which is much faster than direct querying an aggregate table, which involves a SQL aggregation command every time.