Episode 187: Power BI Performance Tuning

Episode 187: Power BI Performance Tuning

Episode 187: Power BI Performance Tuning 560 420 Carlos L Chacon

Make it go faster! Visualizations are great, but when users start complaining about the load times on reports, it can be a tough exercise to figure out the issues. In this episode, we discuss ways you can measure the speed of your Power BI reports. We also get into some of the fundamentals of data storage for Power BI.

Eugene’s Performance Tuning blog post

Microsoft’s Performance Tuning document

Episode Quotes

“You can get away with having a lot of data if it’s in a shape that works with the VertiPaq engine, the DAX engine, in a way that aligns with what Power BI is optimized for.”

“The less data it has to work with, generally speaking, unless you have some complex filtering going on, that’s going to be faster.”

“We have to work on our data model a little bit. There could be some work that needs to be done there before we worry about what the visualizations look like.”

Listen to Learn

00:38     Intro to the team and topic
01:21     Compañero Shout-Outs
02:20     Sqlgene is #1, above Microsoft, in the search “performance tuning power bi”
03:20     This whole conversation is predicated on your using Import Mode
06:07     How you limit the data that comes into your report
09:29     The formula engine and the storage engine
12:24     We need to work on our data model first, even when it’s not immediately gratifying
13:25     Eugene’s rule of thumb of number of “things” on a page
15:37     Organizational ideas for keeping the amount of data represented lower
17:01     Verify that query folding is working – it’s really effective
21:38     Tools Eugene uses for measuring
24:58     Closing Thoughts

*Untranscribed Introduction*

Carlos:            Compañeros! Welcome to another edition of the SQL Data Partners Podcast. This is Episode 187. I am Carlos L Chacon, your host, and I am joined today by Eugene Meidinger.

Eugene:        Hey everybody.

Carlos:            Kevin Feasel is out today, and Eugene says he’s feeling a little under the weather. I don’t know that I can tell the difference, necessarily.

Eugene:          That’s good, because I’m hoping to do some Pluralsight recording today, so, hopefully my voice is– except it’s just going to belike half an octave lower, which probably makes me more attractive, so it all works out.

Carlos:            There we go. Okay, well, we’re glad that you could be with us today, Eugene. Ultimately, we are talking about Power BI Performance Tuning, and so we have some ideas around that. Before we get into our topic, we do have a couple of compañeros shout-outs, and as always, compañeros, you’ll forgive me if I get these wrong. So, Uche Abarah and Naveen Srinivas, thanks guys, for reaching out and connecting. I do appreciate it. It’s always nice to chat with people. You know, I feel like we should be keeping up, like there’s plenty of SQL Server things coming out but we’re going to punt that to another episode. There’s lots of pieces actually in the cloud that I think are going to make relational instances a lot more interesting. And the data lake area is getting a lot of love right now, so it’ll be interesting to see how things change for us in the next couple of years.

Eugene:          Yeah, definitely.

Carlos:            So, I guess that’s kind of a teaser to some of the things I’ve been looking at I suppose. But as always, compañeros, the episode show notes for today’s episode will be at sqldatapartners.com/powerbi or at sqldatapartners.com/187. Now, as we mentioned, Power BI Performance Tuning is our topic, and I did a little searching on the internets, on the interwebs, beforehand, and low and behold, the very first ranking, let’s see, now let me make sure.

Eugene:          Oh, am I at the top? Oh, my goodness.

Carlos:            Is this on Google? This is on Google.

Eugene:          I gotta do Incognito Mode and make sure it’s not messing with the rankings.

Carlos:            There you go, Performance Tuning Power BI is what I searched for. The very first episode that comes up is.

Eugene:          Yes! I’m above Microsoft! Oh, my goodness!

Carlos:            That’s right.

Eugene:          Take that, Satya, take that! Oh, that’s awesome. That’s beautiful. That’s made my day.

Carlos:            Yeah, it is sqlgene’s link, here.

Eugene:          Yeah, absolutely. Ooo,

Carlos:            So, I guess we can dive into that, obviously you have some ideas on this.

Eugene:          We don’t know if they’re good ones, but I’ve written a bunch of words.

Carlos:            I will also include, there is the link from Microsoft and then BlueGranite also has one. I guess I’m going to start off with one that– so it seems fairly straightforward, but less is more.

Eugene:          Right.

Carlos:            Particularly when it comes to performance. So, one of the questions, or the thoughts was, okay, Power BI ultimately a visualization tool or self-service. One of the things that we want to give the users is the ability to, you know, in SSRS, it would be like parameterize some of these reports. So, you could think of that in filters or slicing or, however you decide to implement that, but be able to take a look at data and then look at that in different segments based on however they want, and I don’t need to change my report every time they decide they want something else. I can add this lookup and then it can kind of flow into the report and be available. Now, that’s all great and whatnot, the idea, the premise there, however, that then means that I’m looking at quite a bit more data. So how then do I go about limiting the amount of data that actually comes into my report?

Eugene:          Yeah, so, obviously we’ve run into this issue with customers, where they start out with a report and they say, “okay, we just need three years of data.” And then they show the dashboard and customer’s like, “actually, we need 10 years of data, and we want to add these dimensions,” and all of this kind of stuff.

Carlos:            Yeah, it generally never gets smaller.

Eugene:          Right, they’re never like, “oh, we didn’t actually need that.”

Carlos:            Yeah, yeah.

Eugene:          So, it’s interesting, because I think there’s two or three different ways that we should be thinking of data size, and I think normally we just think in terms of row count, or maybe even just raw kilobyte volume, what have you. And there’s some nuance and some subtly here. You can get away with having a lot of data if it’s in a shape that works with the VertiPaq engine, the DAX engine, in a way that aligns with what Power BI is optimized for. You can have millions and millions and millions of rows and it’s not inherently going to be slow just because you have tens of millions of rows of data.

Carlos:            Well, I will caveat slightly there: with the assumption that you’ve done Import Mode.

Eugene:          Yes.

Carlos:            Fair?

Eugene:          Yeah, we’re talking about Import Mode, correct.

Carlos:            That’s right.

Eugene:          Yep.

Carlos:            So that is a small distinction. So, if you think you’re going to get that same performance with Direct Query–

Eugene:          Oh, no, no.

Carlos:            Yeah, you’re–

Eugene:          Well, and in fact, like it’ll throttle– well, not necessarily throttle you, but if you accidentally run a query through Power BI in Direct Query Mode that returns more than a million rows, it’ll just say, “no, no, no, you’ve screwed up. I’m not even going to do it.”

Carlos:            Yeah, yeah. Error. Cannot process.

Eugene:          Right. There’s some minor things you can do to optimize for Direct Query, but yeah, this whole conversation is predicated on your using Import Mode, which is the default and the most common. So yeah, it’s complicated, so the first thing you could do is certainly, you could try and just limit the sheer volume of rows that you’re including, kind of vertically limit your data, and a lot of times that’s going to come down to limiting by time, your time horizon. So, often it’s as simple as, do you really need 5 years of data, 10 years of data? Can you do a rolling filter? Can you just pull in the last 90 days, what have you. And that’s usually the simplest, conceptually, to work with.

Carlos:            And that’s probably fair, I guess, with that, you mentioned the last 90 days. So, you can have 10 years worth of data, but filter the report when it comes up, to only look at the last 90.

Eugene:          Yeah, you can do that. I meant more just like what you import initially.

Carlos:            Oh, gotcha.

Eugene:          Because, the more data you have, even if it’s filtered out, you’re still potentially going to see some performance impacts, if nothing else, from compression and that sort of thing.

Carlos:            Okay, gotcha.

Eugene:          But yeah, it does help if you can– the less you show on-screen, the faster it’s going to be.

Carlos:            Right, it will render. That’s right.

Eugene:          Yeah, yeah, the fastest report is just a blank page. But the less data it has to work with, generally speaking, unless you have some complex filtering going on, that’s going to be faster. So yeah, having fewer rows. Another thing is having fewer tables or fewer columns, especially. The more columns you add, the worse the compression. And the reason for this is is it uses a type of compression called run-length encoding. There’s some other ones, but this is the most important one. And that’s basically, let’s say I took 2010 census data, and one of the values was State. Well, there’s what, I think 20 million people that live in Pennsylvania? And so, if I sorted all the data by state first, instead of having Pennsylvania repeated in my data 20 million times, I could basically store it as “Pennsylvania,20million”. And that works really, really great for the first few rows that are sorted. And Power BI’s doing this underneath the hood, you’re not telling it what sort order to use or anything. I believe with SSAS, you can hint at it. I think you can also hint at it by pre-sorting your data, if you want. But with Power BI, it’s making all of those decisions. So, the first few that you sort by, those are going to be in kind of a linear order and that’s going to work well, but by the time you get to the last few columns, they’re going to be super fragmented, because you had the sort by other ones, first. You know, so again, as an example, if, for whatever reason, I don’t know, let’s say that you sorted that census data by assigned gender last name and then finally by state. Well, there’s Smiths all over the United States, so you’re not going to get a run of 20 million Pennsylvania’s. You’re going to get a run of ten thousand, and then another ten thousand later, and another ten thousand later, and that sort of thing. So, you want to get rid of columns that you’re not using. You also want to avoid very unique columns, unless you absolutely need them. So, no primary keys, no GUIDs, no social security numbers. You’re going to need some for relationships, but the problem is, you can think about it, if you try to do run-length encoding on social security number, which is, in theory supposed to be unique, in practice not really, but then for the vast majority of your entries, you’re going to have a run of length one over and over again, so you’re not going to get any compression out of that. Yeah, so you want to get rid of columns you don’t need and especially columns that are very unique. So that’s the second way of thinking about kind of reducing your data horizontally. But then, really, a benefit that’s– it’s hard to explain, but something you want to think about is you want your logic to be simple enough that the VertiPaq engine can do scans and basic math to get your answer, because it’s going to be able to use a different component that’s a lot more efficient. So, there’s two pieces underneath; there’s the formula engine and the storage engine, they’re called. The formula engine is kind of like a scripting language in the sense that it can do nearly anything, but it’s very slow. Whereas, you can think of the formula engine kind of like C or assembly language or something, where it’s much more constrained, but it’s very, very fast and very efficient. And generally speaking, the storage engine, it’s limited to basic math, scans, a couple other things. So, if you’re doing something like simple sums or averages or what have you, it can just scan through that data super fast, get you the answer very quickly. But the moment that you start adding conditionals or switches because you want to be able to change the currency from USD to local and all this stuff, then–

Carlos:            Sure, i.e. case statements or things like that.

Eugene:          Yeah, yeah, again, from personal experience. And if you use a tool called DAX Studio, you can actually see how much of that time’s going to the formula engine versus the storage engine stuff. So, this is where I kind of hammer on, just because you have more data doesn’t necessarily mean it’s going to be a lot slower, if you can shape your data in a way that’s going to take advantage of this. So, this is one of the reasons why you’ll hear people hammer on using star schema when it comes to Power BI. Some of that is organizational and logical and UX, just understanding better that whenever the filters only have one direction to flow, you’re not going to need to worry about loops, you’re not going to need to worry about confusion. So, star schema’s simple from a user perspective, but it’s also faster, because generally speaking, star schema leads you towards, “I’ve got a single transaction table, my measure’s based on a simple aggregate on that transaction table, plus some basic filtering.” DAX and the VertiPaq engine are really, really fast at performing those types of tasks. So, if you can shape your data in such a way where you reduce the number of unique columns, you get it into star schema, your measures, most of the time are consisting of simple filtering through relations and simple aggregates. So, the storage engine is fine if you say, “okay, I’m going to filter by date. I’m going to filter by customer. I’m going to do something where I have a dimension table that’s tied to my transaction table. I’m going to filter the dimension table, and that filter’s going to flow down to the transaction.” That’s still really fast. Whenever you have to do weird stuff like, “I’m going to copy the filter from this one table to this other table manually,” then you start to run into issues, but if you can get your data in the right shape, you can handle millions and millions of rows and still have good performance.

Carlos:            Yeah, so I think this is back to the nod of, or the idea that we have to work on our data model a little bit, that there could be some work that needs to be done there before we worry about what the visualizations look like.

Eugene:          Yeah, it’s hard, too, because the thing is, it’s that iceberg problem. The value-add there is diffuse and hard to see and kind of delayed. If I create a visual, the user or the customer can immediately see the value there. If I design a poor data model, what will happen is what happened with one of the customers we had where it wasn’t until months later when the scope kept expanding that we felt the pain. And so, it’s– yeah, it’s important and it’s hard to make that a priority, because it’s so much easier to be like, “hey, look at this visual I made,” as opposed to, “hey, I architected in this way that you’re not going to feel pain in 6 years when your data size doubles.”

Carlos:            Sure. So, we’ve talked a bit about the data and the model if you will, thinking about the VertiPaq engine and whatnot. But what about the visualizations themselves? It’s fun to start dragging and dropping to charts on there. Any thoughts around the number or the types of visualizations that we should have on a report?

Eugene:          Yeah, so, my rule of thumb, and this isn’t based on any kind of empirical data is–

Carlos:            Oh boy. We’re going rogue here, compañeros.

Eugene:          I’m just shooting from the gut. No, I mean my rule of thumb is you don’t want to have more than 20 things on a page.

Carlos:            Things?

Eugene:          Things, so that includes like shapes and text boxes and all of that kind of stuff. So, you should be able to pull up the selection pane and count how many things are there. Obviously, visuals do take more time, but it’s just that, from my experience, once you start getting into, say, like 40 or 50 objects on the screen, it really starts to bog down quite a bit. A lot of the more simple aggregate visualizations do better, which makes sense, given what we talked about with the engine. So, if you’ve got your bar charts, it’s doing simple aggregates, you’re fine. But if you’ve got a table and it’s trying to show– whenever it shows ten thousand rows, if you look at the code that’s actually being run by it, using DAX Studio or the performance analyzer, you’ll see a lot of times it’s pre-filtering to just 500, but still. If you had a table that’s trying to render ten thousand and then paging the first 500, that’s going to be slower. I’ve ran into an interesting issue with a customer where they wanted to use, like, I think it was the Power KPI matrix, and that’s a very detail-heavy matrix. So, it’s got a line chart, it’s got these different measures, it’s got indicators for how that compares to the target, and if you’ve got 12 measures, it’s great, it’s beautiful. But they were like, “well, we want to have every combination of all these different dimensions. So, we want to be able to see how we’re performing for each region, for each– not customer, but whatever.” And whenever you try to shove 500 rows in there, it’s very slow. So, if you’re dealing with some of the more complex custom visuals, you can potentially see some performance issues there.

Carlos:            So along with that, then, so how do you then, I don’t know, because again, from the user perspective, it makes a lot of sense, like, “hey, I am now at this page and I want to see all the combinations.” So then, just from a layout perspective, or kind of maybe guiding the user experience, is there a better way to say like, “okay, well, here on this page we’re going to see this. Oh, you want to filter by region? That means you have to go over to this report, and I can figure out a navigation or a way to get you back and forth between some of those pieces.” So that way I am behind the scenes filtering all the different regions, but in essence, I’m seeing, like if I have five regions, I’m ultimately seeing five different reports.

Eugene:          Yeah, I mean, as we talked about before, the less data that’s trying to be represented on screen, the better your performance is generally going to be. And so, ways that you can accomplish that, I’m really a big fan of contextual filtering. And so what I mean by that is, an example is report tooltips. You hover over a bar and it gives you more detail. Well, that detail is being filtered by whatever the bar is being filtered by. So, you’re just looking at that specific region, or what have you, or drill-throughs, or, like you said, having a different page that you go to, or even just slicers on the page or something like that. There’s a bunch of different ways to accomplish it, but yeah, if you can avoid having pages that show all of the data that aren’t simple aggregate measures, you’re going to see some performance improvement, there.

Carlos:            Yeah. So, we talked a little bit about visualizations, we’ve talked about some of the data, we’ve talked about, using the compression pieces. But one of the ideas you have here is, “verify that query folding is working.”

Eugene:          Oh yeah, that’s a big deal. Well, it’s worth saying, the query folding applies to Power Query and M, which means that it only applies to the load step for import. And that’s one of the important things whenever you’re– performance tuning is understanding where’s the problem. Is it in the load, is it in the model or is it in the visualization? And the thing is, if your users are saying a report is slow, 95% of the time it has nothing to do with the import, because they’re not doing the loading. They’re not doing the data load. That’s going to be something that either you’re doing manually while you’re developing a report, or you have a scheduled refresh, but it’s still important to deal with. I have a customer I’ve worked with where, oh my goodness, he does work, not for telecoms, for companies that deal with telecoms. So, you know, you’re Amazon and all of your employees have like a Verizon account or something like that, and he helps identify, “okay, this person was making a bunch of calls to India and they didn’t have an international plan, so it cost like $600.” And his data is just gigabytes of CSV files and Excel files, cause he’s a one-man show, he’s not going to set up SQL Server. And so, getting the import speed on that to be faster is important. Now, I don’t think query folding applies here, but I say all that because usually, this isn’t a user-facing problem, but query folding is really interesting and really powerful. So, Microsoft has done a great job of saying, “okay, let’s say that I’m filtering products by color. I want to look at all the red products.” Well, I’ve got two options. I could import all of the products in the products table and then filter in memory with Power Query. Or, if my data’s in SQL Server or certain other types of services, I can append in my query a WHERE clause that says, “WHERE color = red” and it turns out that if your database is managed by a competent DBA–

Carlos:            Oh boy.

Eugene:          Then hopefully you have indexes on that table, and so you can get even better performance, or what have you. Or even then, it’s just generally speaking, the machine running SQL’s going to be a lot beefier, a lot more powerful than the machine that’s doing the import. So, query folding is really, really effective, and in fact, sometimes it can be too effective. A demo that I love to do about data source privacy levels is, let’s say that you have an Excel file. You put in a table and you put in the color red and then you load the Excel file and you JOIN it to the SQL table in Power Query, so as far as Power Query is concerned, you’re doing an actual Merge Join kind of thing. Well, Power Query is smart enough that it’ll take that value from the Excel file and push it back into the query to SQL. Which, in my mind, is amazing.

Carlos:            Yeah, that’s right.

Eugene:          Like it’s one thing to fold steps that are actually in the M code, it’s another thing to fold data from a different data source into a SQL query. The reason that can be a problem is if instead of color, you have Social Security number or credit card number or something, well, now you’ve just exported sensitive information to a different data source, potentially.

Carlos:            Gotcha.

Eugene:          Yeah, again, it’s kind of a contrived scenario, but there are situations where privacy levels are really important, and that disables query folding, essentially, depending on the source mixing. So, the important thing, going back to kind of where you started with this, what’s really important is you want to go into Power Query and you want to look at each step and right click and say, “view native query,” and if it’s grayed out, 90% of the time that means that you’ve broken query folding. And not in the sense that you’ve broken it for the whole thing, but at that step, it’s not longer able to query fold, and a lot of times, that’s because there’s some functionality you’re taking advantage of that’s not available natively in SQL. An example is there’s like a sentence case capitalization and there’s no sentence case function in SQL Server to date, and so it can’t query fold that. So, what you want to do is all the steps that can be folded, so filters, removing columns, changing names, you want to put those as far up at the beginning of all the steps as possible so those still happen before you get to the point where query folding doesn’t work anymore. That’s a huge performance improvement, and one that most people don’t have to think about. Like, I love the fact that you’ve got business users working with this stuff, and they can see performance improvements without doing a single thing differently. It’s really powerful.

Carlos:            It’s almost an order of operations kind of idea, yep. Yeah, so in terms of measuring, you talked about a couple of different things already, but I guess it’s probably worth reviewing a couple of those.

Eugene:          Sure.

Carlos:            Yeah, I guess you want to hit some of the highlights in terms of tools that you use to go and measure some of these pieces?

Eugene:          Yeah, so the two things that I use the most are, I mentioned DAX Studio, which I feel like it’s the SSMS of DAX, but better, or at least prettier, I’ll say that. So, I mean this as a positive; some of you may be, you know, frenemies with SQL Server Management Studio and I understand, but it is amazingly detailed. It allows you to monitor what queries are hitting your Power BI Desktop file. I believe it lets you see if you’re hitting the cache. It allows you so clear out the cache so you can get consistent tests. It allows you to manually re-run DAX queries against that model from Power BI or SSAS, which is awesome because you can make small modifications and do kind of, I want to call it AB testing, but really, my last boss would have called it Poke and Hope kind of thing. So, that’s a phenomenal tool. And then one that’s built into Power BI Desktop that came out, I want to say at the beginning of this year, is the Power BI Performance Analyzer, and it’s awesome, because you turn it on, you run it, and it tracks every render and so for every visual it will say, “okay, here’s the time spent rendering, here’s the time spent with DAX, and here’s the time spent in Other.” And Other’s usually waiting on other stuff, because last I read, the rendering process is single-threaded. And the great thing is you can then go to that DAX and say, “okay, well, what query did it run?” And it’ll give you the query and then you copy it into the DAX Studio, and you run it and you test it and you make changes. So, those are the two big things. I know with M, I forget the name of the function, it’s like trace.diagnostic or something like that, but there is a way to kind of write to the log, how long each of the different steps are taking. I haven’t gotten to the point where I’ve needed to do that, because again, generally speaking, the refresh is not user-facing, and so, usually that’s not the thing that’s on fire. And so when it comes to the user-facing stuff, the Performance Analyzer and DAX Studio are going to be your best friends, cause otherwise, you’re just lost in the woods. Like it’s very, very painful to try and do performance tuning without those tools.

Carlos:            Right. Okay. And again, compañeros, just like everything, your experience will vary depending on any number of factors. So, these are some of the things I think that, we’ve run into or dealt with and we’re always interested in getting your thoughts or feedback. In fact, I’m taking a peek here at some of the comments on your site, and one of them deals with time issues, and there’s some–

Eugene:          Yeah. The link’s not on there, but someone actually translated this article into German, they found it so helpful.

Carlos:            There you go, hey. Very nice. Okay, well, I think that’s going to do it for today’s episode. Yeah, thanks, Eugene, for the conversation. And as always, compañeros, we are interested in having you reach out to us, let us know what you’re thinking, and in fact, the folks up in Boston at the SQLSaturday said that they would like to see us up there in March. So, it always, or generally it has in the past, conflicted with the Richmond event, which I am partial to, seeing as that is my home city.

Eugene:          Lazy, yeah.

Carlos:            But if that is no– yeah, that’s right. But if that’s not the case, then I will try to make it up to the Bostonians this next March, so we’ll see. And I know, Eugene, that was something I saw on Twitter that you were trying to plan where you were going to go next year.

Eugene:          Yeah, I mean, so the thing is, this first year of doing like consulting, I signed up for more things than maybe I should have.

Carlos:            Oh yeah, it’s not hard.

Eugene:          Like I had a great time presenting at the Richmond SQL User Group, but maybe the 12 hours driving wasn’t a good use of my time my first year of consulting, right? Hopefully I don’t offend anyone in Richmond. So, this year I’m like, “okay, well, how many SQLSaturdays do I actually want to go to?” And the problem is, there’s like 8 of them within a 6-hour radius. Let’s see, (counting) there’s 8 that I’ve gone to, and then I found out, like, I did a radius map and it’s like, “oh, Detroit’s within that range, too.” And it’s like, “oh my goodness.” So yeah, I’m trying to figure out what’s a healthy number. Not like a Kevin Feasel or a Tracy Boggiano number, but what’s a healthy number of SQLSaturdays to do.

Carlos:            Yeah, a more realistic, mere mortal number.

Eugene:          Yeah, not every other week. Yep.

Carlos:            It must be something in the water down there, you know?

Eugene:          I think so, yeah.

Carlos:            Anyway, okay, compañeros, you can always reach out to us on social media. Eugene?

Eugene:          Yeah, you can find me at sqlgene.com or on Twitter @sqlgene.

Carlos:            And I am most likely on LinkedIn at Carlos L Chacon and thanks, compañeros, for listening in, we do appreciate it. We are coming up on Episode 200, all of a sudden, pretty close. Like, we’re getting to the 190’s here, so if you have thoughts or ideas on what we should do for Episode 200, let us know, as we’re starting to– well, I guess I need to start thinking more about that. Thanks for tuning in, again, compañeros, and we’ll see you on the SQL Trail.

1 Comment

Leave a Reply

Back to top