Power BI Desktop privacy levels whitepaper
“Data Privacy levels affect whether query folding happens with your data sources, and that’s it. It doesn’t affect who can see what, it doesn’t affect security, it doesn’t affect anything else. All that it affects is query folding.”
“It’s not going to fold organizational information into your queries towards a public source and it’s not going to fold private information into queries towards organizational or public sources.”
“My baseline recommendation is, if it’s anything internal, mark it as organizational and if it’s anything external, mark it as public.”
Listen to Learn
00:38 Intro to the team & topic
01:08 Compañero Shout-Outs & recording from Mala
03:59 SQL Trail Information & request for opinions
05:25 SQL Server in the News
06:37 Data privacy levels only affect query folding
08:24 Explanation of what query folding is – hint: it’s not KonMari
10:52 Sometimes query folding is too smart
12:41 The three tiers of privacy levels
13:20 The “hey, ignore all of this” gateway – potential performance improvement
15:24 Eugene’s advice on how to mark your data
17:09 Read Microsoft’s security whitepaper
17:55 You have to set levels yourself or ‘none’ is the default setting
18:32 Kevin’s 10 second summary of this episode
19:04 Closing Thoughts
Music for SQL Server in the News by Mansardian
Carlos: Compañeros, welcome to another edition of the SQL Data Partners Podcast. I am Carlos L Chacon, your host. I am joined today by Kevin Feasel.
Carlos: And Eugene Meidinger.
Carlos: This is Episode 168 and thanks everybody for being here, today. Today we’re talking about Data Privacy Settings in Power BI. So that’s what we’re going to do and you might get out a little early today, compañeros. It remains to be seen, but we’ll see. As always, before we get into the episode, we do have a couple of shout-outs we want to give out. First, to Brett Burnam, Jonathon Lipke, Lumina Gatbonton, and Chavi Arora, and as always, compañeros, you’ll forgive me for butchering your last names. Someone with the last name of Chacon, I guess I’m just used to it now, and so I am one of you. And we don’t get this very often, hint, hint, compañeros, at sqldatapartners.com/podcast. However, you can record a message that we will play here on the podcast. And we have just a message and so we’d like to go ahead and play that now.
Mala: Hey guys, this is Mala. Carlos know who I am. Been a big fan of your podcast and just want to say thank you for everything you’re doing. I really like most of the shows that have been in. I just had a suggestion for a new one. Maybe you can talk about different data types and the right type of database technology to use. It’s a question we have to face at work many times and it’s getting harder for those of us who work exclusively on SQL Server, so maybe Kevin Feasel would be somebody you can talk to in this regard. Just different database technologies and what is the appropriate choice. I would love to hear something like that. Thank you again and look forward to more shows. Bye.
Carlos: So Mala, first, thanks for using the recording. It’s good to hear from you. Glad to know that you enjoy the podcast and thank you for the suggestion. Great suggestion, I’m not sure how much information Kevin will be able to give us, but we’ll–
Kevin: All of it.
Carlos: We’ll try. All of it.
Eugene: He can certainly fill the air. We know that. There’s no dead time.
Kevin: Whether anything will actually be correct or interesting, eh, different story.
Carlos: Yeah, different story. So you are welcome to check out Episode 47 and of course if you have other specific questions, we’d be happy to circle back around to that. But yeah, thanks for using that little voice feature. We appreciate it.
Eugene: Oh, I’ve got a quick shout-out. Speaking of Mala, Mala’s looking into Power BI Embedded and I’d like to give a shout-out to Katie Novotny. I’m hoping I’m getting her–
Carlos: For shame, for shame.
Eugene: I know. But she’s here in the Pittsburgh area and she’s presented on Power BI Embedded and so she was able to give some helpful information to Mala on Twitter about where to get started with that. So shout-out to Katie.
Carlos: Very nice.
Kevin: Great, I’m not going to say any last names in this episode.
Eugene: Well, we can’t all be like Cher, Kevin.
Kevin: That’s fair.
Eugene: He just goes by Kevin.
Carlos: Okay, so a quick tidbit or update on SQL Trail. I know last episode I promised the website would be up and that may or may not be true. As we record this episode, it’s a little bit more slow-going than I would like. A couple of things that I’m trying to juggle all at the same time. One of the– I don’t know if shifts is the right word. As we look to try to find the right mix of things, last year we introduced a full-day workshop and while the content was very, very good, from a flow perspective, there were some things we could improve. What we’re looking to do this year is to have some smaller workshops, so maybe two or three hour-type workshops. And one of the things we’re looking for, compañeros, is ideas or things or topics or technologies or focuses that you’d like us to include that week. So, a couple of ideas that we have, that have been floating around, BIML, doing a workshop there. Power BI, there’s a couple of different options there. Obviously we haven’t quite decided what we’re going to do, and admittedly, we’re waiting to see about Eugene’s availability, hint, hint.
Eugene: No pressure.
Carlos: Yeah, no pressure. And containers is another one. But if you have other ideas or things that you think you’d like to add, please let us know and yeah, we’ll start reaching out to folks and see if we can make that happen. The website is at sqltrail.com.
And now for a little SQL Server in the News. It came out just a couple weeks ago, well, again depending on this recording, it’s been a couple weeks ago. So Azure SQL Managed Instance now has a new tier, so the Business-Critical Tier is generally available. It had been extended, although Managed SQL Instance had been available, now what you get is this business tier, which ultimately means that you get more SSD options and they’re going to give you basically what’s a standard version, so a read-only replica of an availability group as part of that service. That doesn’t mean that you can’t do other availability groups and spread that out, but as far as the base package, you get that out of the box. So particularly as SQL Server 2008 reaches End Of Life here in July, I think more and more folks are going to start taking a look at Managed Instance and I don’t think it’s by coincidence that they’re trying to make this available before that happens, because I think for a lot of folks, making the leap to that service can make a lot of sense.
Okay, so today’s episode will be available on sqldatapartners.com/powerbi or sqldatapartners.com/168. Okay, so to kick this off, in our previous episode, in our Data Visualizations episode, we talked about visualizations phoning home. What information am I sending out? As we talk about security levels or privacy settings, ultimately there’s a setting in Power BI that allows you to configure, again, the privacy level of a data source. Yeah, Eugene, you want to kick us off?
Eugene: Yeah, so when you’re dealing with data sources and Power Query to manipulate your data in Power BI, a lot of times you’ll be asked to specify a privacy level for a type of data source. So that may be a specific SQL Server, it may be a folder on your file system, it may be a web service or something like that. You may see this and think that there’s a lot of complexity, a lot of implications. “Okay, do I want organizational or private or public?” but in reality, data privacy levels are very, very narrowly defined, narrowly implicated, however you want to describe it. They have a very simple cause and effect. So the cause and effect is data privacy levels affect whether query folding happens with your data sources, and that’s it. It doesn’t affect who can see what, it doesn’t affect security, it doesn’t affect anything else. It doesn’t have any implication once the data hits your data model or the service or any of that stuff. All that it affects is query folding. So do you want me to explain query folding or do you have any questions so far, Carlos?
Carlos: So this is when you need to tidy up your queries and get them into your dresser so your mom doesn’t bother you.
Eugene: Yeah, well, so Marie Kondo is a big fan of folding. She doesn’t like it whenever you roll up your socks and I disagree with that. I don’t have any issue. I always lump them together, so yes, if you want to Marie Kondo your data, you need to have good query folding. So query folding, it is one of the ways that Power Query can get really good performance without the user having to do anything different, and so I think this is one of the secret weapons of Power Query. Query folding is basically, the M language is super lazy in a number of ways. One, it does lazy evaluation. It’s literally called lazy evaluation, which just means that it only pulls data whenever it’s needed by the end result. And so I’ve done demos where I’ll make a fake stored procedure that takes 20 seconds to run. I’ll literally type and wait for 20 seconds and then it returns the number 2. And then, if later on, I’m pulling data from that stored procedure and I decide, “you know what? I don’t need that column,” or “I don’t need that information” and say, “I decide that my query is going to return 3 instead,” it’ll stop calling that stored procedure, because it knows it doesn’t need the information from that anymore.
Carlos: Oh, interesting.
Eugene: Yeah, so that’s the lazy evaluation piece, but M gets even more lazy with query folding, where basically M says, “why am I doing all of this work? Why don’t I push some of this back to the SQL database?” And so let’s say you filter on a product color. You say, “I only want black products.” As long as you have things configured correctly, Power Query will say, “well, why don’t I push that color ‘black’ back to my WHERE clause for SQL?” So there’s a lot of different operations that can fold back to the data source. It works for a lot of relational engines, even other things, like if you’re querying Active Directory, potentially, and if you’re making a custom data connector, you can write in your own query folding, so if you want it to talk to a web API or something like that. So query folding is really, really powerful and really great, because a lot of stuff just goes back to the data source and SQL Server, hopefully, if you’ve got a DBA who’s doing their job, has indexes and structures to support faster queries. So, in comes data privacy levels, because the problem is, sometimes query folding is too smart. So in Chris Webb’s series on privacy levels, he has a demo that I’ve done before, which is really, really interesting where let’s say that I want to be able to change what I’m filtering on dynamically. So instead of in Power Query, clicking on the little down arrow and selecting ‘black’ for the color, let’s say instead I have an Excel sheet, and I do a JOIN on that Excel sheet and that has all the colors that I want to filer on. Power Query is smart enough that if I’m pulling all my products from the database and I have an Excel sheet that just says ‘black’ for product, it’ll fold from the Excel sheet into the SQL query, which is really, really cool until instead of filtering on ‘black’, you’re filtering on Social Security numbers or credit card numbers or something like that. So that’s where the data privacy levels come in is you may have information that will leak out to other systems because of query folding if you don’t specify your data privacy levels.
Carlos: Now in that scenario, is the privacy– because I’m assuming my filter on the Social Security number, right, weird example to have colors on Social Security numbers, but–
Eugene: Sure, sure.
Carlos: But that information is in the Excel spreadsheet, right? What we have to set is the privacy level on the data source for the Excel?
Eugene: Well, we have to set it on both.
Carlos: On both.
Eugene: Because what we would do, we would say that the data privacy level on the Excel sheet is Private and the data privacy level on SQL is Organizational. And so it’s basically a pretty simple tier kind of system.
Carlos: And maybe we should go ahead and jump ahead. The three settings you mention, there’s public, organizational, and private, so those are the three choices.
Eugene: Yeah, correct. And so basically, it’s just a simple kind of tiered system. So if you’re trying to mix organizational and public, it’s not going to fold organizational information into your queries towards a public source and it’s not going to fold private information into queries towards organizational or public sources. And that’s basically all that it’s doing is depending on how you’re mixing your data, it’s disabling query folding so that you don’t have data accidentally leaking to other systems, because maybe you don’t trust the DBA running your SQL Server or something like that. Another thing that’s worth mentioning is you can disable this functionality on a data gateway level. I have no idea why you would do that, but I mean maybe you’re testing something, or you have a dev data gateway but you can turn on ‘fast combine’, which is code for ‘ignore privacy levels’.
Carlos: Well, now another option Chris actually talks about it, potentially because the Power BI service doesn’t recognize this or they apply it slightly differently, his thought was that if for whatever reason you didn’t like the way it got applied, by the Power BI service, that the gateway would be a way for you to say “hey, no, this is what I really want to do.”
Eugene: Oh, okay, that makes sense. But yeah, that’s kind of the bulk of it is Power Query’s really powerful because it can use query folding to save a lot of work and effort but sometimes it can be a little bit too smart and you end up with data potentially leaking to other systems, and so you can run into issues there. Like I said, that’s all that it does. It doesn’t affect who can see what, it’s not like data masking, it’s not like row-level security, it’s not like any of that kind of stuff.
Carlos: Right, sure. Now, it says “ignore the privacy levels and potentially improve performance”.
Eugene: Right, that would be the ‘fast combine’ kind of thing. Are you able to set that on the file itself or are we still on the data gateway kind of piece?
Carlos: This is one of the other options, I guess, so it’s not necessarily a privacy level, it’s the “hey, ignore all of this”.
Eugene: Oh, okay, yeah, yeah, that makes sense.
Carlos: Yeah. And I guess they do give the caveat, “potentially”. So I guess it all it depends on ‘your experience may vary’ as to what you’re actually going to get there.
Eugene: Yeah, also when it relates to files, my understanding is that the levels are hierarchical. So, you may have ‘none’ set as a privacy level for a specific folder or a specific file, but then if you have a different privacy level set on the C: drive, if you’ve got ‘none’, it’ll pick up whatever’s higher up on your drive, if I remember correctly. So that can be confusing sometimes, where your privacy level is ‘none’ and you’re not sure what’s going on there.
Carlos: Right, so now, I’m curious, as you’ve started using this, and again, like all security, I guess there are two thoughts. So from the admin perspective, “I don’t know what permissions we’re going to need, let’s just make it all public, that way we don’t have any issues.” Or there’s the, “no, you can’t have anything and you’re going to have to ask for every little permission in triplicate.” So how do you recommend us taking an approach to figuring out, “how much data–”
Eugene: So my baseline recommendation is, if it’s anything internal, mark it as organizational and if it’s anything external, mark it as public.
Carlos: External meaning stuff you can find on Wikipedia.
Eugene: Anything that’s leaving your company’s network, so even if maybe you’re talking to a web API or something like that. And I think, in most cases, you’re going to know if you’ve got HIPAA compliance, or personally identifiable information, or PCI compliance. I mean, generally in my experience, people are intensely aware of those kinds of regulations that they have to deal with, and internally, a lot of the potential use cases seem kind of silly to me, in the sense that again, okay, if you don’t trust your DBA to your own SQL Server, that seems a little weird to me, to worry about–
Carlos: Sure, sure, yeah, I was going to say, by the way, you’ve already made that decision. It seems like if you have a new project, people might want to reevaluate that, but some of those decisions have already been made, in a sense. But it is important, particularly as you think about going to the cloud. I think that’s kind of what gets people riled up a little bit is that now data is “leaving the organization”.
Eugene: Yeah, something we should put in the show notes is there’s a security whitepaper for Power BI and I think, as we start to talk about the cloud piece, really should give some more thought to, “okay, how is that data being secured once it leaves the organization?” And Microsoft has a really good whitepaper about how they handle that data and how they encrypt things and what is kind of stored in rest and what’s just stored in flight and all of that kind of stuff. But I think most of the time, that’s going to be a much, much bigger concern and thing to figure out than necessarily, “okay, do I set this to private or organizational” or whatever.
Carlos: Sure, sure, okay. So there are bigger fish to fry beyond just clicking this.
Eugene: I would definitely say so. But it’s a good setting to know about and you’re going to run into it the moment you want to start combining data sources.
Carlos: Now what’s interesting here is that it gives us the option– we have to go in and set this. So if we don’t set it, is there a default?
Eugene: I believe the default is ‘none’– or I forget, honestly. But I think it’s ‘none’ and–
Carlos: That’s what I was thinking as well.
Eugene: And it’ll prompt you once you start trying to combine different types of data sources. So if I remember correctly, if you’re just playing around with a single SQL database, it doesn’t ask you to set this, but the moment you’re like, “hey, I want to add in this flat file”, if I remember, it’ll prompt you and say, “hey, what kind of levels do you want for this stuff?”
Carlos: Okay. Good deal. Kevin, you got anything for us today?
Kevin: No, you guys hit all the questions I was going to ask, basically about defaults and what the value of it is, and it sounds like default is ‘who cares’ and the value is “eh, some”.
Eugene: ‘Some’, that is an accurate summation.
Carlos: Thank you for that description of our episode today.
Eugene: We can just put that at the beginning of the podcast and be like, “alright, here’s the ‘too long; didn’t read’.” Every episode needs just a 10 second summary from Kevin.”
Carlos: There we go, that sounds like a process change.
Carlos: Okay, well compañeros, that’s going to do it for today’s episode. Thanks again for tuning in. You can reach out to us on social media.
Eugene: You can find me at sqlgene on Twitter and sqlgene.com for my blog.
Kevin: You can find me only if you’re organizational privacy setting. Can’t find me on private or public.
Carlos: And compañeros, thanks, for those of you who have reached out to me on LinkedIn. I enjoy connecting with you. I can be reached at CarlosLChacon and we’ll see on the SQL Trail.
[…] SQL Data Partners Podcast Episode 168: Data Privacy Settings (Carlos L. Chacon) […]