Episode 84: MVP

The Microsoft Most Valued Professional or MVP award–while it’s hard work and dedication to be called an MVP, what are the benefits? Well, Kathi Kellenberger is a great member of the SQL community, traveling to many SQL Saturdays, speaking at conferences about SQL server, and this time she is here with us to tell us what it really means to be an MVP and the great journey it has been to becoming one.

Do you actually know the percentage between men and women MVPs in the world and US? How many posts do MVPs have on forums, helping others? How do they keep track of all the SQL Saturdays, User Groups, conferences they have been to? Kathi will be sharing her story, the process that she went through until she became an MVP, how you get nominated, and how to get renewed!

Episode Quote

“Do these things because you love to do them. Don’t do them because you want to be an MVP.”

“Women, you’ve got to get over your reluctance to promote yourself.”

“It’s not a job, nothing paid […] We’re kind of a free support for people in a way. A lot of what I like about it is that I can share my enthusiasm about SQL Server.”

Listen to Learn

  • What it’s meant to be an MVP?
  • Benefits of being an MVP.
  • MVP categories
  • The process and criteria on how to be awarded as an MVP.

Kathi’s Blog
Kathi’s Kerberos and SSRS course
Kathi on Twitter
Kathi on Pluralsight

About Kathi Kellenberger

Kathi is an independant SQL Server MVP and Teammate with Linchpin People. She is author of “Beginning T-SQL 2008” and co-author of “Professional SQL Server 2005 Integration Services,” “Beginning T-SQL 2012,” and “SQL Server MVP Deep Dives.” Kathi enjoys speaking and writing about SQL Server and teaches the occasional SQL Server class.

Transcription: MVP

Carlos: This is Carlos Chacon.

Steve: This is Steve Stedman.

Kathi: This is Kathi Kellenberger.

Carlos: So Kathi, welcome to the show.

Kathi: Thank you and thanks so much for inviting me. I’m really excited about it.

Carlos: Yes, I’m glad that we finally were able to put this together, and of course, you know, the rest as I say is history. But this is something that we’ve been trying to get together since October of 2015. It’s a long time of coming, we’re glad to finally have you on the show.

Kathi: Yeah.

Carlos: And as your Twitter handle suggests @Kathi, right. You’re a family member to all and a great member of our community and so we’re excited to chat with you.

Kathi: Thank you.

Steve: We have an interesting, oh, go ahead Carlos.

Carlos: No, no, go please.

Steve: And we have an interesting topic to cover today and that topic is?

Carlos: Yeah, we want to talk a little bit about the MVP, Microsoft MVP. What it means to be an MVP and what’s it meant to you and so I guess let’s go ahead and jump into that, and give is a little bit of your story there. How long have you been an MVP first, I guess? 

Kathi: Ok. I was first awarded the MVP in 2008, and then in 2010 I took a detour in my career and ended up at Microsoft for about a year and a half. So I lost it during that time and so then I got it back again in 2013 and have had it every year since then. Kind of a weird back and forth but once I was eligible again I got it again. That’s how long.

Carlos: Now ultimately looking at the MVP website, Microsoft is looking for two professionals to provide support and guidance if you will to the community kind of taking their time to help others along their way particularly with the Microsoft technology stack. I guess what do you see or kind of the qualities of being an MVP?

Kathi: Yeah, again, probably should clarify that. It’s not a job, nothing paid. It’s an award that Microsoft gives so I feel like it’s in some ways we’re an extension of Microsoft sales because the things that we do I’m sure help companies use SQL Server more effectively. We’re kind of a free support for people in a way. A lot of what I like about it is that I can share my enthusiasm about SQL Server. I have this problem where when I learn something, the first thing that I wanted to do is go tell somebody else about it. And this is a great way to be recognized for that because really it’s like a 20 hour per week unpaid job in some respects because I feel like that’s about how much time I spent a week on MVP type activities. Some of it you get paid because if you write a book you get paid a little bit but for the most part it’s like a part-time unpaid job.

Carlos: Interesting. I had never heard that number before, 20 hours a week, that’s quite a bit. But when you think about some of those other things, I mean, you mentioned about writing a book, blog posts, speaking in front of people, preparing a presentation.

Kathi: Right, yeah, you know Perl site courses. I volunteer at a non-profit to teach people about SQL Server, User Groups, SQL Saturdays. Yeah, you know, some weeks are more than others but being an independent consultant right now if I only am billing 15 or 20 hours that week I am going to fill in the rest of the week with the MVP type stuffs.         

Steve: So then when you’re dealing that and you’re doing your 15 or 20 hours that’s focused on that. Are you keeping track of that and somehow having to report that to Microsoft or is that just what you do to be out there and be recognized as an MVP.

Kathi: Right, right. You don’t have to keep track of it in an hourly type basis. I did keep track of it for about two months just because I was interested in how much I was staying. But then that time keeping, you know, that gets to be exuberant in itself so I just wanted an idea so I kept track. What you have to do as a current Microsoft MVP there is a website where you are to upload your activities. So if I speak at a SQL Saturday, I have to fill out a form and said that I spoke at SQL Saturday, there were 30 people in the room, you know, linked to it. So you have to record your activities and that’s for renewal. So let’s say you are awarded the MVP, you don’t have to do anything for the next year. You’re an MVP, you could drop off the face of the earth as far community stuff is concerned. There are no requirements but to get it renewed you have to do things. You do report but it’s not an hourly type thing.

Steve: Ok, so then the title MVP stands for Microsoft’s most Valuable Professional, is that the right?

Kathi: Something like that, yeah. Most Valued Professional I think. 

Steve: Yup. And then so you’ve done all these work and you need to keep doing that work in order to keep getting it year after year. I mean, that’s a lot of things you need to do there and what are the benefits that you gain for doing that?

Kathi: My favorite is a free MSDN subscription so it would include Azure credits. That’s probably my favorite because I can really use and have available to me very easily any of the developer tools, SQL, Windows Operating Systems, things like that for training, or experimenting, that type of things. That’s really my favorite and other probably my second favorite which I don’t always take advantage of is the MVP Summit up in Redmond. That’s the last couple of years that’s been like either overlapping, one year is overlapping the past summit, and the last couple of years it has been like a week after, couple of weeks after. But that’s just really awesome because you were on the Microsoft campus. You’re surrounded by, you know, in my case lots of my friends are MVP. I’m pretty lucky. Get to hang out with friends a lot more and learn about the new things that are coming up. That’s probably the third thing that’s a really great benefit is you get a lot of insider information about what’s coming up, what SQL version is next, or all of the things that are not public knowledge at this point. Obviously you got to keep those things to yourself but that’s pretty cool. They also have webinars from time to time about things that are coming up. In fact, the last year or so they’ve really increased the frequency of those webinars so if there is a topic you’re interested in just watch and chances are there will be information about what’s coming up, or maybe what’s in the next service pack, or what are the new features in the next version of SQL. There are really a lot of benefits that way.

Steve: Ok. Now, I’ve seen that there’s like different categories of MVPs in each country like you might have a Data Platform MVP or an Azure MVP. Am I understanding that correctly?

Kathi: Yes. In fact, last year they revamped all their categories. We were SQL Server MVPs and now we are Data Platform MVPs. Another really interesting thing is that you can now be awarded MVP in multiple areas so for example if somebody is really focused on SQL Server but they’re also really focused on Windows Server they might get an award in that area.

Carlos: Got you. Yeah, PowerShell is probably another category that overlaps quite a bit as well.

Kathi: Yeah, yeah, and I don’t know all the categories. I just look for what I do so at this point everything has been SQL Server. However, I’ve overlapped a tiny bit into the security area because I did a course for Perl site and I’ve spoken on it at a few places on Kerberos so I just want to have and put those in sure security. Whenever you add your activities you have to kind of put them in a category so through people that have different areas. Probably one of the big, I talked about the tangible benefits but for probably another thing is if you want to be a consultant this is probably not necessarily that great to if you are just like a database administrator at one company. I was when I became an MVP originally and my company did not care. They really care less. But as a consultant or somebody with your own company it is gold. It’s going to get you noticed and really be beneficial.

Carlos: One of the things or criteria they talk about creating on as well or is included if you will of contributions and that is form replies and answers. You know, so just having, you got somebody who searches forms and what not or put questions out on forms. I am always amazed at the people who can respond, the #sqlhelp is probably another good example, right? They’re just like, they’re waiting, right, and they just. I mean, almost it seems like and what’s also interesting is that they have really good answers. So couple of people that kind of jump to mind is Aaron Bertrand on Stack Overflow. I don’t know how many points he’s got. Robert Davis is another one. He’s probably on the SQL Help. Seems like he’s always charming and every so often. I have never been able to get into that because I guess maybe a knuckle dragger I guess I don’t know all that much.

Kathi: Yeah, I’m sure you knew. 15 years ago I was on SQL Server Central forums all the time and answering lots of questions. In the 90s, I was on whatever was back then for Visual Basic and constantly answering questions and I don’t know if as I’ve gotten older I can’t multitask as well. I just don’t get, not only that, not only forums but the same people are on Twitter and on Facebook. And all these things all day long and I’m like, “Man, I’ve got work to do”, so that is one area where I did not contribute much. I’m very grateful for the people that do contribute but I do not. I’m more likely DB writing, recording, or presenting. Those are the things that I’m going to do but as far as answering questions, man. And the bad thing I really like doing the TSQL questions on at one time and actually at Pragmatic Works when I was there. They really encouraged us to answer question when we were on the bench. The problem is which I wasn’t really on the bench that much luckily. But the problem is for me someone puts a question out there. Maybe they have the schema and the example and then they, maybe they don’t so you have to asked them about it and you become so, it’s such a commitment to the question. You know, getting out the information from them and maybe you got a solution. Well then you’ve got a, you know, create a little temporary database, not a temporary database, create a little database and populate it and try to work it out. I just feel like it’s too much of a commitment so I don’t do it. But yeah, I pretty much stick and obviously Microsoft does look at that. That’s one of the things they look at but luckily I do enough writing, recording and presenting, and volunteering that doesn’t hurt me that I don’t answer questions.

Carlos: Yeah, the metric they give on the MVP side is this guy from Brazil. He had 728 helpful posts.

Kathi: That’s awesome.

Carlos: That must be total, over the year, 185 answers and 407 posts. Like, holy cow.

Kathi: I guess you can do that. You know, I do remember at one time and I was at jobs. I was getting paid and I was able to do some of that but, you know, I can’t.

Carlos: Sure, that’s true.

Kathi: I can’t, so I don’t know maybe I just can’t multitask anymore.

Steve: You know, I feel that same way too. I think that if I’m going to go and answer questions on Stack Overflow or whatever out there I’ve got to focus on that. And that’s something that I can’t really do when I’m doing work that I’m getting paid to do and there’s got to be that separation there. That bench time that as you mentioned it, if someone has a lot bench time well they might have more opportunity there but I think it’s a balance to figuring that out.

Kathi: Yeah, absolutely.

Carlos: Part of putting together those presentations or even the courses, I mean, you mentioned your book and having just put a book out myself kind of that time and energy that it takes to do that. But it also kind of helps you like the side benefits there is that you that become more familiar with that topic and can talk to it more rather than this specific TSQL problem that they’re having or they’re trying to loop or, you know, whatever the specific implementation is.

Kathi: Right, right, yeah. I just like what I do to scale. Occasionally people will send me a question personally. I will try to help them if I can.

Carlos: Right. And that’s different, if they send it to you personally that’s different.

Kathi: Right, yeah and sometime. I just had a gentleman who I don’t know if they were not familiar with SSIS but they had wrote a post on my blog site about SSIS, and had a couple of questions and I was able to help them very easily with that question and I felt good about that. But sometimes I feel like and I haven’t had anything like this for quite a while but sometimes I feel like they’re trying to get free consulting out of you and it is just too much. There are times whenever I’m asked personally and I say, you know, this is bigger than we can do, you know email it. Maybe you want to put this question out on this other forum for people. You know, if somebody sends me a question about replication or something. I mean, I’ve used it but I’m not an expert on replication. That would be better to go ask that on a replication forum.

Carlos: Yeah, there’s a fine line, right there. You know, particularly as the MVP, you mentioned the time you have to put in and then kind of chasing some of these specific problems when all the hundreds of myriads of variations that could take place can be difficult.

Kathi: Yeah, and that time just to clarify, that’s just an estimate of what I do. I don’t know what other people do and there is no requirement. That’s just pretty much what I do on average.

Steve: Ok now, you’ve mentioned that you’ve been an MVP twice and that you originally have that and then you went to Microsoft and you became an MVP again after that.

Kathi: Right.

Steve: Now, how did you go about getting that? Is that something that somebody nominated you for or what is that process and how does that happen?

Kathi: Right, so what usually happens, and this is kind of funny, but you will be at past summit or SQL Saturday and people start asking you, “Hey, are you going to the MVP Summit this year?” And then you will say, “Well, no, because I’m not an MVP.” And then they’ll say, “Well, you should be.” And then they will nominate you so that happens a lot. I guess I can say who nominated me originally because this happened in, I guess around 2007, people started saying this to me. Andy Leonard, and there were like three or four people who said I’m going to nominate you. Well, one person did and that was Andy Leonard. And I knew you just had him on a podcast the other day because I’ve listened to a part of it. He’s such a sweetheart but he nominated me. Of course I didn’t get it the first time. It took a couple of times for it to go through. But that’s usually it, you fill out this form. You can nominate yourself but that’s kind of weird but somebody who thinks that you deserve it will nominate you, and then you may or may not get contacted from Microsoft. And if you do then they’re going to ask you to fill out a form listing all your activities for the last year, and then you may or may not hear anything back again. It is kind of weir because it just seems like, if you don’t hear anything. If you’re not having a conversation back and forth don’t expect to get the email on the MVP Award Day so what I would think, you know. I would think you probably have some conversation back and forth with them. But even if you do that is no guarantee that you’re going to get it.

Steve: Yup, so mentioned the MVP Award Day and I saw recently, right after the first year, I think there are a bunch of MVP renewals and new MVPs out there. Is that an annual thing or a quarterly thing? How does that work?

Kathi: It’s quarterly. So it’s January 1st, April 1st, July 1st and October 1st. The really funny one is October 1st because it’s, not October, but April 1st because it’s April Fools’ Day, you know, that the really funny one.

Steve: We could all be MVPs for a day on April 1st.

Kathi: We could be, we could be, yeah, so it’s a pretty exciting day for you if you’re up for renewal or you’ve been nominated. You’re going to be checking your email every 10 minutes. But the frustrating is it comes out late morning or early afternoon. Usually because it come out, so they’re coming out of Redmond probably so that’s specific time so, you know, you’re waiting. It is an exciting day if you get renewed, so far so good. I need to tell you something really cool about me in particular with MVP. So I first received an MVP in 2008 and at that time I asked the MVP lead, kind of liaison at Microsoft that deal with it. I’m like, “How many women are MVPs of SQL Server MVPs?” He gave me these numbers. He knew at that time they were, this is 2008, there were 4,000 MVPs in the world of any area.

Carlos: Total?

Kathi: Total. At that time I think there are around 250 or 260 SQL Server MVPs at that time, and I don’t what it is now. And there were 60 SQL Server in the U.S. and 6 of those were women. So I was like, “Wow!” The numbers are just fluctuating constantly partly because Microsoft keeps hiring MVPs. They’ve hired a bunch of them recently.

Carlos: That’s true. They have been on the hiring spree.

Kathi: Right, but I think women have, you’ve got to get over your reluctance to promote yourself. I’ve had that. I feel like women or girls are taught from early on, you know, work really hard and be really smart but don’t let people know about it. I feel like we’ve been taught that from childhood and so I had to get over that and start learning how to really promote myself.

Carlos: eah, that’s interesting. As the father of four girls, that’s probably something I need to key in on because, yeah.

Kathi: Yeah, we were really taught to be exceptional but hide it. You know, I saw that in my own daughter when she was a teenager. Especially if boys were around, all of a sudden she was not quite as smart. Yes, you know.

Carlos: Interesting.

Kathi: Yes, because she doesn’t want to be intimidating. But I think that one of the luncheons at past summit one year they talked about becoming an MVP and they started and saying, “Hey, you need to let people know what you’re doing.” And then there were quite a few of us that became MVPs around that. I don’t remember what year it. I think the number is still pretty small compared to the guys. But I think we’re represented probably about the same percent that actually is working in the field so I guess that’s not bad.     

Steve: When you talk about letting people know what you’re doing or promoting yourself, what type of things did you start to do differently at that point in time?

Kathi: Yeah, I think it was a gradual process. But what I would do is, like say for example, I have a book coming out. I would contact Steve Jones and say, “Hey, I want to write an article that’s going to promote my book.” So I started to think of ways to promote myself. I’m not much on Twitter but if I have a new blog post, or have a new Perl site course, or whatever I’m going to get out there and promote that. I think part of it is speaking. Trust me I’ve got a C in Speech in college and I was a very shy, terribly shy. And speaking in front of people is not something I would do but I made myself do it. Those are ways to I think at least that I did myself out there. I look for opportunities that can help me in multiple ways is really what I do.

Steve: Ok, so how many times in a year would you say that you generally speak at a SQL Saturday or other big events.

Kathi: Let’s see, darn.

Carlos: Put you on the spot there.

Kathi: Yeah, yeah, I’m going to say at least 15. You know, I’ve had to go back out to my MVP thing because I have everything tracked there. But I try to go to at least 6 SQL Saturdays a year. I’ve spoken at past summits every year for the last four. I don’t know how long my luck is going to hold out there but somehow I’ve managed to do that. I speak a lot for the local User Group, plus I do a lot of remote so I’m going to say between 15 and 20 presentations a year.

Steve: Oh, that’s really good out there. I mean, that’s more than one a month and that’s a big commitment.

Kathi: Oh yeah, but you know what, to me once I started speaking it became a little bit addictive. It’s thrilling, you know.

Steve: I think I have that same feeling on the addictiveness of speaking. It’s just the travel side of it that slows me down.

Carlos & Kathi: Yeah.

Steve: Ok, so I guess at this point then is there anything else that you want to share with us?

Kathi: Yeah, I think that making MVP is a pretty, there’s a pretty high bar to make MVP and I don’t want people to be discouraged if they don’t make it. Keep doing the things that you’re doing. When I was at Microsoft even though I wasn’t eligible to be an MVP I continued to do the same kinds of things I was doing because I love it. Do these things because you love to do them. Don’t do them because you want to be an MVP. I think an MVP is got to be, the MVP award is a side effect of doing these things you love. I don’t love posting on forums but I’ll do it. But luckily if for some reason I was not going to be an MVP anymore. You know, that’s always a possibility, would I stop writing books, you know, maybe. But I’d still be writing blogs. I’d still be doing Perl site courses and speaking just because I love to do those things.

Steve: Ok, that’s sounds great. Shall we move on to the SQL Family questions then?

Kathi: Sure.

Steve: Can you share with us how you first got started using SQL Server?

Kathi: Yeah, it was kind of crazy. I had just become a developer. I was a mid life career changer and this was 1997 and I was working in a really tiny company that there were only four of us at this little startup. And they had bought a package of training courses for this other guy that worked there. And he was wanting to take, now they call them Systems Center, but one of those type classes Operations Manager, one of those type classes and the course just kept getting canceled and the expiration date was coming up and they said, “Kathi, how would you like to take one of these courses at this training company because it doesn’t look like this guy is going to use it.” So I said, “Sure.” And I saw there was this SQL Server class and I want to had took that class so that was early 1998. So if it wasn’t for that fact that that guy couldn’t get his Operations Manager class in. I don’t know how my life would have changed. I’m sure it would have got in you know. So that was a SQL Server 6.5 class, so that was my first actually getting to do anything or seeing it. And then shortly after that, well then I started actually using it, implementing it at customers.

Carlos: Sure, interesting. Now, in all that time when you were on SQL Server, if there was one thing you could change about SQL Server what would it be?

Kathi: Well right now, since Service Pack 1 came out and all of these cool features have been downed. I don’t know what you want to call it now in Standard. There is one feature I’d love to see back ported down to Standard Edition, and that’s read-only replicas and AG so that you can do availability groups in Standard and you can actually do that in 2016 before Service  Pack 1. But they’re only, they’re only for fail over. You know, I would just love to have this read-only replica in Standard. That’s what I would love to see.

Steve: Well, perhaps we’ll see that in the next Service Pack or the next version, who knows.

Kathi: Yeah, I’m not going to hold my breath though.

Steve: Ok, what is the best piece of career advice that you’ve received?

Kathi: Yeah, so like I said was a mid life career changer. I had automated an auction at my kids Grade School. I was a pharmacist at that time but I automated it with that Microsoft Access. And this other parent saw what I was doing and he worked for a company that produces training materials for technology. And he saw what I was doing and he said, “You could be a developer.” That was really the first time anybody, this was like early 90s and somebody said something like that to me, “You could be developer”. And he, you know, got me some training materials, encouraged me to take a certification test. So I would say that was probably the best advice, just someone saying, “Hey, you could actually do this.” That really got me started down this path.

Steve: And that was actually triggered the change from Pharmacist to IT side of things, developer, SQL Server? 

Kathi: Yeah, there are a lot of things that happened along the way. Probably the first thing that happened was I was in my last year Pharmacy college, working my Bachelor in Pharmacy back then. And there was a computer course where they had TRS-80s in the room and the first time I saw a little program, we had to type a little programs in, and record them on a tape recorder. The first time I actually typed a little program in that TRS-80 I was hooked from that moment on. It was like a light switch went off in my brain. I had programming and things like that as a hobby to just write some things for home up until I started doing some things for my kid’s school. And then really that automating that auction was the project that got me started down the path, you know, before I felt like it was something that was out of reach.

Steve: Very interesting.

Kathi: Yeah.

Carlos: Our last question for you today Kathi. If you could have on superhero power what would it be and why would you want it?

Kathi: So I’m going to change this up a little bit. I’m not really into superheroes that much but I love Star Trek.  I love Star Trek and I watched the entire Star Trek: Next Generation when I was on the treadmill last year. And this year I’m double dipping. I’m watching Voyager and Deep Space 9 both at the same time this year. I always love Star Trek but I didn’t get a chance to watch it a while when my kids were little, so busy, so now I’m actually getting to do it. Seeing episode here and there but my favorite technology on Star Trek is the replicator. If I could have a replicator and it’s going to look like a little microwave that I can just say to it, “Computer, you know, tea hot, Earl Grey ”. You know, they can get you any food that you want or any piece of clothing that you want just like that. I just think that would just be amazing. That’s what I want.

Carlos:  There you go. I think Amazon would pay a lot of money to have that not be built.

Kathi: You’re right. But we have the 3D printers now that are not quite the same as replicator technology but we’re heading down that road. People are printing all kinds of things, you know, so. Someday we will be there. In fact, Apress got a book. I don’t have the title in front of me but Apress got a book about Star Trek technology.

Carlos: Really?

Kathi:  Yes. And how they came up with the ideas, where the Science is today, where it’s going. There’s a chapter on each technology.

Steve: Oh yeah, it’s really quite amazing how much technology has come out of things that were surely science fiction. Like the communicator, and then we have the flip phone and now flip phones aren’t even new technology anymore.

Kathi: Oh yeah, these little computers in our hands that we walk around with are better than the SIRS I was using in the late 90s. You know, they’re amazing. They’re better than what they had on Star Trek, you know. Everything you could ever want to know is right in there. Then you can to talk it to if you want to.

Carlos: Well Kathi, thank you so much for joining us. It’s been a pleasure to have you.

Kathi: It’s been a pleasure for me. It was a lot of fun.

Steve: Yeah, thanks Kathi, this has been fun.

Kathi:  Yup. Thank you so much.

Episode 83: Data Visualizations

We’ve all been there – you have some data and you need to make it pretty for the report. With the all the new reporting options you have probably been tempted to find a new way to visually present information.  In this episode of the podcast, we are talking about data visualizations and our guest, Jonathan Stewart, has some issues with the way we currently go about it.  Jonathan has been on the show as a panelist and I have really enjoyed his perspective on things so we invited him on the program to discuss some ways you can go about improving the data visualization of your reports, without adding additional burden on process.

Psychology plays a big role in data visualization. To make a good report, before even starting your tool, you should be able to summarize what the report presents. One of the major aspects of psychology is color and how we perceive it. Jonathan explains how we can use this while visualizing data.  We discuss color blindness and I haven’t heard this discussed too much elsewhere and I was curious to know how we can solve this problem with an easy button.

 Episode Quotes

“One of the major aspects of psychology is just color, right? What colors you’re using? And this goes back to my storyboard and you see everything comes back to storyboard, right because that’s like. Actually I have everything done before I ever open a tool. Even the colors picked out should be done before I ever picked a tool.“

“Don’t think of it as, you know, I have to this extra work. Think of it as you’re doing it to prevent extra work. If you develop a framework, you develop your own storyboard and you get your stuff there“

“Out of 30 people in the room, there’s 3 people at least in that room that are colorblind. And then too, there’s not just one type of colorblindness. There are multiple spectrums.“

“We’re datafying everything. We datafy everything from our footsteps, to our calories, to our heart rate, to traffic on the road, we datafy everything because we like to be able to see and analyze things.“

“Are we using the numbers to prove our bias or we’re using the numbers to disprove our bias and to show the proper thing?“

Listen to Learn

  • How to do a storyboarding before even opening a tool to create a report
  • How to include storyboarding in data visualization process without investing too much time
  • How you can prevent additional time spent on report modifications
  • Color psychology behind data visualizations
  • How to create a report that colorblind person can understand
  • How the quality of the source data affects the end result
  • Predictions how the reports and data will be used in the future

Jonathan on Twitter
SQL Locks LLC
Edward Tufte and data visualization
Colour Blind Awareness
See you charts as a color blind person would
Building a data visualization storyboard

About Jonathan Stewart

Data VisualizationJonathan is a SQL Server professional of 18 years. His experience ranges from BI architecture, data modeling, data warehousing, database architecture, database administration and strategic IT planning for Fortune 100 companies. He sees data in everything and is constantly looking for ways to organize, read and analyze whatever data set he can get his hands on. He is passionate about data and making sense of it. He is passionate about using data to effect positive change in his local community of Columbus, Ohio. When not doing data things, you can find him mentoring youth in his community.

Transcription: Episode 83

Carlos: This is Carlos Chacon.

Steve: I’m Steve Stedman.

Jonathan: I’m Jonathan Stewart.

Carlos: Jonathan, welcome to the program.

Jonathan: Thank you for having me. I appreciate you having me on. It’s a great honor.

Carlos: Yes, always interested at chatting with you. And ultimately our topic of discussion today is data visualizations. Now, this isn’t necessarily a reporting discussion but ultimately I guess the way we present and potentially consume reports or as we say data visualizations. So I guess let’s jump into that. And you have an interesting story why is this an important topic for you.

Jonathan: Well, first off one of the issues that I’ve always seen is that whenever we look at data visualizations first thing we think about is a report, a chart, the tool whether Excel or PowerBI, Tableau, Qlik doesn’t matter.

Carlos: Sure, SSRS, a lot of our people using SSRS.

Jonathan: Yeah, Reporting Services. And that’s how we see it but one of their problems, you know, we end up with problems of was the report right and stuff like that. All those problems we end up having with it can be hedge upfront if we look at it from a different view. I begin to look at things from the focal point, the first point. Look at it from the problem and then begin to look at it as a story versus just a report. And how can we get the story better, how can we tell the story better? So I begin to look at, I’ve always been a psychology junkie, so I always love stuff like how these colors make us feel, how this, you know, things like that. All these things begin to build in me about visualization so I begin to study it more and that became almost my focus. I was already a business intelligence consultant so I was already building reports and stuff like that so I got to see things of how people receive certain reports. You know, why pie charts suck. I got to see all these things as I was building out my career and building my focus. It became my interest. It’s a huge interest to me. Everything from how we perceive the world, color blindness, all kind of stuff like that begin to take shape and I get to see it. And as I learn more, and this is just me in general, as I learn more I want to share more to people. I was like, “Oh, this is amazing. Let me share this with them. This is amazing.” So then it became, you know, somebody was like, “You should just speak about it”, so that’s actually one of the things.

Carlos: There you go. Help us here because we’ve had discussions on this show about many of the different reporting options that are available to us, right? And ultimately while there are different needs that’s why there are so many different tools. Why are we fumbling, on what ways are we fumbling when we actually talk about the presentation layer if you will?

Jonathan: One of the major issues that I see in the problem of visualization is that the current tools are sold as self service BI, self service this, self service pumping your gas, and stuff like, of self servicing. The pumping your gas joke is something that Edward Tufte in one of his blogs he made a comment about. He has a huge issue with people like Qlikview, Tableau, even PowerBI, you know you have our tool you can make this. He’s like it’s not like pumping gas. You can’t just take data and just show it because if it has no context, if you don’t know what it means then it is still is a bad visualization.

Carlos: Got you, so we need to be able to better represent what the data means when we present it.

Jonathan: Right, so one of my focus is that I bring it into, like when I go to a client and they’ve never see things like this before I actually storyboard before even I open up the tool. I’ll go and I‘ll interview. If I get a request, I need this, so I’ll go and I’ll sit and talk with them. And it will be a legit interview, almost like I’m building full blow data warehouse. I’m interviewing businesses and stuffs like that. One of the first questions that I ask is, “What problem are you trying to solve with this report?” So before I even open up the tool, well, one of the things too that we do. When we open up a toll, as soon as we start coding something or writing something whether it’s five minutes, or an hour, or 5 days a week whatever, if it’s wrong we really don’t want to change it because it’s our baby, right? Like we don’t want to change it. We write a certain procedure. I really don’t want to just throw it away. We want to modify it. So even before I get to opening any tool, it doesn’t matter what the tool is, I want to make sure that I understand the problem.

Steve: So one of the things you mentioned there was Tufte and I know, I’ve looked at some of his books and I’ve seen some truly amazing visualizations. And I know one of the disconnects that I’ve had is going from here’s the tools we’re given, SSRS or whatever, or maybe that we’re using to, how do we get from that to some of these amazing things that you’re seeing there? And that’s where; I mean even with PowerBI, there’s a lot more features in there. But to really go to some of the stuff that he shown, it’s a big jump.

Jonathan: Right, right. And I think that one of the things as I can get more people to think about in terms of a story, storyboarding, you know, you’ll get to that. So you better see what’s the problem we’re trying to solve? You know, who’s the audience? What action are we trying to take? And then what type of mechanism? Are we doing PowerPoint or PowerBI then, understanding your tone, stuff like that. Are we showing success or failure? Once you have all these questions answered then you can say, “Ok, how can I display that properly?” And then you can start getting to those amazing visualizations that Tufte and all those people show then you can start showing all those things. But once you have all that stuff answered upfront then you can solve. I think one of the problems that we try to jump to that is we say, “Oh ok, I think I can show that in this amazing histogram.”

Carlos: Yeah, exactly. I’ve got some data now let me just throw it in this visualization tool.

Jonathan: Right, and try to make it look pretty, and then it ends up not being what we wanted. It may not be exactly what we want and we try to tweak it. It can still in that being wrong. One of the examples that I use is you have somebody panting for gold out in California. And over the course of a week they have 10,000 tries and over the 10,000 tries let’s they say they find 147 pieces of gold. And so now somebody says, “I want a report of that”. And you say, “Ok, well what do you want in a report? Do you want me to show all 10,000 tries?” You may not want to see all 10,000 tries. You may only want to see the 141 successes. But you may be somebody who’s developing the tools that did the try so yeah you do want to see all 10,000 tries. So which is the right answer? And a lot of times too when somebody gives us a report requirement it may be three reports that we may have to develop, two reports. But we don’t know that until we actually go back and actually interviewing this and figure out who the audience is and stuff like that because at the end of the day it’s not just show us the data.

Carlos: Right, and we’ve talked about this before on the show and kind of maybe the new wave or the new technologist another differentiator is that I think a lot of times like you say,  we’re technologists I have the data I’m going to toss it into the report. I’ve done my job kind of a thing. The new wave or I think what executives are looking for. We’ve had CEOs on this show talking about what they’re looking for in data people. And they want people that understand the data. They want people who understand the business, right? Why are we collecting data this certain way? And I think this could then be a major differentiator to be able to go in and say, “Well you know what I am the guy who develop, use your Gold Rush example. I am the guy who, I understand how to develop this gold developing software, process whatever therefore I can give you better insights into that 10,000, or know how to display that kind of a thing.”

Jonathan: Right, and then two, know who your audience is for too because if you just don’t report for another person who’s going to be out there panting for gold that’s a whole different report than your CEO. Understanding all those things upfront before you ever open up the tool will help ensure that your visualization becomes useful because at the end of the day that’s the whole point to it.

Carlos: Right, that’s not the only problem.

Steve: So one of the things that I see working with clients specifically around reports is that oftentimes you’ll get a request that comes in from one of the customer of the client or an internal customer of the client that says, “Take an existing report that’s like a simple grid type report, and add to it or modify it or do these things to it.” And oftentimes what we run into there is that people know what they’ve seen and they know how to ask for alterations to what they’ve seen but they don’t necessarily know to ask for what they’re really looking for. Do you run into that similar?

Jonathan: I do and I have a couple of tricks that I will share. One of the things I do is like you said, you see that a lot. We have Report A, we want to extend Report A this and this. Once I have all these can I make a one sentence summary of it. Kind of make a three minute story of that. If I take what I have there and I give it back to them and they say, “No, no, wait. That’s not really what I want.” That’s what you will get when you start engaging the user by saying, putting stuff until a story is formed because we understand stories because we like stories. So once you start explaining that to them, they say, “Oh, wait, wait, that’s not it.”

Jonathan: One of the things that we always run across is that we think we know something and somebody may have, not that we don’t have, so it will be I think they know that. And it happens with all of us. We say, “Do such and such.” And we think that everybody may know all of the answers and they don’t. Putting it into a story form and presenting it back then we say, “Ok is this, and this…” And they’ll say, “Oh, wait I forgot to tell you such and such.” So you’ll prompting them to give you more information as well before you ever modify that chart.

Carlos: You’re now assuming that they know something that they may not know, or you know you’re working on different assumptions.

Jonathan: Right, because at the end of the day regardless of what kind of visualization you’re doing. What, you’re doing an infograph, simple chart, histogram doesn’t matter. You should be able to summarize that in words real quickly to whoever is presenting it to you and they just say, “Yeah, that’s exactly what I want.” Whether it’s the grain, everything should be summarized quickly to your user back and say, “Ok, yeah, that’s exactly what I want.” And at the end you could open your tool.

Steve: Sounds like some great advice there. You mentioned psychology earlier, can you tell us a little bit about how psychology plays into what you do with reporting and visualizations?

Jonathan: Yes, definitely. One of the major aspects of psychology is just color, right? What colors you’re using? And this goes back to my storyboard and you see everything comes back to storyboard, right because that’s like. Actually I have everything done before I ever open a tool. Even the colors picked out should be done before I ever picked a tool. So you go back to the tone of your visualization. Are you showing, you know, did you company lost $10 million last quarter? Do you really want to show that with happy colors and stuff like that? Maybe, right, you know. Right, maybe? But you probably don’t want the whole report to be red either. The color connotations of red, you know, how red makes us feel? How certain things we have innate biases to it. So like the color red, the color red has a connotation of power, energy, war, strength and stuff like that. So we use that in that aspect and that’s great. But it can also be violent, and brutal, overbearing, anger. It’s all about understanding the color that we use when we’re using it. And you see a lot of stuff, the visualization, you see the psychology of it and things like company logos, right? Like there’s a reason why majority of company logos have red, blue or white in it, if you’ve ever thought about that? Majority of company logos have red, blue and white because they wanted to show, you know, the red they want to show power, passion, strength, desire. The companies that have white want to show purity, and innocence, and simplicity. And the blue, like Microsoft obviously, right. They want to show confidence, and intelligence, and trust, stuff like this. So they’re already doing these things, you know, showing you these type of things. Even in movies there are certain actors, directors that use color to show feeling because we automatically have these innate feelings of a color when we see them so not going too far from the subject of visualization. We want to go back and make sure that we’re showing, that we’re using the right color. We are not using green because we like the color green. We’re suing green because we want to show nature, or down to earth, growth, vitality. But we got to be careful with it too because it can also be devious, and selfish and greedy as well. So want to make sure we’re cautious on how we use colors in the certain times. You can take a perfect visualization that shows exactly what the user wants and change the colors, and it changes the whole meaning of the report.

Steve: It makes me think I work for a company years ago where one of the marketing directors had basically a rule that nowhere on the company website would green ever be used. I never understood why but thinking about some of the things you’re talking about here maybe they have some kind of psychology behind not using green, and they fail to share that with the development team.

Jonathan: Well, what was the industry they are in?

Steve: It was in sort of self service websites like back in around 2000-2005 sort of early on.

Jonathan: Yeah, because I was trying to think of what kind of negative meaning has green would have or something like that, and they have them in something else that they were thinking about. Yeah, there’s all kind of reasons why people want to do that and then you get the stuff that’s. One of the funny things is like the color orange. Orange is historically a color that you either love or hate. But kids are drawn to it like that’s why Nickolodean’s Logo is orange, right. it signifies like happiness and stuff like that so you get things like Amazon’s logo and stuff to.

Carlos: Well, so it’s funny you mentioned that because people would ask me growing up like what my favorite color was, and I could never really pick one. But I always enjoyed like orange candies and flavor so I chose orange.

Jonathan: Because kids love orange, and this one of those things people don’t think about it until you’re like, “Oh yeah, you know what that’s true.”

Carlos: I think there’s one other area that I want to jump into but I guess I want to maybe stop and say. Well, look all of these sounds well and good and I think yes it make sense. Like we want to present the best data because at the end of the day we want to make sure that our user are happy, that people are understanding the information well. But all of a sudden this sounds like a lot of work. An additional task that I have to that now take on, right, in addition to this. Where’s my easy button, right?

Jonathan: Alright, so here’s your easy button for you.

Carlos: Ok, here we go.

Jonathan: Preventing rework, number one, because at the end of the day if your visualization is wrong you can do it again then you just wasted everything, right? Don’t think of it as, you know, I have to this extra work. Think of it as you’re doing it to prevent extra work. If you develop a framework, you develop your own storyboard and you get your stuff there. You develop your go to 5 questions. Because for me, like I have probably, maybe 8-10 charts and graphs that I use all the time. I already have those in my head. I can throw stuff into them and just go with them. I don’t have to think about those because I already have them. And if anybody who has done reporting sure you guys both have your favorite charts and graphs that you like. So you already have those, you don’t really have to develop them because you already have them and know them. Now if you develop your own…

Carlos: I don’t say, you know, showing the lazy man that I am, right? I mean, 99% of the time I’m taking the default, the default theme, whether that’s in Word, right. Whether that’s in like, you know. So whatever the chart is, I mean, like Excel might give me a couple of options but I was just like, “Ok, I want to take that one and then if they don’t like it then I’ll change it, right?” I admit that, and I think maybe to your point that idea of developing a process around how you’re going to create them and then coming up with some of your favorites. Once you’ve kind of gotten couple of winds there you can go back and make that process a little bit faster.

Jonathan: And you know what thinking of that too, what I’ll do, and I’m committing this now right on air so I have to do it. I’ll write a blog post of like let’s say 5 questions. A quick process that a professional can go through and may take them 20 minutes to gather these answers that will help frame a quick storyboard to build into their platform. It would be basic that you can extend it yourself but I’ll create a couple of questions that you can have and take in that way to help you get to that point so that you can eliminate a lot of the rework that we typically have. I’ll do that and put it in my blog.

Steve: Yeah, that would be great. If you give us the link to that we can include that on the show notes which will be at sqldatapartners.com/visualization.

Jonathan: Ok, yeah, definitely.

Steve: Ok, so now that we’ve talked about colors. What about people who are color blind? How do you handle that?

Jonathan: Ok, that’s actually a big big thing. And it’s actually dear to me too because I grow up with people in my family who suffer from Color Value Deficiency. And one of the things, I even remember as a kid and things that you would never think about unless you actually live with them or actually experience them in day to day life. Like one of the things are cakes, birthday cakes. I remember a birthday cake. We were doing a birthday cake and we have to actually stop and think about the color difference between the cake and letters on it because the person we were creating the cake for wouldn’t be able to see the difference. I was probably like 8 or 9 years old and I was like, “Wow, that’s…” So at an early age it was impactful for me to be able to see that.

Carlos: What are the stats on colorblindness?

Jonathan: I believe it’s something like 1 out of every 8 men are colorblind. 1 out of every 12 men, 8%, so yeah. 1 out of every 12 men and 1 out of every 200 women in general are colorblind. So if you take a SQL Saturday with 30 people in the room, there’s 3 people at least in that room that are colorblind. And then too, there’s not just one type of colorblindness. There are multiple spectrums. There is the Color Vision Deficiency, there’s different types of Protanopia. Those are the three major ones. But there are like Green Cone Deficiency. One of the things that we hear too when we think of colorblindness, we think that the person can’t see the color red but they lose the whole spectrum. So somebody has a green deficiency they lose the whole green spectrum. They could lose the whole red spectrum. And one of the things that I show in one of my talks is that, and I do this purposely because like as I said it’s dear to me, is I show people what other people see because until we actually see how other people see things we can’t really empathize with them. So I’ll take a picture and I’ll show it in the different types of spectrum. I’ll say, “Ok, this is how we see the world.” Then I’ll show, “This is how they see the world. This is how this person sees the world.” And they’re like, “Oh my God I didn’t think about that.” There’s actually some cool things that we can do though. Outside off the bat we can immediately, we can use shading and gradients which are tools that are already available to us to be able to show differences. We can avoid using reds and greens together when possible. But there are other tools too, like one is the ColorLab. And I will give you guys the links so that you can put it with the notes, the ColorLab. The ColorLab is actually pretty cool because you can go through and you take your company’s color palette because that’s a lot of thing too. A lot of companies have color palettes that they have to put their reports in these palettes.

Carlos: Right, they have to meet the criteria, sure.

Jonathan: Right, so they probably better use shading and gradients. But you can put the colors in here and you can go through and change the color spectrum, the CVD spectrum and see how people see these colors truly. And there’s another site, color-blindness.com, it’s called COBLIS, it’s the Colorblind Simulator. And so what I do is when get done with my visualization whether it’s my Reporting Services, PowerBI, Tableau doesn’t matter. I’ll screenshot it and I’ll upload it there and I’ll look through their simulator to make sure that I can still see what I want to show. Does my report, does my visualization still show the story that I want to tell so, am I loosing tone because I lose the color? And if I am then I can go back and change it. Those are quick things that we can do to make sure that we include people from the CVD spectrum because one of the things is that as time goes on we have more and more data and everything is becoming “datafied”. You’re going to have more and more people depending on data. And the one thing that you don’t want to do, right now it’s not written into the ADA, the American Disability Act, but I can see it being used later on because somebody say’s they can’t do their job because you don’t show them a CVD friendly report and they can’t do their job. And that’s actually something that is simple as changing the colors. You could be potentially liable in the future for stuff like that. I see those types of cases coming because I see lawyers chasing stuff like that. That’s a whole other spectrum to chase and make money. To hedge is bad. What you don’t want to do is to be the person who developed that report that got the company sued for $10 million. Right, I mean.

Carlos: No, that’s wasn’t me, that was the last guy, right?

Jonathan: You know, that’s stretching it, that’s an extreme outlier but the possibility exists. So just to prevent it let’s include them anyway. Because then too you will never know who you’re affecting. If you can make somebody’s day by including them without actively going out and searching them that makes somebody feels good. At the end of the day you can help somebody out. That’s what this is all about anyway, right? So 5 more minutes of your time to help include somebody who has CVD, that’s cool.

Steve: Who knows that person you’re helping out by making it look good for the colorblind is maybe the decision maker on whether you come back for that next project as a consultant.

Jonathan: Definitely, yeah.

Steve: That could make a big difference to your future as well as or my future.

Jonathan: Right.

Carlos: So what are some of the tools doing to help us with some of that, right? Again, I’m a lazy knuckle dragging Neanderthal. Alright, I want to kind of out of the box to be that way. What are the tools doing to help me there?

Jonathan: Well, I’m interested to see how PowerBI is going to directly address that. One of the things that you can do right now and like Excel and PowerPoint, and PowerBI as well, you can change gradients and stuff like that. I mean, right now, you got to be actively thinking about it. I could see in the future having a switch that if you have CVD, you could hit the switch and it would change for you. Actually that would probably make a lot of sense to do something like that because that’s your easy button, right? I don’t know if they’re going to do that but I could see them doing that because that’s simple. They could say, for this color, this RGB color we could show this color because they have the budget to be able to delve into that and figure out what works for who and what works for not. I could see them doing something like that in the future.

Carlos: Right. Yeah, just translating that so they do have the different colors like you mentioned the reds and greens where it just translates to gradients or shades of blue or whatever.    

Jonathan: Right, because the nature of the PowerBIs and the Tableaus stuff like that and then with the HTML 5 enable you to do stuff on a fly because it will then be just a quick button that will let you change the styling or shade something like that change the colors on a fly. Obviously a Reporting Service report is static. It would be able to do that. But the new HTML 5 reports will be able to do that. You know what I mean? I could see that coming in the future. We don’t know yet but they’re actively working on it so we know that they know what’s the problem and we will see something soon.

Carlos: Right, now will you talk a little bit about, again, kind of data everywhere, right. And you know SQL Server 2016, the integration with PowerBI and Mobile Reports. We may already feel that there are enough reports out there but it seems like that’s only going to expand, right? Will you talk a little bit about, you mentioned the dangers of datafication. I mean, you feel like the role of reporting and the need to go through this is just expanding. We’re now just starting to kind of get into some of this?

Jonathan: Oh yeah, this is the forefront of it. Like we’re datafying everything. We datafy everything from our footsteps, to our calories, to our heart rate, to traffic on the road, we datafy everything because we like to be able to see and analyze things. We want a quick answer. We want to be able to say, “Oh, I get this.” And it’s good though. It lets us make changes and decisions we can plan. They’re using it to be able to track diseases. They can track the spread of Zika virus using data. That’s amazing stuff you can do but there’s dangers in that though. There are extreme dangers in that. We need to be careful especially with us being the person that’s, the people that are actually showing the data because it falls back on us. I’ll give you two quick examples too. One is a mechanical issue and the other one is a human issue. And so at the end of the day as data professionals we love the term garbage in and garbage out. We live by that. So thinking along that I’m actually writing a blog post about it now so the story will be in there as well. I had a client that was a retail client and they want to track store attendance with a mechanical piece. And as every time someone came in it will track and count so they would know how many people came in and out of the store. So they would do everything, they would base stuff on store bonuses and stuff like that based on that number there. And they would based even their finances is bases on that attendance in the store because they could do store attendance divided by sales and all that kind of stuff like that. So as I was working on one of the analytical project there I was interviewing one of the business people, well I was interviewing the business people in general, and I begin to hear them call it the random number generator. And I was like, what? But it was known that the numbers we’re wrong and it was a joke. I’m like, wait a minute. Your basing your business on something you know that’s wrong. They like, “Yeah, we know it’s wrong.” I’m like, that’s not a problem? Everything they were doing in the future was based on these incorrect assumptions so the analytics was right, the visualization was right but the source was wrong. So you have garbage on the end because your source as wrong so that’s one of the dangers as having an incorrect source, having low quality input. The other one is the human factor and this is little touchy because how it transpired out but we’ll use the election. One of the things that we saw up until the point of the election, a week before the election, most major poll showed Hillary Clinton winning between 3-6% points. You know, ok that was great. And she had based the fact that, she didn’t have a concession speech because the numbers says she could not loose. Obviously we know how that went out. So the first thing that they went back to, they were like, “The polls are wrong, the polls are garbage.” So that reflects back on people like us writing the report so that will be the analog of we wrote the report wrong. No, we didn’t write the report wrong, the data came in wrong. What end up happening was it caused a social desirability bias. People says stuff what they think you want to hear not what they’re really are going to do. So people were saying that they were going to vote for her when obviously they didn’t. But like I said we won’t go into that, that’s also psychology as well. I told you I’m a psychology junkie. But see even with that, people knew we collected the data, they did the full random sampling. You have faces like the Gallup poll and USA Today poll so they’re doing all the stuff that’s correct. They’re doing the proper distribution of people, the proper distribution of the population to be able to get exactly what’s needed to be able to give accurate results and stuff like that. They go through, do all these and have the proper margin of error, you know, everything, right? And they showed, “Ok, 3-6% she should win. Ok, great let’s work on that.” And then that doesn’t end up happening because the source was wrong. One of the things that I want to caution is as data professionals we need to make sure that we’re comfortable in our sources. Is our source correct? Was there a bias in collecting it? Were there errors in the machine that collected it? Was there possibility of errors coming in with the data corrupted coming in, all kind of stuff like that because at the end of the day…

Carlos: Are we looking for the number that will help us prove our point rather than looking at the numbers a little more objectively to help us answer the question.

Jonathan: Yeah, I think so and that’s a great way to put it. Are we, in essence that’s another bias. Are we using the numbers to prove our bias or we’re using the numbers to disprove our bias and to show the proper thing because that’s the end. Like you said, they were using the numbers to prove their bias because they wanted her to win so the numbers proved that so that works let’s go with it. What nobody ever said was, are people truthfully telling the story that’s actually happening. Are they lying to us? Are they truthfully going to vote for her.

Carlos: Right. Man, well some interesting stuff and I always see, I think. This is not going away, right? The reporting options are not decreasing that are available to us.

Jonathan: Not at all.

Carlos: And so I think it will be kind of interesting to see how some of these plays out. I’m interested to start playing with even some of the colorblindness components. Because when I heard those numbers I thought, “Wow that is much higher than I would have guessed.”

Jonathan: One of the things too with the influx of data and how it’s going to become. I could see these principles being used for everybody from the Administrative Assistant to the Executive Director and CEO and stuff like that because everybody at some point is going to become natural to be creating reports. We need to make sure are we comfortable with the source because it’s not just going to be us creating them in the future. As more and more self service tools come out, you know, the iPhone 15 is going to have a way to show reports of your tweets and stuff in real time. You know, Facebook is going to be like, “Hey, show my data.” I mean, I can see that. That’s coming because it is just the next natural evolution of it because as more and more people understand how valuable data is the more and more people are going to show their data. More and more people are going to show what they’re doing with their data. I mean, think about it. Even your FitBit, you know the FitBit, get your own reports so that you can share with your friends. You know, “Look, I had 10,000 steps today. I’m averaging 9.2 hundred steps a week.” You know, stuff like that. So even those small stuff, even now we’re beginning to see people wanting to show their own data. So it’s just going to get more and more in the future as we get further along.

Carlos: So that’s an interesting point. And I think ultimately as data professional we can continue to provide value there if we’re willing to educate people on how to do that and do the best practices rather than I am the creator of the reports, here’s some guidance on how to do that.

Jonathan: And I think that’s the big thing at the end of the day is willing to see control for the greater good. Because at the end of the day you’re more valuable when you provide true benefit to the enterprise than just you know, “I created this”. If you can show more people to do it better than just having, because you’re a bottleneck no matter how good you and me, we are bottlenecks. But if ten of us can do it then we’re better. That’s how I see that.

Carlos: The value there again is an understanding the data and kind of where it comes from and then the presentation that’s like the final piece. You can push that down a little bit and just have that be taking off your plate as the tools become easier and what not. Yeah, good stuff.

Steve: I think what is interesting on that too is that oftentimes the visualizations or the reports that are shown will drive behavior. And when you’re asking someone to build the report that’s going to drive behavior oftentimes they don’t know the complete understanding of that behavior. And I think that goes back to what Jonathan said in the beginning around the story, building the story and storyboard. But if that is pushed out to everyone who’s able to get to the data and build their own visualizations then people really have to build their own stories in that case so that they’re showing the right data. Otherwise, you’re going to end up with data that’s driving business decisions that’s all over the spectrum. One person’s report may contradict another’s based off of their biases or approach to creating that.

Jonathan: Correct. And another thing too with reports this is something that I learned a long time ago too is that a report does not have to answer your question. Sometimes a report makes more questions which eventually lead you to the right answer. That’s something that lot of us do is that we want to report to solve a problem. Of course it doesn’t always have to solve a problem. Make them lead to more questions and that’s progress. That’s what I was thinking about.

Steve: Yeah, that’s one of the things I love with reports is when somebody looks at it and say, “Well, why is that number that way?” And you have to do the research and figure out, “Oh, well it’s because sales are off in that region” or whatever it may be. And then you’re able to fix a business problem because of those questions that came up.

Jonathan: As a consultant when I’m talking to clients that’s what I tell them too. Don’t be afraid to have your results posts more questions because that’s good. You’re getting further into your data. You’re learning more about yourself. Everything doesn’t have to have a quick answer. You know, that’s our society, we want quick answers. But everything doesn’t have a quick answer.

Carlos: Well, awesome Jonathan. Thank you for the discussion.

Jonathan: Yes, thank you. I enjoyed it. I love talking about stuff like this. You could can geek me out for hours for psychology and stuff.

Steve: This has been a lot of fun.

Carlos: Shall we go ahead and do SQL Family?

Jonathan: Sure.

Steve: So Jonathan how did you first get started with SQL Server?

Jonathan: I have a funny story with that. I was working on a help desk and I was trying to figure out what I want, I know I wanted a career in IT, because I wrote my first program when I was 6 years old. I was one of those weird kids. I had that Radio Shack TRS-80. I’m taking it back, right, showing my age. So as 6 years old I wrote my first program so I had been doing IT stuff all through school and stuff like that then got a job at a help desk and I knew that wasn’t really what I wanted to do. So I was doing research and research and end up finding, you know I was like, “I think this database thing is kind of cool.” So I went and sought out the DBA at the company and I work 2nd shift. He worked the regular 8 to 5. So I would come in everyday at 8 o’clock and sit with him until I had the work at 2 to learn SQL Server from him. And he was willing to actually, his name was Carlos too which is funny.

Carlos: Great name.

Jonathan: Yeah. And he was willing to sit with me and teach me SQL Server, and that was SQL Server 6, 6.5, that’s taking back too. They saw my drive and determination with that and the created me a Junior DBA role. And that was my first job as a true SQL Server professional, I was a Junior DBA at that company, and that’s how I got started. That was my first exposure to SQL Server and ever since then I just love SQL Server. I’ve been doing it for, if I tell you how many years ago you’ll know how old I am, so 17 years later this where we are today.

Steve: That’s impressive, going out of your way to sort of work the extra hours so just you can learn something new. If more people did that I think we’ll be in a whole different world.

Carlos: That’s right. Not the typical path, right, kind of falling into your lap, you know with me. It’s more of a, hey that’s pretty cool and let me go check that out. Now, in all that time with SQL Server, if there is one thing that you could change about SQL Server what would it be?

Jonathan: The one thing I will change is that they’re actually changing it. It was, you know, the staticness between versions and fixing things. We went to that whole dead zone, the desert between 2008 R2 and 2012, and that was like, “Oh my god”. You know, trying to figure out, and there were some great things professionals found. Ways around, solutions around problems but that was one of the biggest things between there and there. And then 2012 to 2014, ok that’s cool but to see them how faster change stuff like VNX is in CTP1 now. Is that what it is? They are at a rate where they could possibly even release VNX this year. That’s crazy and that was actually what I’m thinking was, what’s to be able to make changes faster on a fly is the community. And you know, users will say, “Hey, do this. Can do that?” You know, we’re getting that base for PowerBI seems like every week. It seems like every time I open up Management Studio there is another update for that. I mean, they’re doing the things that I would like to change and it’s changed faster. So that’s really cool to see as exciting. I can’t wait to see the future. I mean, imagine SQL Server, well the changes they’re making in just 3 years, 2-3 years it’s going to be amazing what’s coming.

Steve: Alright, so what is the best piece of career advice that you’ve received? So for me the best piece of career advice is based on, for me I had struggled personally in just being comfortable on who I am. And I remember one of my managers, his name was Keith. And me and Keith are still really good friends. I talk to him more of the time, I still do. You know, Keith told me, “Just embrace who you are. Be who you are. Don’t try to be somebody else.” And what’s funny is that’s actually the beginning of my logo. You know, you guys see my logo it’s my hair. That’s me literally embracing myself. And so that was the best piece of advice that I got because once I truly embrace who I am, my strength and my weaknesses. Still trying to fit into a box that I thought somebody else wanted me to be that’s when I begin to flourish. That’s when I begin to be truly happy. I’m truly happy because I’m being who I am. I love to share, I love to talk, I love to teach, and just embracing who I am. The best piece of advice I could give back to somebody is to tell them, “Doesn’t matter who you are. Whether you’re short, tall, you are good at this and good at that. Be who you are, embrace who you are.” And that will be your gift. That will be what people will gravitate towards you.

Steve: Sounds like great advice.

Carlos: You know, that’s right, that whole idea about networking kind of finding your own people. I think all too often when we come out of college and just like, “Oh, I need a job, right, so I’ll go anywhere. If we can identify what it is we want to do or place we want to go after. I mean, even this idea of colorblindness and making reports better to help that segment that kind of speaks to the work that you want to do and just a matter of finding where that’s important to other people as well.

Jonathan: Right, definitely. I think that’s even, actually even people taking that project and run with it. Someone selling a service to a business saying, “Hey, you know, we can convert your reports to.” Somebody say it’s like that you can sell that service.

Carlos: Our last question for you today Jonathan. If you could have one superhero power, what would it be and why do you want it?

Jonathan: So my favorite superhero doesn’t even have any superpowers. You can guess who it is?

Carlos: Batman?

Jonathan: It’s Batman, yeah. And what I always love about him is the detective comics because he uses his mind and that was a superpower. He just outsmarted people. Obviously he was physically fit, right. I mean, the ability to get to the in-depth of stuff. I love that investigative and that’s kind of like what I want to do anyway. Get to the problem, get to the root problem. Figure out what’s really going on. If anything I would want to be just Batman that’s my. I don’t need super speed and super strength. Nah, I just like the super mind and call it a day.

Steve: You know the other cool thing with Batman is he had pretty much unlimited budget, and he had all the toys, and tools, and gadgets that he wanted to do whatever he needed to.   

Jonathan: That is true.

Carlos: That will be a lot of fun.

Jonathan: Yes, I guess you could say that’s also super power too, right, the unlimited money.

Steve: Yup, right.

Carlos: Having a kin business sense or at least having a dad maybe who had one.

Jonathan: Right, right. That would definitely help.

Carlos: Well, Jonathan, thanks so much for being with us today. We do appreciate it.

Jonathan: Yeah, once again I thank both of you guys. This has been an honor to be able to do this. I like to thank the community for even, having you guys available to do this so I look forward to what’s coming in the future. Look forward to seeing you guys again at the SQL Saturday being able to panel and stuff. And I thank you for the opportunity and the time. You guys have a great day!

Carlos: Yup.

Steve: Thanks Jonathan, it’s been great.

Episode 82: Indexes Part 2

In the previous episode we discussed the basics of using indexes, what they are, how to use them, along with some common problems and solutions. One of the problems you will certainly face as time passes is index fragmentation. While indexes will certainly help your database perform well, they do require care and feeding–meaning there is some administrative overhead.

In this episode, we continue our conversation with Randolph West and discuss what we can do to take care of indexes and keep them well maintained over time. There is an option to use default maintenance plans and to do fragmentation and updating statistics, but should you use them? What are the alternatives?  We didn’t forget to consider strategies for deciding which indexes to keep and which can be dropped.

 Episode Quote

“That’s purely Microsoft’s fault because generally, people are using the maintenance wizard are using it because they don’t know how to do it with [other] tools.“

“You still need to update the statistics after you’ve done a reorganization which is something that people tend to forget”

Listen to Learn

  • General maintenance needed to take care of indexes and keep them well structured on an ongoing basis
  • Good strategies for deciding which index is get included and which ones we need to ignore
  • Should you use index maintenance wizard
  • What column store indexes are

About Randolph West

IndexesRandolph West solves technology problems with a focus on SQL Server and C#. He is a Microsoft Data Platform MVP who has worked with SQL Server since the late 1990s. When not consulting, he can be seen acting on the stage and screen or doing voices for independent video games.

Transcription: Episode 82

Steve: So then let’s jump a little bit more into the care and feeding side of things, and I know we touched a little bit on rebuild and a little bit on statistics but if we look at what are the general things that we needed to do to take care of indexes and keep them well structured on an ongoing basis?

Randolph: I’d be interested to hear your side because I’ve been speaking a lot. I’ll open if I had any feedback.

Steve: Ok, I mean, so really I think on that there’s the concept of reorganizing versus rebuilding as you get fragmentation over time. And then there’s statistics and I think that, or let me start with a very bad way of doing it that I see quite often that is if you go and use one of the default maintenance plans. And the maintenance plan you just go and click let’s rebuild my indexes, let’s reorganize my indexes, then let’s rebuilt statistics. One of the things you end up doing is rebuilding which basically rebuilds the entire index from scratch without, basically if you’re in Enterprise Edition you can do that on online mode but if you’re not in Enterprise Edition that’s an offline operation. But what the default maintenance plan people sometimes check is that they say, “Let’s rebuild them.” And that gives you the best structured index you’re probably going to end up with and it also rebuilds your statistics. And then they reorganized them and then they rebuilt statistics with like a 5% sampling or something like that. And in that example you end up with a whole lot of work that’s actually giving you sort of a negative approach or negative impact. Whereas if you just rebuilt them to begin with and then not done the reorg, and not done the statistics rebuild afterwards you would be in a lot better shape.

Randolph: So, you know, those defaults that you’re talking about, the reason that happens is that because the maintenance plan wizard has got rebuilt then reorg, then statistics in that order in the list.

Steve: Absolutely, yeah, people just click the checkboxes or select them and that’s what you get.

Randolph: So when I see that I blame Microsoft for that. That’s purely Microsoft’s fault because generally people here using the maintenance wizard are using it because they don’t know how to do it with say, Ola Hallengren or MinionWare tools so just to name two, there are many. So that’s Microsoft’s fault and every time I see that I do not blame the DBA, or the accidental DBA, or whoever is managing it. It’s not their fault. But it is causing a whole bunch of work that is unnecessary.

Steve: Yes, it may not be their fault but it’s their pain they have to deal with. When I see that I generally going a little bit of education to explain why that’s bad and turn those off. And like you said, replace it with MinionWare or Ola Hallengren scripts. Both which are far better solutions at doing at than the default maintenance plans.

Randolph: So the one thing that I wanted to mention, Steve, is that an index rebuild is, as I mentioned before is size of data operations. So if your table is 10GB and you need to rebuild it, if you’re doing a clustered index rebuild it’s going to require 10GB of space in the data file and then it’s going to require at least 10GB in the transaction log. And if you’re rebuilding a clustered index it is going to by default rebuild all of the non-clustered indexes on that table that referred to that clustered index and that means all of them. So if you’ve got a 10GB table and you’ve got 30GB of non-clustered indexes on that table it’s going to require at least 40GB in the data file and 40GB in the transaction log because it’s going to rebuild everything. So if you’ve got a maintenance plan that says let’s do a clustered index rebuild and then before that you had the non-clustered indexes rebuilding it’s going to do them again anyway so it’s a nightmare. This is why I say it is statistic updates more regularly are going to be a lot more effective than doing index rebuilds over the long term. You still do need index maintenance but let’s not do it every night.

Steve: And that’s where I like the third party solutions that we mentioned there for the ability to rebuild those where you have options to say given a certain level of fragmentation I want to choose to reorganize this or if it’s even more fragmented then we’re going to choose to rebuild it. And some of them even have the option of only do this for a certain amount of time.

Randolph: Can you go into detail what’s the difference between a rebuild and a reorg is?

Steve: Ok, so with the rebuild it’s basically, technically it’s like the equivalent if you drop the index then recreate it, and then it’s going to be completely rebuilt from scratch and it will take up the log and disk I/O like you talked about a moment ago. But with the reorg, what that’s going to do is going to the index and basically restructure or compact the index pages as needed. It will go through and shuffle them around to compact them down to a smaller amount of space. And it will do that page by page as it goes through there and that can be done while things are online in Standard Edition. But it’s slow and the end outcome is not as good as a complete rebuild. It might get you real close but it’s not going to be good as complete rebuild but the benefit is you can keep it online.

Randolph: And you still need to update the statistics after you’ve done a reorganization which is something that people tend to forget as well. The statistics need to be up to date so that the query optimizer knows where the data is and how much of it is in there, and what the distribution of the data is in that index so that it can make the right choice about what type of query to create because a lot of queries is coming down to SORTS and JOINS and that’s where you find your biggest performance bottlenecks. A SORT is a blocking operation so everything in the query has to stop until the sort is finished. So if your index is up to date and it’s sorted correctly then it doesn’t have to do that operation it’s already done. Whole bunch of things about loops, we can talk about here as well which I’m not going to. An up to date statistic is much more valuable to the query optimizer than a defragmented index.

Carlos: There’s another point there even that, we’ve talked about current feeding of the indexes, right, rebuilds vs. reorgs. However, we can also have statistics that are outside of the indexes, right, on the table. Those are also providing feedback to SQL Server to the optimizer. And if we neglect those we’re still kind of the same peril because the optimizer maybe using the statistics but we’re not giving them that current feeding.

Randolph: Correct so, it’s valuable to have a, I hate this word, holistic approach to your current feeding is that you need to look at not just indexes, and not just fragmentation, and not just statistics but all of them. And in a way that is effective and efficient for your maintenance window. You don’t want to go and rebuild indexes every night because it doesn’t make any sense. But you also don’t want to stop doing statistics updates.

Steve: Yup. One thing I commonly see with that is rebuilding of the statistics or updating the statistics where somebody will have a job that goes through and blindly update every single statistic in the entire database every night. And that can have a pretty big performance impact on things depending on the size of your database. And a better approach to do that is to look at which tables or indexes actually have changes and only update those that have a fair amount of changes to them on that frequent basis. And then maybe on a regular basis of once a month or once a week you do an update on across all of them.

Randolph: Yeah, that’s a very good point. The amount of data that changes in a table if you’ve got automatic statistics enabled, we’re going to go a little bit of a segway now, by default automatic statistics are enabled. There are other settings that you can choose in 2016. There’s a whole bunch of new stuff per database which you can choose. Let’s talk about everything before 2016 because that’s what most people are still using. Statistics will only update themselves if a certain percentage of the table has changed. The value has changed across versions recently but the fact is that statistics will only update themselves if a certain percentage has been changed. So if you’ve got a really large table with tens of millions of records or rows and you have only one million that have changed recently statistics will not be updated automatically. So it does pay you to use one of the third party tools we’ve mentioned like MinionWare or Ola Hallengren’s maintenance solution where it will go and check, has any number changed and then you can update the statistics there and that will help your query optimizer as well.

Steve: The other side effect you can get with that is that if you do hit that threshold where it decides it’s time to rebuild those statistics that might be in the middle of your peak load during the day. And right when you’ve got a lot of traffic coming to your database and index or statistics rebuild occurred that can have impact on things too.

Randolph: That in fact is one of the recommendations for SharePoint style databases. There’s a lot of things I don’t like about SharePoint but the fact is a lot of people use it so one of their recommendations is to turn off automatic statistics updates on any SharePoint style database. That includes CRM, it includes GreatPlains, all of those even. What’s the one that I used? Whatever it is, there’s a whole bunch of them where you should turn off statistics updates automatically and then include statistics rebuilds in your maintenance plans. So it’s just to keep in mind each database is different.

Steve: It will probably apply to Navision and Dynamics as well.                     

Randolph: Yes that’s the one. Yeah, Dynamics is the one I’m trying to remember. Thank you! So any SharePoint style database they do recommend turning off statistics updates automatically and to do them in a separate maintenance window. So it pays to do your research to make sure that you’re doing the right kind of maintenance plans for your database and your instance.

Steve: Yup, very good point.

Carlos: So I guess a couple of different things we talked about all of these components. And I guess let’s now talk about some of the benefits right, so all of these things we have to go in, right, kind of the holistic approach, kind of having to know our data, getting more familiar with it. Ultimately to what end are we going to do that? I think, so we’ve talked a little bit about performance. I guess we should probably talk about how that performance gets into the system if you will or the mechanisms that cause the improvements?

Randolph: Ok, so when I do my performance session, what I say is, “You’ve got expensive items like your CPU. You’ve got less expensive items like your RAM, and you’ve got even less expensive items like your hard drives.” So your CPU is the most valuable thing to you because the SQL license is per CPU core and you want to make sure you’re using the most efficiencies of your CPU and memory as you can. What an index does is, we spoke about this before, it is a copy of the data so you want to keep your indexes as efficient as possible so that if you’ve got a large table you don’t want your index to be large as well. You want it to be smaller so that less of it is in memory because that’s what this game is about. SQL Server is all about being in memory as much data as possible in memory. So for Standard Edition up to 2016 even you’ve only got a certain amount of memory that you can access. 2016 Service Pack 1, the limit is still there but that’s your buffer pool that is in memory as opposed to everything. But the fact is that there are limits to the amount of memory you can use for SQL Server specifically with Standard Edition because not everybody can afford Enterprise. So you have to manage the amount of data that’s going into memory as much as you can and that is most effectively done by the right kind of indexes for your queries. And that’s also why you don’t want to have duplicate indexes because it will also be in memory. Also, you don’t want to have wide indexes because they will take up more memory than they need to. And that’s why included columns are very handy way to reduce the size of your indexes that’s why we have filtered indexes. All of these things to try and reduce the amount of data that is in memory so that we can do a lot more with what’s in the buffer pool.

Carlos: So then how do we go about or what’s the best ways to determine, we talked about looking at execution plan that kind of gives the recommended index there. So what are some good strategies to go about deciding which index is get included and which ones we need to ignore?

Steve: One approach I take on that is if I’m looking at an execution plan and it suggests an index. If it’s a development environment one of the things I’ll try initially is just create the index and see how it helps and I’ll drop it, and then go look and see is there a similar index. I mean, if the index that it suggested was beneficial I’ll go and see if there’s a similar index that could be modified or added to that would give the similar and a fact that the suggested index was doing. And sometimes that means you’re taking existing index and just add in an include to it or taking an existing index and add in another column into it.  

Carlos: Ok, so that’s kind of a dev environment, one query at a time.

Steve: Yup, and the other approach that I see there is I can even go and look at some of the missing index DMVs to go and figure out. Actually you look at DMVs for a handful of things, one is to figure out what are the missing indexes and figure out the overall cause associated with those missing indexes. And then come up with what are some good ones to add that are not going to be duplicates of other indexes and then see how that improves performance. You can also, using DMVs, go in and look, you can figure out what are your duplicate indexes. If you have duplicate indexes oftentimes you’ll see that one of them might be not being used at all and the other one is being used a lot. You can go in and drop one of those. However, you need to be careful when you’re looking at unique indexes. You don’t want to drop a unique index versus a clustered index or non-clustered index you want to look at and drop the right one there. Another thing to look at is big clustered indexes, that’s another thing that you track down is what are the clustered indexes that are really wide. And that really wide means it’s going to ripple through all the non-clustered indexes with that extra size. You can also track down unused indexes. What are the indexes that are on your system that are not being used? Now when you look at that you need to be very careful because that’s only going to be the ones that haven’t been used since the SQL Server instance restarted. Also, if you rebuild and index I believe it clears those index usage stats.

Randolph: Sometimes.

Steve: Sometimes, exactly. So it’s something that if you see there’s an index that is not being used you might want to track it over a few weeks to a month and confirm it, yeah it really isn’t being used and then go and take a look dropping those.

Randolph: That goes back to one of the best practices is to have a baseline. So know what your system is doing and track that overtime and then you can refer back to that baseline and say, “Well, this is exactly what’s happening. This is different.” And go from there. Yeah, that’s a good point.

Steve: Yup, and that reminds me of a project, Carlos, that you and I worked on. Were we setup a monitoring component that ran for a month and kept track of unused index details, and log them every few hours, and then we went back at the end of the month and reviewed the ones that over the entire month had no utilization and then suggest those as possible candidates to be dropped.

Randolph: That reminded me of Pinal’s comment as well that he’ll only look at the database that’s been running for a week. The one thing I wanted to mention is the DMVs that you’re referring to there’s one in particular, there’s a group of diagnostic queries that Glenn Berry, Glenn Alan Berry, from sqlskills, he uses and maintains them. He’s got a couple of, and so does Brent Ozar and a whole bunch of other people. But Brent Ozar’s one called as sp_BlitzIndex and Glenn Berry’s one is included in his diagnostic scripts. It’s a very similar query and what it does it waits your missing indexes as according to number of scans, number of seeks and something called Average User Impact. Now that Average User Impact number can be, I think it’s unit less, I’m not quite sure how does that number is calculated but if it’s really in a high impact like tens and hundreds of thousands then usually I will go and look at that first and say, “Well, how many times has this index been used in terms of number of seek and scans.” And if it’s a high usage index that is missing or a high impact then I will usually create that without too many modifications if it’s not too wide or doesn’t have too many include columns.

Carlos: I think it’s the number of times requested multiplied by the number of hits.

Steve: Yup, and just to jump in there I think I’d feel a little bit left out I didn’t have the chance to mention Database Health Monitor on that. In Database Health Monitor there are many indexing reports similar to what Randolph has described there.

Randolph: Certainly, from Steve Stedman’s solutions has a magnificent and free tool called Database Health Monitor which I have used. I don’t even get paid for this statement. It’s a good product. It’s free which is even more amazing. This is the great thing about the SQL Server community. There are a lot of free tools that are out there that are adding value all the time. And all that people asked is that you recognized them and I recognized Steve as a valuable member of our community. That ends the sponsorship message.              

Steve: Thank you for the shameless plug.

Carlos: So it’s interesting, I guess, and maybe I’m being a little bit paranoid but using, so when I’m looking at that impact and I will take that and also using the ones that I’m interested in adding. Of course I want to do what Steve mentioned looking for duplicates or kind of what’s there, right? Then am I going to get over my threshold of ten or whatever just kind of taking into consideration what additional load am I putting on this table by creating the index. And while you want to test that out mostly the environments that I worked on they just don’t have a good mechanism for creating the similar load as I have in production. So when I go to implement that index one of the first things I’ll start to look at is those usage stats, right. Because I want the usage stats in the index that I just created to be going up because I know that, “Ok well, wait a second what did I do? Is this still a really good candidate?”

Randolph: Yeah, that’s a very good point.

Carlos: So I guess, other that’s how are going to go out and creating them, couple of other things that we didn’t really get into some of the other new indexes like column store or even XML indexes. But I guess other thoughts about when to start looking at these other, we even try to approach that now. Maybe I feel like we should punt on some of those.    

Randolph: I can briefly mention about XML indexes that the way they work is they’re going   to be larger than your table or your XML column. The reason being is that it will create an internal table according to the structure the XML document or XML field that you’re indexing. So it actually expands out the XML data into an internal table and then indexes that so you could end up with a column that is say 100kb maximum. That’s a small one and you can end up with an index that is 500kb, or gigabyte, or tens of gigabyte because it’s creating an internal table under the covers. A system table that it is then indexing so be wary of XML columns in that if you’re going to index then make sure that there’s sensible indexes that they’re only indexing certain nodes inside that XML document and be aware of that. And also I’d like to add here that if you use the XML data type, the data going into that XML row or column is not going be the same as what you get out. It does modify the XML data going in for whatever reasons.

Carlos: Say that one more time.

Randolph: If you have XML data that you put in for auditing reasons for example. If you pull it out it’s going to have been modified somehow. Either the tags will be slightly different or the spacing will be slightly different so do not use an XML data type to store audited records of XML types. If you want to keep an absolute record of what your XML looked like as it came out of a web service for whatever reason store it in a varchar or nvarchar column instead because in it it is identical. If you put it into an XML data type it will modify the data. It is still the same data but it will slightly modify. The tags will be slight different or whatever.

Steve: So then as far as column store indexes I think that’s something we could probably dedicate an entire episode just to talk about.

Randolph: We definitely can.

Carlos: I think you’re right.

Randolph: A column store index is neither an index nor clustered so it’s so complicated. In fact, there is a series of post. I think it’s Nico who has done it over one hundred posts on how column store indexes work. We could talk for days on that. They are fascinating and completely subverts what you think you know about indexes, and data, and row level storage and all that kind of stuff. It’s fascinating stuff.

Steve: Yeah, and just a few notes on that I think that it’s one of those things that is great for data warehousing or OLAP type things. And may not always be the best option for your OLTP side.

Randolph: At the same time you could have reporting style queries in your database and with 2016’s optimizations for column store you could have which can change and can be clustered and all sorts of stuff. You could have some stuff that in your OLTP environment that could be OLAP style indexes. Yeah, there’s so much to think about there.

Carlos: We saw that in Episode 78 with Brian Carrig. They talked about using their column store index in a transactional environment.

Steve: Yup, and then one of the things that I’ve seen that is one of those misunderstandings around column stores is that if you, because column store does the compression on each column, is that people think of it as I don’t have to have any non-clustered indexes if I have a column store index. And that’s not true at all. And that if you’re looking at a bigger table and you’re using column store you may still need to have some non-clustered indexes on it as well.

Randolph: Oh yeah, so do you research. Read all one hundred and whatever post. Become an expert then implement them.

Carlos: So we’ve talked a little bit about, so implications, how do we know what to use. So space considerations we talked a little bit about more in the rebuild process, right? That we’re going to need additional space in our data file and our log file things like that. I think we’ve had a pretty good conversation there. I guess one last idea I’ll throw out that we can bat around a little bit. We talked a lot about having multiple TempDB files and so one of the thoughts that I’ve seen out there is if you’re potentially, I won’t say, so what’s the value there is that if you’re looking to spread out or move your indexes from where your data might reside. One way to do that is to create another data file potentially on a separate storage and then rebuild those indexes.

Randolph: Ok, there are two things that I can talk about there. The first thing I wanted to mention is if you do need to move your table into a different file or file group all you have to do is do a clustered index rebuild and target that file group as the new location for the clustered index because remember the clustered index is you data. So that’s a handy way if you need to move your table into a different file group that’s how you will do it. The other thing is by virtue of that definition there is that you could have your non-clustered indexes in a different file as your data and you might want that for improved I/O performance or if you’ve got your read-only stuff in a different file group or all sorts of reasons for that. It’s very handy for splitting your load on the I/O level. Less of a problem these days but it’s a nice thing for large tables to split your non-clustered indexes from your clustered index so that’s not reading from the same portion of the disk or the I/O subsystem or both.

Carlos: Right.

Steve: Just to note on that index rebuild for a clustered index to move it to a different file group although that will move the table and the entire index. One of the things that could be left behind when you do that is any of the varchar max or nvarchar max or other large items that are located outside of the data pages for that table.

Randolph: Yeah, off row data will be affected. Steve, I don’t know because I’ve never tried. What happens if you tell it to do the index rebuild with the log compaction enabled.

Steve: You know, that’s a good question. I have not tried that for a long time.

Randolph: I smell a blog post.

Steve: Yup.

Carlos: Ok, very good. Well, awesome. I think great conversation and obviously there’s a lot more that we could talk about indexes but I think this was a noble attempt at covering some of the basics and getting into some of the nitty gritty as well.

Randolph: Yeah, the fact is that indexes don’t stand by themselves. They are a very important part of everything including statistics and everything else so don’t think that once you became an index expert you become an index performance tuning expert because that’s not true. You have to have a very broad knowledge of how things work in a number of different fields used upon to get the best performance out of your system. And there’s nothing wrong with good enough. You don’t have to have 100% defrag indexes. You don’t have to have indexes rebuilt. You can have them reorganized. Don’t have to have them reorganized at all if your statistics are up to date in certain contexts. There are a lot of tradeoffs that you have to think about when doing your maintenance plans and indexes form just a small part of that.

Steve: Yup.  

Carlos: Great.

Steve: Very good point.

Carlos: So shall we do SQL Family?

Steve: So how did you first get started with SQL Server, Randolph?

Randolph: Steve, that’s a good question because when I originally got this question I had to think very hard about it. I have been fascinated from a personal level about organizing data. Back in my youth I would create lists and lists of things CDs, books all that kind of stuff and I would do them in the paper equivalent of Microsoft Excel. And then when Lotus 123 came along I started playing with that because my dad worked at the bank and he was a 123 expert. If I’d know about VisiCalc I probably would have used it. And then I got into Microsoft Access because it was there. It was a very old version of Access. I think it was version 2 that I started with and then I started working at a PeopleSoft implementation partner in South Africa where I am from. And that’s where I first work with SQL Server and Oracle at the same time, and I was not a fan of SQL Server. I have to be honest. At that time it was version 6.5 and version 7 that’s around the time I joined so there were issues because as some of your listeners may know SQL Server’s engine was rebuilt around 7 time, 6.5, 7 time. In fact, if you look on MSDN you can download 6.5 and you can download 2000 but you cannot download version 7. It is very difficult to come by. There are maybe good reasons for that. I got exposed to SQL Server that would have been in 1997, around there. So yeah, that’s where I first got exposed but then I didn’t really play with it until I got to the bank. I was more of an Oracle guy. Got to the bank in 2006 so there was a large gap of not playing in SQL Server and then I couldn’t figure out why DESCRIBE, the keyword DESCRIBE wasn’t working in SQL Server. For all of you Oracle office you notice there I prefer SQL Server now because as I said before a lot has changed. It is a much better product than it was in 1997. In fact, I think it has surpassed Oracle. I think it’s better than anything else as well and that’s because I’m trying to keep abreast of all the new stuff. I don’t want to be stuck in the past and have assumptions about the product. I want to play around with the new stuff so. That was a long way of saying 20 years or so.

Steve: Ok. Well I know that in the late 90’s Oracle was a much better product than SQL Server.

Randolph: It really was.

Steve: Yup, and that has changed. That has changed significantly in the last 17 years.

Randolph: Yeah, it has.

Carlos: If you could change one thing about SQL Server what would it be?

Randolph: I had a pot for this but I’m going to change my mind. The pot answer is in-memory processing on SQL Server, in-memory OLTP which was called Hackathon, that was the code name for it. There is a transaction log operation that happens even for stuff that you don’t want kept. It’s still being backed by disks somewhere. It may not be in the transaction log itself but it is still being backed by disk somewhere. I would like to have in-memory objects that do not have anything on disk backing them at all. They must just be there, and if the server goes down, oh well tough luck.

Carlos: Alright. Yes, I remember seeing you in the Bob Ward presentation at SQL PASS talking about the in-memory OLTP.

Randolph: There’s some very very cool stuff with in-memory OLTP that I’ve been playing with. I’ve been blogging about it as well on my blog bornsql.ca. That speaks about some in-memory ops and I found. Interestingly I found another issue that I’ve spoken to Jonathan Kehayias about where Temp tables are slower than Temp variables in certain cases. So I’m investigating that at the moment and it had to do with my in-memory investigations.

Steve: Oh interesting, I’d like to see what you find there.

Randolph: If I could change, that was my pot answer about SQL Server what I would change. What I would change is having a TempDB per user database as a configurable extra.

Steve: That would be nice.

Randolph: In other words it doesn’t have to be default but if I need a TempDB for a particular database I would like to have that as a separate entity so that I could manage it separately and keep track of the objects that are happening for that particular database. The architecture for that is going to be really difficult to do so I don’t know if Microsoft is going to do that but that’s what I would like.

Steve: Alright, so what’s the best piece of career advice that you’ve received.

Randolph: Oh, it’s going to be along answer as well, sorry. “Let your work speak for itself” is the first and foremost one. So it doesn’t matter what people say about you if your work can speak for itself then you don’t have to say anything. That’s the best piece of advice that I had. But the other is “Everything you say and everything you do is being recorded somewhere” so please treat people with respect. Treat people as you would have them treat you and don’t say things on voicemails that can be played back to you at meetings with the CEO and saying Randolph don’t do that.      

Steve: I like that too.

Carlos: We won’t ask for the personal experience that led you to that advice.

Randolph: I have at the end of the year or every now and then when it’s a quiet time on my blog I say career limiting moves and that was one of them. So there are a bunch of career limiting moves that I can recommend to you. Don’t do because I’ve done them not because it’s good advice but it’s because I’ve done them so don’t do that because I’ve done it, and I will be one of them.

Steve: Alright.

Carlos: Our last question for you today. If you could have one superhero power what would it be and why do you want it?

Randolph: I thought about this and I honestly don’t know. I liked Pinal’s when where he could touch a server and know what’s wrong with it. The problem is whenever I touch as server if it’s going to break it’s going to break then and there. Maybe I want the reverse of that superpower.

Steve: To just know without touching the server.

Randolph: Yeah, you know per cost of maintenance is one thing. But if I’m in a room and it is going to break it is going to break while I’m there which is good I guess in way because you then you know that we can figure it out then and there. But I’d like the reverse of that that would be cool.

Steve: Ok.

Carlos: Well, awesome. Randolph thanks so much for being with us today.

Randolph: Well, thanks for having me it’s a topic near and dear to my heart indexing and maintenance and stuff like that so I’m glad I got an opportunity to speak to you guys about it.  

Episode 81 Indexes Part I

Part of what we like to do on podcast is to present topics and discuss different ways you might use features. In this episode, we go back to basics and discuss indexes. Like most topics, we made sure to explain all the important aspects. Because this is a such broad topic, we decided to carry this over into our next episode as well. Using the phone book as an example, we chat with our guest Randolph West and explain heaps, non-clustered indexes, and clustered indexes. We also explain best practices to use them when creating new tables.

We didn’t forget to cover the problems you might run into. For example, you might not think about indexes until your data grows very large. Sure, you can attempt to resolve the problem by moving to SSD’s and increasing memory, but this will push the issue further in the future. In this episode, we will discuss considerations for indexes and why database architecture is so important.

We would like to hear your opinions. What are your strategies when using indexes? Use the hashtag #sqlpodcast and let us know!

 Episode Quote

“The clustered index should be chosen very carefully because your entire table is going to be sorted according to that column or inside of columns and what you’re looking for is a unique value for every row.“

“All that an SSD is doing is making your problems go away until a later time. The problem is still there. It’s just being hidden by the fact that you’ve got faster drives. In fact, the same goes for adding more memory. It’s just that happens to be cheaper than trying to fix your indexes.”

Listen to Learn

  • Heaps and indexes types explained
  • Randolph’s thoughts on clustered indexes vs. non-clustered indexes
  • Similarity between clustered indexes and primary keys
  • The results of using Management Studio Designer for creating new tables
  • How to create covering indexes
  • The negatives of having too many indexes and what you can do about it
  • How are default fill factors affecting indexes? Is faster storage good reason to avoid fixing the indexes?

Randolph on IMDB
SQL Skills article on BTree
Randolph’s Blog
Randolph on Twitter
Clustered and Non-clustered indexes described
Heaps
Creating Indexes with Included Columns

About Randolph West

IndexesRandolph West solves technology problems with a focus on SQL Server and C#. He is a Microsoft Data Platform MVP who has worked with SQL Server since the late 1990s. When not consulting, he can be seen acting on the stage and screen or doing voices for independent video games.

 

Transcription: Indexes Part 1

Carlos: Ok, so let’s, I guess we’ll go ahead and kick it off. So Randolph, welcome to the show.

Randolph: Thank you very much for having me.

Carlos: Yeah, it’s great. I know that you and Steve have done a little bit of work together and some of your adventures have come up from time to time as we’ve talked about it, and so it’s great to have you here on the show with us today.

Randolph: Well, thank you! It’s been great fun working with Steve. He is a consummate professional, and I would hire him. I’m just saying.

Carlos: There you go.

Steve: And technically I would say I would hire you too because we’ve done that on projects.

Randolph: Yes we have.

Steve: I think we first met on the Database Corruption Challenge in 2015.

Randolph: Has been it where.

Steve: Oh yeah, yes, and then, yeah you did an amazing job on some of the corruption we worked on there. And then you and I have worked on fixing a couple of corruption issues for clients since then, and then we also worked together on another project sort of a longer term disaster recovery moving to Azure type project for our client.

Carlos: So today our topic is indexes and now this can be a pretty broad topic and just some of our pre-planning we thought, “Gosh, there’s no way we’re going to cover it all.” Some of the components are just a little complex to potentially discuss all of them over an audio format. So some of that will be pointing to our show notes and pointing people to additional documentation there. But let’s go ahead, I guess let’s start with that conversation and kind of maybe even defining an index and how that works. Now I think, so the first one, so we talked about indexes, we’re talking about copies of data, the way our data is structured, right. The very first example we have of that is a heap. Basically, a heap is a table with no indexes on it so there’s no structure or order into the way that the data is saved. Obviously, you have columns, right. You can have data types, and all those kind of constraints but the way the data is actually saved is not part of the definition if you will of the table. That makes it a heap.

Steve: Yup, so then really from there we take a look at how do we get to the data faster because with a heap to find any specific piece of data you’ve really got to search through the entire list, or the entire table and look at every single row. And say, “Is this what I’m looking for? Is that what I’m looking for?” Eventually after you’ve looked through the entire pile, heap in this case you come back with the result. But another way to this is by reordering some of the data. The way I like to think of it is sort of like a phonebook or a phone list. And I know growing up I’m used to seeing the yellow pages and the white pages. The yellow pages were the business listing and the white pages were the listing of every person and business with a phone number organized by name. Now, if you were just to take everyone on that phone list and just put them in a random order perhaps in the order that they signed up or purchased their phone service. That’s sort of the equivalent of a heap. There’s no specific order to it that’s useful for searching on it unless of course maybe you’re searching on the order that they got their phone service installed. But it would make it really impossible for anyone using that phonebook if they were just in that order, just random order basically to find anybodies phone number. So what they’ve done with the white pages is they’ve organized all of the people ordered by last name, then by, when there’s multiple people having the same last name you have the first name order, and when there’s multiple people with the first name you have the middle initial order. And that’s really the equivalent of creating a clustered index on last name, first name and middle initial. What that effectively does is it orders the entire table or phonebook in this case by those items so that when you needed to do a lookup and you’re looking for someone who’s name is Smith, you don’t have to start through the As and go all the way through to the Zs and confirm you found all the Smiths. You can just jump to the S section and then track down where Smith is from there really quickly. So then we look at the yellow pages, and the yellow pages are the business listings where the business is organized by business type. So if you’re looking for an accountant you can jump to the A section, find the section where all the accountants are and then lookup who all the accountants that are listed in the yellow pages. But if you’re looking for an exterminator, you need to go to the E section for exterminators and then you can find the list there. Now, with the yellow pages, it’s sort of like a filtered, non-clustered index in that it’s a copy of data that exist in the yellow pages, sorry it’s a copy of a data that exist in the white pages but it’s ordered differently and the reason it’s filtered is that it only includes businesses. So we’re kind of throwing two things in there, one it’s a non-clustered index and then it’s a copy of the data and two it’s filtered because just people have been thrown out and only businesses are shown.

Carlos: One thing that’s kind of important there is that that filter won’t necessarily change, right. That category of business, you’re not going to take that out of your system. While you can potentially filter on dates, it is a specific date. You’re not going to be like keeping an average for example of 90 days in an index. That wouldn’t be a good use of an index. But if you only wanted to see like from example, the beginning of 2017 forward then that might be an option, something like that.

Steve: Right, yup. So then next if you had what’s often referred to as a reverse phone directory. Typically not included with you white pages or yellow pages but it would be ordered by phone number so you can look up a name. That would be the equivalent of a non-clustered index or a copy of some of the data that’s been ordered in a different way to make it quicker to find something. So if you want to find out who has the phone number at 555-1212 well you can look it up by the phone number and then find out their name. And that would be just a plain old non-clustered index with no filter in that case because it would have the entire list. Not sure if the yellow pages and white pages are international or available in all countries but hopefully everyone has sort of understood where we’re going with that as an example on how those indexes work.

Randolph: Yeah, that makes sense.

Carlos: So let’s see, Randolph, what about thoughts on clustered indexes vs. non-clustered indexes?

Randolph: Well, the clustered index should be chosen very carefully because your entire table is going to be sorted according to that column or inside of columns and what you’re looking for is a unique value for every row. And you want to make sure that the clustering key is narrow so that it doesn’t take of that much space because that clustering key is really used in non-clustered indexes. Because you have to  have a way for that copy of the data to match back to the original table so the way that the database works is it stores the clustering key, the clustered index inside the non-clustered index so that it can match back. Your clustered index, your clustering key has to be narrow. It has to be unique preferably and ever increasing. What you don’t want is a random value because for a number of reasons, which we can get into later. A random value causes something called page splits and it kills performance, and it kills the way the storage mechanism which just makes things ugly. So you want a number that is ever increasing like an integer or a big integer if you’re going to have more than 2 or 4 billion rows. It’s just you have to choose carefully.

Carlos: One of the things I like there about that idea, I guess we can kind of jump in there, so you mentioned page splits. Even Steve using the analogy of the phone book in a real and almost, I guess it’s an example. But if you think about taking pieces of paper right, and then basically writing on them and filling that up. That’s kind of what the way that SQL Server works and it has and these pages at least in SQL Server are 8K chunks right, so it has to sort them. When you talk about that sorting and the clustered key, clustered index rather it’s going to store them in that scenario and I think to your point is that when you’re doing that in kind of a random order it has to resort that every time, and those pages can get out of whack because, then all of a sudden I’ve got to fit in data that wasn’t there originally.

Randolph: Exactly, so what happens there is you have to wait for the index to be resorted and that can take time. It can kill your performance badly, so that’s why we try and stir clear of wide random columns for a clustering key or a clustered index. For example, GUID columns are not a good choice for a clustering key. That doesn’t mean that will make a bad index but they would make a bad clustered index because remember the entire table is sorted according to that index because the clustering key is how you’re data is sorted and that is your actual data. It’s not a copy of the data.

Carlos: So it’s worth asking at this point, right, to maybe like a small little poll here. So we talk about non-clustered indexes and then which are separate and different from but very, they kind of go hand in hand with primary keys most of the time, right?

Randolph: There’s an interesting conversation here. Microsoft for I think a good reason, it wasn’t the right choice but a good reason, decided that if you create a primary key inside the Management Studio Designer and in fact even before there in previous versions of the tool Enterprise Manager I think was called in the old days. If you create a primary key on a table it will automatically make that primary key a clustered index. I think the reason for that default was because they wanted you to have a clustered index at least. And because the primary key is unique by definition it made sense from a designer point of view and for people who are not intimately familiar with internals to have at least a unique value as your primary key and also as your clustering key. It makes sense from that point of view but it can run into trouble because let’s face it a GUID is unique. It’s a very good candidate for a primary key but it may not be a good candidate for a clustered index. In fact, it isn’t. There are reasons why you would not want to use the designer on particular tables. In fact, I prefer not to use the designer anymore when I’m creating tables because it has a bunch of defaults that may not be appropriate for the design that I’m implementing. It doesn’t mean I don’t use the designer. In fact, I use it a lot for creating relationships and stuff because it’s easier. But in terms of clustered indexes and primary keys the designer will make that the same set of columns so it’s something to watch out for.

Carlos: Well, now, having said that, kind of from a poll perspective. What percentage of time are your clustered indexes different from your primary key?

Randolph: Well in my case, 1%.

Steve: Yeah, I think I would put that really into two camps. One of them is if it’s a new database that I’m architecting from scratch or new tables that I’m building out where I have control over it. Oftentimes they may end up being different, the clustered index versus the primary key. But if it’s a, I don’t know, third-party application database that I’m jumping in to do some maintenance on or something. Like 99.9% of the time those are the same.

Carlos: Ok, and so I think, that’s been my experience as well most of the time, right, so that 99.9% of the time. It’s pretty much like when I’ve identified an issue or they have for an instance a GUID as their primary key and not in the clustered index, then we decide, “Hey, maybe we should make change here.” I got into a little bit of, so I’ve got some feedback. I’ve just put out this book “Zero to SQL”. And one of the things, actually I put a sentence in there. I said, when talking about non-clustered indexes, I say, “We don’t have to include the primary key of the table in the non-clustered index because this is automatically gets included in the index.” Now, somebody kind of took me to task, and they’re ultimately correct in the sense that, it’s the primary key that gets added, it’s the clustered index that gets added. I just thought they’re just curious I’m like how many times is your primary key not your clustered index. That’s kind of where I was going with that.

Randolph: Yeah, we’re designing from scratches as Steve said. You might have a reason for that.

Steve: So I guess you has just gone into a little bit talking about how you said in the book something about not putting the same columns that are in the primary key into your non-clustered indexes because they are included there as well. What was the feedback you got on that?

Carlos: So technically correct. It’s not the primary key that gets added, it’s the clustered index that gets added to the non-clustered index.

Randolph: And in fact there is a little bit deeper depth into that is it gets into the leaf node guaranteed. But if it’s a unique index then it’s not in the intermediate levels if that matters.

Carlos: Ok, and so here we start talking about and this is part of the index that will be a little bit hard to describe over the audio portion. But one of the things we kind of came up was that the idea of the game of the $10,000 pyramid or even pyramids, the card game, if your familiar with that. And that idea that you have a structured top and the data kind of goes the pyramid format, and that’s one of the things that Randolph mentions there. So they have leaf nodes and other types of names if you will for some of those sections so that as SQL Server is looking through the data it can find, do I need to keep going or not.         

Steve: Yup, so really there when the index is being used, it’s that pyramid or like a tree type structure. And it uses that to be able to quickly jump to the pages that you’re looking for rather than scanning every leaf node. It goes through that tree to get to the data quickly. Randolph, did you want to talk about the B-Tree and how that part works?

Randolph: Well, it’s difficult to speak about with just audio. But if you think of the top of the pyramid there is only ever one page there and that is called the root node. Only ever one page and it’s also an 8K page. And it would be a type of page instead of being a normal data page. It’s called an index page. But otherwise it’s exactly the same as a table structure or a clustered index because remember your clustered index is your data. So your non-clustered index and your clustered index have a similar structure. You’ve got the single page at the top and then inside that page are references to other pages. And those pages contain the index information. So if you need to go to Stedman, you would look on the root page for where the S starts and then tell you which page to should go to where the actual data is and then inside there it will be able to find the data much quicker than having to read the entire table to find Steve Stedman’s number in there.

Carlos: Almost like an index of sorts, right, like a database.

Randolph: It’s exactly what it is, right? It’s just an implementation detail of exactly what you’re talking about. You flip to the back of the book, you look for S, there’s Stedman and you know which page to jump to. That’s exactly what’s happening.

Steve: So then one of the differences with that is the amount of time it takes to get there. Is it if you’re going through a heap or a clustered index that isn’t clustered on last name and you’re searching for me Stedman. If there’s a million pages in your table you have to touch all million of those to check and see, to find the rows with my last name on it. But with the index in that tree structure there you might have to read two, or three, or four pages depending on how deep that index is to get to the S section, and then scan through just a few pages once you’re there to find Stedman. And the difference is you’re touching maybe 8, maybe 10 pages rather than.

Randolph: Yeah, tends a lot for a seek. What you’ve just described, if you’re checking the entire white pages or the entire index that’s called a “scan”, and if you’re just looking for a particular record because you know how to get there through the index then that is called a “seek”.  There’s a whole conversation here about which is better? The answer is none of them because it depends on what you’re doing. If you need 10,000 records scan is much more efficient than if you just need one. So a seek would be more efficient if you’re just looking for one or two records. But if you’re looking for a million out of your 10 million then a scan would be much more efficient because it would only need to read fewer pages in that respect.

Steve: Yup, good point there. I think that’s one of those that it comes back to really what is it you’re trying to do. Sometimes the query optimizer has to guess at what is it you’re attempting to do with your query. Usually it gets it pretty close but sometimes it’s not quite right and you end up with a seek when a scan may have been better and vice versa.

Randolph: Yeah, and that gets into conversations about are your indexes correct? Are they stale? Do they need to be updated? Do they need to be maintained? There’s a whole bunch of stuff here that we can talk about. There’s statistics as well.

Carlos: I guess we can jump into some of that. Before we do that I think one of those ideas is, are my indexes good enough? Because the optimizer can do some different things based on the information that it has, right? And then there’s this idea of a covering index. Take us through that.

Randolph: What happens there is we spoke about the structure of the index briefly were you’ve got your root node, you’ve got your leaf node where your data is, and then there’s this intermediate levels. What happens with an included column is that if you have a table where you need to bring back the last name, the first name, the telephone number but there’s a whole bunch of other data in there as well then your index would only need to contain those three columns. But let’s say that some of your queries had happen quite often would also require a middle name. You could add the middle name as an included column. That middle name would appear in the leaf level so the bottom level of the index. But it wouldn’t appear in the intermediate levels and that’s called an included column. Your index doesn’t have to be huge if it had four columns because you wouldn’t be using it for all of your queries but it would be a suitable covering query for the one that requires the middle name. I hope I haven’t confused everybody by saying it that way.

Steve: Just a little bit more detail on the covering index because sometimes when I talked about covering index is people say, “Oh, how do I create that? Do I create, is that a different create statement than a non-clustered or clustered index?” And the answer to that is really, No. A covering index is simply an index where all of the data that you’re asking for in your query happens to be in that non-clustered index. It can return it all back to the query without ever touching the heap or the clustered index. But if it’s not covering, what that means is it uses that non-clustered index to find the rows that you need, and then it jumps back to the heap or the clustered index to bring in the data that you’re looking for. So that last step to jump to the heap or non-clustered index that skipped in the covering index and that can save quite a bit of time too.

Randolph: Yup.

Carlos: And that’s in your execution plans you’ll see that lookup, right. And that’s what’s doing. It is basically, I don’t have the information I need so I need to jump and grab it, right. So whether you should then create a “covering index” will depend on how many times a query gets run, performance to that query and all those kind of things.

Randolph: And how many columns there are in the lookup and all sorts of stuff. Yeah, it’s an art. It’s not a science.

Carlos: Right.

Steve: So then with that when you’re looking at the query plan and there is a suggested index. And I see this quite often where it says suggested index with two or three columns and then include, and it includes like 25 columns in it. That kind of thing would certainly give you a covering index but what are your thoughts when SQL Server is suggesting those types of indexes?

Randolph: Don’t do that.

Steve: And why?

Randolph: Ok so, when SQL recommends and index, a missing index, it is only specific to that query. If we know anything about query plans you can add a space to a query and it will suddenly be a different query plan maybe. Don’t just create missing indexes as they come up. What you need to do is to evaluate against other missing indexes or other recommendations for indexes on that table. And maybe combine the different recommendations into one or two new indexes or modifying existing indexes that try and accommodate as many of those recommendations as possible. An index that includes all of the columns of the table you might as well just be creating the clustered index. It’s just not efficient.

Steve: So one of the things that I see oftentimes when I’m looking at performance tuning and you’re looking at maybe indexes that are inefficient or not used. Often times you see them names something like missing index name which is a clue that somebody just copied and pasted from the suggested missing index and run it then they end up with an index that may not be really what they’re really looking for there.

Randolph: In fact, this goes back to Pinal’s interview that you did in December where he said his reference point is, “Do you have more than ten indexes on a table.” He’s going to start looking into that and see why. Because you can have on a legacy system or even a vendor system, third-party that you’re not allowed to touch. People have over the years gone and added indexes because it helps with the particular query. You can end up with duplicate indexes or indexes that are only different by say an included column. You’ve got all of that maintenance that’s going into keeping these indexes up to date. And statistics and all other stuffs that goes with that when in fact you can consolidate and reduce your indexes quite easily by just analyzing the system and how those queries are running.

Steve: Ok, so then let’s say I’ve got too many indexes like you’re talking about there and some of them are not efficient, or they’re not being used. They may help someday with some query but what are the negatives to having too many indexes specifically what’s going to slow down when you have too many of those?

Randolph: Well, there are two things that I would like to address. The first things is, if this index is useful once a month then create it once a month then drop it again. If you’ve got a reporting style query that has to run once month then generates a report then without the index it would take 10x as long, go ahead and build that index just before the report runs and then afterwards drop it because it’s not in use 28 days of the month or 29 days of the month. There’s nothing wrong with you creating and dropping index when you need it as long as it doesn’t obviously impact business hours. That’s the first thing. The second thing to answer the question is every time you modify the table it has to make sure that the indexes are up to date. So if you’ve got 10 indexes or 15 indexes on a table it’s going to have to, every time you insert or update a row, or delete in fact which is the same as an update because it’s changing the data you’ve got to make sure that that row is replicated or the change of the row is replicated in each one of those indexes. So it has to lock the table and it has to lock the indexes while it’s updating them and that can cause performance problems in of itself. It can also cause problems because you are locking data so you cannot make any changes, you cannot select from that unless you start using tricks that you shouldn’t be using query hence which I’m not even going to mention. Because you have too many indexes and you’re trying to satisfy a possibility that might happen in the future when in fact you’ve got statistics that tell you, DMVs that tell you exactly what’s going on with your database. Use the information available. It’s not SQL 2000 anymore. We have better ways of analyzing how your queries are running.

Steve: Ok, so then one of the things that often comes up, and it’s highly debatable, I don’t know, in places that I’ve been recently is fill factor. And the concept of what is a GUID default fill factor setting and what does that really mean to your indexes?

Randolph: Ok well, fill factor is in theory is about leaving enough space in the data pages themselves so that if you do need to add columns and it is a random type of column like we spoke about before with GUIDs. It leaves space in the data page so that it doesn’t have to split as often and it can go and insert data into those pages without having to resort the entire index every time. That’s kind of what’s it’s about. Do you want to add anything on that definition?

Steve: Well, I was just thinking, I heard a debate awhile back of about someone arguing that with SSDs and faster storage that your fill factor changes over spinning storage. I think that, I don’t know if I entirely agree with that. 

Randolph: I completely disagree because all that an SSD is doing is making your problems go away until a later time. The problem is still there. It’s just being hidden by the fact that you’ve got faster drives. In fact, the same goes for adding more memory. It’s just that happens to be cheaper than trying to fix your indexes.

Steve: Yup. Ok then, so with fill factor I think what we end up seeing oftentimes is that people have a default fill factor setting on their database. They think that that’s going to cover what’s ever they need there it’s going to be right for all occasions. But really it comes down to that fill factor needs to vary depending on the data type, the frequency of change, whether the data going in is sequential or ever growing, or whether it’s completely random. Fill factor really depends on the data.

Carlos: Do you have different fill factor settings based on your clustered vs. non-clustered indexes? Because that kind of would be a good sorting factor in the sense that in theory we talked about our clustered indexes. Again, assuming that they’re unique, that they’re ordered, that you shouldn’t be seeing as much fragmentation there because again they’re kind of in that order. Kind of going back to the idea of them being in the primary key or unique, ascending, keep going, identity is a good example of that. It’s the clustered index is that or going to be a little bit more random if you will, and you could see a little bit more shuffling.

Randolph: The magic word you use there was fragmentation. This used to be a huge problem as Steve mentioned with spinning drives. While I disagree with the fact that I’m not worrying about fill factor at all is ok.  I disagree, you should worry about fragmentation. You should worry about these things purely from the point of views of knowing that your database isn’t in a good condition is well maintained. At the same time though, index fragmentation with faster drives doesn’t matter as much as for example making sure that your statistics are up to date. In fact, I now say in my performance tuning session if you want to get the best benefit from your maintenance window. It is far better time spent by updating your statistics than to, for example, rebuild an entire index. Firstly, from an I/O point of view you need a size of data space for large tables if you’re going to be building an index rebuilt. That’s going to need, transaction log space is going to need space in the table itself, in the data file itself. It’s going to need a lot of resources, and memory, and all of those things. And if you’re replicating, it’s going to replicate all. It’s a big deal to rebuild a large table whereas a much more effective use of your time to do a statistics update. And the benefit from having an index rebuild versus your query plan being more effective fragmentation is less of a problem now. So from that point of view, Steve, yes I will agree with your statement before about fill factor But yeah, the one thing I wanted to mention on fragmentation is that there was a confusion a few years ago, fill factor of zero and fill factor of 100. What’s the difference? There is no difference. They’re exactly the same thing. Zero fill factor means there is zero space left in the page, fill factor of a hundred means it’s 100% used so it’s the same thing. I don’t know why they decided to distinguished but who knows it is Microsoft. A fill factor is a percentage from 1 to 100 so zero is a hundred. What you want to do there as you mentioned is it depends on the type of column you’re addressing. Less so of the type of index you’re using but more about the type of column. So a GUID as we spoken about about before is a random value so you would want to have a lower fill factor on that index for that type of data because then it reduces number of page splits which reduces the amount of fragmentation. However, you will still need to maintain that index from time to time to make sure that the fragmentation is reduced.

Steve: One of the interesting things I’ve come across there is that if you’re in a scenario where somebody has a GUID as their primary key and as their clustered index and it might be in a position where it’s too big in a given maintenance window to change that clustered index or too time consuming to change that. One of the work arounds that I found there is to use the new sequential ID rather than new ID so that you can turn GUIDs into sort of a pseudo ever incrementing value.

Randolph: Right, and that’s 16 bytes wide so.

Steve: Yup, it’s still big.

Randolph: It’s still big but it is sequential as you mentioned.

Steve: Yeah, it’s better than random at that point.

Randolph: It’s better than random but the reason people use GUIDs is usually from a development first perspective so the developers want to manage the index, sorry the primary key in the application as opposed to the database handling it. For number of legacy reasons that was a good idea 15 years ago. It’s not a good idea now. So there are a lot of habits and this is the way we do things happening from a development side that are not being, well the new stuff in SQL Server is not being adapted as effectively because there is 15 year old news on the Internet that says SQL is not good at this. The other thing is we have to educate people to say, listen there is a reason these RDBMS is in the top 3 in the world. There’s Oracle, there’s MySQL and there’s SQL Server, those are the top 3 RDBMSs on the world, Relational Database Management System. So SQL Server is in the top 3 for a reason. It’s a good product. It’s in fact one of the best products out there to handle your data so you should be using the newer features and the newer compatibility levels and things like that. And get away from the old habits of 15 years ago because they’re bad habits and things have gotten better in the last few years.

Steve: Yup, and that is oftentimes a challenging debate to have. And I found that to show examples or to come in and show what the performance difference is between one way of doing it versus the other way. It is a good way to show what’s better there. Ok, so then another one is talked about just the different options around indexing that we have is the concept of filtered indexes. And I know I threw this term out when we’re talking about the phonebook directory around the business names in the yellow pages. So filtered indexes, any thoughts on pros and cons and sort of how that works that you want to talk about here?

Randolph: Well, the idea of a filtered indexes is that if you have a very particular condition in your WHERE clause in your query. Let’s say that you want to grab all of the Stedman’s in the phonebook where there middle name equals J or starts with a J. If that WHERE condition happens often, then you could create a filtered index that uses that WHERE condition and then the index will only include those rows that satisfy that condition. The problem with the filtered index there is that if you use a parameter in your WHERE clause it’s not going to use the filtered index. So it’s solving a very specific problem in my opinion. It’s a good problem that it solves but it’s a very particular way of solving it. And you have to be very careful when you create your indexes and create your queries that you’re using them correctly.

Steve: Yup, so a real world example I came across there that filtered indexes were very handy was around a very large table and needing to purge data beyond a certain date. I found that when I just ran DELETE statement to delete where something that’s older than a given date it was forcing a large index again to find that. And it was like taking 40 minutes to delete a thousand rows. And then what I did instead was I went in and I added a filtered index and deleting one day at a time that was beyond the date range. I added a filtered index just for that one day. That filtered index gave me the shortcut to be able to delete a thousand rows at a time just split second like less than a second to delete those rather than extended amount of time to delete it. I found in that case I could add that filtered index do the work I needed to do and then drop it and then add a new one when I was working on a different day, and that was really a handy way to use it, to improve performance there.

Randolph: Yeah, that’s an excellent example of when a filtered index would make a lot of sense. Another example is if you have a very large table and you have a bit typed field where you’ve got a nullable values, it’s a nullable column, and you’ve only got a few of those rows that have a value of true or false and you could have an index, a filtered index on either of the null values or the populated values depending on what type of query you’re writing. And it can identify those columns or those rows that have that bit column set much more easily.

Steve: Yup, yup, another good example there.

Carlos: Exactly, yeah, that’s been my experience as well.

Steve: So then let’s jump a little bit more into the care and feeding side of things, and I know we touched a little bit on rebuild and a little bit on statistics but if we look at what are the general things that we needed to do to take care of indexes and keep them well structured on an ongoing basis?

Randolph: I’d be interested to hear your side because I’ve been speaking a lot. I’ll open if I had any feedback.

Steve: Ok, I mean, so really I think on that there’s the concept of reorganizing versus rebuilding as you get fragmentation over time. And then there’s statistics and I think that, or let me start with a very bad way of doing it that I see quite often that is if you go and use one of the default maintenance plans. And the maintenance plan you just go and click let’s rebuild my indexes, let’s reorganize my indexes, then let’s rebuilt statistics. One of the things you end up doing is rebuilding which basically rebuilds the entire index from scratch without, basically if you’re in Enterprise Edition you can do that on online mode but if you’re not in Enterprise Edition that’s an offline operation. But what the default maintenance plan people sometimes check is that they say, “Let’s rebuild them.” And that gives you the best structured index you’re probably going to end up with and it also rebuilds your statistics. And then they reorganized them and then they rebuilt statistics with like a 5% sampling or something like that. And in that example you end up with a whole lot of work that’s actually giving you sort of a negative approach or negative impact. Whereas if you just rebuilt them to begin with and then not done the reorg, and not done the statistics rebuild afterwards you would be in a lot better shape.

Randolph: So, you know, those defaults that you’re talking about, the reason that happens is that because the maintenance plan wizard has got rebuilt then reorg, then statistics in that order in the list.

Steve: Absolutely, yeah, people just click the checkboxes or select them and that’s what you get.

Randolph: So when I see that I blame Microsoft for that. That’s purely Microsoft’s fault because generally people here using the maintenance wizard are using it because they don’t know how to do it with say, Ola Hallengren or MinionWare tools so just to name two, there are many. So that’s Microsoft’s fault and every time I see that I do not blame the DBA, or the accidental DBA, or whoever is managing it. It’s not their fault. But it is causing a whole bunch of work that is unnecessary.

Steve: Yes, it may not be their fault but it’s their pain they have to deal with. When I see that I generally going a little bit of education to explain why that’s bad and turn those off. And like you said, replace it with MinionWare or Ola Hallengren scripts. Both which are far better solutions at doing at than the default maintenance plans.

Randolph: So the one thing that I wanted to mention, Steve, is that an index rebuild is, as I mentioned before is size of data operations. So if your table is 10GB and you need to rebuild it, if you’re doing a clustered index rebuild it’s going to require 10GB of space in the data file and then it’s going to require at least 10GB in the transaction log. And if you’re rebuilding a clustered index it is going to by default rebuild all of the non-clustered indexes on that table that referred to that clustered index and that means all of them. So if you’ve got a 10GB table and you’ve got 30GB of non-clustered indexes on that table it’s going to require at least 40GB in the data file and 40GB in the transaction log because it’s going to rebuild everything. So if you’ve got a maintenance plan that says let’s do a clustered index rebuild and then before that you had the non-clustered indexes rebuilding it’s going to do them again anyway so it’s a nightmare. This is why I say it is statistic updates more regularly are going to be a lot more effective than doing index rebuilds over the long term. You still do need index maintenance but let’s not do it every night.

Steve: And that’s where I like the third party solutions that we mentioned there for the ability to rebuild those where you have options to say given a certain level of fragmentation I want to choose to reorganize this or if it’s even more fragmented then we’re going to choose to rebuild it. And some of them even have the option of only do this for a certain amount of time.

Randolph: Can you go into detail what’s the difference between a rebuild and a reorg is?

Steve: Ok, so with the rebuild it’s basically, technically it’s like the equivalent if you drop the index then recreate it, and then it’s going to be completely rebuilt from scratch and it will take up the log and disk I/O like you talked about a moment ago. But with the reorg, what that’s going to do is going to the index and basically restructure or compact the index pages as needed. It will go through and shuffle them around to compact them down to a smaller amount of space. And it will do that page by page as it goes through there and that can be done while things are online in Standard Edition. But it’s slow and the end outcome is not as good as a complete rebuild. It might get you real close but it’s not going to be good as complete rebuild but the benefit is you can keep it online.

Randolph: And you still need to update the statistics after you’ve done a reorganization which is something that people tend to forget as well. The statistics need to be up to date so that the query optimizer knows where the data is and how much of it is in there, and what the distribution of the data is in that index so that it can make the right choice about what type of query to create because a lot of queries is coming down to SORTS and JOINS and that’s where you find your biggest performance bottlenecks. A SORT is a blocking operation so everything in the query has to stop until the sort is finished. So if your index is up to date and it’s sorted correctly then it doesn’t have to do that operation it’s already done. Whole bunch of things about loops, we can talk about here as well which I’m not going to. An up to date statistic is much more valuable to the query optimizer than a defragmented index.

Carlos: There’s another point there even that, we’ve talked about current feeding of the indexes, right, rebuilds vs. reorgs. However, we can also have statistics that are outside of the indexes, right, on the table. Those are also providing feedback to SQL Server to the optimizer. And if we neglect those we’re still kind of the same peril because the optimizer maybe using the statistics but we’re not giving them that current feeding.

Randolph: Correct so, it’s valuable to have a, I hate this word, holistic approach to your current feeding is that you need to look at not just indexes, and not just fragmentation, and not just statistics but all of them. And in a way that is effective and efficient for your maintenance window. You don’t want to go and rebuild indexes every night because it doesn’t make any sense. But you also don’t want to stop doing statistics updates.

Steve: Yup. One thing I commonly see with that is rebuilding of the statistics or updating the statistics where somebody will have a job that goes through and blindly update every single statistic in the entire database every night. And that can have a pretty big performance impact on things depending on the size of your database. And a better approach to do that is to look at which tables or indexes actually have changes and only update those that have a fair amount of changes to them on that frequent basis. And then maybe on a regular basis of once a month or once a week you do an update on across all of them.

Randolph: Yeah, that’s a very good point. The amount of data that changes in a table if you’ve got automatic statistics enabled, we’re going to go a little bit of a segway now, by default automatic statistics are enabled. There are other settings that you can choose in 2016. There’s a whole bunch of new stuff per database which you can choose. Let’s talk about everything before 2016 because that’s what most people are still using. Statistics will only update themselves if a certain percentage of the table has changed. The value has changed across versions recently but the fact is that statistics will only update themselves if a certain percentage has been changed. So if you’ve got a really large table with tens of millions of records or rows and you have only one million that have changed recently statistics will not be updated automatically. So it does pay you to use one of the third party tools we’ve mentioned like MinionWare or Ola Hallengren’s maintenance solution where it will go and check, has any number changed and then you can update the statistics there and that will help your query optimizer as well.

Steve: The other side effect you can get with that is that if you do hit that threshold where it decides it’s time to rebuild those statistics that might be in the middle of your peak load during the day. And right when you’ve got a lot of traffic coming to your database and index or statistics rebuild occurred that can have impact on things too.

Randolph: That in fact is one of the recommendations for SharePoint style databases. There’s a lot of things I don’t like about SharePoint but the fact is a lot of people use it so one of their recommendations is to turn off automatic statistics updates on any SharePoint style database. That includes CRM, it includes GreatPlains, all of those even. What’s the one that I used? Whatever it is, there’s a whole bunch of them where you should turn off statistics updates automatically and then include statistics rebuilds in your maintenance plans. So it’s just to keep in mind each database is different.

Steve: It will probably apply to Navision and Dynamics as well. 

Randolph: Yes that’s the one. Yeah, Dynamics is the one I’m trying to remember. Thank you! So any SharePoint style database they do recommend turning off statistics updates automatically and to do them in a separate maintenance window. So it pays to do your research to make sure that you’re doing the right kind of maintenance plans for your database and your instance.

Steve: Yup, very good point.

Carlos: So I guess a couple of different things we talked about all of these components. And I guess let’s now talk about some of the benefits right, so all of these things we have to go in, right, kind of the holistic approach, kind of having to know our data, getting more familiar with it. Ultimately to what end are we going to do that? I think, so we’ve talked a little bit about performance. I guess we should probably talk about how that performance gets into the system if you will or the mechanisms that cause the improvements?

Randolph: Ok, so when I do my performance session, what I say is, “You’ve got expensive items like your CPU. You’ve got less expensive items like your RAM, and you’ve got even less expensive items like your hard drives.” So your CPU is the most valuable thing to you because the SQL license is per CPU core and you want to make sure you’re using the most efficiencies of your CPU and memory as you can. What an index does is, we spoke about this before, it is a copy of the data so you want to keep your indexes as efficient as possible so that if you’ve got a large table you don’t want your index to be large as well. You want it to be smaller so that less of it is in memory because that’s what this game is about. SQL Server is all about being in memory as much data as possible in memory. So for Standard Edition up to 2016 even you’ve only got a certain amount of memory that you can access. 2016 Service Pack 1, the limit is still there but that’s your buffer pool that is in memory as opposed to everything. But the fact is that there are limits to the amount of memory you can use for SQL Server specifically with Standard Edition because not everybody can afford Enterprise. So you have to manage the amount of data that’s going into memory as much as you can and that is most effectively done by the right kind of indexes for your queries. And that’s also why you don’t want to have duplicate indexes because it will also be in memory. Also, you don’t want to have wide indexes because they will take up more memory than they need to. And that’s why included columns are very handy way to reduce the size of your indexes that’s why we have filtered indexes. All of these things to try and reduce the amount of data that is in memory so that we can do a lot more with what’s in the buffer pool.

Carlos: So then how do we go about or what’s the best ways to determine, we talked about looking at execution plan that kind of gives the recommended index there. So what are some good strategies to go about deciding which index is get included and which ones we need to ignore?

Steve: One approach I take on that is if I’m looking at an execution plan and it suggests an index. If it’s a development environment one of the things I’ll try initially is just create the index and see how it helps and I’ll drop it, and then go look and see is there a similar index. I mean, if the index that it suggested was beneficial I’ll go and see if there’s a similar index that could be modified or added to that would give the similar and a fact that the suggested index was doing. And sometimes that means you’re taking existing index and just add in an include to it or taking an existing index and add in another column into it. 

Carlos: Ok, so that’s kind of a dev environment, one query at a time.

Steve: Yup, and the other approach that I see there is I can even go and look at some of the missing index DMVs to go and figure out. Actually you look at DMVs for a handful of things, one is to figure out what are the missing indexes and figure out the overall cause associated with those missing indexes. And then come up with what are some good ones to add that are not going to be duplicates of other indexes and then see how that improves performance. You can also, using DMVs, go in and look, you can figure out what are your duplicate indexes. If you have duplicate indexes oftentimes you’ll see that one of them might be not being used at all and the other one is being used a lot. You can go in and drop one of those. However, you need to be careful when you’re looking at unique indexes. You don’t want to drop a unique index versus a clustered index or non-clustered index you want to look at and drop the right one there. Another thing to look at is big clustered indexes, that’s another thing that you track down is what are the clustered indexes that are really wide. And that really wide means it’s going to ripple through all the non-clustered indexes with that extra size. You can also track down unused indexes. What are the indexes that are on your system that are not being used? Now when you look at that you need to be very careful because that’s only going to be the ones that haven’t been used since the SQL Server instance restarted. Also, if you rebuild and index I believe it clears those index usage stats.

Randolph: Sometimes.

Steve: Sometimes, exactly. So it’s something that if you see there’s an index that is not being used you might want to track it over a few weeks to a month and confirm it, yeah it really isn’t being used and then go and take a look dropping those.

Randolph: That goes back to one of the best practices is to have a baseline. So know what your system is doing and track that overtime and then you can refer back to that baseline and say, “Well, this is exactly what’s happening. This is different.” And go from there. Yeah, that’s a good point.

Steve: Yup, and that reminds me of a project, Carlos, that you and I worked on. Were we setup a monitoring component that ran for a month and kept track of unused index details, and log them every few hours, and then we went back at the end of the month and reviewed the ones that over the entire month had no utilization and then suggest those as possible candidates to be dropped.

Randolph: That reminded me of Pinal’s comment as well that he’ll only look at the database that’s been running for a week. The one thing I wanted to mention is the DMVs that you’re referring to there’s one in particular, there’s a group of diagnostic queries that Glenn Berry, Glenn Alan Berry, from sqlskills, he uses and maintains them. He’s got a couple of, and so does Brent Ozar and a whole bunch of other people. But Brent Ozar’s one called as sp_BlitzIndex and Glenn Berry’s one is included in his diagnostic scripts. It’s a very similar query and what it does it waits your missing indexes as according to number of scans, number of seeks and something called Average User Impact. Now that Average User Impact number can be, I think it’s unit less, I’m not quite sure how does that number is calculated but if it’s really in a high impact like tens and hundreds of thousands then usually I will go and look at that first and say, “Well, how many times has this index been used in terms of number of seek and scans.” And if it’s a high usage index that is missing or a high impact then I will usually create that without too many modifications if it’s not too wide or doesn’t have too many include columns.

Carlos: I think it’s the number of times requested multiplied by the number of hits.

Steve: Yup, and just to jump in there I think I’d feel a little bit left out I didn’t have the chance to mention Database Health Monitor on that. In Database Health Monitor there are many indexing reports similar to what Randolph has described there.

Randolph: Certainly, from Steve Stedman’s solutions has a magnificent and free tool called Database Health Monitor which I have used. I don’t even get paid for this statement. It’s a good product. It’s free which is even more amazing. This is the great thing about the SQL Server community. There are a lot of free tools that are out there that are adding value all the time. And all that people asked is that you recognized them and I recognized Steve as a valuable member of our community. That ends the sponsorship message.             

Steve: Thank you for the shameless plug.

Carlos: So it’s interesting, I guess, and maybe I’m being a little bit paranoid but using, so when I’m looking at that impact and I will take that and also using the ones that I’m interested in adding. Of course I want to do what Steve mentioned looking for duplicates or kind of what’s there, right? Then am I going to get over my threshold of ten or whatever just kind of taking into consideration what additional load am I putting on this table by creating the index. And while you want to test that out mostly the environments that I worked on they just don’t have a good mechanism for creating the similar load as I have in production. So when I go to implement that index one of the first things I’ll start to look at is those usage stats, right. Because I want the usage stats in the index that I just created to be going up because I know that, “Ok well, wait a second what did I do? Is this still a really good candidate?”

Randolph: Yeah, that’s a very good point.

Carlos: So I guess, other that’s how are going to go out and creating them, couple of other things that we didn’t really get into some of the other new indexes like column store or even XML indexes. But I guess other thoughts about when to start looking at these other, we even try to approach that now. Maybe I feel like we should punt on some of those. 

Randolph: I can briefly mention about XML indexes that the way they work is they’re going   to be larger than your table or your XML column. The reason being is that it will create an internal table according to the structure the XML document or XML field that you’re indexing. So it actually expands out the XML data into an internal table and then indexes that so you could end up with a column that is say 100kb maximum. That’s a small one and you can end up with an index that is 500kb, or gigabyte, or tens of gigabyte because it’s creating an internal table under the covers. A system table that it is then indexing so be wary of XML columns in that if you’re going to index then make sure that there’s sensible indexes that they’re only indexing certain nodes inside that XML document and be aware of that. And also I’d like to add here that if you use the XML data type, the data going into that XML row or column is not going be the same as what you get out. It does modify the XML data going in for whatever reasons.

Carlos: Say that one more time.

Randolph: If you have XML data that you put in for auditing reasons for example. If you pull it out it’s going to have been modified somehow. Either the tags will be slightly different or the spacing will be slightly different so do not use an XML data type to store audited records of XML types. If you want to keep an absolute record of what your XML looked like as it came out of a web service for whatever reason store it in a varchar or nvarchar column instead because in it it is identical. If you put it into an XML data type it will modify the data. It is still the same data but it will slightly modify. The tags will be slight different or whatever.

Steve: So then as far as column store indexes I think that’s something we could probably dedicate an entire episode just to talk about.

Randolph: We definitely can.

Carlos: I think you’re right.

Randolph: A column store index is neither an index nor clustered so it’s so complicated. In fact, there is a series of post. I think it’s Nico who has done it over one hundred posts on how column store indexes work. We could talk for days on that. They are fascinating and completely subverts what you think you know about indexes, and data, and row level storage and all that kind of stuff. It’s fascinating stuff.

Steve: Yeah, and just a few notes on that I think that it’s one of those things that is great for data warehousing or OLAP type things. And may not always be the best option for your OLTP side.

Randolph: At the same time you could have reporting style queries in your database and with 2016’s optimizations for column store you could have which can change and can be clustered and all sorts of stuff. You could have some stuff that in your OLTP environment that could be OLAP style indexes. Yeah, there’s so much to think about there.

Carlos: We saw that in Episode 78 with Brian Carrig. They talked about using their column store index in a transactional environment.

Steve: Yup, and then one of the things that I’ve seen that is one of those misunderstandings around column stores is that if you, because column store does the compression on each column, is that people think of it as I don’t have to have any non-clustered indexes if I have a column store index. And that’s not true at all. And that if you’re looking at a bigger table and you’re using column store you may still need to have some non-clustered indexes on it as well.

Randolph: Oh yeah, so do you research. Read all one hundred and whatever post. Become an expert then implement them.

Carlos: So we’ve talked a little bit about, so implications, how do we know what to use. So space considerations we talked a little bit about more in the rebuild process, right? That we’re going to need additional space in our data file and our log file things like that. I think we’ve had a pretty good conversation there. I guess one last idea I’ll throw out that we can bat around a little bit. We talked a lot about having multiple TempDB files and so one of the thoughts that I’ve seen out there is if you’re potentially, I won’t say, so what’s the value there is that if you’re looking to spread out or move your indexes from where your data might reside. One way to do that is to create another data file potentially on a separate storage and then rebuild those indexes.

Randolph: Ok, there are two things that I can talk about there. The first thing I wanted to mention is if you do need to move your table into a different file or file group all you have to do is do a clustered index rebuild and target that file group as the new location for the clustered index because remember the clustered index is you data. So that’s a handy way if you need to move your table into a different file group that’s how you will do it. The other thing is by virtue of that definition there is that you could have your non-clustered indexes in a different file as your data and you might want that for improved I/O performance or if you’ve got your read-only stuff in a different file group or all sorts of reasons for that. It’s very handy for splitting your load on the I/O level. Less of a problem these days but it’s a nice thing for large tables to split your non-clustered indexes from your clustered index so that’s not reading from the same portion of the disk or the I/O subsystem or both.

Carlos: Right.

Steve: Just to note on that index rebuild for a clustered index to move it to a different file group although that will move the table and the entire index. One of the things that could be left behind when you do that is any of the varchar max or nvarchar max or other large items that are located outside of the data pages for that table.

Randolph: Yeah, off row data will be affected. Steve, I don’t know because I’ve never tried. What happens if you tell it to do the index rebuild with the log compaction enabled.

Steve: You know, that’s a good question. I have not tried that for a long time.

Randolph: I smell a blog post.

Steve: Yup.

Carlos: Ok, very good. Well, awesome. I think great conversation and obviously there’s a lot more that we could talk about indexes but I think this was a noble attempt at covering some of the basics and getting into some of the nitty gritty as well.

Randolph: Yeah, the fact is that indexes don’t stand by themselves. They are a very important part of everything including statistics and everything else so don’t think that once you became an index expert you become an index performance tuning expert because that’s not true. You have to have a very broad knowledge of how things work in a number of different fields used upon to get the best performance out of your system. And there’s nothing wrong with good enough. You don’t have to have 100% defrag indexes. You don’t have to have indexes rebuilt. You can have them reorganized. Don’t have to have them reorganized at all if your statistics are up to date in certain contexts. There are a lot of tradeoffs that you have to think about when doing your maintenance plans and indexes form just a small part of that.

Steve: Yup.  

Carlos: Great.

Steve: Very good point.

Carlos: So shall we do SQL Family?

Steve: So how did you first get started with SQL Server, Randolph?

Randolph: Steve, that’s a good question because when I originally got this question I had to think very hard about it. I have been fascinated from a personal level about organizing data. Back in my youth I would create lists and lists of things CDs, books all that kind of stuff and I would do them in the paper equivalent of Microsoft Excel. And then when Lotus 123 came along I started playing with that because my dad worked at the bank and he was a 123 expert. If I’d know about VisiCalc I probably would have used it. And then I got into Microsoft Access because it was there. It was a very old version of Access. I think it was version 2 that I started with and then I started working at a PeopleSoft implementation partner in South Africa where I am from. And that’s where I first work with SQL Server and Oracle at the same time, and I was not a fan of SQL Server. I have to be honest. At that time it was version 6.5 and version 7 that’s around the time I joined so there were issues because as some of your listeners may know SQL Server’s engine was rebuilt around 7 time, 6.5, 7 time. In fact, if you look on MSDN you can download 6.5 and you can download 2000 but you cannot download version 7. It is very difficult to come by. There are maybe good reasons for that. I got exposed to SQL Server that would have been in 1997, around there. So yeah, that’s where I first got exposed but then I didn’t really play with it until I got to the bank. I was more of an Oracle guy. Got to the bank in 2006 so there was a large gap of not playing in SQL Server and then I couldn’t figure out why DESCRIBE, the keyword DESCRIBE wasn’t working in SQL Server. For all of you Oracle office you notice there I prefer SQL Server now because as I said before a lot has changed. It is a much better product than it was in 1997. In fact, I think it has surpassed Oracle. I think it’s better than anything else as well and that’s because I’m trying to keep abreast of all the new stuff. I don’t want to be stuck in the past and have assumptions about the product. I want to play around with the new stuff so. That was a long way of saying 20 years or so.

Steve: Ok. Well I know that in the late 90’s Oracle was a much better product than SQL Server.

Randolph: It really was.

Steve: Yup, and that has changed. That has changed significantly in the last 17 years.

Randolph: Yeah, it has.

Carlos: If you could change one thing about SQL Server what would it be?

Randolph: I had a pot for this but I’m going to change my mind. The pot answer is in-memory processing on SQL Server, in-memory OLTP which was called Hackathon, that was the code name for it. There is a transaction log operation that happens even for stuff that you don’t want kept. It’s still being backed by disks somewhere. It may not be in the transaction log itself but it is still being backed by disk somewhere. I would like to have in-memory objects that do not have anything on disk backing them at all. They must just be there, and if the server goes down, oh well tough luck.

Carlos: Alright. Yes, I remember seeing you in the Bob Ward presentation at SQL PASS talking about the in-memory OLTP.

Randolph: There’s some very very cool stuff with in-memory OLTP that I’ve been playing with. I’ve been blogging about it as well on my blog bornsql.ca. That speaks about some in-memory ops and I found. Interestingly I found another issue that I’ve spoken to Jonathan Kehayias about where Temp tables are slower than Temp variables in certain cases. So I’m investigating that at the moment and it had to do with my in-memory investigations.

Steve: Oh interesting, I’d like to see what you find there.

Randolph: If I could change, that was my pot answer about SQL Server what I would change. What I would change is having a TempDB per user database as a configurable extra.

Steve: That would be nice.

Randolph: In other words it doesn’t have to be default but if I need a TempDB for a particular database I would like to have that as a separate entity so that I could manage it separately and keep track of the objects that are happening for that particular database. The architecture for that is going to be really difficult to do so I don’t know if Microsoft is going to do that but that’s what I would like.

Steve: Alright, so what’s the best piece of career advice that you’ve received.

Randolph: Oh, it’s going to be along answer as well, sorry. “Let your work speak for itself” is the first and foremost one. So it doesn’t matter what people say about you if your work can speak for itself then you don’t have to say anything. That’s the best piece of advice that I had. But the other is “Everything you say and everything you do is being recorded somewhere” so please treat people with respect. Treat people as you would have them treat you and don’t say things on voicemails that can be played back to you at meetings with the CEO and saying Randolph don’t do that.      

Steve: I like that too.

Carlos: We won’t ask for the personal experience that led you to that advice.

Randolph: I have at the end of the year or every now and then when it’s a quiet time on my blog I say career limiting moves and that was one of them. So there are a bunch of career limiting moves that I can recommend to you. Don’t do because I’ve done them not because it’s good advice but it’s because I’ve done them so don’t do that because I’ve done it, and I will be one of them.

Steve: Alright.

Carlos: Our last question for you today. If you could have one superhero power what would it be and why do you want it?

Randolph: I thought about this and I honestly don’t know. I liked Pinal’s when where he could touch a server and know what’s wrong with it. The problem is whenever I touch as server if it’s going to break it’s going to break then and there. Maybe I want the reverse of that superpower.

Steve: To just know without touching the server.

Randolph: Yeah, you know per cost of maintenance is one thing. But if I’m in a room and it is going to break it is going to break while I’m there which is good I guess in way because you then you know that we can figure it out then and there. But I’d like the reverse of that that would be cool.

Steve: Ok.

Carlos: Well, awesome. Randolph thanks so much for being with us today.

Randolph: Well, thanks for having me it’s a topic near and dear to my heart indexing and maintenance and stuff like that so I’m glad I got an opportunity to speak to you guys about it.