Storage testing and validation is something what we to add under roles and responsibilities as DBAs. Every database we ever manage is going to need one, but how often do we kick the tires? Many times we’re basically told to go verify that array or we’re doing this POC, we’re testing this new storage, but are we really only testing connectivity? In this episode of the podcast, we chat with Argenis Fernandez about how he goes about testing a SAN array and the important metrics you should consider for your storage. If you are still using SQLIO or even Diskspeed to test the IO for your SQL Server, don’t miss today’s episode.
Episode Quote
“Storage testing and validation is actually something we end up doing as DBAs.”
Listen to Learn
- Why the now deprecated SQLIO tool was poorly named
- Why you should question your storage vendor about allocation unit sizes
- Why you should consider garbage collection when you test SAN arrays
- How compression and deduplication make testing more complex.
- Why testing storage in the cloud is difficult.
Argenis on Twitter
Argenis blog at Pure Storage
Hyper-convergence
DiskSpd Utility
VDBench
About Argenis Fernandez
Argenis is a Solutions Architect with PureStorage, a Microsoft SQL Server MVP, VMWare vExpert, and a well-known face to the #SQLFamily community, especially since he’s now the Director-at-Large of SQL PASS. He’s a regular speaker at PASS events, including SQL Server Summit. He also founded the PASS Security Virtual Chapter.
Untranscribed introductory portion*
Carlos: Argenis, again, welcome back to the podcast.
Argenis: Thank you so much for having me again. You guys, it’s always awesome, you know, I actually requested this one. So we upfront with everyone in the audience, I did told you guys to please go ahead and schedule me again because I wanted to go on yet another rant. Like I won the one rant on the previous one, I’m ready to go on another one man. Let’s go!
Carlos: Very good, and as you all know compañeros what Argenis wants, Argenis gets. And of course we’re happy to have him back here on the show today. So last time we talk was kind of just a SSD conversation, talking a little bit about disk. And today, I think we’re going to be, our topics might be wide in range but we want us to kind of kickoff potentially with testing. And testing your arrays or testing your disks that you’re using in your servers.
Argenis: Yeah, so storage, testing and validation which is actually something what we end up doing as DBAs. You know, like a lot of times we’re basically told, “Yeah, you need to go verify that array.” No we’re doing this POC, we’re testing this new storage, you need to go ahead and test it. And what is it, you guys answer this for me, what’s your favorite tool to use when validating storage?
Carlos: Yeah, so they deprecated the SQLIO, so Diskspeed is the other one that comes to mind there.
Argenis: Ding, ding, ding, right, so that’s what everyone looks for. So SQLIO, I love SQLIO, rest in peace now. Because it was completely wrong name to use like might possibly the most awful name to use for that tool because it actually had nothing to do with SQL Server or whatsoever. It just so happens that somebody in the SQL team wrote that utility but it was actually never meant to validate SQL Server patterns. It was just a storage benchmarking tool basically. It wasn’t even good for base lining because you couldn’t take like one, you know, I guess you could save the outputs and then compare them to some other output that you took in the future, right, to see how you’re doing etcetera.
Carlos: Exactly
Argenis: But you don’t see those things in SQL Server like that. For example, you would get all these fancy scripts from people out there, you know, that would run SQLIO at multiple block sizes and multiple frets and. You know, you will get some number so, what’s funny is that as DBAs we would actually never know if that was good or not. We would be like, “Yeah, we got some numbers. They looked alright. Let’s go.” It’s actually kind of funny at least that was my experience way back when I was giving a, I can’t remember who the manufacturer of that array was, and I was basically told here, “You go test this stuff.” I was like, “Ok, what do I do?” So naturally, I found SQLIO and then SQLIO is what I ran and I got a whole bunch of numbers and I’ve got a bunch of pretty graphs and then I showed them to the storage guy and the storage guy is like, “Yeah, yeah, yeah.” Ok, what does that mean? Is that good, is that bad? Are we, you know, if I put my databases on this thing are they going to work? Or you know, what is it? So on the era of magnetic storage and this is changing super fast like way faster than any of us expected. As you guys know, I work for the flash arrays all the time so I’m not really motivated or particularly interested on validating storage on magnetic anymore. But back in the day when we have the old HDD comprised arrays, magnetic arrays or spinning rust or whatever you want to call them. We wanted to test using something that made sense so, SQLIO would be a thing because they would actually generate some workload against the array, you know, regardless of the fact that it would be a patterned dataset which is actually very important to understand. SQLIO would generate a pattern of data to be send down to the storage array, not even close to real dataset, not even close. At that point, whatever performance characteristics you would see on your storage array at that point you will be happy with because you would basically ask the storage guy, “Hey, I’m getting, I don’t know, 800 a second and that thing. Is that good?” The storage guy would be like, “Yup, that’s good.” “Alright I’m done. I validated that my performance was ok.” You will look at the latency and see if the latency will be acceptable of different block sizes. And you would commit the most frequent mistakes of all which would be tying your allocation unit size on NTFS to the actual block size that gave you the best latency. That’s what everyone would do, like they would actually make that mistake right there and then. You would go ahead and see what SQLIO told you in terms of latency for a given IO block size. And you would say, “Yup, I need to format my volume at this particular allocation unit size because that’s what’s going to give me the best performance.” That couldn’t be any further from the truth, like literally I have no idea why people got stuck in on that myth. And I actually have a blog post that I’m sure we can reference from the podcast here that mentions that I/O block sizes and SQL Server in general.
Carlos: So Argenis, just to be a knuckle dragger here.
Argenis: Yeah, please.
Carlos: Because everybody knows, isn’t that because just run this test it’s telling me, “Hey this is the allocation unit size that you should be using.” And then you’re like, “Ok, well that..”
Argenis: That’s not what the test is telling you. The test is telling you that a given I/O block size your storage array behave in a certain way. The I/O bock size as you all know has nothing to do with the actual allocation unit size of that of an NTFS users. They are two completely two different things so it makes no sense for you. So if you got the lowest latency at 4K you will not going to format your NTFS allocation unit of 4K. That’s just not it, right? Because allocation units are meant to be used for allocation, that’s what they are for. And so larger allocation units sizes, so 64K, and quite possibly larger than that with newer file systems, like ReFS which is becoming a thing nowadays. You would not consider using smaller allocation unit sizes because you want less metadata for your file system as long as that metadata doesn’t become a contention point for allocations. This should not be a contention point because this doesn’t work like DFS [inaudible –] You’re not going to be continuously allocating new expanse for your database all the time like one by one, right? Doesn’t become a contention point for your database in terms of growth. Yeah, Steve, you wanted to ask something go ahead.
Steve: So then with that, I mean it’s used to be you look SQLIO and then you format your partition size in your disk to match whatever your best throughput is. And then came along the rule that general always format it with 64K.
Argenis: Just go 64K. Yeah, that’s what you should do. There are other storage arrays that still tell you to format that at a different allocation unit size. I literally have no idea why they ask you to do that. Because I don’t even know if this people are guided by the same things that I mentioning here all day just looking at the computer one thing, right? At my company, we just told everyone, format at 64K you’ll be fine. They are just going to perform just fine. All other storage array vendors tell their customers to go 4K on transaction log and go 64K on data files. I have no idea why they say that? I do not have an idea on how they architected at that granule level that it actually matters. To me it shouldn’t matter on in any storage array for the most part. Now, would you see differences on SQLIO? Yes, because you are allocating a file upfront for SQLIO. That’s one of the things that matter, right? Like SQLIO actually takes time to allocate that file and Diskspeed does that thing as well. So where we actually swayed by the fact that SQLIO was creating a file or what? I think that is actually part of the problem.
Carlos: Other examples that might include is if I’m moving or upgrading from one server to the other. I already have a larger file and I have to allocate all of that space, right? In the case of a growth, I start with a very small database and I’m growing in chunks. Those patterns are also different.
Argenis: Well, I mean, at that point you should need to start thinking in terms of file system, right? You have a set of database files that are sitting on a given file system. If you’re migrating a database from one server to the other and you’re just using a restore that destination file system doesn’t have to match the allocation unit of the source file system by any means. If you’re using something like Storage vMotion or Storage Live Migration on Hyper-V or whatever it is called on the Microsoft world. You were just to migrate something and you would end up with that same allocation unit size because you’re performing basically a block by block copy of the entire thing including the file system. Alright, so, new ways of doing things etcetera etcetera but what matters in the end is, me personally I ask your storage vendor what is it that they like. If they like 64K ask, you know, it would probably safe to end the conversation right there and format in 64K. If they ask you to format at 4K, ask them why? Why is that a thing? Are they doing the right thing by telling you to format at 4K? Anyway, rant over on that one guys let’s not go into that. Let’s take a step back, we started talking about SQLIO, right? And we said rest in peace, it had the wrong name to begin with, it generates pattern data, it was replaced by Diskspeed. But guess what I love Diskspeed and I kind of hate Diskspeed at the same time. Because you know what Diskspeed does? It also generates patterned data. Why does that matter? Most of the storage arrays that they sell today, pretty much every single one out there has intelligence built-in to the controllers, so whenever you send a patterned data set to it, it will basically just mark a little metadata entry for it and nothing else, so you’d actually wouldn’t be stressing. This is actually super important for everyone to understand. You would not be stressing your final media so if you’re laying data down, pushing data down to your hard drives you wouldn’t be stressing them. If you’re pushing data down to SSDs you wouldn’t be stressing those either because the controllers will basically drop all that data at the head, at the front end and it would basically just say, “Yup, I got some activity going on here but it wasn’t enough to merit me using the SSDs or the Hard Drives at all.” That’s kind of very important, right?
Steve: So then in that case then Diskspeed may still be good for testing like local drives that are installed in a server but not the storage array? Is that right?
Argenis: Here is what Diskspeed is good, yeah, if you have local storage obviously Diskspeed is good. However, if you have something like, I don’t know, like Storage Spaces Direct or whatever that might be that might have some other backend behind the scenes, you still have to go through all of that, right? You may have an intelligent controller on the other side maybe managing disk for you. You don’t know that. As a DBA you’re not told exactly how your storage stack is laid out like. So if you’re connected to some storage array they’re not giving you the specifics of whether that stuff is doing something a certain way or not. So your job as a DBA is to make sure that you test with the best dataset possible and the best workload possible. So is Diskspeed the thing that’s going to generate the most realistic workload for you and the most realistic dataset for you? I think the answer to that is a flat resounding NO. So this is why I only use Diskspeed to validate plumbing, meaning I’m getting good throughput, I’m getting good latency really quick, and like my channel is good. I have a good line of communication between my host and my storage. That’s all I am testing with Diskspeed. If you have an old fashioned, you just dump SSD for example, that you just attached directly to a server and you want to run Diskspeed against that. Yeah, that’s good but is that actually going to test everything on that SSD. I will dare say no because there’s more to that. An SSD is actually comprised of NAND, right, like NAND chips. And the NAND chips actually perform garbage collection, right? So overall, the SSD has a controller and that controller would decide at some point to perform garbage collection on that NAND because NAND is not byte-addressable. NAND is actually addressed kind of like database pages, like the entire pages are written at the time. You don’t writer on a byte by byte basis. So that actually makes a huge difference, right? When you’re testing at something that is byte-addressable like a regular Hard Drive used to be that would be byte-addressable, right, like sector addressable. You wouldn’t trigger a garbage collection on HDD because there is no need for that. But on SSD you will trigger a garbage collection and the only way to stress an SSD so that you make sure that you trigger garbage collection on it is by priming it, so filling it with data and start writing data to it again. So you would trigger the garbage collection mechanisms, ok. Did you see how storage testing is a complex story? It’s not something that’s straight forward. It’s not just running Diskspeed and seeing what kind of numbers do I get. It is so much more than that. It’s running a continuous workload at different I/O block sizes with the given reducibility of the data. Actually we haven’t even talked about this. Reducibility of the data is a thing on all flash arrays and it matters a whole lot. Because most of the storage arrays out there are doing the duplication and compression of some sort. Like ours does, right? Like the one company that I worked for does it. EMC does it, SolidFire does it, Nimble Storage does it, THL does it. Everyone does compression, right? So when you want to validate that all of the storage arrays are doing the things that you want, they’re going to reflect a certain performance characteristics upon a given workload and a given dataset. You want to test with the real workload and the real dataset. So again, Diskspeed doesn’t take you there. Diskspeed only helps you test the plumbing. Make sure that, you know, whatever is between you and your final storage device is not bottleneck somewhere. It is really good for that and it is really also good for testing dump byte-addressable storage. Here is another thing that a lot of people don’t know. Diskspeed actually has a way to generate random datasets. As you can tell Diskspeed here go ahead and generate a random buffer but getting it to work. There’s actually a huge bug on Diskspeed. Getting it to work is actually really complicated. I don’t remember exactly all the conditions that you have to do. Like you have to use a file of a certain size and then the buffer that you pick has to be evenly divisible by the size of the file or something like that. I can’t remember what it was like. Because its buggy and it’s not quite perfect then you have to have all these conditions for it to generate a random workload. But then that random workload will be so random that it wouldn’t actually be representative of your real dataset either. So it will be like the complete opposite of what I said before. You could start with a patterned dataset which is very easy for the controller to drop, right? And then in the complete opposite end of the spectrum you’re sending a completely garbled high entropy dataset that makes no sense whatsoever and it’s not actually reflecting your real workload. This is why I keep telling people you want to test with a real workload and a real dataset. So restore your database, replay your queries against it. That’s the best way to actually validate that your storage array is going to do the things you wanted to do. And you can see how it reacts to your workload and keep it on a continuous loop right. So run it over and over and over. Increase the frequency of your workload if you can. So there’s multiple ways we can talk about that, multiple things that you can do to kind of replay your workload and kick off more threads and stress the storage array a little bit. Or if you can’t do that, there’s synthetic testing that you can do against the storage array that will kind of mimic database access patterns but it wouldn’t be your real workload. In the end, what I want people to do is forget about running Diskspeed, foget about running SQLIO. If you want, run Diskspeed really quick just to make sure that you got good connectivity against your storage array. But in the end what you want is to replay your production data. Your productions datasets, so restore your production dataset and replay your production workload against that storage array. That’s really what’s going to give you the best picture.
Steve: So at that point, when you say replay your queries, replay your workload I assume you’re talking about the distributed replay controller that point then.
Argenis: That’s one way of doing it, right? You also have, Orca which is also another way of doing it. There are other tools out there that, you know, from vendors that allow you to do that. I believe Quest Software has one if I’m not mistaken. There’s others right.
Carlos: What’s that second tool that you mentioned?
Argenis: I’ll just mention three that I know about. And even if you don’t have time to do that you can just run an ETL, right. Which most people have an ETL or you can write very intensive read queries or write queries provided that you do things cold cache. So you flash your buffer with DBCC Drop Clean Buffers then run your queries or your I/O intensive queries against that restored copy of your database and see what kind of behavior you see. But more importantly when you’re testing and validating a shared storage array it’s not one workload that’s going to be hitting that storage array so things get even more complicated. And this is why, you know, I could be talking about this stuff literally for weeks.
Carlos: Even then Argenis, you just flashed the buffer cache but that’s on the database server. And you mentioned even like so that caching at the SAN level. Now is the first time you pull that it’s not going to be in that cache. But how do you?
Argenis: Well, that’s another thing, right? If you have caches on the SAN side how do you invalidate those caches, right? If you’re going to have a cache into your SAN do you want to use all that cache for writes or do you want to use all that cache for reads. So it depends on your workload. So what I tell people is, you know, what kind of workload do you have? If you have a data warehouse, there are going to be times during the day when you do ETL and you load data into that data warehouse that you want to have that cache to be just for writes as much as possible. And then at some point, you’re going to start pruning that database and at that point that cache be better used for reads. Is that the best way to set it up? It depends on your storage array. It really really does and this is why working with your system engineers from your storage vendor matters a whole lot. You need to understand what knobs are available to you or even if you want to tweak them or not. When you’re validating a storage array it’s very important that you understand that there’s a cache or not. So ask that question from your storage guy, “Hey, is there a cache involved in this thing?” Because if there is a cache you want to make sure that you do enough testing that you over run that cache and see what the behavior will be after you overrun that cache and just start operating it a different speed. Because that’s something that storage people love to do. They love to abstract you from the fact that you’re running on this low storage by throwing another to your storage in there. Is that solving the problem? Maybe, right. Maybe it does fix your problem for you. But is it a long term solution for your workload that’s continuously growing and continuously getting more aggressive? Maybe not.
Carlos: Well, it’s the old adage. I mean, it’s just a new version of throwing more CPU and memory at the server, right?
Argenis: Right. Tiers are the thing for storage has always been a thing. And even at the CPU level you have different kinds of caches. You have an L1, an L2 and an L3 cache for instructions on data on the CPUs. So this is not a new concept, right, by any means. It’s just that, you know, how many times do you have to change your storage so you can get a solution going which is also a very big thing, right? How many times do I have to migrate my database before we actually run on a thing that actually supports my workload? It’s a pretty common problem out there, right? Like a lot of people hearing this podcast will probably identify themselves with this. You know, I migrated my database once because when you switch to new storage, six months into it we realized that it sucked and that we have to migrate it again, so data migrations are very much a thing. But in the end it’s all the storage testing and the validation that you did that’s going to give you that confidence of using that storage array the right way. You’re making the right purchasing decisions. It’s very important, you know, and going back to this point that I was talking about before, reducibility of the workload is very much a thing, right? If you just create one database that’s completely empty and what are you testing there? You just created a database that’s completely empty. What are you testing there? Absolutely nothing, right? If you’re validating, the performance characteristics of a TD Encrypted database that you’re placing on an array that performs the duplication, compression and things like that are going to be different that if you send a database that’s in the clear. So those are any compression whatsoever. The performance characteristics will be different so you want to validate both. You want to make sure that your array reacts a certain way for your datasets. So if you have TD Encrypted databases make sure that you restore TD Encrypted databases to that array and you run your performance test on top of that. The future is filled with arrays that perform data reduction all the time everywhere. And, you know, even if your array only offers certain features and you know you’ll going to use some of those features like for example there are some arrays out there that have data services turned off but you can pick to do compression only for example. Then, make sure that you enable compression on the volumes that you are going to use for your databases and drop your data in there and whatever final form it’s going to be. So if you place compressed data on top of an already compressed dataset on a compressed volume on a volume that will perform compression, you know, what kind of performance characteristics are you going to see out of that. In general, it can be kind of eye opening to see how storage arrays react to different datasets. A lot of people don’t think about this because they just think that, you know, running a quick Diskspeed will get them out of the way quick. It’s not what you want.
Steve: So let’s say you’ve got your production server running on a new storage array that’s shared and you start out with SQL Server there and everything is running great. And then overtime the SAN administrator decide to load that storage up with additional systems using it. Bunch of backups right in there, or file stores or other systems using it. One of the things that I’ve seen when this happens is that your SQL Server performance degrades over time. And I guess as far as testing that one of the things that I’m really curious about is how do you kind of test on a regular basis to know that your production performance is not degrading there on the actual I/O side of the storage.
Argenis: So, I mean, you nailed it, right when you mentioned that this happens very frequently and that’s just because people are not doing their homework, right? Every single storage array has limits. There’s a given amount of workload that you can fit on it first and it works just fine. As you add more workloads into that array you start seeing different performance characteristics from that array as you’re adding different workloads for different patterns. That is just natural. You know, all of a sudden just start, you have your data warehouse running at full throttle on that array. You start adding a bunch of VMs that are going to do, I don’t know, you’re running Exchange on those or whatever, or SharePoint or whatever you’re going to see a different kind of behavior from those storage arrays. So how do you test that? The only way to test it that you can be comfortable with is mimicking what it is actually going to be as much as possible, and as much work takes and as painful that sounds is the only way to make sure that your storage array is performing a certain way. So take your real workloads with your real data sets and hit your array with that. I you want to fit an additional workload and you don’t even know what that workload looks like then you do have some homework to do. Like you need to understand what are your current, how busy your storage device is and whether it will accept that additional workload without tipping it over. Or will it actually cause it to, you know, start generating a lot more latency or your throughput will reduce because of that additional workload because now you have, you know, everyone’s drinking and we have a big fire hose before now we have two people that are drinking from it and not just one. It kind of works like that, it’s just, every single computing element in your data center has limits. And the storage array or your storage devices are just one more. You need to understand what are those limits are, and when you are going to feed more workloads into that you need to make sure that you do your homework, and understand how busy that storage device is. And when you are ready to drop that new workload in that it fits nicely and it’s not going to tip it over.
Carlos: Yeah, it’s interesting even recently we put out our SQL Server checklist and you know lots of administrators are trying to standardize our processes and. You know cookie cutter in a sense, repeatability, automation, right that’s the big mantra. And it almost sounds like, “Eerrk, you know, hold the presses here.” There are certain areas where, not that you can’t automate all of it, but at least you’re going to have to understand. You’re going to do some homework first and then choose your own adventure if you will.
Argenis: You nailed it. That’s exactly what happens. And do you know what a lot of people think that they can get away with not doing this in the cloud. “Oh, it’s just a cloud so it’s fine.” “Oh it’s not fine alright”, let me tell you. And the cloud you have a bunch of, you know, Amazon may not tell you, Microsoft may not tell you, Google may not tell you but they do all these things that we talked about, right? So they do their caching, the tiers, and they have this devices on the behind the scenes that do a whole lot more than just storage. They might have actually data services involved in all that. So it’s not as straight forward to test storage on the cloud either. What I like to do is just tell people remember the basics. You could get cache some idea with the throughput and what the latency will be like by using kind of dump synthetic tools like Diskspeed or Iometer or, I don’t know, CrystalDiskMark which in the end uses Diskspeed underneath the hood, or Atol, or whatever. There’s like a hundred of them. You can definitely use those. We’ll give you an idea but testing with real workloads and real datasets is going to be the thing. That’s why I get people, I literally have a friend who was spanking me on Twitter the other day. They bought this hyper-converge solution. It looked great when they run SQLIO on Diskspeed against it. But guess what that thing was doing? Exact same thing that I mentioned before, right, it was dropping the patterns of the controller side. So they never actually saw it perform with real workloads and real datasets. They put their workload on it, they started seeing all this latency and the workload was actually kind of in the same way that it was when it was running on magnetic. Because they didn’t test the fundamentals which is test with your real workloads and real datasets. I keep hearing hyper-converge by the way. It’s a thing now, like everyone is talking about hyper-converge. Need to remind people that hyper-converge is great but not necessarily for databases. One big problem with databases is that they need low latency. You’re going to be doing transaction log writes. You’re going to be doing a lot of reads and those reads better be performed quick. Because that quick turnaround from storage is the thing that’s going to give you better performance and so.
Carlos: And so I’ve been hiding under a rock Argenis. Describe to me what hyper-convergence is?
Argenis: So hyper-convergence, it’s kind of involved now for describing a new kind of architecture where compute and storage is all same thing, and even networking is kind of hidden. You can do on the same compute nodes you can have storage attached to them and, you know, everything is kind of flat and it gets you like that kind of cloud experience when it comes to provisioning storage kind of thing. But in the end you’re sharing resources between storage nodes and compute nodes. So you can end up with a situation where your databases are running on the same exact node as your storage device and your storage. The actual serving of storage takes compute resources. And those compute resources are being kind of colliding with your databases. So there’s obviously other kinds of designs or features on hyper-converge where you can have storage only nodes. But that’s really no different than, you know, having your own storage device out there, and so they basically spinning storage again into its own node. It’s basically, you know, having less things to configure on your network so basically, it’s more like the appliance approach to computing where you just buy something. Hook it up to your network and two seconds after you have it up and running. Well, provisioning is part of it, right? Provisioning and getting things to run fast is one part of it of course but you know it’s an on-going operation and on-going performance so you are going to need either of that thing that really matters a whole lot. So if you’re looking at hyper-converge solutions, please, please, please make sure you test with the right tools. And if I can mention one synthetic tool that actually works really really well. I personally hate Oracle but this is one thing that comes from Oracle that’s quite kind of decent. It’s called Vdbench, V-D-Bench, so Victor David Bench. That is quite possibly the best storage testing tool, storage testing and validation tool that would allow you to get, you know, a better idea of how your workload and how your dataset are going to behave on whatever storage device you’re testing. It actually allows you to specify the duplication and compression settings for you workloads. So you can say, oh this is like a database so it will dedupe 1.2 to 1 and it will compress 2 to 1. Or it will compress 5 to 1. Or I’m testing VDIs so I’m going to have a lot of reducibility on my workload that’s going to reduce 19 to 1 so I can test it that way. And then you can also generate workloads by saying I have, you know, this kind of I/O block sizes this time of day. I have these peaks and valleys. You may actually specify that the target device needs to be filled with data before it can actually be tested. So there’s kind of whole bunch of sweet sweet sweet features that you would love to leverage when testing and validating your storage.
Steve: Ok, great. So one of the questions I have here is that it seems like when we talk about testing with an actual real workload that really that often times in my experience happens after someone has already made a decision on purchasing your storage and it gets to the environment and here’s what you have, figure out how to best use it. I mean, is there a process that you generally goes through like the pre-sales when people are looking at different storage where you can do this kind of testing usually?
Argenis: So most of the storage vendors out there will want you to perform a POC. Some of them actually don’t like it because it takes resources away from them, right, like, you know, they have to ship you an array. They have to be on top of you and you only have so many days to complete that POC etcetera, etcetera. Most of the storage arrays vendors are or the storage vendors in general not just storage array. They will happily give you something for you to try and it’s during that period that you were able to validate that, that storage array or that storage device. If your storage vendor doesn’t do that and your management will sold on that storage array and you get it right away. I need to remind you that most things are you can actually return. So if you get something, test it right away. Like drop everything you’re doing and test it because you don’t want to pay for this in the long run. Especially if you’re the DBA who’s going to be babysitting processes that are going to be running on top of this thing. Because imagine you end up with a dot, imagine you end up with a storage array that does not kind of give you the performance you need or storage device doesn’t give you the performance you need. You are going to pay for that dearly. You’re going to pay for that with your weekends. You are going to be suffering. You are going to have to watch those backups, and watch that maintenance, and watch that update stats they’ve meant. It’s going to be painful, right. So just make sure you test things as much as possible that you inject yourself in the process of acquiring the storage as much as possible. That you become knowledgeable on the storage side as much as possible because storage is a thing that is fundamental to every database out there. And every single database is backed by some sort of storage and if you don’t understand how that storage works, and you don’t get into it a little bit, even a little bit then you’re going to pay for it down the run.
Steve: And I think that’s where if you’re working in a shop where there’s a wall that’s been built between the storage administrators and the DBAs. That’s where you get the most trouble but when they’re the same person doing both sides of it or that there’s no wall and there’s very good communication between the DBAs and the storage admins then you’re able to make those kind of things happen.
Argenis: This is something that you have to ferment. You have to make sure that whatever walls exists in your environment today you can overcome. Like, you become best friends with the storage people. You know exactly what they have. The storage people know why you care about certain things. They understand why databases are the pain in the butt that they are for storage. They are, they are a pain in the butt like the storage people hate DBAs for a reason. Because the databases are nasty especially, you know, take backups right. Who has a storage guy that loves to see the performance of the storage array when taking backups? Nobody, right, because backups are just nasty on every single thing. Backups are nasty on the source storage array, backups are nasty on the networks and backups are nasty on this target storage array. Or jbuff whatever, may not be an array. But whatever you end up backing up too. So it’s a thing like people need to understand that if you don’t talk to the people that work in your company that do different things than you do. Then, in the end have power and oversee the things that matters so much to you. You’re just hurting yourself. That’s actually part, you know, we could go on another rant about why you should be best friends with your developers as well. But it’s basically the same thing, right. Like everyone is working towards the same objective which is making sure that your company continues to operate at the highest level. And you can crank out features and crank out new functionalities as fast as your business wants so don’t become an obstacle right there. That’s actually something that I advocate to everyone. Don’t become an obstacle on the database side. Don’t become an obstacle on the storage side. Don’t become an obstacle on the virtualization side. Offer solutions, tell people, “Hey, this isn’t working let’s go and see what we can do to make it better.” Engage more people and make sure that everyone understands what’s actually going on in your environment. Because the last thing that anyone wants is surprises, “Oh, we never told anyone that this actually wasn’t working.” You need to make sure that everyone knows the status of your environment out there.
Steve: Right, right.
Carlos: Sounds like very good career advice.
Steve: Absolutely.
Argenis: Alright, man, we went a little rant over there. Man, I need to catch my breath here.
Steve: Oh, it’s all good stuff.
Carlos: Well, thanks again for being with us today, Argenis.
Argenis: It’s my pleasure. Thanks for the time you guys. I really really really wanted to do this because you know having, I make sure people kind of got a taste of what it’s like to pay so much attention to the little details on the storage side. A lot of us happen kind of complacent in the past and said, “Oh, I just got some storage array that I own, some storage device that I got from this guy. I’m just going to quickly run something. Ah, numbers look good. I’m moving out.” That’s not it. It’s so much more to it. So you need to understand how things work and why testing in a different way matters so much.
Carlos: That’s right. So before we let you go we’ve had you on the shows, we’ve done our SQL family questions with you already. I guess for those who may not know all of your history. Give us the brief story how did you first get started with SQL Server?
Argenis: The year was 1998. I remember that Google was barely getting started back then. I remember when Google came out I was like, “Oh, I’m not using Yahoo anymore. I’m not using Alta Vista anymore.” Whatever we were using back then. That felt awesome. I remember I was SQL Server 65 that I was powering the backend for this product that I was working on. I was working for an ISP data administrator on an internet service provider. The largest one done in Venezuela and the mandate was to migrate from a FreeBSD environment to a Windows NT 4.0 environment. So you guys can imagine, right, like that was really really really controversial so we got a lot of pushback in there. But in the end that was the mandate that came from management and we purchased this product from Microsoft called Microsoft Commercial Internet System. It’s called MCIS, right? And the version of that was version 2.0. So that server was, that suite of products was NT 4.0 with the option packs so IIS and all of that, and you had an additional set of binaries for running an SMTP Server, a Pop Server, an Ldap Server. And that Ldap Server was powered by Site Server’s commerce edition, Site Server 3.0. So if you guys, you know, I like reading, ancient folks like me that worked that technology wayback when you remember all of these stuffs. Actually I can’t remember if it was Site Server 3.0 or earlier version on the first version of MCIS that I used. I can’t remember. The truth is that it was powered by SQL Server 65 at first and then they develop support for SQL Server 7.0. So that’s kind of how I got started with it maintaining that database and I was basically one of the, there were many of us, accidental DBAs for that database. That’s kind of how I got started with it. And so we were sent up here to Redmond so that was January 1999 when we came here to Redmond for the first time. We spent a month here training on that and we were basically given a crash course on Windows administration and SQL Server administration. So that’s how I got started with SQL way back when. And you know, change stops a hundred different times and, you know, made my way through this admin a little bit and I did development for a little bit, and I did kind of management for a little bit. But in every single one of those positions that I did consulting even for Microsoft at some point. Through all of those positions I was always working with SQL server in some way or another. So you know, it’s been 18 years now. It’s been a great great great career and I just went to the MVP Summit. I knew some of you guys were there. Man, the MVP Summit was awesome because it’s all those, you know, they laid it out in front of us the future of SQL Server and what it looks like. It’s now going to run on Linux. Those learning on Linux, so if you guys out there haven’t seen that you need to go take a look because I think it’s going to be all the rage. SQL Server on Linux is going to be a huge huge thing.
Steve: Oh yeah, I think so too. I know I’ve got it running on a virtual machine here on Linux and it’s pretty nice.
Argenis: It is, it is. So that’s kind of the backstory on me starting with SQL Server way back when.
Steve: Excellent.
Carlos: Awesome. Well, again, Argenis thanks so much for being here and taking some time with us today.
Argenis: Hey guys, thank you so much for letting me do this again. I really really really appreciate that you guys kind of give me another slot of your podcast which is wonderful by the way. You know, to get some more of my opinions out there and I know, you know, the previous podcast actually generated a lot of controversial discussions out there on where my stance is. I actually owe some people blog posts on that to follow up on some of the points that we talked about. And they will probably ask me a blog post on the things that I talked about today so I will promise. I promise, I promise I will make that. I will make those blog posts happen and we’ll get people really really interested on testing their storage the right way.
Steve: Sounds good. We’ll definitely look forward to reading those.
[…] SQL Data Partners Podcast Episode 75: Testing Storage Devices (Carlos Chacon, Jr.) […]