It is not very often the internal codename is used when a product is launched; however, when you think about a database designed with global distribution in mind, CosmosDB makes sense. I was a bit shocked to hear Microsoft was releasing another database technology in 2017; however, the team leading the charge was very impressive and it looked like it was solving problems other database engines could not. In this episode, we talk with Sidney Andrews about the Cosmos platform, what we can expect from the product and how it differs from SQL Server. I think you will find the conversation very interesting.
“[CosmosDB] is a different offering for people whose requirements are slightly different than the traditional relational database.”
“I like to say scale is a configuration setting. It’s not a project anymore.”
“Even for me, as a developer, it took a lot of time and a lot of changing my perspective to really understand what’s going on, but from a DBA perspective, I think your interface is going to be the Data Explorer.”
“The intention is not for Cosmos to be an island, but to be just as much an integrated part of your workflow.”
“You have to know the purpose of the tool and what your needs are and then try to match them appropriately.”
Listen to Learn
01:49 Introduction to Kevin Feasel
02:57 Compañero Shout-Outs
03:33 SQL Trail Conference
05:39 Intro to the guest and topic
06:35 Why did Microsoft introduce a new database technology in 2017?
07:43 Who are the primary users of CosmosDB?
11:00 CosmosDB is multi-modal
15:00 Different consistency levels with CosmosDB
18:12 Consistency models typically seen in the real world
19:01 Discussion on pricing
23:56 You can dynamically scale your RUs
25:12 Interfacing SQL Server with CosmosDB
28:16 Use a retry model or pattern just in case you’re not successful on the first try
31:27 Life of DBAs and administrators with CosmosDB
40:07 Difference between Data Lake and CosmosDB
41:38 Match your needs to the appropriate tool
44:40 SQL Family Questions
47:09 Closing Thoughts
About Sidney Andrews
Sidney Andrews is a Microsoft Certified Trainer and Azure MVP with SeeSharpRun.NET. His developer background is in XAML, ASP.NET and Entity Framework development. Sidney has driven efforts to develop and deliver Azure readiness training through channels such as Ignite, Microsoft Tech Summit, Microsoft Virtual Academy, whitepapers, and courseware. Sidney is also the author of multiple free, online courses on edx.org
Carlos: Compañeros, welcome to the SQL Data Partners podcast. This is Episode 134 and it is good to have you on the SQL Trail again. Today we’re talking about CosmosDB, the database platform that Microsoft introduced in 2017, which admittedly took me aback a bit, but is very cool and obviously, they’re looking to integrate with the Azure services that are available. Today we have with us Sidney Andrews. Sidney’s been doing quite a bit of work on the training side of Cosmos and he is the owner and operator of seesharprun.net. He is actually locally here in Richmond, Virginia, which is how I know him. He’s been doing quite a bit on the training front, so he’s been speaking at Ignite, he does many courses on the Microsoft Virtual Academy and has several courses on the edX site. We’re excited to have him and to be talking with him about Cosmos, what Microsoft is doing with it, and how it might relate to those of us who know and love SQL Server. We also have Eugene Meidinger and Kevin Feasel on the program today.
Carlos: So, Kevin, one of the things we’re doing is we’re inviting our–
Kevin: The best of the best?
Carlos: Yes. We are trying to get the best of the best here on the SQL Data Partners podcast.
Kevin: But you couldn’t find them, so you had to settle for me. That’s fine.
Carlos: And so, we wanted to talk a bit about, remind folks who you are, what you’re doing and things that you might be talking about on the podcast.
Kevin: Sure. I’m Kevin Feasel. I’m a data platform MVP. I run a predictive analytics team out of Durham, North Carolina. I’m focused heavily on Hadoop, R, and basically everything in the broader data platform, which means I don’t have enough hours in the day to do much of anything, it seems.
Carlos: There you go. You’ve come on the program to ensure that you get more episodes than Jonathan Stewart.
Kevin: This is a big part of it, and also to evangelize F#. Everybody go learn F#, because it’s a great language.
Carlos: Awesome, Kevin, we look forward to having you on more episodes as a guest.
Kevin: Absolutely, looking forward to it.
Carlos: We’re still working out this panel system, if you will, so unfortunately Kevin doesn’t get a lot of air time this episode, but we look forward to having him in the future.
We do have a couple of Compañero Shout-Outs, as well. The first, Jeff Moden. If you run into any of the internet forums, I know SQL Server Central, he’s very active over there. If you ever hear the term RBAR, Row By Agonizing Row, then more than likely, you have gotten into a discussion with Jeff Moden. We are happy to have scheduled some time to have him on the program in the future, so we’re looking forward to that. Also want to give a shout-out to Aaron Hayes, Douglas Kemp, and Clark Hahn who I met in West Virginia.
I want to talk a little bit about the SQL Trail. What is the SQL Trail, you might ask? For those of you who are new to the program, tuning in for the first time, the SQL Trail is the East Coast SQL Server Conference. We are going to have the conference on October 10th through the 12th. Registration, I’m hoping to have registration open by the time this podcast is available. However, if not, it will be opening in the next couple of days. It will be held in Richmond, Virginia, two-and-a-half-day conference. We have a mix of MVPs that will be coming in, but it is different from traditional conferences in that it will be much more engaged and you will have an opportunity to both select and contribute to the content. Again, this kind of goes back to my Alan White quote in that we all have something that we can learn from each other and this is one of the goals of the SQL Trail. We will be breaking up into tracks this year, more than likely. We’ll see, again, what the registration counts are. We are going to close registration at 40 registrants this year and we’re looking forward to having some labs, some hands-on labs that Friday. Unfortunately, those details are not yet final, but I think that you’ll be very excited about what we have lined up for everyone for that Friday. So, two and a half days. We’re going to start Wednesday at 2pm, do some get-to-know-you. We’re going to have a great dinner that evening. Thursday will be, I won’t say more traditional, but it will be sessions, content-selected sessions that we chose on Wednesday in addition to some of the content that our speakers, our guests, are going to be bringing to us, along with any other conversations that we want to be bringing up, questions that you might have, challenges that you’re having, things like that. Then Friday, we’ll be devoting a bit more to the hands-on labs. If you’d like to find out more, you can take a peek at sqltrail.com and again, we hope to see you in Richmond in October.
Sidney did give us quite a few notes and links to courses and other material, for those of you who are interested in getting started with Cosmos. We’re going to include that on our show notes for today’s episode. That’s going to be found at sqldatapartners.com/cosmos or at sqldatapartners.com/134. So, with that, let’s go ahead and jump into the conversation.
Kevin: Hi, this is Kevin Feasel, professional podcast guest.
Eugene: Hi, this is Eugene Meidinger, business intelligence consultant.
Sidney: Hello, this is Sidney. New to podcasts, quite interested.
Carlos: So, Sidney, welcome to the program! It’s good to have you!
Sidney: Thanks, Carlos. I’ve been wanting to come on for a while and things have just been so hectic. That’s what happens when you work with a new database platform.
Carlos: Yeah, that’s right, we’ve been trying to get you on here for a little while. Of course, we’re joined today by Eugene and Kevin. It’s nice to have them on. Ultimately, our topic today is Cosmos Database. I think we’ll get into a little bit of the history, here, in a minute, but let me go ahead and start off by asking why did Microsoft release a new database technology in 2017?
Sidney: Okay, that’s a really interesting question. I’m not going to call it loaded, because I like it. The biggest thing here is, it’s not necessarily new, it’s new to us. So internally, Microsoft has been using this database for a long time. It had some code names, but usually in traditional Microsoft fashion the codename gets changed into something unreadable or undecipherable. But somehow the codename Cosmos made it out into the public and that’s the name of the database. Cosmos really is a different way of looking at databases and looking at accessing databases. I’d almost say it’s complementary to what you have with SQL. It’s strong in some things that SQL’s not strong at. It has some weaknesses. SQL’s a very mature product that has a lot of strengths that Cosmos just isn’t there, yet. I would say it’s not new as much as it’s a different offering for people whose requirements are slightly different than the traditional relational database.
Carlos: Okay, we know that Cosmos is what they call mulit-modal, but who’s going to be using Cosmos? Who are the primary users of this database?
Carlos: You’re taking advantage of the cloud offering, in particular, Azure here, and you get to choose your own adventure, if you will, from a hardware perspective?
Sidney: Boom, you got it. The big thing is, I like to say scale is a configuration setting. It’s not a project anymore. I’ve been using entity framework and SQL Server for years, but when I start talking about sharding and partitioning SQL Server, some people in the room, their eyes roll over because we know this is going to be a big project, involve a lot of people, a lot of planning. And nothing against SQL Server, again, I love the platform, it’s just, it’s a different approach. Here, if I want to scale Azure CosmosDB, I go into portals, say, “you know what, instead of a thousand request units per second, I need about ten thousand.” I cut a different check. It’s the same promised platforms for servers, just for the database world. That’s one audience. The other audience, I think, that’s going to be really interested in this, is these big corporations who are taking in massive amounts of data. So, you think about IOT, gaming, e-commerce, where they’re taking in ridiculous quantities of data and they want to be able to do stuff on it.
Carlos: Okay. Let’s talk for a minute when we talk about multi-mode. So, my understanding is, it basically means that I can have graph data, I can have documents, I can have key value pairs and all of these types or sets of data, I can throw into Cosmos and it can consume those.
Sidney: It’s really interesting. Let’s take a step backwards, because I know a lot of audience, this might be your first time hearing about Cosmos. Traditionally, if we have a database, we store the data in a certain way, and that way that we store the data, that’s what’s surfaced to our end points. You have your TDS or tabular data stream protocol endpoint, and it surfaces the data in the relational database structure that you have. Well, Cosmos is a little different. Cosmos stores data in a specialized version of atom format and then when you request data and decide to interact with it, two things happen. The first thing that happens is that it actually shapes that data into a model. Think about it like real-time shaping. If you say, “I need to access this data as a JSON document”, it’ll shape it into a JSON document. If I need to access it as a graph, it’ll shape that data into vertices and edges. If I need to access it as Azure storage tables, it’ll shape it into that key value dictionary that you’re used to. Then, the API you use accesses the data in that shape. Those shapes I’m talking about, that’s what we mean by models. So, the document model, the key value model, the columnar or family model, and then graph model. Those are the shapes that CosmosDB shapes data into.
Carlos: Okay, interesting. Then I’m just interacting with it based on the data type that I want to work with at the moment, and I’m just going to feed it in and then Cosmos is going to store it as it needs to. So, if I wanted to put it in as graph, but then you wanted to pull it out as one of the other types, as a document or JSON, then you could do that, but we’re still looking at the same data?
Sidney: Yes, and no. Today, you can’t do that quite yet. You can kind of see glimpses of that. I’m dating the podcast here, but one of the things you can do in 2018 is you can go and create a SQL API with document model instance of Azure CosmosDB, but then you can access it using the Azure storage explorer as a table API employ. I can open up MongoDB Compass and access that whole SQL API or DocumentDB instance. All I’m doing is just taking the primary keys and URLConnection streams and rebuilding it into whatever connection stream format that database expects. That’s the future of CosmosDB. The product team has been very talkative about this at conferences. That’s the goal. That’s the dream at the end of the tunnel is that you can just dynamically switch between models and types and APIs, just not quite there yet, because you’ve got to imagine, a really good example is with Mongo. Mongo stores data as BSON, but the SQL API, which used to be called DocumentDB, that stores data as JSON and JSON and BSON don’t have a one-to-one correlation. What is the differences, how do you serialize and de-serialize between one another? The engineering team has a lot of really smart people on staff who, literally, they’re trying to solve that problem now. That’s the goal, yeah.
Carlos: Sure, okay. I think I just learned something. That storage type, I guess one, Mongo’s storing it as BSON and then I always thought of JSON as, I don’t know, just a different format, almost. But you’re saying that there is a, I don’t know if protocol is the right word.
Sidney: I think you had the right word the first time, data format. BSON, it’s a superset of JSON that’s specific to MongoDB. I’ll give you a really good example. If you ever go to the Hello World tutorial for MongoDB, you’ll notice that instead of putting in an ID field that’s a string or integer, Mongo ID field is a type objectID, which is a very special data type unique to Mongo. So, BSON is really JSON with a whole bunch of Mongo stuff added on.
Carlos: So, lots of different decisions, but I feel like all of a sudden, I’m well out of my comfort zone.
Kevin: Speaking of decisions, let’s put it to one of the other major choices that you can make with setting up CosmosDB, and that is different consistency levels.
Eugene: Yeah, they’ve got like five of them.
Sidney: Yeah, and it’s one of the craziest things because I’ve been using CosmosDB since back when it was DocumentDB and it was a little bit different platform. It’s been years now, but only two months ago did I really understand all the consistency levels, and that’s because a product team member just sat me down and drew it for me and I was, “oh!” But essentially, strong and eventual is what we all know. If you go to SQL world, that’s Sync and Async, you know, replication, so strong and eventual is easy. The ones in the middle are compromises between strong and eventual where you need strong in some situations and eventual in other. A really good example is session consistency. Session consistency means within the context of this connection or session, we have strong consistency, but outside of this session, we have eventual consistency. So, if you have strong consistency globally for your database, everything is synchronous and you have to wait for all replicas to commit before your data’s available, so you have less performance, relatively. You have multiple reaches, multiple replicas. Session basically says, “just for this user session, this one user in their session or connection, is going to have slower performance because they have to wait for all replicas to commit”, but that’s an acceptable sacrifice because we have scenarios where a user needs to see the same data. The example I always give is let’s say you have a mobile application and it’s like a Zillow or something. You’re trying to upload your house, let’s say it’s for sale. If you put in your house information and hit the submit button, and this is a fully asynchronous eventual database, and we refresh the list of houses and your house doesn’t appear, you think the application’s broken. So, what do you do? You fill out the form again, and all of a sudden you have three listings where you meant one. For that user in their session, I want strong consistency, but nobody else should see a sacrifice in performance outside of this session. You see, that’s a really good example. It’s compromises.
Eugene: I have a quick question. On the strong consistency, I forget, does that impact your options for geo-replication and that sort of thing?
Sidney: No, you still have your same options for geo-replication backup and restore, but you hit an interesting point on the head. If you have strong consistency and you have geo-replicated to multiple sites, it will have an impact on the performance of that request. Without getting too technical, the 400-level detail, you’re going to eat it as far as transaction calls.
Carlos: That is interesting and I think that as you bring up, it sounds like I can actually have different types of consistency within the database. So, from a SQL Server perspective, I guess you can specify it to query, so at the query level I could do with read uncommitted or something like that. I suppose that’s very similar to some of that.
Sidney: Yeah, you’re absolutely right. This is one I just learned about a week ago. You can even specify different consistency levels for a specific transaction. So, if you’re using SDK from a .net node or Java, you could say, “for this individual request, I want a different consistency level.” It never came up in the real world for me, yet, but it’s there.
Kevin: What kinds of consistency models do you typically see in the real world?
Sidney: Almost all cases, strong is more than good enough. I try to push people towards eventual, but it’s too much of a culture change. I like eventual, but I can’t convince any customers to go there, yet. In my world, we’re talking seconds, minutes at most. ‘This report is accurate as of two minutes ago’ is perfectly acceptable to me. The larger the corporation is, the more I can push them towards eventual consistency and understand the nature of asynchronous data. Session consistency seems to be my pitstop, so if I can get a company halfway convinced, I can get them to session and then eventual. It’s kind of like moving along the path. I do warn them, “hey, if we’re going to do strong consistency, understand that each request is going to transact more RUs. The aggregate, you may have to have a higher threshold ceiling of RUs per second.”
Eugene: Since you brought up RUs, just a quick comment on that. At least whenever it was DocumentDB, I found the pricing to be a little bit weird for I guess what’s now called CosmosDB. I did a personal app to keep track of to-dos and that sort of thing. One of the things that was strange to me was that everything went in a collection and you had to price out the entire collection, what level performance you wanted, that sort of thing. But it didn’t seem like there was the idea of tables within that collection, you would think a database and a table. It seemed like it was just this big giant bag of data. So, the problem that I ran into was I’m coming from the SQL world and I was making this app to just keep track of to-dos and other things and so I wanted to have the equivalent of tables to separate out my data. At least whenever I was poking around with it, it looked like what I’d have to do was that I’d have to have some sort of flag or field in my documents that says this is what type of object it is. I’d hoped that I could do different collections and treat different collections like tables, but the pricing model just didn’t make sense for that. I don’t know if that’s changed at all. It looks like you’re still paying for RUs and data space or data storage, and so you kind of have to put everything in one big collection, is what it seems like.
Sidney: Two answers to your question. The first answer is in Cosmos world, we have this whole hierarchical structure where you have your account and within your account you have a database and the database is just a logical container. That database can have zero or many containers. In the DocumentDB, SQL, Mongo world, we call those collections. A container is kind of closer to your table than it is to your database, so what ends up happening in the Mongo world a lot is that Mongo developers will have 20 or 30 collections because there’s a one-to-one correlation there, between a collection and a table. You noticed already that in a Cosmos world, you built based on container or collections. Containers is what they’re called behind the scenes when you’re API agnostic. Collection is just the name. The problem you run into is that the floor of a container is too high. There’s a minimum cost per container, so if you need 30 containers, you’re really eating it, pricing-wise. That’s the number one question brought up on uservoice, it’s brought up at every conference. I ‘m really hoping they do something about it. Obviously if I know something, I can’t say anything about it. But I’ll say this, the product team is extremely aware that this is a huge limitation. The way they found out, and this is really interesting, there’s a huge market of people, like I talked about earlier, who do MongoDB and are ready to scale. They figured out to scale MongoDB, you have to bring in like a half a million dollar consulting firm and do a one, two year project for partitioning and setting everything up. It’s pretty complex. Cosmos is really interesting to them because they can say, “oh, I can just use this service and let this service scale me out and I don’t have to change any code” because it just uses the Mongo wire protocol. The problem they ran into is just like we talked about. They have these containers in Cosmos, but in MongoDB, there’s really no limit to the amount of collections you can have, so they have hundreds of collections, hundreds of tables. What do you do? The guidance up until today has been just to combine your collections, combine your tables and use a field to separate between them. From a technical perspective, there’s no downside to doing that, it’s just the easiest way to solve that problem. Usually we have JSON documents and we have a field in a JSON document called Type. I’ve seen people in .net land, they serialize the name of their plain old CLR object and use that as a type and then they get that serialization back and that’s how they know to get all records of a certain type. I’ve seen people partition based upon that type. There’s lots of ways to tackle that. I think the answer is that the product team has to come up with a way for you to maybe pull or collect RU resources to spread them out amongst multiple containers. I think that’s the answer. Maybe a lower floor so there’s an easier cost of entry to using Cosmos. There’s lots of ways to tackle that problem, but it’s kind of the whole Microsoft thing where you create a new database and you have to serve these Fortune 500 customers, and then once the database matures a bit, you realize, “oh wow, this is too expensive for the average person.” How many times have you seen that in Azure? Now you’ve got to figure out how to price this differently and all of a sudden you come up with like 20 or 30 different pricing schemes. I think we’re at that phase of the growth where Cosmos is now trying to figure out how be cost-effective for small to medium size operations.
Carlos: What does one of those containers go for, then? Do you remember off the top of your head?
Eugene: I feel like it was like $10 or $20 a month, back when I was looking at it. I forget.
Sidney: Yeah, that’s about right. That’s for the minimum. If you’re doing something like, “we’re importing Mongo”, we recommend 10,000 RUs per second for the import operation. A 10,000 RU per second container goes for about, let’s say $200 a month. One of the things you can do, though, is you can dynamically scale your RUs. That’s something a lot of people don’t realize. You can assign 10,000 RUs per second to do the import operation, to scale-back down to 1000 after you’re done and that’s your regular running database. So, if you need to do one-off operations, you can blow up the amount of throughput you have, perform that crazy, massive operation all in parallel and then just bring it back down to normal. I had a customer who did that because they do OCR scanning. They brought in all this raw OCR data and they wanted to pop into Cosmos, so we bumped them up to 1million RUs per second and flooded the database with requests. They got a $200 bill for the couple of hours of operations and then we just brought it back down to normal. So, pricing is a little bit of a challenge right now because the database is definitely tuned towards customers who have millions of records, hundreds of thousands of transactions, tens of thousands of transactions per second. I think they’ve got to take a step back and think about the small or medium size applications.
Eugene: Quick question. I don’t know if anyone knows this. With SQL Database, aren’t you able to pull your DTUs for that? Does anyone remember?
Carlos: Yeah, with elastic pools.
Eugene: Exactly, that’s what I was thinking of. So, hopefully at some point, they have something similar.
Sidney: I hope we get something similar for Cosmos.
Carlos: Okay, so let’s talk about interfacing here. From the SQL Server perspective and our migration to the cloud, there was Azure SQL Database, at first. The very first versions you could only really interface with it through the web browser, now they would allow you to connect through SQL Server Management Studio. How might connecting to CosmosDB?
Sidney: One of the interesting things is it’s based on your API, so if you’re using a Mongo API, you use that MongoDB Compass or hundreds of other Mongo IDs. If you’re using Gremlin API based on Graph IDs.
Carlos: We want to talk about, I’m a SQL Server guy, so what am I going to be using to connect to MongoDB?
Sidney: See, that’s where it gets really interesting.
Carlos: You keep using that phrase, Sidney.
Sidney: Because it is! Even for me, as a developer, it took a lot of time and a lot of changing my perspective to really understand what’s going on, but from a DBA perspective, I think your interface is going to be the Data Explorer. The Data Explorer is in the Azure portal and it gives you the ability to run queries, to view your documents, to manage your index policies, to manage your scale, your RUs per second, all of that within the portal. Even though all of that’s available via developer SDKs and stuff like that, I think the portal’s by far the best experience for this. They’re very portal-first here, rather than building a desktop tool to manage this, they just say, “well, everybody else is in the portal, why not just build it in the portal?”
Carlos: I’m interested, Eugene, what was your experience when you were building your little app?
Eugene: Yeah, it was interesting. My little web app was like asp.net. What I did was for all of the testing, I was using the portal, just to test out queries and to play around with it.
Carlos: Using the Data Explorer that Sidney mentioned?
Eugene: Yeah, I was using the Data Explorer. Whenever I was adding in to-dos initially for stuff that was recurring and just playing around with the data, I was using the Data Explorer that he talked about in the portal. But then for the actual code, I was using, actually I think I played around with C# and I think I may have also used Angular a little bit, I forget. I think if you’re coming from the DBA world, you probably have a developer friend in your team that’s very comfortable with C#, so I have a feeling that’s a very likely path to go down.
Sidney: Every engineer I’ve met on the product team are developers first, so as a developer for database. Matter of fact, we all had to be convinced, including me, to create labs and document examples in .net because they wanted to do it in Python and I wanted to do a node. We were doing 400-level labs roadshows and somebody came in and was like, “are you all going to run some .net labs?” We were like, “oh yeah, this is Microsoft, sorry.” So, yeah, I use the C# SDK and one of the things you’ll find if you go to docs.microsoft.com, you look at the documentation, they actually have a webpage that not a lot of people that I’ve talked to have ever seen until I mention it. It was the same thing for me, the product team mentioned it to me. I was like, “oh, this is here?”
Eugene: They also have a good quick start tutorial. Whenever I was trying to get started with it, they basically had a, I think it was on GitHub, just a whole project you download and it has code for a bunch of different operations. So, it was easy to get started with. Something related to that, Sidney, I’m wondering if you’ve had any experience with this, but I ran into some trouble where I don’t know if I was just coding wrong and I’d get an error or I’d get throttled or something like that. Because it’s cloud-first, it seems like you kind of have to take a cloud-first mentality with design patterns and build in kind of a retry model just because you may not always be successful on the first try. Have you run into that issue at all?
Sidney: Yes, absolutely. There is a retry pattern baked into the SDK, but we highly recommend you roll your own retry logic. There’s a couple errors you’re going to run into. One of the most interesting ones is, CosmosDB has their own http status codes. They indicate certain things with the database, like you’re hitting your throughput limit for transactions or there’s an error code that means you can have up to 10GB for a partition key and if you hit the maximum for your partition key, there’s a specific error code for that. So, when you get your response object back, it actually tells you which status code you hit. It’s going to be in the show notes, I think. There’s this cosmosdb.github.io site where we built these architect level, 400-level labs. These architect labs specifically go into this, how do you troubleshoot for this. What we do is we actually have a collection and you spin up to 10,000RUs per second to put in your documents. Once your documents in there, we kill it down to 400RUs per second. Then we try to simultaneously insert 5000 documents by spinning off 5,000 .net tasks in full parallel to kill the database. The whole point is just so you can see it’s returning certain error codes indicating that you’re hitting throughput limits or error codes indicating that you’re hitting storage limits. Again, that’s on the docs.microsoft.com website. There’s documentation around that. You’re right, you do have to roll retry logic in. The problem with retry logic is that, wow, this is going to be very controversial to say this. Entity framework’s made us lazy. It’s baked in, it’s built in, you don’t ever have to think about it. So now, we’re thinking about stuff we don’t have to think about and it feels like, “oh, CosmosDB should do that for us” and the engineering team is more of the perspective of, “we don’t want to be that opinionated in our SDK.”
Eugene: Right. Carlos, you’ve proclaimed yourself as a knuckle-dragging Neanderthal DBA, right?
Carlos: That’s right.
Eugene: And so, I’d imagine for you the idea of having to write code that has to try multiple times is a bit of a foreign concept, because with SQL, we just expect it to just work.
Carlos: That’s right. If it errors out, the transaction’s going to roll back and I know that it didn’t work and then I’m going to get an error and have to do something with it.
Carlos: Just to interject for a moment, at one point we kind of hit on a couple of different topics. I didn’t have a great way to edit that back in, so we’ll apologize and we’ll start back in with Sidney talking about roles of the DBA.
Sidney: The DBA’s job becomes more of a performance, indexing, tuning, and I’m trying my best not to get into 400-level here, but a really good example of what a really good Cosmos DBA is doing, just talking about SQL API and JSON here, Cosmos indexes everything by default, which is both wonderful and bad. It’s great from a read perspective because every query is amazing and fast, but any time you write to the database, you have more things to update in the index. You become the arbiter of, “okay, which paths do we really need to index? What paths should we exclude from the index?” The example I always give is you have a JSON document with 10 properties. If you index every property, every time you write to that JSON document, writes are atomic in Cosmos. That means all 10 JSON document properties are going to need to be updated in the index, even if they didn’t change. But if you’re only indexing 4 properties in that JSON document, you’re doing less than half the amount of work when you’re updating the index, so that significantly reduces the amount of RUs consumed by transactional query update operations.
Carlos: Yeah, that will be kind of interesting to get into. I admit I’m not super familiar with Mongo, either. That’s something we still need to talk about on the program, as well. But the idea of tracing where the data is while we do indexing in SQL Server, it still seems like a step out to be like, “okay, now where are those documents and how am I going to traverse that?” Maybe it shouldn’t, maybe because I’m just so used to the B tree idea, it’s just a little bit more engrained. But it will be interesting to see how that works with the other platforms.
Sidney: To give you a really good example of that, I had a really long conversation with somebody at a partner lift in Redmond, you know, near Microsoft’s campus and we brought in all these people who are working on Cosmos delivering solutions. We had this long conversation on how do you represent dates in JSON. Do you represent it in an epoch date using a numeric value, do you use the string with the iso date? If you use the string with the iso date, are you doing a hash index where it’s like a hash table, any exact point is extremely fast? Or using a range index where you’re comparing strings, looking for values between Tuesday and Thursday? Well, do we care about the date, do we care about the hours and minutes? If we care about only the date itself and no hours or minutes, maybe we can get away with the string and do a range index. These become really interesting conversations that you can actually be the winner in your organization and say, “I was able to reduce our consumption in Cosmos by half because of all the wonderful index tuning I’m doing. Give me a raise.”
Carlos: There you go. So maybe we don’t want those containers to drop in price just yet, so we can–
Sidney: Yeah, let’s get our raises first, and then we can drop the containers.
Kevin: I want to talk about the life of the administrator. You mentioned already one thing that administrators would have to do, which is around tuning, but how about some of the other stuff that we, I saw we, even though I’m not an administrator anymore. But, we tend to do with data, like a user comes in and says, “hey, I accidentally deleted all of our data for the last three days. Can you recover it in the next 20 minutes, because I’ve got a meeting?”
Carlos: So, data recovery.
Sidney: Oh yeah. Before I answer the data recovery question, Kevin, did you want to throw anymore examples at me, because I’m actually taking notes here, so data recovery.
Kevin: Yeah, data recovery is a big one. Just backups in general, not necessarily, I have two or three or 500 thousand data points I want to recover, but I want to make sure that everything is there and that I can revert to a previous time of say if a release goes catastrophically wrong. The ability to copy data off to bring it on some other site. A lot of this becomes less important if you’re using CosmosDB as a secondary caching store, where the data lives for real in a primary source system somewhere else and then you’re using it here for enhancing speed. But if you want to use this as a primary system, like you don’t have the data anywhere else.
Carlos: Right, so I think the example, a good one that people might be asked about a lot now is like your data science folks, or I guess the BIs are the more traditional, right? They want to get access to that data but you don’t want it stomping on the transactional system so they have to spread that out so that everybody can be happy.
Sidney: I’ve got two answers to that question. The very first one, around backup and recovery, that’s kind of the gift and the curse of the platforms and service. You think about Azure SQL Database and the backup recovery story there, where it’s mostly automated. There’s some stuff you can do, but if you want to do anything beyond what the service provides, it just involves setting up manual replication, setting up manual options. The same thing with Cosmos. There’s built-in backup and recovery, but if you want to go a little bit more advanced, there are solutions there. That kind of gets me to the second answer where you’re talking about BI integrations. CosmosDB has two really interesting ways you can set up integrations and that allows you to create more interesting BI scenarios, even backup and recovery scenarios. The service itself has built-in backup and recovery and that kind of is great because it’s built in, but it sucks because what do I manage, what do I configure? It’s a set backup retention period. There’s a set process for recovering. It’s not very customizable, so I have a lot of customers who say, “that’s not acceptable. I want something more elegant, more sophisticated. I want to be able to report out.” A really common one, I have my transactional data in Azure CosmosDB. I like to aggregate it and dump it into SQL and then use PowerBI to report off of SQL for my reporting needs, which a relational database is excellent for reporting and analytical workloads, so why not use a relational database there? The way you do that in CosmosDB land is you have two mechanisms, you have the change feed and the connectors. Those connectors that feed into the various Hadoop services, obviously you just have a HD inside instance, so there’s a Spark connector, there’s other connectors available and what you can do is use those workloads to actually pop data. A really common scenario you’re going to hear is feeding data from Cosmos into Azure Data Factory and use Azure Data Factory is kind of your modern SSIS to pop the data into wherever you want to be. If you want to aggregate any kind of average data or you want to highlight certain records, not pulling everything, but pulling the records that are appropriate and put them into Data Lake, SQL Data Warehouse or even just SQL Server or the Azure SQL Database service, that is very common. I see a lot of DBAs in that space, because what you’re doing is you’re really crafting a solution, a little bit more intense backup and recovery solution. Then on top of that, it opens up additional workloads, so a lot of times you’ll see, if you actually look at the diagrams, I know they throw diagrams everywhere in conference talks. A lot of the examples they show, show Azure CosmosDB as the transactional database, but then they’re actually shipping data for reporting to Data Warehouse. So, they’re shipping data, they’re using Data Factory to pop data into other systems. I’ve seen scenarios where Data Factory ships data that you fired off a logic app and the logic app’s putting data into SharePoint Salesforce. I’ve seen scenarios popping in Data Warehouse. What is the name of the service that gives you data packages that users can consume on their own and then can create their own adhoc queries in Excel or PowerBI? It’s kind of like a marketplace of data.
Eugene: Oh, shoot, are you talking about Data Catalog or something like that?
Sidney: Yep! You got it, Data Catalog. I’ve seen that, where Cosmos is the transactional database but they’re using Data Factory to take specific records out of Cosmos and put them into, I can’t remember if it was a SQL Database instance or a Data Warehouse instance. And then they surface that data using Data Catalog so the sales team can go look at sales data without actually seeing any API about the actual customers themselves. The intention is not for Cosmos to be an island, but to be just as much an integrated part of your workflow. It’s a lot of– I don’t know, are you all really familiar with the concept of Lambda Architecture?
Eugene: Yeah, oh yeah.
Sidney: A lot of Lambda Architecture diagrams you see now use Cosmos for the hot path and then use Data Warehouse for the cold path. It’s super common now because the Cosmos is great for super speed queries, especially when they guarantee reads for kilobyte-sized JSON documents. By the way, that’s a massive JSON document. For a 1KB sized JSON document, they guarantee a read in under 10 milliseconds if you have geo-replication enabled. For a transactional database hot path, that’s amazing. But for the cold path, where you’re trying to do deeper analysis, HDM sites, SQL Data Warehouse, these are much better solutions, so it’s not meant to replace your stat, but to just be really integrated component of your stat.
Carlos: We’ve talked about Data Lake, and we’ve talked about Cosmos consuming all of this data. Is there really a difference, then, between the Data Lake and Cosmos?
Sidney: You know what, I’ll put it this way. What’s the difference between SQL Database and Cosmos?
Eugene: I think there is a difference between Data Lake and Cosmos. They’re both similar that they’re kind of, shoot, what’s the term? You worry about the schema later.
Eugene: Yeah, semi-structured. I think that’s definitely a big deal. But I think one of the differences with Data Lake, if I understand it correctly, is that it’s got more of that scale-out capability, or at least it’s designed for that, and it’s based on hdfs or webhdfs, if I remember, at least for the Azure version. So, in my mind, even though CosmosDB and Data Lake both have scale-out capabilities, Data Lake’s more for really, really big kind of scale-out than maybe Cosmos is. Now you can correct me, I know you were talking about how capable Cosmos is for that sort of thing.
Sidney: Oh, you want me to do the sales pitch? No, no, no. I’ll put it this way. Data Lake is great for storing massive quantities of data, and that amount of data that can be stored in Data Lake is orders of magnitude higher than your partition key. There are limits in Cosmos, so you will hit technical limits in Cosmos before you hit them in Data Lake, but then the opposite is true for transactions. Transactions off of Data Lake just aren’t nearly as performant as transactions off of Cosmos, so a lot of times you see Data Lake storing the data, then you aggregate it out and perform transactions elsewhere.
Carlos: I think it’s interesting, one of those examples where you have to know the purpose of the tool and what your needs are and then try to match them appropriately, rather than, well, lots of different services will have similar features, but then what are the scale-out, cost implications, things like that, to help you make the right decision?
Sidney: Yeah, if I have to put a one quick phrase explanation of Cosmos, Cosmos is a database for unstructured data that doesn’t really commit to any transactional API. We were joking, we were talking about Data Lake and I thought you were talking about analytics, that’s why I made the joke, “well, what’s really the difference between Cosmos and SQL?” The answer becomes SQL is committed to an API, Cosmos is not. There’s nothing stopping the Cosmos team from implementing a relational structure in SQL API. That’s all closer to SQL Server. It’s just they probably don’t want to compete with their own product.
Carlos: Yeah, it is interesting. From a marketing perspective, one of the key notes at PASS last year was CosmosDB. We’ve been hearing a little bit about it, DocumentDB. As we come back to the evolution of the DBA and what’s going to be happening, one of the advantages that Cosmos has is that it, for lack of a better word, shards, way better than SQL Server. And you think, “oh gosh, what am I going to be doing?” I think there’s always that balance there between what tool you should be using, what your teams feel comfortable with.
Eugene: I think that CosmosDB has a lot of promise and it’s just so funny that they Frankenstein-ed this database engine where they just keep slapping on new data models. It’ll be interesting to see where it goes and I really hope that they add more of a lower floor for the price so that amateurs like me can do side projects and learn about it without having to spend $20 a month just to play with it.
Carlos: Should we go ahead and do SQL Family?
Kevin: Do it.
Eugene: Sounds good.
Carlos: Okay, Sidney, how did you first get started with SQL Server?
Sidney: Oh wow, when I was at college at VCU, we had a .net class and I had to learn this new database called SQL Server, and I thought this was a rip-off of MySQL and how would anybody ever use a knock-off product? So, I learned it about 2007.
Carlos: Okay, very good. Now, if you could change one thing about SQL Server what would it be?
Sidney: Today, I think the biggest thing I would change is just– that’s a really good question. The elastic scale capabilities in Azure SQL Database, I feel like it’s really difficult to figure out how to use that and to get started. It’s something I’m really interested in and after researching it and doing examples, I still don’t have my mind wrapped around it, so I think that learning curve needs to drop significantly.
Carlos: What’s the best piece of career advice you’ve received?
Sidney: When I was in high school, I got a piece of career advice from a youth program. In our youth program, we were driving down the highway, he said, “when you’re talking about yourself and selling yourself, you’re a used car salesman for you, so you need to get out there, wear the ugly jacket and just really convince people to invest in you, that you’re the greatest person in the world and get the car off the lot. Get that job.” So yeah, I’ve really taken that advice seriously.
Carlos: Our last question for you today, Sidney, if you could have one superhero power, what would it be and why do you want it?
Sidney: Oh wow, the power to redistribute time. What that means is, I’d like to be able to take time from the future and use it now and take time from now and save it for the future. I work from home, I have my own consultancy, I just became a Microsoft Partner. I’m a Learning Partner, now, just delivering my first courses a week from this podcast, so it’s really exciting. I’m about to be an application dev Silver Partner within a month, so a lot of what happens for me is that I get two days slammed, hour to hour. For example, the day we’re recording this podcast, I had a three hour webinar in the morning and I have a user group meeting talk, two hours after the podcast is done. I’d like to be able to borrow some time where the rest of the week I have nothing else going on, and redistribute it to today.
Carlos: There you go, interesting. Sidney, thanks so much for being on the program.
Sidney: Thank you for having me. I put tons of links in the show notes for tutorials, labs, white paper scenarios, plenty of reading for everybody in the audience to get deeper.
Carlos: And thanks, Eugene and Kevin, as well, for jumping in today, for today’s program.
Kevin: Glad to be here.
Eugene: Definitely, it’s a pleasure.
Carlos: So, this is going to be exciting to see what the adoption is. Again, I think there are just so many different cases or use cases that organizations have around data and they’re going to need multiple options to be able to handle them. I think it was pretty obvious from the conversation, the first customers are going to be those larger organizations that have customers all over the world, so you can think about banks or global organizations that have dispersed geographic needs. And then it will probably make its way down into other organizations as time goes on and as they’re able to make things more affordable, things like that. I am looking forward to seeing how this evolves, how it continues to interact with those of us who are using SQL Server. I definitely don’t think that we need to be abandoning ship with SQL Server anytime soon. It is interesting, however, for those who are considering adopting Mongo, that this may be a better opt-in, or a migration strategy if you already have Mongo, to be able to get back into the Microsoft stack and play a bit nicer, if you will with SQL Server, for example. That will be interesting to see.
Again, you can check out the show notes for today’s episode. They will be at sqldatapartners.com/cosmos and as always, compañeros, we’re very interested in what you have to say. Thoughts or comments are always appreciated. You can reach out to me on social media, particularly LinkedIn. You can connect with me. I’d love to connect with you and chat. I am @carloslchacon and we’ll see you on the SQL Trail.