Most SQL Server DBAs don’t even like MySQL, let alone NoSQL. They cringe when they hear people say things like, “I don’t need transactions” or “The data will eventually be consistent” or “Indexes will be supported in the next version.” SQL Server and Oracle people are constantly trying to out-feature each other, and the abject lack of features in the NoSQL world comes as a complete shock to both parties.
Thing is, NoSQL and cloud databases do fill some very real business needs. Today, I’m going to give you a peek into the NoSQL/cloud-database world. You can be a better database professional if you understand what these tools are trying to achieve, who wants to use them, and how your database compares with theirs.
Let’s start at the beginning: who needs this stuff? To answer the question, let’s look at a few common projects that are probably happening right now in your company with conventional databases.
Use Case #1: Logging High-Volume, Low-Value Data
Most companies have a web filtering system that makes sure employees aren’t surfing
href=”http://achewood.com/index.php?date=04292004″>dangerous sites during business hours. They want to reduce their Internet connection needs by eliminating streamed music or video, and they want to avoid sexual harassment lawsuits caused by employees surfing for porn at work. The filtering system maintains two databases: configuration data such as employees, departments, and blacklisted web sites, and monitoring data that covers who surfed for what, and when. The config data has to be perfect – we can’t have our configuration data disappearing, but that isn’t too big of a problem because it’s a fairly small amount of records.
class=”size-medium wp-image-8524 ” title=”dirty-computer” src=”http://i.brentozar.com/wp-content/uploads/2010/05/dirty-computer-300×225.jpg” alt=”I can tell when someone’s been surfing dirty web sites.” width=”300″ height=”225″ />
style=’ padding: 0 4px 5px; margin: 0;’ class=”wp-caption-text”>I can always tell who surfs the dirty sites.
The monitoring data is important, but not all that important. We want to capture as much history as possible, and we’re going to be inserting data fast and furious, but it wouldn’t kill us if we missed a few rows. Odds are someone won’t surf for illicit material just once – they’ll do it repeatedly over time. We don’t make money on this data (although we could make an argument that it helps us avoid risk) so the CIO isn’t prepared to spend a fortune archiving Johnny’s Google searches. The only time we query this data is when HR wants to check Johnny’s web activity – we don’t need to run fancy reports to see the average web use by department.
This system could live without fancy transactional backups. We just need to do a full backup periodically – even once a week or per month would be fine if it saved us a lot of money. When it goes down, we don’t need to restore it quickly – or maybe not even at all, if HR doesn’t ask for a report. We just need to set up a new database to start taking inserts again.
What if we had a way to store multiple terabytes of this data on commodity hardware that would have a very low cost, pretty high availability, but might not give us very good reporting options or transactional consistency?
Use Case #2: One-Time, One-Way, Big-Load Analysis
When a company’s data gets large enough, they start asking questions about what the data might reveal. I’ve worked with PhD’s and analysts who say things like, “I need a copy of the data warehouse to solve a problem I’m working on. I’ve got a hunch about the numbers, and I’ll need to do a week or two of slicing and dicing to figure things out.” It’s just one or two rocket scientists working on the data for a relatively short amount of time.
They don’t want to synchronize the company’s incoming data with this research dataset, and they don’t need to back it up. They just want you to hand over the data, and then you won’t hear from them again. If it goes down, that’s okay – it doesn’t affect sales. Sometimes they use their own SQL Servers, sometimes analytic tools, sometimes its plain old Excel or Access. They won’t write production reports on this data – they’re just going to dive into specific queries and analytics, and then write PowerPoint decks or white papers on the results.
They might do similar projects like this over and over with different datasets from different areas in the company, but they use fresh sets of data every time. They could use a completely different toolset each time without training worries – these are rocket scientists, after all.
What if you could give them a tool that could handle rapid loads of completely different data each time, perform wild and crazy queries, but it might not use traditional query tools or database languages? These users aren’t proficient with SQL queries or RDBMS technologies anyway – they could use anything as long as it’s cheap and scales well.
Why Not Use SQL Server or Oracle?
class=”size-medium wp-image-8526″ title=”mozilla-labs-test-pilot” src=”http://i.brentozar.com/wp-content/uploads/2010/05/mozilla-labs-test-pilot-145×300.jpg” alt=”Mozilla Labs Test Pilot Project” width=”145″ height=”300″ />
style=’ padding: 0 4px 5px; margin: 0;’ class=”wp-caption-text”>Mozilla Labs Test Pilot Project
I bet you’ve run into projects just like these in your career, and until now, you’ve solved them with the only tool you really know well – relational databases like SQL Server. SQL Server, when properly configured and managed, doesn’t blink at the requirements of most NoSQL projects. Take the
href=”http://blog.mozilla.com/data/2010/05/18/riak-and-cassandra-and-hbase-oh-my/”>Mozilla Test Pilot experiment:
- Expected amount of data stored per experiment: 1.2 TB
- Expected peak traffic: approximately 75 GB per hour for two 8 hour periods following the conclusion of an experiment window. This two day period will result in collection of approximately 90% of the total data.
- Remain highly available under load
- Provide necessary validation and security constraints to prevent bad data from polluting the experiment or damaging the application
Here’s my personal take on it – NoSQL doesn’t mean NoSQLServer, but rather NoMySQL. MySQL doesn’t scale as easily to these levels, whereas it’s not much of an event anymore for a single-node SQL Server to scale to multiple terabytes. I’ve worked with plenty of SQL Server data warehouse systems that have much larger volumes and loads of data than this. In the last month, I’ve worked with two in particular that ran on less than $50k of commodity hardware.
It’s not that SQL Server or Oracle can’t scale to these levels.
It’s that NoSQL and cloud databases have a few crucial differences about how they scale.
NoSQL Has a Different Cost Model
If you’re doing multi-terabyte databases, you probably need SQL Server Enterprise Edition, which is
href=”http://www.microsoft.com/sqlserver/2008/en/us/pricing.aspx”>licensed in one of two ways:
- $27,495 per CPU socket for unlimited users, or
- $8,592 per server plus $164 per user (unless the user already has a SQL CAL)
When you start a SQL Server project, you need to have an accurate CPU count and stick to it. If you decide halfway into the project that you need a lot more power to get the job done, you have to buy more licenses. These licenses stick around, and that long-term overhead becomes a business problem. Businesses don’t like to see expensive software licenses not in use when a project isn’t doing number-crunching.
In addition, even just four sockets of Enterprise at $35-120k is a lot of money that could be spent on hardware – and therefore performance – instead. Even if SQL Server was 50% faster than a competing solution, that advantage is negated when the user spends 50% more on hardware instead to get the same speed at a lower overall cost. Tough sell.
The advantage of buying an industry standard database with a long history (whether it be DB2, Oracle, SQL Server, or Sybase) is that it’s relatively easy to find staff and software that interact with that database. NoSQL and cloud databases, on the other hand, come with no up-front licensing – but because there’s no established industry standard, it’s very tough to find experienced staff off the street.
class=”size-medium wp-image-8534″ title=”mulletdb” src=”http://i.brentozar.com/wp-content/uploads/2010/05/mulletdb-208×300.jpg” alt=”An early build of MulletDB” width=”208″ height=”300″ />
style=’ padding: 0 4px 5px; margin: 0;’ class=”wp-caption-text”>An early build of MulletDB
How do I hire experienced NoSQL-savvy-staff when the very NoSQL databases out there keep changing? Here’s a sampling of announcements from one 24-hour period:
href=”http://www.readwriteweb.com/start/2010/05/fluiddb-backing.php”>New Type of Database Gets High-Profile Funding (FluidDB)
href=”http://www.theregister.co.uk/2010/05/25/voltdb_cloud_database_nosql/”>Postgres and Ingres Father Announces VoltDB
href=”http://sheddingbikes.com/posts/1274762986.html”>Announcing MulletDB (SQL in the front, NoSQL in the back)
NoSQL solutions have different costs. Immature platforms suffer from a lack of documentation, tooling, and commonly-needed features, thereby driving implementation costs up.
But is that really a problem? There’s an abundance of documentation, tools, and features for Microsoft SQL Server, but I almost never see it implemented correctly. Just because there’s a manual doesn’t mean anybody’s going to read it or heed it, and just because there’s a tool doesn’t mean users will use it appropriately. It’s easy for me to hire SQL Server people off the street, but that doesn’t mean that they’re good, either.
The best NoSQL solutions solve the human cost problem by building in the things you need in order to scale, store data, and consume data. It’s not all that hard, either – the scenarios I described in the beginning of the post had one thing in common: a very limited number of features. Some NoSQL and cloud database users need to insert a lot, and they need to run very specific queries, but that’s about it. This is where open source products can really shine – they need even less features than MySQL has.
NoSQL Makes Smart Compromises to Achieve Scale
Right from the start, cloud databases and NoSQL projects aimed to scale out. Whether you’re using pizza boxes or virtual servers in the cloud, you can get more capacity by throwing more of them into your resource pool and letting the platform spread the load around. The
href=”http://blog.nahurst.com/visual-guide-to-nosql-systems”>Visual Guide to NoSQL Systems illustrates the compromises – between consistency, availability, and partition tolerance, you can pick any two. If every client needs to see an identical copy of the data, there’s an app for that. If you can tolerate different query results in exchange for higher availability, there’s plenty of choices for that too.
href=”http://www.mongodb.org/”>MongoDB’s feature set is a good example. They tout:
- Auto-Sharding – Scale horizontally without compromising functionality.
- Replication & High Availability – Mirror across LANs and WANs for scale and peace of mind.
- GridFS – Store files of any size without complicating your stack.
If I was a business manager, I’d take those bullet points straight to my database team and ask, “Can our database do this? If not, why don’t we try MongoDB for our next project?” If the use case matches one of the two I described earlier, it might be a better fit than SQL Server.
No, you can’t have your cake and eat it too, and NoSQL systems don’t support anything even remotely near SQL Server’s feature set. Forget things like joins, stored procedures, triggers, spatial indexes, CLR in the database, foreign keys, and so on. As a performance tuner, though, that list makes me grin – because for years, I’ve been telling programmers to keep their CPU-intensive work out of the database server if you want to scale. The NoSQL camp understands the importance of that. These systems are about storing data – period.
No Licensing Revenue Means No Buzzword Features
class=”size-medium wp-image-8426 ” title=”Can’t Read My Poker Face” src=”http://i.brentozar.com/wp-content/uploads/2010/05/Cant-Read-My-Poker-Face-300×225.jpg” alt=”My poker face, let me show you it” width=”300″ height=”225″ />
style=’ padding: 0 4px 5px; margin: 0;’ class=”wp-caption-text”>From my SQL Server 2008 R2 review
href=”http://www.brentozar.com/archive/2010/05/sql-server-r-review/”>my SQL Server 2008 R2 review, I lamented that the last couple of versions of SQL Server have focused heavily on Business Intelligence because it’s a sexy buzz-phrase that sells licenses. I imagine Microsoft marketing thinks of customers as saying, “We’re a business and we need intelligence – get me some of that SQL Server!” Don’t get me wrong – I’m all for BI – but sometimes it feels like buzzwords get emphasized over basics, like working IntelliSense for all versions of SQL.
Since NoSQL tools are built by the users, they’re fat-free. If nobody needs it, it doesn’t get built. On the other hand, if one very determined person needs it badly, that person builds it or pays someone else to build it.
This leads to some interesting design decisions. Many of the NoSQL tools have abandoned any sense of schema – they want to store chunks of data, but that data layout can change at any time with no rhyme or reason. Take our web filtering example – if we design our application to suddenly start saving more information, like the browser agent or the remote site’s web server type, we can do that without a problem. When we pull the data back out, we need to handle whatever kind of data we find, but we’re not worried about joining this data to another table. It’s just flat web history data without relationships.
No Licensing Revenue Means Different Development Models
Five years ago, Microsoft wanted to conquer the VLDB market, so they poured money into making SQL Server scale better. SQL Server 2005 took on the “SQL Server can’t scale” challenge and knocked it out of the park. We got things like
href=”http://msdn.microsoft.com/en-us/library/dd537533.aspx”>partitions for fast data loads, DMVs for easy management, and online piecemeal restore for quicker disaster recovery. Microsoft’s business decision to go after large databases paid off for SQL Server DBAs everywhere because we could use our skills on progressively larger databases.
When you pick a NoSQL or cloud database provider, the first question to ask yourself is, “Who will be developing this thing next year?” If the platform doesn’t have a really strong developer infrastructure, and your company runs into a data storage problem, the answer might be you. I dunno about you, but I’m pretty good at administering databases – but absolutely useless at coding them.
That means long-term customers should either:
- Choose a database with a commercial company supporting the development
- Have coders on staff who start getting familiar with the project’s source code
- Consider sponsoring development of the database. Sponsored open source development is a foreign concept to most businesses, but I’ve seen it work well even in small dollar amounts.
On the other hand, short-term customers have a completely different decisionmaking process. Consider the second use case I described – repeated one-time, one-way analysis projects. In that case, who cares if the database platform is still being developed a year from now? Your database team is probably already used to repeatedly learning new systems for data import/export, and they’re skilled at techniques, not just technologies. They’ll be comfortable switching to another platform if development dies off or if another more attractive platform comes out. So how do you pick the most attractive platform?
How I Would Choose a NoSQL or Cloud Database
Ignore throughput performance benchmarks. The market and the tools are changing way too fast. By the time a well-written benchmark comes out, it’s outdated in this rapidly evolving environment. Poorly-done benchmarks can come out much quicker – but they’re not good for making decisions.
Focus on developer performance. How fast can you get started deploying and storing data in it? Shaving days off the development time will probably outweigh the computing gains, especially since you’re going to be throwing your most skilled people at these problems.
Ask the team who will do the work. If you’re a manager on the project rather than a hands-on coder, let the hands-on people decide. If they have personal connections to someone else using a particular tool, they might be able to get faster answers or code help.
Prioritize difficult-to-code features that work right now. If you need your reporting team to be able to write SQL-style queries against the data, only pick platforms that offer that support today. If transaction support is important to you, rule out any solutions without it. Don’t gamble on these challenging features possibly coming out later, because it’s tough to adapt these into existing databases.
Prefer systems that tolerate a node failure. If the database is stored across a dozen cheap pizza box servers, one of the servers is going to go down. Your data loads and queries should still run while you provision a new pizza box and add it to the system. Ideally, you want this systems management function to be performed by your infrastructure team, not the database people. The closer this system comes to being an invisible managed appliance, the lower it will keep your manpower costs. This feature is extremely difficult to add to a database system – just ask Microsoft, who still haven’t done it yet for SQL Server, and believe me, we’ve all been asking.
I want to end this with a recommendation for a specific NoSQL or cloud database solution, but I can’t bring myself to do it. After all, the market is still evolving so fast, and I just told you to ask your team who will do the work. If you’re looking to play around with one, though, I’d recommend trying
href=”http://www.mongodb.org/”>MongoDB. It’s the closest to a conventional database while still meeting most of the criteria I’d want for a scaling solution, and there’s even an easy-to-play-with online tutorial so you don’t have to install anything.
But don’t do it from your work computer.
You don’t want SQL Server to know you were cheating on her.