One of the most fundamental and fraught decisions in technology is deciding on the right way to store your data. Get it wrong and you could be handcuffed forever, get it just right and answers will seem to spring from your data before you knew the question. We at Seamgen know the struggle and are here to help make sure you choose the right database engine.
How do you begin to navigate the confusing and crowded space: relational or noSQL? Schema or no Schema? Columnar, Document, unstructured? What about graph databases?
Here are some key questions to ask. The answers can guide you to a reasonable solutions. The most important thing is to understand some key attributes of your data, and to compare these with the problems that the various storage technologies were created to solve.
Before getting into that, let me say that the answer is almost always a traditional RDBMS using SQL. With good reason this type of database engine has been around since the early 1970s and is still the most used way to store data today.
You may think you have a lot of data, but there are SQL databases with a Terabyte of data running today and doing their job well. For most use cases an RDBMS can be designed and tuned to meet your application’s needs.
Factors to Consider Choosing the Right Database Engine
Disclaimers aside, what are the attributes of your system to consider when deciding on a database?
You should consider the natural structure of your data, how it is structured when you write it, and how you will read it. You will need to have an idea, or at least an opinion, about how large your data set will be, both in terms of total volume and the working sets you will need.
While thinking about size you should also think about related scale considerations like concurrent users, throughput, and latency. Finally you must consider the types of computation that will be performed on the data.
When looking at the structure and use cases for your data, ask which of these storage models most closely matches the intention and needs of your application: row store, column store, key value, document, graph, time series, or unstructured.
Row stores model transactional data exceptionally well. Traditional row-based databases excel when you need to do joins or you need to group operations in transactions. While an RDBMS can support very large data volumes, this structure performs poorly with very large working sets. Transactional data is the canonical example of row-oriented data.
Column stores are often used in analysis, and can be very space- and compute-efficient as they can take advantage of repeating data within columns. Columnar databases are a good choice for unbounded data sets and analytic operations.
Document or Key-Value data stores work well when you can’t be tied down by a fixed schema, or when you want to search across different types of entities. Like most No-SQL databases these work well with very large data sets, and excel at write efficiency.
Graph databases are for the special case when you are more interested in the relationships between your entities than the entities themselves. Using SQL to traverse nodes and edges becomes very slow very quickly.
The last type of database is no database at all. If your data is completely heterogeneous you may ask nothing more of your storage than a way to put data in and get it out. With such a system you can bring computing power to the data, rather than the other way around, and only categorize the data at read time, structuring it differently for each new question. Unstructured object storage is often a good choice when you want to store binary data alongside the metadata that describes it.
When considering the types of computation to be performed on your data, you should first ask how much of that computation will be pushed down to the database. Traditional RDBMS systems generally have robust support for computation, whereas NoSQL database generally rely on outside compute systems.
Document, Key-Value, and Columnar databases work well for aggregate analysis, while SQL databases perform well in operational settings.
So what is the right tool for your job? Historically there was only one answer: you chose one of the general purpose databases, and you threw all your data into them. Because of this, there are many very mature, feature-rich, robust RDBMS systems on the market.
Now that so many different databases are available a new approach is possible. Now we can think about treating disparate data sources independently, using the best tools for the job. To some extent this has been happening for some time, with daily transactional data being stored in a traditional RDBMS and then transferred via ETL to columnar store for analysis. When choosing your database, don’t limit yourself to one and live with the trade-offs.
Instead, look at all of your data and all of your application needs, and use all of the tools at your disposal. With the maturation of cloud providers you can have scalable, robust, fault-tolerant, managed databases of practically any variety living together in the same private subnet in the cloud, alongside other services your application might need like object storage, API hosting, and compute capability. Don’t architect your system for the last thirty years; architect it for the next 30 years using all the tools at your disposal.
Thanks for Reading!
Enjoyed this article on how to choose the right database engine? Check out some more of our recent, related articles in the area of data:
Data is the Oil & Analytics is the Refinery of the 21st Century
Which Cloud Platform is Right for You?
Seamgen’s Recap of AWS re:Invent 2017