🇬🇧🇺🇸 What's the deal with NoSQL?
NoSQL sometimes feels like an overused buzzword. Sometimes it feels like it's the shiny new thing that everybody uses in their projects. A usual debate I heard too often is about whether NoSQL is better than regular SQL (which, by the way, is the wrong thing to ask anyway). The correct question is: In what cases NoSQL shine?
In this post I want to talk a little about NoSQL as a whole category of databases, the flavors they come in, give some examples, and their specific use cases. I will try to shift the popular focus from whether NoSQL or SQL is better (they are both great with their own use cases) to which tool is better suited for which job.
SQL and NoSQL
But first, let's settle on the definitions for each: what is SQL and what is considered NoSQL?
SQL stands from Structured Query Language and represents a way to query big chunks of data, that is stored in a structured manner (in tables). There are different software applications that are capable of understanding this kind of languages, and are commonly referred to as RDBMSs ( R elational D ata B ase M anagement S ystems). Most popular such systems are PostgreSQL, MySQL, and Microsoft SQL Server (there would also be SQLite, but this one is only suitable for development, not real applications that are used by real people).
NoSQL stands for N ot O nly SQL and are systems that share some similarities with traditional RDBMSs but they have some very special capabilities. There is no set definition for such systems, as they are grouped in multiple categories which have different use-cases:
- Key value stores (eg. Redis, Memcached) which are suitable for caching and simple and fast simple data manipulation.
- Column stores (eg. Cassandra, HBase) which are suitable for working with big data quantities, as they optimize data querying and retrieval, as long as we don't work on many columns at the same time.
- Document stores (eg. MongoDB) which are suitable for storing arbitrary shaped (unstructured) data.
- Graph databases (eg. Neo4j) which facilitates storing and working with graph shaped data (nodes and edges).
It is important to understand that no database is a "one size fits all" solution. Each database is a tool, and we need to figure out which tool is best suited for our specific use cases.
There are certain well-known, tried and tested technology picks that you can't go wrong with though for specific problems that occur pretty often in the industry (eg, caching, log aggregation, timeseries storage, analytics/aggregations).
Key value stores
When to use them? The most common use-case for these kinds of NoSQL databases are caching, temporary storage of simple data types and sharing information and configurations between distributed services.
Examples of tech The most popular such software are:
- Redis - usually used for caching frequently used data, and sometimes as message queue in cases where message delivery doesn't have to be guaranteed.
- Memcached - same as Redis, for caching, but it also has a more interesting use case as a shared configuration storage, to be shared between distributed services. A real-life example of this use-case is inside a Kubernetes cluster's control plane: all the metadata and state of the cluster is usually stored inside a distributed Memcached cluster (to prevent downtime and data loss).
Column stores
Column stores are databases that are usually built with big data in mind. They are called like that because of the way they store the data: they put the information stored for a column closer together, so that any operation that is done on a single column is faster than usual.
They often must be deployed in a distributed manner, in a cluster, so that they can scale horizontally easier than other more traditional solutions for storing tabular data (for example, PostgreSQL and MySQL don't have very good clustering/distributed support. These capabilities were added afterwards, and they are pretty hard to manage and scale because of that).
The most frequent use case for this kind of databases are for applications that need to do a lot of aggregations (eg. analytics platforms) and no range filters (filtering on ranges require special indexing and access patterns which are hard and inefficient to do in a column store. They are not impossible in some cases, but they are discouraged). For example, Cassandra is used at Facebook for storing the Messenger messages, because the only filtering they need to do is by sender or receiver, and the time based ordering is done by creation time (a combination of clustering and partitioning keys to achieve the desired result).
One big drawback of such databases is the inability to filter by ranges (greater than, inside an interval, etc) and arbitrary ordering at runtime.
Document stores
This family of databases specialize in storing, indexing and querying unstructured data, which are also commonly referred to as documents. We don't have a specific schema for our "tables" (they are in fact called collections), and we can put any kind of arbitrary shaped data in them. The keys can be different, we can have nested data (nested arrays and objects) and when we do queries, we can "traverse" these nested objects.
It's a powerful feature, indeed, but nowadays relational databases also gained the capability to work natively with arbitrary shaped and nested data.
Honestly, I consider document stores to be a dying breed of database, because they don't offer unique enough features to justify using them in complex applications. You can safely replace them with a modern RDBMS (eg PostgreSQL) and not lose any flexibility. You will even be able to do more nice things such as joins and foreign key constraints.
One of the biggest drawbacks (in my opinion) is the inability to have foreign keys. There are high chances that you will need foreign keys anyway. It's rare nowadays to have entities that exists in void that don't reference other entities and are not referenced by any other entity.
Graph databases
They specialize in storing heavily relational data with accent on connections between the entities it stores. Just like a graph (duuh).
They are a pretty obscure or niche type of databases that are not very used in the industry, mostly because they have very specific use cases. It's rare to need to store and work with data with so much focus on graphs. In most cases, using traditional RDBMSs is enough, as long as you don't need to traverse more that 2-3 edges at the same time (simple table joins make that possible with ease).
In most modern systems, this is enough.
Other reasons for the lack of popularity of these kind of databases are the lack of tooling, unfamiliarity of the developers with this technology (there are not many developers out there specialized in working with graph based databases) so hiring talent to develop your application that uses a graph database is harder.
They are a nice solution to a nice problem, but unfortunately, the complexity of modern apps don't require this kind of control and insights into graph modeled data. Foreign keys and joins are usually more than enough for 99% of the use cases.
What should I use for my project?
My 2 cents regarding choosing a database for your project, no matter the size and complexity: go with RDBMSs as the primary database, you can't go wrong with that. They are versatile and powerful enough for most of the use cases, and they give you all the features you need.
Applications nowadays are built around relational data, so there makes little sense to pick as the primary data store something that doesn't work well with relational data out of the box.
Later on, as you uncover more cases, start to hit the limits of RDBMSs and find yourself designing your schemas around these limitations, you can start attaching NoSQLs to your app, to handle these specific use-cases separately (as complementary solutions).
Using a NoSQL database as the primary database of a full fledged database is a bad choice, and shouldn't be encouranged. Yeah, they are cool, hip and are considered bleeding edge tech, but they lack the most common capabilities any application needs. You will end up reinventing a lot of wheels and patch together a lot of custom application code to handle these cases (eg. cascading deletes, foreign key constraints so we don't end up with inconsistent data).
Conclusion
NoSQL are an interesting kind of databases that have very cool use cases and resolve very specific problems.
Unfortunately they are not suitable for the most common use cases of an application. They shine as complementary tech that works together with a main RDBMS to handle those specific cases.
More than often, RDBMSs are enough for any application, as they offer some features that make life easier: foreign keys and everything nice that comes with it (joins, constraints, cascading deleted, etc).