Relational Database Management Systems (RDBMS) and Structured Query Language (SQL) associated with them represent a mature technology that existed for over 30 years. However, a group of data storage technologies based on a completely different paradigm, collectively known as NoSQL, is establishing itself as a popular alternative, especially within performance-critical systems that are intended to handle large quantities of data. Despite this, SQL is far from being obsolete. There are situations where traditional RDBMS systems still remain a much better choice than NoSQL.
There are many good articles on the web that provide comparison between SQL-based and NoSQL systems. This article, however, aims to take a slightly different approach by outlining a number of fairly common use cases and assessing suitability of different database management systems in each one of them. What is important to note is that there is no such thing as a generic NoSQL. Unlike SQL-based RDBMS, NoSQL consists of a number of completely different and unrelated technologies. Therefore we are not only comparing RDBMS against NoSQL, but also different types of NoSQL systems against each other.
Brief introduction to different database types
As I have mentioned previously, SQL-based RDBMS software of various makes was traditionally used as the most common data storage technology. All of RDBMS systems consist of tables that can be joined to each other via common values in certain columns or foreign keys . All of the systems that belong in this category use SQL as the language that retrieves and manipulated the data and dictates the structure of the databases themselves. Although different RDBMS systems use distinct dialects of SQL (e.g. Oracle's PL/SQL vs Microsoft's T-SQL), the core of the syntax is exactly the same. Most of the database design best practices are also applicable to all makes of RDBMS. The best-known examples of RDBMS are Microsoft SQL Server, Oracle Database, MySQL and PostgreSQL.
NoSQL, on the other hand, consists of several completely unrelated technologies, each consisting of its own data manipulation language, capabilities and best practices. They can broadly be split into 4 distinct categories. However, this categorisation is very broad, as no two NoSQL systems from the same category are similar enough to each other that knowing one of them would imply that only relatively little learning effort would be required to become proficient in the other one, like it is the case with RDBMS. The categories are as follows:
- Key-Value store: The data is stored in a hashtable where each unique key corresponds to a particular data object. Examples include DynamoDB, InfinityDB and Redis.
- Document Store: Data is stored in a form an object written in a declarative language, such as JSON or XML. Examples include MongoDB, CounchDB and DocumentDB.
- Column-oriented DBMS: Data is stored in tables, just like in traditional RDBMS, but it is partitioned by columns rather than rows. Examples include HBase, MariaDB and Metakit.
- Graph database: Database is expressed in a form of network that can be visualised. Examples include Neo4j, InfiniteGraph and ArangoDB.
To complicate the matter even further, some database management systems have capabilities of both traditional RDBMS and various types of NoSQL. Microsoft SQL Server, for example, has the ability to manage columnar databases in data warehousing scenarios and, as of 2017, it has capabilities of managing graph databases. When it comes to open source, PostgreSQL has the ability to manage document store.
When you need a small-scale system or a website
If you are building something small that is intended to remain relatively small, classic RDBMS is the best solution by far. This is because the technology is very mature and there is a large number of support available on the web. Many NoSQL databases evolve rapidly, so the information about a particular software package that you may find online may be out of date already. However, with RDBMS, you can be sure that the tutorial you will read today will not be obsolete tomorrow. Also, the performance of RDBMS software only noticeably degrades when you start having tables several gigabytes in size. Therefore, on a small system, you are unlikely to ever hit the limitations of RDBMS.
Many technologies that are intended to act as a framework for a small-scale systems are already using RDBMS. For example, SQLite is used as a data storage mechanisms for mobile apps and embedded systems, while WordPress, a popular website-building platform, comes with MySQL.
When you would expect your data structure to change fairly often
This is a scenario where RDBMS is not the best choice. Relational databases come with a rigid pre-defined schema and operations that significantly change the data structure can be computationally expensive; especially when your system already contains large quantities of data. Most of NoSQL databases are schema-less; therefore data can be stored in any format. However, the exact type of NoSQL database to be used depends on a more specific use case scenario.
When you need a system that is frequently used and is expected to grow significantly
The major limitations of RDBMS systems are degrading performance as the tables grow and difficulty in scaling systems up.
The data is retrieved slowly from large tables because selection queries do table scans and, as the data grows, greater chunks of disk space need to be scanned. This problem can be mitigated by adding indexes, so a much smaller collection of data is scanned before the right data is retrieved. However, indexes have their own problem. Every time new data is inserted, index needs to be updated. This will make insertion operations slow, which may not be so good for multi-user interactivity scenario. Although it is possible to scale SQL-based systems up as they grow, the process is not usually very easy. There are RDBMS tools that enable high-availability, such as AlwaysOn availability group for SQL Server and there are methods of sharding SQL-based databases, so it is split into several smaller context-specific databases. However, those techniques and tools were developed for the systems that weren't designed for sharding or redundancy in the first place. Therefore the processes are not always easy to implement.
Many NoSQL database management systems, on the other hand, were specifically designed for high availability and performance with many of such actions done for you in the background at a press of a button. Document-based store, such as MongoDB, would probably be the best choice for a large-scale user-interactivity scenario. An individual article or a conversation would be stored as a document with all associated data items, such as comments and replies, being contained within the document itself rather than separate tables. As the document is already stored in a format hat many programming languages understand, modifying the document becomes fast. As everything is stored in the same entity, the lookup is really fast. This is what allows conversations on social media and messaging apps to be updated instantaneously, even when the number of simultaneous users is very large.
When you need a transactional system where consistency is critical
One of major disadvantages of NoSQL systems is that they employ eventual consistency. This means that, when a data update is made, it will not be instantly applied to all related documents and all database replicas. Also, as not all data is updated at the same time, some data loss may occur; although this is rare on well-designed systems. This, however, makes a system that relies on eventual consistency unsuitable as a back-end for a system where data consistency and durability are critical, such as accounting software or payment record systems.
RDBMS, on the other hand, employs ACID transactions which guarantee that, if update has occurred in one part of the system, all related parts of the system are updated in the same transaction. The disadvantage of this is that the transaction may take some time to execute on a large system. This is one of the reasons why online payments are not processed instantly. However, the transaction speed in such scenario is of lesser importance compared to the integrity of the data.
When you are adding business analytics to your system
Business analytics would rely on long-running report-producing processes; therefore the processes aren't expected to be particularly quick. The data that is being analysed is either the kind of data that doesn't change frequently, such as user information, or historic data that is stored away in a data warehouse. If the analysis relies on complex queries, then RDBMS is the most suitable candidate for business intelligence data store. In such systems, data redundancy (and, therefore, storage space and inconsistency) can be reduced by applying normalisation, so each table would contain an atomic dataset and multiple datasets would be queried simultaneously by using joins.
If, however, business analysis is intended to be fairly simple, then columnar database is far superior candidate. For example, if you would want to know how many purchases were made where customers have spent between £100 and £1,000 in a single purchase, the process of querying a row-based table in RDBMS would be slow if you have many purchase records. In such scenarios, we would need to scan the entire table, including the columns that aren't related to the data that we are interested in and only then extract purchase value information from it and perform all of the necessary calculations. In a large table, separate rows would be stored in separate chunks of physical storage; therefore multiple read operations would be necessary for the scan of the entire table, which is computationally expensive. In columnar database, however, all data from the purchase value column will be stored close together. As this is the only data we are interested in, we would not even need to look at any other columns. Therefore this type of query would be much faster on a columnar database compared to a classic row-based database from RDBMS.
When you are building a social network
Graph storage is by far the best choice of database management technology for building the core functionality of a social network. This database type consists of nodes and relationships, the latter of which is also known as graphs and edges. Nodes are similar to entities in any other database type. However, while RDBMS also has a concept of relationships, in an SQL-based systems relationships are nothing more than links that connect multiple tables by specific columns. In a graph database, relationships are entities in their own right.
This type of arrangement allows user profiles to be defined as nodes and various relationships between them (e.g. proposed a friend request, accepted a friend request, updated a relationship status, etc.) are defined as graph relationships. Each of the graphs would contain all the data about a given relationship, decoupling it from a user profile itself.
Of course, this will only apply to the most fundamental functionality of a social network. Additional things, such as messaging and data analytics, would probably be handled by different database types. This is why Facebook has a diverse range of back end technologies.
When you are working on a fraud detection system
Fraud investigation, or any other criminal investigation for that matter, relies on establishing a network of related people and various entities. This is why, as you may have seen in documentaries or movies, briefing room of a police detective team usually has a whiteboard where pictures of suspects and various pieces of information are linked together and visualised as a network. As graph database natively stores data in a format similar to this, this type of database would be the most suitable choice for any kind of a criminal investigation.
When your goal is to build a multiplayer game
Depending on its scale, a multiplayer game may have a variety of data storage technologies implemented for different purposes. However, one particular set of data that any multiplayer game would need to maintain is the current state of each user. Key-value datastore is the best suited database technology for this purpose.
Key-value stores work in a similar manner to a file system where filename would act as the key and the content of the file would act as a value. Value can contain absolutely anything, while the key needs to be relatively short and consist of a simple data type, such as a string of alpha-numeric characters. In a multiplayer game scenario, key may be a unique user identifier, while the value can contain absolutely any information relevant to this user. As the user key is well-known and there is only a single list of user keys that needs to be queried to retrieve user-state information, any queries on such database would be fast, regardless of its size.
When you are developing an e-commerce web app
E-commerce web app would definitely require different types of data store for different purposes. The product data would probably be stored in a document database to allow reliability and large-scale user interaction, payment-processing part will be done in RDBMS to guarantee consistency and business intelligent part would use either RDBMS or columnar data store.
Recommendation engine of an e-commerce app, i.e. the type of software that would suggest potentially suitable purchases for you while you are doing your online shopping, can be most effectively built by using graph database. The query would look at how the products that are being purchased relate to other purchases in a typical shopping session and would show those products that are most frequently purchased together.
Written by Fiodar Sazanavets
Posted on 5 Jan 2018