Data management has been always at the heart of IT applications, as most of them base their operation on the processing of input data and the subsequent production of output data. For several years, transactional databases based on the relational model and the Structured Query Language (SQL) have dominated the data management landscape and helped supporting the most prominent enterprise applications like ERP, CRM, Business Intelligence (BI), knowledge management and reporting systems. During the past decade, this landscape has radically changed, due to the advent of Big Data applications and the introduction of new types of data management infrastructures such as Data Lakes and NoSQL databases. Nevertheless, both the old and the new data management landscapes share a common characteristic: The availability and popularity of open source databases.
Open source databases provide a compelling option for data management in the Big Data era: They come at an affordable cost and incorporate a wide range of novel data management capabilities that enable innovative applications. The landscape of open source databases provides a wide array of possible choices, which could make the selection of a proper open source database challenging. In this context, it’s always good to understand the main features of open source databases and when it’s a good idea to use each one of them. Thus a high-level classification of open source databases in SQL, noSQL and special-purpose databases is useful. Databases falling in each of these categories serve a different purpose at the first place.
Open Source SQL Databases
Despite the emergence of Big Data that include various sources of unstructured and semi-structured data, a large number of companies are still in-need of handling conventional, structured, relational datasets. That’s for example the case for all traditional enterprise applications that need to offer structured reporting functionalities. Luckily, the open source community has made available several open source relational databases, which offer features comparable to the large-scale enterprise databases. Prominent mentions include:
- MySQL, which is the relational database that the vast majority of IT professionals know about. It is widely deployed and suitable for most transactional use cases. It has a long history, along with strong support from the open source community. Moreover, its ecosystem is currently supported by Oracle, a global leader in relational database systems. Unless you have special relational databases management requirements, MySQL is an open source option that will fulfill your needs.
- PostgreSQL, which is another popular relational database that has been around for over twenty years. It is very popular within certain developer communities, such as Ruby, Python and Go developers. Moreover, it is common for several developers to adopt PostgreSQL after starting with some other relational database. This usually happens due to the fact that PostgreSQL is very stable, while offering some unique features like native support for storing documents (e.g., JSON and XML), fine integration with scripting languages (such as Perl and Python), as well as its full text search capabilities. Despite being an SQL database, PostgreSQL comes with some noSQL features such as document management and support for key-value storage and management. This enables a hybrid data management approach that is preferred by several developers.
- SQLite, which is a very lightweight, “serverless” relational database. Its lightweight nature makes it very appropriate for some applications like apps running in CPU constrained devices. It supports advanced SQL features like transaction management and subqueries, while at the same time providing document support. Nevertheless, it has inevitably some limitations that make it inappropriate for large scale applications. For example, it offers a maximum database size of 140 TBs, while putting limits on the size of columns and rows that it supports.
Open Source noSQL Databases
In the Big Data era, noSQL databases have been gaining momentum, as they offer virtually infinite scalability. However, this scalability comes at the expense of sacrificing strict transaction guarantees and lack of support for structured schemas that facilitate reporting. Prominent example of open source noSQL databases include:
- MongoDB, which is probably the most popular noSQL databases and already used in the scope of full stack development as part of the so-called MEAN (Mongo Express Angular, NodeJS) JavaScript stack. It is a very good choice for applications that require flexible, document-oriented schemas, rather that rigorously structured relational schemes. Hence, it is also a good option for several Web 2.0 applications like blogs. MongoDB is probably your first choice when you need to get started fast with noSQL databases e.g., based on a migration from mySQL to a noSQL alternative.
- Neo4j, which falls within a special class of noSQL databases known as graph databases. Graph databases are an excellent choice for representing connected data and relationships between entities, which are very typical in the scope of social networks and social graphs. Likewise, Neo4j is an excellent choice for supporting graph queries and analytics, traversing social graphs and visualizing graph-like relationships. Moreover, it can support the transformation of tabular data into graph data.
- Cassandra, which is an example of a Java based columnar noSQL database. As such it abstracts and stores data in columns rather than in rows, as a means of minimizing seek time. It supports use cases that entail high loads and zero downtime tolerance, while offering fast write performance. As such is it commonly used for logging and analytics on audit trails.
Open Source Special Purpose Databases
Beyond the SQL and noSQL distinction, there are also databases that are designed for specialized purposes and applications, including applications that did not exist few years before. For example:
- RethinkDB is a database for real-time applications. Its main value proposition lies in the fast pushing of updates to the data from the database to the application (e.g., its front end). This enables applications to get notified of changes to the data in real time. As such RethinkDB is a very good choice for games and real-time analytics.
- Redis is an in-memory database that enables the implementation of middleware functionalities like caching. In practice it’s a key value database with data living in random access memory, which makes it extremely fast. Beside caching, consider using it in applications that comprise distributed data-centric components that need to communicate very fast with each other.
- TimeScaleDB serves the needs of Internet of Things (IoT) data and applications. As such it can be tailored to storing, analyzing and visualizing data from sensors and other internet connected objects. It is used for reporting and analytics over IoT data, while offering excellent support for large amounts of time-based data (i.e. timestamped data series).
There are tens of other open source databases that could be added on this list. However, the given list is overall representative of the open source choices that are offered to Big Data developers. The latter have to consider various trade-offs when selecting their databases. They might have to consider other criteria like the level of support for their preferred programming platform, the learning curve associated with each database, as well as the compatibility of the target database with the overall context of the project at hand. Nevertheless, the wealth of available choices is certainly a “pleasant headache” that empowers developers to ultimately identify what suits their needs best.