MongoDB 8.0 – Performance Test? Done and explained!

Introduction

Since 2009, MongoDB has revolutionised the way data is managed, offering an alternative to traditional relational databases and over the years becoming a market standard and one of the most adopted solutions for modern applications, from start-ups to large companies.
MongoDB is a document-oriented NoSQL database, a feature that distinguishes it from traditional relational databases. It is particularly appreciated for its flexibility: documents do not require a fixed structure, allowing for a more dynamic and adaptable data management.

The documents, structured according to JSON syntax, consist of key-value pairs. The value can be a number, a string, but also an array or an object, thus offering great flexibility in data representation.

Data are stored in BSON documents (a compressed, binary version of JSON), grouped in an object called a Collection. A database consists of several collections, whereas a single instance can house several databases.

To give a comparative example, let us imagine that we want to create a database relating to cars.
With a relational database, we could structure a main (master) table containing basic model information, such as name, colour code, engine code, gearbox code, etc.

Each code would correspond to a dedicated table (e.g. engine, gearbox, colour table), linked to the master table via foreign keys. This approach requires strong normalisation and the use of relationships between tables in order to reconstruct the complete model.

With a non-relational database, on the other hand, each car can be represented as a complete, self-contained document, which directly includes all relevant information (colour, engine, gearbox, etc.), without the need to retrieve it from separate collections. This allows for easier and more flexible reading of data, particularly when the relationships between elements are driven by content rather than structured references.

MongoDB potential

There are three types of architecture in MongoDB. The simplest is the standalone mode, i.e. a single instance. In contrast, to ensure resilience and high availability, MongoDB supports replication: a read and write active database is flanked by several standby replicas. In the event of a failure of the main server, one of the replicas can take over automatically, ensuring the system’s business continuity.

Now, let us imagine that a client wants to read data. Mongod (MongoDB’s main process) receives the request and starts working: the data is structured in B-Tree indexes (more on this later) and the default one is _id; therefore, even if it is not explicitly requested, the data is still written to this index. In the absence of other suitable indices, mongod uses the _id index to locate the correct documents as quickly as possible. If the requested documents are already present in the cache, they are retrieved directly from there; otherwise, mongod loads them from the mapped files in memory. Once found, it returns them to the client.

One of the smartest features of mongod is the way it manages memory. It uses an internal cache to keep the most frequently used documents ready for use, which speeds up read operations considerably. As for data files on disk, on the other hand, MongoDB uses a system called memory-mapped files: data files on disk are mapped directly into memory, allowing fast access to data. This is the secret of its speed: the intelligent use of memory, if of course combined with correct data modelling.

If, on the other hand, the client wishes to write data, mongod first updates the cache in memory and then the data files on disk. Each write operation is then recorded in a journal to ensure that nothing is lost, even in the event of a system crash. Mongod uses database and collection blocks to ensure that multiple operations can take place in parallel without interfering with each other. With the WiredTiger engine acquired in 2014, MongoDB uses a method called MVCC (Multi-Version Concurrency Control) to ensure that reads are always consistent, even while writes occur.

If MongoDB is configured for replication, mongod takes care of replicating the data on multiple servers, ensuring that if one of them fails, replicas can take over without causing interruptions.

The third type of architecture is the Sharded Cluster, designed to handle large volumes of data and traffic. MongoDB uses this type of architecture, which distributes data over several servers, or ‘shards’. Each shard contains a portion of the data and works together with the other shards to form a single logical database. Mongos is the routing process of MongoDB: it receives queries from applications and uses metadata from the Config Replicaset (a special replica set that hosts metadata about the placement of data in the shards) to route queries to underlying mongods. Although mongos manages to make all queries operational in partitioned environments, the selected shard key can have a profound effect on query performance.

f the query includes a shard key, mongos will route the query directly to a single shard ensuring better performance. Otherwise, it will forward the query to the mongods receiving the resultset from each before sending the response to the application. Scatter/gather’ queries can result in long-running operations.

This brings with it some important considerations. If data modelling is a very important step in order to make the most of MongoDB’s immense performance potential, it is even more so if sharding is to be used. A poorly designed data model forces developers to use the $lookup operator, which allows joins between multiple collections. In addition, it may prevent the use of shard keys in queries, resulting in a scatter/gather mechanism that would greatly degrade performance and negate the benefits of sharding.

Innovation often lies not only in inventing something radically new, but in critically reviewing existing limitations and overcoming them with intuitive and effective solutions. Not being able to use $lookup in sharded collections, being able to choose a specific shard where a collection can be placed so as to avoid scatter/gather mechanisms, and being able to host data in the config Replica Set, represent a real egg of Columbus.

MongoDB Server 8.0 introduces all these (and many other) innovations, and the novelties concerning sharding have probably gone unnoticed.

A practical example

Let us now analyse this use case.

Enterprise Corporation operates in the insurance sector and manages a huge amount of data via a MongoDB-based application. In order to cope with the expected growth and to guarantee data availability for at least 15 years, both for audit purposes and to offer operators a complete historicity in a single dashboard, the company chose a Sharded Cluster with three shards.

During the design phase, the DBAs chose an efficient data model, but with some limitations resulting from the inability to use the $lookup operator in aggregations, since all collections were sharded. This led to an imbalance in the size of three main collections: while the number of insurance positions and policies grew significantly, the number of saleable products remained almost unchanged.

In order to maintain high performance, it was necessary to size the machines so that the memory was adequate for the larger collections, which increased costs. With the arrival of MongoDB Server 8.0, the scenario changed radically.
MongoDB 8.0 introduces a new feature that allows collections to be moved between shards without necessarily sharding them. This allows the DBAs of the Enterprise Corporation to reposition data within the cluster, for instance by focusing on the three largest collections or, conversely, on the three lightest ones. In this way, it is possible to redefine hardware resources downwards in the event that one or more shards have to manage a reduced volume of data. Furthermore, as of version 8.0, config servers can also host application data, in addition to cluster metadata, becoming config shards in effect. This makes it possible to move application configuration collections – which are usually little used – further lightening the load on the main shards and optimising the use of cluster-wide resources.

Reading the release notes of the new MongoDB version, the SORINT.lab DBAs realised that $lookup is now supported in transactions even when involving shard collections. This will be particularly useful in the event that more complex data extractions need to be performed, e.g. at auditing request.

In conclusion, MongoDB proves to be a solid and flexible solution for managing large volumes of data. The combination of high performance, optimised memory management and advanced support for replication and sharding makes it a mature and reliable choice for even the most complex scenarios.