Database Design Patterns


Database is an essential component of any real world application. A good database design is imperative for a good application design. To enable a good database design, experts have marked out some of the useful database design patterns.

SQL or NoSQL


This is the first and foremost point when we think of databases. What are the advantages and disadvantages of having a relational v/s document oriented database?

It is a common misconception that Relational Databases are outdated, old fashioned and meant only for the legacy applications. Relational databases are old (and mature), but certainly not outdated. We have many use-cases that need a relational database.

Different applications deal with different types of data, at different stages of the data lifecycle. The raw data that our applications receive has very high Volume-Velocity-Variety. Such data is best for a document oriented database where we just dump the data using a accessor key.

As the data is processed and information is extracted, it crystalizes into a more structured form. Some data is intrinsically structured - for example, application logs. We can structure it further with some processing. On the other hand, some data is intrinsically unstructured - for example, a relationship graph. We can stretch our imagination to cast it into a structured data, but that would defeat the purpose.

Similarly, the data of candidate CV's is intrinsically document oriented. It can have a variety of data formats including images, pdf's, and also plain text.

Similar v/s Related Data Together


Let's get back to the data of CV's stored by a recruitment firm. For the simplicity, let us assume this is made of simple text (no files or images). A CV can have details about education (school, graduation, post graduation..), experience, roles, technologies, personal details... Although this is not perfectly structured, we can accomodate this information in a relational database. We can have a master table that contains a key for the candidate, along with keys to the records in different tables. We can have a table for all experience records - with columns for the company id, the start date, end date, designation.. Similarly we can have a table for all companies indexed by the company id.

This way, we can have 20+ tables to contain all the data about all the candidates. Note that this data layout puts "similar data together". The Experience table has the information of all the experience of all the candidates. The Company table has all the information of all the companies. And so on.

On the other hand, we can also have related data together. We can have a document structure - where each record gives us the data for a given candidate.

There is no right or wrong about either approach. Just that they are different and have different implications and different use-cases. In the above example, with "similar data together", it is very easy to structure data. That is almost impossible with "related data together". The first approach allows us to fetch the list of all the companies - in a simple short query. In the second approach, that would require a lot of effort (for the developer as well as the db engine). On the other hand, it is very easy to get all data for a candidate from the second approach. That would require a lot of effort on the first.

In most applications, either extreme will have a lot of overheads. A good solution should identify which related data should go together and which similar data should go together.

This has little to do with a relational or non relational databases. A mongodb based solution can use embedded or normalized data models. And an Oracle based solution can use CLOBs. The important point to check here, is the kind of query we intend to fire. A good solution has to be an appropriate point between the two approaches.

Atomicity and Transactions


A well designed and scalable solution should not need atomicity. Atomicity implies an intrinsic hold-and-wait, which is disastrous for an application that plans to scale horizontally. Yet, there are times where we just cannot eliminate the need for an atomic transaction.

Some of the relational databases like Oracle provide an explicit commit. They provide atomic increments and updates. But a lot of other databases simply ensure atomicity of any single update. In such a case, it is the developer's responsibility to ensure consistency of the data. MongoDB provides document level atomicity, but if the data is normalized and split across documents, then it can not have an atomic update.

On the other hand, DynamoDB does not make any claims about atomicity or consistency. It assures "read after write" consistency for new records, and "eventual consistency" for any updates. That means, it is possible to read the old value after we put a new one.

Anyway, atomicity should become a thing of the past. A good database design - relational or non relational should not require atomicity.

Indexes


This is one of the important aspects of database design. The indexes need to be identified based on the structure of the data and the pattern of queries on the data. A fully unstructured data would have just one index based on the data key. A semi structured data could afford one more key based on some data it holds. Indexes can improve the query performance upto a limit. But too many indexes can actually cause performance problems instead of solving them.

An index pulls out a part of the data into a smaller searchable subset. An index helps performance by providing a smaller dataset to search. Instead of going through the entire document or table, we just need to scan through the index. But, if we have too many indexes, the index dataset itself could grow larger than the main data.

Also, the major cost of the index is in updating the data. As the index gets stale, it leads to performance issues.

There are several types of indexes that can help in different use cases.

  • Single Field Index: Index on just one field
  • Multi Field Index: Index on combination of multiple fields
  • Hashed Index: Index on hash value of the data rather than the data itself. This is useful when working with huge document data.