Data is growing bigger every day. And it will grow bigger and faster as we progress with IoT. The natural choice for storing and processing data at a high scale is a cloud service - AWS being the most popular among them.
AWS provides us several ways of working with data - at every step in the data analytics pipeline. It starts with collecting data, storing, processing and analyzing the data to obtain meaningful insights. We have several use cases that require different speeds and volumes and varieties of the data being processed.
For example, if we have an application that provides run time insights about network security, it has to be fast. No point knowing that the network was hacked yesterday! At the same time, such data would be pretty uniform. On the other hand, getting insights from posts on Facebook may not require an instant response. But the data here has a huge variety. There are other use cases that carry huge volume, variety, velocity and also require instant response. For example, a defense drone that monitors the borders would generate a huge videos, images as well as audio; along with information about the geo-location, temperature, humidity, etc. And this requires instant processing.
An interesting video recording of a presentation in AWS Reinvent 2018 can be found on Youtube. It covers the subject in great depth. This blog is an extract from the this video.
AWS provides us several services for each step in the data analytics pipeline - collect, store, process and analyze. We have different architecture patterns for the different use cases including:
There are different approaches to implement the pipelines:
AWS recommends some architecture principles that can improve the deployment of a data analytics pipeline on the cloud. They are tailored towards the AWS cloud, but may be extended to any other cloud provider like Azure.
Decoupling is perhaps the most important architectural principle irrespective of the domain and architecture style. It is equally true when we implement a data analytics pipeline on AWS.
The six steps of the analytics: Data -> Store -> Process -> Store -> Analyze -> Answers should be decoupled enough to be replaced or scaled irrespective of the other steps.
AWS recommends different services for each step - based on the kind of data being processed - based on the data structure, latency, throughput and access patterns. These aspects are detailed in the blog below. Following these recommendations can significantly reduce the cost and improve the performance of the pipeline. Hence it is important that we understand each of these services and its use case.
The fundamental architecture principle for any application on AWS cloud is - prefer service to server. Nobody stops us from provisioning a fleet of EC2 instances to deploy an open source analytics framework on it. It might still be easier than having everything on campus. But, the idea is to leverage what AWS provides us.
The managed services and the serverless services provide us a great advantage in cost, management and scalability. Hence it is recommended in every way. As mentioned above, only problem with serverless services is that they could lock us down to AWS.
A decoupled system would certainly require an event-journal based design. Usually, the data is accumulated into an S3 bucket - that remains the source of truth - not modified by any other service. That allows us to decouple different components that read from it. Because of the high velocity of data, it is important to maintain a source of truth - taking care of any component that drops due to any reason.
S3 provides efficient data lifecycle - allowing us to glacier the data over time. That helps us with a significant cost reduction.
Again, this has nothing to do with AWS or Big Data. Any application architecture has to consider cost saving as an important design constraint. AWS helps us with different techniques for doing that - auto scaling, PAYG, serverless... are some of them. These have to be leveraged when working with AWS.
Data is meaningless if we cannot learn and use it. AWS provides a range for Machine Learning based services - ranging from SageMaker to Comprehend and Alexa. Each has a use case in Data Analytics and can be leveraged to obtain meaningful insights and actions out of the data being analyzed.
Using these services may tie you down to AWS, but they have a great utility and can add a lot of value to the pipeline.
Often, the data processed is classified as hot or cold. This is based on the various factors including the volume and speed and latency required.
|Latency||Micro Seconds - Milli Seconds||Milli Seconds - Seconds||Minutes - Hours|
|Request Rate||Very High||High||Low|
|Cost / GB||$$-$||$-cc||c|
The analytics pipeline can be defined in terms of 6 steps - Collect Data, Store, Process, Store, Analyze, Answer. Let us now look into each of these, and have a look at the different AWS services and the relevance of each of them.
Data input is classified into three types of sources:
The next step is to store the data. AWS provides a wide range of options for storing data, for the different use cases. Each of them has associated pros and cons for a given use case.
S3 is perhaps the most popular of the lot.
Apart from S3, AWS also provides several types of databases - managed as well as serverless - to store our data.
With such a wide range of solutions available to us, the natural question we have is - which one should I use? AWS recommends using the below criteria for identifying the right solution for our problem. The type of volume, variety and velocity of the data and the access patterns, are the primary points to be considered in this analysis
We can summarize in the two tables below. Based on the data:
|Fixed Schema||SQL, NoSQL|
|No Schema||NoSQL, Search|
And based on the data access patterns:
|Data access patterns||Database|
|Put/Get (key-value)||In-memory, NoSQL|
|Simple Relationships (1:N, M:N)||NoSQL|
|Multi-table joins, transactions||SQL|
The database choice based on the data may not always match that based on the data access patterns. In such a case, the more prominent of the two should be used.
Based on the use case, we can choose a particular database using the below chart:
|Use Cases||In memory caching||Key/Value lookups, document store||OLTP, Transactional||OLTP, Transactional||Log analysis, reverse indexing||Graph||File store|
|Performance||Ultra high request rate, ultra low latency||Ultra high request rate, ultra low latency||Very high request rate, low latency||High request rate, low latency||Medium request rate, low latency||Medium request rate, low latency||High throughput|
|Data Shape||Key/Value||Key/Value and Document||Relational||Relational||Documents||Node/edges||Files|
|Data Size||GB||TB, PB(no limits)||GB, mid TB||GB, low TB||GB, TB||GB, mid TB||GB, TB, PB, EB (no limits)|
|Availability||2 AZ||3 AZ||3 AZ||2 AZ||1-2 AZ||3 AZ||3 AZ|
|VPC support||Inside VPC||VPC Endpoint||Inside VPC||Inside VPC||Inside VPC||Inside VPC||VPC endpoint|
The next step in the pipeline is to process the data. Here too, AWS provides us a wide range of options to process the data available to us.
We have three major use cases when we process the big data:
When working on interactive or batch analytics processing, we can expect the lesser heat. One might expect interactive analytics to be hot. But, the point is that the data volumes of an interactive session are so low that it is not considered hot. Also, a quick response for a user's perception is not really so fast from data analytics perspective. For such a use case, AWS recommends one of these services
On the other hand, when we have data streaming in (eg from IoT devices and sensors), and we need to process it in real time, we have to consider using a different set of processing services.
|.||EMR (Spark Streaming)||KCL application||Kinesis Analytics||Lambda|
|Scale/Throughput||No limits, depends on number of nodes||No limits, depends on number of nodes||No limits, scales automatic||No limits, scales automatic|
|Availability||Single AZ||Multi AZ||Multi AZ||Multi AZ|
|Sliding Window Functions||Built-in||App needs to implement||Built in||No|
|Reliability||Spark Checkpoints||KCL Checkpoints||Managed by Kinesis Data Analytics||Managed by Lambda|
Either of the above could require predictive analysis based on the data provided. AWS provides a wide range of AI services that can be leveraged on different levels.
Naturally, the question is: Which analytics should I use?
Finally, we get on to analyzing the data we have gathered. The first step here is to prepare the data for consumption. This is done using the ELT / ETL. AWS provides for a variety of tools for ELT/ETL. The below table gives a top level view of these services and their implications.
|.||Glue||ETL Data Pipeline||Data Migration Service||EMR Apache||NiFi||Partner Solution|
|Use Case||Serverless||ETL Data Workflow||Migrate databases (to/from datalakes)||Customize developed hadoop/spark||ETL Automate the flow of data between systems||Rich partner ecosystem for ETL|
|Scale/Throughput||~DPUs||~Nodes, through EMR cluster||EC2 Instance Type||~Nodes||Self managed||Self managed or through partner|
|Managed service||Clusterless||Managed||Managed EC2 on your behalf||Managed EC2 on your behalf||Self managed on EMR or marketplace||Self managed or through partner|
|Data sources||S3, RDBMS, Redshift, DynamoDB||S3, JDBC, Custom RDBMS, data warehouses||S3, Various||Managed Hadoop/Spark||Various through rich processor framework||Various|
|Skills needed||Wizard for simple mapping, code snippets for advanced ETL||Wizard and code snippets||Wizard and drag/drop||Hadoop/Spark Coding||NiFi processors and some coding||Self managed or through partner|
Finally, this data is consumed by the services that can provide meaningful insights based on the data the is processed. These services could be one of the AI services that can process the data to generate a decision. Or we can also have these insights provided back to the system, in a friendly format. Thus, we can have the consuming service among AI Apps, Jupyter, Anaconda, R Studio, Kibana, Quicksight, tableau, looker, Microstrategy, Qlik, etc.
The following diagram sums up the entire process of data analytics, along with the various choices available to us.
One can choose the services to be used based on the temperature of the data:
Let us now look at a sample architecture for a realtime streaming analytics pipeline.
This uses a variety of services for processing and storing the data. As the data stream is gathered, it is processed by the Kinesis Data Analytics - for initial processing. Then, it is fed into the steaming data processing to different applications - for extracting and classifying different aspects of the data. This is fed into the AI services for making any necessary realtime predictions.
Rest is stored into the variety of data storage services - based on the type of data extracted and segregated out of the input stream. This is finally used to generate notifications and insights. The purified data stream is forwarded to any other downstream application that might want to process it.