Big Data on AWS


Data is growing bigger every day. And it will grow bigger and faster as we progress with IoT. The natural choice for storing and processing data at a high scale is a cloud service - AWS being the most popular among them.

AWS provides us several ways of working with data - at every step in the data analytics pipeline. It starts with collecting data, storing, processing and analyzing the data to obtain meaningful insights. We have several use cases that require different speeds and volumes and varieties of the data being processed.

For example, if we have an application that provides run time insights about network security, it has to be fast. No point knowing that the network was hacked yesterday! At the same time, such data would be pretty uniform. On the other hand, getting insights from posts on Facebook may not require an instant response. But the data here has a huge variety. There are other use cases that carry huge volume, variety, velocity and also require instant response. For example, a defense drone that monitors the borders would generate a huge videos, images as well as audio; along with information about the geo-location, temperature, humidity, etc. And this requires instant processing.

An interesting video recording of a presentation in AWS Reinvent 2018 can be found on Youtube. It covers the subject in great depth. This blog is an extract from the this video.

AWS provides us several services for each step in the data analytics pipeline - collect, store, process and analyze. We have different architecture patterns for the different use cases including:

  • Batch processing
  • Interactive
  • Stream processing
  • Machine Learning

There are different approaches to implement the pipelines:

  • Virtualized: The least encouraged one is to use your favorite open source analytics framework and deploy it on EC2 instances.
  • Managed Services: AWS provides several services that are based on provisioned servers - managed by AWS. Some of these are based on popular open source frameworks, some other are proprietary to AWS.
  • Containerized: These are better than the previous two - because we do not need the underlying EC2 instances - providing us a good amount of saving in the cost.
  • Serverless: This are highly cost effective and scalable. AWS encourages us to switch over to the native serverless architectures. Only downside of this approach is that it locks us down to AWS. If you don't mind that (I don't), a serverless architecture is the best choice.

Architecture Principles


AWS recommends some architecture principles that can improve the deployment of a data analytics pipeline on the cloud. They are tailored towards the AWS cloud, but may be extended to any other cloud provider like Azure.

Build decoupled systems


Decoupling is perhaps the most important architectural principle irrespective of the domain and architecture style. It is equally true when we implement a data analytics pipeline on AWS.

The six steps of the analytics: Data -> Store -> Process -> Store -> Analyze -> Answers should be decoupled enough to be replaced or scaled irrespective of the other steps.

Right Tool for Right Job


AWS recommends different services for each step - based on the kind of data being processed - based on the data structure, latency, throughput and access patterns. These aspects are detailed in the blog below. Following these recommendations can significantly reduce the cost and improve the performance of the pipeline. Hence it is important that we understand each of these services and its use case.

Leverage Managed & Serverless services


The fundamental architecture principle for any application on AWS cloud is - prefer service to server. Nobody stops us from provisioning a fleet of EC2 instances to deploy an open source analytics framework on it. It might still be easier than having everything on campus. But, the idea is to leverage what AWS provides us.

The managed services and the serverless services provide us a great advantage in cost, management and scalability. Hence it is recommended in every way. As mentioned above, only problem with serverless services is that they could lock us down to AWS.

Use event-journal design pattern


A decoupled system would certainly require an event-journal based design. Usually, the data is accumulated into an S3 bucket - that remains the source of truth - not modified by any other service. That allows us to decouple different components that read from it. Because of the high velocity of data, it is important to maintain a source of truth - taking care of any component that drops due to any reason.

S3 provides efficient data lifecycle - allowing us to glacier the data over time. That helps us with a significant cost reduction.

Be cost conscioius


Again, this has nothing to do with AWS or Big Data. Any application architecture has to consider cost saving as an important design constraint. AWS helps us with different techniques for doing that - auto scaling, PAYG, serverless... are some of them. These have to be leveraged when working with AWS.

Enable AI Services


Data is meaningless if we cannot learn and use it. AWS provides a range for Machine Learning based services - ranging from SageMaker to Comprehend and Alexa. Each has a use case in Data Analytics and can be leveraged to obtain meaningful insights and actions out of the data being analyzed.

Using these services may tie you down to AWS, but they have a great utility and can add a lot of value to the pipeline.

Temperature of Data


Often, the data processed is classified as hot or cold. This is based on the various factors including the volume and speed and latency required.

HotWarmCold
VolumeMB-GBGB-TBPB-EB
Item SizeB-KBKB-MBKB-TB
LatencyMicro Seconds - Milli SecondsMilli Seconds - SecondsMinutes - Hours
DurabilityLowHighVery High
Request RateVery HighHighLow
Cost / GB$$-$$-ccc

Analytics Pipeline


The analytics pipeline can be defined in terms of 6 steps - Collect Data, Store, Process, Store, Analyze, Answer. Let us now look into each of these, and have a look at the different AWS services and the relevance of each of them.

Collect Data


Data input is classified into three types of sources:

  • Web/Mobile Apps, Data Centers: Such data is generally structured and transactional in nature. It can be pushed and stored in SQL/NoSQL databases. One could also use in memory databases like Redis.
  • Migration, Logs: These are files. Typically this includes media or log files. The files could be huge and need to be saved in bulks. S3 is the ideal solution for such data.
  • IoT, Device Sensors: This is typically streaming data, pouring in data streams. It is managed with events, pushing into stream storage like Kafka, Kinesis Streams or Kinesis Firehose. Kafka is ideal for a high throughput distributed platform. Kinesis streams provides a managed stream storage. And Kinesis Firehose is good for managed data delivery.

Store


The next step is to store the data. AWS provides a wide range of options for storing data, for the different use cases. Each of them has associated pros and cons for a given use case.

S3 is perhaps the most popular of the lot.

  • It is natively supported by big data frameworks (Spark, Hive, Presto, and others)
  • It can decouple storage and compute. There is no need to run compute clusters for storage (unlike HDFC). It can be used when running transient EMR clusters using EC2 spot instances. It can provide for multiple heterogeneous analysis clusters and services can use the same data
  • S3 provides a very high durability (eleven nines) 99.999999999%
  • S3 is very cost effective when used within a region. There is no need to pay for data replication within a region.
  • And above all, S3 is secure. It provides for SSL encryption in transit as well as at rest.

Apart from S3, AWS also provides several types of databases - managed as well as serverless - to store our data.

  • ElastiCache - Managed Memcached or Redis service
  • DynamoDB - Managed Key-Value / Document DB
  • DynamoDB Accelerator (DAX) - Managed in memory cache for DynamoDB
  • Neptune - Managed Graph DB
  • RDS - Managed Relational Database

With such a wide range of solutions available to us, the natural question we have is - which one should I use? AWS recommends using the below criteria for identifying the right solution for our problem. The type of volume, variety and velocity of the data and the access patterns, are the primary points to be considered in this analysis

  • Relational Database provides strong referential integrity with strongly consistent transactions and hardened scale. We can make complex queries via SQL
  • Key-Value Database is useful for achieving Low-Latency. It provides Key-based queries with high throughput and fast data ingestion. We can make simple query with filters.
  • Document Database is useful for indexing and storing of documents. They provide support for query on any property. We can make simple queries with filters, projections and aggregates.
  • In memory databases and caches provide us microsecond latency. We can make key-based queries. They provide for specialized data structures and simple query methods with filters.
  • Graph Databases are useful when creating and navigating relations between data. We can easily express queries in terms of relations.

We can summarize in the two tables below. Based on the data:

Data StructureDatabase
Fixed SchemaSQL, NoSQL
No SchemaNoSQL, Search
Key-ValueIn-memory, NoSQL
GraphGraphDB

And based on the data access patterns:

Data access patternsDatabase
Put/Get (key-value)In-memory, NoSQL
Simple Relationships (1:N, M:N)NoSQL
Multi-table joins, transactionsSQL
Faceting, SearchSearch
Graph traversalGraphDB

The database choice based on the data may not always match that based on the data access patterns. In such a case, the more prominent of the two should be used.

Based on the use case, we can choose a particular database using the below chart:

ElastiCacheDAXAuroraRDSElasticsearchNeptuneS3+Glacier
Use CasesIn memory cachingKey/Value lookups, document storeOLTP, TransactionalOLTP, TransactionalLog analysis, reverse indexingGraphFile store
PerformanceUltra high request rate, ultra low latencyUltra high request rate, ultra low latencyVery high request rate, low latencyHigh request rate, low latencyMedium request rate, low latencyMedium request rate, low latencyHigh throughput
Data ShapeKey/ValueKey/Value and DocumentRelationalRelationalDocumentsNode/edgesFiles
Data SizeGBTB, PB(no limits)GB, mid TBGB, low TBGB, TBGB, mid TBGB, TB, PB, EB (no limits)
Cost/GB$$cc-$$cccccccccc
Availability2 AZ3 AZ3 AZ2 AZ1-2 AZ3 AZ3 AZ
VPC supportInside VPCVPC EndpointInside VPCInside VPCInside VPCInside VPCVPC endpoint

Process


The next step in the pipeline is to process the data. Here too, AWS provides us a wide range of options to process the data available to us.

We have three major use cases when we process the big data:

Interactive & Batch Processing


When working on interactive or batch analytics processing, we can expect the lesser heat. One might expect interactive analytics to be hot. But, the point is that the data volumes of an interactive session are so low that it is not considered hot. Also, a quick response for a user's perception is not really so fast from data analytics perspective. For such a use case, AWS recommends one of these services

  • AWS Elasticsearch - Managed service for Elastic Search
  • Redshift & Redshift Spectrum - Managed data warehouse, Spectrum enables querying S3
  • Athena - Serverless interactive query service
  • EMR - Managed Hadoop framework for running Apache Spark, Flink, Presto, Tex, Hive, Pig, Hbase and others

Streaming and Realtime Analytics


On the other hand, when we have data streaming in (eg from IoT devices and sensors), and we need to process it in real time, we have to consider using a different set of processing services.

  • Spark Streaming on EMR
  • Kinesis Data Analytics - Managed service for running SQL on Streaming Data
  • Kinesis Client Library
  • Lambda - Run code serverless, Services such as S3 can publish events to Lambda, Lambda can pool event from a Kinesis
.EMR (Spark Streaming)KCL applicationKinesis AnalyticsLambda
Managed ServiceYesNoYesYes
ServerlessNoNoYesYes
Scale/ThroughputNo limits, depends on number of nodesNo limits, depends on number of nodesNo limits, scales automaticNo limits, scales automatic
AvailabilitySingle AZMulti AZMulti AZMulti AZ
Sliding Window FunctionsBuilt-inApp needs to implementBuilt inNo
ReliabilitySpark CheckpointsKCL CheckpointsManaged by Kinesis Data AnalyticsManaged by Lambda

Predictive Analysis


Either of the above could require predictive analysis based on the data provided. AWS provides a wide range of AI services that can be leveraged on different levels.

  • Application Services: High level SAS services like Rekognition, Comprehend, Transcribe, Polly, Translate, Lex can process the input data and provide the output in a single service call.
  • Platform Services: AWS helps us with platforms that help us leverage our own AI models - Amazon Sagemaker, Amazon Mechanical Turk, Amazon Deep Learning AMIs. These help us create our tailored models that can process the input data and provide good insights. Any of the Generic AI frameworks like TensorFlow, PyTorch, Caffe2 can power the platforms we use.
  • Infrastructure: On the lowermost level, AWS also allows us to choose the hardware below the containers or EC2 instances that support the platform that platform. We can choose any of NVIDIA Tesla V100 GPU accelerated for AI/ML training, Compute intensive instances for AI/ML inference, Greengrassh ML, ...

Naturally, the question is: Which analytics should I use?

  • Batch - Takes minutes to hours - eg, daily, weekly, monthly reports, EMR (MapReduce, Hive, Pig, Spark)
  • Interactive - Takes seconds - example self service dashboards - Redshift, Athena, EMR (Presto, Spark)
  • Stream - milliseconds to seconds - eg, fraud alerts, one minute metrics - EMR (spark streaming), Kinesis Data Analytics, KCL, Lambda and others
  • Predictive - Takes milliseconds (real time) to minutes (batch) - eg. Fraud detection, forecasting demand, speech recognition - Sagemaker, Polly, Rekognition, Transcribe, Translate, EMR (Spark ML), Deep Learning AMI (MXNet, Tensorflow, Theano, Torch, CNTK, Caffe2)

Analysis


Finally, we get on to analyzing the data we have gathered. The first step here is to prepare the data for consumption. This is done using the ELT / ETL. AWS provides for a variety of tools for ELT/ETL. The below table gives a top level view of these services and their implications.

.GlueETL Data PipelineData Migration ServiceEMR ApacheNiFiPartner Solution
Use CaseServerlessETL Data WorkflowMigrate databases (to/from datalakes)Customize developed hadoop/sparkETL Automate the flow of data between systemsRich partner ecosystem for ETL
Scale/Throughput~DPUs~Nodes, through EMR clusterEC2 Instance Type~NodesSelf managedSelf managed or through partner
Managed serviceClusterlessManagedManaged EC2 on your behalfManaged EC2 on your behalfSelf managed on EMR or marketplaceSelf managed or through partner
Data sourcesS3, RDBMS, Redshift, DynamoDBS3, JDBC, Custom RDBMS, data warehousesS3, VariousManaged Hadoop/SparkVarious through rich processor frameworkVarious
Skills neededWizard for simple mapping, code snippets for advanced ETLWizard and code snippetsWizard and drag/dropHadoop/Spark CodingNiFi processors and some codingSelf managed or through partner

Consume


Finally, this data is consumed by the services that can provide meaningful insights based on the data the is processed. These services could be one of the AI services that can process the data to generate a decision. Or we can also have these insights provided back to the system, in a friendly format. Thus, we can have the consuming service among AI Apps, Jupyter, Anaconda, R Studio, Kibana, Quicksight, tableau, looker, Microstrategy, Qlik, etc.

Sum Up


The following diagram sums up the entire process of data analytics, along with the various choices available to us.

One can choose the services to be used based on the temperature of the data:

Sample Architecture


Let us now look at a sample architecture for a realtime streaming analytics pipeline.

This uses a variety of services for processing and storing the data. As the data stream is gathered, it is processed by the Kinesis Data Analytics - for initial processing. Then, it is fed into the steaming data processing to different applications - for extracting and classifying different aspects of the data. This is fed into the AI services for making any necessary realtime predictions.

Rest is stored into the variety of data storage services - based on the type of data extracted and segregated out of the input stream. This is finally used to generate notifications and insights. The purified data stream is forwarded to any other downstream application that might want to process it.