Home   > Blog   > Big Data with AWS Series - Data Ingestion

Big Data with AWS Series - Data Ingestion


Data ingestion is the process of transporting data from one or more sources to a target site for further processing and analysis. This data can originate from a range of sources, including data lakes, IoT devices, on-premises databases, and SaaS apps, and end up in different target environments, such as cloud data warehouses or data marts.

Now,let us understand few important concepts about Data ingestion:


AWS IoT provides the cloud services that connect your IoT devices to other devices and AWS cloud services. AWS IoT provides device software that can help you integrate your IoT devices into AWS IoT-based solutions. If your devices can connect to AWS IoT, AWS IoT can connect them to the cloud services that AWS provides.


Features :

AWS IoT Device SDK :
The AWS IoT Device SDK gives the easiest way to connect your hardware device or your mobile application to AWS IoT Core. The AWS IoT Device SDK enables your devices to connect, authenticate, and exchange messages with AWS IoT Core using the MQTT, HTTP, or WebSockets protocols.

Device Advisor :
Device Advisor is a fully managed cloud-based test capability for validating IoT devices during development. It provides pre-built tests that helps developers to validate their IoT devices for reliable and secure connectivity with AWS IoT Core. Developers use this to test if their IoT devices can reliably interoperate with AWS IoT Core and follow security best practices.

Device Gateway :
The Device Gateway serves as the entry point for IoT devices connecting to AWS. It manages all active device connections and implements semantics for multiple protocols to ensure that devices are able to securely and efficiently communicate with AWS IoT Core. Currently the Device Gateway supports the MQTT, WebSockets, and HTTP 1.1 protocols.

Message Broker :
The Message Broker is a high throughput pub/sub message broker that securely transmits messages to and from all of your IoT devices and applications with low latency. It's flexible nature allows you to send messages to, or receive messages from, as many devices as you would like.

Registry :
The Registry establishes an identity for devices and tracks metadata such as the devices' attributes and capabilities. The Registry assigns a unique identity to each device that is consistently formatted regardless of the type of device or how it connects.

Advantages :


2 . Lambda Function :

Lambda Function is a compute service that lets you run code without provisioning or managing servers. You can create Lambda functions and add them as actions in your pipelines. Because Lambda allows you to write functions to perform almost any task, you can customize the way your pipeline works.


Scaling : Lambda manages the infrastructure that runs your code, and scales automatically in response to incoming requests.

Concurrency Controls : Use concurrency settings to ensure that your production applications are highly available and highly responsive.If you want to prevent a function from using too much concurrency, and to reserve a portion of your account's available concurrency for a function, then use reserved concurrency. Reserved concurrency splits the pool of available concurrency into subsets. A function with reserved concurrency only uses concurrency from its dedicated pool.

Function URLs : Lambda offers built-in HTTP(S) endpoint support through function URLs. With function URLs, you can assign a dedicated HTTP endpoint to your Lambda function.

Asynchronous Invocation : With asynchronous invocation of function, Lambda queues the event for processing and returns a response immediately.

Advantages :


3 . Kinesis Streams

Kinesis Data Streams is a serverless streaming data service that makes it easy to capture, process, and store data streams at any scale.


Serverless : There are no servers to manage with Amazon Kinesis Data Streams. The on-demand mode further removes the need to provision or manage throughput by automatically scaling capacity when there is an increase in workload traffic. You can get started with Kinesis Data Streams with a few clicks from the AWS Management Console.

Highly available and durable : Synchronously replicate your streaming data across three Availability Zones (AZs) in an AWS Region, and store that data for up to 365 days to provide multiple layers of data loss protection.

Low latency : Make your streaming data available to multiple real-time analytics applications, to Amazon Kinesis Data Analytics, or to AWS Lambda within 70 milliseconds of being collected.

Dedicated throughput per consumer : You can attach up to 20 consumers to your Kinesis data stream, each with its own dedicated read throughput.

Advanatages :

Applications: Even if you use Kinesis Data Streams to solve a variety of streaming data problems, a common use is the real-time aggregation of data followed by loading the aggregate data into a data warehouse or map-reduce cluster. The process goes like this.Initially,data is put into Kinesis data streams, which ensures durability and elasticity. Then,the delay between the time a record is put into the stream and the time it can be retrieved (put-to-get delay) is typically less than 1 second. In other words, a Kinesis Data Streams application can start consuming the data from the stream almost immediately after the data is added. The managed service aspect of Kinesis Data Streams relieves you of the operational burden of creating and running a data intake pipeline. You can create streaming map-reduce-type applications. The elasticity of Kinesis Data Streams enables you to scale the stream up or down, so that you never lose data records before they expire.