Top AWS Services for Data Engineer

Deepanshu tyagi
6 min readJul 16, 2022

--

AWS has many services and all of them are not required to be a data engineer.As a beginner anyone can be confused about the services that they have to learn to be a data engineer. So, In this blog we are going to discuss the services that are required to be a data engineer.

Concept on which data engineer focus on :

  1. Data sources
  2. Data Ingestion
  3. Application Integration
  4. Data Lake
  5. Data Warehouse
  6. External Transformation and Business Rules
  7. Data Analytics
  8. Monitoring

Data Sources : From where data is coming, it can be from AWS RDS or can be from AWS DynamoDB.

  • AWS RDS : Relational Database Service is a distributed relational database service by Amazon Web Services. It provides affordable relational databases in the cloud and is also easy to use.
https://aws.amazon.com/
  • AWS DynamoDB : AWS DynamoDB is hosted fully managed No SQL database service provided by AWS. It supports key- value pairs.
https://aws.amazon.com/

2. Data Ingestion : How data is feeded to the system.

It can be feeded in :

  • Batch
  • Streaming

Batch : Batch processing is when people process large amounts of data at one.

There are three services that will help you to Ingest batch data.

  • AWS Lambda : AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.
https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Amazon_Lambda_architecture_logo.svg/800px-Amazon_Lambda_architecture_logo.svg.png
  • AWS Glue : AWS Glue is a scalable, serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
https://aws.amazon.com/
  • Amazon EMR: Amazon EMR is a web service that makes it easy to process vast amounts of data efficiently using Apache Hadoop and services offered by Amazon Web Services.
https://aws.amazon.com/

Streaming : Stream processing is when people process large amounts of data which is generated continuously from various sources in real time.

There are three services that will help you to Ingest streaming data.

  • Amazon Kinesis Data Streams: Amazon Kinesis Data Streams to collect and process large streams of data records in real time.
https://aws.amazon.com/
  • Amazon kinesis Data Analytics: Amazon Kinesis Data Analytics enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time.
https://aws.amazon.com/
  • Amazon Kinesis Firehose: Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, and Sumo Logic.
https://aws.amazon.com/

Data Lake : Where all types of data is stored.

  • AWS S3 Storage : It is the cloud storage provided by amazon to store all types of data and it is a data lake. You can upload unlimited data in a s3 bucket.
https://aws.amazon.com/

External Transformation & Business Rules : In this phase you will build a pipeline or transform the data according to the business rules and save the result back to the data lake or in a data warehouse.

There are three services which are used to transform the data(all explained in batch data section).

  • AWS Glue
  • Amazon EMR
  • AWS Lambda

Data Warehouse : A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.

There is only one data warehouse present in AWS which is powerful.

  • Amazon Redshift: Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
https://aws.amazon.com/

Data Analytics: Data analytics (DA) is the process of examining data sets in order to find trends and draw conclusions about the information they contain.

There are two services that will help in this:

  • Amazon Athena: Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is server-less, so there is no infrastructure to manage, and you pay only for the queries that you run.
https://aws.amazon.com/
  • Amazon QuickSight : Amazon QuickSight allows everyone in your organization to understand your data by asking questions in natural language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by machine learning.
https://aws.amazon.com/

Application Integration: Application integration is the merging and optimization of data and workflows between two disparate software applications.

There are three services that will help in this:

  • Amazon EventBridge : Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale using events generated from your applications, integrated Software-as-a-Service (SaaS) applications, and AWS services. EventBridge delivers a stream of real-time data from event sources such as Zendesk or Shopify to targets like AWS Lambda and other SaaS applications.
https://aws.amazon.com/
  • Amazon SNS : Amazon Simple Notification Service (Amazon SNS) is a fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication.
https://aws.amazon.com/

Monitoring: It means to monitor all your infrastructure, application and services.

  • Amazon CloudWatch: Amazon CloudWatch monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real time. You can use CloudWatch to collect and track metrics, which are variables you can measure for your resources and applications.
https://aws.amazon.com/

--

--

No responses yet