Source
Apache Airflow is a popular open-source tool that helps teams create, schedule, and monitor sequences of tasks, known as “workflows.”
In particular, data engineers utilize Airflow to help manage a lot of their data pipelines due to its ability to easily deploy and manage complex tasks in what is often referenced as a DAG( which we will discuss shortly)
Managed Workflows for Apache Airflow, abbreviated as MWAA, is a managed service designed for orchestration in Apache Airflow. With MWAA, users can easily operate data pipelines at scale, setting up and managing them from end to end.
What isn’t always discussed in articles about Airflow is the struggle to manage Airflow systems. They can be challenging to scale, overload VMs and have scheduler get stuck.
When this isn’t managed by a service like MWAA or Astronomer.io, it can pose a lot of extra work.
In this community update we wanted to discuss MWAA and its benefits to see if it’s the right choice for your team.
What Is Apache Airflow?
Apache Airflow is a powerful platform that helps teams manage their workflows programmatically. With Airflow, sequences of tasks are effortlessly turned into Directed Acyclic Graphs (DAGs). Using the Airflow scheduler, users can set up dependencies to delegate tasks automatically to workers, ensuring reliable and fast execution.
Even when working with complex DAGs, changes are made easily thanks to a robust set of command line utilities. With an advanced user interface, visualizing pipelines is simple, allowing teams to monitor production, progress and problems easily. By defining workflows as code, maintaining, versioning, testing, and collaborating with them is made simpler. All of that makes Apache Airflow a fantastic platform for workflow orchestration, and MWAA only makes it better.
What Is Managed Workflows for Apache Airflow?
Managed Workflows allow developers the ability to quickly deploy an Airflow instance on AWS that utilizes a combination of other AWS services to optimize the overall set-up.
Scalability, availability and security are assured with Managed Workflow’s automatic orchestration, which will scale capacity to meet your needs with limited intervention.
This is unlike a more manual approach which would be to use an EC2 instance to run DAGs that you either store in S3 or on the EC2 instance itself.
The managed service means that you no longer need to monitor and manually scale your Celery workers to meet your workflows’ demand.
MWAA was set up to use only CeleryExecutor with and, more importantly, has an autoscaling mechanism implemented under the hood. In many of my consulting projects I have come into plenty of projects where a client had a broken Airflow instances that wouldn’t have broken if they had MWAA.
Combined with integrated AWS security, this will allow teams to enjoy fast data access with complete peace of mind.
Now let’s take a closer look at some benefits that come along with using Managed Workflows for Apache Airflow.
Rapid Airflow Deployment at Scale
Deploying at scale is fast and easy. Choose from AWS Management Console, AWS CloudFormation, AWS SDK or CLI. Once you create an account, you can begin deploying DAGs (Directed Acyclic Graphs) directly to your Airflow environment. There’s no need to wait for someone to provision infrastructure or gather development resources.
Improved Logging
On one consulting project I came across had an Airflow Scheduler that stopped working. The reason?
Their logs were being stored locally and caused the VM to crash.
One, you shouldn’t store your logs locally. Two why even have to worry about where your logs are stored?
MWAA is configured to use CloudWatch for logging. Setting this up manually would be tedious. It would require a developer to configure a CloudWatch Agent on all instances to stream the logs to CloudWatch, and ensure proper log groups for all components. This allows for a lot of errors.
But what if it just was automatically set up.
Well that’s what MWAA will do.
Built-In Airflow Security
Security is a critical concern to modern businesses, but Managed Workflows gives teams peace of mind. With Managed Workflows, you can be confident knowing that your workloads are secure by default. All workloads will run in an isolated and secure environment in the cloud, making use of the Virtual Private Cloud (VPC) Amazon offers. Additionally, all data is encrypted automatically with Amazon’s Key Management Service (KMS).
If you want to control authentication and authorization based on user role, you can do so by tapping into Apache Airflow’s user interface and navigating to the Identity and Access Management (IAM) area. Security is further assured by giving users Single-Sign-On (SSO) access when they need to schedule or view a workflow execution.
Reduced Operational Costs
As a managed service, Managed Workflows helps to cut back on the intensive manual labor that’s typically associated with running Apache Airflow at scale. By cutting back on the heavy lifting, Managed Workflows help bring down engineering overhead and reduce overall operational costs through quicker deployment, less manual input and the on-demand monitoring that’s necessary to orchestrate an end-to-end data pipeline optimally.
Choose The Best Plugin
Managed Workflows give teams flexibility in that they can choose to use a pre-existing plugin or use their own. You can always connect to any AWS plugins available, but Managed Workflows also gives you the opportunity to use on-premises resources to run your workflows. Athena, Fargate, Lambda and Redshift are among the most popular examples, but you’ll also find Batch, Cloudwatch, Firehouse, SNS and countless others on the list.
How It Works
Once your team is set up with Apache Airflow, making the move to Managed Workflows for Apache Airflow will prove effortless. MWAA uses Python, which is where the Directed Acyclic Graphs (DAGs) are written. It’s those DAGs that help orchestrate and schedule the workflows your team creates.
To set up Managed Workflows, you simply need to assign an S3 bucket to it, which is where the Python dependencies list, DAGs and plugins will be stored. You can upload to the bucket using a code pipeline or manually in order to describe your ETL and learn process for automation. Lastly, you can begin running and monitoring your DAGs using the user interface of Airflow, CLI, or SDK.
So Is MWAA For You?
With all of that in mind, the question to ask is simple: Is Managed Airflows right for your brand?
There are some limitations when utilizing MWAA between using an unmanaged Airflow system. For example, currently MWAA is limited to version 2.0.2 where Airflow is on version 2.1.4.
Using any managed service will always provide some limited flexibility, but for a trade-off of reducing costs such as dealing with planning how to scale out larger systems and requiring more employees just to manage operations.
If you’re considering adopting Apache Airflow for the first time, or if you’re already using Airflow and look for a way to automate your workflows, MWAA could be a wise addition to your stack.
Thank you for reading!
What Is Managed Workflows for Apache Airflow On AWS And Why Companies Are Migrating
ReplyDeletehttps://thepythoncoding.blogspot.com/2021/10/what-is-managed-workflows-for-apache.html
#MachineLearning #DataScience #Python #AI #100DaysOfCode #DEVCommunity #IoT #flutter #javascript #Serverless #RStats #WomenWhoCode #DeepLearning #data #MLOps #DevOps