0% found this document useful (0 votes)
356 views36 pages

Modern Data Pipelines With Apache Airflow

The document discusses modern data pipelines using Apache Airflow. It provides an overview of Airflow concepts like DAGs, tasks, the Airflow web interface and scaling options. It also demonstrates example DAGs for GitHub stats and loading clickstream data into Redshift. Finally, it shows how to quickly get started with Airflow using the Astro CLI.

Uploaded by

trang.nnt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
356 views36 pages

Modern Data Pipelines With Apache Airflow

The document discusses modern data pipelines using Apache Airflow. It provides an overview of Airflow concepts like DAGs, tasks, the Airflow web interface and scaling options. It also demonstrates example DAGs for GitHub stats and loading clickstream data into Redshift. Finally, it shows how to quickly get started with Airflow using the Astro CLI.

Uploaded by

trang.nnt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Modern Data Pipelines

with Apache Airflow


Andy Cooper & Taylor Edmiston @ Astronomer.io
Momentum Dev Con 2018
About Us
Andy Cooper Taylor Edmiston

● Data Engineer ● Backend software engineer building the


● 6 years of experience developing software Airflow platform at Astronomer.io
and data pipelines ● 9 years with Python, 6 years as a
● Began career developing traditional data professional developer
warehouses with Microsoft stack ● Top 20% all time on Stack Overflow with a
● Using Airflow since 1.7 reach of 750k developers
● Enjoys travel - 9 countries / 4 continents
What is Astronomer?
● Astronomer is a data engineering platform built on Apache Airflow and clickstream analytics
● Building tools that make data engineers lives easier
● Seed-stage startup, founded ~3 years ago, located in Cincinnati (OTR)
● AngelPad #9 batch
● https://www.astronomer.io
● https://www.crunchbase.com/organization/astronomer
What do we do?
Airflow Clickstream

● Astronomer Cloud (Managed Airflow) ● A clickstream analytics pipeline and router


○ Get up and running with Airflow quickly for user events
● Astronomer Enterprise (docs) ● Client-side (web, native mobile) or
○ Keep your data and workflows in your server-side
private cloud
● Not an analytics service! We integrate with
○ Astronomer Spacecamp - Enterprise
support & training available 50+
(https://www.astronomer.io/blog/announcin ● Free tier
g-astronomer-spacecamp/) ● astronomer.io/clickstream
● Astronomer Open (docs) ● 2-min demo video -
○ The core of our platform is open source — https://www.youtube.com/watch?v=ru7VM
try our Docker images on your machine
e5MXZk
(~40 min) Outline
● (5 min) Intro
● (10 min) Part I - Airflow overview & concepts
● (10 min) Part II - Example DAGs
● Midpoint Q&A?
● (10 min) Part III - Getting started with Airflow + Astro CLI demo
● (5 min) Summary / Outro
● Q&A
What We’ll Cover
● Airflow Concepts
● Getting Started with Airflow
● Astro CLI
● Preview and Discussion Of Airflow UI
● Q&A
What is Apache Airflow?
● “Airflow is a platform to programmatically author, schedule and monitor
workflows.”
● Open Source currently in the Apache Incubator phase
○ 7,500 stars
○ 4,000 commits
○ 400 contributors
● Written in Python
● Leverages Flask web framework
Airflow Concepts
What is a DAG?
Directed Acyclic Graph
Define Your Pipelines in
Code
A Centralized Web App for
All Workflows
Web App Features
● A quick look into DAG and task progress
● Error Logging
● Connections & Variables
● Connection Pooling
Hooks and Operators
Hooks
● An interface to an external system
● Often a wrapper for an API client
● Examples
○ DbApiHook
○ S3Hook
○ SlackHook
Operators
● Sensor Operators
○ S3KeySensor
○ S3PrefixSensor
○ HTTPSensor
● Action Operators
○ BashOperator
○ PythonOperator
○ EmailOperator
● Transfer Operators
○ SalesforceToRedshiftSchemaSync
○ SalesforceToS3
DAG Runs & Task
Instances
Dynamic DAGs
Executors & Scaling
Executors
● SequentialExecutor
● LocalExecutor
○ No additional dependencies
○ Multi-threaded out of the box
● CeleryExecutor
● MesosExecutor
● KubernetesExecutor (future)
Plugins
What can a plugin do?
● Extend the Airflow API
● Build new dashboards
● Create custom Hooks and Operators
● Astronomer maintains the most comprehensive collection of Airflow Plugins
○ github.com/airflow-plugins
● Code reuse, composition, good software engineering practices, etc
● Examples
○ Salesforce To Redshift Plugin
○ airflow-api-plugin
○ Airflow DAG Creation Manager Plugin
Example DAGs
DAG Examples
● GitHub stats DAG
● Clickstream Redshift loader DAG
○ ~200 million events per month from customer apps
○ ~2 million Airflow task instances per month
● https://github.com/airflow-plugins/Example-Airflow-DAGs
Github Issue and Commit Tracking Ex.
Clickstream Redshift DAG
Clickstream Redshift DAG
● Your Website → Astronomer Clickstream → S3 → [S3 sensor → Redshift
copy via Apache Spark]
● Dynamic DAGs configured via API → Scheduler (cached) → Variable
Astro CLI
The fastest way to get started with Airflow
How can I get started with Airflow?
● Source Code
○ https://github.com/astronomerio/astro-cli
● Install CLI
○ $ curl -sL https://install.astronomer.io | sudo bash
● Start a Project
○ $ mkdir test-project && cd test-project
○ $ astro airflow init
○ $ astro airflow start
Takeaway
● Part I - Airflow overview & concepts
● Part II - Example DAGs
● Part III - Getting started with Airflow + Astro CLI demo
Resources
● Official
○ https://github.com/apache/incubator-airflow
○ https://airflow.apache.org
○ Airflow Dev Mailing List
○ Apache Airflow meetups
● Community
○ https://github.com/airflow-plugins
○ https://soundcloud.com/the-airflow-podcast
○ https://github.com/jghoman/awesome-apache-airflow
● Related Talks
○ https://blog.tedmiston.com/talks/
Contact Info
● Andy
○ https://twitter.com/andscoop
○ https://www.linkedin.com/in/andscoop/
○ https://andscoop.com/
[email protected]
● Taylor
○ https://twitter.com/kicksopenminds
○ https://www.linkedin.com/in/tedmiston/
○ https://blog.tedmiston.com
[email protected]

You might also like