July 6, 2020

Get Airflow while Docked

Key Apache Airflow concepts for the beginners and how to get it running using Docker.

Get Airflow while Docked

Jaques Bray

I’ve been working with Apache Airflow for a little more than a year. It took me a day or 2 to get airflow working properly with the way it’s set up in our environment. The setup errors took me a bit to figure out and fix. Finally I had it setup and working the way I was expecting it to work. Just to get a new laptop and I had to do the setup again and by that time I forgot what I’ve done exactly to get it working in the way I wanted it to work.

Key concepts for Apache Airflow:

DAG (or Directed Acyclic Graph)

Is a collection of all the tasks you want to run, organised in a way that reflects their relationships and dependencies.

DAG Runs

A DAG run is a physical instance of a DAG, containing task instances that run for a specific execution_date. A DAG run is usually created by the Airflow scheduler, but can also be created by an external trigger.

Operators

While DAGs describe how to run a workflow, Operators determine what actually gets done by a task

An operator describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators.

Airflow provides operators for many common tasks, including:

  • BashOperator - executes a bash command
  • PythonOperator - calls an arbitrary Python function
  • EmailOperator - sends an email
  • SimpleHttpOperator - sends an HTTP request
  • MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command
  • Sensor - an Operator that waits (polls) for a certain time, file, database row,

Tasks

Once an operator is instantiated, it is referred to as a “task”. The instantiation defines specific values when calling the abstract operator, and the parameterised task becomes a node in a DAG.

Task instances

A task instance represents a specific run of a task and is characterised as the combination of a DAG, a task, and a point in time. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc. As displayed below

Getting Airflow up and running using Docker:

Now it’s time to get our hands dirty with some airflow setup using docker. We’ll be assuming that you are somewhat familiar with docker.

So for this article we will not be discussing how to install docker and how to get a docker hub account.

So go to docker hub and login, once logged in search for puckel/docker-airflow we’ll be using this as our docker image.

You can do a simple docker pull of this

docker pull puckel/docker-airflow

After a bit you should see the something similar to the following:

This will mean the docker image is pulled and useable, you can verify this with the following

docker images -a | grep docker-airflow

Now that the image is downloaded you can start a running container with the following command

docker run -d -p 8080:8080 pucker/docker-airflow webserver

Now airflow is running on your local machine using docker. You can access airflow at

http://127.0.0.1:8080/admin/

Now to find the image you can do the following

docker ps

And this should give you something similar to the following:

You can now enter the docker container using the following

docker exec -ti <container name> bash

Running your first DAG

Now we can start running our DAG with our airflow since our container and webserver is up and running.

In Airflow, DAGs definition files are python scripts (“configuration as code” is one of the advantages of Airflow). You create a DAG by defining the script and simply adding it to a folder ‘dags’ within the $AIRFLOW_HOME directory. In our case, the directory we need to add DAGs to in the container is:

/usr/local/airflow/dags

Instead, one solution is to use “volumes”, which allow you to share a directory between your local machine with the Docker container. Anything you add to your local container will be added to the directory you connect it with in Docker. In our case, we’ll create a volume that maps the directory on our local machine where we’ll hold DAG definitions, and the locations where Airflow reads them on the container.

But you will need to stop the previous container you created to recreate but with a volume attached. This can be achieved with

docker stop <container id>

Then you will be able to recreate the docker container with a volume attached to it:

docker run -d -p 8080:8080 -v/some/path/to/keep/dags/locally:/usr/local/airflow/dags
puckel/docker-airflow webserver

Then inside your/some/path/to/keep/dags/locally you can drop create a HelloWorld.py with the code found here: HelloWorld

Then after waiting a few minutes you should see a new DAG displaying on your airflow (http://127.0.0.1:8080/admin/)

You can test individual tasks in your DAG by entering into the container using the docker exec command described earlier and running the command airflow test. Once you’re in, you can see all of your DAGs by running airflow list_dags. Below you can see the result, and our Helloworld DAG is at the top of the list.

You can now just test that airflow is running by manually triggering it through the UI:

You will see it starting to run and work through tasks:


That’s all we will cover with this article. I found this way simpler than doing it the manual way and it’s a quick way to start a brand new airflow with only a few commands.

Jaques Bray