I’ve been working with Apache Airflow for a little more than a year. It took me a day or 2 to get airflow working properly with the way it’s set up in our environment. The setup errors took me a bit to figure out and fix. Finally I had it setup and working the way I was expecting it to work. Just to get a new laptop and I had to do the setup again and by that time I forgot what I’ve done exactly to get it working in the way I wanted it to work.
Key concepts for Apache Airflow:
DAG (or Directed Acyclic Graph)
Is a collection of all the tasks you want to run, organised in a way that reflects their relationships and dependencies.
A DAG run is a physical instance of a DAG, containing task instances that run for a specific execution_date. A DAG run is usually created by the Airflow scheduler, but can also be created by an external trigger.
While DAGs describe how to run a workflow, Operators determine what actually gets done by a task
An operator describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators.
Airflow provides operators for many common tasks, including:
- BashOperator - executes a bash command
- PythonOperator - calls an arbitrary Python function
- EmailOperator - sends an email
- SimpleHttpOperator - sends an HTTP request
- MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command
- Sensor - an Operator that waits (polls) for a certain time, file, database row,
Once an operator is instantiated, it is referred to as a “task”. The instantiation defines specific values when calling the abstract operator, and the parameterised task becomes a node in a DAG.
A task instance represents a specific run of a task and is characterised as the combination of a DAG, a task, and a point in time. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc. As displayed below
Getting Airflow up and running using Docker:
Now it’s time to get our hands dirty with some airflow setup using docker. We’ll be assuming that you are somewhat familiar with docker.
So for this article we will not be discussing how to install docker and how to get a docker hub account.
So go to docker hub and login, once logged in search for puckel/docker-airflow we’ll be using this as our docker image.
You can do a simple docker pull of this
docker pull puckel/docker-airflow
After a bit you should see the something similar to the following:
This will mean the docker image is pulled and useable, you can verify this with the following
docker images -a | grep docker-airflow
Now that the image is downloaded you can start a running container with the following command
docker run -d -p 8080:8080 pucker/docker-airflow webserver
Now airflow is running on your local machine using docker. You can access airflow at
Now to find the image you can do the following
And this should give you something similar to the following:
You can now enter the docker container using the following
docker exec -ti <container name> bash
Running your first DAG
Now we can start running our DAG with our airflow since our container and webserver is up and running.
In Airflow, DAGs definition files are python scripts (“configuration as code” is one of the advantages of Airflow). You create a DAG by defining the script and simply adding it to a folder ‘dags’ within the $AIRFLOW_HOME directory. In our case, the directory we need to add DAGs to in the container is:
Instead, one solution is to use “volumes”, which allow you to share a directory between your local machine with the Docker container. Anything you add to your local container will be added to the directory you connect it with in Docker. In our case, we’ll create a volume that maps the directory on our local machine where we’ll hold DAG definitions, and the locations where Airflow reads them on the container.
But you will need to stop the previous container you created to recreate but with a volume attached. This can be achieved with
docker stop <container id>
Then you will be able to recreate the docker container with a volume attached to it:
docker run -d -p 8080:8080 -v/some/path/to/keep/dags/locally:/usr/local/airflow/dags puckel/docker-airflow webserver
Then inside your/some/path/to/keep/dags/locally you can drop create a HelloWorld.py with the code found here: HelloWorld
Then after waiting a few minutes you should see a new DAG displaying on your airflow (http://127.0.0.1:8080/admin/)
You can test individual tasks in your DAG by entering into the container using the docker exec command described earlier and running the command airflow test. Once you’re in, you can see all of your DAGs by running airflow list_dags. Below you can see the result, and our Helloworld DAG is at the top of the list.
You can now just test that airflow is running by manually triggering it through the UI:
You will see it starting to run and work through tasks:
That’s all we will cover with this article. I found this way simpler than doing it the manual way and it’s a quick way to start a brand new airflow with only a few commands.