May 8, 2020

How to Orchestrate workflows using Google Cloud Composer

How to Orchestrate workflows using Google Cloud Composer

What is Cloud Composer?

Google Cloud Composer is essentially a managed instance of Apache Airflow. It allows the user to schedule, manage and monitor pipelines.

What is Apache Airflow?

Apache Airflow is an open source tool that allows the user to create, schedule and monitor their workflows. These workflows are defined in DAGs. A DAG is a Directed Acyclic Graph which is a mouthful so we call them DAGs. If we break down those terms we can infer that directed means a DAG has a direction, i.e. tasks in a DAG will be executed in sequence from start to finish and this sequence can’t be reversed or altered. Acyclic means the tasks defined in the DAG can’t loop back on themselves.

In Airflow and hence in Cloud Composer DAGs are defined using Python scripts.

Why use Cloud Composer over Airflow

Using Google Cloud Composer over a standard Airflow install allows the user to receive all the benefits of using airflow to manage workflows without having to deal with installing and managing an Airflow installation. This is ideal for projects with less devops experience available.

Creating a Cloud Composer Environment

Creating a cloud composer environment is really simple. Navigate to the Cloud Composer UI in Google Cloud Console (its under the big-data section). Select create an environment.

There are a range of customisation options available but the only options the user is required to select are the name of the environment and the location (which should be the same location you use for your other data storage and processing requirements).

The user can customise the number of nodes available for processing, and the specifications of these nodes. The defaults should be fine for most workflows but consider increasing nodes if you have many tasks that will need to be processed simultaneously. Similarly the type of node should be customised for your workflows; more hardware intensive calculations will require larger more powerful nodes.

Once you have clicked create give it some time to spin up the kubernetes cluster. You can now go to the environments page from the composer ui in cloud console. From here you should see the environment you just created and when you select it we can see the details of the composer environment you just set up. Two important pieces of information here are the DAGs folder which will give you a link to the google cloud storage bucket where you can upload your DAGs and the Airflow web UI which will give you a link to the airflow front end where you can monitor the progress of any DAGs you’ve created.

Uploading a new DAG

To get your workflow into airflow you need to upload a copy of your DAG into the DAG folder listed in the environment description of your composer instance. From there Airflow will do its magic and your DAG will appear in the airflow ui. Note this can take a minute or two, essentially the airflow scheduler will periodically scan the DAG folder and update the ui.

Managing your DAGs

Your DAG will now run on the schedule you defined in the DAG definition. You can trigger your DAG manually too by using the play button in the links column of your DAG.

This column also contains various views and tools you can use to monitor and debug your workflow. One of the most useful is the graph view. This presents the user with a block diagram representation of your workflow. It will show the dependencies of each task via arrows. When a DAG is running completed tasks have a dark green outline, in progress tasks will have a light green outline. If a task fails it will get a yellow outline this task is now in a retrying state and will do this based on the retry policy defined in the DAG definition. Once a task has continued to fail and used all the retries defined in the policy it will receive a red outline. If a task fails the DAG will fail and this will be shown on the DAGs screen which has a summary of all the DAGs.

On the DAG screen the user can toggle DAGs on and off. DAGs that are toggled off will be hidden from the DAG screen by default. “OFF” DAGs will not be scheduled however they will continue to run any outstanding task if they were already running when the DAG was turned off. To see the full list of DAGs including those that have been turned off select the show paused DAGs option at the bottom of the DAGs screen.

I hope you find this guide useful. If you have any questions, or know an easier way - drop us a message!