July 23, 2020

Creating Your First Pipeline With KubeFlow

Working with ML pipelines and how to build them using Kubeflow.

Creating Your First Pipeline With KubeFlow

Aibaki Tembo

What is Kubeflow?

Kubeflow is an open-source application which allows you to build and automate your ML workflows on top of Kubernetes infrastructure. When I first started working on Kubeflow I thought it was just a show off, overhyped version of Apache Airflow using Kubernetes Pod Operators, but I was more than mistaken. Kubeflow is built and designed for the ML and AI engineer. It allows a user to use notebook servers, run complex Tensorflow training model tasks and build and serve ML pipelines just to mention a few.

In this guide, I will strictly work with the ML pipelines and show you how to build them using lightweight components. Hopefully, in another article, I can show you how to build pipelines using container operations.

Creating the Pipeline:

Before proceeding with building the pipeline, one needs to understand how to build docker containers, how containers interact with each other and Kubernetes in general.

Another thing that helped, is assuming your architecture is stateless; i.e. the data does not propagate to other tasks/jobs in the pipeline. By default, Kubeflow works in this way, data from one job/pod cannot be accessed in another and as best practice you would need to upload any large amounts of data to cloud storage.

Before you proceed to the next section, install the Kubeflow Pipelines SDK.

Using Lightweight Python Component Functions

As stated above Kubeflow is run on top of Kubernetes. When building components using functions, each function is run in a container. Therefore one needs to ensure that any packages needed for that function are in the underlying container, either beforehand or one can make a subprocess call within the python function to pip install the required packages.

To create a lightweight component using a function, one needs to import the following from the SDK:

from kfp.components import func_to_container_op

In this example, we will create two functions; one will calculate the mean and the other the standard deviation given the mean and a list of values.

The code snippet below shows how to create the two functions using the Function to Container Operator as a Decorator:

Two things to note in the above; the two functions have a decorator, this is an alternative to calling the func_to_component_op method with the function as a parameter. The other thing is that the math library was imported in the function it is being used in, this is to the earlier point that all libraries used need to be imported in the functions calling them.

Now that we have created the components, below is the code to create the Kubeflow Pipeline:

Running the above script generated the demo.zip file. This file should then be uploaded to the Kubeflow Dashboard under the pipelines section. Once uploaded the pipeline would look like:

Image for post
Kubeflow Demo Pipeline

After a run is created, the pipeline will execute and under the logs section you can see the printed mean and standard deviation:

Image for post
Calculate Mean Task Output
Image for post
Calculate Standard Deviation Output

Conclusion

In the above example, I calculated the mean and used another task to calculate the standard deviation from the previous mean. This is not an ML pipeline per se, but it illustrates the flow of tasks. For an ML pipeline, the first component can pre-process the data, the next can train the model and finally you can test and deploy the models. Going forward and to further understand Kubeflow and it’s best practises, the following links will be of help:

Aibaki Tembo

I hope you find this guide useful. If you have any questions, or know an easier way - drop us a message!