Ashill Chiranjan | Chris Nelson | Zibusiso Bhango
Data governance is an important paradigm in an organisation ultimately responsible for data. It is responsible for providing information about the data source, schema for these sources, processes reading the data, data transformation, when data was last updated, classification used on data and the restrictions on the data. Data governance provides the ability to understand metadata and make appropriate actions as required. Data governance is meant to give a complete end to end picture to keep up with the constantly changing and growing datasets in your environment.
Apache Atlas is the one-stop solution for data governance and metadata management. It is a data governance and metadata tool which facilitates gathering, processing and maintaining metadata. It monitors data processes, data stores, files and updates in a metadata repository. Apache Atlas is typically used with Hadoop environments but can be integrated into other environments as well. It has a scalable and extensive architecture which can be plugged into many components to manage their metadata in a central repository. Due to its extensible Type System, any arbitrary component can be modelled to capture metadata of its datasets and events. It provides open metadata management and governance capabilities for organisations to build a catalogue of their data assets, classify and govern these assets and provides collaboration capabilities around these assets for data stakeholders.
Apache Atlas features include metadata types and instances, classification, lineage and search/discovery. Atlas has pre-defined types for various metadata, but new types for metadata can be created and managed. Types can have primitive attributes, complex attributes, object references and can inherit from other types. Entities are instances of types that capture metadata object details and their relationships.
Atlas allows for dynamic creation of classifications, which are tags associated with entities. Classifications can include attributes and entities can be associated with multiple classifications, enabling easier discovery and enforcing security and compliance. Classifications are also propagated via lineage, automatically ensuring classifications remain intact as data go through various processing.
Atlas provides an intuitive UI to view lineage as it moves through various processes. Rest APIs can be used to access and update lineage. Atlas tracks lineage of datasets. When a dataset derives from another dataset, the event can be registered and Atlas will capture the lineage relationship. Atlas can be used effectively as a search tool. An intuitive UI can be used to search entities by type, classification, attribute value or free text. Rest APIs can also be used to search for more complex criteria.
How to use Apache Atlas
Apache Atlas can be set up locally on a desktop/pc and can be used to store file and table metadata. Note that Apache Atlas does not actually store file or table data but merely stores data about the file or table i.e. Metadata. You would still need to store your file or table in some place be it hdfs or cloud storage.
Apache Atlas allows you to:
- Create files, tables or schemas
- Classify files, tables or schemas for purposes of grouping data together
- Linking schemas to files or tables to provide better information on the underlying file or table structure
- Create lineage (flow) between files/tables and processes to see visually how the file/table changed over time and what the expected file/table should look like.
- Create relationships between files/tables and other files/tables.
Overall Apache Atlas helps with central organisation and tracking of data with a nice UI.
Once set up locally or on a cluster, apache atlas provides a very nice and somewhat intuitive UI that allows you to interact with metadata for purposes of tracking, monitoring or debugging. Apache atlas does allow for interaction using the UI but for more complex operations, one would have to make use of the RESTful API or Kafka messages to manipulate data in Atlas.
Once Apache Atlas is setup you can start by creating instances of files/tables and populating the metadata fields with various values that are relevant to your data.
In the above image, we see metadata about a file in the Atlas UI. Here we can search for other files, classifications, schemas or tables and create lineage between them.
To create lineage between files/tables and other files/tables. You would need to link the files or tables using their unique names or generated UUIDs. These values would need to be passed in using the API or Kafka messaging payloads and the generated result will show the initial data and how it was transformed over its lifecycle in a nice flow diagram.
Searching & Filtering
The UI can also allow for searching and filtering.
Apache Atlas by default only allows for soft deletes. If you would like to remove files/tables it will only mark the data as being DELETED but will not actually remove the data. This is a useful feature as history should always be maintained when working with big data.
Setting up Atlas on your machine and playing around.
- The best way to set up atlas locally is to make use of the docker image and steps as listed here : https://github.com/sburn/docker-apache-atlas. This lists a step by step guide on how to set up and run atlas on your local machine. (You do need to have docker setup and running on your machine)
- The main website for apache atlas : https://atlas.apache.org/#/
- The technical user guide is about as best as you would get for using atlas.https://atlas.apache.org/0.7.1-incubating/AtlasTechnicalUserGuide.pdf
This includes some json payloads for posting and playing around with atlas.