Integration
Collectors

Azure Data Factory

6min
The Azure Data Factory Collector is a robust tool designed to enhance data observability within your Azure Data Factory pipelines. By gathering metadata from Data Factory pipeline runs and Azure Data Lake Storage, it provides valuable insights into data lineage, schema changes, and performance metrics.

Purpose

The primary purpose is to turn Azure Data Factory data observable by generating data observations based on Data Factory pipeline executions. It empowers users to trace data flow, comprehend dependencies, and ensure data quality and compliance. With this collector, you can attain a comprehensive view of your data pipelines and make informed decisions.

How It Works

The Azure Data Factory Collector functions by utilizing the Azure Data Factory Python SDK to interact with Azure resources.

It retrieves pipeline and activity run information directly from Azure Data Factory and extracts lineage and contextual information (pipeline name, Azure Data Factory project, environment, timestamp).

This allows you to discern which pipelines produce specific outputs and consume particular inputs. To enrich the observations, the collector also retrieves schema details and compute statistical metrics on the data sources (see Data Source Connections)

Features

Data Collection

The Azure Data Factory Collector gathers a wealth of information from your Data Factory pipeline runs, activity runs, and tables used. This encompasses pipeline run specifics like start time, end time, and status, as well as the input and output datasets used by each pipeline. Furthermore, it captures schema information and performance metrics related to the datasets employed in your Data Factory pipelines.

Integration with Azure Data Factory

This collector seamlessly integrates with Azure Data Factory, leveraging the Azure Data Factory Python SDK to retrieve "pipeline run" metadata and other infromation about Azure Resources.

With this integration in place, you can effortlessly fetch data observations for all your Azure Data Factory pipelines.

Circuit Breaker Feature

The Azure Data Factory Collector comes with an agent integrating with the Circuit breaker.

This provides the user the capability to automatically break the executions of the data factories in case of incidents.

With this feature, the collector constantly monitors the observations in Azure Data Factory to identify any data-related issues with pipeline executions. If a data issue is detected, the circuit breaker will halt the affected pipeline run, safeguarding the integrity of your data and preventing potential downstream issues.

To achieve the agent is composed of an Azure WebHook that triggers the Circuit Breaker.

Workflow

Document image


Configuration of the Azure Data Factory collector

The Azure Data Factory collector relies on the Azure SDK and direct connections to the data sources used in the Azure Data Factory pipelines.

To register an Azure Data Factory Connection, please follow these steps.