Matillion
The Kensu collector for Matillion is a powerful tool designed to enhance data observability in your Matillion data pipelines. By collecting metadata from Matillion job runs and Snowflake tables, it provides valuable insights into data lineage, schema changes, and metrics.
The main purpose of the Kensu collector for Matillion is to generate data observations based on Matillion job runs. It enables users to track the flow of data, understand dependencies, and ensure data quality and compliance. With Kensu, you can gain a holistic view of your data pipelines and make informed decisions.
The Kensu collector works by fetching job run JSON data from the Matillion API. It then extracts lineage and contextual information (job name, Matillion project, environment, timestamp) from this data, allowing you to understand which jobs produce which outputs and consume which inputs. To enrich the observations, the collector also retrieves schema information and metrics from the associated Snowflake tables.
The Kensu collector collects a wealth of information from your Matillion job runs and Snowflake tables. This includes job run details, such as start time, end time, and status, as well as input and output tables used by each job. Additionally, it captures schema information and metrics associated with the tables used in your Matillion jobs.
The collector seamlessly integrates with Matillion, leveraging the Matillion API to retrieve job run data. This integration requires a Matillion connection to be configured, enabling the collector to access the necessary information. With this connection, you can easily fetch data observations for all your Matillion jobs.
Ensure segregation: The Collector can be configured to handle a set of Matillion jobs only for which a different polling period can be defined (to relax the collector, for example) and a diverse collection of information to connect to Matillion or the default Kensu connection. Also, each Matillion Job can have its own Kensu connection in the Hub to assign them to different Kensu Application Groups via their token.
To provide comprehensive insights, the Kensu collector also requires Snowflake connections. These connections are used to fetch schema information and metrics from the Snowflake tables associated with your Matillion jobs. By combining data from Matillion and Snowflake, you can gain a complete understanding of your data pipelines.
The Kensu collector includes a powerful circuit breaker feature that adds an extra layer of control to your data processing. With this feature, the collector checks the observations in Kensu to identify any job-related data issues. If a data issue is detected, the circuit breaker will prevent the affected job from continuing its execution, safeguarding the integrity of your data.
The Kensu collector for Matillion comes with a few limitations that you should be aware of:
- The SQL parser used by the collector, sqloxide, may not fully parse complex SQL statements, which could result in incomplete lineage information. Please verify the completeness of the observed lineage in such cases.
- Unsupported cases will create empty lineages, which are not sent to Kensu Core
- The Kensu collector for Matillion has been tested with Snowflake data sources only.
- The Kensu collector for Matillion has been tested with Snowflake data sources with jobs containing the following activities: - Table input - Calculator - Rename - Rewrite Table - Table Update - Distinct - Join - Create View - Extract Nested Data
The Kensu collector for Matillion follows a series of steps to collect, process, and aggregate the necessary data for observability and governance. Here's an overview of the inner workings of the collector:
- Data Retrieval: The collector retrieves the latest job run data from the Matillion API at regular intervals, typically every 5 minutes. This ensures that the observations are up to date and capture the most recent changes in your Matillion data pipelines.
- JSON Processing: Once the job run data is obtained, the collector processes the JSON response, extracting relevant information such as data sources, lineage details, and job information. This step involves parsing and structuring the JSON data for further analysis.
- Snowflake Querying: For each data source identified in the job run data, the collector queries the corresponding Snowflake table to retrieve additional metadata. This includes the schema information, such as column names and data types, as well as a set of metrics. The metrics collected may include statistics like missing values, number of rows, and distributions for numerical columns. By fetching this additional information from Snowflake, the collector enriches the observations with valuable insights about the underlying data.
- Aggregation and Sending: After retrieving the necessary data from both the Matillion API and Snowflake, the collector aggregates and processes the collected information. It organizes the observations, aligning them with the appropriate job runs, data sources, and associated metadata. Finally, the collector sends the aggregated data to the Kensu core, where it can be utilized for comprehensive data observability, governance, and analysis.