The integration between the Databricks and Kensu platforms brings mutual customers enhanced data observability and metadata automation capabilities completing Databricks Unity's capabilities.
This integration empowers data teams deploying Databricks jobs to automate the harvesting of metadata, lineage (traces), and data metrics during the execution of the deployed Spark jobs, ensuring data quality, performance, and compliance.
Automated Metadata Harvesting and Lineage
Comprehensive Data Source Support
Automatic Computation of Data Metrics
Support for Batch and Streaming Jobs
Runtime Discrepancy Detection and Recommendations
Streamlined Root Cause Analysis
Native Circuit-Breaking Support
To proceed with the installation of this integration, you'll only need the following:
To install the integration, please follow the below few steps:
1 - Login to your Kensu instance
2 - On the sidebar, navigate to Collectors > Configure a connection
3 - Click on the Databricks Logo
4 - Provide your Databricks information requested in the previous section
5 - Then click on Connect
6 - The Kensu interface will list all clusters of your Databricks account, so you can select the cluster you want Kensu to observe the data usages. Click on Configure to validate your choice.
7 - You can also set Kensi on all new clusters that will be created in your Databricks account, to ensure all data usage are observed by Kensu. For this, switch the button Enable Kensu on every new cluster by default, then click on Configure.
If you have selected an existing cluster, then you need to restart it for the Kensu integration to be enabled. It will then attach to it the appropriate Kensu jar (agent) and install the kensu python module.
Running your notebooks on these clusters will autonatically track the usage and metrics of you data involved in those notebooks.
Kensu will notify you if there are unexpected behavior in your data.
As most Databricks customers, you have probably started using Unity.
In this case, the agent installed by Kensu on the cluster requires a dedicated hint to ensure the metrics of the data sources involved in your Spark jobs are computed.
This is done with a single cell to be added at the end of your notebooks: