Find the root cause with Kensu
In this section, we run the pipeline again, having created the rule. This will create a ticket for the Min-Max rule created in Create a monitoring rule in Kensu.
A ticket is like a notification or work item. It is created by the Kensu platform when a rule exceeds its thresholds.
This draws attention to data events worth reviewing as they may be issues for example.
Detect a data issue
1️⃣ Run the following command, this time using data from December 2021.
2️⃣ Look at the Home page. The tickets count is now 1. Click on Tickets to review it.
Analyze the issue
1️⃣ On the Tickets page:
- To see the reason for the ticket click the + icon.
- Click the Data Source name report_buzzfeed.csv.
2️⃣ Display the Min/Max rule by clicking the chart icon.
The chart displays Intraday_Delta.std.
The red exclamation mark is the observation that violated the rule Min-Max rule.
The table below the chart shows the observation that is out-of-bounds.
Find the Root Cause
Explore the statistics of the data source
Now, investigate the drop in quality of the data set.
To do so, we drill into values for Intraday_Delta.
1️⃣ Click +Select Attributes.
2️⃣ Click on the checkbox Intraday_Delta. Click OK.
As you can see, something looks fishy. You would expect the data set to have around 20 rows of data for 20 business days per month, yet the last run has only 3 rows.
If you use PySpark, count is named nrows.
This is an interesting metric to follow. To avoid future issues like this, one could add a rule on the Count to ensure it is always around 20.
Explore the lineage to find the origin of the issue
Now, having found an issue, we find its root cause.
To do this, Kensu collects the technical data lineage. This lets you browse the data sources backward, from the faulty data source to the origins of the pipeline.
1️⃣ In the panel Creation Of, you see the data sources that were used to create the report_buzzfeed.csv file. They represent its upstream lineage. Click on the bar next to monthly_assets.csv, toward the bottom. This is the upstream data source node monthly_assets.csv.
2️⃣ Click on View Data Source Details.
3️⃣ You see the monthly_assets.csv data source page. You can explore those statistics.
4️⃣ Now, go back up the page to the Statistics area. Click +Select Attribute.
In the table of observations, you can see both runs of the pipeline. These are sorted in reverse execution time, Timestamp. Observe that the number of unique Symbols, num_categories, has increased by one.
Looking at other columns, notice:
- The ENFA stock symbol shows 3 rows while there were 21 for the first execution.
- A new stock symbol, BZFD, appears.
So, based on those observations, you've discovered a stock symbol ticker name change.
On the 6th of December, the ENFA ticker changed its name to BZFD. Therefore, we have 3 records for ENFA and the remaining 19 days for BZFD.
This dramatically changed the standard deviation compared to last month, causing the Min-Max rule to trigger.
To prevent future issues, you could add a rule on the number of categories.
Go to the Create a monitoring rule programmatically