Preview: PySpark Remote Configuration
This way of initializing Kensu (Py)Spark collector in advantegeous because:
Using this method to initialize the Kensu (Py)Spark collector offers several advantages:
- Seamless Integration: There's no need to alter the customer's existing Spark job code, making the process straightforward and reducing the risk of errors. This simplicity results in minimal manual integration effort.
- Comprehensive Coverage: This approach can be applied to all Spark jobs, ensuring a complete view of the Spark job landscape. It addresses the challenge of potentially overlooking the activation of tracking for certain jobs, especially when managing numerous Spark jobs.
- Dynamic Configuration via Kensu UI: Any modifications to the Kensu agent can be achieved directly through the Integrations tab in the Kensu UI. This eliminates the need to redeploy the customer's Spark job every time there's a change in agent settings.
- Download and install the public preview jar. The public preview jar is available here for Spark 2.4.0: https://public.usnek.com/n/repository/kensu-public/releases/kensu-spark-collector/alpha/kensu-spark-collector-1.4.0-alpha231018_1132_47-fffc395_spark-2.4.0.jar Please follow the PySpark instructions for the installation.
- Add mandatory config options to enable remote config (e.g. via spark-submit conf properties or spark-defaults.conf , etc), the rest of config will be fetched from Kensu UI. These are required to:
- Load Kensu Spark listeners
- Configure Kensu host & token
- p.s. installation of kensu-pyspark / kensu-py python library is not required (optional) if you are using this method
To test a single spark-submit job without affecting others, you may provide the conf via --conf arguments:
spark.sql.queryExecutionListeners and spark.extraListeners must be set via spark-conf (either spark default, or --conf params to spark-submit), while the following mandatory params could passed either via spark-conf, or environment variable or Java system property
Description | Property name | Env var |
---|---|---|
Enable kensu query execution listener | spark.sql.queryExecutionListeners | - |
Enable Spark listener | spark.extraListeners | - |
Kensu API host | spark.kensu.agentApiHost | KSU_AGENT_API_HOST |
Kensu API External Application token (PAT) | spark.kensu.agentToken | KSU_AGENT_AGENT_TOKEN |

After running the job, go to Kensu UI, you should see your application in Integrations tab:
You may fine tune the Kensu agent configuration for that application from there, by clicking "Configure":
- by setting Application group & Token you may control who see's the ingested data from this application
- Parameters configure the default spark agent behaviour, e.g. if to compute statistics and which ones
One way to automatically add config to ALL spark jobs is via spark-defaults.conf because this file is automatically loaded by Apache Spark, so there would be no need to modify the spark-submit command for each job.
E.g. in Cloudera VM I had to modify the /opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/etc/spark/conf.dist/spark-defaults.conf file:
These are optional, but could be used to provide extra info:
spark property | Environment variable | Default | Description |  |
---|---|---|---|---|
spark.kensu.application_id | KSU_APPLICATION_ID | PySpark: file:/full/path/script_name.py | Each application must have a UNIQUE and STABLE application id if not provided one will be infered by parsing spark-submit command and trying to extract the pyspark .py script name |  |
spark.kensu.process_name | KSU_PROCESS_NAME | PySpark: file:/full/path/script_name.py | similar to application_id, only it's less important to be unique, as it's just used to display the name, but do not affect the logic |  |
spark.kensu.project_name | KSU_PROJECT_NAME | - |  |  |
spark.kensu.code_location | KSU_CODE_LOCATION | - |  |  |
spark.kensu.code_version | KSU_CODE_VERSION | current datetime, e.g.: Thu Jun 01 17:08:00 EEST 2023 | If not provided explicitly in Kensu UI, will use the current datetime |  |
example (via environment variables):
example (via spark properties):