Integration
...
Agents
PySpark

Preview: PySpark Remote Configuration

9min

Initializing the Kensu (Py)Spark Collector

This way of initializing Kensu (Py)Spark collector in advantegeous because:

Using this method to initialize the Kensu (Py)Spark collector offers several advantages:

  • Seamless Integration: There's no need to alter the customer's existing Spark job code, making the process straightforward and reducing the risk of errors. This simplicity results in minimal manual integration effort.
  • Comprehensive Coverage: This approach can be applied to all Spark jobs, ensuring a complete view of the Spark job landscape. It addresses the challenge of potentially overlooking the activation of tracking for certain jobs, especially when managing numerous Spark jobs.
  • Dynamic Configuration via Kensu UI: Any modifications to the Kensu agent can be achieved directly through the Integrations tab in the Kensu UI. This eliminates the need to redeploy the customer's Spark job every time there's a change in agent settings.

Installation and configuration

  1. Download and install the public preview jar. The public preview jar is available here for Spark 2.4.0: https://public.usnek.com/n/repository/kensu-public/releases/kensu-spark-collector/alpha/kensu-spark-collector-1.4.0-alpha231018_1132_47-fffc395_spark-2.4.0.jar Please follow the PySpark instructions for the installation.
  2. Add mandatory config options to enable remote config (e.g. via spark-submit conf properties or spark-defaults.conf , etc), the rest of config will be fetched from Kensu UI. These are required to:
    1. Load Kensu Spark listeners
    2. Configure Kensu host & token
    3. p.s. installation of kensu-pyspark / kensu-py python library is not required (optional) if you are using this method

Enable for single job for testing

To test a single spark-submit job without affecting others, you may provide the conf via --conf arguments:

Shell


Mandatory config options

spark.sql.queryExecutionListeners and spark.extraListeners must be set via spark-conf (either spark default, or --conf params to spark-submit), while the following mandatory params could passed either via spark-conf, or environment variable or Java system property

Description

Property name

Env var

Enable kensu query execution listener

spark.sql.queryExecutionListeners

-

Enable Spark listener

spark.extraListeners

-

Kensu API host

spark.kensu.agentApiHost

KSU_AGENT_API_HOST

Kensu API External Application token (PAT)

spark.kensu.agentToken

KSU_AGENT_AGENT_TOKEN



Verify installation

After running the job, go to Kensu UI, you should see your application in Integrations tab:

Document image


You may fine tune the Kensu agent configuration for that application from there, by clicking "Configure":

  • by setting Application group & Token you may control who see's the ingested data from this application
  • Parameters configure the default spark agent behaviour, e.g. if to compute statistics and which ones
Document image


Sharing config for all jobs

One way to automatically add config to ALL spark jobs is via spark-defaults.conf because this file is automatically loaded by Apache Spark, so there would be no need to modify the spark-submit command for each job.

E.g. in Cloudera VM I had to modify the /opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/etc/spark/conf.dist/spark-defaults.conf file:

spark-defaults.conf


Optional config properties

These are optional, but could be used to provide extra info:

spark property

Environment variable

Default

Description



spark.kensu.application_id

KSU_APPLICATION_ID

PySpark: file:/full/path/script_name.py

Each application must have a UNIQUE and STABLE application id

if not provided one will be infered by parsing spark-submit command and trying to extract the pyspark .py script name



spark.kensu.process_name

KSU_PROCESS_NAME

PySpark: file:/full/path/script_name.py

similar to application_id, only it's less important to be unique, as it's just used to display the name, but do not affect the logic



spark.kensu.project_name

KSU_PROJECT_NAME

-





spark.kensu.code_location

KSU_CODE_LOCATION

-





spark.kensu.code_version

KSU_CODE_VERSION

current datetime, e.g.: Thu Jun 01 17:08:00 EEST 2023

If not provided explicitly in Kensu UI, will use the current datetime



example (via environment variables):

Shell


example (via spark properties):

Shell








Updated 21 Nov 2023
Doc contributor
Doc contributor
Did this page help you?