p

ch.cern

sparkmeasure

package sparkmeasure

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class FlightRecorderStageMetrics extends StageInfoRecorderListener

    FlightRecorderStageMetrics - Use Spark Listeners defined in stagemetrics.scala to record task metrics data aggregated at the Stage level, without changing the application code.

    FlightRecorderStageMetrics - Use Spark Listeners defined in stagemetrics.scala to record task metrics data aggregated at the Stage level, without changing the application code. The resulting data can be saved to a file and/or printed to stdout.

    Use: by adding the following configuration to spark-submit (or Spark Session) configuration --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics

    Additional configuration parameters: --conf spark.sparkmeasure.outputFormat=<format>, valid values: java,json,json_to_hadoop default "json" note: json and java serialization formats, write to the driver local filesystem json_to_hadoop, writes to JSON serialized metrics to HDFS or to an Hadoop compliant filesystem, such as s3a

    --conf spark.sparkmeasure.outputFilename=<output file>, default: "/tmp/stageMetrics_flightRecorder" --conf spark.sparkmeasure.printToStdout=<true|false>, default false. Set to true to print JSON serialized metrics to stdout.

  2. class FlightRecorderTaskMetrics extends TaskInfoRecorderListener

    FlightRecorderTaskMetrics - Use a Spark Listener to record task metrics data and save them to a file

    FlightRecorderTaskMetrics - Use a Spark Listener to record task metrics data and save them to a file

    Use: by adding the following configuration to spark-submit (or Spark Session) configuration --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderTaskMetrics

    Additional configuration parameters: --conf spark.sparkmeasure.outputFormat=<format>, valid values: java,json,json_to_hadoop default "json" note: json and java serialization formats, write to the driver local filesystem json_to_hadoop, writes to JSON serialized metrics to HDFS or to an Hadoop compliant filesystem, such as s3a

    --conf spark.sparkmeasure.outputFilename=<output file>, default: "/tmp/taskMetrics_flightRecorder" --conf spark.sparkmeasure.printToStdout=<true|false>, default false. Set to true to print JSON serialized metrics to stdout.

  3. class InfluxDBSink extends SparkListener

    InfluxDBSink: write Spark metrics and application info in near real-time to InfluxDB v 1.x use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution How to use: attach the InfluxDBSInk to a Spark Context using the extra listener infrastructure.

    InfluxDBSink: write Spark metrics and application info in near real-time to InfluxDB v 1.x use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution How to use: attach the InfluxDBSInk to a Spark Context using the extra listener infrastructure. Note: this is for InfluxDB v1.x Example: --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink

    Configuration for InfluxDBSink is handled with Spark conf parameters:

    spark.sparkmeasure.influxdbURL (default "http://localhost:8086") spark.sparkmeasure.influxdbUsername (default "", this can be empty if InfluxDB is configured with no authentication) spark.sparkmeasure.influxdbPassword (default "") spark.sparkmeasure.influxdbName (default "sparkmeasure") spark.sparkmeasure.influxdbStagemetrics, (boolean, default is false) spark.sparkmeasure.influxdbEnableBatch, boolean, default true Note: this is to improve write performance, but it requires to explicitly stopping Spark Session for clean exit: spark.stop() consider setting it to false if this is an issue

    This code depends on "influxdb.java", you may need to add the dependency: --packages org.influxdb:influxdb-java:2.14 Note currently we need to use version 2.14 as newer versions generate jar conflicts (tested up to Spark 3.3.0)

    InfluxDBExtended: provides additional and verbose info on Task execution use: --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSinkExtended

    InfluxDBSink: the amount of data generated is relatively small in most applications: O(number_of_stages) InfluxDBSInkExtended can generate a large amount of data O(Number_of_tasks), use with care

  4. class InfluxDBSinkExtended extends InfluxDBSink

    InfluxDBSinkExtended extends the basic Influx Sink functionality with a verbose dump of Task metrics and task info into InfluxDB Note: this can generate a large amount of data O(Number_of_tasks) Configuration parameters and how-to use: see InfluxDBSink

  5. class KafkaSink extends SparkListener

    KafkaSink: write Spark metrics and application info in near real-time to Kafka stream use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution

    KafkaSink: write Spark metrics and application info in near real-time to Kafka stream use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution

    How to use: attach the KafkaSink to a Spark Context using the extra listener infrastructure. Example: --conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSink

    Configuration for KafkaSink is handled with Spark conf parameters:

    spark.sparkmeasure.kafkaBroker = Kafka broker endpoint URL example: --conf spark.sparkmeasure.kafkaBroker=kafka.your-site.com:9092 spark.sparkmeasure.kafkaTopic = Kafka topic example: --conf spark.sparkmeasure.kafkaTopic=sparkmeasure-stageinfo

    This code depends on "kafka clients", you may need to add the dependency: --packages org.apache.kafka:kafka-clients:3.2.1

    Output: each message contains the name, it is acknowledged as metrics name as well. Note: the amount of data generated is relatively small in most applications: O(number_of_stages)

  6. class KafkaSinkExtended extends KafkaSink

    KafkaSinkExtended extends the basic KafkaSink functionality with a verbose dump of tasks metrics Note: this can generate a large amount of data O(Number_of_tasks) Configuration parameters and how-to use: see KafkaSink

  7. case class PushGateway(config: PushgatewayConfig) extends Product with Serializable

    config: case class with all required configuration for pushgateway,

  8. class PushGatewaySink extends SparkListener

    PushGatewaySink: write Spark metrics and application info in near real-time to Prometheus Push Gateway use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution Limitation: only metrics with numeric values are reported to the Push Gateway

    PushGatewaySink: write Spark metrics and application info in near real-time to Prometheus Push Gateway use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution Limitation: only metrics with numeric values are reported to the Push Gateway

    How to use: attach the PushGatewaySink to a Spark Context using the extra listener infrastructure. Example: --conf spark.extraListeners=ch.cern.sparkmeasure.PushGatewaySink

    Configuration for PushGatewaySink is handled with Spark conf parameters: spark.sparkmeasure.pushgateway = SERVER:PORT // Prometheus Push Gateway URL spark.sparkmeasure.pushgateway.jobname // value for the job label, default pushgateway Example: --conf spark.sparkmeasure.pushgateway=localhost:9091

    Output: each message contains the metric name and value, only numeric values are used Note: the amount of data generated is relatively small in most applications: O(number_of_stages)

  9. case class PushgatewayConfig(serverIPnPort: String, jobName: String, connectionTimeoutMs: Int = 5000, readTimeoutMs: Int = 5000) extends Product with Serializable

    serverIPnPort

    String with prometheus pushgateway hostIP:Port

    jobName

    the name of the spark job

    connectionTimeoutMs

    connection timeout for the http client, default is 5000ms

    readTimeoutMs

    read timeout for the http client, default is 5000ms

  10. class StageInfoRecorderListener extends SparkListener

    StageInfoRecorderListener: this listener gathers metrics with Stage execution granularity It is based on the Spark Listener interface Stage metrics are stored in memory and use to produce a report that aggregates resource consumption they can also be consumed "raw" (transformed into a DataFrame and/or saved to a file) See StageMetrics

  11. case class StageMetrics(sparkSession: SparkSession) extends Product with Serializable

    Stage Metrics: collects stage-level metrics with Stage granularity and provides aggregation and reporting functions for the end-user

    Stage Metrics: collects stage-level metrics with Stage granularity and provides aggregation and reporting functions for the end-user

    Example: val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)

    The tool is based on using Spark Listeners as the data source and collecting metrics into a ListBuffer of a case class that encapsulates Spark task metrics. The List Buffer may optionally be transformed into a DataFrame for ease of reporting and analysis.

    Stage metrics are stored in memory and used to produce a report. The report shows aggregated resource consumption on the measured period.

  12. case class StageVals(jobId: Int, jobGroup: String, stageId: Int, name: String, submissionTime: Long, completionTime: Long, stageDuration: Long, numTasks: Int, executorRunTime: Long, executorCpuTime: Long, executorDeserializeTime: Long, executorDeserializeCpuTime: Long, resultSerializationTime: Long, jvmGCTime: Long, resultSize: Long, diskBytesSpilled: Long, memoryBytesSpilled: Long, peakExecutionMemory: Long, recordsRead: Long, bytesRead: Long, recordsWritten: Long, bytesWritten: Long, shuffleFetchWaitTime: Long, shuffleTotalBytesRead: Long, shuffleTotalBlocksFetched: Long, shuffleLocalBlocksFetched: Long, shuffleRemoteBlocksFetched: Long, shuffleLocalBytesRead: Long, shuffleRemoteBytesRead: Long, shuffleRemoteBytesReadToDisk: Long, shuffleRecordsRead: Long, shuffleWriteTime: Long, shuffleBytesWritten: Long, shuffleRecordsWritten: Long) extends Product with Serializable
  13. class TaskInfoRecorderListener extends SparkListener

    TaskInfoRecorderListener: this listener gathers metrics with Task execution granularity It is based on the Spark Listener interface Task metrics are stored in memory and use to produce a report that aggregates resource consumption they can also be consumed "raw" (transformed into a DataFrame and/or saved to a file)

  14. case class TaskMetrics(sparkSession: SparkSession) extends Product with Serializable

    Task Metrics: collects metrics data at Task granularity and provides aggregation and reporting functions for the end-user

    Task Metrics: collects metrics data at Task granularity and provides aggregation and reporting functions for the end-user

    Example of how to use task metrics: val taskMetrics = ch.cern.sparkmeasure.TaskMetrics(spark) taskMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)

    The tool is based on using Spark Listeners as data source and collecting metrics in a ListBuffer of a case class that encapsulates Spark task metrics.

  15. case class TaskVals(jobId: Int, jobGroup: String, stageId: Int, index: Long, launchTime: Long, finishTime: Long, duration: Long, schedulerDelay: Long, executorId: String, host: String, taskLocality: Int, speculative: Boolean, gettingResultTime: Long, successful: Boolean, executorRunTime: Long, executorCpuTime: Long, executorDeserializeTime: Long, executorDeserializeCpuTime: Long, resultSerializationTime: Long, jvmGCTime: Long, resultSize: Long, diskBytesSpilled: Long, memoryBytesSpilled: Long, peakExecutionMemory: Long, recordsRead: Long, bytesRead: Long, recordsWritten: Long, bytesWritten: Long, shuffleFetchWaitTime: Long, shuffleTotalBytesRead: Long, shuffleTotalBlocksFetched: Long, shuffleLocalBlocksFetched: Long, shuffleRemoteBlocksFetched: Long, shuffleLocalBytesRead: Long, shuffleRemoteBytesRead: Long, shuffleRemoteBytesReadToDisk: Long, shuffleRecordsRead: Long, shuffleWriteTime: Long, shuffleBytesWritten: Long, shuffleRecordsWritten: Long) extends Product with Serializable

Value Members

  1. object IOUtils

    The object IOUtils contains some helper code for the sparkMeasure package The methods readSerializedStageMetrics and readSerializedTaskMetrics are used to read data serialized into files by the "flight recorder" mode.

    The object IOUtils contains some helper code for the sparkMeasure package The methods readSerializedStageMetrics and readSerializedTaskMetrics are used to read data serialized into files by the "flight recorder" mode. Two serialization modes are supported currently: java serialization and JSON serialization with jackson library.

  2. object Utils

    The object Utils contains some helper code for the sparkMeasure package The methods formatDuration and formatBytes are used for printing stage metrics reports

Ungrouped