when the application stops serialize the content of stageMetricsData into a file in the driver's filesystem
when the application stops serialize the content of stageMetricsData into a file in the driver's filesystem
This methods fires at the end of the stage and collects metrics flattened into the stageMetricsData ListBuffer Note all times are in ms, cpu time and shufflewrite are originally in nanosec, thus in the code are divided by 1e6
This methods fires at the end of the stage and collects metrics flattened into the stageMetricsData ListBuffer Note all times are in ms, cpu time and shufflewrite are originally in nanosec, thus in the code are divided by 1e6
Spark Measure package: proof-of-concept tool for measuring Spark performance metrics This is based on using Spark Listeners as data source and collecting metrics in a ListBuffer The list buffer is then transformed into a DataFrame for analysis
Stage Metrics: collects and aggregates metrics at the end of each stage Task Metrics: collects data at task granularity
Use modes: Interactive mode from the REPL Flight recorder mode: records data and saves it for later processing
Supported languages: The tool is written in Scala, but it can be used both from Scala and Python
Example usage for stage metrics: val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)
for task metrics: val taskMetrics = ch.cern.sparkmeasure.TaskMetrics(spark) spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show() val df = taskMetrics.createTaskMetricsDF()
To use in flight recorder mode add: --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics
Created by Luca.Canali@cern.ch, March 2017