The Scope of “Monitoring Apache Spark”
“Apache Spark is an open-source distributed general-purpose cluster-computing framework.” It’s primarily written in Scala and uses the Java Virtual Machine. Spark runs on and creates multiple JVM processes to execute and manage applications, batch jobs, or a stream’s lifecycle. "Spark applications run as independent sets of JVM processes on a cluster, coordinated by the
SparkContext object in your main program (called the driver program )." 
Ok, we have a bunch of different JVM processes running as infrastructure and a cluster manager/master moving things around and a driver pushing tasks out to separate JVM processes (executors) that live on worker nodes, etc…
To consider how to approach monitoring Spark, let’s decompose the architecture into three distinct areas within the Spark landscape that will need monitoring:
- Master - Standalone, Kubernetes, Mesos, Yarn, etc.
- Worker node(s)
- Disk, CPU, native memory, etc.
For the first item, Spark Applications, you’ll need to use the Java Agent’s Custom Instrumentation API for any application level tracing that you want to see (likely, the async token api). At this time, we don’t have a built-in module to trace anything from a SparkContext or SparkSession.
Skipping the second item briefly, you can gain insight into your host machines with the New Relic Infrastructure agent, of course.
Back to item two on our list (Spark Infrastructure), if using the Infrastructure agent and Kubernetes, you can use New Relic’s Kuberentes Integration to gain insight to the Master and Worker nodes along with JMX Integration for the drivers and executors. You also have the option of JMX Integration with the Infrastructure agent for other manager implementations, i.e. the Spark Standalone. Similarly, you can use the Java Agent to gain JMX integration for Spark Metrics (which is the example provided below).
Spark also comes with a couple UIs that provide some operational data. With the New Relic agent installed you’ll see the Netty Server transactions for those UIs.
The remainder of this level-up post will run through the basics on how to get Spark Infrastructure metrics into New Relic. Apache Spark provides some information about the metrics they make available and the various methods to get them. The metrics are provided through a “sink” concept. We’ll be using the JMXSInk.
##Configure jmx.yml file for NR Extensions
There are multiple ways to confirm the JMX Bean names, i.e. jconsole, jmxterm, docs, etc. I found jmxterm to be really helpful.
Pretty much everything is under the Spark defined “metrics” domain and namespace. I’ve included just a small sample of attributes.
name: SparkMetrics version: 1.0 enabled: true jmx: - object_name: metrics:name=* metrics: - attributes: Count, Value, 75thPercentile, 95thPercentile, Mean, Min, Max, StdDev type: monotonically_increasing
You’ll likely want to limit the number you’re bringing in. I’ve just used the wildcard on name=* for simplicity in this demo.
This file gets saved into the newrelic/extensions/ directory.
Okay, all of that is pretty straight forward…
You can download Apache Spark here:
Use the current build (this one is 2.4.0) and the Pre-built for Hadoop 2.7 and later
In your Spark directory in spark/conf there are a number of templates. We’ll cp
spark-defaults.conf.template and metrics.properties.template to new files:
The spark-defaults.conf is where you can add in your custom configuration and override default settings. This is where we will need to add our JAVA_OPTS. There’s is a system property for the spark.executor AND spark.driver . Here is where we will pass in the -javaagent flags as well as enable remote and local access for JMX. The key thing here is to note that you’ll be enabling JMX for both the driver and executors separately (likewise the -javaagent flag needs to passed for both ).
# Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" spark.driver.extraJavaOptions= -javaagent:/path/to/files/newrelic/newrelic.jar - Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote=true - Dcom.sun.management.jmxremote.local.only=false - Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false - Dcom.sun.management.jmxremote.port=8009 spark.executor.extraJavaOptions= -javaagent:/path/to/files/newrelicExecutor/newrelic.jar - Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote=true - Dcom.sun.management.jmxremote.local.only=false - Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false - Dcom.sun.management.jmxremote=0
Spark provides “sinks” for JMX and metrics. You can read about the various sinks here:
For our purposes we want to enable the JMXSink and specify that we want JMX data for Master, Worker, Executor, and Driver.
# Enable JmxSink for all instances by class name *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink ... ... ... # Enable JvmSource for instance master, worker, driver and executor master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
The next step is to startup our cluster so that we can submit a job to be processes.
Spark Standalone Mode - Spark 2.4.0 Documentation
This is easy enough…
In the sbin directory there are a number of startup scripts. We need to run:
You can grep the log output for the location of the master…But the default will be spark:/:7077. The spark-slave.sh script takes the master location as a startup argument.
The spark-slave.sh script takes the master location as a startup argument.
Submit The Application
Once these are running you can submit your job to the cluster. The key gotcha I had with this was that there is a switch (–files) for the the submit script. My application hard codes the file location, but the executors can theoretically be launched on any machine, so you have to provide some facility for letting them know where to get whatever it is they’ll need. So, that could be an S3 location, hdfs directory, etc. There are switches to provide the appropriate means to access the file.
Likewise, the path to the newrelic.jar files need to be provided with the --files switch. This files can actually be zipped into a single file and Spark will unpack them and use as needed once it knows where they are.
./spark/bin/spark-submit --class com.jobreadyprogrammer.spark.Application \ --files /path/to/files/large.json \ --files /path/to/files/newrelic/newrelic.jar \ --files /path/to/files/newrelicExecutor/newrelic.jar \ --master spark://<IP address>:7077 project-0.0.1-SNAPSHOT.jar
You can then build out a Dashboard in Insights to view your metrics. There’s not much data in this screenshot because it’s from a single batch run, but it’s here to demonstrate some of the metrics that can be collected.
Hopefully, this helps you reason about how to monitor in the Apache Spark environment. The key takeaways are that Infrastructure agent can provide much of what is needed. JMX metric information can be gathered via the Java Agent (or Infrastructure integration). For your custom Spark Application specific instrumentation, you’ll need to use the Java Agent’s Custom Instrumentation APIs (@Trace annotations).