Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Monitoring Apache Spark

jmx
cluster
clustering
javaagent
spark
apachespark
dataframe
apm
infrastructure

#1

The Scope of “Monitoring Apache Spark”

“Apache Spark is an open-source distributed general-purpose cluster-computing framework.” It’s primarily written in Scala and uses the Java Virtual Machine. Spark runs on and creates multiple JVM processes to execute and manage applications, batch jobs, or a stream’s lifecycle. "Spark applications run as independent sets of JVM processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program )." [1]

Ok, we have a bunch of different JVM processes running as infrastructure and a cluster manager/master moving things around and a driver pushing tasks out to separate JVM processes (executors) that live on worker nodes, etc…

To consider how to approach monitoring Spark, let’s decompose the architecture into three distinct areas within the Spark landscape that will need monitoring:

Spark Applications

  • Driver
  • Executors

Spark Infrastructure

  • Master - Standalone, Kubernetes, Mesos, Yarn, etc.
  • Worker node(s)

Hosting Machines

  • Disk, CPU, native memory, etc.

For the first item, Spark Applications, you’ll need to use the Java Agent’s Custom Instrumentation API for any application level tracing that you want to see (likely, the async token api). At this time, we don’t have a built-in module to trace anything from a SparkContext or SparkSession.

Skipping the second item briefly, you can gain insight into your host machines with the New Relic Infrastructure agent, of course.

Back to item two on our list (Spark Infrastructure), if using the Infrastructure agent and Kubernetes, you can use New Relic’s Kuberentes Integration to gain insight to the Master and Worker nodes along with JMX Integration for the drivers and executors. You also have the option of JMX Integration with the Infrastructure agent for other manager implementations, i.e. the Spark Standalone. Similarly, you can use the Java Agent to gain JMX integration for Spark Metrics (which is the example provided below).

Spark also comes with a couple UIs that provide some operational data. With the New Relic agent installed you’ll see the Netty Server transactions for those UIs.

The remainder of this level-up post will run through the basics on how to get Spark Infrastructure metrics into New Relic. Apache Spark provides some information about the metrics they make available and the various methods to get them. The metrics are provided through a “sink” concept. We’ll be using the JMXSInk.

##Configure jmx.yml file for NR Extensions

There are multiple ways to confirm the JMX Bean names, i.e. jconsole, jmxterm, docs, etc. I found jmxterm to be really helpful.

jmxterm quickstart - Apache Kafka - Apache Software Foundation

Pretty much everything is under the Spark defined “metrics” domain and namespace. I’ve included just a small sample of attributes.

Custom JMX instrumentation by YAML | New Relic Documentation

name: SparkMetrics
version: 1.0
enabled: true
jmx:
  - object_name: metrics:name=*
  metrics:
    - attributes: Count, Value, 75thPercentile, 95thPercentile, Mean, Min, Max, StdDev
      type: monotonically_increasing

You’ll likely want to limit the number you’re bringing in. I’ve just used the wildcard on name=* for simplicity in this demo.

This file gets saved into the newrelic/extensions/ directory.

Okay, all of that is pretty straight forward…

Example Implementation:

Configuring Spark

You can download Apache Spark here:

Downloads | Apache Spark

Use the current build (this one is 2.4.0) and the Pre-built for Hadoop 2.7 and later

In your Spark directory in spark/conf there are a number of templates. We’ll cp

spark-defaults.conf.template and metrics.properties.template to new files:

spark-defaults.conf

and

metrics.properties

The spark-defaults.conf is where you can add in your custom configuration and override default settings. This is where we will need to add our JAVA_OPTS. There’s is a system property for the spark.executor AND spark.driver . Here is where we will pass in the -javaagent flags as well as enable remote and local access for JMX. The key thing here is to note that you’ll be enabling JMX for both the driver and executors separately (likewise the -javaagent flag needs to passed for both ).

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.driver.extraJavaOptions=    -javaagent:/path/to/files/newrelic/newrelic.jar - 
Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote=true - 
Dcom.sun.management.jmxremote.local.only=false - 
Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false​ - 
Dcom.sun.management.jmxremote.port=8009‍‍‍‍‍

spark.executor.extraJavaOptions=  -javaagent:/path/to/files/newrelicExecutor/newrelic.jar - 
Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote=true - 
Dcom.sun.management.jmxremote.local.only=false - 
Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false - 
Dcom.sun.management.jmxremote=0‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Spark provides “sinks” for JMX and metrics. You can read about the various sinks here:
https://spark.apache.org/docs/latest/monitoring.html

For our purposes we want to enable the JMXSink and specify that we want JMX data for Master, Worker, Executor, and Driver.

Excerpted:


# Enable JmxSink for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

...
...
...

# Enable JvmSource for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource

worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource

driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource

executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The next step is to startup our cluster so that we can submit a job to be processes.

Startup:

Spark Standalone Mode - Spark 2.4.0 Documentation
This is easy enough…

In the sbin directory there are a number of startup scripts. We need to run:

. /start-master.sh
You can grep the log output for the location of the master…But the default will be spark:/:7077. The spark-slave.sh script takes the master location as a startup argument.

The spark-slave.sh script takes the master location as a startup argument.

./start-slave.sh spark://<master-ip-address>:7077‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Submit The Application

Submitting Applications - Spark 2.4.0 Documentation

Once these are running you can submit your job to the cluster. The key gotcha I had with this was that there is a switch (–files) for the the submit script. My application hard codes the file location, but the executors can theoretically be launched on any machine, so you have to provide some facility for letting them know where to get whatever it is they’ll need. So, that could be an S3 location, hdfs directory, etc. There are switches to provide the appropriate means to access the file.

Likewise, the path to the newrelic.jar files need to be provided with the --files switch. This files can actually be zipped into a single file and Spark will unpack them and use as needed once it knows where they are.

./spark/bin/spark-submit --class com.jobreadyprogrammer.spark.Application \
--files /path/to/files/large.json \
--files /path/to/files/newrelic/newrelic.jar \
--files /path/to/files/newrelicExecutor/newrelic.jar \
--master spark://<IP address>:7077 project-0.0.1-SNAPSHOT.jar
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

You can then build out a Dashboard in Insights to view your metrics. There’s not much data in this screenshot because it’s from a single batch run, but it’s here to demonstrate some of the metrics that can be collected.

Hopefully, this helps you reason about how to monitor in the Apache Spark environment. The key takeaways are that Infrastructure agent can provide much of what is needed. JMX metric information can be gathered via the Java Agent (or Infrastructure integration). For your custom Spark Application specific instrumentation, you’ll need to use the Java Agent’s Custom Instrumentation APIs (@Trace annotations).


Newrelic java agent for Profiling Spark drivers and executors