Background
If you want to see your instance’s GPU metric on newrelic, you can simply add a .yml file. It is called New Relic Flex which is an application-agnostic, all-in-one tool that allows you to collect metric data from a wide variety of services. Then you can easily query for the GPU status (temperature, utilization, memory, etc) in newrelic. You can make GPU dashboard like below for instance.
Benefit (for AWS CloudWatch Users)
In CloudWatch, you can see the GPU metrics. When you request(query) the metrics, you will be paid. However, it’s free on newrelic.
Also, you can customize the GPU and related metrics using newrelic flex, then can make your own dashboard.
Pre-requirement
In your on-premise or cloud environment, you have to install the GPU driver to get the GPU Metrics from the command.
This example tested on Ubuntu in the AWS p2 instance which has GPU resources.
I manually installed the NVIDIA drive on my AWS p2 instance, but you can use the AMI which already installed the Nvidia driver. For more detail, you can refer to the below links.
- NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation
- Install NVIDIA drivers on Linux instances - Amazon Elastic Compute Cloud (AWS)
New Relic Flex integration (for GPU Metrics)
-
Suppose that you installed the newrelic infra agent and connected the instance which has the GPU resource.
$ cd /etc/newrelic-infra/integrations.d $ sudo vim flex-nvidia-gpu.yml
You can name flex-WHAT-YOU-WANT.yml
-
Copy and paste the below YML script to flex-nvidia-gpu.yml
integrations: - name: nri-flex config: name: nvidiaGpuMetric apis: - name: NvidiaGpuMetric commands: - run: echo "$(hostname), $(nvidia-smi --query-gpu=name,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,timestamp --format=csv,noheader)" split: horizontal split_by: \, set_header: [hostname,name,driverVersion,temperatureGpu,utilizationGpu,utilizationMemory,memoryTotal,memoryFree,memoryUsed,timestamp] perc_to_decimal: true
The result of run field and set_header (column) field must be the same number of the elements. Also, be care of split_by field. If they aren’t right, you can’t see any result in the newrelic query builder.
There are two commands which concatenate with echo in run field. When you query to newrelic, you can group by hostname (Facet).
$ hostname
$ nvidia-smi --query-gpu=name,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,timestamp --format=csv,noheader -
Go to one.newrelic.com > Click Query your data on the right top > Click Query builder tab.
-
Input below NRQL queries and test them.
SELECT hostname,name,driverVersion,temperatureGpu,utilizationGpu, utilizationMemory,memoryTotal,memoryFree,memoryUsed FROM NvidiaGpuMetricSample Since 1 day ago
SELECT average(numeric(temperatureGpu)) as 'Temperature' FROM NvidiaGpuMetricSample TIMESERIES Since 1 day ago
SELECT average(numeric(utilizationGpu)) as 'utilizationGpu' FROM NvidiaGpuMetricSample TIMESERIES Since 1 day ago
Test Environment
- Ubuntu 18.04 LTS (or 20.04 LTS)
- AWS p2 instance
- NVIDIA Tesla driver
References
-
[Newrelic] Flex Docs
https://docs.newrelic.com/docs/integrations/host-integrations/host-integrations-list/flex-integration-tool-build-your-own-integration/ -
[Newrelic] Flex example
https://docs.newrelic.com/docs/integrations/host-integrations/host-integrations-list/flex-integration-tool-build-your-own-integration/#example -
[Newrelic] Flex Github Tutorial
https://github.com/newrelic/nri-flex/blob/master/docs/basic-tutorial.md -
[AWS] Install NVIDIA drivers on Linux instances
Install NVIDIA drivers on Linux instances - Amazon Elastic Compute Cloud -
[NVIDIA] Driver Installation Quickstart Guide
NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation -
[NVIDIA] CUDA Installation Guide for Linux
Installation Guide Linux :: CUDA Toolkit Documentation -
[NVIDIA] Useful Commands
https://enterprise-support.nvidia.com/s/article/Useful-nvidia-smi-Queries-2 -
[Github] GPU Stress Test Command
https://github.com/wilicc/gpu-burn