How to add the NVIDIA GPU Metrics on the newrelic infra agent (feat. newrelic flex)

Background

If you want to see your instance’s GPU metric on newrelic, you can simply add a .yml file. It is called New Relic Flex which is an application-agnostic, all-in-one tool that allows you to collect metric data from a wide variety of services. Then you can easily query for the GPU status (temperature, utilization, memory, etc) in newrelic. You can make GPU dashboard like below for instance.

Benefit (for AWS CloudWatch Users)

In CloudWatch, you can see the GPU metrics. When you request(query) the metrics, you will be paid. However, it’s free on newrelic.

Also, you can customize the GPU and related metrics using newrelic flex, then can make your own dashboard.

Pre-requirement

In your on-premise or cloud environment, you have to install the GPU driver to get the GPU Metrics from the command.
This example tested on Ubuntu in the AWS p2 instance which has GPU resources.
I manually installed the NVIDIA drive on my AWS p2 instance, but you can use the AMI which already installed the Nvidia driver. For more detail, you can refer to the below links.

New Relic Flex integration (for GPU Metrics)

  1. Suppose that you installed the newrelic infra agent and connected the instance which has the GPU resource.

    $ cd /etc/newrelic-infra/integrations.d
    $ sudo vim flex-nvidia-gpu.yml
    

    You can name flex-WHAT-YOU-WANT.yml

  2. Copy and paste the below YML script to flex-nvidia-gpu.yml

    integrations:
      - name: nri-flex
        config:
          name: nvidiaGpuMetric
          apis:
            - name: NvidiaGpuMetric
              commands:
                - run: echo "$(hostname), $(nvidia-smi --query-gpu=name,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,timestamp --format=csv,noheader)"
                  split: horizontal
                  split_by: \,
                  set_header: [hostname,name,driverVersion,temperatureGpu,utilizationGpu,utilizationMemory,memoryTotal,memoryFree,memoryUsed,timestamp]
              perc_to_decimal: true
    

    The result of run field and set_header (column) field must be the same number of the elements. Also, be care of split_by field. If they aren’t right, you can’t see any result in the newrelic query builder.

    There are two commands which concatenate with echo in run field. When you query to newrelic, you can group by hostname (Facet).
    $ hostname
    $ nvidia-smi --query-gpu=name,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,timestamp --format=csv,noheader

  3. Go to one.newrelic.com > Click Query your data on the right top > Click Query builder tab.

  4. Input below NRQL queries and test them.

    SELECT hostname,name,driverVersion,temperatureGpu,utilizationGpu,
           utilizationMemory,memoryTotal,memoryFree,memoryUsed 
    FROM NvidiaGpuMetricSample 
    Since 1 day ago
    

    SELECT average(numeric(temperatureGpu)) as 'Temperature' 
    FROM NvidiaGpuMetricSample 
    TIMESERIES 
    Since 1 day ago
    

    SELECT average(numeric(utilizationGpu)) as 'utilizationGpu' 
    FROM NvidiaGpuMetricSample 
    TIMESERIES 
    Since 1 day ago
    

Test Environment

  • Ubuntu 18.04 LTS (or 20.04 LTS)
  • AWS p2 instance
  • NVIDIA Tesla driver

References

1 Like

Thank you for the detailed steps , I’ve followed them and i’m using newrelic infra agent 1.14 on kubernetes and tried all the steps however the data is not getting generated .

No chart data available

No events found – do you have the correct event type and time range?

Hi @satish.kumar.kuchipu,

  1. Did you test the same .yml script above?
  2. If not, can you check you command(run:) and delimiter (split_by:)?

Just as a note, this is an absolutely great topic - and it needs to be kept around. The only key issue is that I had difficulty on windows 11 with escaping quotes/evaluating properly. Adding “shell: powershell” resolved the issue entirely.

1 Like

Hi @ecummings,

Welcome to the Explorers Hub and thank you for sharing this. I am certain it will helpful to other members in the community should they experience this issue.