Infra Alert via terraform

Hi all,

What’s the best recommended way to setup alert on container restart?
We have been using NRQL but we had an instance where we didn’t got any alerts when a container restarted in prod :frowning_man: Here is how my NRQL alert resource look like.

#---------------------------------------------------
# Add dynamic newrelic infra alert condition
#---------------------------------------------------

locals {
  l_newrelic_infra_alert_condition_defaults = {
    for obj in var.newrelic_infra_alert_condition : "${obj.name} ${obj.policy_id} ${obj.type}" => merge(var.newrelic_infra_alert_condition_defaults, obj)
  }
}


resource "newrelic_infra_alert_condition" "dynamic_infra_alert" {

  for_each = local.l_newrelic_infra_alert_condition_defaults

  policy_id  = each.value.policy_id
  type       = each.value.type
  name       = each.value.name
  event      = each.value.event
  select     = each.value.select
  comparison = each.value.comparison
  where      = each.value.where

  critical {
    duration      = each.value.critical.duration
    value         = each.value.critical.value
    time_function = each.value.critical.time_function
  }
  dynamic "warning" {
    for_each = try([each.value["warning"]], [])
    content {
      duration      = warning.value.duration
      value         = warning.value.value
      time_function = warning.value.time_function
    }
  }
}

Input values to the resource here:

   {
      "policy_id"            = "1111111"
      "type"                 = "static"
      "name"                 = "Container Restarted"
      "description"          = "Alert when containers restart"
      "runbook_url"          = "tbd"
      "enabled"              = true
      "value_function"       = "single_value"
      "violation_time_limit" = "one_hour"
      "nrql" = {
        "query"             = "SELECT max(restartCount) - min(restartCount) as 'Restarts' FROM K8sContainerSample WHERE clusterName IN ('gke_xxx') AND (containerName  like '%xxx%') FACET clusterName, podName, containerName"
        "evaluation_offset" = 3
      },
      "critical" = {
        "operator"              = "above"
        "threshold"             = 0
        "threshold_duration"    = 60
        "threshold_occurrences" = "ALL"
      }
    }

We suspected that the restart condition didn’t last long enough to create a notification or register an incident. But, I confirmed that the JVM took slightly over 2 minutes to restart. Additionally, I found that Infra alert also have the condition restart count and I tried this

   {
      "policy_id"             = "1111111"
      "type"                  = "infra_metric"
      "name"                  = "Container Restarted - UK Region"
      "event"                 = "K8sContainerSample"
      "select"                = "restartCount"
      "comparison"            = "equal"
      "violation_close_timer" = "one_hour"
      "where"                 = "clusterName IN ('gke-uk') AND (containerName like '%xxxxx%')"
      "critical" : {
        "value" : 1,
        "duration" : 1,
        "time_function" : "all"
      }
    }

I tested and it works. However, I see multiple issues

  • I can’t override default violation_close_timer from 24 to 1 hour

  • If I manually close the incident the alert reappears in a couple of minutes. I tried updating the violation_close_timer to 1 hour from UI but the behaviour remained unchanged.

Please, suggest if you have any recommendations. It’s urgent and I have escalated this to our NR support team too.

Thanks

Cc: @sblue

Hi @deshdeep,

Since there are several moving parts and I lack knowledge about your stack, technical support will likely be a better resource for assistance with creating an alert condition that fits your needs in this scenario.

That being said, I did notice one thing for the Infra condition. For the newrelic_infra_alert_condition, the value for violation_close_timer should an integer and is represented in minutes per the docs (acceptable values: 0, 1, 2, 4, 8, 12, 24, 48, 72). So "violation_close_timer" = "one_hour" would need to be changed.

  • If I manually close the incident the alert reappears in a couple of minutes. I tried updating the violation_close_timer to 1 hour from UI but the behaviour remained unchanged.

And in the case of the behavior you mentioned above, that sounds like an issues outside the scope of Terraform. Hopefully technical support can with that as well, especially regarding changing that value in the UI didn’t affect the behavior.

Hope this helps at least a little bit :slightly_smiling_face: