When reviewing our own systems, we find the same as the original poster. Our systems are running most efficiently when our cache is full; however, this triggers the alert thresholds unnecessarily.
For example, we received an alert for a system that was >90% utilization. When including buffers and cache, the system was at 97% utilization; however, this cache is used to speed up application performance. Using the correct (linux-preferred) calculation, our system was at 40% RAM utilization.
Restarting the application cleared 37% of our RAM utilization dropping the totals, including cache, from 97% to 61%; however, only a 7-point drop of our actual total RAM usage was observed, from 40% to 33%, impacting the system by forcing it to re-read the cache required for performance.
I would like to see the alert metrics fixed. Impacting performance in this way is unwarranted, this is also an impact to the teams relying on proper alert behavior.
P.S. Information about how cache is used and the correct amount of “free” RAM can be seen at http://serverfault.com/questions/85470/meaning-of-the-buffers-cache-line-in-the-output-of-free.
P.P.S. Re-reading the contents and the post and examining our alerts further, I see our alerts are unrelated to server RAM utilization but are related to the PHP OPcache plugin. The alert messages are confusing as the title is “% Memory Used > 90” requiring and requires further drill-down within the New Relic interface to determine this is related to a plugin monitoring application usage and not the server itself.
Hopefully this information may help others who are confused by alerts related to RAM usage when the source is not apparent.