Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Linux agent problem with some EC2 instance

linux
agent
aws
infrastructure-beta
newrelic-infra

#1

Hi Everyone!

I’m having an strange issue with the newrelic-infra agent. I deployed the agent in 5 instances but in the console I can see just 2 of the 5 instance. I checked and the agents in all the instance are running. Relly I can’t understand why I don’t see the information in the dashboard.

I’m attaching some part of the log that show one of the instance that currently I can’t show on our console.

Oct 28 16:31:52 saps4qa newrelic-infra: 2016/10/28 16:31:52 Processing partial delta: /var/db/newrelic-infra/data/services/upstart.json
Oct 28 16:31:59 saps4qa init: ttyS0 (/dev/ttyS0) main process (22597) terminated with status 1
Oct 28 16:31:59 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:32:09 saps4qa init: ttyS0 (/dev/ttyS0) main process (22617) terminated with status 1
Oct 28 16:32:09 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:32:19 saps4qa init: ttyS0 (/dev/ttyS0) main process (22643) terminated with status 1
Oct 28 16:32:19 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:32:22 saps4qa newrelic-infra: 2016/10/28 16:32:22 Processing partial delta: /var/db/newrelic-infra/data/services/upstart.json
Oct 28 16:32:29 saps4qa init: ttyS0 (/dev/ttyS0) main process (22712) terminated with status 1
Oct 28 16:32:29 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:32:39 saps4qa init: ttyS0 (/dev/ttyS0) main process (22732) terminated with status 1
Oct 28 16:32:39 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:32:49 saps4qa init: ttyS0 (/dev/ttyS0) main process (22757) terminated with status 1
Oct 28 16:32:49 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:32:53 saps4qa newrelic-infra: 2016/10/28 16:32:53 Processing partial delta: /var/db/newrelic-infra/data/services/upstart.json
Oct 28 16:32:59 saps4qa init: ttyS0 (/dev/ttyS0) main process (22782) terminated with status 1
Oct 28 16:32:59 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:33:09 saps4qa init: ttyS0 (/dev/ttyS0) main process (22801) terminated with status 1
Oct 28 16:33:09 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:33:19 saps4qa init: ttyS0 (/dev/ttyS0) main process (22828) terminated with status 1
Oct 28 16:33:19 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:33:23 saps4qa newrelic-infra: 2016/10/28 16:33:23 Processing partial delta: /var/db/newrelic-infra/data/services/upstart.json
Oct 28 16:33:29 saps4qa init: ttyS0 (/dev/ttyS0) main process (22897) terminated with status 1
Oct 28 16:33:29 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:33:39 saps4qa init: ttyS0 (/dev/ttyS0) main process (22918) terminated with status 1
Oct 28 16:33:39 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:33:49 saps4qa init: ttyS0 (/dev/ttyS0) main process (22944) terminated with status 1
Oct 28 16:33:49 saps4qa init: ttyS0 (/dev/ttyS0) main process ended, respawning
Oct 28 16:33:53 saps4qa newrelic-infra: 2016/10/28 16:33:53 Processing partial delta: /var/db/newrelic-infra/data/services/upstart.json

Let me know if I can bring some more information from our side or if you know some method to put the logging in DEBUG to show more information.

The instance are using the same SecurityGroup, the same Redhat release and the same region on AWS…

Regards
DIB


#2

Hi @dbellini,

Could you please paste a link here to your infrastructure account where you’re seeing this behavior?

Are your instances cloned from each other? Do they all have distinct hostnames?

Cheers,


#3

Hi @ccastro thanks so much for your response!

This is the link to our infraestructure account " https://infrastructure.newrelic.com/accounts/727611/compute"

The instance aren’t cloned, all are been deployed with a manual process but really similar between each of them…

Another odd thing I notice is a constant message in the events saying “Service restarted: ttyS0|” on just one of the boxes that are working

All the boxes have a different hostname but similar because all start with word SAP

for ex

saphanadesa
saphanaqa
saphanaprod
saps4qa
sapwdqa
sapgwqa

and the only ones that work are

saphanadesa
saphanaqa

All are been deployed with the same process

Let me know if I can help you with some else information from our side

Regards
DIB


#4

I found something really strange… In the dashboard I see that the instances where the agent are running are “saphanadesa” and “saphanaqa” but if I restart the service in the instance “saphanaqa” I never shows the event in the “events” tab. The stranger thing is that if I restart the service in other instance that now isn’t display “saps4qa” the events is reported in the “event” tab over the instance “saphanaqa”…

Like if both instances are merged or something like that…

I’m not sure if i’m beeing clear, let me know if I didn’t

I’ve attached some screens in order to try to be more graphic

  1. the screen where I did the restart of the service over “saps4qa”

  1. the screen that show the restart of the service in the event tab over NewRelic infraestructure ( you can see in the first picture the process ID that its the same that the process ID that is showed in the event tab )

  1. The tab where i can see the agents that are reporting information to our dashboard

  1. This screen show the “saphanaqa” instance on our AWS account. The instance is a r3.8xlarge

  1. This screen show the “saps4qa” instance on our AWS account. The instancfe is a m3.2xlarge

  1. and the last screen, this screen show the “saphanaqa” instance in the NewRelic dashboard where I can see the instance like a t1.micro when the real instance size is “r3.8xlarge” like you can see in the 4th screen.
    The most strange of this point, I never deployed the infraestructure agent in our “NAT-INSTANCE”. I don’t know how the dashboard reach this instance…

Sorry for the long ticket and the poor english but I’m trying to be the most clear as possible to me.

Let me know if you want more information or something else from our side.

Regards
DIB


#5

After reviewed again all this information I think that I find our problem with the agent but not the solution… Apparently the problem is generated because all our instance except “saphanadesa” are using a NAT-INSTANCE to route all the trafic to internet and I think that your api are understanding that all the agents are the same HOST and for this same reason I’m seeing in the agent information the TAG NAME = NAT-INSTANCE and TAG InstanseSize = t1.micro…

Do you knnow if is possible fix this issue ? Its clear my explanation ?


#6

@dbellini,

Thanks very much for the detailed explanation and don’t worry, your English is perfect…

So i noticed you have an active AWS Integration and all those “EC2 Tags” are coming from there. That’s how we get information like the instance type and the Tag Name.
Those are captured by the integration and not through the agent itself. The Agent captures the instance ID by querying the AWS API when it starts and uses that ID to stitch everything together with information coming from the AWS Integration.

I would expect the agent to grab locally each hostname from your different servers and use that as identifier for each server in your dashboard.

To run a quick test on what we are grabbing could you have a look at a file located in /var/db/newrelic-infra/data/metadata. You should have a host_aliases.json
cat that file on a working and a non-working system at least and please let me know the contents of it…

If that doesn’t reveal any useful information i will bring you into a ticket so we can collect verbose logs and have a deeper look at the issue.

We definitely want to understand what’s is going on in there, so we really appreciate your time in debugging this with us.

Cheers,


Infra is showing one host instead of two
#7

Hi @ccastro,

Thanks so much for your response. I think that the problem is definitely in the IP of our GATEWAY. Our GW is an EC2 instance named “NAT-INSTANCE” and almost all of our instances have the traffic to internet routed for this instace. This is the name that I’m seeing on our dashboard for one of the two host that I see running now.
I think that I’m seeing just 2 hosts in our dashboard because all the instances except one of the two that i’m see running are going to your collector for the same IP, and as you know our AWS TAGS the dashboard asume that the agent is running over our “NAT-INSTANCE” ( our default gateway ) that really don’t have the agent deployed.

Anyway I’m attaching the output from the cat for the file that you said in order to check the real name of this instance.

This is the info from the server that now is working fine. This server have a PublicIP and don’t use our default GW

SAPHANADESA

[root@saphanadesa ~]# cat /var/db/newrelic-infra/data/metadata/host_aliases.json
{
	"hostname": {
		"alias": "saphanadesa.turner.com",
		"id": "hostname"
	},
	"hostname_short": {
		"alias": "saphanadesa",
		"id": "hostname_short"
	},
	"instance-id": {
		"alias": "i-4438dbd0",
		"id": "instance-id"
	}
}[root@saphanadesa ~]# initctl status newrelic-infra
newrelic-infra start/running, process 82849
[root@saphanadesa ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 06:C7:B0:6B:4E:41  
          inet addr:10.161.118.165  Bcast:10.161.118.255  Mask:255.255.255.128
          inet6 addr: fe80::4c7:b0ff:fe6b:4e41/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:81513320 errors:0 dropped:0 overruns:0 frame:0
          TX packets:164346223 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:26770330595 (24.9 GiB)  TX bytes:237054590753 (220.7 GiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:5981407143 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5981407143 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9145007773657 (8.3 TiB)  TX bytes:9145007773657 (8.3 TiB)

This is the info from the servers that now aren’t working. (Really the agent is working but I don’t see it in our dashboard). these both server are using our default GW, I have 3 instance more with the same issue

SAPHANAQA

[root@saphanaqa ~]# cat /var/db/newrelic-infra/data/metadata/host_aliases.json
{
	"hostname": {
		"alias": "saphanaqa",
		"id": "hostname"
	},
	"hostname_short": {
		"alias": "saphanaqa.turner.com",
		"id": "hostname_short"
	},
	"instance-id": {
		"alias": "i-d8f6724c",
		"id": "instance-id"
	}
}[root@saphanaqa ~]# initctl status newrelic-infra
newrelic-infra start/running, process 37480
[root@saphanaqa ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 06:6A:F7:89:C3:E7  
          inet addr:10.161.118.185  Bcast:10.161.118.255  Mask:255.255.255.128
          inet6 addr: fe80::46a:f7ff:fe89:c3e7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:698538210 errors:0 dropped:0 overruns:0 frame:0
          TX packets:697588707 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:369278165168 (343.9 GiB)  TX bytes:457111085961 (425.7 GiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:358090104 errors:0 dropped:0 overruns:0 frame:0
          TX packets:358090104 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:603681655421 (562.2 GiB)  TX bytes:603681655421 (562.2 GiB)

SAPS4QA

[root@saps4qa metadata]# cat /var/db/newrelic-infra/data/metadata/host_aliases.json 
{
	"hostname": {
		"alias": "saps4qa",
		"id": "hostname"
	},
	"hostname_short": {
		"alias": "saps4qa",
		"id": "hostname_short"
	},
	"instance-id": {
		"alias": "i-d8f6724c",
		"id": "instance-id"
	}
}[root@saps4qa metadata]#initctl status newrelic-infra
newrelic-infra start/running, process 16349
[root@saps4qa metadata]# ifconfig 
eth0      Link encap:Ethernet  HWaddr 06:33:22:39:85:0F  
          inet addr:10.161.118.166  Bcast:10.161.118.255  Mask:255.255.255.128
          inet6 addr: fe80::433:22ff:fe39:850f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:28083087 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30076344 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:24553117093 (22.8 GiB)  TX bytes:55292743940 (51.4 GiB)
          Interrupt:34 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:1539729 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1539729 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:51191690 (48.8 MiB)  TX bytes:51191690 (48.8 MiB)

Let me know if you need some else information from our side or if you have some doubt with some part of my explanation. I’m glad to help with the troubleshooting.

Regards
DIB


#8

Hi guys!

Do you have some update about this issue ? If you want I can do any class of test or checks that you need to debug the client. I think that the dashboard could be really helpful for us and for this reason I want to troubleshoot this issue.

Let me know if I can do some else from our side.

Regards
DIB


#9

Hi @dbellini,

Apologies for the radio silence here.
I’m going to open a ticket for you so we can collect some logs and have a deeper look on what might be happening in there.
I’m still not convinced that the fact several hosts are being routed by your NAT-INSTANCE is the key here, so i want to have a closer look.

Wait for a mail from me for this ticket.

Cheers,


#10

Hello, I am having the same issue. Did you achieve to find any solutions ?

Thanks,


#11

Hi @aurelien11 - thanks for chiming in. The problem that’s described here is very specific to the customer’s environment and a lot has changed since it was first reported. Would you be able to provide additional information about what’s wrong? A link to your account, your OS, agent version, and any relevant verbose logs would be helpful.


#12

Hi,

I thought this was related so i posted here. I am sorry if it was not appropriate but i still believe it was the same issue.
I redacted a post which i solved here: Infra is showing one host instead of two with all the details you asked. No need to answer, I am just thinking that for newcomers on newRelic it would be nice to precise in the documentation that if you use an ec2 instance as a gateway and the aws ec2 plugin, you should disable cloud metadata on your hosts behind the gateway so they can be identified by newRelic Infrastructure.
I dont have any solutions if those hosts are also on ec2 (which is not my case)