Transactions vs Page Views + Bot Detection

Insight

https://insights.newrelic.com/accounts/1050354/dashboards/217484?now=1478617475650

NRQL:

View 1 (CountUnique Transaction Guest ID’s)
SELECT uniquecount(guestID) FROM Transaction SINCE yesterday WHERE appName = ‘[NEMESIS] Remington [PHP 7]’

View 2 (CountUnique Session from PageView)
SELECT uniquecount(session) FROM PageView SINCE yesterday WHERE appName =’[NEMESIS] Remington [PHP 7]’

Issue:

Hi, I am having the following issue with views created above.
In Google Analytics the number of recorded sessions is only out by a couple of hundred compared to “View 2” (CountUnique Session from PageView).

When I created “View 1” I expected the figure to be less than or the same as “View 2” but instead was confused when it was 20,000 more.
I created a cookie that has a unique guest tracking ID and expires in 1 year, One is generated the first time the user visits the site. I tested revisiting the site and confirmed that I always had the same ID from my first visit, I push this ID in PHP through newrelic_add_custom_parameter(‘guestID’, $guestID); for every page request.

Can anyone clarify why I have such large difference between my Unique Transaction Guest ID Counter vs Unique Session PageView Counter?

Hi @marc.newton - Session management is always a thorny issue but can be the result of several different items. In this case, New Relic only tracks sessions, not users, from the script embedded in the page. Obviously multiple browser tabs or application windows result in more sessions as this is what the browser is creating internally.

By tracking users using cookies you will find that it is more accurate unless users block or regularly delete the cookies.

That’s my summary and I’m sure I have read more detail on this in the forums but I’m already late leaving for an appointment.

Yes that is what I thought but there are 20,000 more cookies recorded than sessions so it seems to be doing more than the opposite of what was expected.

I added an image to the original post to show how the different queries got various results.

My point is that I am not really sure what of the following are true:

Is NewRelic uniquecount() working correctly on custom Transaction data?

If the extra recorded traffic is bots & crawlers?

Does NewRelic filter out bot traffic when counting sessions?

Can a bot or crawler even actually cause a cookie to be set?

I have done a little more investigation, What I found is that there is not a simple way to find out on the first hit if the request was a bot, On a second request from a bot the cookie value would still not be set so what is happening is that a single Guest ID is being generated for each bot request in a Transaction and this type of request does not generate a page view/session, hence the large differences in the counters counts…

I put in a a regex for the user agent so that if I can pickup that it is a bot/crawl/slurp or spider then the ID will be set to “bot”.

To test this theory I want to be able to basically:
SELECT uniquecount(guestID) FROM Transaction WHERE count(guestID) > 1 AND guestID !=bot

This is so that I only bring back a tally of unique Guest ID’s that have two or more transactions recorded against them for a billboard total tally.

This would also allow continued tracking of transaction activity of a returning visitor as long as the cookie remains (I have it set to expire in 1 year).

Not sure how to do this in NRQL, Any ideas?

I have added in some degree of crawler/spider detection so i can see how many of the total transactions and unique cookies are bots.

I am still trying to identify why in the image below that Unique Cookie Guests count is much larger than Total PageViews in the same time period.
If there where 700 unique (Actual Humans) cookies set in the last full hour then why is there not over 700+ Page views and Sessions to acompany this volume?

I am still looking for a way to only do a Unique Count of guestID FROM Transaction where the number of transactions for each ID is greater than one. (To find out how many guestIDs have > 1 transaction).

Hi @marc.newton - with regards to the number recorded in Total Page Views being too low. It could be that some of your pages are not instrumented correctly with New Relic Browser. For example, if you use multiple servers for your website than one or more might have auto-instrumentation disabled.

If you consider the number of page views that are currently being recorded, does that meet your expectations? And does that number match other tools such as Google Analytics? If not then that might mean not all pages are instrumented correctly with New Relic Browser. Going through these troubleshooting steps might in that case help you identify why too few page views are being recorded.

Hi @nsiekman
The number of unique users in NR 1,790 vs GA (Google Analytics) 1,734.

The number of Page Views is totally off however: GA 4,000~ vs NR 3000~.

We have one server with the auto-instrumentation enabled.

I have just been reading up on an option to pass transactions to PageViews, I might see what happens there.

I think unique Cookie Guests is too high rather than Page Views too low but if you use GA Page Views against the NR Guest ID count then you correctly have more views than guests, In that case, you can then argue that perhaps NR Page Views is too low.

Overall the GA stats are not that drastically out from NR PageView data but both against the Unique Cookie ID count from Transactions, Its way off compared to the other sources…

Hey @marc.newton - it sounds like you’ve done a ton of investigation here already - that’s great. Based on what I’m understanding, it sounds like you have two questions:

  • Why might there be a smallish discrepancy between NR Page Views and GA page views when the number of sessions is very close?
  • Why is the number of guestIDs attached to your Transaction events so much larger than either?

In general, NR and GA can have slightly varying page view counts for a couple reasons, usually configuration differences, or incomplete instrumentation (something @nsiekman mentioned as a possibility). More on this in Browser data doesn’t match other analytics tools.

The second question is more specific to your data, since you’re creating the guestIDs. I created a query along the lines you mentioned

I am still looking for a way to only do a Unique Count of guestID FROM Transaction where the number of transactions for each ID is greater than one. (To find out how many guestIDs have > 1 transaction).

If I understand what you meant, we can do this with FACET:

SELECT count(guestID) FROM Transaction SINCE 2 hours ago UNTIL 1 hour ago FACET guestID limit 1000

So there are a few IDs with many transactions, and many with 1.

To get more detail about how Transaction events and PageView events are distributed over different transaction names, I ran two queries:

SELECT count(*) FROM Transaction where guestID is not null SINCE 2 hours ago UNTIL 1 hour ago FACET name limit 1000

SELECT count(*) FROM PageView SINCE 2 hours ago UNTIL 1 hour ago FACET name limit 1000

When comparing these queries, there are many fewer page views overall, especially for one particular transaction (hair-woahs). So it seems like there are more Transactions than PageViews, which might be expected or not. We cover this in our document App server requests greatly outnumber Browser PageView transactions.

Hope that provides some avenues for exploration!

Thank you Alexis, This does indeed confirm what I thought was going on, I am now making alterations to the custom transaction data to enable NRQL to exclude those anonymous transactions, Should bring down the Unique Guest ID’s by hopefully more than half.

I will let you know as soon as Insights has collected enough data on the new metrics for a valid result.

I have completed my investigation sooner than expected. After I implemented a theory this morning I got near perfect results. There are still going to be the odd bot that can’t be identified but I believe I have got the number of actual humans viewing the site.

All of the figures I raised concerns over all now fall into line.

I can happily confirm that in my opinion that Google Analytics was recording more Page Views than I expected, All implementations where checked to ensure they fire only once per site instance.

I am getting the following results as I deemed expected before I stated this investigation:

  • The number of Page Views is greater than the number of sessions.

  • The number of users & bots online is equal to or is slightly greater than the number of sessions but never has a session count less than the number of users, The discrepancy as expected is always with bots (Traceable vs Untraceable).
    This is expected since some bots may not cause a traceable session to be created.

  • The number of transactions is always equal to or slightly greater than the number of guests online (CRON Jobs etc contribute to Transactions).

  • The total number of transactions should be equal to or greater than the number of Page Views. Most of the time they are when looking at longer view periods or after a 1 minuet tail, they do fall inline (The image attached is real time), I put this down to a small expected latency in the collection of data in Page Views vs Transactions because they are 2 very different collection sources so a latency is always expected in a real time view since Page Views is from JavaScript in a client browser and Transactions is from the Server Daemon.
    The Server Daemon is more likely to deliver its data faster than the user end point data unless your server can somehow invert time :slight_smile:

Total Page Views & Active Session are counted from the PageViews Table where as the other 4 blocks are counted from Transactions table, The two separate sources of data marry up.

In another comparison, Google Analytics reported 15 users online, The New Relic chart reported 12 Users & 3 Bots! :sunglasses:

4 Likes

Love the detail and so glad you resolved your query.

I’m interested in the approach you took for:[quote=“marc.newton, post:10, topic:43149”]
All implementations where checked to ensure they fire only once per site instance.
[/quote]

Did you have to make changes to Google Analytics?

2 Likes

We did revamp how GA was done.

Tests where done a new release of a site on a new custom CMS we are building. support for tracking tools was built in very early so we could ensure that the logic of the code could only ever include tracking scripts once. Also NR was done via the auto injection method for PHP to rule out human error in with manually adding the embedding code.

To double check both NR and GA, I use the Tag Assistant Plugin for Chrome by Google for GA. The assistant clearly alerts you to any duplicated or resubmitted data including GTM (Google Tag Manager) and even GTM Data layer events.
If you have a direct integration of GA and one of GTM, then someone also adds GA to GTM you would be notified of the two separate calls being made be both implementations being fired at the same time.
We ensured that the only implementation was GTM, that there was only one account and that GA was fired by GTM only.

Also keeping an eye out in Chrome Developer Tools window on the Network tab to also check if any communications where duplicated in the browser. Taking a closer look at XHR and JS for outside calls.
Is GTM/GA/NR/ scripts attempted once, are there any DOM duplication of injected scripting visible in the final source view?

That’s how I checked implementation.

1 Like

Thanks for the details again @marc.newton. I suspect that this will be a highly useful reference.

No worries, I will share later on how to collect the data I have been adding into Transactions once I have finalized if there is any additional data needed while setting up the dashboards to display some appropriate charts as I want to use the Transaction data to power the insights previously built with PageViews :slight_smile:

2 Likes

Thanks so much for sharing all of this amazing detail @marc.newton! This is now an incredible resource for all of our other community members! :unicorn:

Hey everyone, Just a progress update. I am working on an Object that handles the collection and parsing of raw data. I am still testing and expanding on this object as a I research the data that is collected.

The object creates a single instance of it’s self, so recalling the method as a singleton.

Here is what the object does so far:

Application Naming.

Use a setting from a project configuration file.
My usage case is that I have a single application on a scale-ability server serving multiple Brands that in turn serve multiple dedicated language websites. I group in the application the languages by Brand Name as the Application name, In Insights I then filter by “header.requests.host” that as each domains format is appropriate to filter by (country_code.my-brand.com). What I have is 3 Applications listed, each collecting stats of up to 50 websites.

`\NewRelic::track()->appName(BRAND_NAME);

And to concatenate recorded transactions to maintain efficiency of NR by using mothod names or custom names; for example in APM /product/my-product-name is recorded as \shop\product.view

\NewRelic::track()->nameTransaction(__CLASS__ . ‘.’ . __FUNCTION__);

Guest Tracking
Every visitor is given a tracking ID that lasts for a lifetime, The idea is to track a returning user long after a session has closed. Transactions and Custom Events passively apply a Guest ID. Hopefully in the future NRQL will allow more complex JOINS so we can join data from one table to another on the guestID.

Passive Bot Detection & Manual Schedule Task Flagging
Has the ability to track if the visitor is a Robot or a Human and a daisy chain function (isTask()) alows you to flag an operation as a background_job.

Passive Marketing Analytics (Referrals)
For the marketing team I experimented with tracking referrals & the ability to monitor the effectiveness of campaigns.
Referral event captures have two additional options:

$internal (bool) default false
Toggle weather or not to log internal traffic on the sites, Turning this on allows you to monitor page throughput on transactions. We could use this data to visualize a journey of a human or a bot.

$bot (bool) default false
Simply ignore bots, No referral data is recorded for a bot when set to false.

Here is a result of referral capture:

I added unique guests from Transactions on the screen, The low number of Guests against the high number of referrals suggest that the same guest is revisiting the site from another search or the same search again but later in the day, The are a number of contributing reasons why a single user would be re-referred multiple times. Seeing this I could produce another copy of all the charts that use uniqueCount(guestID) to get a number of referred visitor count rather than a referred hit count.

Campaign Monitoring
I started an experiment to also track specific campaigns in a dedicated campaign dashboard.
How many referrals the campaign brought in and where the traffic was sent to.

There is still plenty for me to do and refine, I will come back with another update in a week.

3 Likes

Quick update, I have done a var and function name revision pass on the NR PHP class and started documenting it getting it ready for Public release.

Today I have gone over the bot detection data I have collected and done a couple of minor tweaks to fix some agents reported a bots that other developers have previously thought to have been bots, One of them being the browser name for the “Kindle Fire” and the other a browser called “Titan” for a none mainstream android phone.

Also today I started doing more in-deph Humanized identification of bots so I can do specific filtering of bots activity, the data can be used to identify much earlier than before pages that are being crawled by bots that you do not want to be crawled. In the image below you can see a bot has crawled basket/add, I can filter to search a list of all bot agents that crawled /basked/add and then set about resolving the problem much more quickly.

From the image above I can see that there are some strange URI’s that need to be investigated, I have yet to investigate if referrer data is collectible from a crawled link, That will conclude in next weeks update when I have collected the data.

2 Likes

Hi @marc.newton. What you are doing is ridiculously awesome, and the community is definitely benefiting from it!! We look forward to hearing about your next update, and if you have any scoped questions, leave us know.