Introduction to Bot Traffic — Part One of our Bot Analytics Series

As much as 40% of internet traffic is from non-human sources. This traffic ranges from legitimate bots and crawlers to nefarious automated programs that may be controlled by hackers, fraudsters, or competitors. The high volume and wide variety of bad bots make defending against this threat difficult. Find out more about bad bot traffic and the role DeviceAtlas can play in your bot management strategy.

Bot management is becoming an increasingly vital part of any digital strategy. As much as 40% of internet traffic is automated, yet accurately detecting bot traffic is harder than it has ever been. Many web analytics solutions do not detect or identify bot traffic, but instead log it with traffic generated by humans, skewing aggregated data.

Often, bot traffic is harmless, and is used by organisations to gather data for useful purposes. For example, Google’s web crawler, Googlebot, continuously visits websites to gather up-to-date information for Google’s search index. Googlebot visiting your site is beneficial when you want your site to be seen and visited via Google Search.

However, some bots are designed specifically to behave in a harmful way, such as probing for security weaknesses or generating fake ad clicks. Malicious bot developers constantly devise new ways to operate, thanks to easily accessible and inexpensive crawling services.

But what exactly is malicious bot traffic, and how can you prepare against it? Fortunately, DeviceAtlas offers a solution that can detect certain unwanted bots and let you take action against them. Continue reading to find out more about bot traffic and how bad bots can negatively impact your website, mobile apps, and APIs.

What are internet bots?

Internet bots are programs that operate on the internet and perform automated tasks. Some bots perform tasks that mimic human activity but they can perform them at a much higher rate. Like them or not, bots represent a fundamental part of the web ecosystem.

Good bots

A "good" bot performs tasks that are useful or helpful to internet users. Examples are:

  • Search engine bots: bots that retrieve and index web content so that it can show up in search engine results for relevant user searches, e.g. Google or Bing
  • Copyright bots: Bots that crawl websites looking for content that may violate copyright law such as plagiarised text, music, images, or videos e.g. YouTube’s copyright bot
  • Site monitoring bots: These bots monitor website metrics and can alert users of downtime or drops in page load speeds, e.g. Pingdom
  • Content monitoring bots: Bots that crawl the Internet to monitor information, such as news reports or customer reviews
  • Feed bots: These bots crawl the Internet looking for newsworthy content to add to a platform’s news feed and are often operated by content aggregator sites or social media networks

Other types of programs are sometimes referred to as bots such as Chatbots or Personal Assistant bots but are different in that they interact with humans instead of crawling web content.

Malicious bot activity

Bad bots interact with applications in the same way a legitimate user would, making them harder to detect and prevent. They enable high-speed abuse, misuse, and attacks on websites, mobile apps with server side components, and APIs. They allow bot operators, attackers and fraudsters to perform a wide array of nefarious activities, such as:

  • probing for unpatched security weaknesses, e.g. vulnerabilities that allow attackers to run a malicious code by leveraging a known security bug that has not been patched. The bot will try to probe your environment looking for unpatched systems, and then attack them directly or log the existence of the vulnerability for later exploitation
  • web scraping, i.e. bots that harvest content and data from a website. Web scrapers extracts the underlying HTML code as well as the data stored in a database. The bot can then replicate the content elsewhere
  • personal and financial data harvesting, i.e. bots that automatically extract large amounts of information from websites about users such as location, gender, age to create profiles that are auctioned online
  • brute-force login attacks, i.e. using trial-and-error to guess login info, encryption keys, or find a hidden web page
  • digital ad fraud, i.e. representing online advertising impressions, clicks, conversions, or data events, in fraudulent ways, in order to generate revenue
  • forum spam, i.e. posting unsolicited content in forums such as fake reviews or advertising
  • transaction fraud, i.e. when a stolen payment card or data is used to generate an unauthorised transaction
  • scalper bots, i.e. bots that scan retail websites at the beginning of a sale ahead of individual buyers to buy up stock that will later be resold at a large profit

Advanced Persistent Bots (APBs)

Advanced Persistent Bots (APBs) plague websites with malicious activity such as:

  • account takeover attacks, i.e. a type of identity theft where a bad actor gains unauthorised access to an account belonging to someone else
  • credential stuffing, i.e. where bots steal usernames and passwords from one organisation to access user accounts at another organisation
  • content and price scraping, i.e. extracting content and pricing information from a website for analysis. This can involve the use of a headless browser, which is a standard browser adapted to be operated by a bot rather than a human. This improves the bot’s ability to masquerade as a human.

APBs often avoid detection by cycling through ranges of IP addresses, tunnelling through anonymous proxies, changing their identities, and mimicking human behavior. According to the 2020 edition of Imperva's annual "Bad Bot" report from 2019, bad bot traffic rose to its highest ever percentage of 24.1% of all internet traffic. APBs accounted for 73.7% of this total.

The impact of bots on publishers and advertisers

Most of the time, good crawlers will declare themselves clearly in the user agent string, which they send to a website when they request a page. Here are some examples:

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
"'Mozilla/5.0 (compatible; DuckDuckBot-Https/1.1; https://duckduckgo.com/duckduckbot)'"

These three lines show user agent strings provided by bots operated by three popular search engines. These good bots identify themselves clearly.

Malicious bots can defeat standard IVT (invalid traffic) detection. Scrapers and other bad bots will disguise themselves so they are difficult to detect and block. Malicious bots will deliberately identify themselves as Chrome, Safari, Firefox, etc. in order to masquerade as a legitimate user. They can even disguise their IP address using proxy services, making it very hard for publishers to block them. Some of these malicious bots collect cookies, to make themselves appear to be part of specific audiences that advertisers want to target. Other bots set out to attack legitimate publishers. This phenomenon has been documented where fraudsters attempt to extort legitimate publishers by threatening them with bot traffic that would get them banned from Google Adsense, unless they paid the extortion money. Even though some publishers paid, they still got hit with lots of bot traffic, resulting in suspension and loss of all their ad revenue.

A particular type of malicious bot is designed to generate fake ad revenue by masquerading as users clicking on ads. If such bots are not detected and blocked, an advertiser might end up paying large sums for ads that are never actually seen by a human.

Ad networks and exchanges are supposed to filter these bots, so that no ad calls or bids for ads are made. However, it is hard to be certain they are doing that for you, or doing it correctly. That is why you need your own analytics to monitor bot traffic and ensure that it is being done properly for you. Ad tech vendors often block the bots in the deny lists provided by trade associations to their members. But considering there are tens of thousands of named bots, the industry-supplied lists are probably only catching a small fraction. That means the ad tech vendors are only blocking a small fraction in campaigns. The vendors do not have any incentive to block more, because that would mean reduced volumes and profits.

A growing problem

The high volume and wide variety of bad bots make defending against this threat difficult. Furthermore, the increasingly complex nature of IT infrastructure can make it difficult to keep track of potential vulnerabilities. Good bots can share characteristics with malicious bots, and bad bots can even masquerade as well-known good bots like the Googlebot. When putting together a bot management strategy, this makes it challenging to ensure good bots are not inadvertently blocked.

Malicious bots can incur a considerable cost in terms of sheer traffic. It has been estimated that 20 - 30% of infrastructure and bandwidth costs can be attributed to bot traffic (editing note: can't find source of this data, got stat from DeviceAtlas bot detection webpage). While it can be tempting for a site administrator to implement a very stringent bot management strategy, bot defense that is overzealous can lead to unhappy users. For example, Captchas are often used to block bots, but they can easily result in user frustration. Also, an overly stringent bot policy might block legitimate bots and thereby prevent indexing of a website with a resulting drop in search rankings. For these reasons, bot traffic needs to be managed carefully.

How can DeviceAtlas help?

DeviceAtlas contains extensive information on bots. This is made possible through a global network of honeypots that provide visibility of traffic behaviour. DeviceAtlas detects and identifies bot traffic and improves user experience by minimising the impact of bots on website and / or campaign performance by:

  • Identifying bots based on the headers in real time to ensure ads are served to real users. If a bot imperfectly masquerades as a desktop or mobile device in their headers, Device Atlas will be able to distinguish between it and legitimate traffic
  • Understanding bot traffic levels to reconcile ad revenues and web traffic reports

The benefits of this include:

  • Improving user experience
  • Reducing your infrastructure costs

Request a trial to start gaining visibility on your bot traffic now.