One unresolved problem with the web is the detection of bots, programs that drive a web browser to emulate human behavior. Bot detection is an issue in web development because automated tools can reduce the user experience, such as a bot which responds to Twitter posts with angry messages. During a research internship as Brave, a privacy-focused web browser, we worked on a new client-side approach to detect user humanity.
In addition to reducing user experience, bot tools can also negatively impact a business by automatically stealing and reproducing all of the content on its website. Bots can also be used when for fraud or website hacking by automatically crawling a website perimeter and probing the security. In response, most websites have deployed some form of bot detection, ranging from visible, obtrusive, countermeasures (e.g., CAPTCHA), to other heuristic detection methods. The prevalence of botting tools has led to widespread adoption of CDNs, such as Cloudflare, which will automatically block clients who make HTTP repeat or suspicious HTTP requests, such as those which contain SQL. In this case of Brave, the issue of fraud is a considerable concern, as one goal of the company is to reward users for viewing adverts without revealing user identity or sending network requests back to the client.
This work is a deployment of anomaly detection to identify client humanity through runtime behavior. The biggest concerns raised by this approach are privacy and verifiability, as the proof of humanity is on the client-side but must be verifiable by a third party (Brave) without exposing private user data. We decided to approach this problem with anomaly detection, a technology that requires only positively labeled data to infer what behavior is normal (in this case, human) and what behavior is abnormal (in our case, by a bot). Anomaly detection was a particularly appealing technology for our use-case, as we did not want to rely on observed bot behavior when developing our model.
The lack of a need for observed bot data is appealing because we did not want to limit the effectiveness of our model by the behavior of known malware samples. If we required observed bot behavior as part of our model development, then an attacker would need only to sufficiently differentiate its behavior from the training dataset in order to circumvent detection. As our approach only requires samples of human behavior, this is no longer sufficient. Instead, the attacker now has to make their behavior appear human, rather than differentiate its behavior from other automation tools. Since our model includes event timing, this is also likely to slow down the speed of any bot tools - as they have to emulate latency similar to humans. The added runtime cost of emulating human-like latency is likely to disincentivize the development of bots further.
In order to test our idea in practice, we developed a proxy for data collection and a custom variant of the Brave web browser to use as a testbed. We used the proxy to collect a set of human behaviors in-house, across a variety of web browsers. We then generated a model for human behavior and implemented it both in Brave and through a custom proxy. To evaluate our approach, we used two popular bot development tools, ZennoPoster, and uBot, as well as several custom bots developed with Puppeteer framework.
Though the work was exploratory, and not thoroughly evaluated, we found that this approach can identify all evaluated automation tools distinctly from humans with a high degree of accuracy, especially when the classification is made over several samples. Our results showed promise for the approach, and we confident that with further data it will be able to provide a new approach to bot detection, once the issues of verifying client-side classifications are solved.