OpenWPM 1 million site tracking measurement


1 Online Tracking: A 1-million-site Measurement and Analysis Steven Englehardt Arvind Narayanan Princeton University Princeton University [email protected] [email protected] This is an extended version of our paper that appeared at ACM CCS 2016. to resort to a stripped-down browser [31] (a limitation we ABSTRACT explore in detail in Section 3.3). (2) We provide compre- We present the largest and most detailed measurement of hensive instrumentation by expanding on the rich browser online tracking conducted to date, based on a crawl of the extension instrumentation of FourthParty [33], without re- top 1 million websites. We make 15 types of measurements quiring the researcher to write their own automation code. on each site, including stateful (cookie-based) and stateless (3) We reduce duplication of work by providing a modular (fingerprinting-based) tracking, the effect of browser privacy architecture to enable code re-use between studies. tools, and the exchange of tracking data between different Solving these problems is hard because the web is not de- sites (“cookie syncing”). Our findings include multiple so- 2 signed for automation or instrumentation. Selenium, the phisticated fingerprinting techniques never before measured main tool for automated browsing through a full-fledged in the wild. browser, is intended for developers to test their web- own This measurement is made possible by our open-source sites. As a result it performs poorly on websites not con- 1 , which uses an web privacy measurement tool, OpenWPM trolled by the user and breaks frequently if used for large- automated version of a full-fledged consumer browser. It scale measurements. Browsers themselves tend to suffer supports parallelism for speed and scale, automatic recovery memory leaks over long sessions. In addition, instrument- from failures of the underlying browser, and comprehensive ing the browser to collect a variety of data for later analy- browser instrumentation. We demonstrate our platform’s sis presents formidable challenges. For full coverage, we’ve strength in enabling researchers to rapidly detect, quantify, found it necessary to have three separate measurement points: and characterize emerging online tracking behaviors. a network proxy, a browser extension, and a disk state mon- itor. Further, we must link data collected from these dis- 1. INTRODUCTION parate points into a uniform schema, duplicating much of the browser’s own internal logic in parsing traffic. Web privacy measurement — observing websites and ser- vices to detect, characterize and quantify privacy-impacting A large-scale view of web tracking and privacy. behaviors — has repeatedly forced companies to improve In this paper we report results from a January 2016 mea- their privacy practices due to public pressure, press cov- surement of the top 1 million sites (Section 4). Our scale erage, and regulatory action [5, 15]. On the other hand, enables a variety of new insights. We observe for the first web privacy measurement presents formidable engineering time that online tracking has a “long tail”, but we find a and methodological challenges. In the absence of a generic surprisingly quick drop-off in the scale of individual track- tool, it has been largely confined to a niche community of ers: trackers in the tail are found on very few sites (Sec- researchers. tion 5.1). Using a new metric for quantifying tracking (Sec- We seek to transform web privacy measurement into a tion 5.2), we find that the tracking-protection tool Ghostery widespread practice by creating a tool that is useful not just ( is effective, with some caveats to our colleagues but also to regulators, self-regulators, the (Section 5.5). We quantify the impact of trackers and third press, activists, and website operators, who are often in the parties on HTTPS deployment (Section 5.3) and show that dark about third-party tracking on their own domains. We cookie syncing is pervasive (Section 5.6). also seek to lessen the burden of continual oversight of web Turning to browser fingerprinting, we revisit an influential tracking and privacy, by developing a robust and modular 2014 study on canvas fingerprinting [1] with updated and im- platform for repeated studies. proved methodology (Section 6.1). Next, we report on sev- OpenWPM (Section 3) solves three key systems challenges eral types of fingerprinting never before measured at scale: faced by the web privacy measurement community. It does font fingerprinting using canvas (which is distinct from can- so by building on the strengths of past work, while avoiding vas fingerprinting; Section 6.2), and fingerprinting by abus- the pitfalls made apparent in previous engineering efforts. ing the WebRTC API (Section 6.3), the Audio API (Section (1) We achieve scale through parallelism and robustness by 6.4), and the Battery Status API (6.5). Finally, we show utilizing isolated measurement processes similar to FPDetec- that in contrast to our results in Section 5.5, existing pri- tive’s platform [2], while still supporting stateful measure- effective at detecting these newer and not vacy tools are ments. We’re able to scale to 1 million sites, without having more obscure fingerprinting techniques. 1 2

2 Overall, our results show cause for concern, but also en- components, but supports a subset of OpenWPM’s instru- couraging signs. In particular, several of our results suggest mentation. FourthParty that while online tracking presents few barriers to entry, is a Firefox plug-in for instrumenting the browser and does not handle automation [33]. OpenWPM trackers in the tail of the distribution are found on very few has incorporated and expanded upon nearly all of Fourth- sites and are far less likely to be encountered by the av- Party’s instrumentation (Section 3). erage user. Those at the head of the distribution, on the WebXray is a PhantomJS based tool for measuring HTTP other hand, are owned by relatively few companies and are responsive to the scrutiny resulting from privacy studies. traffic [31]. It has been used to study third-party inclusions We envision a future where measurement provides a key on the top 1 million sites, but as we show in Section 3.3, layer of oversight of online privacy. This will be especially measurements with a stripped-down browser have the po- tential to miss a large number of resource loads. important given that perfectly anticipating and preventing all possible privacy problems (whether through blocking tools is a Chrome extension that detects track- TrackingObserver or careful engineering of web APIs) has proved infeasible. ing and exposes APIs for extending its functionality such as measurement and blocking [48]. To enable such oversight, we plan to make all of our data AdFisher [9] are tools for running auto- XRay publicly available (OpenWPM is already open-source). We [27] and expect that measurement will be useful to developers of pri- mated personalization detection experiments. AdFisher builds on similar technologies as OpenWPM (Selenium, xvfb), but vacy tools, to regulators and policy makers, journalists, and many others. is not intended for tracking measurements. 4 uses an Apache Nutch based crawler. Common Crawl The Common Crawl dataset is the largest publicly available 2. BACKGROUND AND RELATED WORK 5 web crawl , with billions of page visits. However, the crawler As users Background: third-party online tracking. used does not execute Javascript or other dynamic content browse and interact with websites, they are observed by both during a page visit. Privacy studies which use the dataset “first parties,” which are the sites the user visits directly, and [49] will miss dynamically loaded content, which includes “third parties” which are typically hidden trackers such as many advertising resources. ad networks embedded on most web pages. Third parties Crowd-sourcing of web privacy and personalization mea- can obtain users’ browsing histories through a combination surement is an important alternative to automated brows- of cookies and other tracking technologies that allow them ing. $heriff and Bobble are two platforms for measuring per- to uniquely identify users, and the “referer” header that tells sonalization [35, 65]. Two major challenges are participant the third party which first-party site the user is currently privacy and providing value to users to incentivize partici- visiting. Other sensitive information such as email addresses pation. may also be leaked to third parties via the referer header. Previous findings. Krishnarmurthy and Wills [24] pro- The closest Web privacy measurement platforms. vide much of the early insight into web tracking, showing the comparisons to OpenWPM are other open web privacy mea- growth of the largest third-party organizations from 10% to surement platforms, which we now review. We consider a 20-60% of top sites between 2005 and 2008. In the following tool to be a platform if is is publicly available and there is years, studies show a continual increase in third-party track- some generality to the types of studies that can be performed ing and in the diversity of tracking techniques [33, 48, 20, using it. In some cases, OpenWPM has directly built upon 2, 1, 4]. Lerner et al. also find an increase in the prevalence existing platforms, which we make explicit note of. and complexity of tracking, as well as an increase in the is the most similar platform to OpenWPM. FPDetective interconnectedness of the ecosystem by analyzing Internet FPDetective uses a hybrid PhantomJS and Chromium based Archive data from 1996 to 2016 [29]. Fruchter et al. stud- automation infrastructure [2], with both native browser code ied geographic variations in tracking [17]. More recently, and a proxy for instrumentation. In the published study, the Libert studied third-party HTTP requests on the top 1 mil- platform was used for the detection and analysis of finger- lion sites [31], providing view of tracking across the web. In printers, and much of the included instrumentation was built this study, Libert showed that Google can track users across to support that. The platform allows researchers to conduct nearly 80% of sites through its various third-party domains. additional experiments by replacing a script which is exe- Web tracking has expanded from simple HTTP cookies to cuted with each page visit, which the authors state can be include more persistent tracking techniques. Soltani et al. easily extended for non-fingerprinting studies. first examined the use of flash cookies to “respawn” or re- OpenWPM differs in several ways from FPDetective: (1) instantiate HTTP cookies [53], and Ayenson et al. showed it supports both stateful and stateless measurements, whereas how sites were using cache E-Tags and HTML5 localStor- FPDetective only supports stateless (2) it includes generic age for the same purpose [6]. These discoveries led to media instrumentation for both stateless and stateful tracking, en- backlash [36, 30] and legal settlements [51, 10] against the abling a wider range of privacy studies without additional companies participating in the practice. However several changes to the infrastructure (3) none of the included instru- follow up studies by other research groups confirmed that, mentation requires native browser code, making it easier to despite a reduction in usage (particularly in the U.S.), the upgrade to new or different versions of the browser, and (4) technique is still used for tracking [48, 34, 1]. OpenWPM uses a high-level command-based architecture, Device fingerprinting is a persistent tracking technique which supports command re-use between studies. which does not require a tracker to set any state in the user’s is a Chromium based crawler that uti- Chameleon Crawler 3 lizes the Chameleon browser extension for detecting browser 4 fingerprinting. Chameleon Crawler uses similar automation 5 3 crawl/

3 browser. Instead, trackers attempt to identify users by a Browser Browser Selenium combination of the device’s properties. Within samples of Manager over 100,000 browsers, 80-90% of desktop and 81% of mobile Browser Task Browser device fingerprints are unique [12, 26]. New fingerprinting Selenium Manager WWW Manager ... techniques are continually discovered [37, 43, 16], and are subsequently used to track users on the web [41, 2, 1]. In Browser Browser Selenium Manager Section 6.1 we present several new fingerprinting techniques discovered during our measurements. Instrumentation Layer Personalization measurement. Measurement of track- ing is closely related to measurement of personalization, Data Analysis since the question of what data is collected leads to the ques- Aggregator Scripts tion of how that data is used. The primary purpose of online tracking is behavioral advertising — showing ads based on the user’s past activity. Datta et al. highlight the incom- Figure 1: High-level overview of OpenWPM The task manager monitors browser managers, which con- pleteness of Google’s Ad Settings transparency page and vert high-level commands into automated browser actions. provide several empirical examples of discriminatory and The data aggregator receives and pre-processes data from predatory ads [9]. L ́ecuyer et al. develop XRay, a system instrumentation. for inferring which pieces of user data are used for personal- ization [27]. Another system by some of the same authors is general, modular, and scalable enough to support essentially Sunlight which improves upon their previous methodology any privacy measurement. to provide statistical confidence of their targeting inferences OpenWPM is open source and has already been used for [28]. measurement by several published studies. Section 3.4 in Many other practices that raise privacy or ethical con- the supplementary materials examines the advanced features cerns have been studied: price discrimination , where a site used by each study. In this paper we present, for the first shows different prices to different consumers for the same time, the design and evaluation of the platform and highlight steering , a gentler form of price discrimina- product [19, 63]; its strengths through several new measurements. tion where a product search shows differently-priced results for different users [32]; and the filter bubble , the supposed 3.1 Design Motivations effect that occurs when online information systems person- alize what is shown to a user based on what the user viewed OpenWPM builds on similar technologies as many previ- in the past [65]. ous platforms, but has several key design differences to sup- ports modular, comprehensive, and maintainable measure- Web security measurement. Web security studies of- ment. Our platform supports stateful measurements while ten use similar methods as web privacy measurement, and FPDetective [2] does not. Stateful measurements are im- the boundary is not always clear. Yue and Wang modified portant for studying the tracking ecosystem. Ad auctions the Firefox browser source code in order to perform a mea- may vary based on cookie data. A stateless browser always surement of insecure Javascript implementations on the web appears to be a new user, which skews cookie syncing mea- [67]. Headless browsers have been used in many web security surements. In addition to cookie syncing studied in this measurements, for example: to measure the amount of third- paper, stateful measurements have allowed our platform to party Javascript inclusions across many popular sites and be used to study cookie respawning [1] and replicate realistic the vulnerabilities that arise from how the script is embed- user profiles [14]. ded [40], to measure the presence of security seals on the top Many past platforms rely on native instrumentation code 1 million sites [62], and to study stand-alone password gener- [39, 52, 2], which have a high maintenance cost and, in some ators and meters on the web [60]. Several studies have used cases a high cost-per-API monitored. In our platform, the Selenium-based frameworks, including: to measure and cat- cost of monitoring new APIs is minimal (Section 3.3) and egorize malicious advertisements displayed while browsing APIs can be enabled or disabled in the add-on without re- popular sites [68], to measure the presence of malware and compiling the browser or rendering engine. This allows us to other vulnerabilities on live streaming websites [46], to study monitor a larger number of APIs. Native codebase changes HSTS deployment [21], to measure ad-injecting browser ex- in other platforms require constant merges as the upstream tensions [66], and to emulate users browsing malicious web codebase evolves and complete rewrites to support alterna- shells with the goal of detecting client-side homephoning tive browsers. [55]. Other studies have analyzed Flash and Javascript el- ements of webpages to measure security vulnerabilities and 3.2 Design and Implementation error-prone implementations [42, 61]. We divided our browser automation and data collection infrastructure into three main modules: browser managers which act as an abstraction layer for automating individual 3. MEASUREMENT PLATFORM which serves task manager browser instances, a user-facing to distribute commands to browser managers, and a data An infrastructure for automated web privacy measure- , which acts as an abstraction layer for browser in- aggregator ment has three components: simulating users, recording ob- strumentation. The researcher interacts with the task man- servations (response metadata, cookies, behavior of scripts, ager via an extensible, high-level, domain-specific language etc.), and analysis. We set out to build a platform that for crawling and controlling the browser instance. The entire can automate the first two components and can ease the platform is built using Python and Python libraries. researcher’s analysis task. We sought to make OpenWPM

4 Browser driver: Providing realism and support for processes, and loads the archive (which includes cookies and web technologies. We considered a variety of choices to history) into a fresh browser with the same configuration. measurements, i.e., to instruct the browser to visit a set drive Data Aggregator: Providing repeatability. Repeata- of pages (and possibly to perform a set of actions on each). bility can be achieved logging data in a standardized format, The two main categories to choose from are lightweight browsers so research groups can easily share scripts and data. We ag- like PhantomJS (an implementation of WebKit), and full- gregate data from all instrumentation components in a cen- fledged browsers like Firefox and Chrome. We chose to use tral and structured location. The data aggregator receives Selenium, a cross-platform web driver for Firefox, Chrome, data during the measurement, manipulates it as necessary, Internet Explorer, and PhantomJS. We currently use Sele- and saves it on disk keyed back to a specific page visit and nium to drive Firefox, but Selenium’s support for multiple browser. The aggregator exists within its own process, and browsers makes it easy to transition to others in the future. is accessed through a socket interface which can easily be By using a consumer browser, all technologies that a typ- connected to from any number of browser managers or in- ical user would have access to (e.g., HTML5 storage op- strumentation processes. tions, Adobe Flash) are also supported by measurement We currently support two data aggregators: a structured instances. The alternative, PhantomJS, does not support SQLite aggregator for storing relational data and a Lev- WebGL, HTML5 Audio and Video, CSS 3-D, and browser elDB aggregator for storing compressed web content. The plugins (like Flash), making it impossible to run measure- SQLite aggregator stores the majority of the measurement ments on the use of these technologies [45]. In retrospect data, including data from both the proxy and the exten- this has proved to be a sound choice. Without full support sion (described below). The LevelDB aggregator is designed for new web technologies we would not have been able to to store de-duplicated web content, such as Javascript or discover and measure the use of the API for AudioContext HTML files. The aggregator checks if a hash of the content device fingerprinting as discussed in Section 6.4. is present in the database, and if not compresses the content Finally the use of real browsers also allows us to test the and adds it to the database. effects of consumer browser extensions. We support run- Instrumentation: Supporting comprehensive and ning measurements with extensions such as Ghostery and We provide the researcher with reusable measurement. HTTPS Everywhere as well as enabling Firefox privacy set- data access at several points: (1) raw data on disk, (2) at the tings such third-party cookie blocking and the new Tracking network level with an HTTP proxy, and (3) at the Javascript Protection feature. New extensions can easily be supported level with a Firefox extension. This provides nearly full cov- with only a few extra lines of code (Section 3.3). See Sec- erage of a browser’s interaction with the web and the sys- tion 5.3 and Section 5.5 for analyses of measurements run tem. Each level of instrumentation keys data with the top with these browser settings. level site being visited and the current browser id, making Browser managers: Providing stability. During the it possible to combine measurement data from multiple in- course of a long measurement, a variety of unpredictable strumentation sources for each page visit. events such as page timeouts or browser crashes could halt Disk Access — We include instrumentation that collects the measurement’s progress or cause data loss or corruption. changes to Flash LSOs and the Firefox cookie database after A key disadvantage of Selenium is that it frequently hangs each page visit. This allows a researcher to determine which indefinitely due to its blocking API [50], as it was designed to domains are setting Flash cookies, and to record access to be a tool for webmasters to test their own sites rather than cookies in the absence of other instrumentation an engine for large-scale measurements. Browser managers HTTP Data — After examining several Python HTTP 6 provide an abstraction layer around Selenium, isolating it proxies, we chose to use Mitmproxy to record all HTTP Re- from the rest of the components. quest and Response headers. We generate and load a certifi- Each browser manager instantiates a Selenium instance cate into Firefox to capture HTTPS data alongside HTTP. with a specified configuration of preferences, such as block- Additionally, we use the HTTP proxy to dump the con- ing third-party cookies. It is responsible for converting high- tent of any Javascript file requested during a page visit. We level platform commands (e.g. visiting a site) into specific use both Content-Type and file extension checking to detect Selenium subroutines. It encapsulates per-browser state, en- scripts in the proxy. Once detected, a script is decompressed abling recovery from browser failures. To isolate failures, (if necessary) and hashed. The hash and content are sent to each browser manager runs as a separate process. the LevelDBAggregator for de-duplication. We support launching measurement instances in a “head- We provide the researcher with a Javascript Access — less” container, by using the pyvirtualdisplay library to in- Javascript interface to pages visited through a Firefox ex- terface with Xvfb, which draws the graphical interface of the tension. Our extension expands on the work of Fourthparty browser to a virtual frame buffer. [33]. In particular, we utilize Fourthparty’s Javascript in- Task manager: Providing scalability and abstrac- strumentation, which defines custom getters and setters on 7 tion. The task manager provides a scriptable command-line window.screen window.navigator the . We and interfaces interface for controlling multiple browsers simultaneously. updated and extended this functionality to record access to Commands can be distributed to browsers either synchro- , HTMLCanvasElement , Storage the prototypes of the Can- nized or first-come-first-serve. Each command is launched , RTCPeerConntection , vasRenderingContext2D AudioCon- in a per-browser command execution thread. objects, as well as the prototypes of several children text The command-execution thread handles errors in its cor- of AudioNode . This records the setting and getting of all responding browser manager automatically. If the browser 6 manager crashes, times out, or exceeds memory limits, the 7 thread enters a crash recovery routine. In this routine, the In the latest public version of Fourthparty (May 2015), manager archives the current browser profile, kills all current this instrumentation is not functional due to API changes.

5 object properties and calls of all object methods for any ob- manager, we recover from all browser crashes and have ob- served no data corruption during stateful measurements of ject built from these prototypes. Alongside this, we record 100,000 sites. During the course of our stateless 1 million site the new property values set and the arguments to all method measurement in January 2016 (Section 5), we observe over calls. Everything is logged directly to the SQLite aggregator. 90 million requests and nearly 300 million Javascript calls. In addition to recording access to instrumented objects, we record the URL of the script responsible for the prop- A single instrumented browser can visit around 3500 sites per day, requiring no manual interaction during that time. erty or method access. To do so, we throw an Error and The scale and speed of the overall measurement depends on parse the stack trace after each call or property intercept. the hardware used and the measurement configuration (See This method is successful for 99.9% of Javascript files we encountered, and even works for Javascript files which have “Resource Usage” below). . A minor limitation is been minified or obfuscated with eval OpenWPM reproduces a human user’s Completeness. that the function calls of a script which gets passed into the web browsing experience since it uses a full-fledged browser. eval method of a second script will have their URL labeled However, researchers have used stripped-down browsers such as the second script. This method is adapted with minor as PhantomJS for studies, trading off fidelity for speed. 8 modifications from the Privacy Badger Firefox Extension . To test the importance of using a full-fledged browser, In an adversarial situation, a script could disable our in- we examined the differences between OpenWPM and Phan- strumentation before fingerprinting a user by overriding ac- tomJS (version 2.1.1) on the top 100 Alexa sites. We av- cess to getters and setters for each instrumented object. eraged our results over 6 measurements of each site with However, this would be detectable since we would observe each tool. Both tools were configured with a time-out of 10 access to the meth- lookup{G,S}etter or define{G,S}etter seconds and we excluded a small number of sites that didn’t ods for the object in question and could investigate the complete loading. Unsurprisingly, PhantomJS does not load cause. In our 1 million site measurement, we only observe Flash, HTML5 Video, or HTML5 Audio objects (which it and HTMLCanvasElement script access to getters or setters for does not support); OpenWPM loads nearly 300 instances of CanvasRenderingContext2D interfaces. All of these are be- those across all sites. More interestingly, PhantomJS loads nign accesses from 47 scripts total, with the majority related about 30% fewer HTML files, and about 50% fewer resources to an HTML canvas graphics library. with plain text and stream content types. Upon further ex- Example workflow. amination, one major reason for this is that many sites don’t serve ads to PhantomJS. This makes tracking measurements 1. The researcher issues a command to the task manager using PhantomJS problematic. and specifies that it should synchronously execute on all We also tested PhantomJS with the user-agent string spoofed browser managers. to look like Firefox, so as to try to prevent sites from treat- 2. The task manager checks all of the command execution ing PhantomJS differently. Here the differences were less threads and blocks until all browsers are available to ex- extreme, but still present (10% fewer requests of html re- ecute a new command. sources, 15% for plain text, and 30% for stream). However, 3. The task manager creates new command execution threads ) seem to break when several sites (such as for all browsers and sends the command and command PhantomJS presents the incorrect user-agent string. This parameters over a pipe to the browser manager process. is because sites may expect certain capabilities that Phan- 4. The browser manager interprets this command and runs tomJS does not have or may attempt to access APIs us- the necessary Selenium code to execute the command in ing Firefox-specific names. One site, , redirected the browser. PhantomJS (with either user-agent string) to an entirely 5. If the command is a “Get” command, which causes the different landing page than OpenWPM. These findings sup- browser to visit a new URL, the browser manager dis- port our view that OpenWPM enables significantly more tributes the browser ID and top-level page being visited complete and realistic web and tracking measurement than to all enabled instrumentation modules (extension, proxy, stripped-down browsers. or disk monitor). When using the headless configura- Resource usage. 6. Each instrumentation module uses this information to tion, we are able to run up to 10 stateful browser instances on properly key data for the new page visit. 9 an Amazon EC2 “c4.2xlarge” virtual machine . This virtual 7. The browser manager can send returned data (e.g. the machine costs around $300 per month using price estimates parsed contents of a page) to the SQLite aggregator. from May 2016. Due to Firefox’s memory consumption, 8. Simultaneously, instrumentation modules send data to stateful parallel measurements are memory-limited while state- the respective aggregators from separate threads or pro- less parallel measurements are typically CPU-limited and cesses. can support a higher number of instances. On the same 9. Finally, the browser manager notifies the task manager machine we can run 20 browser instances in parallel if the that it is ready for a new command. browser state is cleared after each page load. The platform minimizes code duplication Generality. 3.3 Evaluation both across studies and across configurations of a specific Stability. We tested the stability of vanilla Selenium study. For example, the Javascript monitoring instrumenta- without our infrastructure in a variety of settings. The best tion is about 340 lines of Javascript code. Each additional average we were able to obtain was roughly 800 pages with- API monitored takes only a few additional lines of code. The out a freeze or crash. Even in small-scale studies, the lack of instrumentation necessary to measure canvas fingerprinting recovery led to loss or corruption of measurement data. Us- (Section 6.1) is three additional lines of code, while the We- ing the isolation provided by our browser manager and task 8 9

6 Study Browser automation Persistent profiles Fine-grained profiles Advanced plugin support Automated login Detect tracking cookies Monitor state changes Javascript Instrumentation Content extraction Year Stateful crawls • • • • • • • Persistent tracking mechanisms [1] 2014 • ◦ FB Connect login permissions [47] 2014 • • • • • Surveillance implications of web tracking [14] 2015 • • • ◦ • HSTS and key pinning misconfigurations [21] 2015 2015 • • • The Web Privacy Census [4] • • 2015 Geographic Variations in Tracking [17] • • Analysis of Malicious Web Shells [55] 2016 • • • • • • • • This study (Sections 5 & 6) 2016 Table 1: Seven published studies which utilize our Platform. An unfilled circle indicates that the feature was useful but application-specific programming or manual effort was still required. with the included . This is used Javascript instrumentation bRTC measurement (Section 6.3) is just a single line of code. to measure device fingerprinting (Section 6). Similarly, the code to add support for new extensions or pri- Finally, the platform also has a limited ability to extract vacy settings is relatively low: 7 lines of code were required mod- content extraction content from web pages through the to support Ghostery, 8 lines of code to support HTTPS Ev- ule, and a limited ability to automatically log into web- erywhere, and 7 lines of codes to control Firefox’s cookie sites using the Facebook Connect automated login capabil- blocking policy. ity. Logging in with Facebook has been used to study login Even measurements themselves require very little addi- permissions [47]. tional code on top of the platform. Each configuration listed in Table 2 requires between 70 and 108 lines of code. By comparison, the core infrastructure code and included in- 4. WEB CENSUS METHODOLOGY strumentation is over 4000 lines of code, showing that the platform saves a significant amount of engineering effort. We run measurements on the homepages of the top 1 mil- lion sites to provide a comprehensive view of web tracking 3.4 Applications of OpenWPM and web privacy. These measurements provide updated met- Seven academic studies have been published in journals, rics on the use of tracking and fingerprinting technologies, conferences, and workshops, utilizing OpenWPM to perform allowing us to shine a light onto the practices of third par- 10 Ta- a variety of web privacy and security measurements. ties and trackers across a large portion of the web. We also ble 1 summarizes the advanced features of the platform that explore the effectiveness of consumer privacy tools at giving each research group utilized in their measurements. users control over their online privacy. browser automation In addition to and HTTP data dumps, We run our measure- Measurement Configuration. the platform has several advanced capabilities used by both ments on a “c4.2xlarge” Amazon EC2 instance, which cur- our own measurements and those in other groups. Mea- rently allocates 8 vCPUs and 15 GiB of memory per ma- surements can keep state, such as cookies and localStor- chine. With this configuration we are able to run 20 browser , or persist stateful measurements age, within each session via instances in parallel. All measurements collect HTTP Re- . Persisting this state across sessions with persistent profiles quests and Responses, Javascript calls, and Javascript files state across measurements has been used to measure cookie using the instrumentation detailed in Section 3. Table 2 respawning [1] and to provide seed profiles for larger mea- summarizes the measurement instance configurations. The surements (Section 5). In general, stateful measurements are data used in this paper were collected during January 2016. useful to replicate the cookie profile of a real user for track- All of our measurements use the Alexa top 1 million site ing [4, 14] and cookie syncing analysis [1] (Section 5.6). In list (, which ranks sites based on their addition to recording state, the platform can detect tracking global popularity with Alexa Toolbar users. Before each cookies . measurement, OpenWPM retrieves an updated copy of the The platform also provides programmatic control over in- list. When a measurement configuration calls for less than dividual components of this state such as Flash cookies through 1 million sites, we simply truncate the list as necessary. For as well as plug-ins via fine-grained profiles advanced plug-in eash site, the browser will visit the homepage and wait until support . Applications built on top of the platform can mon- the site has finished loading or until the 90 second timeout itor state changes on disk to record access to Flash cookies is reached. The browser does not interact with the site or and browser state. These features are useful in studies which visit any other pages within the site. If there is a timeout wish to simulate the experience of users with Flash enabled we kill the process and restart the browser for the next page [4, 17] or examine cookie respawning with Flash [1]. visit, as described in Section 3.2. Beyond just monitoring and manipulating state, the plat- To obtain a complete picture Stateful measurements. form provides the ability to capture any Javascript API call of tracking we must carry out stateful measurements in ad- 10 We are aware of several other studies in progress. dition to stateless ones. Stateful measurements do not clear

7 Configuration # Sites # Success Timeout % Parallel HTTP Data Javascript Files Javascript Calls Disk Scans Time to Crawl Flash Enabled Stateful 917,261 10.58% 14 days Default Stateless 1 Million • • • • 8.23% 3.5 days 94,144 ◦ • • • • 100,000 Default Stateful Ghostery • • • • 0.7 days 55,000 50,023 5.31% 53,688 12.41% • • • • 0.8 days Block TP Cookies 55,000 53,705 • • • • 55,000 1 day HTTPS Everywhere 14.77% 9,707 • • • • • • 2.9 days 10,000 ID Detection 1* 6.81% 9,702 6.73% • • • • • • 2.9 days ID Detection 2* 10,000 Table 2: Census measurement configurations. An unfilled circle indicates that a seed profile of length 10,000 was loaded into each browser instance in a parallel measurement. “# Success” indicates the number of sites that were reachable and returned a response. A Timeout is a request which fails to completely load in 90 seconds. *Indicates that the measurements were run synchronously on different virtual machines. the browser’s profile between page visits, meaning cookie determine a (cookie-name, parameter-name, parameter-value) and other browser storage persist from site to site. For some tuple to be an ID cookie if it meets the following criteria: (1) measurements the difference is not material, but for others, the cookie has an expiration date over 90 days in the future such as cookie syncing (Section 5.6), it is essential. 100, (3) the parameter- ≤ length(parameter-value) ≤ (2) 8 Making stateful measurements is fundamentally at odds value remains the same throughout the measurement, (4) with parallelism. But a serial measurement of 1,000,000 sites the parameter-value is different between machines and has a (or even 100,000 sites) would take unacceptably long. So we similarity less than 66% according to the Ratcliff-Obershelp which vis- make a compromise: we first build a seed profile algorithm [7]. For the last step, we run two synchronized its the top 10,000 sites in a serial fashion, and we save the measurements (see Table 2) on separate machines and com- resulting state. To scale to a larger measurement, the seed pare the resulting cookies, as in previous studies. profile is loaded into multiple browser instances running in What makes a tracker? Every third party is potentially parallel. With this approach, we can approximately simulate a tracker, but for many of our results we need a more con- visiting each website serially. For our 100,000 site stateless servative definition. We use two popular tracking-protection measurement, we used the “ID Detection 2” browser profile lists for this purpose: EasyList and EasyPrivacy. Including as a seed profile. EasyList allows us to classify advertising related trackers, This method is not without limitations. For example third while EasyPrivacy detects non-advertising related trackers. parties which don’t appear in the top sites if the seed pro- The two lists consist of regular expressions and URL sub- file will have different cookies set in each of the parallel in- strings which are matched against resource loads to deter- stances. If these parties are also involved in cookie syncing, mine if a request should be blocked. the partners that sync with them (and appear in the seed Alternative tracking-protection lists exist, such as the list profile) will each receive multiple IDs for each one of their built into the Ghostery browser extension and the domain- 11 own. This presents a trade-off between the size the seed pro- based list provided by Disconnect . Although we don’t use file and the number of third parties missed by the profile. these lists to classify trackers directly, we evaluate their per- We find that a seed profile which has visited the top 10,000 formance in several sections. sites will have communicated with 76% of all third-party Note that we are not simply classifying domains as track- domains present on more than 5 of the top 100,000 sites. ers or non-trackers, but rather classify each instance of a Handling errors. In presenting our results we only con- third party on a particular website as a tracking or non- sider sites that loaded successfully. For example, for the 1 tracking context. We consider a domain to be in the tracking Million site measurement, we present statistics for 917,261 context if a consumer privacy tool would have blocked that sites. The majority of errors are due to the site failing to resource. Resource loads which wouldn’t have been blocked return a response, primarily due to DNS lookup failures. by these extensions are considered non-tracking. Other causes of errors are sites returning a non-2XX HTTP While there is agreement between the extensions utiliz- status code on the landing page, such as a 404 (Not Found) ing these lists, we emphasize that they are far from perfect. They contain false positives and especially false negatives. or a 500 (Internal Server Error). That is, they miss many trackers — new ones in particu- Detecting ID cookies. Detecting cookies that store lar. Indeed, much of the impetus for OpenWPM and our unique user identifiers is a key task that enables many of the measurements comes from the limitations of manually iden- results that we report in Section 5. We build on the methods tifying trackers. Thus, tracking-protection lists should be used in previous studies [1, 14]. Browsers store cookies in a considered an underestimate of the set of trackers, just as structured key-value format, allowing sites to provide both considering all third parties to be trackers is an overestimate. and name string a . Many sites further structure value string the value string of a single cookie to include a set of named The analysis presented in this paper has Limitations. parameters. We parse each cookie value string assuming the several methodological and measurement limitations. Our format: platform did not interact with sites in ways a real user might; (name (name =)value ... | =)value | we did not log into sites nor did we carry out actions such N 1 N 1 -=. We represents any character except a-zA-Z0-9 where | 11

8 as scrolling or clicking links during our visit. While we have basis is relatively small. The effect is accentuated when we performed deeper crawls of sites (and plan to make this data consider that different third parties may be owned by the publicly available), the analyses presented in the paper per- same entity. All of the top 5 third parties, as well as 12 tain only to homepages. of the top 20, are Google-owned domains. In fact, Google, For comparison, we include a preliminary analysis of a Facebook, Twitter, and AdNexus are the only third-party en- crawl which visits 4 internal pages in addition to the home- . tities present on more than 10% of sites page of the top 10,000 sites. The analyses presented in this paper should be considered a lower bound on the amount of 70 tracking a user will experience in the wild. In particular, the 60 Tracking Context average number of third parties per site increases from 22 50 Non-Tracking Context to 34. The 20 most popular third parities embedded on the 40 30 homepages of sites are found on 6% to 57% more sites when 20 internal page loads are considered. Similarly, fingerprinting 10 % First-Parties scripts found in Section 6 were observed on more sites. Can- 0 vas fingerprinting increased from 4% to 7% of the top sites while canvas-based font fingerprinting increased from 2% to 2.5%. An increase in trackers is expected as each additional page visit within a site will cycle through new dynamic con- tent that may load a different set of third parties. Addition- ally, sites may not embed all third-party content into their Figure 2: Top third parties on the top 1 million sites. Not homepages. all third parties are classified as trackers, and in fact the The measurements presented in this paper were collected same third party can be classified differently depending on from an EC2 instance in Amazon’s US East region. It is the context. (Section 4). possible that some sites would respond differently to our Further, if we use the definition of tracking based on measurement instance than to a real user browsing from tracking-protection lists, as defined in Section 4, then track- residential or commercial internet connection. That said, ers are even less prevalent. This is clear from Figure 2, which Fruchter, et al. [17] use OpenWPM to measure the varia- shows the prevalence of the top third parties (a) in any con- tion in tracking due to geographic differences, and found no text and (b) only in tracking contexts. Note the absence or evidence of tracking differences caused by the origin of the reduction of content-delivery domains such as, measurement instance., and Although OpenWPM’s instrumentation measures a di- We can expand on this by analyzing the top third-party verse set of tracking techniques, we do not provide a com- , many of which consist of multiple entities. organizations plete analysis of all known techniques. Notably absent from As an example, Facebook and Liverail are separate entities our analysis are non-canvas-based font fingerprinting [2], but Liverail is owned by Facebook. We use the domain-to- navigator and plugin fingerprinting [12, 33], and cookie respawn- organization mappings provided by Libert [31] and Discon- ing [53, 6]. Several of these javascript-based techniques are nect[11]. As shown in Figure 3, Google, Facebook, Twitter, currently supported by OpenWPM, have been measured Amazon, AdNexus, and Oracle are the third-party organi- with OpenWPM in past research [1], and others can be eas- zations present on more than 10% of sites. In comparison ily added (Section 3.3). Non-Javascript techniques, such as to Libert’s [31] 2014 findings, Akamai and ComScore fall font fingerprinting with Adobe Flash, would require addi- significantly in market share to just 2.4% and 6.6% of sites. tional specialized instrumentation. Oracle joins the top third parties by purchasing BlueKai and Finally, for readers interested in further details or in repro- AddThis, showing that acquisitions can quickly change the ducing our work, we provide further methodological details tracking landscape. in the Appendix: what constitutes distinct domains (13.1), how to detect the landing page of a site using the data col- lected by our Platform (13.2), how we detect cookie syncing 90 80 (13.3), and why obfuscation of Javascript doesn’t affect our Tracking Context 70 Non-Tracking Context ability to detect fingerprinting (13.4). 60 50 40 5. RESULTS OF 1-MILLION SITE CENSUS 30 20 % First-Parties 10 5.1 The long but thin tail of online tracking 0 are fl AOL During our January 2016 measurement of the Top 1 mil- Adobe Oracle Google Twitter OpenX Yahoo! Yandex Neustar Amazon Adnexus Facebook Datalogix MaxCDN comScore Cloud lion sites, our tool made over 90 million requests, assembling Automattic Media Math the largest dataset on web tracking to our knowledge. Our Rubicon Project The Trade Desk large scale allows us to answer a rather basic question: how Figure 3: Organizations with the highest third-party pres- many third parties are there? In short, a lot: the total num- ence on the top 1 million sites. Not all third parties are clas- ber of third parties present on at least two first parties is sified as trackers, and in fact the same third party can be over 81,000. classified differently depending on the context. (Section 4). What is more surprising is that the prevalence of third Larger entities may be easier to regulate by public-relations parties quickly drops off: only 123 of these 81,000 are present pressure and the possibility of legal or enforcement actions, on more than 1% of sites. This suggests that the number an outcome we have seen in past studies [1, 6, 34]. of third parties that a regular user will encounter on a daily

9 HTTPS w\ Passive 5.2 Prominence: a third party ranking metric HTTP HTTPS Mixed Content In Section 5.1 we ranked third parties by the number of first party sites they appear on. This simple count is a good Firefox 47 first approximation, but it has two related drawbacks. A ma- jor third party that’s present on (say) 90 of the top 100 sites Chrome 47 would have a low score if its prevalence drops off outside the top 100 sites. A related problem is that the rank can be sen- Figure 5: Secure connection UI for Firefox Nightly 47 and Chrome 47. Clicking on the lock icon in Firefox reveals sitive to the number of websites visited in the measurement. the text “Connection is not secure” when mixed content is Thus different studies may rank third parties differently. present. We also lack a good way to compare third parties (and especially trackers) over time, both individually and in ag- 55K Sites 1M Sites gregate. Some studies have measured the total number of HTTP Only 82.9% X cookies [4], but we argue that this is a misleading metric, 8.6% HTTPS Only 14.2% since cookies may not have anything to do with tracking. 2.9% X HTTPS Opt. To avoid these problems, we propose a principled met- ric. We start from a model of aggregate browsing behavior. Table 3: First party HTTPS support on the top 55K and There is some research suggesting that the website traffic fol- top 1M sites. “HTTP Only” is defined as sites which fail lows a power law distribution, with the frequency of visits to upgrade when HTTPS Everywhere is enabled. ‘HTTPS th 1 [3, 22]. to the ranked website being proportional to N N Only” are sites which always redirect to HTTPS. “HTTPS The exact relationship is not important to us; any formula Optional” are sites which provide an option to upgrade, for traffic can be plugged into our prominence metric below. but only do so when HTTPS Everywhere is enabled. We carried out HTTPS-everywhere-enabled measurement for Definition: . 1 only 55,000 sites, hence the X’s. t ) = Σ ( Prominence s,t )=1 edge ( ) s ( rank where is present ) indicates whether third party s, t ( edge t 5.3 Third parties impede HTTPS adoption on site s . This simple formula measures the frequency with which an “average” user browsing according to the power-law Table 3 shows the number of first-party sites that sup- model will encounter any given third party. port HTTPS and the number that are HTTPS-only. Our The most important property of prominence is that it results reveal that HTTPS adoption remains rather low de- de-emphasizes obscure sites, and hence can be adequately spite well-publicized efforts [13]. Publishers have claimed approximated by relatively small-scale measurements, as shown that a major roadblock to adoption is the need to move in Figure 4. We propose that prominence is the right metric all embedded third parties and trackers to HTTPS to avoid for: mixed-content errors [57, 64]. Mixed-content errors occur when HTTP sub-resources are 1. Comparing third parties and identifying the top third loaded on a secure site. This poses a security problem, lead- parties. We present the list of top third parties by promi- ing to browsers to block the resource load or warn the user nence in Table 14 in the Appendix. Prominence rank- depending on the content loaded [38]. mixed con- Passive ing produces interesting differences compared to rank- tent, that is, non-executable resources loaded over HTTP, ing by a simple prevalence count. For example, Content- cause the browser to display an insecure warning to the user Distribution Networks become less prominent compared mixed content is a far Active but still load the content. to other types of third parties. more serious security vulnerability and is blocked outright 2. Measuring the effect of tracking-protection tools, as we by modern browsers; it is not reflected in our measurements. do in Section 5.5. Third-party support for HTTPS. To test the hypoth- 3. Analyzing the evolution of the tracking ecosystem over esis that third parties impede HTTPS adoption, we first time and comparing between studies. The robustness of characterize the HTTPS support of each third party. If a the (Figure 4) makes it ideally rank-prominence curve third party appears on at least 10 sites and is loaded over suited for these purposes. HTTPS on all of them, we say that it is HTTPS-only. If it is loaded over HTTPS on some but not all of the sites, we say that it supports HTTPS. If it is loaded over HTTP 1 10 on all of them, we say that it is HTTP-only. If it appears 1K-site measurement on less than 10 sites, we do not have enough confidence to 0 10 50K-site measurement make a determination. 1M-site measurement − 1 10 Table 4 summarizes the HTTPS support of third party domains. A large number of third-party domains are HTTP- − 2 10 only (54%). However, when we weight third parties by Prominence (log) − 3 prominence, only 5% are HTTP-only. In contrast, 94% of 10 prominence-weighted third parties support both HTTP and 600 0 800 400 1000 200 HTTPS. This supports our thesis that consolidation of the Rank of third-party third-party ecosystem is a plus for security and privacy. Figure 4: Prominence of third party as a function of promi- Impact of third-parties. We find that a significant nence rank. We posit that the curve for the 1M-site mea- fraction of HTTP-default sites (26%) embed resources from surement (which can be approximated by a 50k-site mea- third-parties which do not support HTTPS. These sites would surement) presents a useful aggregate picture of tracking. be unable to upgrade to HTTPS without browsers display-

10 50 Prominence Tracker HTTPS Support Percent weighted % 40 Non-Tracker HTTP Only 54% 5% 30 1% HTTPS Only 5% 20 94% 41% Both 10 Table 4: Third party HTTPS support. “HTTP Only” is 0 defined as domains from which resources are only requested arts over HTTP across all sites on our 1M site measurement. news adult home sports health games society science regional business ‘HTTPS Only” are domains from which resources are shopping reference average recreation computers only requested over HTTPS. “Both” are domains which kids and teens have resources requested over both HTTP and HTTPS. Results are limited to third parties embedded on at least Figure 6: Average # of third parties in each Alexa category. 10 first-party sites. 5.5 Does tracking protection work? Top 1M Top 55k Users have two main ways to reduce their exposure to % FP % FP Class tracking: the browser’s built in privacy features and exten- Own 25.4% 24.9% sions such as Ghostery or uBlock Origin. 2.6% 2.1% Favicon Contrary to previous work questioning the effectiveness Tracking 10.4% 20.1% of Firefox’s third-party cookie blocking [14], we do find the 2.6% CDN 1.6% feature to be effective. Specifically, only 237 sites (0.4%) Non-tracking 44.9% 35.4% have any third-party cookies set during our measurement 15.6% 6.3% Multiple causes set to block all third-party cookies (“Block TP Cookies” in Table 2). Most of these are for benign reasons, such as redi- Table 5: A breakdown of causes of passive mixed-content recting to the U.S. version of a non-U.S. site. We did find ex- warnings on the top 1M sites and on the top 55k sites. ceptions, including 32 that contained ID cookies. For exam- “Non-tracking” represents third-party content not classified ple, there are six Australian news sites that first redirect to as a tracker or a CDN. before re-directing back to the initial domain, which seems to be for tracking purposes. While this type of ing mixed content errors to their users, the majority of which workaround to third-party cookie blocking is not rampant, (92%) would contain active content which would be blocked. we suggest that browser vendors should closely monitor it Similarly, of the approximately 78,000 first-party sites that and make changes to the blocking heuristic if necessary. are HTTPS-only, around 6,000 (7.75%) load with mixed pas- Another interesting finding is that when third-party cookie sive content warnings. However, only 11% of these warnings blocking was enabled, the average number of third parties (around 650) are caused by HTTP-only third parties, sug- per site dropped from 17.7 to 12.6. Our working hypothesis gesting that many domains may be able to mitigate these for this drop is that deprived of ID cookies, third parties cur- warnings by ensuring all resources are being loaded over tail certain tracking-related requests such as cookie syncing HTTPS when available. We examined the causes of mixed (which we examine in Section 5.6). content on these sites, summarized in Table 5. The major- ity are caused by third parties, rather than the site’s own 0 . 1 content, with a surprising 27% caused solely by trackers. 0 8 . . 0 6 5.4 News sites have the most trackers 4 . 0 The level of tracking on different categories of websites 0 2 . varies considerably — by almost an order of magnitude. To 0 . 0 measure variation across categories, we used Alexa’s lists of Fraction of TP Blocked 4 − − 3 0 − 2 1 − 10 10 10 10 10 top 500 sites in each of 16 categories. From each list we Prominence of Third-party (log) sampled 100 sites (the lists contain some URLs that are not home pages, and we excluded those before sampling). Figure 7: Fraction of third parties blocked by Ghostery as In Figure 6 we show the average number of third parties a function of the prominence of the third party. As defined loaded across 100 of the top sites in each Alexa category. earlier, a third party’s prominence is the sum of the inverse ranks of the sites it appears on. Third parties are classified as trackers if they would have We also tested Ghostery, and found that it is effective at been blocked by one of the tracking protection lists (Sec- reducing the number of third parties and ID cookies (Fig- tion 4). ure 11 in the Appendix). The average number of third-party Why is there so much variation? With the exception of includes went down from 17.7 to 3.3, of which just 0.3 had the adult category, the sites on the low end of the spectrum third-party cookies (0.1 with IDs). We examined the promi- are mostly sites which belong to government organizations, nent third parties that are not blocked and found almost all universities, and non-profit entities. This suggests that web- of these to be content-delivery networks like sites may be able to forgo advertising and tracking due to the or widgets like, which Ghostery does not presence of funding sources external to the web. Sites on the try to block. So Ghostery works well at achieving its stated high end of the spectrum are largely those which provide ed- objectives. itorial content. Since many of these sites provide articles for However, the tool is less effective for obscure trackers free, and lack an external funding source, they are pressured (prominence < 0 . 1). In Section 6.6, we show that less promi- to monetize page views with significantly more advertising.

11 nent fingerprinting scripts are not blocked as frequently by 6. FINGERPRINTING: A 1-MILLION SITE blocking tools. This makes sense given that the block list VIEW is manually compiled and the developers are less likely to OpenWPM significantly reduces the engineering require- have encountered obscure trackers. It suggests that large- ment of measuring device fingerprinting, making it easy to scale measurement techniques like ours will be useful for tool update old measurements and discover new techniques. In developers to minimize gaps in their coverage. this section, we demonstrate this through several new fin- gerprinting measurements, two of which have never been 5.6 How common is cookie syncing? measured at scale before, to the best of our knowledge. We Cookie syncing, a workaround to the Same-Origin Policy, show how the number of sites on which font fingerprinting allows different trackers to share user identifiers with each is used and the number of third parties using canvas finger- other. Besides being hard to detect, cookie syncing enables printing have both increased by considerably in the past few back-end server-to-server data merges hidden from public years. We also show how WebRTC’s ability to discover lo- view, which makes it a privacy concern. cal IPs without user permission or interaction is used almost Our ID cookie detection methodology (Section 4) allows exclusively to track users. We analyze a new fingerprinting us to detect instances of cookie syncing. If tracker A wants found during our investi- technique utilizing AudioContext to share its ID for a user with tracker B, it can do so in one of gations. Finally, we discuss the use of the Battery API by two ways: embedding the ID in the request URL to tracker two fingerprinting scripts. B, or in the referer URL. We therefore look for instances Our fingerprinting measurement methodology utilizes data of IDs in referer, request, and response URLs, accounting collected by the Javascript instrumentation described in Sec- for URL encoding and other subtleties. We describe the full tion 3.2. With this instrumentation, we monitor access to details of our methodology in the Appendix (Section 13.3), all built-in interfaces and objects we suspect may be used with an important caveat that our methodology captures for fingerprinting. By monitoring on the interface or object both intentional and accidental ID sharing. level, we are able to record access to all method calls and Most third parties are involved in cookie syncing. property accesses for each interface we thought might be We run our analysis on the top 100,000 site stateful mea- useful for fingerprinting. This allows us to build a detection surement. The most prolific cookie-syncing third party is criterion for each fingerprinting technique after a detailed — it shares 108 different cookies with 118 analysis of example scripts. other third parties (this includes both events where it is a Although our detection criteria currently have negligible referer and where it is a receiver). We present details of the low false positive rate, we recognize that this may change as top cookie-syncing parties in Appendix 13.3. new web technologies and applications emerge. However, in- More interestingly, we find that the vast majority of top strumenting all properties and methods of an API provides third parties sync cookies with at least one other party: 45 a complete picture of each application’s use of the interface, of the top 50, 85 of the top 100, 157 of the top 200, and allowing our criteria to also be updated. More importantly, 460 of the top 1,000. This adds further evidence that cookie this allows us to replace our detection criteria with machine syncing is an under-researched privacy concern. learning, which is an area of ongoing work (Section 7). We also find that third parties are highly connected by % of First-parties synced cookies. Specifically, of the top 50 third parties that Rank Interval Canvas Canvas Font WebRTC are involved in cookie syncing, the probability that a ran- dom pair will have at least one cookie in common is 85%. , [0 1K) 5.10% 2.50% 0.60% The corresponding probability for the top 100 is 66%. [1K , 10K) 3.91% 1.98% 0.42% Implications of “promiscuous cookies” for surveil- [10K , 100K) 2.45% 0.86% 0.19% From the Snowden leaks, we learnt that that NSA lance. 1M) 0.06% 0.25% 1.31% , [100K “piggybacks” on advertising cookies for surveillance and ex- ploitation of targets [56, 54, 18]. How effective can this Table 6: Prevalence of fingerprinting scripts on different slices of the top sites. More popular sites are more likely to technique be? We present one answer to this question. We have fingerprinting scripts. consider a threat model where a surveillance agency has identified a target by a third-party cookie (for example, via 6.1 Canvas Fingerprinting leakage of identifiers by first parties, as described in [14, 23, The HTML Canvas allows web appli- Privacy threat. 25]). The adversary uses this identifier to coerce or com- cation to draw graphics in real time, with functions to sup- promise a third party into enabling surveillance or targeted port drawing shapes, arcs, and text to a custom canvas el- exploitation. ement. In 2012 Mowery and Schacham demonstrated how We find that some cookies get synced over and over again the HTML Canvas could be used to fingerprint devices [37]. promiscuous cook- to dozens of third parties; we call these Differences in font rendering, smoothing, anti-aliasing, as . It is not yet clear to us why these cookies are synced ies well as other device features cause devices to draw the im- repeatedly and shared widely. This means that if the ad- age differently. This allows the resulting pixels to be used versary has identified a user by such a cookie, their ability as part of a device fingerprint. to surveil or target malware to that user will be especially Detection methodology. We build on a 2014 measure- good. The most promiscuous cookie that we found belongs ment study by Acar [1]. Since that study, the canvas ; it is synced or leaked to 82 to the domain API has received broader adoption for non-fingerprinting other parties which are collectively present on 752 of the top purposes, so we make several changes to reduce false pos- 1,000 websites! In fact, each of the top 10 most promiscuous itives. In our measurements we record access to nearly all of cookies is shared with enough third parties to cover 60% or properties and methods of the HTMLCanvasElement interface more of the top 1,000 sites.

12 CanvasRenderingCon- and of the interface. We filter CanvasRenderingContext2D Detection methodology. The method, which re- text2D interface provides a measureText scripts according to the following criteria: turns several metrics pertaining to the text size (including width properties must 1. The canvas element’s height and 12 its width) when rendered with the current font settings of not be set below 16 px. the rendering context. Our criterion for detecting canvas 2. Text must be written to canvas with least two colors or font font fingerprinting is: the script sets the property to at least 10 distinct characters. measure- at least 50 distinct, valid values and also calls the 3. The script should not call the addE- , or restore save , Text method at least 50 times on the same text string. We methods of the rendering context. ventListener manually examined the source code of each script found this 4. The script extracts an image with or with a toDataURL way and verified that there are zero false positives on our 1 that specifies an area with a getImageData single call to million site measurement. minimum size of 16px × 16px. This heuristic is designed to filter out scripts which are Results. We found canvas-based font fingerprinting present unlikely to have sufficient complexity or size to act as an on 3,250 first-party sites. This represents less than 1% of identifier. We manually verified the accuracy of our detec- sites, but as Table 6 shows, the technique is more heavily tion methodology by inspecting the images drawn and the used on the top sites, reaching 2.5% of the top 1000. The source code. We found a mere 4 false positives out of 3493 vast majority of cases (90%) are served by a single third scripts identified on a 1 million site measurement. Each of party, The number of sites with font finger- the 4 is only present on a single first-party. printing represents a seven-fold increase over a 2013 study [2], although they did not consider Canvas. See Table 12 in We found canvas fingerprinting on 14,371 (1.6%) Results. the Appendix for a full list of scripts. sites. The vast majority (98.2%) are from third-party scripts. These scripts come from about 3,500 URLs hosted on about 6.3 WebRTC-based fingerprinting 400 domains. Table 7 shows the top 5 domains which serve canvas fingerprinting scripts ordered by the number of first- WebRTC is a framework for peer-to- Privacy threat. parties they are present on. peer Real Time Communication in the browser, and acces- sible via Javascript. To discover the best network path be- Domain # First-parties tween peers, each peer collects all available candidate ad- dresses, including addresses from the local network inter- 7806 faces (such as ethernet or WiFi) and addresses from the 2858 public side of the NAT and makes them available to the 904 without explicit permission from the user. web application 499 This has led to serious privacy concerns: users behind a 303 proxy or VPN can have their ISP’s public IP address ex- 2719 407 others posed [59]. We focus on a slightly different privacy concern: 14371 unique ) TOTAL 15089 ( users behind a NAT can have their local IP address revealed, Table 7: Canvas fingerprinting on the Alexa Top 1 Million which can be used as an identifier for tracking. A detailed sites. For a more complete list of scripts, see Table 11 in description of the discovery process is given in Appendix the Appendix. Section 11. To detect WebRTC local IP Detection methodology. Comparing our results with a 2014 study [1], we find three RTCPeerConnection interface discovery, we instrument the important trends. First, the most prominent trackers have prototype and record access to its method calls and property by-and-large stopped using it, suggesting that the public access. After the measurement is complete, we select the backlash following that study was effective. Second, the and createDataChannel scripts which call the createOffer overall number of domains employing it has increased con- 14 APIs, and access the event handler . We onicecandidate siderably, indicating that knowledge of the technique has manually verified that scripts that call these functions are spread and that more obscure trackers are less concerned in fact retrieving candidate IP addresses, with zero false about public perception. As the technique evolves, the im- positives on 1 million sites. Next, we manually tested if ages used have increased in variety and complexity, as we de- such scripts are using these IPs for tracking. Specifically, we tail in Figure 12 in the Appendix. Third, the use has shifted check if the code is located in a script that contains other from behavioral tracking to fraud detection, in line with the known fingerprinting techniques, in which case we label it ad industry’s self-regulatory norm regarding acceptable uses tracking. Otherwise, if we manually assess that the code of fingerprinting. has a clear non-tracking use, we label it non-tracking. If neither of these is the case, we label the script as ‘unknown’. 6.2 Canvas Font Fingerprinting We emphasize that even the non-tracking scripts present a Privacy threat. The browser’s font list is very useful privacy concern related to leakage of private IPs. for device fingerprinting [12]. The ability to recover the list We found WebRTC being used to discover lo- Results. of fonts through Javascript or Flash is known, and existing cal IP addresses without user interaction on 715 sites out tools aim to protect the user against scripts that do that [41, of the top 1 million. The vast majority of these (659) were 2]. But can fonts be enumerated using the Canvas interface? done by third-party scripts, loaded from 99 different loca- The only public discussion of the technique seems to be a Tor 13 tions. A large majority (625) were used for tracking. The . To the best of our knowledge, Browser ticket from 2014 we are the first to measure its usage in the wild. 14 Although we found it unnecessary for current scripts, 12 × 150px. The default canvas size is 300px instrumenting localDescription will cover all possible IP 13 address retrievals.

13 top 10 scripts accounted for 83% of usage, in line with our Destination GainAnalyser Oscillator other observations about the small number of third parties responsible for most tracking. We provide a list of scripts in FFT =0 Table 13 in the Appendix. The number of confirmed non-tracking uses of unsolicited Triangle Wave IP candidate discovery is small, and based on our analysis, none of them is critical to the application. These results eb8a30ad7... [-121.36, -121.19, ...] SHA1( ) have implications for the ongoing debate on whether or not unsolicited WebRTC IP discovery should be private by de- Dynamics fault [59, 8, 58]. Oscillator Destination Compressor Classification # Scripts # First-parties Bu er ff 625 (88.7%) 57 Tracking 40 (5.7%) 10 Non-Tracking Sine Wave 40 (5.7%) Unknown 32 [33.234, 34.568, ...] ad60be2e8... MD5( ) Table 8: Summary of WebRTC local IP discovery on the top 1 million Alexa sites. Figure 8: AudioContext node configuration used to gen- Used by erate a fingerprint. Top: 6.4 AudioContext Fingerprinting in an Used by client.a.pxi. Bottom: . AudioContext The scale of our data gives us a new way to systemati- pub/*/main.min.js and in an cally identify new types of fingerprinting not previously re- . Full details in Appendix 12. OfflineAudioContext ported in the literature. The key insight is that fingerprint- ing techniques typically aren’t used in isolation but rather -80 Chrome Linux 47.0.2526.106 in conjunction with each other. So we monitor known track- -100 Firefox Linux 41.0.2 ing scripts and look for unusual behavior (e.g., use of new -120 Firefox Linux 44.0b2 APIs) in a semi-automated fashion. Using this approach we -140 dB AudioContext found several fingerprinting scripts utilizing -160 and related interfaces. -180 15 In the simplest case, a script from the company Liverail -200 Oscilla- and AudioContext checks for the existence of an -220 torNode to add a single bit of information to a broader fin- 700 800 850 750 1000 950 900 gerprint. More sophisticated scripts process an audio signal Frequency Bin Number OscillatorNode to fingerprint the device. generated with an Figure Oscilla- Visualization of processed 9: This is conceptually similar to canvas fingerprinting: audio torNode from the fingerprinting script output signals processed on different machines or browsers may have for three different browsers slight differences due to hardware or software differences be- on the same machine. We found these values to remain tween the machines, while the same combination of machine constant for each browser after several checks. and browser will produce the same output. Figure 8 shows two audio fingerprinting configurations found in three scripts. The top configuration utilizes an for the current battery level or charging status of a host AnalyserNode to extract an FFT to build the fingerprint. device. Olejnik et al. provide evidence that the Battery Both configurations process an audio signal from an Oscil- API can be used for tracking [43]. The authors show how latorNode before reading the resulting signal and hashing the battery charge level and discharge time have a sufficient it to create a device audio fingerprint. Full configuration number of states and lifespan to be used as a short-term details are in Appendix Section 12. identifier. These status readouts can help identify users who We created a demonstration page based on the scripts, take action to protect their privacy while already on a site. which attracted visitors with 18,500 distinct cookies as of For example, the readout may remain constant when a user this submission. These 18,500 devices hashed to a total of clears cookies, switches to private browsing mode, or opens 713 different fingerprints. We estimate the entropy of the fin- a new browser before re-visiting the site. We discovered two gerprint at 5.4 bits based on our sample. We leave a full eval- fingerprinting scripts utilizing the API during our manual uation of the effectiveness of the technique to future work. analysis of other fingerprinting techniques. We find that this technique is very infrequently used as heartbeat.js, re- One script, of March 2016. The most popular script is from Liverail, trieves the current charge level of the host device and com- present on 512 sites. Other scripts were present on as few bines it with several other identifying features. These fea- as 6 sites. This shows that even with very low usage rates, tures include the canvas fingerprint and the user’s local IP we can successfully bootstrap off of currently known finger- address retrieved with WebRTC as described in Section 6.1 printing scripts to discover and measure new techniques. and Section 6.3. The second script, BatteryManager score.min.js, queries all properties of the 6.5 Battery API Fingerprinting interface, retrieving the current charging status, the charge As a second example of bootstrapping, we analyze the level, and the time remaining to discharge or recharge. As Battery Status API, which allows a site to query the browser with the previous script, these features are combined with 15 other identifying features used to fingerprint a device.

14 6.6 The wild west of fingerprinting scripts 7. CONCLUSION AND FUTURE WORK In Section 5.5 we found the various tracking protection Web privacy measurement has the potential to play a key measures to be very effective at reducing third-party track- role in keeping online privacy incursions and power imbal- ing. In Table 9 we show how blocking tools miss many of the ances in check. To achieve this potential, measurement tools scripts we detected throughout Section 6, particularly those must be made available broadly rather than just within the using lesser-known techniques. Although blocking tools de- research community. In this work, we’ve tried to bring this tect the majority of instances of well-known techniques, only ambitious goal closer to reality. a fraction of the total number of scripts are detected. The analysis presented in this paper represents a snapshot of results from ongoing, monthly measurements. OpenWPM and census measurements are two components of the broader EL + EP Disconnect Web Transparency and Accountability Project at Princeton. % Sites % Scripts Technique % Scripts % Sites We are currently working on two directions that build on the Canvas 88.3% 17.6% 78.5% 25.1% work presented here. The first is the use of machine learning 10.3% Canvas Font 10.3% 97.6% 90.6% to automatically detect and classify trackers. If successful, 21.3% WebRTC 4.8% 5.6% 1.9% this will greatly improve the effectiveness of browser pri- Audio 5.6% 53.1% 1.6% 11.1% vacy tools. Today such tools use tracking-protection lists that need to be created manually and laboriously, and suf- Table 9: Percentage of fingerprinting scripts blocked by fer from significant false positives as well as false negatives. Disconnect or the combination of EasyList and EasyPrivacy Our large-scale data provide the ideal source of ground truth for all techniques described in Section 6. Included is the for training classifiers to detect and categorize trackers. percentage of sites with fingerprinting scripts on which scripts are blocked. The second line of work is a web-based analysis platform that makes it easy for a minimally technically skilled ana- lyst to investigate online tracking based on the data we make Fingerprinting scripts pose a unique challenge for manu- available. In particular, we are aiming to make it possible ally curated block lists. They may not change the rendering for an analyst to save their analysis scripts and results to of a page or be included by an advertising entity. The script the server, share it, and for others to build on it. content may be obfuscated to the point where manual in- spection is difficult and the purpose of the script unclear. 8. ACKNOWLEDGEMENTS 0 1 . We would like to thank Shivam Agarwal for contribut- ing analysis code used in this study, Christian Eubank and 8 0 . Peter Zimmerman for their work on early versions of Open- . 0 6 WPM, and Gunes Acar for his contributions to OpenWPM . 0 4 and helpful discussions during our investigations, and Dillon Reisman for his technical contributions. . 2 0 We’re grateful to numerous researchers for useful feed- . 0 0 back: Joseph Bonneau, Edward Felten, Steven Goldfeder, − 6 2 − 5 − − 3 4 − 10 10 10 10 10 Fraction of Scripts Blocked Harry Kalodner, and Matthew Salganik at Princeton, Fer- Prominence of Script (log) nando Diaz and many others at Microsoft Research, Franziska Roesner at UW, Marc Juarez at KU Leuven, Nikolaos Laoutaris Figure 10: Fraction of fingerprinting scripts with promi- at Telefonia Research, Vincent Toubiana at CNIL, France, nence above a given level blocked by Disconnect, EasyList, Lukasz Olejnik at INRIA, France, Nick Nikiforakis at Stony or EasyPrivacy on the top 1M sites. Brook, Tanvi Vyas at Mozilla, Chameleon developer Alexei Miagkov, Joel Reidenberg at Fordham, Andrea Matwyshyn OpenWPM’s active instrumentation (see Section 3.2) de- at Northeastern, and the participants of the Princeton Web tects a large number of scripts not blocked by the current Privacy and Transparency workshop. Finally, we’d like to privacy tools. Disconnect and a combination of EasyList thank the anonymous reviewers of this paper. and EasyPrivacy both perform similarly in their block rate. This work was supported by NSF Grant CNS 1526353, The privacy tools block canvas fingerprinting on over 78% a grant from the Data Transparency Lab, and by Amazon of sites, and block canvas font fingerprinting on over 90%. AWS Cloud Credits for Research. However, only a fraction of the total number of scripts uti- lizing the techniques are blocked (between 10% and 25%) showing that less popular third parties are missed. Lesser- 9. REFERENCES known techniques, like WebRTC IP discovery and Audio fingerprinting have even lower rates of detection. [1] G. Acar, In fact, fingerprinting scripts with a low prominence are C. Eubank, S. Englehardt, M. Juarez, A. Narayanan, and C. Diaz. The web never forgets: Persistent tracking blocked much less frequently than those with high promi- mechanisms in the wild. In , 2014. Proceedings of CCS nence. Figure 10 shows the fraction of scripts which are urses, [2] G. Acar, M. Juarez, N. Nikiforakis, C. Diaz, S. G ̈ blocked by Disconnect, EasyList, or Easyprivacy for all tech- F. Piessens, and B. Preneel. FPDetective: dusting the niques analyzed in this section. 90% of scripts with a promi- web for fingerprinters. In Proceedings of CCS . ACM, 2013. nence above 0.01 are detected and blocked by one of the [3] L. A. Adamic and B. A. Huberman. Zipf’s blocking lists, while only 35% of those with a prominence law and the internet. Glottometrics , 3(1):143–150, 2002. above 0.0001 are. The long tail of fingerprinting scripts are [4] H. C. Altaweel I, largely unblocked by current privacy tools. Good N. Web privacy census. Technology Science , 2015.

15 [5] J. Angwin. What they T. Petsios, R. Spahn, A. Chaintreau, and R. Geambasu. know. The Wall Street Journal. Xray: Enhancing the web’s transparency with differential USENIX Security Symposium correlation. In public/page/what-they-know-digital-privacy.html, 2012. , 2014. [28] M. Lecuyer, R. Spahn, [6] M. Ayenson, D. J. Wambach, A. Soltani, Y. Spiliopolous, A. Chaintreau, R. Geambasu, and D. Hsu. N. Good, and C. J. Hoofnagle. Flash cookies and Sunlight: Fine-grained targeting detection at scale with privacy II: Now with HTML5 and ETag respawning. World Proceedings of CCS statistical confidence. In , 2011. . ACM, 2015. Wide Web Internet And Web Information Systems [29] A. Lerner, A. K. Simpson, T. Kohno, [7] P. E. Black. Ratcliff/Obershelp pattern recognition., and F. Roesner. Internet jones and the raiders of the lost trackers: An archaeological study of web tracking from Dec. 2004. , 2016. Proceedings of USENIX Security) 1996 to 2016. In [8] Bugzilla. WebRTC Internal IP Address Leakage. [30] J. Leyden. Sites pulling sneaky flash cookie-snoop. http: bug.cgi?id=959893. // cookies/, 2009. [9] A. Datta, [31] T. Libert. Exposing the invisible web: An M. C. Tschantz, and A. Datta. Automated experiments on analysis of third-party http requests on 1 million websites. ad privacy settings. Privacy Enhancing Technologies , 2015. International Journal of Communication , 9(0), 2015. [10] W. Davis. KISSmetrics Finalizes Supercookies Settlement. [32] D. Mattioli. On Orbitz, Mac users steered to pricier hotels. 191409/kissmetrics-finalizes-supercookies-settlement.html, 2013. [Online; accessed 12-May-2014]. SB10001424052702304458604577488822667325882, 2012. [33] J. R. Mayer [11] Disconnect. Tracking Protection Lists. and J. C. Mitchell. Third-party web tracking: Policy and technology. In Security and Privacy (S&P) . IEEE, 2012. [12] P. Eckersley. How unique is your web browser? . Springer, 2010. In [34] A. M. McDonald and L. F. Privacy Enhancing Technologies Cranor. Survey of the use of Adobe Flash Local Shared [13] Electronic Frontier Foundation. ISJLP Objects to respawn HTTP cookies, a. , 7, 2011. Encrypting the Web. [35] J. Mikians, L. Gyarmati, V. Erramilli, and N. Laoutaris. [14] S. Englehardt, D. Reisman, C. Eubank, Detecting price and search discrimination on the internet. P. Zimmerman, J. Mayer, A. Narayanan, and E. W. Felten. . ACM, 2012. In Workshop on Hot Topics in Networks Cookies that give you away: The surveillance implications of web tracking. In 24th International Conference [36] N. Mohamed. You deleted your cookies? think again. http://www.wired. , pages 289–299. International on World Wide Web com/2009/08/you-deleted-your-cookies-think-again/, 2009. World Wide Web Conferences Steering Committee, 2015. [15] Federal Trade Commission. Google will pay $22.5 million [37] K. Mowery and H. Shacham. Pixel perfect: Fingerprinting to settle FTC charges it misrepresented privacy assurances , 2012. canvas in html5. Proceedings of W2SP to users of Apple’s Safari internet browser. https://www. [38] Mozilla Developer Network. Mixed content - Security. https:// pay-225-million-settle-ftc-charges-it-misrepresented, 2012. content. [16] D. Fifield and S. Egelman. Fingerprinting [39] C. Neasbitt, B. Li, R. Perdisci, L. Lu, K. Singh, and Financial Cryptography web users through font metrics. In K. Li. Webcapsule: Towards a lightweight forensic engine , pages 107–124. Springer, 2015. and Data Security for web browsers. In . ACM, 2015. Proceedings of CCS [17] N. Fruchter, H. Miao, S. Stevenson, [40] N. Nikiforakis, L. Invernizzi, A. Kapravelos, S. Van Acker, and R. Balebako. Variations in tracking in relation W. Joosen, C. Kruegel, F. Piessens, and G. Vigna. to geographic location. In Proceedings of W2SP , 2015. You are what you include: Large-scale evaluation of remote [18] S. Gorman Proceedings of CCS javascript inclusions. In . ACM, 2012. and J. Valentino-Devries. New Details Show Broader [41] N. Nikiforakis, A. Kapravelos, W. Joosen, NSA Surveillance Reach., 2013. C. Kruegel, F. Piessens, and G. Vigna. Cookieless [19] A. Hannak, G. Soeller, D. Lazer, A. Mislove, and C. Wilson. monster: Exploring the ecosystem of web-based device Measuring price discrimination and steering on e-commerce fingerprinting. In Security and Privacy (S&P) . IEEE, 2013. , 2014. web sites. In 14th Internet Measurement Conference [42] F. Ocariza, K. Pattabiraman, and [20] C. J. Hoofnagle and N. Good. B. Zorn. Javascript errors in the wild: An empirical study. Available at SSRN 2460547 Web privacy census. , 2012. . IEEE, 2011. Software Reliability Engineering (ISSRE) In [21] M. Kranch and [43] L. Olejnik, J. Bonneau. Upgrading HTTPS in midair: HSTS and key G. Acar, C. Castelluccia, and C. Diaz. The leaking pinning in practice. In NDSS ’15: The 2015 Network and battery. Cryptology ePrint Archive , Report 2015/616, 2015. , February 2015. Distributed System Security Symposium [44] L. Olejnik, C. Castelluccia, et al. Selling [22] S. A. Krashakov, A. B. Teslyuk, and L. N. NDSS ’14: The 2014 Network off privacy at auction. In Shchur. On the universality of rank distributions of website and Distributed System Security Symposium , 2014. Computer Networks popularity. , 50(11):1769–1780, 2006. [45] Phantom JS. Supported web [23] B. Krishnamurthy, K. Naryshkin, and C. Wills. standards., 2016. Privacy leakage vs. protection measures: the growing [46] M. Z. Rafique, T. Van Goethem, W. Joosen, , volume 2, 2011. disconnect. In Proceedings of W2SP C. Huygens, and N. Nikiforakis. It’s free for a reason: [24] B. Krishnamurthy and C. Wills. Exploring the ecosystem of free live streaming services. Privacy diffusion on the web: a longitudinal perspective. In , 2016. Network and Distributed System Security (NDSS) Conference on World Wide Web In . ACM, 2009. [47] N. Robinson and J. Bonneau. Cognitive disconnect: [25] B. Krishnamurthy and C. E. Wills. On the leakage of per- Understanding Facebook Connect login permissions. In 2nd sonally identifiable information via online social networks. In . ACM, 2014. ACM conference on Online social networks . ACM, 2009. 2nd ACM workshop on Online social networks [48] F. Roesner, T. Kohno, and D. Wetherall. [26] P. Laperdrix, W. Rudametkin, and B. Baudry. Detecting and Defending Against Third-Party Beauty and the beast: Diverting modern web browsers Tracking on the Web. In Symposium on Networking to build unique browser fingerprints. In 37th IEEE Systems Design and Implementation . USENIX, 2012. Symposium on Security and Privacy (S&P 2016) , 2016. [49] S. Schelter and J. Kunegis. On [27] M. L ́ecuyer, G. Ducoffe, F. Lan, A. Papancea,

16 the ubiquity of web tracking: Insights from a billion-page APPENDIX web crawl. arXiv preprint arXiv:1607.07403 , 2016. [50] Selenium Browser Automation. Selenium faq. 35 30 com/p/selenium/wiki/FrequentlyAskedQuestions, 2014. 25 [51] R. Singel. Online Tracking 20 Firm Settles Suit Over Undeletable Cookies. http:// 15 10, 2010. 5 % First-Parties [52] K. Singh, A. Moshchuk, H. J. 0 Wang, and W. Lee. On the incoherencies in web browser access control policies. In Proceedings of S&P . IEEE, 2010. [53] A. Soltani, S. Canty, Q. Mayo, L. Thomas, and C. J. Hoofna- gle. Flash cookies and privacy. In AAAI Spring Symposium: Intelligent Information Privacy Management , 2010. [54] A. Soltani, A. Peterson, and B. Gellman. NSA uses Google cookies to pinpoint targets for hacking. http://www.washingtonpost. Figure 11: Third-party trackers on the top 55k sites with com/blogs/the-switch/wp/2013/12/10/nsa-uses-google- Ghostery enabled. The majority of the top third-party cookies-to-pinpoint-targets-for-hacking, December 2013. domains not blocked are CDNs or provide embedded content (such as Google Maps). [55] O. Starov, J. Dahse, S. S. Ahmad, T. Holz, and N. Nikiforakis. No honor among thieves: A large-scale analysis of malicious web shells. , 2016. In International Conference on World Wide Web [56] The Guardian. ‘Tor Stinks’ presentation - read the full document. 04/tor-stinks-nsa-presentation-document, October 2013. [57] Z. Tollman. We’re Going HTTPS: Here’s How WIRED Is Tackling a Huge Security Upgrade. 2016/04/wired-launching-https-security-upgrade/, 2016. [58] J. Uberti. New proposal for IP address handling in WebRTC. https://www. [59] J. Uberti and G. wei Shieh. WebRTC IP Address Handling Recommendations. https: // [60] S. Van Acker, D. Hausknecht, W. Joosen, and A. Sabelfeld. Password meters and generators on the web: From Conference large-scale empirical study to getting it right. In on Data and Application Security and Privacy . ACM, 2015. [61] S. Van Acker, N. Nikiforakis, L. Desmet, W. Joosen, and F. Piessens. Flashover: Automated discovery of cross-site scripting vulnerabilities in rich . ACM, 2012. internet applications. In Proceedings of CCS [62] T. Van Goethem, F. Piessens, W. Joosen, and N. Nikiforakis. Clubbing seals: Exploring the ecosystem of third-party . ACM, 2014. security seals. In Proceedings of CCS [63] T. Vissers, N. Nikiforakis, N. Bielova, and W. Joosen. Crying wolf? on the price discrimination of online airline tickets. HotPETS, 2014. [64] W. V. Wazer. Moving the Washington Post to HTTPS. 2015/12/10/moving-the-washington-post-to-https/, 2015. Three sample canvas fingerprinting images Figure 12: [65] X. Xing, W. Meng, D. Doozan, created by fingerprinting scripts, which are subsequently N. Feamster, W. Lee, and A. C. Snoeren. Exposing hashed and used to identify the device. inconsistent web search results with bobble. In Passive and Active Measurement , pages 131–140. Springer, 2014. [66] X. Xing, W. Meng, B. Lee, U. Weinsberg, A. Sheth, 10. MIXED CONTENT CLASSIFICATION R. Perdisci, and W. Lee. Understanding malvertising To classify URLs in the HTTPS mixed content analysis, 24th through ad-injecting browser extensions. In we used the block lists described in Section 4. Additionally, . International International Conference on World Wide Web 16 World Wide Web Conferences Steering Committee, 2015. . we include a list of CDNs from the WebPagetest Project [67] C. Yue and H. Wang. A measurement The mixed content URL is then classfied according to the study of insecure javascript practices on the web. first rule it satisfies in the following list: ACM Transactions on the Web (TWEB) , 7(2):7, 2013. 1. If the requested domain matches the landing page do- [68] A. Zarras, A. Kapravelos, G. Stringhini, favicon.ico main, and the request URL ends with clas- T. Holz, C. Kruegel, and G. Vigna. The dark alleys of sify as a “favicon”. madison avenue: Understanding malicious advertisements. 2. If the requested domain matches the landing page do- In Internet Measurement Conference . ACM, 2014. main, classify as the site’s “own content”. 16

17 Content-Type Count 3. If the requested domain is marked as “should block” by the blocklists, classify as “tracker”. binary/octet-stream 8 4. If the requested domain is in the CDN list, classify as 12664 image/jpeg “CDN”. image/svg+xml 177 5. Otherwise, classify as “non-tracking” third-party content. image/x-icon 150 image/png 7697 41 image/ 11. ICE CANDIDATE GENERATION text/xml 1 It is possible for a Javascript web application to access 1 audio/wav ICE candidates, and thus access a user’s local IP addresses 8 application/json and public IP address, without explicit user permission. Al- application/pdf 1 though a web application must request explicit user permis- 8 application/x-www-form-urlencoded sion to access audio or video through WebRTC, the frame- 5 application/unknown RTCDataChan- work allows a web application to construct an audio/ogg 4 nel without permission. By default, the data channel will 2905 image/gif launch the ICE protocol and thus enable the web application 20 video/webm to access the IP address information without any explicit application/xml 30 user permission. Both users behind a NAT and users behind image/bmp 2 a VPN/proxy can have additional identifying information 1 audio/mpeg exposed to websites without their knowledge or consent. application/x-javascript 1 Several steps must be taken to have the browser gener- application/octet-stream 225 RTCDataChannel must be cre- ate ICE candidates. First, a 1 image/webp RTCPeerConnection.c- ated as discussed above. Next, the text/plain 91 Promise reateOffer() must be called, which generates a text/javascript 3 that will contain the session description once the offer has text/html 7225 been created. This is passed to RTCPeerConnection.setLo- video/ogg 1 calDescription() , which triggers the gathering of candi- 23 image/* date addresses. The prepared offer will contain the sup- 19 video/mp4 ported configurations for the session, part of which includes 2 image/pjpeg 17 A web appli- the IP addresses gathered by the ICE Agent. image/small 1 cation can retrieve these candidate IP addresses by using the 2 image/x-png event handler and RTCPeerConnection.onicecandidate() RTCPeerConnect- retrieving the candidate IP address from the Table 10: Counts of responses with given Content-Type ionIceEvent.candidate or, by parsing the resulting Session which cause mixed content errors. NOTE: Mixed content 18 RTCPeerConnec- string from Description Protocol (SDP) blocking occurs based on the tag of the initial request (e.g. image src tags are considered passive content), not the after the offer generation is com- tion.localDescription response Content-Type. Thus it is likely that the Javascript plete. In our study we only found it necessary to instru- and other active content loads listed above are the result of RTCPeerConnection.onicecandidate() ment to capture all misconfigurations and mistakes that will be dropped by the current scripts. browser. For example, requesting a Javascript file with an image tag. 12. AUDIO FINGERPRINT CONFIGURATION Transform (FFT) of the audio signal, which is captured us- Figure 8 in Section 6.4 summarizes one of two audio fin- Script- event handler added by the onaudioprocess ing the gerprinting configurations found in the wild. This configura- . The resulting FFT is fed into a hash and ProcessorNode tion is used by two scripts, (*/main.min.js used as a fingerprint. and These scripts use to generate a sine wave. The output OscillatorNode an 13. ADDITIONAL METHODOLOGY signal is connected to a , possibly DynamicsCompressorNode to increase differences in processed audio between machines. All measurements are run with Firefox version 41. The The output of this compressor is passed to the buffer of an Ghostery measurements use version 5.4.10 set to block all . The script uses a hash of the sum of OfflineAudioContext possible bugs and cookies. The HTTPS Everywhere mea- values from the buffer as the fingerprint. surement uses version 5.1.0 with the default settings. The A third script, *, utilizes AudioContext Block TP Cookies measurement sets the Firefox setting to to generate a fingerprint. First, the script generates a tri- “block all third-party cookies”. OscillatorNode . This signal is passed angle wave using an and a through an ScriptProcessorNode AnalyserNode . Fi- GainNode nally, the signal is passed into a through a with gain set to zero to mute any output before being connect to the AudioContext’s destination (e.g. the computer’s speak- ers). The provides access to a Fast Fourier AnalyserNode 17 RTCPeerConnection-createOffer-Promise- RTCSessionDescription--RTCOfferOptions-options 18

18 13.1 Classifying Third-party content • If the ID appears in the location URL: the original re- quested domain is the sender of the ID, and the redirected In order to determine if a request is a first-party or third- location domain is the receiver. party request, we utilize the URL’s “public suffix + 1” (or This methodology does not require reverse engineering PS+1). A public suffix is “is one under which Internet users any domain’s cookie sync API or URL pattern. An im- can (or historically could) directly register names. [Exam- portant limitation of this generic approach is the lack of ples include] .com, and” A PS+1 is the discrimination between intentional cookie syncing and acci- public suffix with the section of the domain immediately pro- dental ID sharing. The latter can occur if a site includes a ceeding it (not including any additional subdomains). We 19 user’s ID within its URL query string, causing the ID to be use Mozilla’s Public Suffix List in our analysis. We con- shared with all third parties in the referring URL. sider a site to be a potential third-party if the PS+1 of The results of this analysis thus provide an accurate rep- the site does not match the landing page’s PS+1 (as de- resentation of the privacy implications of ID sharing, as a termined by the algorithm in the supplementaary materials third party has the technical capability to use an uninten- Section 13.2). Throughout the paper we use the word “do- tionally shared ID for any purpose, including tracking the main” to refer to a site’s PS+1. user or sharing data. However, the results should be in- terpreted only as an upper bound on cookie syncing as the 13.2 Landing page detection from HTTP data practice is defined in the online advertising industry. Upon visiting a site, the browser may either be redirected by a response header (with a 3XX HTTP response code or 13.4 Detection of Fingerprinting “Refresh” field), or by the page content (with javascript or a Javascript minification and obfuscation hinder static anal- “Refresh” meta tag). Several redirects may occur before the ysis. Minification is used to reduce the size of a file for tran- site arrives at its final landing page and begins to load the sit. Obfuscation stores the script in one or more obfuscated remainder of the content. To capture all possible redirects strings, which are transformed and evaluated at run time we use the following recursive algorithm, starting with the function. We find that fingerprinting and track- eval using initial request to the top-level site. For each request: ing scripts are frequently minified or obfuscated, hence our 1. If HTTP redirect, following it preserving referrer details dynamic approach. With our detection methodology, we from previous request. intercept and record access to specific Javascript objects, 2. If the previous referrer is the same as the current we as- affected by minification or obfuscation of the which is not sume content has started to load and return the current source code. referrer as the landing page. The methodology builds on that used by Acar, [1] 3. If the current referrer is different from the previous refer- to detect canvas fingerprinting. Using the Javascript calls rer, and the previous referrer is seen in future requests, instrumentation described in Section 3.2, we record access assume it is the actual landing page and return the pre- to specific APIs which have been found to be used to fin- vious referrer. gerprint the browser. Each time an instrumented object is 4. Otherwise, continue to the next request, updating the accessed, we record the full context of the access: the URL current and previous referrer. of the calling script, the top-level url of the site, the prop- This algorithm has two failure states: (1) a site redirects, erty and method being accessed, any provided arguments, loads additional resources, then redirects again, or (2) the and any properties set or returned. For each fingerprint- site has no additional requests with referrers. The first fail- ing method, we design a detection algorithm which takes ure mode will not be detected, but the second will be. From the context as input and returns a binary classification of manual inspection, the first failure mode happens very in- whether or not a script uses that method of fingerprinting frequently. For example, we find that only 0.05% of sites when embedded on that first-party site. are incorrectly marked as having HTTPS as a result of this When manual verification is necessary, we have two ap- failure mode. For the second failure mode, we find that we proaches which depend on the level of script obfuscation. If can’t correctly label the landing pages of 2973 first-party the script is not obfuscated we manually inspect the copy sites (0.32%) on the top 1 million sites. For these sites we which was archived according to the procedure discussed in fall back to the requested top-level URL. Section 3.2. If the script is obfuscated beyond inspection, we embed a copy of the script in isolation on a dummy HTML 13.3 Detecting Cookie Syncing page and inspect it using the Firefox Javascript Deobfusca- 20 extension. We also occasionally spot check live versions tor We consider two parties to have cookie synced if a cookie of sites and scripts, falling back to the archive when there ID appears in specific locations within the referrer , request , are discrepancies. and URLs extracted from HTTP request and re- location sponse pairs. We determine cookie IDs using the algorithm described in Section 4. To determine the re- and sender ceiver of a synced ID we use the following classification, in line with previous work [44, 1]: • request URL: the requested do- If the ID appears in the main is the recipient of a synced ID. • If the ID appears in the referrer URL: the referring do- main is the sender of the ID, and the requested domain is the receiver. 20 19 javascript-deobfuscator/

19 Fingerprinting Script Count src internal24.js 4588 internal23.js src 2963 2653 src.js 2093 1208 894 v2.js 498 303 180 173 140 127 118*/platform.min.js 97 85 72 71 56 56 55 685 others 1853 18283 1 TOTAL 14371 unique Table 11: Canvas fingerprinting scripts on the top Alexa 1 Million sites. **: Some URLs are truncated for brevity. 1: Some sites include fingerprinting scripts from more than one domain. # of sites Text drawn into the canvas Fingerprinting script 2941 mmmmmmmmmmlli 243 abcdefghijklmnopqr[snip] 1 gMcdefghijklmnopqrstuvwxyz0123456789 75 * mmmmmmmmmMMMMMMMMM=llllIiiiiii‘’. 2* 1 mmmmmmmmmmlli session.js 1 mimimimimimimi[snip] 3263 2 (3250 unique TOTAL ) - Table 12: Canvas font fingerprinting scripts on the top Alexa 1 Million sites. **: Some URLs are truncated for brevity. 1: The majority of these inclusions were as subdomain of the first-party site, where the DNS record points to a subdomain of 2: Some sites include fingerprinting scripts from more than one domain.

20 Fingerprinting Script First-party Count Classification 147 Tracking Tracking 115*/jsEngine.js 72 Tracking * Tracking 72 45 Tracking Tracking 45 Non-Tracking 31 Tracking 27 Tracking 16 15 Tracking 14 Tracking 6 Unknown 6 Tracking 3 Tracking 3 Tracking Unknown 2 2 Unknown 2 Tracking 2 Unknown 80 - 80 others present on a single first-party TOTAL 705 - Table 13: WebRTC Local IP discovery on the Top Alexa 1 Million sites. **: Some URLs are truncated for brevity. Prominence # of FP Rank Change Site 6.72 +2 447,963 6.20 609,640 − 1 5.70 461,215 − 1 5.57 0 397,246 4.20 309,159 +1 176,604 3.27 +3 233,435 0 3.02 133,391 +4 2.76 2.68 − 4 370,385 59,723 +13 2.37 2.37 94,281 +2 2.11 143,095 − 1 2.00 172,234 − 3 1.84 − 6 210,354 1.83 +5 71,725 1.63 45,333 +17 1.60 59,613 +7 39,673 +24 1.52 1.45 81,118 − 3 1.45 49,080 +9 Table 14: Top 20 third-parties on the Alexa top 1 million, sorted by prominence. The number of first-party sites each third-party is embedded on is included. Rank change denotes the change in rank between third-parties ordered by first-party count and third-parties ordered by prominence.

Related documents

DS Digital Influence Machine

DS Digital Influence Machine

FOLIO TK Weaponizing the Digital Influence Machine: The Political Perils of Online Ad Tech Anthony Nadler, Matthew Crain, and Joan Donovan Data & Society Research Institute 1

More info »