1 One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon † † ? ? † Zaoxing Liu , Vladimir Braverman , Gregory Vorsanger , Antonis Manousis , Vyas Sekar † ? Carnegie Mellon University Johns Hopkins University 1 Introduction ABSTRACT Network management is multi-faceted and encompasses a Network management requires accurate estimates of met- range of tasks including traffic engineering [11, 32], attack rics for many applications including traffic engineering (e.g., and anomaly detection [49], and forensic analysis [46]. Each heavy hitters), anomaly detection (e.g., entropy of source such management task requires accurate and timely statis- addresses), and security (e.g., DDoS detection). Obtain- tics on different application-level metrics of interest; e.g., the ing accurate estimates given router CPU and memory con- flow size distribution [37], heavy hitters [10], entropy mea- straints is a challenging problem. Existing approaches fall sures [38, 50], or detecting changes in traffic patterns [44]. in one of two undesirable extremes: (1) low fidelity general- At a high level, there are two classes of techniques to esti- purpose approaches such as sampling, or (2) high fidelity mate these metrics of interest. The first class of approaches but complex algorithms customized to specific application- relies on generic flow monitoring , typically with some form level metrics. Ideally, a solution should be both general of packet sampling (e.g., NetFlow [25]). While generic flow (i.e., supports many applications) and provide accuracy com- monitoring is good for coarse-grained visibility, prior work parable to custom algorithms. This paper presents Univ- has shown that it provides low accuracy for more fine-grained Mon , a framework for flow monitoring which leverages re- metrics [30, 31, 43]. These well-known limitations of sam- cent theoretical advances and demonstrates that it is possible pling motivated an alternative class of techniques based on to achieve both generality and high accuracy. UnivMon uses algorithms. Here, custom online al- or sketching streaming an application-agnostic data plane monitoring primitive; dif- gorithms and data structures are designed for specific met- ferent (and possibly unforeseen) estimation algorithms run rics of interest that can yield provable resource-accuracy trade- in the control plane, and use the statistics from the data plane offs (e.g., [17, 18, 20, 31, 36, 38, 43]). to compute application-level metrics. We present a proof- of-concept implementation of UnivMon using P4 and de- While the body of work in data streaming and sketching velop simple coordination techniques to provide a “one-big- has made significant contributions, we argue that this trajec- switch” abstraction for network-wide monitoring. We eval- tory of crafting special-purpose algorithms is untenable in uate the effectiveness of UnivMon using a range of trace- the long term. As the number of monitoring tasks grows, this driven evaluations and show that it offers comparable (and entails significant investment in algorithm design and hard- sometimes better) accuracy relative to custom sketching so- ware support for new metrics of interest. While recent tools lutions across a range of monitoring tasks. like OpenSketch [47] and SCREAM [41] provide libraries to reduce the implementation effort and offer efficient resource CCS Concepts allocation, they do not address the fundamental need to de- sign and operate new custom sketches for each task. Fur- • Networks → Network monitoring; Network measure- thermore, at any given point in time the data plane resources ment; have to be committed (a priori) to a specific set of metrics to monitor and will have fundamental blind spots for other Keywords metrics that are not currently being tracked. Ideally, we want a monitoring framework that offers both Flow Monitoring, Sketching, Streaming Algorithms by delaying the binding to specific applications generality Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or fidelity of interest but at the same time provides the required distributed for profit or commercial advantage and that copies bear this notice for estimating these metrics. Achieving generality and high and the full citation on the first page. Copyrights for components of this work fidelity simultaneously has been an elusive goal both in the- owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, or republish, to post on servers or to redistribute to ory [33] (Question 24) as well as in practice [45]. lists, requires prior specific permission and/or a fee. Request permissions from In this paper, we present the UnivMon (short for Univer- [email protected] sal Monitoring) framework that can simultaneously achieve SIGCOMM ’16, August 22–26, 2016, Florianopolis, Brazil both generality and high fidelity across a broad spectrum of c 2016 ACM. ISBN 978-1-4503-4193-6/16/08. . . $15.00 © monitoring tasks [31, 36, 38, 51]. UnivMon builds on and http://dx.doi.org/10.1145/2934872.2934906 DOI:

2 body of work in data streaming or sketching. This derives extends recent theoretical advances in , universal streaming from a rich literature in the theory community on stream- where a single universal sketch is shown to be provably ac- ing algorithms starting with the seminal “AMS” paper [9] curate for estimating a large class of functions [15, 16, 19, and has since been an active area of research (e.g., [19, 24, 21, 22]. In essence, this generality can enable us to delay 26, 28]). At the high level, the problem they address is as the binding of the data plane resources to specific monitor- follows: Given an input sequence of items, the algorithm ing tasks, while still providing accuracy that is comparable is allowed to make a single or constant number of passes (if not better) than running custom sketches using similar re- over the data stream while using sub-linear (usually poly- sources. logarithmic) space compared to the size of the data set and While our previous position paper suggested the promise the size of the dictionary. The algorithm then provides an ap- of universal streaming [39], it fell short of answering several proximate estimate of the desired statistical property of the practical challenges, which we address in this paper. First, stream (e.g., mean, median, frequency moments). Streaming we demonstrate a concrete switch-level realization of Uni- is a natural fit for network monitoring and has been applied vMon using P4 [12], and discuss key implementation chal- to several tasks including heavy hitter detection [31], entropy lenges in realizing UnivMon. Second, prior work only fo- estimation [38], change detection [36], among others. cused on a single switch running UnivMon for a specific A key limitation that has stymied the practical adoption of feature (e.g., source addresses) of interest, whereas in prac- streaming approaches is that the algorithms and data struc- tice network operators want a panoramic view across multi- tures are tightly coupled to the intended metric of interest. ple features and across traffic belonging to multiple origin- This forces vendors to invest time and effort in building spe- destination pairs. To this end, we develop lightweight-yet- cialized algorithms, data structures, and corresponding hard- effective coordination techniques that enable UnivMon to ef- ware without knowing if these will be useful for their cus- fectively provide a “one big switch” abstraction for network- tomers. Given the limited resources available on network wide monitoring [34], while carefully balancing the moni- routers and business concerns, it is difficult to support a wide toring load across network locations. spectrum of monitoring tasks in the long term. Moreover, at We evaluate UnivMon using a range of traces [1, 2] and any instant the data plane resources are committed before- operating regimes and compare it to state-of-art custom sketch- hand to the application-level metrics and other metrics that ing solutions based on OpenSketch [47]. We find that for a may be required in the future (e.g., as administrators start single network element, UnivMon achieves comparable ac- some diagnostic tasks and require additional statistics) will curacy, with an observed error gap 3.6% and average er- ≤ fundamentally not be available. ror gap ≤ 1%. Furthermore, UnivMon outperforms OpenS- The efforts closest in spirit to our UnivMon vision is the ketch in the case of a growing application portfolio. In a minimalist monitoring work of Sekar et al. [45] and OpenS- network-wide setting, our coordination techniques can re- ketch by Yu et al., [47]. Sekar et al. showed empirically duce the memory consumption and communication with the that flow sampling and sample-and-hold [31] can provide control plane by up to three orders of magnitude. comparable accuracy to sketching when equipped with sim- In summary, this paper makes Contributions and roadmap: ilar resources. However, this work offers no analytical basis the following contributions: for this observation and does not provide guidelines on what A practical architecture which translates recent theoretical • metrics are amenable to this approach. advances to serve as the basis for a general-yet-accurate OpenSketch [47] addresses an orthogonal problem of mak- monitoring framework (§3, §4). ing it easier to implement sketches. Here, the router is equipped with a library of predefined functions in hardware (e.g., hash- An effective network-wide monitoring approach that pro- • maps or count-min sketches [26]) and the controller can re- vides a one-big switch abstraction (§5). program these as needed for different tasks. While OpenS- A viable implementation using emerging programmable • ketch reduces the implementation burden, it still faces key switch architectures (§6). shortcomings. First, because the switches are programmed A trace-driven analysis which shows that UnivMon pro- • to monitor a specific set of metrics, there will be a fundamen- vides comparable accuracy and space requirements com- tal lack of visibility into other metrics for which data plane pared to custom sketches (§7). resources have not been committed, even if the library of We begin with background and related work in the next functions supports those tasks. Second, to monitor a portfo- section. We highlight outstanding issues and conclude in §8. lio of tasks, the data plane will need to run many concurrent sketch instances, which increases resource requirements. 2 Background and Related Work In summary, prior work presents a fundamental dichotomy: Many network monitoring and management applications de- generic approaches that offer poor fidelity and are hard to pend on sampled flow measurements from routers (e.g., Net- reason about analytically vs. sketch-based approaches that Flow or sFlow). While these are useful for coarse-grained offer good guarantees but are practically intractable given metrics (e.g., total volume) they do not provide good fidelity the wide range of monitoring tasks of interest. unless these are run at very high sampling rates, which is Our recent position paper makes a case for a “RISC” ap- undesirable due to compute and memory overhead. proach for monitoring [39], highlighting the promise of re- cent theoretical advances in universal streaming [19,21]. How- This inadequacy of packet sampling has inspired a large

3 App 2 App 1 N App set of metrics to be estimated, and yet offer strong guar- ... antees on accuracy. UnivMon [R2.] There One-big-switch abstraction for monitoring: • . Metric Estimation 3 Topology Control Routing Plane may be several network-wide management tasks interested Manifest computation in measuring different dimensions of traffic; e.g., source IPs, destination ports, IP 5-tuples. UnivMon should pro- 1. Distribute #Sketches , Manifests vide a “one big switch” abstraction for monitoring to the Dimension , emory M management applications running atop UnivMon, so that UnivMon the estimates appear as if all the traffic entering the net- Data work was monitored at a giant switch [34]. P lane [R3.] While pure soft- • Feasible implementation roadmap: 2. Collect Sketch counters ware solutions (e.g., Open vSwitch [42]) may be valu- The data plane Overview of UnivMon: Figure 1: able in many deployments, for broader adoption and per- nodes perform the monitoring operations and report formance requirements, the UnivMon primitives used to sketch summaries to the control plane which calculates achieve [R1] and [R2] must have a viable implementation application-specific metric estimates. in (emerging) switch hardware [12, 13]. Given the trajectory of prior efforts that offer high gener- ever, this prior work fails to address several key practical ality and low fidelity (e.g, packet sampling) vs. low general- challenges. First, it does not discuss how these primitives ity and high fidelity (e.g., custom sketches), [R1] may appear can actually be mapped into switch processing pipelines. In infeasible. To achieve [R2], we observe that if each router fact, we observe that the data-control plane split that they acts on the traffic it observes independently, it can become suggest is impractical to realize as they require expensive difficult to combine the measurements and/or lead to signif- sorting/sifting primitives (see §6). Second, this prior work icant imbalance in the load across routers. Finally, for [R3], takes a narrow single-switch perspective. As we show later, we note that even emerging flexible switches [3, 12, 13] have naively extending this to a network-wide context can result constraints on the types of operations that they can support. in inefficient use of compute resources on switches and/or result in inaccurate estimates (see §5). This paper develops Next, we briefly outline how the Uni- Approach Overview: network-wide coordination schemes and demonstrate an im- vMon control and data plane designs address these key re- plementation in P4 [12]. Further, we show the fidelity of quirements and challenges: UnivMon on a broader set of traces and metrics. • The UnivMon plane uses sketching UnivMon data plane: primitives based on recent theoretical work on universal 3 UnivMon architecture [19, 21]. By design, these so-called universal streaming In this section, we provide a high-level overview of the Uni- sketches require no prior knowledge of the metrics to be vMon framework. We begin by highlighting the end-to-end estimated. More specifically, as long as these metrics sat- workflow to show the interfaces between (a) the UnivMon isfy a series of statistical properties discussed in detail in control plane and the management applications and (b) be- §4, we can prove theoretical guarantees on the memory- tween the UnivMon control and data plane components. We accuracy tradeoff for estimating these metrics in the con- discuss the key technical requirements that UnivMon needs trol plane. to satisfy and why these are challenging. Then, we briefly • Given that the data plane sup- UnivMon control plane: give an overview of the control and data plane design to set ports universal streaming, the control plane needs no ad- up the context for the detailed design in the following sec- ditional capabilities w.r.t. [R1] once it collects the sketch 1 tions. information from the router. It runs simple estimation al- Figure 1 shows an end-to-end view of the UnivMon frame- gorithms for every management application of interest as work. The UnivMon data plane nodes run general-purpose we discuss in §4 and provides simple APIs and libraries monitoring primitives that process the incoming stream of for applications to run estimation queries on the collected packets it sees and maintains a set of counter data structures counters. To address [R2], the UnivMon control plane associated with this stream. The UnivMon control plane as- sketching manifests that specify the monitoring generates signs monitoring responsibilities across the nodes. It peri- responsibility of each router. These manifests specify the odically collects statistics from the data plane, and estimates set of universal sketch instances for different dimensions the various application-level metrics of interest. of interest (e.g., for source IPs, for 5-tuples) that each router needs to maintain for different origin-destination There are three natural re- Requirements and challenges: (OD) pair paths that it lies on. This assignment takes quirements that UnivMon should satisfy: into account the network topology and routing policies Fidelity for a broad spectrum of applications: [R1.] • Ide- and knowledge of the hardware resource constraints of its ally UnivMon should require no prior knowledge of the network elements. 1 In the following sections, we begin by providing the back- We use the terms routers, switches, and nodes interchange- ground on universal streaming that forms the theoretical ba- ably.

4 For function sis for UnivMon. Then, in §5, we describe the network- -heavy el- g be the set containing core - G , let g wide coordination problem that the UnivMon control plane ements. g -heavy elements can be defined as, for 0 < γ < 1 , ∑ any element ] ∈ [ n solves. In §6, we show how we implement this design in such that g ( f i ) > γ . ) g f ( j i j P4 [7, 12]. Now, let us consider two cases: g -heavy hitter in the stream: 1. There is one sufficiently large 4 Theoretical Foundations of UnivMon If the frequency vector has one (sufficiently) large heavy In this section, we first describe the theoretical reasoning be- hitter, then most of mass is concentrated in this value. hind universal streaming and the class of supported func- L norm Now, it can be shown that a heavy hitter for the 2 tions [19, 21]. Then, we present and explain the underlying of the frequency vector is also a heavy hitter for com- algorithms from universal streaming which serve as a basis core , we can putable g [14, 19]. Thus, to compute G - for UnivMon. We also show how several canonical network L simply find heavy hitters (L2-HH) using some known 2 monitoring tasks are amenable to this approach. G - sum . techniques (e.g., [9, 24]) and use it to estimate 4.1 Theory of Universal Sketching 2. There is no single g -heavy hitter in the stream and no sin- gle element contributes significantly to the G - sum : For the following discussion, we consider an abstract stream When there is no single large heavy hitter, it can be shown ) with m of length m,n ( denote D f unique elements. Let n i sum - G that can be approximated w.h.p. by finding heavy -th unique element in the stream. i the frequency of the hitters on a series of sampled substreams of increasingly The intellectual foundations of many streaming algorithms smaller size. The exact details are beyond the scope of can be traced back to the celebrated lemma by Johnson and this paper [19] but the main intuition comes from tail Lindenstrauss [27]. This shows that N points in Euclidean bounds (Chernoff/Hoeffding). Each substream is defined space can be embedded into another Euclidean space with an recursively by the substream before it, and is created by exponentially smaller dimension while approximately pre- sampling the previous frequency vector by replacing each serving the pairwise distance between the points. Alon, Ma- coordinate of the frequency vector with a zero value with tias, and Szegedy used a variant of the Johnson-Lindenstrauss . Repeating this procedure k times reduces probability 0 5 . lemma to approximately compute the second moment of the √ ∑ ∑ k 2 2 2 the dimensionality of the problem by a factor of . Then, frequency vector = f L norm = (or the f ) in 2 i i i i summing across heavy hitters of all these recursively de- the streaming model [9], using a small (polylogarithmic) fined vectors, we create a single “recursive sketch” which amount of memory. The main question that universal stream- G - gives a good estimate of sum [21]. ing seeks to answer is whether such methods can be extended ∑ for an arbi- g ( f to more general statistics of the form ) i 4.2 Algorithms for Universal Sketching - G sum . . We refer to this statistic as the trary function g Using the intuition from the two cases described above, we Class of Stream-PolyLog Functions: Informally, streaming now have the following universal sketch construction using algorithms which have polylogarithmic space complexity, an online sketching stage and an offline estimation stage. is mono- are known to exist for G - sum functions, where g The proof of the theorems governing the behavior of these 2 2 tonic and upper bounded by the function O ( f ) [14, 19]. i algorithms is outside the scope of this paper and we refer Note that this only guarantees that some (possibly custom) readers to the previous work of Braverman et al [19, 21]. and sketching algorithm exists if G - sum ∈ Stream-PolyLog In this section, we focus on providing a conceptual view does not argue that a single “universal” sketch can compute of the universal sketching primitives. As we will discuss G s. all such - sum later, the actual data plane and control plane realization will Intuition Behind Universality: The surprising recent the- be slightly different to accommodate switch hardware con- oretical result of universal sketches is that for any function straints (see §6). de- Stream-PolyLog belongs to the class sum - G where () g In the online stage, as described in Algorithm 1, we main- single universal fined above can now be computed by using a L -heavy hitter” (L2-HH) log( tain ) n parallel copies of a “ 2 sketch . sketch (e.g., [24]), one for each substream as described in th The intuition behind universality stems from the follow- parallel instance, the algorithm j case 2 above. For the ing argument about heavy hitters in the stream. Informally, processes each incoming packet 5-tuple and uses an array if changing its frequency item f is a heavy hitter w.r.t. i g i : [ j } pairwise independent hash functions of 1 , 0 → { ] n h i significantly affects the - sum value as well. For instance, G to decide whether or not to sample the tuple. When 5-tuple √ ( consider the frequency vector 1 , 1 ,..., 1) n, n ; of size to , h ( tup h , arrives in the stream, if for all tup h ) = 1 i 1 j heavy hitter since its frequency L here the first item is a 2 then the tuple is added to , the sampled substream. Then, D j is a large fraction of the norm of the frequency vector. L 2 , we run an instance of L2-HH as shown D for substream j in Algorithm 1, and visualized in Figure 2. Each L2-HH in- 2 This is an informal explanation; the precise characteriza- L heavy hitters and their that contains Q stance outputs j 2 tion is more technically involved and can be found in [19]. estimated counts from . This creates substreams of de- D sum - G While streaming algorithms are also known for j 2 -th instance is expected to have all creasing lengths as the j f g when its [17] they can- grows monotonically faster than i not be computed in polylogarithmic space due to the lower of the hash functions agree to sample half as often as the /k − 2 1 -th instance. This data structure is all that is required ( j − 1) k > Ω( ) bound n [23]. 2 where

5 Algorithm 2 UnivMon Offline Estimation Algorithm ,...,Q Q { Q Input: Set of heavy hitters = } 0 n ) log( in j = 0 ... log( n ) , call g() on all counters w ) ( i 1. For j is Q . After . )) i ( g w ( g () , the Q -th entry in i j j j ∑ 2. Compute . )) i = Y ( g ( w log( ) n log( ) n i from j , compute: log( n 3. For each − 1 to 0 ) ∑ Figure 2: High-level view of universal sketch = ( - 2 h i )) ( i )) Y w (1 2 Y ( g + +1 j j j +1 j ∈ i Q j Algorithm 1 UnivMon Online Sketching Step Output: Y 0 ( m,n D { a ,a ,...,a Input: Packet stream } ) = m 1 2 For the following discussion, we consider network traffic as pairwise independent hash func- 1. Generate log( n ) m packets and at most D unique n a stream ( n,m ) with : [ ] n . ...h h } 1 , 0 →{ tions 1 n ) log( flows. When referring to the definitions of Heavy Hitters, . Q 2. Run L2-HH sketch on D and maintain HH set 0 L heavy hitters are a stronger notion that sub- note that 2 heavy hitters. sumes L 1 3. For , in parallel: j = 1 to log( n ) To detect heavy hitters in the network traf- Heavy Hitters: h × a ) in D arrives, if all (a) when a packet a ( 1 i i fic, our goal is to identify the flows that consume more than to h ( a ) ···× h ( a ) = 1 , sample and add a 2 i i j i 3 a fraction γ of the total capacity [31]. Consider a function . D sampled substream j ( outputs a list core G such that the corresponding x ) = - g x and maintain heavy D (b) Run L2-HH sketch on j (1 -approximation of their frequen- of heavy hitters with ) ± Q hitters j cies. For this case, these heavy hitters are -heavy hitters L 1 } Output: Q = { ,...,Q Q 2 0 log( n ) is upperbounded by x and . Thus we have an algorithm ( x g ) . This is technically a special case of that provides G - core for the online portion of our approach. sum the universal sketch; we are not ever computing a - G In the offline stage, we use Algorithm 2 to combine the - core directly in all cases. function and using G results of the parallel copies of Algorithm 1 to estimate dif- DDoS Victim Detection: Suppose we want to identify if ferent G - sum functions of interest. This method is based on a host X is experiencing a Distributed Denial of Service the Recursive Sum Algorithm from [21]. The input to this (DDoS) attack. We can do so using sketching by check- } { algorithm is the output of Algorithm 1; i.e., a set of Q j unique flows from different sources are k ing if more than buckets maintained by the L2-HH sketch from parallel in- communication with X [47]. To show that the simple DDoS . Let stance j -th bucket (heavy w i ( i ) be the counter of the j victim detection problem is solvable by the universal sketch, -th bucket i is the hash of the value of the ) i ( hitter) in h . Q j j 0 g consider a function g that and g ( (0) = 0 . Here ) = x x is the hash function described in Algorithm h where Q in j j 2 x x ( f is upper bounded by g and sketches already ex- ) = 1 Step 1. It can be shown that the output of Algorithm 2 is sum ist to solve this exact problem. Thus, we know - is G - [19, 21]. In this algorithm, G an unbiased estimator of sum in polylog- in G - and we approximate Stream-PolyLog sum ap- g is function Y each Y is recursively defined, where j arithmic space using the universal sketch. In terms of in- , the L2-HH sketch for substream Q plied to each bucket of j sum is esti- - terpreting the results of this measurement, if G , and the sum taken on the value of those buckets and all D j ′ mated to be larger than k , a specific host is a potential DDoS ′ Y > j . Note that Q is the set of heavy hitters from ,j j ) n log( victim. the sparsest substream D in Algorithm 1, and we begin n ) log( by computing Y Y . Thus, can be viewed as computing 0 ) n log( Change Detection: Change detection is the process of iden- G - in parts using these sampled streams. sum tifying flows that contribute the most to traffic change over The key observation here is that the online component, two consecutive time intervals. As this computation takes Algorithm 1, which will run in the UnivMon data plane is place in the control plane, we can store the output of the g to the specific choice of in the offline stage. This agnostic universal sketches from multiple intervals without impact- is in stark contrast to custom sketches where the online and ing online performance. Consider two adjacent time inter- offline stages are both tightly coupled to the specific statistic vals x . If the volume for a flow t in interval and t t B A A we want to compute. ] is S . The difference signal [ t over interval ] x [ x S and B B A x [ D is defined as | for ] x [ x S − ] x [ . A flow is a S | ] = A B 4.3 Application to Network Monitoring heavy φ change flow if the difference in its signal exceeds , G sum As discussed earlier, if a function - ∈ Stream-PolyLog percentage of the total change over all flows. The total dif- ∑ then it is amenable to estimation via the universal sketch. ference is . A flow ] x [ D = D is defined to be a x [ ] n ∈ x Next, we show that a range of network measurement tasks D heavy change iff · D . The task is to identify these φ ≥ ] x [ G - sum ∈ Stream-PolyLog . can be formulated via a suitable heavy change flows. We assume the size of heavy change 3 c over the total capacity . T flows is above some threshold ) ...D In this way, we obtain log( n ,D streams D ; 1 2 n ) log( heavy hit- We can show that the heavy change flows are L 1 ... = 1 , the number of unique items i.e., for in n n log j , t ) b ··· b ( ters on interval t and interval ( a ) ··· a . D D , is expected to be half of 1 A B 1 j +1 j n/ 2 2 n/

6 ∑ . ) = where | a − b | ,t G - sum here is L norm, L ( t i 1 1 B i A G - core can be solved which belongs to Stream-PolyLog , and sum - by universal sketch. The outputs the estimated size G G - core outputs the possible heavy of the total change D and sum and - change flows. By comparing the outputs from G G - core , we can detect and determine the heavy change flows that are above some threshold of all flows. We define entropy with the expres- Entropy Estimation: ∑ n f f i i sion H ≡− log( 0 log 0 = 0 ) [38] and we define =1 i m m H for source here. The entropy estimation task is to estimate IP addresses (but could be performed for ports or other fea- ∑ n f f i i tures). To compute the entropy, H ) = log( = − =1 i m m ∑ 1 4 . As can be easily obtained, log( f m ) f − log( m ) i i i m ∑ ) log( f f the difficulty lies in calculating . Here the func- i i i 2 x ) = x log( x ) is bounded by tion ( x ) = x g and thus its ( g - sum is in Stream-PolyLog and H can be estimated by uni- G versal sketch. Consider a system or network Global Iceberg Detection: Figure 3: Example topology to explain the one-big- N that consists of distributed nodes (e.g., switches). The switch notion and to compare candidate network-wide , contains a stream of tuples < item j data set S at node id j solutions U is an item identity from a set c = > μ where ...μ item } { n 1 id 5.1 Problem Scope and c is an incremental count. For example, an item can be a packet or an origin-destination (OD) flow. We define We begin by scoping the types of network-wide estimation ∑ ∑ c , the frequency of the item μ when = fr tasks we can support and formalize the one-big-switch ab- i i j <μ S ,c> ∈ i j aggregated across all the nodes. We want to detect the pres- straction that we want to provide in UnivMon. To illustrate ence of items whose total frequency across all the nodes this, we use the example in Figure 3 where we want to mea- T adds up to exceed a given threshold . In other words, we sure statistics over two dimensions of interest: 5-tuple and U μ would like to find out if there exists an element ∈ source-IP. i . (In §5, we will explain a solution to T ≥ fr such that In this example, we have four OD-pairs; suppose the set i gain a network-wide universal sketch. Here, we assume here , , of traffic flows on each of these is denoted by P P , P 11 12 21 that we maintain an abstract universal sketch across all nodes and P . We can divide the traffic in the network into four 22 by correctly combining all distributed sketches.) Consider a partitions, one per OD-pair. Now, imagine we abstract away out- x core ) = - x ( g function G such that the corresponding the topology and consider the union of the traffic across these ) − puts a list of global heavy hitters with approximation (1 ± partitions flowing through one logical node representing the g of their frequencies. For this case, since -heavy hitters are entire network; i.e., computing some estimation function L heavy hitters, we have an algorithm that provides G - core . P ] ] denotes the disjoint set ] , where ) ] P P ( F P 1 11 12 21 22 union operation. For this work, we restrict our discussion to network-wide 5 Network-wide UnivMon functions where we can independently compute the F esti- In a network-wide context, we have flows from several origin- mates on each OD-pair substream and add them up. In other words, we restrict our problem scope to estimation functions destination (OD) pairs, and applications may want network- F s such that: wide estimates over multiple packet header combinations of interest. For instance, some applications may want per- )+ P ( F )+ P ( F )+ P ( F ) = P ] ] ) P ( F ( P P ] P F 11 12 11 21 22 12 21 22 source IP estimates, while others may want characteristics in terms of the entire IP-5-tuple. We refer to these different Note that this still captures a broad class of network-wide packet header features and feature-combinations as dimen- tasks such as those mentioned in section 4.3. One such ex- . sions ample measurement is finding heavy hitters for destination In this section, we focus on this network-wide monitor- IP addresses. An important characteristic of the UnivMon ing problem of measuring multiple dimensions of interest approach is that in the network-wide setting the output of traversing multiple OD-pairs. Our goal is to provide equiva- sketches in the data plane can then be added together in the lent coverage and fidelity to a “one big switch abstraction”, control plane to give the same results as if all of the packets providing the same level of monitoring precision at the net- passed through one switch. The combination of the sepa- work level as at the switch level. We focus mostly for the rate sketches is a property of the universal sketch primitive case where each OD-pair has a single network route and de- used in the data plane and is independent of the final statistic scribe possible extensions to handle multi-pathing. monitored in the control plane, allowing the combination to work for all measurements supported by UnivMon. We do 4 however acknowledge that some tasks fall outside the scope sum e.g., a single counter or estimated as a G . -

7 remaining sketches to run. For instance, in Figure 3, GDC of this partition model; an example statistic that is out of , and the 2 O and 1 would maintain the source IP sketch at O scope would be measuring the statistical independence of A 5-tuple sketch at . This method is correct, as each sketch source and destination IP address pairs (i.e. if a source IP is for each OD pair is maintained once. However, it is difficult likely to appear with a given destination IP, or not), as this in- to properly balance resources as nodes at the intersection of troduces cross-OD-pair dependencies. We leave extensions multiple paths could be burdened with higher load. to support more general network-wide functions for future work (see §8). An alternative approach Reactive Query and Sketch (QS): efficiency and The challenge here is to achieve correctness is to use the controller to ensure better sketch assignment. bal- (e.g., switch memory, controller overhead) while also For instance, whenever a new flow is detected at a node, we across the data plane elements. Informally, ancing the load query the controller to optimally assign sketches. In Figure we seek to minimize the total number of sketches instanti- 3, the controller would optimally put the source IP sketch ated in the network and the maximum number of sketches at A (or vice versa). With this B and the 5-tuple sketch at that any single node needs to maintain. method, we can be assured of correctness. However, the reactive nature means that QS generates many requests to 5.2 Strawman Solutions and Limitations the controller. Next, we discuss strawman strategies and argue why these 5.3 Our Approach fail to meet one or more of our goals w.r.t. correctness, effi- ciency, and load balancing. We observe that we can combine Next, we present our solution, which uses the UnivMon con- the underlying sketch primitives at different switches as long troller to coordinate switches to guarantee correctness and as we use the same random seeds for our sketches, as the efficiency but without incurring the reactive query load of counters are additive at each level of the UnivMon sketch. the QS strategy described above. With this, we only need to add the guarantee that we count Periodically, the UnivMon controller gives each switch a each packet once to assure correctness. In terms of resource A . For each switch sketching manifest and for each OD- usage, our goal is to minimize the number of sketches used. lies on, the manifest specifies the dimen- pair’s route that A sions for which A needs to maintain a sketch. When a packet Redundant Monitoring (RM): Suppose for each of di- k arrives at a node, the node uses the manifest to determine mensions of interest, we maintain a sketch on every node, the set of sketching actions to apply. When the controller with each node independently processing traffic for the OD- needs to compute a network-wide estimate, we pull sketches pairs whose paths it lies on. Now, we have the issue of com- from all nodes and for each dimension, combine the sketches bining sketches to get an accurate network-wide estimate. In across the network for that dimension. This method mini- particular, adding all of the counters from the sketches would mizes communication to the control plane while still making be incorrect, as packets would be counted multiple times. In use of the controller’s ability to optimize resource use. the example topology, to correctly count packets we would The controller solves a simple constrained optimization need to either only use the sketches at A or B , or, conversely, problem that we discuss below. Note that maintaining two or D O combine the sketches for source IP at O 1 and 1 and 2 sketches uses much more memory than adding twice as many D . In terms of efficiency, this RM strategy maintains a 2 elements to one sketch. Thus, a key part of this optimiza- dimensions at each node and thus we main- k sketch for all tion is to ensure that we try to reuse the same sketch for a nodes. Our goal, is to kN tain a total of N sketches across given dimension across multiple OD pairs. In Figure 3, we total sketches, where maintain s s << kN . the 5-tuple B the source IP sketch, then A would first assign Ingress Monitoring (IM): An improvement over the RM 1 . When choosing where to 1) O ( sketch for the OD pair ,D method is to have only ingress nodes maintaining every sketch. ,D O , the algorithm place the sketches for the OD pair 2 2) ( Thus, for each OD pair, all sketch information is maintained matches the manifests such that the manifest for ( O 2 ,D 2) in a single node. By not having duplicate sketches per OD and the 5-tuple sketch A uses the source IP sketch already at pair, we will not double count and therefore can combine B already at . sketches together. This gives us the correctness guarantee We formulate the controller’s decision to place sketches missing in RM. In Figure 3, IM would maintain sketches at as an integer linear program (ILP) shown in Figure 4. Let . However, for O ingress nodes, we would run N and O 2 1 i s j be a binary decision variable denoting if the switch is jk N N sketches, and if we spend a similar amount kN ≈ i i maintaining a sketch for dimension j . The goal of the op- of resources to RM, which is still high. Additionally, these timization is to ensure that every OD-pair is suitably “cov- sketches woul be would all be present on a small number of ered” and that the load across the switches is balanced. Let nodes, where other nodes with available compute resources r be the amount of memory for a sketch for dimension k k would not run any sketches. denote maximum amount of memory available on and let R a single node. Note that the amount of memory for a sketch Greedy Divide and Conquer (GDC): To overcome the can be chosen in advance based on the accuracy required. concentration of sketches in IM above, one potential solution As a simple starting point, we focus primarily on the mem- is to evenly divide sketch processing duties across the path. ory resource consumption assuming that all UnivMon opera- Specifically, each node has a priority list of sketches, and tions can be done at line-rate; we can extend this formulation tags packets with the current sketches that are already main- to incorporate processing load as well. tained for this flow so that downstream nodes know which

8 red O1-D1 path and the blue O1-D1 path. By computing the Minimize: × N Maxload + Sumload , subject to ILP with multiple paths per OD pair as needed, sketches are ∑ distributed across nodes, and single counting is guaranteed. s ∀ ≥ 1 (1) : i,k jk This method works best when the total number of paths per p ∈ j i OD pair is constant relative to the total number of nodes, and ∑ larger numbers of paths will cause the sketches to concen- × j ∀ r (2) s : Load = j jk k trate on the source or destination nodes, possibly requiring k ∑ additional solutions. : R r ∀ s ≤ j (3) jk k The second issue occurs when multi-path routing causes : Maxload ∀ Load j (4) ≥ the frequency of an item to be split between too many sketches. j ∑ In the single-path setting, if an OD pair has a globally heavy ∀ j : Sumload (5) = Load j feature, then it will be equally heavy or heavier in the sketch j where it is processed. However in the multi-path case, it is Figure 4: ILP to compute sketching manifests possible for some OD pairs to have more paths than others, Eq (1) captures our coverage constraint that we maintain and thus it becomes possible for items that are less frequent 5 We model the per-node each sketch once for each OD-pair. but have fewer routes to be incorrectly reported heavy, and in load in Eq (2) and ensure that it respects the router capac- turn fail to report true heavy elements in the control plane. ity in Eq (3). Our objective function balances two compo- This problem is shown in Figure 5. In this case, we have nents: the maximum load that any one node faces and the 10,000 packets from node O1 to D1 split across two paths, 6 total number of sketches maintained. and 6,000 packets from O2 to D2. For simplicity, assume we are only looking for the "heaviest" source IP, the source 5.4 Extension to Multi-path IP with the highest frequency, and that the nodes have a sin- IP gle IP address, (i.e. Packets go from to IP and IP O D O 1 1 1 IP ). For this metric, the sketch at A will report IP as D O 2 1 a heavy source address with count 5,000, and B will report IP as a heavy source address with count 6,000. At the data O 2 plane these values are compared again, and the algorithm IP would return IP , a false positive, and miss , a false O O 2 1 negative. To solve this issue, instead of sending the heavy hitter report from individual sketches as described in Algo- rithm 1, the counters from each sketch must be sent directly to the control plane to be added, reconstructing the entire Figure 5: Example topology to showcase difficulty of sketch and allowing the correct heavy hitters to be identi- multi-path fied. In our example, the counters for the O1 at A and B would be added, and IP would be correctly identified as O Adapting the above technique to multi-path requires some 1 the heavy hitter. This approach is already used in the P4 modification, but is feasible. For simplicity, we still as- implementation discussed below, but is not a requirement of sume that packets are routed deterministically (e.g., by pre- UnivMon in general. We note that when using the method fix rules), but may have multiple routes. We defer settings described below in Section 6.2, identifying the true IP ad- that depend on randomized or non-deterministic routing for dress of the heavy item is harder in the multi-path setting, future work. but is solved by increasing γ relative to the maximum num- Even in this deterministic setting, there are two potential ber of sketches per true OD pair, which is naturally limited problems. First, ensuring packets are only counted once is by the ILP. With these modifications, the heavy hitters are important to avoid false positives, as in the single path case. correctly found from the combined sketch, and the one big Second, if the packets with a heavy feature (e.g., the desti- switch abstraction are maintained in a multi-path setting. nation address is heavy) are divided over many routes, it can increase the difficulty of accurately finding heavy hitters, re- 6 UnivMon Implementation moving false positives and preventing false negatives. The first issue, guaranteeing packets are counted only once, In this section, we discuss our data plane implementation in is solvable by the ILP presented above. For each path used P4 [7, 12]. We begin by giving an overview of key design by an OD pair, we create a unique sub-pair which we treat tradeoffs we considered. Then, we describe how we map as an independent OD pair. This is shown in Figure 5 by the UnivMon into corresponding P4 constructs. 5 Our coverage constraint allows multiple sketches of the 6.1 Implementation overview same kind to be placed in the same path. This is because in some topologies, it may not be feasible to have an equal- At a high level, realizing the UnivMon design described in ity constraint. In this case, the controller post-processes the the previous sections entails four key stages: solution and removes duplicates before assigning sketches stage which decides whether an incoming sampling 1. A for a given OD pair. 6 helps to normalize the values. MaxLoad term for N The packet will be added to a specific substream.

9 Implementation Stages is k (assuming ≤ 20 ) is only a few KBs per measurement Top-k HH App-Estimation Sketching Sampling between the two epoch. Thus, this decision to split the top- k planes computation is practical and simplifies the data plane Option 1 Option 2 requirements. Data Plane Control Plane Control Plane Data Plane 6.2 Mapping UnivMon data plane to P4 Sk S Top-k App S Top-k Sk App Based on the above discussion, the UnivMon data plane im- table sampling1 { plements sampling, sketching, and “heavy” flowkey storage Pros: Storage -Comm actions { Overhead sample_1; in P4. In a P4 program, packet processing is implemented } table Sket_1 { through Match+Action tables, and the control flow of the Cons: HW Complexity } actions { sket_1; program dictates the order in which these tables are applied } to incoming packets. Given the sketching manifests from } the control plane, we generate a control program that defines Figure 6: An illustration of UnivMon’s stages along with the different pipelines that a packet needs to be assigned to. the two main implementation options. These pipelines are specific to the dimension(s) (i.e., source IP, 5-tuple) for which the switch needs to maintain a univer- sketching stage which calculates sketch counters from 2. A sal sketch. We begin by explaining how we implemented input substreams and populates the respective sketch counter these functions and then describe a sample control flow. arrays. Sampling: P4 enables programmable calculations on spe- 3. A computation stage which identifies (approximately) k top- cific header fields using user-defined functions. We use this the heaviest elements of the input stream. k to sample incoming packets, with a configurable flowkey that can be any subset of the 5-tuple (srcIP, dstIP, srcPort, stage which collects the heavy element 4. An estimation dstPort, protocol). We define l pairwise-independent hash frequencies and calculates the desired metrics. functions, where l is the number of levels from §4. These Let us now map these stages to our data and control plane functions take as input the flowkey and output a binary value. modules from Figure 1. Our delayed binding principle im- We store this output bit as packet metadata. A packet is sam- plies that the estimation stage maps to the UnivMon con- pled at level if the outputs of the hash functions of all levels i trol plane. Since the sampling and sketching are processing ≤ i is equal to 1. We implement sampling for each level as packets, they naturally belong in the data plane to avoid con- a table that matches all packets and whose action is to apply trol plane overhead. the sampling hash function of that level. The hash metadata One remaining question is whether the top- k computation in the packets are used in conditional statements in the con- stage is in the data or control plane (Figure 6). Placing the substreams. Pack- trol flow to append the packet to the first i top- stage in the data plane has two advantages. First, the k ets that are not sampled are not subject to further UnivMon communication cost between the data and control plane will 7 processing. k rather than raw counters need to be low, as only the top- Sketching: The sketching stage is responsible for maintain- be transferred. Second, the data plane only needs to keep ing counters for each one of the substreams. From these l k track of the flowkeys (e.g., source IP) of the heaviest el- sketch counters, we can estimate the L2-HH for each stage ements at any given point in time, and thus not incur high k and then the overall top- heavy hitters and their counts. memory costs. However, one stumbling block is that real- While UnivMon does not impose any constraints on the L2- izing this stage requires (i) sorting counter values and (ii) HH algorithm to be used, in our P4 implementation we use storing information about the heavy elements in some form Count Sketch [24]. The sketching computation for each level of a priority queue. Unfortunately, these primitives may be is implemented as a table that matches every packet belong- hard to implement in hardware and are not supported in P4 ing to that level’s substream and its actions update the coun- yet. Thus, we make a pragmatic choice to split the top- k ters, stored in the sketch counter arrays. Similar to the sam- stage between the control and the data planes. We identify pling stage, we leverage user-defined hash functions that take the top- k heavy flowkeys in the dataplane and then we use as input the same flowkey as in the sampling stage. We use the raw data counters to calculate their frequencies in the their output to retrieve the indexes of the sketch register ar- control plane. The consequence is that we incur higher com- rays cells that correspond to a particular packet and update munication overhead to report the raw counter data structure, their value as dictated by the Count Sketch algorithm. but the number of flowkeys stored in the data plane remains abstraction which offers a form of register P4 provides a low. stateful memory that can store user-defined data and that UnivMon’s raw counters and flowkeys are stored on the can be arranged into one dimensional arrays of user-defined target’s on-chip memory (TCAM and SRAM). We argue that length. Register cells can be read or written by P4 action in practice the storage overhead of UnivMon is manageable statements and are also accessible through the control plane even for hardware targets with limited SRAM [4, 8, 47]. We API. Given that our goal is to store sketch counter values show that for the largest traces that we evaluate and with- 7 out losing accuracy, the total size of the raw counters can There may be other routing/ACL actions to be applied to be less than 600 KB whereas the cost of storing flowkeys the packet but this is outside our scope.

10 6.3 Control plane which do not represent byte or packet counts, we use reg- ister arrays to store and update sketch counters. The size of We implement the UnivMon control plane as a set of cus- the array and the bitlength of each array cell are user-defined tom C++ modules and libraries. We implement modules and can be varied based on the required memory-accuracy for (1) Assigning sketching responsibilities to the network tradeoff as well as on the available on-chip memory of the elements, and (2) implementing the top- k and estimation w t rows and hardware target. Each sketch is an array of stages. The P4 framework allows us to define the API for ∗ w , and columns. We instantiate register arrays of length t control-data plane communication. We currently use a sim- the bitlength of each cell is based on the maximum expected ple RPC protocol that allows us to import sketching mani- value of a counter. fests and to query the contents of data plane register arrays The one remaining issue is storing flowkeys correspond- After the heavy flowkeys and their respective counters ing to the “heavy” elements since these will be needed by the have been collected, the frequencies of the k -most frequent estimation stage running in the control plane. One option is elements in the stream are extracted. The heavy elements heavy hitters on- k to use a priority queue to maintain the top along with the statistical function of the metric to be esti- line, as it is probably the most efficient and accurate choice mated are then fed to the recursive algorithm of UnivMon’s to maintain heavy flowkeys. However, this can incur more estimation stage. than constant update time for each element, which makes it difficult to implement on hardware switches. To address 7 Evaluation the issue, we use an alternative approach which is to main- We divide our evaluation into two parts. First, we focus tain a fixed sized table of heavy keys and use constant time on a single router setup and compare UnivMon vs. custom updates for each operation. It is practical and acceptable sketches via OpenSketch [47]. Second, we demonstrate the when the size of the table is small (e.g., 10-50) and the ac- benefits of our network-wide coordination mechanisms. tual number of heavy flows doesn’t greatly exceed this size. The lookup/update operations could be very fast (in a single 7.1 Evaluation setup clock cycle) when leveraging some special types of memory We begin by describing our trace-driven evaluation setup. (e.g., TCAM) on hardware switches. Applications and error metrics: We have currently imple- Another scheme we use is as follows, and we leave im- mented translation libraries for five monitoring tasks: Heavy proved sketches for finding heavy flowkeys as future work. Hitter detection (HH), DDoS detection (DDoS), Change De- of them. For γ 1 /γ -threshold heavy hitters, there are at most tection (Change), Entropy Estimation (Entropy), and Global While packets are being processed, we maintain an up-to- Iceberg Detection (Iceberg). For brevity, we only show re- value (of the frequency vector), specifically L = L date 2 2 2 1 2 2 / 2 sults for metrics computed over one feature, namely the source is each flow’s current c ( + 1) c + ( ) L ( , where c ) − i i i 2 IP address; our results are qualitatively similar for other di- log(1 /γ ) buckets of size k . In the online count and we create mensions too. c stage, when updating the counters in L2-HH, is obtained i For Heavy Hitters and Global Iceberg detection, we set by reading current sketch counters. = 0 T a threshold 05% of the link capacity and identify all . 4 / 2 / L We then maintain buckets marked with ,..., ,L 2 2 large flows that consume more traffic than that threshold. We . For each element that arrives, if its counter is greater γL 2 obtain the average on the counts of each identi- relative error than , insert it into the L / / L 2 bucket using a simple hash; 2 2 2 | True | Estimate − fied large flow; i.e., . For Change Detection, 4 otherwise, if its counter is greater than / , insert it into the L 2 True of the φ whose frequency has changed more than a threshold L bucket, and so forth. When the value of 4 / L doubles 2 2 total change over all flows across two monitoring windows. L itself, we delete the last γL 2 / bucket and we add a new 2 2 05% 0 We chose this threshold to be and calculate the aver- . k bucket. This scheme ensures that O ( log(1 /γ )) flowkeys age relative error similar to HH. For Entropy Estimation and are stored, and at the end of the stream we can return most DDoS, we evaluate the relative error on estimated entropy top γL . k items heavier than 2 value and the number of distinct source IPs. P4 Control Flow: As a simple starting point, we use a se- We normalize UnivMon’s memory usage Configuration: quential control flow to avoid cloning every incoming packet with the custom sketches by varying three key parameters: (i.e., number of levels) times. This means that every packet l and number of columns t number of rows in Count-Sketch w is processed by a sketching, a storage and a sampling table l in the universal sketch. In tables, and the number of levels sequentially until the first level where it doesn’t get sampled. l total UnivMon uses t × w × counters. In OpenSketch, More specifically, after a packet passes the parsing stage dur- we configure the memory usage in a similar way by varying ing which P4 extracts its header fields, it is first processed number of rows t and counters per row w in all the sketches by the sketching table of level_0. The “heavy” keys for that they use. When comparing the memory usage with OpenS- stage are updated and then it is processed by the sampling ketch, we calculate the total number of sketch counters as- table of level_1. If the packet gets sampled at level_1, it is suming that each integer counter occupies 4 bytes. Both Uni- sketched at this level, the “heavy” keys are updated and the vMon and OpenSketch use randomized algorithms; we run procedure continues until the packet reaches the last level or the experiment 10 times with random hash seeds and report until it is not sampled. the median cross these runs.

11 10 Heavy Hitter Trace Loc Date and Time Change Detection DDoS Equinix-Chicago 1. CAIDA’15 2015/02/19 2. CAIDA’15 2015/05/21 Equinix-Chicago 2015/09/17 Equinix-Chicago 3. CAIDA’15 5 4. CAIDA’15 Equinix-Chicago 2015/12/17 2014/06/19 5. CAIDA’14 Equinix-Sanjose Error Rate (%) Table 1: CAIDA traces in the evaluation 1 0.01 Traces: For this evaluation, we use five different one-hour OC192-P4 OC192-1 OC192-2 OC192-3 OC192-4 OC192-5 backbone traces (Table 1) collected at backbone links of a Tier1 ISP between (i) Chicago, IL and Seattle, WA in 2015 (a) UnivMon 3 and (ii) between San Jose and Los Angeles in 2014 [1, 2]. Heavy Hitter Change Detection DDoS We split the traces into different representative time inter- vals (5s, 30s, 1min, 5min). For example, each one hour trace 2 min 25% , , contains 720 5s-epoch data points and we report 75% max , and median , on whisker bars. By default, we re- 1 port results for a 5-second trace. Each 5s packet-trace con- Error Rate (%) ∼ tains 155k to 286k packets with 55k distinct source IP ad- 40k distinct destination IP addresses. The link dresses and ∼ 0.01 OC192-2 OC192-1 OC192-4 OC192-5 OC192-3 speed of these traces is 10 Gbps. Experiment Setup: For our P4 implementation prototype, (b) OpenSketch we used the P4 behavioral simulator, which is essentially a Figure 7: Error rates of HH, Change and DDoS for Uni- P4-enabled software switch [6]. To validate the correctness vMon and OpenSketch of our P4 implementation, we compare it against a software implementation of the data plane and control plane algo- sion offers better accuracy vs. memory tradeoff than OpenS- rithms, written in C++. We evaluate P4 prototype on Trace ketch’s original method [44]. For completeness, we also re- 1 and run software implementation in parallel on Trace 1- 5. port the memory usage of OpenSketch’s original design (us- The results between the two implementations are consistent ing the k-ary sketch). From Figure 8c, we see UnivMon as the relative error between the results of the two imple- provides comparable accuracy even though UnivMon has a mentations does not exceed 0.3%. To evaluate OpenSketch, much smaller sketch table on each level of its hierarchical we use its simulator written in C++ [5]. structure. This is because the “diff” across sketches are well preserved in UnivMon’s structure. 7.2 Single Router Evaluation Next, we evaluate the memory needed Fixed Target Errors: Comparison under fixed memory setting: First, we com- ≤ ). In Figures 9 and 10 as 1% to achieve the same error rates ( pare UnivMon and OpenSketch on the applications that OpenS- we vary the monitoring window, we can see that only small ketch supports: HH, Change, and DDoS. In Figures 7a and amount of memory increase is required for both UnivMon 7b, we assign 600KB memory and use all traces in order to and OpenSketch to achieve 1% error rates. In fact, we find estimate the error when running UnivMon and OpenSketch. that UnivMon does not require more memory to maintain a We find that the absolute error is very small for both ap- stable error rate for increased number of flows in the traffic. proaches. We observe that OpenSketch provides slightly bet- This is largely because sketch-based approaches usually just ter results for all three metrics. However we note that Uni- take logarithmic memory increase in terms of input size to vMon uses 600KB memory to run three tasks concurrently maintain similar error guarantees. Furthermore, the nature while OpenSketch is given 600KB to run each task. Figure of traffic distribution also helps as there are only a few very 7a and 7b confirm that this observation holds on multiple heavy flows and the entire distribution is quite ‘flat’. traces; the error gap between UnivMon and OpenSketch is Other metrics: We also considered metrics not in the OpenS- 3 . . ≤ 6% ketch library in Figure 11 to confirm that UnivMon is able to The previous result considered a Accuracy vs. Memory: calculate a low-error estimate. Specifically, we consider the fixed memory value. Next, we study the sensitivity of the entropy of the distribution and the second frequency moment 2 9 2 2 error to the memory available. Figure 8a and 8b shows that m for Again, we distinct elements. + f ··· f + f = F 2 m 2 1 the error is already quite small for all the HH and DDoS ap- 500KB) find that with reasonable amounts of memory ( ≥ plications and that the gap is almost negligible with slightly the error of UnivMon is very low. increased memory 1MB. ≥ Impact of Application Portfolio: Next, we explore how Figure 8c shows the results for the Change Detection task. UnivMon and OpenSketch handle a growing portfolio of For this task, the original OpenSketch paper uses a stream- same hash functions; combine two sketches by one sketch ing algorithm based on reversible k-ary sketches [44]. We “subtracts” the other; and use reversible sketch to trace back implement an extension to OpenSketch using a similar idea the keys. 8 as UnivMon. Our evaluation results show that our exten- 9 This is a measure of the “skewness” and is useful to calcu- 8 We maintain two recent Count-Min sketches using the late repeated rate or Gini index of homogeneity.

12 15 15 1 UnivMon UnivMon UnivMon OpenSketch(CM) OpenSketch OpenSketch OpenSketch(K-ary) 5 5 2 2 Error Rate (%) Error Rate (%) 1 Error Rate (%) 0.1 0.1 0.01 0.4 0.1 2 0.8 1 2 3 0.8 1 0.25 0.4 0.8 1 0.2 0.6 Memory Usage (MB) Memory Usage (MB) Memory Usage (MB) (b) DDoS (c) Change (a) HH Figure 8: Error vs. Memory for HH, DDoS, Change 10 OS-trace1 UM-trace1 Heavy Hitter DDoS OS-trace2 UM-trace2 Change Detection OS-trace3 UM-trace3 5 OS-trace4 UM-trace4 UM-trace5 OS-trace5 300 1 -1 -5 Error Gap (%) 200 Memory Usage (KB) -10 Appset1 Appset2 Appset3 1m 30s 5s 5m Figure 12: The impact of a growing portfolio of monitor- Monitoring Time Interval ing applications on the relative performance Figure 9: HH: average memory usage to achieve a 1% UnivMon with Different Data Structures (600KB) error rate for different time intervals 100 Count-Sketch Pick-and-Drop OS-trace1 UM-trace1 Count-Min-Sketch UM-trace2 OS-trace2 OS-trace3 UM-trace3 UM-trace4 OS-trace4 2.0 UM-trace5 OS-trace5 1.5 Error Rate(%) 10 1.0 0.2 Entropy HH DDoS Change Memory Usage (KB) 0.5 0.3 Figure 13: Analyzing different HH data structures 5s 30s 1m 5m Monitoring Time Interval L algorithm that identifies heavy hitters as a building block. 2 Two natural questions arise: (1) How do different heavy Figure 10: Change: average memory usage to achieve a hitter algorithms compare and (2) Can we use other popu- 1% error rate for different time intervals lar heavy hitter identifiers, such as Count-Min sketch? We 15 implemented and tested the Pick-and-Drop algorithm [20] UnivMon(Entropy) and Count-Min sketch [26] as building blocks for UnivMon. UnivMon(F2) Figure 13 shows that Pick-and-Drop and CM sketch lose the generality of UnivMon as they can provide accurate re- 5 sults only for HH and Change tasks. This is because, intu- 2 Error Rate (%) 1 0.1 p 3) heavy hitters are identi- p = 1 ≥ ( itively, only L or p 0.1 0.4 0.8 1 1.5 fied. The technical analysis of universal sketch shows that Memory Usage (MB) only L heavy hitters contribute significantly to the G - Sum 2 Figure 11: Error rates of Entropy and F2 estimation - Sum when G L norm. As dis- is upper bounded by some 2 - G cussed in Section 4.3, the functions corresponding Sum monitoring tasks with a fixed memory. We set the switch L norms. Therefore, the es- to HH and Change are actually 1 memory to 600KB for both UnivMon and OpenSketch and L heavy hitters output by Count-Min or Pick-and- timated 1 run three different application sets: AppSet1={HH}, AppSet2 Drop work well for HH and Change tasks, but not Entropy ={HH,DDoS}, and AppSet3={HH,DDoS,Change}. We as- or DDoS. When combining heavy hitter counters in the re- sume that OpenSketch divides the memory uniformly across cursive step of calculation, we will simply miss too many the constituent applications; i.e., in AppSet1 600KB is de- significant heavy elements for all tasks. voted to HH, but in AppSet2 and Appset3, HH only gets 300KB and 200KB respectively. Figure 12 shows the “er- Processing Overhead: One concern might be the computa- − ror gap” between UnivMon and OpenSketch (UnivMon tional cost of the UnivMon vs. custom sketch primitives. We OpenSketch); i.e., positive values imply UnivMon is worse used the Intel Performance Counter Monitor [29] to evalu- and vice versa. As expected, we find that when running con- ate compute overhead (e.g., Total cycles on CPU) on Uni- current tasks, the error gap decreases as each task gets less vMon and OpenSketch’s software simulation libraries. For memory in OpenSketch. That is, with more concurrent and any given task, our software implementation was only 15% supported tasks, UnivMon can still provide guaranteed re- more expensive than OpenSketch. When we look at all three sults on each of the applications. applications together, however, the UnivMon takes only half the compute cycles as used by OpenSketch in total. While Choice of Data Structures: UnivMon uses a a sketching

13 Network Wide Evaluation (600KB per sketch) Network Wide Evaluation (600KB per sketch) Network Wide Evaluation (600KB per sketch) 2k 1 2000 Ingress Ingress Ingress Greedy-D.&C. Greedy-D.&C. Greedy-D.&C. Q.&S. Q.&S. Q.&S. 1500 UnivMon UnivMon UnivMon 1k 1000 Error Rate (%) 500 Average Memory(KB) 0.1 Total requests to controller 0.1k 0 BellSouth GEANT ATT-N.A. BellSouth ATT-N.A. GEANT BellSouth ATT-N.A. GEANT (c) Total number of requests to controller (a) Error rates of global iceberg detection (b) Average memory consumption Figure 14: Network-wide evaluation on major ISP backbone topologies Topology OD Pairs Dim. Time (s) Total Sketches 2. When comparing sensitivity to error and available mem- 4 68 1560 Geant2012 0.09 ory, we observe that UnivMon provides comparable ac- 0.10 60 Bellsouth 2550 4 curacy with OpenSketch with similar, or smaller memory 2.8 252 18906 4 Dial Telecom requirements. 0.22 1560 Geant2012 136 8 Bellsouth 2550 120 8 0.28 3. The network-wide evaluation shows that UnivMon pro- 8 12.6 504 18906 Dial Telecom vides an even distribution of resources on each node while Table 2: Time to compute sketching manifests using ILP providing results with high accuracy. we acknowledge that we cannot directly translate into ac- 8 Conclusions and Future Work tual hardware processing overheads, this suggests that Uni- vMon’s compute footprint will be comparable and possibly In contrast to the status quo in flow monitoring that can of- better. fer generality or fidelity but not both simultaneously, Univ- Mon offers a dramatically different design point by leverag- 7.3 Network-wide Evaluation ing recent theoretical advances in universal streaming. By For the network-wide evaluation, we consider different topolo- delaying the binding of data plane primitives to specific (and gies from the Topology Zoo dataset [35]. As a specific network- unforeseen) monitoring UnivMon provides a truly software- wide task, we consider the problem of estimating source IP defined monitoring approach that can fundamentally change and destination IP “icebergs”. We report the average relative network monitoring. We believe that this “minimality” of errors across these two tasks. the UnivMon design will naturally motivate hardware ven- Figure 14a, Figure 14b, and Benefits of Coordination: dors to invest time and resources to develop optimized hard- Figure 14c present the error, average memory consumption, ware implementations, in the same way that a minimal data and total controller requests of four solutions: Ingress Moni- plane was key to get vendor buy-in for SDN [40]. toring(IM), Greedy Divide and Conquer(GDC), Query and Our work in this paper takes UnivMon beyond just a the- Sketch(QS), and our approach(UnivMon). We pick three oretical curiosity and demonstrates a viable path toward a representative topologies: AT&T North America, Geant, and switch implementation and a network-wide monitoring ab- Bell South. We see that UnivMon provides an even distribu- straction. We also demonstrate that UnivMon is already very tion of resources on each node while providing results with competitive w.r.t. custom solutions and that the trajectory high accuracy. Furthermore, the control overhead is several (i.e., as the number of measurement tasks grows) is clearly orders of magnitude smaller than purely reactive approaches. biased in favor of UnivMon vs. custom solutions. One potential concern is the time to solve ILP solving time: UnivMon already represents a substantial improvement the ILP. Table 2 shows the time to compute the ILP solution over the status quo, That said, we identify several avenues on a Macbook Pro with a 2.5 GHz Intel Core i7 processor us- for future work to further push the envelope. First, in terms k allowing at most ing glpsol sketches per switch, where of the data plane, while the feasibility of mapping UnivMon k is the number of dimensions maintained. We see that the to P4 is promising and suggests a natural hardware mapping, ILP computation takes at most a few seconds which suggest we would like to further demonstrate an actual hardware im- that updates can be pushed to switches with reasonable re- plementation on both P4-like and other flow processing plat- sponsiveness as the topology or routing policy changes. forms. Second, in terms of the one-big-switch abstraction, we need to extend our coordination and sketching primitives 7.4 Summary of Main Findings to capture other classes of network-wide tasks that entail Our analysis of UnivMon’s performance shows that: cross-OD-pair dependencies. Third, while the ILP is quite scalable for many reasonable sized topologies, we may need 1. For a single router with 600KB of memory, we observe other approximation algorithms (e.g., via randomized round- comparable median error rate values between UnivMon ing) to handle even larger topologies. Fourth, in terms of the 6% and OpenSketch, with a relative error gap ≤ 3 . . The various dimensions of interest to track, we currently main- relative error decreases significantly with a growing ap- tain independent sketches; a natural question if we can avoid plication portfolio.

14 explicitly creating a sketch per dimension. Finally, while be- [26] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms , ing application agnostic gives tremendous power, it might be 2005. useful to consider additional tailoring where operators may [27] S. Dasgupta and A. Gupta. An elementary proof of a theorem of want the ability to adjust the granularity of the measurement , Jan. 2003. Random Struct. Algorithms johnson and lindenstrauss. to dynamically focus on sub-regions of interest [48]. [28] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream SIAM J. Comput. statistics over sliding windows. , June 2002. We thank our shepherd Mohammad Acknowledgments: [29] R. Dementiev, T. Willhalm, O. Bruggeman, P. Fay, P. Ungerer, Alizadeh and the SIGCOMM reviewers for their comments A. Ott, P. Lu, J. Harris, P. Kerly, P. Konsor, A. Semin, M. Kanaly, R. Brazones, and R. Shah. Intel performance counter monitor - a that helped improve the paper. This work was supported in better way to measure cpu utilization. http://goo.gl/tQ5gxa. part by NSF awards CCF-1536002, IIS-1447639, Raytheon [30] N. Duffield, C. Lund, and M. Thorup. Estimating flow distributions BBN Technologies, and by a Google Faculty Award. Proc. from sampled flow statistics. In , SIGCOMM, 2003. [31] C. Estan and G. Varghese. New directions in traffic measurement and 9 References , SIGCOMM, 2002. accounting. In Proc. [32] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, and [1] Caida internet traces 2014 sanjose. http://goo.gl/uP5aqG. F. True. Deriving traffic demands for operational ip networks: [2] Caida internet traces 2015 chicago. http://goo.gl/xgIUmF. Methodology and experience. , June 2001. IEEE/ACM Trans. Netw. [3] Intel flexpipe. http://goo.gl/H5qPP2. [33] P. Indyk, A. McGregor, I. Newman, and K. Onak. Open problems in [4] Netfpga technical specifications. http://netfpga.org/1G_specs.html. data streams, property testing, and related topics. 2011. [5] Opensketch simulation library. https://goo.gl/kyQ80q. [34] N. Kang, Z. Liu, J. Rexford, and D. Walker. Optimizing the "one big [6] P4 behavioral simulator. https://github.com/p4lang/p4factory. Proc. switch" abstraction in software-defined networks. In , CoNEXT, 2013. [7] P4 specification. http://goo.gl/5ttjpA. [8] Why big data needs big buffer switches. https://goo.gl/ejWUIq. [35] S. Knight, H. Nguyen, N. Falkner, R. Bowden, and M. Roughan. The Selected Areas in Communications, IEEE internet topology zoo. [9] N. Alon, Y. Matias, and M. Szegedy. The space complexity of Journal on , october 2011. Proc. approximating the frequency moments. In , STOC, 1996. [36] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based [10] N. Bandi, A. Metwally, D. Agrawal, and A. El Abbadi. Fast data change detection: methods, evaluation, and applications. In Proc., stream algorithms using associative memories. In , SIGMOD, Proc. ACM SIGCOMM IMC , 2003. 2007. [37] A. Kumar, M. Sung, J. J. Xu, and J. Wang. Data streaming [11] T. Benson, A. Anand, A. Akella, and M. Zhang. Microte: Fine algorithms for efficient and accurate estimation of flow size grained traffic engineering for data centers. In Proc. , CoNEXT, 2011. Proc. , SIGMETRICS, 2004. distribution. In [12] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, [38] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. Proc. algorithms for estimating entropy of network traffic. In , P4: Programming protocol-independent packet processors. SIGMETRICS/Performance, 2006. SIGCOMM Comput. Commun. Rev. , July 2014. [39] Z. Liu, G. Vorsanger, V. Braverman, and V. Sekar. Enabling a "risc" [13] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, approach for software-defined monitoring using universal streaming. M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: , ACM HotNets, 2015. In Proc. Fast programmable match-action processing in hardware for sdn. In [40] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, Proc. , SIGCOMM, 2013. L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: [14] V. Braverman and S. R. Chestnut. Universal Sketches for the Enabling innovation in campus networks. SIGCOMM Comput. Frequency Negative Moments and Other Decreasing Streaming , Mar. 2008. Commun. Rev. Sums. In APPROX/RANDOM , 2015. [41] M. Moshref, M. Yu, R. Govindan, and A. Vahdat. SCREAM: Sketch [15] V. Braverman, S. R. Chestnut, R. Krauthgamer, and L. F. Yang. Proc., Resource Allocation for Software-defined Measurement. In , 2015. CoRR Streaming symmetric norms via measure concentration. , 2015. CoNEXT [16] V. Braverman, S. R. Chestnut, D. P. Woodruff, and L. F. Yang. [42] B. Pfaff, J. Pettit, T. Koponen, K. Amidon, M. Casado, and Streaming space complexity of nearly all functions of one variable on S. Shenker. Extending networking into the virtualization layer. In frequency vectors. In , 2016. Proc., PODS , 2009. Proc., HotNets [17] V. Braverman, J. Katzman, C. Seidell, and G. Vorsanger. An optimal [43] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani. algorithm for large frequency moments using o(nˆ(1-2/k)) bits. In , IMC, 2008. Proc. Fast monitoring of traffic subpopulations. In APPROX/RANDOM , 2014. [44] R. Schweller, A. Gupta, E. Parsons, and Y. Chen. Reversible sketches [18] V. Braverman, Z. Liu, T. Singh, N. V. Vinodchandran, and L. F. for efficient and accurate change detection over network data Yang. New bounds for the CLIQUE-GAP problem using graph Proc. streams. In , IMC, 2004. decomposition theory. In In Proc., MFCS , 2015. [45] V. Sekar, M. K. Reiter, and H. Zhang. Revisiting the case for a Proc. , [19] V. Braverman and R. Ostrovsky. Zero-one frequency laws. In , IMC, minimalist approach for network flow monitoring. In Proc. STOC, 2010. 2010. [20] V. Braverman and R. Ostrovsky. Approximating large frequency [46] Y. Xie, V. Sekar, D. A. Maltz, M. K. Reiter, and H. Zhang. Worm , APPROX/ROMDOM moments with pick-and-drop sampling. In S&P origin identification using random moonwalks. In . IEEE 2013. Computer Society, 2005. [21] V. Braverman and R. Ostrovsky. Generalizing the layering method of [47] M. Yu, L. Jose, and R. Miao. Software defined traffic measurement indyk and woodruff: Recursive sketches for frequency-based vectors , NSDI, 2013. Proc. with opensketch. In APPROX/RAMDOM on streams. In . 2013. [48] L. Yuan, C.-N. Chuah, and P. Mohapatra. Progme: towards [22] V. Braverman, R. Ostrovsky, and A. Roytman. Zero-one laws for , 2011. IEEE/ACM TON programmable network measurement. sliding windows and universal sketches. In APPROX/RANDOM , [49] Y. Zhang. An adaptive flow counting method for anomaly detection 2015. , CoNEXT, 2013. Proc. in sdn. In [23] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower bounds on [50] H. C. Zhao, A. Lall, M. Ogihara, O. Spatscheck, J. Wang, and J. Xu. the multi-party communication complexity of set disjointness. In A data streaming algorithm for estimating entropies of od flows. In , 2003. IEEE CCC , IMC, 2007. Proc. [24] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items [51] H. C. Zhao, A. Lall, M. Ogihara, and J. J. Xu. Global iceberg . 2002. Automata, Languages and Programming in data streams. In Proc., ICDE , 2010. detection over distributed data streams. In [25] B. Claise. Cisco systems netflow services export version 9. RFC 3954.