WWW 2008 / Refereed Track: Performance and Scalability April 21-25, 2008. Beijing, China A Comparative Analysis of Web and Peer-to-Peer Traffic Naimul Basher1 , Aniket Mahanti1 , Anirban Mahanti2 , Carey Williamson1 , and Mar tin Arlitt1 2 University of Calgary, Canada Indian Institute of Technology Delhi, India 1 ABSTRACT Peer-to-Peer (P2P) applications continue to grow in p opularity, and have rep ortedly overtaken Web applications as the single largest contributor to Internet traffic. Using traces collected from a large edge network, we conduct an extensive analysis of P2P traffic, compare P2P traffic with Web traffic, and discuss the implications of increased P2P traffic. In addition to studying the aggregate P2P traffic, we also analyze and compare the two main constituents of P2P traffic in our data, namely BitTorrent and Gnutella. The results presented in the pap er may b e used for generating synthetic workloads, gaining insights into the functioning of P2P applications, and developing network management strategies. For example, our results suggest that new models are necessary for Internet traffic. As a first step, we present flow-level distributional models for Web and P2P traffic that may b e used in network simulation and emulation exp eriments. Categories and Subject Descriptors C.2.2 [Computer-Communications Networks]: Network Protocols; I.6.6 [Simulation and Modeling]: Model Development General Terms Measurement, Performance Keywords Web, Peer-to-Peer, Traffic Characterization, Traffic Models 1. INTRODUCTION In the mid-1990s, a significant prop ortion of Internet traffic was from applications that used HTTP, the standard protocol for exchanging Web documents. The distinguishing characteristics of Web-dominated Internet traffic include small-sized flows, short-lived connections, asymmetric flow volumes, and well-defined p ort usage. For the past decade, these characteristics have underpinned the traffic models used in network simulation and emulation exp eriments. The introduction of Peer-to-Peer (P2P) file sharing applications, such as Napster in 2000, triggered a paradigm shift in Internet data exchange. P2P applications typically share large multimedia files with individual hosts (called p eers), Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. which act as b oth content providers and consumers. A p eer can obtain p ortions of a file concurrently from multiple p eers and/or obtain p ortions of the same file from a single p eer using one or more p ersistent connections. P2P usage has grown steadily since its inception, and recent empirical studies indicate that Web and P2P together dominate today's Internet traffic [17, 21]. In this pap er we use recent packet traces, collected at the gateway of a large university, to extensively characterize and compare traffic generated by Web and P2P applications. Our focus is on characterizing the b ehaviors of these applications at the flow-level and host-level. The goal of this characterization is to develop flow-level distributional models that may b e used to refine models of Internet traffic for use in network simulation and emulation exp eriments, to provide insights into the similarities and differences b etween Web and P2P traffic, and to obtain insights into how current P2P applications work. A distinguishing asp ect of our work is the use of recent full-payload packet traces. Popular P2P applications, including BitTorrent, Gnutella, and eDonkey, are known to use dynamic p orts, in addition to well-known p orts [6, 11, 20]. Identification of P2P traffic by default p ort numb ers is likely to miss a significant p ortion of this typ e of traffic. In fact, our data suggests that as much as 90% of P2P traffic may b e on random p orts. In this work, we utilize payload-based signature matching to accurately identify P2P traffic. Our study highlights the evolving nature of Internet traffic due to growing P2P traffic. In addition to studying the aggregate P2P traffic, we also analyze and compare two p opular P2P applications: Gnutella and BitTorrent. This study of individual P2P applications aids in understanding the aggregate P2P traffic trends and also helps in understanding how these two applications work. We consolidate our understanding of these traffic typ es by developing distributional models for each typ e of traffic; these models can help refine models of Internet traffic. We present high-level results and key observations from our study in Tables 1 and 2. Table 1 summarizes the similarities/dissimilarities b etween Web and P2P traffic, while Table 2 summarizes the similarities/dissimilarities b etween Gnutella and BitTorrent traffic. The remainder of this pap er is structured as follows. Our trace collection, traffic identification, and analysis methodologies are describ ed in Section 2. Sections 3 and 4 present flow-level and host-level characterization results, resp ectively. Section 5 reviews related work. Issues related to trace data collection and analysis are discussed in Section 6. Section 7 summarizes our contributions and lists future work. 287 WWW 2008 / Refereed Track: Performance and Scalability April 21-25, 2008. Beijing, China Table 1: Key results: Comparing Web and P2P traffic Characteristics Flow Size Flow Inter-arrival time Web Intro duces many mice but few elephant flows. Mo del: hybrid Weibull-Pareto distribution. Typically short inter-arrival time. Distribution is long-tailed. Mo del: two-mo de Weibull distribution. Typically short-lived. Mo del: two-mo de Pareto distribution. Most hosts maintain more than one concurrent flow. Hosts maintain concurrent flows with a few distinct hosts. Large transfers are dominated by downstream traffic. Heavy-hitters account for a large p ortion of total transfer and their transfers follow a p ower-law distribution. Most external hosts are lo cated primarily in the same geographic region. P2P Intro duces many mice and elephant flows. Mo del: hybrid Weibull-Pareto distribution. Typically long inter-arrival time. Distribution is heavy-tailed. Mo del: hybrid Weibull-Pareto distribution. Typically long-lived. Mo del: hybrid Weibull-Pareto distribution. Many hosts maintain only one flow at a time. Hosts that maintain more than one flow do so by connecting with many distinct hosts. Large transfers happ en in either upstream or downstream direction. Heavy-hitters account for a huge p ortion of total transfer and their transfers follow a p ower-law distribution. External p eers are globally distributed. Section 3.1 3.2 Flow Duration Flow Concurrency 3.3 4.1 Transfer Volume 4.2 Geography 4.3 Table 2: Key results: Comparing Gnutella and BitTorrent traffic Characteristics Flow Size Gnutella Both small and large flows are observed. Elephants are relatively more frequent. Distribution is heavy-tailed. Mo del: hybrid Lognormal-Pareto distribution. Typically short-lived. Distribution is heavy-tailed. Peers mostly connect to a single host at a time. Transfers are extremely asymmetric and dominated by single direction traffic. Heavy hitters account for less volume of traffic. External p eers are mostly concentrated in the same geographic region. BitTorrent Small flows are prevalent. Elephants are less frequent, but comparatively large. Distribution is heavy-tailed. Mo del: hybrid Lognormal-Pareto distribution. Typically long-lived. Distribution is long-tailed. Peers maintain many concurrent flows with a large numb er of distinct hosts. Transfers are comparatively less asymmetric and more balanced. Heavy-hitters contribute more traffic volume. External p eers are from regions with broadband connectivity. Flow Duration Flow Concurrency Transfer Volume Geography 2. METHODOLOGY 2.1 Trace Collection and Traffic Identification The network traffic traces used in this work were collected from the commercial Internet link1 of the University of Calgary, a large research-intensive university with 28,000 students and 5,000 employees. We used lindump2 running on a dual processor 1.4 GHz Pentium system with 2 GB memory and 70 GB disk space to capture TCP/IP packets via p ort mirroring. Identifying P2P traffic correctly in the traces is a challenge. One approach, which has b een used in some recent P2P characterization studies [17, 21, 24], is to map network traffic to applications using well-known p ort numb ers. However, many P2P applications including BitTorrent and Gnutella use dynamic p ort numb ers. This necessitated the use of payload signatures [11, 20] to identify applications. We used Bro [15], an op en source Network Intrusion Detection System, to p erform the payload signature matching. The built-in payload "signature matching engine" in Bro was used to p erform the mapping of network flows to application typ es. We used the signatures describ ed by Sen et al. [20] and Karagiannis et al. [11]; details of our payloadbased identification scheme can b e found in [6]. We identify the start of a TCP flow using connection establishment semantics (i.e., SYN-SYNACK-ACK packet transmissions) or by the first packet transmission observed b etween hosts, and end of a TCP flow after observing a FIN or RST packet. By default, Bro considers a flow terminated if it is idle for more than 900 seconds. At the time of trace collection, the Internet link was a 100 Mbps full-duplex connection. 2 http://awgn.antifork.org/codes/lindump.c 1 The payload-based identification technique requires traces with relevant application-layer headers. The signature strings for some P2P applications (e.g., Gnutella) can b e buried deep inside a packet [6]; therefore, successful string matching requires full-packet payloads. This p oses another challenge: the huge storage space required for full-packet trace collection from a high-sp eed Internet connection for an extended interval (e.g., a day or a week). For our work, we used noncontiguous one-hour traces collected b etween April 6 and April 30, 2006. The traces were collected each morning (910 am) and evening (9-10 pm) on Thursday through Sunday every week (i.e., eight one-hour traces p er-week). Although discontinuous traces limit the analysis of long-term traffic b ehavior, we exp ect the traces to capture morning/evening and weekday/weekend trends. Our methodology also captured b ehavioral asp ects related to the academic calendar. 2.2 Trace Summary The traces contain 1.12 billion IP packets totalling 639.4 Gigabytes (GB) of data. In this pap er, attention is restricted to only TCP/IP packets b ecause these account for 84.4% of the total packets and 92% of the total bytes in the traces. Furthermore, Web and P2P applications such as Gnutella and BitTorrent use TCP in most cases. In total, we consider 23.3 million TCP flows with 946 million IP packets and 588.3 GB of data. Table 3 shows the breakdown by application typ e. Web and P2P dominate in terms of bytes. Although P2P accounts for only 2.8% of the total flows, it accounts for 33.1% of the total bytes. The Unknown category includes HTTPS (p ort 443), flows without payloads, and flows unclassified by Bro. The Others category bundles together the remaining traffic; the main contributors (by bytes) are email (5%), file transfer (3%), and streaming (2%) applications. 288 WWW 2008 / Refereed Track: Performance and Scalability 1 April 21-25, 2008. Beijing, China 1 0.8 P[X<=x] 0.6 0.4 0.2 0 5 6 -1 0 1 BT-empirical BT-model Gnu-empirical Gnu-model 2 3 4 5 6 log10 (Flow Size in KB) Table 3: Flow and byte count by applications Application Web P2P Unknown Others Total Flows 9,213,424 646,082 9,275,013 4,186,232 23,320,751 % Flows 39.51 2.77 39.77 17.95 100.00 Bytes (GB) 204.32 194.96 68.42 120.61 588.31 % Bytes 34.73 33.14 11.62 20.51 100.00 P[X<=x] 0.8 0.6 0.4 0.2 0 -1 0 1 P2P-empirical P2P-model Web-empirical Web-model 2 3 4 log10 (Flow Size in KB) (a) Web and P2P Table 4: Flow and byte count for P2P P2P Systems Gnutella BitTorrent eDonkey Other-P2P Total Flows 137,024 393,641 79,796 35,621 646,082 % Flows 21.21 60.93 12.35 5.51 100.00 Bytes (GB) 151.51 31.88 2.64 8.93 194.96 % Bytes 77.71 16.36 1.35 4.58 100.00 (b) Gnutella and BitTorrent Figure 1: CDF of flow sizes References to the "tail" of the CCDF refer to those values in the upp er 10% of the empirical distribution; the remaining 90% of the distribution is referred to as the b ody. CCDF tails are often studied to determine how quickly or slowly they decay. A distribution where the tail decays more slowly than an exp onential distribution is called long-tailed. A distribution is heavy-tailed if the tail asymptotically follows a hyp erb olic shap e (i.e., shap e parameter 0 < 2). We present statistical models that capture the salient features seen in our data sets. We use the following distributional models: Pareto (CDF: 1 - ( ) ), Weibull (CDF: 1 - x ` ´ -( x ) e ), and Lognormal (CDF: ln x- ) where and are shap e and scale parameters, and are mean and standard deviation of the distribution, and is the Laplace Integral; we also present models that are hybrid of the aforementioned distributions, where the model thresholds were determined manually such that the hybrid distribution passed a goodness-of-fit test. We tested the statistical models for accuracy using the Kolmogorov-Smirnov (K-S) goodness-of-fit test. If the statistical model passed the K-S test at the 5% significance level, we considered it to model our empirical data well.4 Only these models are presented in the pap er. Table 4 categorizes the P2P flows present in our traces by P2P application typ e. There are approximately 646,000 P2P flows; these account for nearly 195 GB of traffic data. From the table, we notice that BitTorrent has a lower byte-toflow ratio than Gnutella. Table 4 also reveals that although eDonkey accounts for many P2P flows, the cumulative traffic volume in bytes was relatively small. The Other-P2P category consists of P2P applications that each contributed less than 1% of the identified P2P flows. 2.3 Characterization Metrics We consider three flow-level characterization metrics: Flow Size ­ the total bytes transferred during a TCP flow. Flows can b e categorized as mice [25], buffalo [22] and elephants [13]. We lab el flows as mice if they transfer less than 10 Kilobytes (KB), and as elephants if they transfer more than 5 Megabytes (MB) of data. The rest are lab eled as buffalo. Flow Duration ­ the time b etween the start and the end of a TCP flow. Flow Inter-arrival time ­ the time interval b etween two consecutive flow arrivals. We consider three host-level characterization metrics: Flow Concurrency ­ the maximum numb er of TCP flows a single host uses concurrently to transfer content. Transfer Volume ­ the total bytes transferred to and from a host during its activity p eriod. Upstream transfer volume is measured as the total bytes transmitted from an internal host to the external hosts. Downstream transfer volume is the total bytes received by an internal host from the hosts external to the network. Geographic Distribution ­ the distribution of the shortest distance b etween individual hosts and our campus along the surface of the Earth. This distance measure is known as the great-circle3 distance. 3. FLOW-LEVEL CHARACTERIZATION In order to conduct realistic network simulations, models of flow size, inter-arrival time, and duration are needed. In this section, we present our flow-level characterization results and derive distributional models from the characterization results. Summary statistics for Web and P2P traffic are presented in Table 5. The corresp onding statistics for Gnutella and BitTorrent are shown in Table 6. 3.1 Flow Size 3.1.1 Web and P2P Flow Sizes Table 5 shows that P2P flows have a higher mean flow size and lower median flow size than Web flows. These observations suggest that P2P applications generate many small and many very large-sized flows compared to Web. The CDF of Web and P2P flow sizes in Figure 1(a) corrob orates the aforementioned observation. The prep onderance of small-sized P2P flows is somewhat unexp ected as P2P applications are typically used to share large audio and video files. There are at least three sources of small-sized flows: extensive signalling, ab orted transfers, and connection attempts with non-resp onsive p eers. We also find some very large-sized P2P flows. These few P2P flows are much larger than the occasional large Web transfer. Our analysis indicates that P2P applications contribute We validated the mo dels using a distribution fitting to ol called EasyFit: http://www.mathwave.com/products/easyfit.html . 4 2.4 Statistical Measures and Models We use statistical measurements such as mean, median, standard deviation, inter-quartile range (IQR), and skewness to summarize trends of the sample data. Where necessary, we also use the probability density function (PDF), cumulative distribution function (CDF), and complementary CDF (CCDF) of the sample data to obtain further insights. 3 http://en.wikipedia.org/wiki/Great- circle_distance 289 WWW 2008 / Refereed Track: Performance and Scalability April 21-25, 2008. Beijing, China Table 5: Flow-level summary statistics of Web and P2P Characteristic Flow size (KB) Flow Inter-Arrival (sec) Flow duration (sec) Mean 21.50 0.11 13.32 Median 2.53 0.007 0.40 Web S t d . De v . 341.92 3.53 56.71 IQR 7.38 0.016 1.80 Skewness 44.03 26.05 14.48 Mean 362.40 1.77 123.54 Median 1.17 0.18 24.80 P2P S t d . De v . 12470 17.21 274.37 IQR 1.89 0.39 93.30 Skewness 192.13 48.69 7.61 Table 6: Flow-level summary statistics of Gnutella and BitTorrent Characteristic Flow size (KB) Flow Inter-Arrival (sec) Flow duration (sec) 0 -1 log10 (P[X>x]) log10 (P[X>x]) -2 -3 -4 -5 -6 -7 -1 P2P-empirical P2P-model Web-empirical Web-model 0 1 2 3 4 5 6 log10 (Flow Size in KB) Mean 1159.40 2.30 89.35 Median 1.89 0.21 9.70 0 -1 -2 -3 -4 -5 -6 -1 Gnutella S t d . De v . 15549 22.22 386.22 IQR 2.73 0.51 25.60 Skewness 94.68 30.15 8.12 Mean 84.95 2.46 135.43 Median 0.96 0.42 33.20 BitTorrent S t d . De v . IQR 11189 2.10 20.25 0.99 221.41 180.90 Skewness 292.31 49.78 3.03 Table 7: Mice and elephant flow breakdown Application Web P2P Gnutella BitTorrent Mice % Flows % Bytes Elephants % Flows % Bytes BT-empirical BT-model Gnu-empirical Gnu-model 0 1 2 3 4 5 6 log10 (Flow Size in KB) 75.78 92.93 83.41 94.96 8.89 0.47 0.14 1.94 0.04 0.81 3.05 0.08 15.35 93.43 93.14 94.87 (a) Web and P2P (b) Gnutella and BitTorrent We observe that b oth categories of application generate many mice flows. Although the mice flows originating from Web applications are less prevalent than those from P2P applications, Web mice flows account for a relatively higher prop ortion of the total Web bytes than P2P mice flows account for the total P2P bytes. For example, approximately 9% of total Web bytes are from Web mice flows, whereas only 0.4% of total P2P bytes are transferred by P2P mice flows. Both applications generate a small prop ortion of elephant flows. Nevertheless, these few elephant flows contribute a significant fraction of the total bytes; the elephant-sized Web flows contributed ab out 15% of the total Web-generated bytes, while the elephant-sized P2P flows contributed as much as 93% of the total P2P bytes. Network op erators may b e interested in bandwidth-limiting these long-duration "elephant" flows, or may b e interested in assigning these flows lower priority. As P2P applications b ecome more p opular, we can exp ect networks to carry increasingly more elephant flows. Our results also indicate that P2P elephant flows are significantly larger than Web elephant flows. We next analyze mice and elephant flows generated by Gnutella and BitTorrent. While b oth P2P applications have a similar prop ortion of mice flows, the BitTorrent mice flows account for a much higher p ercentage of byte transfers than Gnutella mice flows; that is, Gnutella mice flows are smaller, on average, than BitTorrent mice flows. As mentioned earlier, signalling b etween p eers is a ma jor contributor to the p ool of P2P mice flows. Our data suggests that BitTorrent applications have more intense signaling activities compared to Gnutella, resulting in relatively larger mice flows. In our data, Gnutella has a much higher p ercentage of elephant flows than BitTorrent, even though b oth Gnutella and BitTorrent elephant flows account for a comparable prop ortion of byte transfers. Thus, on average, BitTorrent elephant flows are larger than Gnutella elephant flows. We b elieve that the typ e of files exchanged using these P2P systems can provide an explanation for our observation. A 2005 study by CacheLogic5 showed that a ma jority of Gnutella users shared mostly audio files (70%), whereas BitTorrent users shared more video files (47%). Video files are, on avCacheLogic. Peer-to-Peer File Typ e Study, http://www.cachelogic. com/home/pages/research/filetypestudy.php 5 Figure 2: CCDF of flow sizes many mice and elephant flows, and possibly alters the mix of these flow types in today's IP networks. We elab orate on this phenomenon in Section 3.1.3. We examined the tails of the flow size distributions using CCDF plots. Figure 2(a) presents the CCDF of flow sizes for Web and P2P. In the b ody of the distribution, P2P flows are smaller than Web flows, but in the tail (sp ecifically, the upp er 3.5% of flows after the "crossover" p oint) P2P flows are larger than Web flows. Also, the tail of the Web flow size distribution decays more quickly than the corresp onding P2P distribution. These observations provide further evidence of P2P's large elephant-sized flows. 3.1.2 Gnutella and BitTorrent Flow Sizes Table 6 indicates that Gnutella flow sizes are larger and more disp ersed than BitTorrent flow sizes. The empirical CDF for the two P2P variants in Figure 1(b) shows that b oth applications generate a similar p ercentage of smallsized flows (e.g., 5 KB or less). Many of these smaller flows are the result of control information exchanged b etween p eers, which is a byproduct of the distributed nature of P2P protocols. The ratio of large-sized to total flows for BitTorrent is, however, less than that for Gnutella. For example, approximately 5% of BitTorrent flows are larger than 10 KB, whereas 17% of Gnutella flows exceed this size. The characteristics of these large-sized flows are analyzed next. Figure 2(b) shows the CCDF of flow sizes of Gnutella and BitTorrent applications. Gnutella app ears to generate more large-sized flows than BitTorrent. BitTorrent uses file segmentation to split an ob ject into multiple equal-sized "pieces" (256 KB each by default), and downloads these pieces from either the same or different p eers using parallel flows. In contrast, Gnutella typically downloads the entire ob ject from a single p eer. As a result, we observe fewer large flows in BitTorrent than Gnutel la. 3.1.3 Mice and Elephant Phenomenon Table 7 shows the p ercentage of mice and elephant flows among the total flows contributed by different applications. 290 WWW 2008 / Refereed Track: Performance and Scalability 1 log10 (P[X>x]) 0.8 P[X<=x] 0.6 0.4 0.2 0 -3 -2 -1 P2P-empirical P2P-model Web-empirical Web-model 0 1 2 3 4 log10 (Flow Inter-arrival in seconds) 0 -1 P[X<=x] -2 -3 -4 -5 -3 P2P-empirical P2P-model Web-empirical Web-model -2 -1 0 1 2 3 log10 (Flow Inter-arrival in seconds) 1 0.8 0.6 0.4 0.2 0 -1 0 April 21-25, 2008. Beijing, China 1 0.8 P[X<=x] 0.6 0.4 0.2 0 3 4 -1 0 BT-empirical BT-model GN-empirical GN-model 1 2 3 4 log10 (Flow Duration in seconds) P2P-empirical P2P-model Web-empirical Web-model 1 2 log10 (Flow Duration in seconds) (a) CDF (b) CCDF (a) Web and P2P (b) Gnutella and BitTorrent Figure 3: Web and P2P flow inter-arrival erage, significantly larger than audio files. We b elieve that the extremely large BitTorrent flows are due to the transfer of multiple pieces of large video files over a single TCP flow. Figure 4: CDF of flow duration CCDF of flow IAT for Web and P2P. Web flow IAT are much shorter than those of P2P flows. For example, approximately 97% of Web flow IAT are less than 0.1 second, whereas only 25% of P2P flow IAT are this short. Another way to understand the difference b etween the IAT of Web and P2P flows is to study their corresp onding flow arrival rates. Web traffic has a higher arrival rate of approximately 80 flows/seconds, compared to P2P traffic, which has arrival rate of only 6 flows/seconds. Another factor contributing to the lower arrival rate and the longer IAT values for P2P flows is the p ersistent nature of their TCP connections. How these p ersistent connections are used is discussed in Section 4.1. We examine the tails of flow IAT for Web and P2P in Figure 3(b). Flow IAT from b oth applications show similar decay throughout the tails. At the upp er tail, we observe sharp decay due to the limited duration of our traces. Flow IAT from individual P2P applications are found to follow similar patterns, and thus are not shown here. 3.1.4 Flow Size Models In this section, we present statistical models that describ e the b ody and the tail of flow size (S) distribution. These models may b e used to generate transfer sizes of TCP flows in network simulations. Figures 1 and 2 plot the statistical models in addition to the empirical distributions. Web flow sizes are well-modeled by a concatenation of b ounded Weibull and Pareto distributions: 8 S 0.38 >1 - e-( 2.7 ) : S < 30K B < 3 FW eb (S ) = 1 - ( S )1.05 : 30K B S 5M B > : 0 : S > 5M B 1 - ( 2S0 )2.35 We find that the tail of the Web flow size distribution is a mix of heavy-tailed and long-tailed distributions. Similarly, we find that P2P flow sizes are well-modeled by a hybrid b ounded Weibull and Pareto distributions: 8 0.81 S >1 - e-( 1.36 ) : S < 4K B < FP 2P (S ) = 1 - ( 0.005 )0.35 : 4K B S 10M B S > : 0 : S > 10M B 1 - ( 4S0 )1.42 From the ab ove-mentioned model, we can conclude that P2P flow sizes are heavy-tailed. Both the BitTorrent and Gnutella flow sizes are well-modeled by combining b ounded Lognormal and Pareto distributions: 8 ` ln S -0.03 ´ : S < 2K B > 0.95 > > <1 - ( 1.07 )1.4 : 2K B S 50K B S FBT (S ) = -9 >1 - ( 3×10 )0.25 : 50K B < S 7M B > S > : 9 : S > 7M B 1 - ( 0.S 5 )0.78 8` ´ 0 > ln S.-3 .44 < 07 0 FGnu (S ) = 1 - ( 0.S 4 )0.3 > :1 - ( 1800 )1.61 S : S < 3K B : 3K B S 10M B : S > 10M B 3.2.2 Inter-arrival Time Models We find that Web flow IAT can b e modeled by a two-mode b ounded Weibull distribution: ( I AT 0.76 1 - e-( 0.01 ) : I AT 0.06 sec FW eb (I AT ) = -( I AT 5 )0.15 - : I AT > 0.06 sec 1 - e 3×10 In contrast, P2P flow IAT are well-modeled by a hybrid Weibull-Pareto distribution: 8 I AT 0.87 >1 - e-( 0.35 ) : I AT 0.1 sec < I AT 0.65 FP 2P (I AT ) = 1 - e-( 0.45 ) : 0.1 < I AT 1 sec > : 0.18 : I A T > 1 sec 1 - ( I AT )0.97 These distribution models indicate that Web IAT are longtailed, whereas P2P IAT are heavy-tailed. Our models provide evidence of the inapplicability of memoryless Poisson models for Web and P2P flow arrivals [16]. 3.3 Flow Duration 3.3.1 Web and P2P Flow Durations Our statistical analysis (cf. Table 5) indicates the presence of many short-duration flows. Figure 4 shows the CDF of flow durations. From Figure 4(a) we observe that approximately 30% of P2P flows are shorter than 10 seconds in duration. Some of these short-duration transfers are either failed or ab orted flows, while other short-duration flows are a byproduct of the P2P applications' signaling b ehavior. Note that short-duration flows typically transfer a small amount of data, but the converse does not always hold. There are a few long-duration mice flows; these flows arose due to rep eated unsuccessful connection attempts by p eers. We also We find that both BitTorrent and Gnutel la flow size distributions are heavy-tailed; BitTorrent flow sizes, however, are less heavy-tailed than Gnutel la flows. 3.2 Flow Inter-arrival Times 3.2.1 Web and P2P Inter-arrival Times Analysis of our data (see Table 5) shows that P2P flow inter-arrival times (IAT) are much longer and more disp ersed than Web flow IAT. Figure 3 shows the CDF and 291 WWW 2008 / Refereed Track: Performance and Scalability 0 log10 (P[X>x]) log10 (P[X>x]) -1 -2 -3 -4 -5 -1 P2P-empirical P2P-model Web-empirical Web-model 0 1 2 3 4 log10 (Flow Duration in seconds) 0 -1 April 21-25, 2008. Beijing, China 1 0.8 1 0.8 P [X < = x ] P2P Web 0 1 2 3 0.6 0.4 0.2 0 0 1 BitTorrent Gnutella 2 3 P [X < = x ] 2 3 4 0.6 0.4 0.2 0 -2 -3 -4 -1 BT-empirical BT-model GN-empirical GN-model 0 1 log10 (Flow Duration in seconds) log10 (Maximum # of Concurrent Flows) log10 (Maximum # of Concurrent Flows) (a) Web and P2P (b) Gnutella and BitTorrent (a) Web and P2P (b) Gnutella and BitTorrent Figure 5: CCDF of flow duration observe that a large prop ortion, approximately 40%, of P2P flow durations are b etween 20 and 200 seconds. We found that some P2P connections are bandwidth-limited, and thus of long-duration. Bandwidth limitations reflect the available bandwidth b etween p eers (e.g., p eers with asymmetric Internet access have limited uplink capacity) as well as flow management on our network (cf. Section 6). Approximately 70% of the Web flows last no longer than 1 second. End users have excellent Internet connectivity in our campus network, and most Web servers are also well-provisioned. Thus, we exp ect low resp onse times for Web requests. The remaining Web flows that are longer than 1 second are typically resp onsible for either downloading large ob jects (e.g., streaming video from youtube.com) or transferring multiple ob jects from Web pages using p ersistent HTTP/1.1 connections. In Figure 5 we analyze the tail of the flow duration distributions. Figure 5(a) shows the CCDF of Web and P2P flow durations. We find that the probability of long-duration flows is higher for P2P than Web. Figure 6: CDF of host flow concurrency b etween BitTorrent flow size and duration, and therefore, observe a lower prop ortion of extremely long-duration flows in BitTorrent. Other factors such as file size, swarm p opulation, and availability of pieces in the swarm can also influence the duration of BitTorrent flows. These factors result in the BitTorrent tail b eing long-tailed instead of heavy-tailed. 3.3.3 Flow Duration Models This section outlines the statistical models of flow durations (D) (see Figures 4 and 5). Web flow duration is wellmodeled using two b ounded Pareto distributions: ( 1 - ( 0.1 )0.43 : D 60 sec D FW eb (D) = : D > 60 sec 1 - ( 10 )1.5 D The preceding model shows that Web flow durations are heavy-tailed. A similar analysis shows that P2P flow durations can b e well-modeled by a concatenation of b ounded Weibull and heavy-tailed Pareto distribution: 8 0.35 D >1 - e-( 88.3 ) : D < 20 sec < 0.55 D FP 2P (D) = 1 - e-( 57.2 ) : 20 D 300 sec > : : D > 300 sec 1 - ( 65 )1.53 D BitTorrent flow durations are well-modeled by a hybrid b ounded Weibull and Pareto distributions, whereas Gnutella flow durations are well-modeled by a hybrid b ounded Lognormal and Pareto distributions: ( 0.48 D 1 - e-( 83.5 ) : D 300 sec FBT (D) = 0 : D > 300 sec 1 - ( 2D0 )3 (` ´ ln D.-2.1 27 FGnu (D) = 0.73 5 1 - (D ) : D 10 sec : D > 10 sec 3.3.2 Gnutella and BitTorrent Flow Durations Summary statistics in Table 6 show that, on average, BitTorrent flows last longer than Gnutella flows; furthermore, the flow durations are disp ersed over a wide range of values. Figure 4(b) shows the CDF of Gnutella and BitTorrent flow durations. This graph reaffirms the aforementioned p oint. We find that these relatively longer flows of BitTorrent resulted due to its protocol architecture. BitTorrent utilizes a rarest first piece selection p olicy to exchange data. At any given time, a fixed numb er of concurrent uploads/downloads are p ermitted. BitTorrent architecture allows p ersistent connections b etween p eers and controls downloads/uploads using its piece selection p olicy which results in connections p eriodically b eing idle. Furthermore, concurrent download from a single BitTorrent p eer splits the bandwidth available at uploaders for downloading. In contrast, Gnutella can use a single flow for downloading an ob ject and thus does not need to share bandwidth. Occasionally, Gnutella p eers may share bandwidth, for example, when the same ob ject is requested by other p eers or when different ob jects are requested by the same p eer. Figure 5(b) shows the CCDF of Gnutella and BitTorrent flow duration. Two observations can b e drawn. First, b efore the crossover p oint, BitTorrent shows a higher p ercentage of long-duration flows than Gnutella; however, following the crossover p oint (upp er 2% of flows), the probability of longduration flows in Gnutella is higher than that in BitTorrent. Second, at the distribution tail, BitTorrent flow durations decay more quickly than Gnutella flow durations. We found earlier that extremely large transfers are not very common in BitTorrent, due to its file segmentation feature. We also found a p ositive correlation (correlation coefficient is 0.69) The ab ove-mentioned statistical distributions show that BitTorrent flow durations are long-tailed (tail fits a Pareto distribution with > 2) but not heavy-tailed. In contrast, Gnutel la flow durations are heavy-tailed. 4. HOST-LEVEL CHARACTERIZATION This section presents a host-level characterization of Web and P2P traffic. This characterization provides information to network administrators for tasks such as bandwidth management and capacity planning, and also provide insights into the functioning of modern P2P systems. The results presented here may also b e used to develop synthetic workloads and design realistic network simulations. 292 WWW 2008 / Refereed Track: Performance and Scalability 3 log10 (Distinct # of IPs) log10 (Distinct # of IPs) 3 April 21-25, 2008. Beijing, China 1 0.8 P [X < = x ] 2 2 0.6 0.4 0.2 P2P Web 0 1 2 3 4 5 6 log10 (Volume Transfer in KB) 1 1 0 0 1 2 3 log10 (Maximum # of Concurrent Flows) 0 0 1 2 3 log10 (Maximum # of Concurrent Flows) 0 (a) Web (b) P2P Figure 8: CDF of transfer volume Figure 7: Flow concurrency vs distinct IPs Table 8: Fair-share ratio in P2P systems 4.1 Flow Concurrency Figure 6 shows the CDF of host flow concurrency for Web, P2P, Gnutella, and BitTorrent. From Figure 6(a), we observe (surprisingly) that many P2P hosts in our network maintain only a single TCP connection. We explain the observation later in this section by analyzing flow concurrency for individual P2P applications. While analyzing the flow concurrency for Web hosts, we ignore the Web servers internal to our network. From the analysis, we find that a significant proportion of the internal Web hosts maintain more than one concurrent TCP connection. Web browsers often initiate multiple concurrent connections to transfer content in parallel. This parallel download feature increases the degree of flow concurrency in HTTP-based applications. However, a high-degree of flow concurrency (e.g., ab ove 30) is not typically observed for general Web clients; rather, Web proxies and content distribution nodes account for this high degree of flow concurrency. The CDF of host flow concurrency for Gnutella and BitTorrent is shown in Figure 6(b). We observe that most Gnutel la hosts connect with only one host at a time. As discussed earlier, Gnutella applications typically download a whole ob ject from another Gnutella host using a single TCP flow. We observed a few Gnutella hosts that maintained more than 10 concurrent TCP connections. These hosts likely acted as "sup er p eers" in Gnutella's p eer hierarchy. In contrast, most BitTorrent hosts exhibit a high degree of flow concurrency. Approximately 24% of the BitTorrent hosts use more than 100 concurrent flows. This high degree of concurrency is a natural occurrence in BitTorrent. BitTorrent clients obtain a p eer list from a tracker, and then attempt to connect with these p eers. Once connections are established, BitTorrent uses its rarest first piece selection p olicy and tit-for-tat fairness mechanisms to determine how pieces are shared [3]. Typically, only a small numb er of these concurrent connections actively transfer file pieces. We also study the correlation b etween the maximum numb er of concurrent flows seen at a host and the numb er of distinct hosts connected at that time. Figure 7 shows scatter plots of flow concurrency versus distinct hosts for Web and P2P hosts. (The plots for Gnutella and BitTorrent are similar to that of P2P, and thus not shown here.) From Figure 7(a) we observe that most of the p oints are well-b elow the diagonal. In other words, the number of concurrent Web flows far exceed the number of Web hosts concurrently contacted. From Figure 7(b), we observe that P2P hosts use concurrent flows to connect to many distinct hosts as illustrated by the concentration of p oints along the diagonal. This b ehavior is not unexp ected, since P2P protocols such as BitTorrent and eDonkey encourage connectivity with multiple hosts to facilitate widespread sharing of data. Downstream (MB) <1 1 - 20 20 - 40 40 - 60 60 - 80 80 - 100 > 100 Minimum Fair-share Ratio none 0.01 0.25 0.35 0.45 0.55 0.65 4.2 Transfer Volume This section studies the transfer activity of hosts in terms of their transfer volume. Figure 8 show the CDF of the transfer volume for Web and P2P hosts. We observe that approximately half of the distinct P2P and Web hosts transfer small amounts of data (e.g., less than 1 MB); these hosts are typically active for less than 100 seconds. We find that these P2P hosts rep eatedly yet unsuccessfully attempt to connect with serving p eers. Connection requests are unsuccessful for a variety of reasons including insufficient resources or no useful content at the contacted p eers. In contrast, Web transfers in this region result from Web browsing, widgets that retrieve information from the Web p eriodically (e.g., weather up dates, stock prices), and downloading small files. We find that approximately 35% of Web hosts and 15% of P2P hosts transfer data ranging from 1 to 10 MB, and are active mostly for 100 to 1000 seconds. These P2P host transfers are due to sharing small ob jects, whereas these Web host transfers are due to prolonged Web browsing, downloading software/multimedia files, and HTTP-based streaming. The prop ortion of hosts that transfer large amounts of data (e.g., 10 MB or more) and are active for over 1000 seconds, is significantly higher in P2P than in Web. 4.2.1 Transfer Symmetry in P2P Systems Transfer symmetry is a ma jor concern for P2P system develop ers, who want to encourage fair sharing among participating p eers. Many content sharing p ortals require that users maintain a minimum ratio of upstream to downstream transfer volume, which we refer to as the minimum fair-share ratio. Table 8 shows the minimum ratios of fair-sharing we defined for different levels of downstream traffic. Note that hosts transferring less than 1 MB of data in total are not sharing any content and thus are excluded from our transfer symmetry calculation. In most cases, we used equal-sized bins to assign minimum fair-share ratios; however, for ab ove 100 MB of data transfer, we used a single bin as only 10% of P2P hosts fall in this category. We divide P2P hosts into three categories (freeloaders, fair-share, and benefactors ) according to their transfer ratios (i.e., upstream/downstream ratios) and corresp onding minimum fair-share ratios from Table 8. We define freeloaders 293 WWW 2008 / Refereed Track: Performance and Scalability log10 (Volume Transfer in MB) April 21-25, 2008. Beijing, China 5 4 3 2 1 0 -1 -2 -3 P2P Web -4 -2 Table 9: Transfer symmetry in P2P systems Systems Gnutella BitTorrent 1 0.8 P[X<=x] 0.6 0.4 0.2 0 -1 0 P2P Web 1 2 log10 (Ranked Hosts %) P[X<=x] Freeloader 56.93% 10.30% Fair-share 10.00% 39.91% 1 0.8 0.6 0.4 0.2 0 -1 Benefactor 33.07% 49.79% =0.27 -1 0 1 2 log10 (Ranked Hosts %) Figure 10: Transfer volume of ranked host BitTorrent Gnutella 1 Percentage of Hosts 2 1 North America Other Continents Atlantic 0 1 0.8 0.6 0.4 0.2 0 Percentage of Hosts 0.8 0.6 0.4 0.2 0 0 BitTorrent Gnutella 4000 8000 12000 16000 Distance (km) log10 (Ranked Hosts %) (a) Web and P2P (b) Gnutella and BitTorrent Figure 9: CDF of ranked hosts as those hosts who have a transfer ratio less than the minimum fair-share ratio. Benefactors are hosts that have a transfer ratio of 2 or greater. The remaining hosts are in the "fair-share" range. Table 9 shows the p ercentage of Gnutella and BitTorrent hosts as freeloaders, fair-share hosts, and b enefactors. We find that approximately 10% of BitTorrent hosts are acting as freeloaders, whereas 57% of Gnutella hosts are freeloaders. Benefactors are common in b oth BitTorrent (50%) and Gnutella (33%) hosts. Therefore, Gnutella host b ehavior app ears to b e dominated by extreme downstream and upstream transfers. We find that approximately 40% of BitTorrent p eers and 10% of Gnutella p eers reside in the fair share zone. BitTorrent introduced a "tit-for-tat" mechanism to encourage fair sharing among the p eers [3]. Every p eer in the BitTorrent system is encouraged to upload for obtaining the opp ortunity to download. Therefore, we observe more freeloaders in Gnutel la and better fairness in BitTorrent. P2P Web 0 4000 8000 12000 16000 Distance (km) (a) Web and P2P (b) Gnutella and BitTorrent Figure 11: Geographic distribution of hosts BitTorrent heavy-hitters account for a much larger fraction of that application's total bytes than Gnutella heavy-hitters do for their total bytes. We also found that the transfer volume of top-ranked Gnutella and BitTorrent hosts did not follow a p ower-law distribution. 4.3 Geographic Distribution This section discusses the geographic distribution of hosts external to the campus network. We calculated the greatcircle distance b etween individual hosts and our campus using a geolocation database6 . This database provides the geographic coordinates, country name, and city name for an IP address range. Figure 11 shows the geographical distribution of the external hosts. Note the plateau b etween 3, 500 and 7, 000 kilometers represents the Atlantic ocean. Figure 11(a) shows the geographical distribution of the external Web and P2P hosts. Most of the external Web hosts, approximately 75%, are in North America; Asia and Europ e each account for 10% of the external Web hosts. The results here are not surprising. We know that most of the external Web hosts are Web servers. O'Neill et al. [14] had shown that in 1999 and 2002, 49% and 55% of the public Web sites, resp ectively, were associated with entities located in the United States. In addition, we b elieve that cultural p ecularities may also affect the results. A ma jority of our campus Web users are English-sp eaking, and thus they are more likely to visit Web sites located in predominantly English-sp eaking countries. In contrast to the geographic distribution of external Web hosts, we found that approximately 40% of P2P hosts are located in North America, 30% in Europ e, 18% in Asia, 6% in Australia, and 5% in South America. This indicates that connectivity between P2P hosts does not appear to strongly rely on host locality, rather it dep ends on resource availability during the connection establishment phase. The noninteractive nature of P2P applications makes latency only a secondary concern; the primary goal is to find the requested file. In addition, our results suggest that files b eing shared using these systems transcend geographic divides. 6 4.2.2 Heavy-hitters Figure 9 plots the CDF of hosts ranked by transfer volume (the higher the amount of data transferred, the higher the rank). We find that a few hosts account for much of the volume transferred; we call these hosts heavy-hitters. Figure 9(a) shows that the top 0.1% of Web hosts account for 14% (28 GB) of the total Web transfer. Similarly, the top 0.1% of P2P hosts transfer 12% (24 GB) of the total P2P data. Moreover, top 1% of Web and P2P hosts account for 70 GB (34%) and 82 GB (42%) of the total Web and P2P bytes, resp ectively. Clearly, heavy-hitters are present in both Web and P2P. Examination of the upstream to downstream transfer ratio for the P2P heavy-hitters shows that most P2P heavy-hitters are either freeloaders or benefactors. Figure 10 shows the transfer volume of ranked Web and P2P hosts. We observe that the total amount of data transferred by the top 10% Web and P2P hosts fol lows a powerlaw distribution (with 0.27) ; we emphasize that the p ower-law does not apply to the b ody and the tail of the ranked distribution. The only difference seen b etween the applications is the total transfer volume; top-ranked P2P hosts transfer an order of magnitude more data than topranked Web hosts. Figure 9(b) shows the CDF of ranked transfer volume for Gnutella and BitTorrent hosts. We find that the top 1% of BitTorrent hosts transfer 20 GB (60% of total BitTorrent traffic), whereas the top 1% of Gnutella hosts account for 53 GB (35% of total Gnutella traffic). Our data suggests that MaxMind: GeoIP City Database, http://www.maxmind.com/app/city 294 WWW 2008 / Refereed Track: Performance and Scalability The host geographical distribution for the P2P variants are shown in Figure 11(b). It shows that ma jority of external Gnutella hosts (70%) are from North America. Approximately 18% of the Gnutella hosts are located in Europ e and the rest are in Asia (6%), Australia (2.3%), and South America (2.3%). This suggests that either Gnutella p eers prefer to connect with hosts that are in close proximity or that Gnutella clients are widely used in North America for file-sharing. In contrast, only 30% of external BitTorrent hosts are located in North America. Among the rest, approximately 40% of BitTorrent hosts are located in Europ e, 18% in Asia, 6% in Australia, and 3% in South America. We know BitTorrent hosts connect to p eers from a p eer-list provided by trackers. We b elieve that the list from trackers is created based on host bandwidth availability in a swarm and thus, we see a bias towards regions with high broadband p enetration. We did observe, however, that although BitTorrent peers connect to other distant peers for obtaining content, most of the successful transfers originate from the peers located in the same geographic region. 7 log10 (Flow Size in KB) 6 5 4 3 2 1 0 0 April 21-25, 2008. Beijing, China 56 Kbps 1 2 3 4 log10 (Flow Duration in seconds) Figure 12: Flow size versus flow duration tion, that flow IAT are exp onentially distributed, and that eDonkey flows do not app ear to alter the mice-elephant mix of flows. Similar to our observations, Plissonneau et al. found that eDonkey systems generates many short duration flows, have significant unfairness, and do not exploit geographic locality when exchanging data. Plissonneau et al. did not present any traffic models in their work. Our study complements prior work on Web and P2P traffic analysis. We used recent traces that reflect the emerging traffic trends in a large edge network, and employed application signature matching to identify Web and P2P traffic accurately. We explored the similarities and differences in flow-level and host-level characteristics of Web and P2P flows, and develop ed models for b oth typ es of traffic. 5. RELATED WORK Web traffic has b een extensively characterized. Many studies concentrate on the user-level b ehavior such as the size and numb er of request/resp onse messages, and Web application-sp ecific prop erties such as page complexity and document referencing(e.g., [1, 2]). Flow-level prop erties of Web traffic have also b een studied (e.g., [4, 16]). One key observation from prior work is that Poisson arrival process may not b e appropriate for Web flows [4, 16]. Our data reaffirms this observation, and also shows that Poisson models may not b e appropriate for modeling P2P flow arrivals. There are many studies of p opular P2P systems in the literature, including Napster [19], KaZaA [8], Gnutella [19, 21, 26], BitTorrent [9, 10, 18], and eDonkey [17, 24]. These studies have focussed on different asp ects of P2P systems such as query traffic [12], data traffic [8, 21], flow characteristics [17, 24], p eer b ehavior [21], system architecture [9, 10, 18], and system dynamics (e.g., churn) [23, 26]. In this section, we discuss closely related prior work. Saroiu et al. [19] studied Gnutella and Napster systems using traces collected using crawling techniques. They observed Gnutella hosts had high-bandwidth, high-latency, and low user-activity p eriods when compared to Napster hosts. Sen and Wang [21] studied DirectConnect, Gnutella, and FastTrack traces from a large ISP's network. They found that the traffic volume, p eer connectivity, and mean bandwidth usage distributions are extremely skewed, which is similar to our observations. Recently, Zhao et al. [26] analyzed traffic from modern Gnutella systems. They observed a significant decrease in free-riders over the past few years. Our results, however, indicate pronounced free-riding in Gnutella. We b elieve free-riding needs to b e further studied. Guo et al. [9] analyzed and modeled BitTorrent systems based on traces collected from a p opular tracker site. They found that swarm p opularity decreases exp onentially over time, and that the distribution of swarm p opulation is heavily skewed. Pouwelse et al. [18] studied p erformance, robustness, and content integrity of BitTorrent systems. Tutschku [24] and Plissonneau et al. [17] analyzed eDonkey traffic observed on the protocol's standard p ort. Tutschku found that eDonkey flow sizes follow the lognormal distribu- 6. DISCUSSION In this section, we discuss two related issues: identification of P2P traffic and impact of network traffic management. Many recent P2P characterization studies (e.g., [17, 21, 24]) have relied on identification by p ort numb ers. Our full payload packet traces allow us to apply application signature matching to identify P2P traffic that would otherwise not b e identified had we only relied on p ort numb ers for traffic identification. We b elieve that future characterization of P2P traffic should not rely solely on p ort numb ers for identification of this traffic. Because collection of traces with payloads p oses unique challenges (e.g., processing cost, longer-term data collection) and are often difficult to obtain, alternative approaches are necessary. For example, recently prop osed machine-learning techniques that use only flow statistics (see [6, 7] and the references therein) or heuristics-based techniques [5, 11] that leverage characteristic b ehavior of P2P applications may b e suitable candidates for identifying P2P traffic. A consequence of increased use of P2P applications is the deployment of bandwidth management solutions in edge networks. Any analysis of network traffic, therefore, needs to b e aware of the p otential implications of traffic management as some characteristics of interest such as flow duration and flow concurrency may b e affected by flow management. At the University of Calgary, traffic is managed using a commercial packet shaping device. The packet shap er (to the b est of our knowledge) employs a combination of application signatures and p ort numb ers to identify traffic. At the time of trace capture, the network p olicy in place was to group together all identified P2P flows (except those from the student residences) and collectively limit their bandwidth to 56 Kbps. Figure 12 shows a scatter plot of P2P flow size and duration for our trace. The scatter plot includes a straight line that marks the 56 Kbps b oundary; P2P flows (i.e., p oints) ab ove this line represent an achieved 295 WWW 2008 / Refereed Track: Performance and Scalability flow throughput exceeding 56 Kbps. We should also note that p oints b elow this line do not necessarily imply that the flow's bandwidth was limited by the traffic shaping device. Flow rates may b e b elow this line for other reasons such as multiplexing of flows, flow control, or congestion control mechanisms. The key observation from the plot is that we do not observe a strong p ositive correlation b etween flow size and duration. This suggests that some P2P flows are indeed identified and limited by the packet shaping device. Nevertheless, we do see many p oints ab ove the 56 Kbps threshold; these P2P flows clearly escap ed detection by the traffic shap er. The final comment we make is regarding the representativeness of our observations and models. Our study is based on observations from one vantage p oint, and on a network that employs some form of bandwidth management. Clearly, there is a need to study traffic from different networks to validate the models we prop ose and also to develop general models for Web and P2P traffic. Nevertheless, we b elieve that our results are still useful as they provide a snapshot of Web and P2P traffic characteristics from a large edge network, and thus should b e representative of other large edge networks with similar p opulation and network management p olicies. In cases where the network differs significantly in design or management p olicy, our methodology can b e applied to develop representative models. April 21-25, 2008. Beijing, China 8. REFERENCES [1] M. Arlitt and C. Williamson. Internet Web Servers: Workload Characterization and Performance Implications. ToN, 1997. [2] F. Camp os, K. Jeffay, and F. Smith. Tracking the Evolution of Web Traffic: 1995-2003. In MASCOTS, 2003. [3] B. Cohen. Incentives Build Robustness in BitTorrent. In P2PECON, 2003. [4] M. Crovella and A. Bestavros. Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. ToN, 1997. [5] T. Dang, M. Perenyi, A. Gefferth, and S. Molnar. On the Identification and Analysis of P2P Traffic Aggregation. In Networking, 2006. [6] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson. Offline/Realtime Traffic Classification Using Semi-Sup ervised Learning. In Performance, 2007. [7] J. Erman, A. Mahanti, M. Arlitt, and C. Williamson. Identifying and Discriminating Between Web and Peer-to-Peer Traffic in the Network Core. In WWW, 2007. [8] K. Gummadi, R. Dunn, S. Saroiu, S. Gribble, H. Levy, and J. Zahorjan. Measurement, Mo deling, and Analysis of a Peer-to-Peer File-Sharing Workload. In SOSP, 2003. [9] L. Guo, S. Chen, Z. Xiao, E. Tan, X. Ding, and X. Zhang. Measurements, Analysis, and Mo deling of BitTorrent-like Systems. In IMC, 2005. [10] M. Izal, G. Urvoy-Keller, E. Biersack, P. Felb er, A. Hamra, and L. Garcs-Erice. Dissecting BitTorrent: Five Months in a Torrents Lifetime. In PAM, 2004. [11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos. BLINC: Multilevel Traffic Classification in the Dark. In SIGCOMM, 2005. [12] A. Klemm, C. Lindemann, M. K. Vernon, and O. P. Waldhorst. Characterizing the Query Behavior in Peer-to-Peer File Sharing Systems. In IMC, 2004. [13] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. Identifying Elephant Flows through Perio dically Sampled Packets. In IMC, 2004. [14] E. O'Neill, B. Lavoie, and R. Bennett. Trends in the Evolution of the Public Web: 1998 - 2002. D-Lib Mag., 2003. [15] V. Paxson. Bro: A System for Detecting Network Intruders in Real-Time. Com. Net., 1999. [16] V. Paxson and S. Floyd. Wide-Area Traffic: The Failure of Poisson Mo deling. ToN, 1995. [17] L. Plissonneau, J. Costeux, and P. Brown. Analysis of Peer-to-Peer Traffic on ADSL. In PAM, 2005. [18] J. Pouwelse, P. Garbacki, D. Ep ema, and H. Sips. The Bittorrent P2P File-sharing System: Measurements and Analysis. In IPTPS, 2005. [19] S. Saroiu, P. Gummadi, and S. Gribble. Measuring and analyzing the characteristics of Napster and Gnutella hosts. Multi. Sys., 2003. [20] S. Sen, O. Spatscheck, and D. Wang. Accurate, Scalable In-Network Identification of P2P Traffic. In WWW, 2004. [21] S. Sen and J. Wang. Analyzing Peer-to-Peer Traffic Across Large Networks. ToN, 2004. [22] A. Soule, K. Salamatian, N. Taft, R. Emilion, and K. Papagiannaki. Flow Classification by Histograms: or How to go on Safari in the Internet. In SIGMETRICS, 2005. [23] D. Stutzbach and R. Rejaie. Understanding Churn in Peer-to-Peer Networks. In IMC, 2006. [24] K. Tutschku. A Measurement-Based Traffic Profile of the eDonkey Filesharing Service. In PAM, 2004. [25] Z. Zhang, V. Rib eiro, S. Mo on, and C. Diot. Small-time Scaling Behaviors of Internet Backb one Traffic: An Empirical Study. In INFOCOM, 2003. [26] S. Zhao, D. Stutzbach, and R. Rejaie. Characterizing Files in the Mo dern Gnutella Network: A Measurement Study. In MMCN, 2006. 7. CONCLUSIONS This pap er presented an extensive characterization of Web and P2P traffic using full packet traces collected at a large edge network. We considered three flow-level metrics, namely flow size, flow IAT, and flow duration, and three host-level metrics, sp ecifically flow concurrency, transfer volume, and geographic distance. We observed a numb er of contrasting features b etween Web and P2P traffic. Typically, Web flows are short-lived whereas P2P flows are long-lived. Both Web and P2P host transfers are asymmetric; however, P2P host transfers are dominated by b oth upstream and downstream traffic, but not b oth. Web hosts maintain a high degree of flow concurrency, whereas many P2P hosts maintain a single flow at a time. Finally, P2P traffic exacerbates the "mice and elephants" phenomenon in Internet traffic. Flow-level distributional models were develop ed for Web and P2P traffic; these models can b e used in network simulation and emulation exp eriments. We b elieve much work remains. Traffic from other networks should b e studied to facilitate development of general models for Web and P2P traffic. Similarly, traffic from other non-Web applications, for example P2P streaming applications such as PPLive, P2P VoIP, and other P2P applications, should b e examined, and their impact on Web-based applications studied. Acknowledgements The authors thank Jeffrey Erman, the anonymous WWW reviewers, i CORE, and NSERC. 296