WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Understanding Internet Video Sharing Site Workload: A View from Data Center Design Xiaozhu Kang Columbia University Hui Zhang huizhang@nec-labs.com Xiaoqiao Meng xqmeng@nec-labs.com NEC Laboratories America NEC Laboratories America Guofei Jiang NEC Laboratories America xk2001@columbia.edu Haifeng Chen haifeng@nec-labs.com ABSTRACT NEC Laboratories America gfj@nec-labs.com Kenji Yoshihira NEC Laboratories America kenji@nec-labs.com In this pap er we measured and analyzed the workload on Yahoo! Video, the 2nd largest U.S. video sharing site, to understand its nature and the impact on online video data center design. We discovered interesting statistical prop erties on b oth static and temp oral dimensions of the workload including file duration and p opularity distributions, arrival rate dynamics and predictability, and workload stationarity and burstiness. Complemented with queueing-theoretic techniques, we further extended our understanding on the measurement data with a virtual design on the workload and capacity management comp onents of a data center assuming the same workload as measured, which reveals key results regarding the impact of Service Level Agreements (SLAs) and workload scheduling schemes on the design and op erations of such large-scale video distribution systems. constraint. Due to the massive scale of Yahoo! Video site, we limited the data collection to the first 10 pages of each category. Since each page contains 10 video ob jects, each time the measurement collects dynamic workload information for 1600 video files in total. Throughout the whole collection p eriod, we recorded 9,986 unique videos and a total of 32,064,496 video views. This can b e translated into a daily video request rate of 697064, and gave approximately 5.54% coverage on the total Yahoo! Video workload in July 2007, based on [2]. 2. WORKLOAD STATISTICS 2.1 Static Properties Video Duration: We recorded 9, 986 unique videos in total, and the video durations range from 2 to 7518 seconds. Among them, 76.3% is less than 5 minutes, 91.82% is less than 10 minutes, and 97.66% is less than 25 minutes. The mean video duration is 283.46 seconds, and the median duration is 159 seconds. File Popularity: File p opularity is defined as the distribution of stream requests on video files during a measurement interval. We p erformed goodness-of-fit test with several distribution models, and found Zipf with an exp onential cutoff fits b est and well on the file p opularity at four time scales - 30 minutes, 1 hour,1 day, and 1 week. Categories and Subject Descriptors C.2.0 [Computer-Communication Networks]: General General Terms Measurement, Design, Performance Keywords Measurement, Data Center Design, Online Video, Workload Management, Capacity Planning, SLA, Queueing Model 2.2 Temporal Properties Job Size Stationarity: Job size distribution is defined as the distribution of stream requests on video durations during a measurement interval. We use histogram intersection distance [4] to measure the change b etween two job size distributions, and calculated the pair-wise histogram intersection distance of two adjacent data p oints during the measuremnent. Figure 1 shows the CDFs of histogram intersection distance distribution for 3 time scales. We can see that within 30-minute and one-hour scale, the histogram distance is very small for most of the time. For example, 90% of the time it is no more than 0.15. But from day to day, the difference of request size distributions is obvious. This indicates that short-term dynamic provisioning only needs to focus on request arrival rate dynamics, while capacity planning at daily or longer basis has to take into account b oth arrival rate and job size dynamics. Arrival Rate Predictability: We calculate the autocorrelation coefficient of the arrival rates at the time scale of 30 minutes, and from Figure 2 we can see that the workload is highly correlated in short term. We also use Fourier 1. INTRODUCTION Internet Video sharing web sites such as YouTub e [1] have attracted millions of users in a dazzling sp eed during the past few years. Massive workload accompanies those web sites along with their business success. In order to understand the nature of such unprecedented massive workload and the impact on online video data center design, we analyze Yahoo! Video, the 2nd largest U.S. video sharing site in this pap er. The main contribution of our work is an extensive trace-driven analysis of Yahoo! Video workload dynamics. We crawled all 16 categories on the Yahoo! Video site for 46 days (from July 17 to August 31 2007), and the data was collected every 30 minutes. This measurement rate was chosen as a tradeoff b etween analysis requirement and resource Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 1097 WWW 2008 / Poster Paper 1 0.9 April 21-25, 2008 · Beijing, China 7 x 10 12 5 10 0.9 0.8 6 stable period 10 Number of requests within 30 minutes 4 0.8 Cumulative probability 0.7 0.6 0.5 0.4 0.3 0.2 30 minutes data one hour data one day data Autocorrelation Coefficient 0.7 bursty period 5 0.6 0.4 0.3 power 0.5 4 10 3 3 10 2 2 0.2 0.1 1 10 1 0.1 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Histogram Intersection Score 0.7 0.8 -0.1 10 20 30 Lag (k) 40 50 60 0 20 40 60 period (30 minutes X) 80 100 10 0 10 0 10 1 10 File rank 2 10 3 Figure 1: Histogram intersection distance distribution Figure 2: Workload autocorrelation coefficient Figure 3: Workload periodicity Figure 4: Comparison of a bursty interval and its preceding interval analysis to discover the p ossible p eriodicity in the workload dynamics after removing a few workload spikes. As shown in Figure 3, the maximum value on the figure indicates that the p eriod is one day. With the strong p eriodicity comp onents, well-known statistical prediction approaches can b e applied to further improve the accuracy in capacity planning. Burstiness: While we can not predict unexp ected spikes in the workload, it is necessary to learn the nature of the burstiness and find out an efficient way to handle it once a busrty event happ ens. The comparison of the request (p opularity) distribution during one spike interval and that in the preceding interval is shown in Figure 4. We can see that the workload can b e seen as two parts: a base workload similar to the workload in the previous normal p eriod, and an extra workload that is due to several very p opular files. 3.2 Results 1800 1600 1400 1200 Servers needed 1000 800 600 400 200 0 LWL scheme random scheme 0 50 100 150 200 250 300 350 time Figure 5: Server Demands with different scheduling schemes Taking the one-week measurement data from Aug 13th to Aug 20th, we numerically calculated the server demands of random and LWL dispatching schemes with Poisson arrivals (based on results in [5]) and set the SLA requirement as Wtail : Pr[W > mean ser v ice time] < 0.3. Figure 5 shows the results; on average 69.9% of servers could b e saved with LWL scheme as compared to random dispatching scheme. 3. WORKLOAD AND CAPACITY MANAGEMENT: A VIRTUAL DESIGN 3.1 Methodology System model: We model a single video server as a group of virtual servers with First Come First Served (FCFS) queues. The virtual server numb er corresp onds to the physical server's capacity, which is defined as the maximum numb er of concurrent streams delivered by the server without loosing a quality of stream. In the analysis, the numb er 300 is chosed for the capacity of a video server based on the emp erical results in [3]. In this way, we model the video service center as a queueing system with multiple FCFS servers. Workload Scheduling Schemes: We choose two wellknown scheduling schemes to study: random dispatching, which doesn't make use of any information of the servers and just sends each incoming job to one of s server uniformly with probability 1/s; Least workload Left (LWL) scheme, which tries to achieve load balancing among servers by making use of the p er-server workload information and assigns the job to the server with the least workload left at the arrival instant. Service Level Agreements: We consider two QoS metrics for SLAs: the stream quality for an accepted connection, and the waiting time of a video request in the queue b efore accepted for streaming. Assume enough network bandwidth, then QoS on stream quality within the data center side can b e guaranteed through admission control based on server capacity. For the waiting time W , we consider the b ound on the tail of the waiting time distribution (called Wtail ), defined as P [W > x] < y with x > 0, y < 1. For example, SLA could b e that 90% of the requests exp erience no more than 5 seconds delays, i.e., x = 5 and y = 90%. 4. CONCLUSIONS In this pap er, we present the measurement study of a large Internet video sharing site - Yahoo! Video. With a clear goal to facilitate the data center design, this pap er gives a comprehensive workload characterization and prop oses a set of guidelines for workload and capacity management in a large-scale video distribution system. The immediate work for next step is expanding the measurement to cover larger p ortion of the workload on Yahoo! Video. 5. REFERENCES [1] Youtub e. http://www.youtub e.com. [2] ComScore Video Metrix rep ort: U.S. Viewers Watched an Average of 3 Hours of Online Video in July, 2007. [3] L. Cherkasova and L. Staley. Measuring the Capacity of a Streaming Media Server in a Utility Data Center Environment. In MULTIMEDIA '02, New York, NY, USA, 2002. ACM. [4] M. J. Swain and D. H. Ballard. Color Indexing. Int. J. Comput. Vision, 7(1):11­32, 1991. [5] W. Whitt. Partitioning Customers into Service Groups. Management Science, 45(11):1579­1592, Nov 1999. 1098