WWW 2008 / Alternate Track: Industrial Practice and Experience April 21-25, 2008 ˇ Beijing, China Planetary-Scale Views on a Large Instant-Messaging Network Carnegie Mellon University Jure Leskovec jure@cs.cmu.edu horvitz@microsoft.com Microsoft Research Eric Horvitz ABSTRACT We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numb ers of p eople, rather than the actions and characteristics of individuals. The dataset contains summary prop erties of 30 billion conversations among 240 million p eople. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We rep ort on multiple asp ects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited rep ort that p eople are separated by "six degrees of separation" and find that the average path length among Messenger users is 6.6. We also find that p eople tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are b oth more frequent and of longer duration than conversations with the same gender. Categories and Sub ject Descriptors: H.2.8 Database Management: : Database applications ­ Data mining General Terms: Measurement; Exp erimentation. Keywords: Social networks; Communication networks; User demographics; Large data; Online communication. 1. INTRODUCTION Large-scale web services provide unprecedented opp ortunities to capture and analyze b ehavioral data on a planetary scale. We discuss findings drawn from aggregations of anonymized data representing one month (June 2006) of high-level communication activities of p eople using the Microsoft Messenger instant-messaging (IM) network. We did not have nor seek access to the content of messages. Rather, we consider structural prop erties of a communication graph and study how structure and communication relate to user demographic attributes, such as gender, age, and location. The data set provides a unique lens for studying patterns of human b ehavior on a wide scale. Jure Leskovec p erformed this research during an internship at Microsoft Research. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. We explore a dataset of 30 billion conversations generated by 240 million distinct users over one month. We found that approximately 90 million distinct Messenger accounts were accessed each day and that these users produced ab out 1 billion conversations, with approximately 7 billion exchanged messages p er day. 180 million of the 240 million active accounts had at least one conversation on the observation p eriod. We found that 99% of the conversations occurred b etween 2 p eople, and the rest with greater numb ers of participants. To our knowledge, our investigation represents the largest and most comprehensive study to date of presence and communications in an IM system. A recent rep ort [6] estimated that approximately 12 billion instant messages are sent each day. Given the estimate and the growth of IM, we estimate that we captured approximately half of the world's IM communication during the observation p eriod. We created an undirected communication network from the data where each user is represented by a node and an edge is placed b etween users if they exchanged at least one message during the month of observation. The network represents accounts that were active during June 2006. In summary, the communication graph has 180 million nodes, representing users who participated in at least one conversation, and 1.3 billion undirected edges among active users, where an edge indicates that a pair of p eople communicated. We note that this graph should b e distinguished from a buddy graph where two p eople are connected if they app ear on each other's contact lists. The buddy graph for the data contains 240 million nodes and 9.1 billion edges. On average each account has approximately 50 buddies on a contact list. To highlight several of our key findings, we discovered that the communication network is well connected, with 99.9% of the nodes b elonging to the largest connected comp onent. We evaluated the oft-cited finding by Travers and Milgram that any two p eople are linked to one another on average via a chain with "6-degrees-of-separation" [17]. We found that the average shortest path length in the Messenger network is 6.6 (median 6), which is half a link more than the path length measured in the classic study. However, we also found that longer paths exist in the graph, with lengths up to 29. We observed that the network is well clustered, with a clustering coefficient [19] that decays with exp onent -0.37. This decay is significantly lower than the value we had exp ected given prior research [11]. We found strong homophily [9, 12] among users; p eople have more conversations and converse for longer durations with p eople who are similar to themselves. We find the strongest homophily for the language used, followed by conversants' geographic lo- 915 WWW 2008 / Alternate Track: Industrial Practice and Experience cations, and then age. We found that homophily does not hold for gender; p eople tend to converse more frequently and with longer durations with the opp osite gender. We also examined the relation b etween communication and distance, and found that the numb er of conversations tends to decrease with increasing geographical distance b etween conversants. However, communication links spanning longer distances tend to carry more and longer conversations. 10 10 April 21-25, 2008 ˇ Beijing, China 10 10 = 3.6 count 10 5 count Login every 20 minutes Login every 15 seconds 10 5 = 2.2 10 0 10 0 10 10 10 number of Login events per user 2 4 6 10 0 10 0 10 10 number of AddBuddy events per user 2 4 (a) Login (b) AddBuddy 2. INSTANT MESSAGING The use of IM has b een b ecome widely adopted in p ersonal and businesss communications. IM clients allow users fast, near-synchronous communication, placing it b etween synchronous communication mediums, such as real-time voice interactions, and asynchronous communication mediums like email [18]. IM users exchange short text messages with one or more users from their list of contacts, who have to b e online and logged into the IM system at the time of interaction. As conversations and messages exchanged within them are usually very short, it has b een observed that users employ informal language, loose grammar, numerous abbreviations, with minimal punctuation [10]. Contact lists are commonly referred to as buddy lists and users on the lists are referred to as buddies. Figure 1: Distribution of the number of events per user. (a) Number of logins per user. (b) Number of buddies added per user. ˇ Presence events: These include login, logout, first ever login, add, remove and block a buddy, add unregistered buddy (invite new user), change of status (busy, away, b e-right-back, idle, etc.). Events are user and time stamp ed. ˇ Communication: For each user participating in the session, the log contains the following tuple: session id, user id, time joined the session, time left the session, numb er of messages sent, numb er of messages received. ˇ User data: For each user, the following self-rep orted information is stored: age, gender, location (country, ZIP), language, and IP address. We use the IP address to decode the geographical coordinates, which we then use to p osition users on the glob e and to calculate distances. We gathered data for 30 days of June 2006. Each day yielded ab out 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eightprocessor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pip eline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes p er day. Processing the data took an additional 4 to 5 hours p er day. A sp ecial challenge was to account for missing and dropp ed events, and session "id recycling" across different IM servers in a server farm. As part of this process, we closed a session 48 hours after the last leave session event. We closed sessions automatically if only one user was left in the conversation. 2.1 Research on Instant Messaging Several studies on smaller datasets are related to this work. Avrahami and Hudson [3] explored communication characteristics of 16 IM users. Similarly, Shi et al. [13] analyzed IM contact lists submitted by users to a public website and explored a static contact network of 140,000 p eople. Recently, Xiao et al. [20] investigated IM traffic characteristics within a large organization with 400 users of Messenger. Our study differs from the latter study in that we analyze the ful l Messenger p opulation over a one month p eriod, capturing the interaction of user demographic attributes, communication patterns, and network structure. 2.2 Data description To construct the Microsoft Instant Messenger communication dataset, we combined three different sources of data: (1) user demographic information, (2) time and user stamp ed events describing the presence of a particular user, and (3) communication session logs, where, for all participants, the numb er of exchanged messages and the p eriods of time sp ent participating in sessions is recorded. We use the terms session and conversation interchangeably to refer to an IM interaction among two or more p eople. Although the Messenger system limits the numb er of p eople communicating at the same time to 20, p eople can enter and leave a conversation over time. We note that, for large sessions, p eople can come and go over time, so conversations can b e long with many different p eople participating. We observed some very long sessions with more than 50 participants joining over time. All of our data was anonymized; we had no access to p ersonally identifiable information. Also, we had no access to text of the messages exchanged or any other information that could b e used to uniquely identify users. We focused on analyzing high-level characteristics and patterns that emerge from the collective dynamics of 240 million p eople, rather than the actions and characteristics of individuals. The analyzed data can b e split into three parts: presence data, communication data, and user demographic information: 3. USAGE & POPULATION STATISTICS We shall first review several statistics drawn from aggregations of users and their communication activities. 3.1 Levels of activity Over the observation p eriod, 242,720,596 users logged into Messenger and 179,792,538 of these users were actively engaged in conversations by sending or receiving at least one IM message. Over the month of observation, 17,510,905 new accounts were activated. As a representative day, on June 1 2006, there were almost 1 billion (982,005,323) different sessions (conversations among any numb er of p eople), with more than 7 billion IM messages sent. Approximately 93 million users logged in with 64 million different users b ecoming engaged in conversations on that day. Approximately 1.5 million new users that were not registered within Microsoft Messenger were invited to join on that particular day. 916 WWW 2008 / Alternate Track: Industrial Practice and Experience 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 0 10 9 April 21-25, 2008 ˇ Beijing, China World population MSN population Count x Count -3.5 10 Number of users per session 1 10 2 10 10 Conversation duration 1 2 Figure 2: (a) Distribution of the number of people participating in a conversation. (b) Distribution of the durations of conversations. The spread of durations can be described by a power-law distribution. 10 6 age 20 10 10 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 0 10 11 x -3.67 100+ 95-99 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 35-39 30-34 25-29 20-24 15-19 10-14 5-9 0-4 Female Male 0.1 0.05 0 0.05 proportion of the population 0.1 Data = 9.7e5 x-1.77 R2=1.00 10 6 10 5 Figure 4: World and Messenger user population age pyramid. Ages 15­30 are overrepresented in the Messenger population. 10 10 10 count 5 10 4 Data -3.70 2 = 1.5e11 x R =0.99 10 8 Data -1.53 2 = 3.9e9 x R =0.99 1 day 2 days count 10 10 3 4 count 10 0 10 2 Figure 3: (a) Distribution of login duration. (b) Duration of times when people are not logged into the system (times between logout and login). We consider event distributions on a p er-user basis in Figure 1. The numb er of logins p er user, displayed in Figure 1(a), follows a heavy-tailed distribution with exp onent 3.6. We note spikes in logins at 20 minute and 15 second intervals, which corresp ond to an auto-login function of the IM client. As shown in Figure 1(b), many users fill up their contact lists rather quickly. The spike at 600 buddies undoubtedly reflects the maximal allowed length of contact lists. Figure 2(a) displays the numb er of users p er session. In Messenger, multiple p eople can participate in conversations. We observe a p eak at 20 users, the limit on the numb er of p eople who can participate simultaneously in a conversation. Figure 2(b) shows the distribution over the session durations, which can b e modeled by a p ower-law distribution with exp onent 3.6. Next, we examine the distribution of the durations of p eriods of time when p eople are logged on to the system. Let (tij , toj ) denote a time ordered (tij < toj < tij +1 ) sequence of online and offline times of a user, where tij is the time of the j th login, and toj is the corresp onding logout time. Figure 3(a) plots the distribution of toj - tij over all j over all users. Similarly, Figure 3(b) shows the distribution of the p eriods of time when users are logged off, i.e. tij +1 - toj over all j and over all users. Fitting the data to p ower-law distributions reveals exp onents of 1.77 and 1.3, resp ectively. The data shows that durations of b eing online tend to b e shorter and decay faster than durations that users are offline. We also notice p eriodic effects of login durations of 12, 24, and 48 hours, reflecting daily p eriodicities. We observe similar p eriodicities for logout durations at multiples of 24 hours. 10 login duration 1 10 2 10 0 10 3 Data = 6.9e5 x-1.34 R2=1.00 10 logout duration 1 10 5 count 10 6 3 days 10 2 10 10 0 4 10 0 10 conversation duration [min] 2 10 4 10 0 time between conversations [min] 10 5 Figure 5: Temporal characteristics of conversations. (a) Average conversation duration per user; (b) time between conversations of users. strongly overrepresented in the active Messenger p opulation. Focusing on the differences by gender, females are overrepresented for the 10­14 age interval. For male users, we see overall matches with the world p opulation for age spans 10­ ° 14 and 35U39; for women users, we see a match for ages in the span of 30­34. We note that 6.5% of the p opulation did not submit an age when creating their Messenger accounts. 4. COMMUNICATION CHARACTERISTICS We now focus on characteristics and patterns with communications. We limit the analysis to conversations b etween two participants, which account for 99% of all conversations. We first examine the distributions over conversation durations and times b etween conversations. Let user u have C conversations in the observation p eriod. Then, for every conversation i of user u we create a tuple (tsu,i , teu,i , mu,i ), where tsu,i denotes the start time of the conversation, teu,i is the end time of the conversation, and mu,i is the numb er of exchanged messages b etween the two users. We order the conversations by their start time (tsu,i < tsu,i+1 ). Then, for every user u, wei calculate the average conversation du1 ¯ ration d(u) = C teu,i - tsu,i , where the sum goes over all the u's conversations. Figure 5(a) shows the distribution ¯ of d(u) over all the users u. We find that the conversation length can b e describ ed by a heavy-tailed distribution with exp onent -3.7 and a mode of 4 minutes. Figure 5(b) shows the intervals b etween consecutive conversations of a user. We plot the distribution of tsu,i+1 - tsu,i , where tsu,i+1 and tsu,i denote start times of two consecutive conversations of user u. The p ower-law exp onent of the distribution over intervals is - 1.5. This result is similar to the temp oral distribution for other kinds of human communication activities, e.g., waiting times of emails and letters b efore a reply is generated [4]. The exp onent can b e explained by a priority-queue model where tasks of different 3.2 Demographic characteristics of the users We compared the demographic characteristics of the Messenger p opulation with 2005 world census data and found differences b etween the statistics for age and gender. The visualization of this comparison displayed in Figure 4 shows that users with rep orted ages in the 15­35 span of years are 917 WWW 2008 / Alternate Track: Industrial Practice and Experience 60 55 50 45 40 35 30 25 20 15 10 10 20 30 40 50 60 60 55 50 45 40 35 30 25 20 15 10 10 20 30 40 50 60 April 21-25, 2008 ˇ Beijing, China U 1.3 F 3.6 21.3 M 3.7 49.9 20.2 M 6.7 7.6 5.9 (b) U F M (d) U F M U 277 F 301 275 M 277 304 252 M 1.38 1.50 1.42 (a) U F M (c) U F M U 5.7 (a) Numb er of conversations 60 55 50 45 40 35 30 25 20 15 10 10 20 30 40 50 60 (b) Conversation duration 60 55 50 45 40 35 30 25 20 15 10 10 20 30 40 50 60 F 7.1 6.6 U 1.25 F 1.42 1.43 (c) Messages p er conversation (d) Messages p er unit time Table 1: Cross-gender communication, based on all two-person conversations during June 2006. (a) Percentage of conversations among users of different self-reported gender; (b) average conversation length in seconds; (c) number of exchanged messages per conversation; (d) number of exchanged messages per minute of conversation. serve a similar phenomenon when plotting the average numb er of exchanged messages p er conversation, computed as i 1 Ca,b mi , displayed in Figure 6(c). Again, we find |Ca,b | that older p eople exchange more messages, and we observe a dip for ages 25­45 and a slight p eak for ages 15­25. Figure 6(d) displays the numb er of exchanged messageis p er unit mi time; for each age pair, (a, b), we measure |C 1 | Ca,b di . a,b Here, we see that younger p eople have faster-paced dialogs, while older p eople exchange messages at a slower pace. We note that the younger p opulation (ages 10­35) are strongly biased towards communicating with p eople of a similar age (diagonal trend in Figure 6(a)), and that users who rep ort b eing of ages 35 years and ab ove tend to communicate more evenly across ages (rectangular pattern in Fig. 6(a)). Moreover, older p eople have conversations of the longest durations, with a "valley" in the duration of conversations for users of ages 25­35. Such a dip may represent shorter, faster-paced and more intensive conversations associated with work-related communications, versus more extended, slower, and longer interactions associated with social discourse. Figure 6: Communication characteristics of users by reported age. We plot age vs. age and the color (zaxis) represents the intensity of communication. priorities arrive and wait until all tasks with higher priority are addressed. This model generates a task waiting time distribution describ ed by a p ower-law with exp onent -1.5. 5. COMMUNICATION DEMOGRAPHICS Next we examine the interplay of communication and user demographic attributes, i.e., how geography, location, age, and gender influence observed communication patterns. 5.1 Communication by age We sought to understand how communication among p eople changes with the rep orted ages of participating users. Figures 6(a)-(d) use a heat-map visualization to communicate prop erties for different age­age pairs. The rows and columns represent the ages of b oth parties participating, and the color at each age­age cell captures the logarithm of the value for the pairing. The color sp ectrum extends from blue (low value) through green, yellow, and onto red (the highest value). Because of p otential misrep orting at very low and high ages, we concentrate on users with self-rep orted ages that fall b etween 10 and 60 years. Let a tuple (ai , bi , di , mi ) denote the ith conversation in the entire dataset that occurred among users of ages ai and bi . The conversation had a duration of di seconds during which mi messages were exchanged. Let Ca,b = {(ai , bi , di , mi ) : ai = a bi = b} denote a set of all conversations b etween users of ages a and b, resp ectively. Figure 6(a) shows the numb er of conversations among p eople of different ages. For every pair of ages (a, b) the color indicates the size of set Ca,b , i.e., the numb er of different conversations b etween users of ages a and b. We note that, as the notion of a conversation is symmetric, the plots are symmetric. Most conversations occur b etween p eople of ages 10 to 20. The diagonal trend indicates that p eople tend to talk to p eople of similar age. This is true esp ecially for age groups b etween 10 and 30 years. We shall explore this observation in more detail in Section 6. Figure 6(b) displays a heat map for the average converi sation duration, computed as |C 1 | Ca,b di . We note a,b that older p eople tend to have longer conversations. We ob- 5.2 Communication by gender We rep ort on analyses of prop erties of pairwise communications as a function of the self-rep orted gender of users in conversations in Table 1. Let Cg,h = {(gi , hi , di , mi ) : gi = g hi = h} denote a set of conversations where the two participating users are of genders g and h. Note that g takes 3 p ossible values: female, male, and unknown (unrep orted). Table 1(a) relays |Cg,h | for combinations of genders g and h. The table shows that approximately 50% of conversations occur b etween male and female and 40% of the conversations occur among users of the same gender (20% for each). A small numb er of conversations occur b etween p eople who did not reveal their gender. Similarly, Table 1(b) shows the average conversation length in seconds, broken down by the gender of conversant, comi puted as |C 1 | Cg,h di . We find that male­male converg,h sations tend to b e shortest, lasting approximately 4 minutes. Female­female conversations last 4.5 minutes on the average. Female­male conversations have the longest durations, taking more than 5 minutes on average. Beyond taking place over longer p eriods of time, more messages are exchanged in female­male conversations. Table 1(c) lists 918 WWW 2008 / Alternate Track: Industrial Practice and Experience i values for |C 1 | Cg,h mi and shows that, in female­male g,h conversations, 7.6 messages are exchanged p er conversation on the average as opp osed to 6.6 and 5.9 for female­female and male­male, resp ectively. Table 1(d) sihows the commi 1 munication intensity computed as |Cg,h | Cg,h di . The numb er of messages exchanged p er minute of conversation for male­female conversations is higher at 1.5 messages p er minute than for cross-gender conversations, where the rate is 1.43 messages p er minute. We examined the numb er of communication ties, where a tie is established b etween two p eople when they exchange at least one message during the observation p eriod. We computed 300 million male­male ties, 255 million female­ female ties, and 640 million cross-gender ties. The Messenger p opulation consists of 100 million males and 80 million females by self rep ort. These findings demonstrate that ties are not heavily gender biased; based on the p opulation, random chance predicts 31% male­male, 20% female­ female, and 49% female­male links. We observe 25% male­ male, 21% female­female, and 54% cross-gender links, thus demonstrating a minor bias of female­male links. The results rep orted in Table 1 run counter to prior studies rep orting that communication among individuals who resemble one other (same gender) occurs more often (see [9] and references therein). We identified significant heterophily, where p eople tend to communicate more with p eople of the opp osite gender. However, we note that link heterogeneity was very close to the p opulation value [8], i.e., the numb er of same- and cross-gender ties roughly corresp onds to random chance. This shows there is no significant bias in linking for gender. However, we observe that cross-gender conversations tend to b e longer and to include more messages, suggesting that more effort is devoted to conversations with the opp osite sex. April 21-25, 2008 ˇ Beijing, China Figure 7: Number of users at a particular geographic location. Color of dots represents the number of users. Figure 8: Number of Messenger users per capita. Color intensity corresponds to the number of users per capita in the cell of the grid. Figure 9 shows a heat map that represents the intensities of Messenger communications on an international scale. To create this map, we place the world map on a fine grid, where each cell of the grid contains the count of the numb er of conversations that pass through that p oint by increasing the count of all cells on the straight line b etween the geolocations of pairs of conversants. The color indicates the numb er of conversations crossing each p oint, providing a visualization of the key flows of communication. For example, Australia and New Zealand have communications flowing towards Europ e and United States. Similar flows hold for Japan. We see that Brazilian communications are weighted toward Europ e and Asia. We can also explore the flows of transatlantic and US transcontinental communications. 5.3 World geography and communication We now focus on the influence of geography and distance among participants on communications. Figure 7 shows the geographical locations of Messenger users. The general location of the user was obtained via reverse IP lookup. We plot all latitude/longitude p ositions linked to the p osition of servers where users log into the service. The color of each dot corresp onds to the logarithm of the numb er of logins from the resp ective location, again using a sp ectrum of colors ranging from blue (low) through green and yellow to red (high). Although the maps are built solely by plotting these p ositions, a recognizable world map is generated. We find that North America, Europ e, and Japan are very dense, with many users from those regions using Messenger. For the rest of the world, the p opulation of Messenger users app ears to reside largely in coastal regions. We can condition the densities and b ehaviors of Messenger users on multiple geographical and socioeconomic variables and explore relationships b etween electronic communications and other attributes. As an example, harnessed the United Nations gridded world p opulation data to provide estimates of the numb er of p eople living in each cell. Given this data, and the data from Figure 7, we calculate the numb er of users p er capita, displayed in Figure 8. Now we see transformed picture where several sparsely p opulated regions stand out as having a high usage p er capita. These regions include the center of the United States, Canada, Scandinavia, Ireland, Australia, and South Korea. 5.4 Communication among countries Communication among p eople within different countries also varies dep ending on the locations of conversants. We examine two such views. Figure 10(a) shows the top countries by the numb er of conversations b etween pairs of countries. We examined all pairs of countries with more than 10 million conversations p er month. The width of edges in the figure is prop ortional to the logarithm of the numb er of conversations among the countries. We find that the United States and Spain app ear to serve as hubs and that edges app ear largely b etween historically or ethnically connected countries. As examples, Spain is connected with the Spanish sp eaking countries in South America, Germany links to Turkey, Portugal to Brazil, and China to Korea. Figure 10(b) displays a similar plot where we consider country pairs by the average duration of conversations. The 919 WWW 2008 / Alternate Track: Industrial Practice and Experience 6 5 x 10 7 April 21-25, 2008 ˇ Beijing, China Raw data Smooted conversations per friendship 7.2 7.1 7 6.9 6.8 6.7 6.6 Raw data Smooted number of conversations 4 3 2 1 0 0.5 1 distance [km] 1.5 x 10 4 2 6.5 5000 10000 distance [km] 15000 (a) Numb er of conversations (b) Conversations p er link Figure 11: Communication with the distance. Figure 9: A communication heat map. Attribute Age Gender ZI P County Language Correlation Rnd Comm -0.0001 0.297 0.0001 -0.032 -0.0003 0.557 0.0005 0.704 -0.0001 0.694 Probability Rnd Comm 0.030 0.162 0.434 0.426 0.001 0.23 0.046 0.734 0.030 0.798 Venezuela Chile Argentina Mexico Peru Colombia Belgium Morocco France Portugal Egypt Brazil Malaysia United States Spain China Italy Taiwan Korea Yugoslavia Poland Canada Croatia Thailand Netherlands Russia Hong Kong SAR Bahrain Israel Oman France Sudan U.A.R Qatar Algeria Belgium Egypt Jordan Syria Cameroon United Kingdom snia Bo Canada Australia Turkey Germany Netherlands Dominican Republic Austria Germany Lebanon Palestinian Auth. Azerbaijan Kuwait Pakistan Yemen U.S.A. Saudi Arabia Australia U.K. Iraq Table 2: Correlation coefficients and probability of users sharing an attribute for random pairs of people versus for pairs of people who communicate. versation lengths do not increase with distance (see plots in [7]). Conversation duration decreases with the distance, while the numb er of exchanged messages remains constant b efore decreasing slowly. Figure 11(b) shows the communications p er link versus the distance among participants. The plot shows that longer links, i.e., connections b etween p eople who are farther apart, are more frequently used than shorter links. We interpret this finding to mean that p eople who are farther apart use Messenger more frequently to communicate. In summary, we observe that the total numb er of links and associated conversations decreases with increasing distance among participants. The same is true for the duration of conversations, the numb er of exchanged messages p er conversation, and the numb er of exchanged messages p er unit time. However, the numb er of times a link is used tends to increase with the distance among users. This suggests that p eople who are farther apart tend to converse with IM more frequently, which p erhaps takes the place of more exp ensive long-distance voice telephony; voice might b e used more frequently in lieu of IM for less exp ensive local communications. In Morocco dia Libya Turkey Thailand Figure 10: (a) Communication among countries with at least 10 million conversations in June 2006. (b) Countries by average length of the conversation. Edge widths correspond to logarithms of intensity of links. width of the edges are prop ortional to the mean length of conversations b etween the countries. The core of the network app ears to b e Arabic countries, including Saudi Arabia, Egypt, United Arab Emirates, Jordan, and Syria. 5.5 Communication and geographical distance We were interested in how communications change as the distance b etween p eople increases. We had hyp othesized that the numb er of conversations would decrease with geographical distance as users might b e doing less coordination with one another on a daily basis, and where communication would likely require more effort to coordinate than might typically b e needed for p eople situated more locally. We also conjectured that, once initiated, conversations among p eople who are farther apart would b e somewhat longer as there might b e a stronger need to catch up when the lessfrequent conversations occurred. Figure 11 plots the relation b etween communication and distance. Figure 11(a) shows the distribution of the numb er of conversations b etween conversants at distance l. We found that the numb er of conversations decreases with distance. However, we observe a p eak at a distance of approximately 500 kilometers. The other p eaks and drops may reveal geographical features. For example, a significant drop in communication at distance of 5,000 km (3,500 miles) may reflect the width of the Atlantic ocean or the distance b etween the east and west coasts of the United States. The numb er of links rapidly decreases with distance. This finding suggests that users may use Messenger mainly for communications with others within a local context and environment. We found that the numb er of exchanged messages and con- 6. HOMOPHILY OF COMMUNICATION We p erformed several exp eriments to measure the level at which p eople tend to communicate with similar p eople. First, we consider all 1.3 billion pairs of p eople who exchanged at least one message in June 2006, and calculate the similarity of various user demographic attributes. We contrast this with the similarity of pairs of users selected via uniform random sampling across 180 million users. We consider two measures of similarity: the correlation coefficient and the probability that users have the same attribute value, e.g., that users come from the same countries. Table 2 compares correlation coefficients of various user attributes when pairs of users are chosen uniformly at random with coefficients for pairs of users who communicate. We can see that attributes are not correlated for random pairs of p eople, but that they are highly correlated for users 920 WWW 2008 / Alternate Track: Industrial Practice and Experience 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 April 21-25, 2008 ˇ Beijing, China (a) Random (b) Communicate Figure 12: Numbers of pairs of people of different ages. (a) Randomly selected pairs of people; (b) people who communicate. Correlation between age and communication is captured by the diagonal trend. 10 10 9 5.4 8 10 10 10 10 10 7 time per conversation [min] number of conversations 5.2 5 6 5 4.8 4 4.6 3 0 20 40 60 age difference 80 100 4.4 0 20 40 age difference 60 80 (a) Numb er of conversations (b) Conversation duration versation duration p eaks at an age difference of 20 years b etween participants. We sp eculate that the p eak may corresp ond roughly to the gap b etween generations. The plots reveal that there is strong homophily in the communication network for age; p eople tend to communicate more with p eople of similar rep orted age. This is esp ecially salient for the numb er of buddies and conversations among p eople of the same ages. We also observe that the links b etween p eople of similar attributes are used more often, to interact with shorter and more intense (more exchanged messages) communications. The intensity of communication decays linearly with the difference in age. In contrast to findings of previous studies, we observe that the numb er of cross-gender communication links follows a random chance. However, cross-gender communication takes longer and is faster paced as it seems that p eople tend to pay more attention when communicating with the opp osite sex. Recently, using the data we generated, Singla and Richardson further investigated the homophily within the Messenger network and found that p eople who communicate are also more likely to search the web for content on similar topics [14]. Figure 13: Communication characteristics and age difference between conversants. who communicate. As we noted earlier, gender and communication are slightly negatively correlated; p eople tend to communicate more with p eople of the opp osite gender. Another method for identifying association is to measure the probability that a pair of users will show an exact match in values of an attribute, i.e., identifying whether two users come from the same country, sp eak the same language, etc. Table 2 shows the results for the probability of users sharing the same attribute value. We make similar observations as b efore. People who communicate are more likely to share common characteristics, including age, location, language, and they are less likely to b e of the same gender. We note that the most common attribute of p eople who communicate is language. On the flip side, the amount of communication tends to decrease with increasing user dissimilarity. This relationship is highlighted in Figure 11, which shows how communication among pairs of p eople decreases with distance. Figure 12 further illustrates the results displayed in Table 2, where we randomly sample pairs of users from the Messenger user base, and then plot the distribution over rep orted ages. As most of the p opulation comes from the age group 10­30, the distribution of random pairs of p eople reaches the mode at those ages but there is no correlation. Figure 12(b) shows the distribution of ages over the pairs of p eople who communicate. Note the correlation, as represented by the diagonal trend on the plot, where p eople tend to communicate more with others of a similar age. Next, we further explore communication patterns by the differences in the rep orted ages among users. Figure 13(a) plots on a log-linear scale the numb er of conversations in the social network with participants of varying age differences. Again we see that links and conversations are strongly correlated with the age differences among participants. Figure 13(b) shows the average conversation duration with the age difference among the users. Interestingly, the mean con- 7. THE COMMUNICATION NETWORK So far we have examined communication patterns based on pairwise communications. We now create a more general communication network from the data. Using this network, we can examine the typical social distance b etween p eople, i.e., the numb er of links that separate a random pair of p eople. This analysis seeks to understand how many p eople can b e reached within certain numb ers of hops among p eople who communicate. Also, we test the transitivity of the network, i.e., the degree at which pairs with a common friend tend to b e connected. We constructed a graph from the set of all two-user conversations, where each node corresp onds to a p erson and there is an undirected edge b etween a pair of nodes if the users were engaged in an active conversation during the observation p eriod (users exchanged at least 1 message). This results in a network which contains 179,792,538 nodes, and 1,342,246,427 edges. Note that this is not simply a buddy network; we only connect p eople who are buddies and have communicated during the observation p eriod. Figures 14­15 show the structural prop erties of the communication network. The network degree distribution shown in Figure 14(a) is heavy tailed but does not follow a p owerlaw distribution. Using maximum likelihood estimation, we fit a p ower-law with exp onential cutoff p(k) k-a e-bk with fitted parameter values a = 0.8 and b = 0.03. We found a strong cutoff parameter and low p ower-law exp onent, suggesting a distribution with high variance. Figure 14(b) displays the degree distribution of a buddy graph. We did not have access to the full buddy network; we only had access to data on the length of the user contact list which allowed us to create the plot. We found a total of 9.1 billion buddy edges in the graph with 49 buddies p er user. We fit the data with a p ower-law distribution with exp onential cutoff and identified parameters of a = 0.6 and b = 0.01. The p ower-law exp onent now is even smaller. This model describ ed the data well. We note a spike at 600 which is the limit on the maximal numb er of buddies imp osed by the Messenger software client. The maximal numb er of buddies was increased to 300 from 150 in March 921 WWW 2008 / Alternate Track: Industrial Practice and Experience 100 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10 0 April 21-25, 2008 ˇ Beijing, China 109 108 k-0.8 exp(-0.03k) 10-1 10-2 100 p(b) (Probability) p(k) (probability) p(l) (Probability) 10-3 10-4 10-5 10-6 10-7 10-8 b-0.6 exp(-0.01b) 0 10-4 10-6 10-8 10-10 10-12 0 5 10 15 20 25 l, (Path length in hops) 30 Number of nodes 10-2 107 106 105 104 10 3 k=20 102 10 10 10 k (number of conversants) 1 2 3 10 4 10 10 10 b (Number of buddies) 1 2 10 3 101 0 10 k=60-68, n=79 101 Core of order K 102 (a) Communication (b) Buddies (a) Diameter (b) k-cores Figure 14: (a) Degree distribution of communication network (number of people with whom a person communicates). (b) Degree distribution of the buddy network (length of the contact list). 100 107 c k-0.37 10 -1 Figure 16: (a) Distribution over the shortest path lengths. Average shortest path has length 6.6, the distribution reaches the mode at 6 hops, and the 90% effective diameter is 7.8; (b) distribution of sizes of cores of order k. c (Clustering coefficient) 106 105 Count 104 103 10 2 10 -2 largest component (99.9% of the nodes) 101 10 -3 101 k (Degree) 102 100 0 10 101 102 103 104 105 106 107 108 109 Weakly connected component size (a) Clustering (b) Comp onents Figure 15: (a) Clustering coefficient; (b) distribution of connected components. 99.9% of the nodes belong to the largest connected component. 2005, and was later raised to 600. With the data from June 2006, we see only the p eak at 600, and could not identify bumps at the earlier constraints. Social networks have b een found to b e highly transitive, i.e., p eople with common friends tend to b e friends themselves. The clustering coefficient [19] has b een used as a measure of transitivity in the network. The measure is defined as the fraction of triangles around a node of degree k [19]. Figure 15(a) displays the clustering coefficient versus the degree of a node for Messenger. Previous results on measuring the web graph as well as theoretical analyses show that the clustering coefficient decays as k-1 (exp onent -1) with node degree k [11]. For the Messenger network, the clustering coefficient decays very slowly with exp onent -0.37 with the degree of a node and the average clustering coefficient is 0.137. This result suggests that clustering in the Messenger network is much higher than exp ected--that p eople with common friends also tend to b e connected. Figure 15(b) displays the distribution of the connected comp onents in the network. The giant comp onent contains 99.9% of the nodes in the network against a background of small comp onents, and the distribution follows a p ower law. Figure 16(a) displays the distribution over the shortest path lengths. To approximate the distribution of the distances, we randomly sampled 1000 nodes and calculated for each node the shortest paths to all other nodes. We found that the distribution of path lengths reaches the mode at 6 hops and has a median at 7. The average path length is 6.6. This result means that a random pair of nodes in the Messenger network is 6.6 hops apart on the average, which is half a link longer than the length measured by Travers and Milgram. The 90th p ercentile (effective diameter [16]) of the distribution is 7.8. 48% of nodes can b e reached within 6 hops and 78% within 7 hops. So, we might say that, via the lens provided on the world by Messenger, we find that there are ab out "7 degrees of separation" among p eople. We note that long paths, i.e., nodes that are far apart, exist in the network; we found paths up to a length of 29. 7.2 Network cores We further study connectivity of the communication network by examining the k-cores [5] of the graph. The concept of k-core is a generalization of the giant connected comp onent. The k-core of a network is a set of vertices K , where each vertex in K has at least k edges to other vertices in K . The distribution of k-core sizes gives us an idea of how quickly the network shrinks as we move towards the core. The k-core of a graph can b e obtained by deleting from the network all vertices of degree less than k. This process will decrease degrees of some non-deleted vertices, so more vertices will have degree less than k. We keep pruning vertices until all remaining vertices have degree of at least k. We call the remaining vertices a k-core. Figure 16 plots the numb er of nodes in a core of order k. We note that the core sizes are remarkably stable up to a value of k 20; the numb er of nodes in the core drops for only an order of magnitude. After k > 20, the core size rapidly drops. The central part of the communication network is comp osed of 79 nodes, where each of them has more than 68 edges inside the set. The structure of the Messenger communication network is quite different from the Internet graph; it has b een observed [2] that the size of a k-core of the Internet decays as a p ower-law with k. Here we see that the core sizes remains very stable up to a degree 20, and only then start to rapidly degrease. This means that the nodes with degrees of less than 20 are on the fringe of the network, and that the core starts to rapidly decrease as nodes of degree 20 or more are deleted. 7.1 How small is the small world? Messenger data gives us a unique opp ortunity to study distances in the social network. To our knowledge, this is the first time a planetary-scale social network has b een available to validate the well-known "6 degrees of separation" finding by Travers and Milgram [17]. The earlier work employed a sample of 64 p eople and found that the average numb er of hops for a letter to travel from Nebraska to Boston was 6.2 (mode 5, median 5), which is p opularly known as the "6 degrees of separation" among p eople. We used a p opulation sample that is more than two million times larger than the group studied earlier and confirmed the classic finding. 922 WWW 2008 / Alternate Track: Industrial Practice and Experience April 21-25, 2008 ˇ Beijing, China 7.3 Strength of the ties It has b een observed by Alb ert et al. [1] that many realworld networks are robust to node-level changes or attacks. Researchers have showed that networks like the World Wide Web, Internet, and several social networks display a high degree of robustness to random node removals, i.e., one has to remove many nodes chosen uniformly at random to make the network disconnected. On the contrary, targeted attacks are very effective. Removing a few high degree nodes can have a dramatic influence on the connectivity of a network. Let us now study how the Messenger communication network is decomp osed when "strong," i.e., heavily used, edges are removed from the network. We consider several different definitions of "heavily used," and measure the typ es of edges that are most imp ortant for network connectivity. We note that a similar exp eriment was p erformed by Shi et al [13] in the context of a small IM buddy network. The authors of the prior study took the numb er of common friends at the ends of an edge as a measure of the link strength. As the numb er of edges here is too large (1.3 billion) to remove edges one by one, we employed the following procedure: We order the nodes by decreasing value p er a measure of the intensity of engagement of users; we then delete nodes associated with users in order of decreasing measure and we observe the evolution of the prop erties of the communication network as nodes are deleted. We consider the following different measures of engagement: ˇ Average sent: The average numb er of sent messages p er user's conversation ˇ Average time: The average duration of user's conversations ˇ Links: The numb er of links of a user (node degree), i.e., numb er of different p eople he or she exchanged messages with ˇ Conversations: The total numb er of conversations of a user in the observation p eriod ˇ Sent messages: The total numb er of sent messages by a user in the observation p eriod ˇ Sent p er unit time: The numb er of sent messages p er unit time of a conversation ˇ Total time: The total conversation time of a user in the observation p eriod At each step of the exp eriment, we remove 10 million nodes in order of the sp ecific measure of engagement b eing studied. We then determine the relative size of the largest connected comp onent, i.e., given the network at particular step, we find the fraction of the nodes b elonging to the largest connected comp onent of the network. Figure 17 plots the evolution of the fraction of nodes in the largest connected comp onent with the numb er of deleted nodes. We plot a separate curve for each of the seven different measures of engagement. For comparison, we also consider the random deletion of the nodes. The decomp osition procedure highlighted two typ es of dynamics of network change with node removal. The size of the largest comp onent decreases rapidly when we use as measures of engagement the numb er of links, numb er of conversations, total conversation time, or numb er of sent messages. In contrast, the size of the largest comp onent decreases very slowly when we use as a measure of engagement the average 1 0.9 0.8 Component size 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Avg. sent Avg. time Links Conversations Sent messages Sent per unit time Total time Random 2 4 6 8 10 Deleted nodes 12 14 16 x 10 7 Figure 17: Relative size of the largest connected component as a function of number of nodes removed. 14 12 10 Deleted edges x 10 8 8 6 4 2 0 0 Avg. sent Avg. time Links Conversations Sent messages Sent per unit time Total time Random 2 4 6 8 10 Deleted nodes 12 14 16 x 10 7 Figure 18: Number of removed edges as nodes are deleted by order of different measures of engagement. time p er conversation, average numb er of sent messages, or numb er of sent messages p er unit time. We were not surprised to find that the size of the largest comp onent size decreases most rapidly when nodes are deleted in order of the decreasing numb er of links that they have, i.e., the numb er of p eople with whom a user at a node communicates. Random ordering of the nodes shrinks the comp onent at the slowest rate. After removing 160 million out of 180 million nodes with the random p olicy, the largest comp onent still contains ab out half of the nodes. Surprisingly, when deleting up to 100 million nodes, the average time p er conversation measure shrinks the comp onent even more slowly than the random deletion p olicy. Figure 18 displays plots of the numb er of removed edges from the network as nodes are deleted. Similar to the relationships in Figure 17, we found that deleting nodes by the inverse numb er of edges removes edges the fastest. As in Figure 18, the same group of node ordering criteria (numb er of conversations, total conversation time or numb er of sent messages) removes edges from the networks as fast as the numb er of links criteria. However, we find that random node removal removes edges in a linear manner. Edges are removed at a lower rate when deleting nodes by average time p er conversation, average numb ers of sent messages, or numb ers of sent messages p er unit time. We b e- 923 WWW 2008 / Alternate Track: Industrial Practice and Experience lieve that these findings demonstrate that users with long conversations and many messages p er conversation tend to have smaller degrees--even given the findings displayed in Figure 17, where we saw that removing these users is more effective for breaking the connectivity of the network than for random node deletion. Figure 18 also shows that using the average numb er of messages p er conversation as a criterion removes edges in the slowest manner. We b elieve that this makes sense intuitively: If users invest similar amounts of time to interacting with others, then p eople with short conversations will tend to converse with more p eople in a given amount of time than users having long conversations. April 21-25, 2008 ˇ Beijing, China We hop e that our studies with Messenger data serves as an example of directions in social science research, highlighting how communication systems can provide insights ab out high-level patterns and relationships in human communications without making incursions into the privacy of individuals. We hop e that this first effort to understand a social network on a genuinely planetary scale will emb olden others to explore human b ehavior at large scales. Acknowledgments We thank Dan Liebling for help with generated world map plots, and Dimitris Achlioptas and Susan Dumais for helpful suggestions. 8. CONCLUSION 9. REFERENCES We have reviewed a set of results stemming from the generation and analysis of an anonymized dataset representing the communication patterns of all p eople using a p opular IM system. The methods and findings highlight the value of using a large IM network as a worldwide lens onto aggregate human b ehavior. We describ ed the creation of the dataset, capturing highlevel communication activities and demographics in June 2006. The core dataset contains more than 30 billion conversations among 240 million p eople. We discussed the creation and analysis of a communication graph from the data containing 180 million nodes and 1.3 billion edges. The communication network is largest social network analyzed to date. The planetary-scale network allowed us to explore dep endencies among user demographics, communication characteristics, and network structure. Working with such a massive dataset allowed us to test hyp otheses such as the average chain of separation among p eople across the entire world. We discovered that the graph is well connected, highly transitive, and robust. We reviewed the influence of multiple factors on communication frequency and duration. We found strong influences of homophily in activities, where p eople with similar characteristics tend to communicate more, with the exception of gender, where we found that crossgender conversations are b oth more frequent and of longer duration than conversations with users of the same rep orted gender. We also examined the path lengths and validated on a planetary scale earlier research that found "6 degrees of separation" among p eople. We note that the sheer size of the data limits the kinds of analyses one can p erform. In some cases, a smaller random sample may avoid the challenges with working with terabytes of data. However, it is known that sampling can corrupt the structural prop erties of networks, such as the degree distribution and the diameter of the graphs [15]. Thus, while sampling may b e valuable for managing complexity of analyses, results on network prop erties with partial data sets may b e rendered unreliable. Furthermore, we need to consider the full data set to reliably measure the patterns of age and distance homophily in communications. In other directions of research with the dataset, we have pursued the use of machine learning and inference to learn predictive models that can forecast such prop erties as communication frequencies and durations of conversations among p eople as a function of the structural and demographic attributes of conversants. Our future directions for research include gaining an understanding of the dynamics of the structure of the communication network via a study of the evolution of the network over time. [1] R. Alb ert, H. Jeong, and A.-L. Barabasi. Error and attack tolerance of complex networks. Nature, 406:378, 2000. [2] J. I. Alvarez-Hamelin, L. Dall'Asta, A. Barrat, and A. Vespignani. Analysis and visualization of large scale networks using the k-core decomp osition. In ECCS '05: European Conference on Complex Systems, 2005. [3] D. Avrahami and S. E. Hudson. Communication characteristics of instant messaging: effects and predictions of interp ersonal relationships. In CSCW '06, pages 505­514, 2006. [4] A.-L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature, 435:207, 2005. [5] V. Batagelj and M. Zaversnik. Generalized cores. ArXiv, (cs.DS/0202039), Feb 2002. [6] IDC Market Analysis. Worldwide Enterprise Instant Messaging Applications 2005­2009 Forecast and 2004 Vendor Shares: Clearing the Decks for Substantial Growth. 2005. [7] J. Leskovec and E. Horvitz. Worldwide Buzz: Planetary-Scale Views on an Instant-Messaging Network. Tech. rep ort MSR-TR-2006-186, 2006. [8] P. V. Marsden. Core discussion networks of americans. American Sociological Review, 52(1):122­131, 1987. [9] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415­444, 2001. [10] B. A. Nardi, S. Whittaker, and E. Bradner. Interaction and outeraction: instant messaging in action. In CSCW '00: Proceedings of the 2000 ACM conference on Computer supported cooperative work, pages 79­88, 2000. [11] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks. Physical Review E, 67(2):026112, 2003. [12] E. M. Rogers and D. K. Bhowmik. Homophily-heterophily: Relational concepts for communication research. Public Opinion Quarterly, 34:523­538, 1970. [13] X. Shi, L. A. Adamic, and M. J. Strauss. Networks of strong ties. Physica A Statistical Mechanics and its Applications, 378:33­47, May 2007. [14] P. Singla and M. Richardson. Yes, there is a correlation - from social networks to p ersonal b ehavior on the web. In WWW '08, 2008. [15] M. P. Stumpf, C. Wiuf, R. M. May. Subnets of scale-free networks are not scale-free: sampling prop erties of networks. PNAS, 102(12), 2005. [16] S. L. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual model for the internet top ology. In GLOBECOM '01, vol. 3, pages 1667 ­ 1671, 2001. [17] J. Travers and S. Milgram. An exp erimental study of the small world problem. Sociometry, 32(4), 1969. [18] A. Voida, W. C. Newstetter, and E. D. Mynatt. When conventions collide: the tensions of instant messaging attributed. In CHI '02, pages 187­194, 2002. [19] D. J. Watts and S. H. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393:440­442, 1998. [20] Z. Xiao, L. Guo, and J. Tracey. Understanding instant messaging traffic characteristics. In ICDCS '07, 2007. 924