WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities

Statistical Properties of Community Structure in Large Social and Information Networks
Jure Leskovec
 

Kevin J. Lang


Anirban Dasgupta


Michael W. Mahoney


jure@cs.cmu.edu ABSTRACT

Carnegie Mellon University

{langk, anirban, mahoney}@yahoo-inc.com 1.1 Overview of our approach
At the risk of oversimplifying the large b ody of work on community detection in complex networks, the following five-part story describ es the general methodology: (1) Data are modeled by an "interaction graph." In particular, part of the world gets mapp ed to a graph in which nodes represent entities and edges represent some kind of interaction b etween pairs of those entities. For example, nodes may represent individual p eople and edges may represent friendships, interactions or communication b etween pairs of those p eople. (2) The hyp othesis is made that the world contains groups of entities that interact more strongly amongst themselves than with the outside world, and hence the interaction graph should contain sets of nodes, i.e., communities, that have more and/or b etter-connected "internal edges" connecting memb ers of the set than "cut edges" connecting the set to the rest of the world. (3) A ob jective function or metric is chosen to formalize this idea of groups with more intra-group than intergroup connectivity. (4) An algorithm is then selected to find sets of nodes that exactly or approximately optimize this or some other related metric. Sets of nodes that the algorithm finds are then called "clusters," "communities," "groups," "classes," or "modules". (5) The clusters (communities) are then evaluated in some way. For example, one may map the sets of nodes back to the real world to see whether they app ear to make intuitive sense as a plausible social community. Alternatively, one may attempt to acquire some form of "ground truth," in which case the set of nodes output by the algorithm may b e compared with it. With resp ect to p oints (1)­(4), we will follows the usual path in this pap er. For p oint (3), we choose a natural and widely-adopted notion of community goodness called conductance, also known as the normalized cut metric [6, 31, 16]. Since there exist a rich suite of b oth theoretical and practical algorithms to optimize this quantity [32, 20, 4, 17, 37, 10], we can for p oint (4) compare and contrast several methods to approximately optimize it. However, it is in p oint (5) that we deviate from previous work. Instead of focusing on individual groups of nodes and trying to interpret them as "real" communities, we investigate statistical prop erties of a large numb er of communities over a wide range of size scales in real-world social and information networks. We take a step back and ask questions

Yahoo! Research

A large b ody of work has b een devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections b etween its memb ers than to the remainder of the network. In this pap er, we characterize as a function of size the statistical and structural prop erties of such sets of nodes. We define the network community profile plot, which characterizes the "b est" p ossible community--according to the conductance measure--over a wide range of size scales, and we study over 70 large sparse real-world networks taken from a wide range of application domains. Our results suggest a significantly more refined picture of community structure in large real-world networks than has b een appreciated previously. Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the b est p ossible communities gradually "blend in" with the rest of the network and thus b ecome less "community-like." This b ehavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this b ehavior is exactly the opp osite of what one would exp ect based on exp erience with and intuition from expander graphs, from graphs that are well-emb eddable in a low-dimensional structure, and from small social networks that have served as testb eds of community detection algorithms. We have found, however, that a generative model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community structure similar to our observations. Categories and Sub ject Descriptors: H.2.8 Database Management: Database applications ­ Data mining General Terms: Measurement; Exp erimentation. Keywords: Social networks; Graph partitioning; Community structure; Conductance; Random walks.

1.

INTRODUCTION

In this pap er, we explore from a novel p ersp ective several questions related to identifying meaningful communities in social and information networks, and we come to several surprising conclusions that have theoretical and practical implications for community detection.
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04.

695


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities

Figure 1: (a) Typical NCP plot. (b) Network structure as suggested by our experiments.

1.2 Summary of our results
Main Empirical Findings: Our results suggest a rather detailed and somewhat counterintuitive picture of the community structure in large networks. Several qualitative properties of community structure are nearly universal: · Up to a size scale, which empirically is roughly 100 nodes, there not only exist well-separated communities, but also the slop e of the network community profile plot is generally sloping downward. (See Fig. 1(a).) This latter p oint suggests, and empirically we often observe, that smaller communities can b e combined into meaningful larger communities. · At size scale of 100 nodes, we often observe the global minimum of the network community profile plot. (Although these are the "b est" communities in the entire graph, they are usually connected to the remainder of the network by just a single edge.) · Ab ove the size scale of roughly 100 nodes, the network community profile plot gradually increases, and thus there is a nearly inverse relationship b etween community size and community quality. (See Fig. 1(a).) This upward slop e suggests, and empirically we often observe, that as a function of increasing size, the b est p ossible communities as they grow b ecome more and more "blended into" the remainder of the network. This last p oint is particularly significant, and it is our main empirical finding: at larger and larger size scales the b est p ossible communities gradually "blend in" more and more with the rest of the network and thus gradually b ecome less and less community-like (less well-expressed/separated). Eventually, even the existence of large well-defined communities is quite questionable if one models the world with an interaction graph, as in p oint (1) ab ove, and if one also defines good communities as densely linked clusters that are weakly-connected to the outside, as in hyp othesis (2) ab ove. This is imp ortant if one asserts that cut and density based intuitions will find "true" communities.

We have also examined in detail the structure of our social and information networks. We have observed that an "jellyfish" or "octopus" model [33, 7] provides a rough first approximation to structure of many of the networks we have examined. That is, most networks may b e viewed as having a "core," with no obvious underlying geometry and which contains a constant fraction of the nodes, and then there are a large numb er of relatively small "whiskers" that are only tenuously connected to the core. (See Fig. 1(b).) Main Modeling Results: The observed prop erties of the network community profile plot are not reproduced, at even a qualitative level, by any of the commonly-used network generation models we have examined, including but not limited to preferential attachment, copying, and hierarchical network models. Moreover, this b ehavior is qualitatively different than what is observed in networks with an underlying mesh-like or manifold-like geometry (which is significant as these structures are often used as a scaffolding up on which to build other models), in networks that are good expanders (which may b e surprising, since it is often observed that large social networks are expander-like), and in small social networks often used as testb eds for community detection algorithms (which may have implications for the applicability of these methods to detect large communitylike structures in networks). For the commonly-used network generation models, as well as for expander-like, lowdimensional, and small social networks, the network community profile plots are generally downward sloping or relatively flat. We, however, make the following modeling observations: · Very sparse random graph models with no underlying geometry have relatively deep cuts at small size scales, the b est cuts at large size scales are very shallow, and there is a relatively abrupt transition in b etween. This is a consequence of the extreme sparsity of the data. · A "forest fire" generative model [21], in which edges are added in a manner that imitates a fire-spreading process, reproduces not only the deep cuts at small size scales and the absence of deep cuts at large size scales but other prop erties as well: the small barely connected pieces are significantly larger and denser than random; and for appropriate parameter settings the network community profile plot increases relatively gradually as the size of the communities increases. Intuitively, the structure of the whiskers (See Fig. 1(b).), which are not unlike small social networks that have b een extensively studied, are resp onsible for the downward part of the network community profile plot, while the core of the network and the manner in which the whiskers root themselves to the core helps to determine the upward part of the network community profile plot.

696

sr eksihW

eroC

such as: How well do real graphs split into communities? What is a good way to measure and characterize presence or absence of communities in networks? What are typical community sizes and typical community qualities? To address these and related questions, we introduce the concept of a network community profile (NCP) plot. Intuitively, the network community profile plot measures the quality of "b est" community as a function of community size in a network. To measure the quality of a community we use conductance [6]. By this measure, the b est communities are densely linked sets of nodes attached to the rest of the network via few edges. Fig. 1(a) gives a typical NCP plot. We compare our results across over 70 large social and information networks, numerous commonly-studied small social networks, and also expanders and low-dimensional meshlike ob jects. We also compare our results on each network with what is known from the field from which the network is drawn. To our knowledge, this makes ours the most extensive such analysis of the community structure in large real-world social and information networks. By comparing and contrasting these plots for a large numb er of networks, and by computing other related structural prop erties, we obtain results that suggest a significantly more refined picture of the community structure in large real-world networks than has b een appreciated previously.


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
· Social nets Nodes Edges LiveJournal 4,843,953 42,845,684 Epinions 75,877 405,739 CA-DBLP 317,080 1,049,866 · Information (citation) networks Cit-hep-th 27,400 352,021 AmazonProd 524,371 1,491,793 · Web graphs Web-google 855,802 4,291,352 Web-wt10g 1,458,316 6,225,033 · Bipartite affiliation (authors-to-papers) Atp-DBLP 615,678 944,456 AtM-Imdb 2,076,978 5,847,693 · Internet networks AsSkitter 1,719,037 12,814,089 Gnutella 62,561 147,878 Description Blog friendships [5] Trust network [28] Co-authorship [5] Arxiv hep-th [14] Amazon products [8] Google web graph TREC WT10G networks DBLP [21] Actors-to-movies Autonom. sys. P2P network [29]

the sum of degrees of nodes in S , and let s b e the numb er of edges with one endp oint in S and one endp oint in S , where S denotes the complement of S . Then, the conductance of S is  = s/v , or equivalently  = s/(s + 2e), where e is the numb er of edges with b oth endp oints is S . More formally, if A is the adjacency matrix of the graph G, then: i S,j S Aij /  (S ) = (6) min{A(S ), A(S )} i j where A(S ) = S V Aij , in which case the conductance of the graph G is G = min (S ).
S V

(7)

Table 1: Some of the network datasets we studied.

2.

BACKGROUND AND OVERVIEW

In this section, we will provide background on our data and methods. There exist a large numb er of reviews on topics related to those discussed in this pap er. For example, see the reviews on community identification [24, 9], graph and sp ectral clustering [13, 30], and the monographs on sp ectral graph theory and complex networks [6, 7].

2.1 Network datasets
We have examined a large numb er of real-world complex networks. Table 1 gives a subset of the networks that we use in this pap er. (We refer to the extended version of the pap er [23] for a complete list of networks.) In all cases, we consider networks as undirected, and we extract the largest connected comp onent. We have group ed the networks into 5 categories: social networks which consist of on-line social networks and co-authorship networks of computer science (DBLP) and various areas of physics; information networks which contain citation networks of physics and blogosphere; web-graphs which contain networks with nodes representing web-pages and hyp erlinks b eing the edges; bipartite social affiliation networks which contain mainly authors-to-pap ers networks of computer science and physics; and finally, internet networks which consist of autonomous systems network and Gnutella P2P file sharing network. Table 1 also shows the numb er of nodes and edges in each network. The sizes of the networks we have studied range from ab out 5, 000 nodes up to nearly 14 million nodes, and from ab out 6, 000 edges up to more than 100 million edges [23]. In addition, all of the networks are quite sparse-- their densities range from an average degree of ab out 2.5 for the blog p ost network, up to an average degree of ab out 400 in a network of movie ratings from Netflix [23]--and most of the other networks, including the purely social networks, have average degree around 10 (median degree of 6). In total, we have examined over 100 different networks, including over 70 large real-world social and information networks, making this, to our knowledge, the largest and most comprehensive study of such networks. (We will make data available via a link from the first author's web page.)

Thus, the conductance of a set provides a measure for the quality of the cut (S, S ), or relatedly the goodness of a community S . Indeed, it is often noted that communities should b e thought of as sets of nodes with more and/or b etter intraconnections than inter-connections. When interested in detecting communities and evaluating their quality, we prefer sets with small conductances, i.e., sets that are densely linked inside and sparsely linked to the outside. Although numerous measures have b een prop osed for how communitylike is a set of nodes, it is commonly noted--e.g., see [31] and [16]--that conductance captures the "gestalt" notion of clustering [36], and so it has b een widely-used for graph clustering and community detection [13, 30].

3. NETWORK COMMUNITY PROFILE PLOT
In this section, we discuss the network community profile plot (NCP plot), which measures the quality of network communities at different size scales.

3.1 The network community profile plot
In order to resolve more finely community structure in large networks, we introduce the network community profile plot (NCP plot). Intuitively, the NCP plot measures the quality of the b est p ossible community in a large network, as a function of the size of the purp orted community. Formally, we may define it as the conductance value of the b est conductance set of cardinality k in the entire network, as a function of k. That is, (k) =
S V , |S |= k

min

 (S ).

(8)

2.2 Clusters and communities in networks
If G = (V , E ) denotes a graph, then the conductance  of a set of nodes S  V , (where S is assumed to contain no more than half of all the nodes), is defined as follows. Let v b e

where |S | denotes the cardinality of the set S and where the conductance (S ) of S is given by (6). Since this quantity is intractable to compute, we employ well-studied approximation algorithms for the Minimum Conductance Cut Problem to compute different approximations to the NCP plot. We employ two procedures: first, Metis+MQI, i.e., the graph partitioning package Metis [17] followed by the flow-based MQI p ost-processing procedure MQI [19], which taken together returns sets that have very good conductance values; and second, the Local Sp ectral Algorithm [3], which returns sets that are somewhat "regularized" (more internally "coherent") but that often have worse conductance values. Just as the conductance of a set of nodes provides a quality measure of that set as a community, the shap e of the NCP plot provides insight into the community structure of a graph. For example, the magnitude of the conductance tells us how well clusters of different sizes are separated from the rest of the network. One might hop e to obtain some sort of

697


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
10  (conductance)
0

100 Clique, -1/d0  (conductance) Cube, -1/d-.33  (conductance) 10
-1

1

10

-1

10-2
-3

Grid, -1/d-.50 Chain, -1/d-1.0
0

10

-2

10

10

10 10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

6

10

-3

cut A 0.1 1

cut A+B

10

0

10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

(a) Low-dimensional meshes
100  (conductance)  (conductance) 10 10 10 10
-1 -2 -3 -4

(b) PowerGrid
100

(a) Zachary's karate club

(b) . . . and it's NCP plot
1  (conductance)

10 k (number of nodes in the cluster)

10

-1

0.1

10-5 0 10

10 10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

6

10-2 0 10

Average degree 4 Average degree 6 Average degree 8 10 10 10 10 k (number of nodes in the cluster)
1 2 3 4

0.01

A

B

C

D C+E

0.001

1

(c) RoadNet-CA

(d) Expander: dense Gnm

(c) Network science

(d) . . . and it's NCP plot

10 100 k (number of nodes in the cluster)

Figure 2: NCP plots for networks that "live" in lowdimensional spaces and for an expander-like graph.

Figure 3: Depiction of several small social networks that are common test sets for community detection algorithms and their network NCP plots. communities are sparsely emb edded in larger communities. Empirically we observe that local minima in the NCP plot corresp ond to sets of nodes that are plausible communities. Consider, e.g., Zachary's karate club [35], an extensivelyanalyzed social network [24, 26]. Figure 3(a) depicts the karate club network, and Figure 3(b) shows its NCP plot. Note that Cut B , which separates the graph roughly in half, has b etter conductance value than Cut A (note also community A is included in B ). This corresp onds with the intuition ab out the NCP plot derived from studying low-dimensional graphs. The karate network corresp onds well with the intuitive notion of a community, where nodes of the community are densely linked among themselves and there are few edges b etween nodes of different communities. In a similar manner, Figure 3(c) depicts Newman's network of 379 scientists who conduct research on networks [25]. In this latter case, we see a hierarchical structure, in which the community defined by Cut C is included in a larger community that has b etter conductance value.

"smoothed" measure of the notion of the b est community of size k, e.g., by considering a 95-th p ercentile, rather than a minimum. We have not defined such a measure since there is no obvious way to average meaningfully over all subsets of size k. Although Metis+MQI finds sets of nodes with extremely good conductance value, empirically we observe that they often have little or no internal structure--they can even b e disconnected; on the other hand, since sp ectral methods in general tend to confuse long paths with deep cuts [32], the Local Sp ectral Algorithm finds sets that are "tighter" and more "well-rounded" and thus in many ways more community-like.

3.2 Community profile plots for expander, lowdimensional, and small social networks
The NCP plot b ehaves in a characteristic manner for graphs that are "well-emb eddable" into a low-dimensional geometric structure. To illustrate this, consider Figure 2. The NCP plot is steadily downward sloping as a function of the numb er of nodes in the smaller cluster. Moreover, the curves are straight lines with a slop e equal to -1/d, where d is the dimensionality of the underlying grids. In particular, as the underlying dimension increases then the slop e of the NCP plot gets less steep. Of course, this is a manifestation of the isop erimetric (i.e., surface area to volume) phenomenon. A steadily downward sloping NCP plot is quite robust for networks that "live" in a low-dimensional structure, e.g., on a manifold or the surface of the earth. For example, Figure 2(b) shows the NCP plot for a p ower grid network of Western States Power Grid [34], and Figure 2(c) shows the NCP plot for a road network of California. Finally, in contrast, Figures 2(d) shows NCP plots for a Gnm graph with 100, 000 nodes and average degrees of 4, 6, and 8, i.e., graphs that are very good expanders. The NCP plot is roughly flat, which we also observed in Figure 2(a) for a clique, which is to b e exp ected since the minimum conductance cut in the entire graph cannot b e too small for a good expander [15]. Interestingly, a steadily decreasing downward NCP plot is also seen for small social networks that have b een extensively studied for validating community detection algorithms. Two examples are shown in Figures 3. For these networks, the interpretation is the hierarchical organization, where smaller

3.3 Community profile plots of large social and information networks
We have examined NCP plots for over 70 real-world social and information networks, and in Figure 4 we present NCP plots for six of these. The most striking feature is that the NCP plot is steadily increasing for nearly its entire range. Consider, the NCP plot for the LiveJournal social network in Figure 4(a), and focus first on the red curve, which presents the results of Local Sp ectral Algorithm. Up to a size scale, which empirically is roughly 100 nodes, the slop e of the NCP plot is generally sloping downward. At that size scale, we observe the global minimum of the NCP plot (denoted by a purple square). This set of nodes as well as others achieving local minima of the NCP plot in the same size range are the "b est" communities, according to the conductance measure, in the entire graph. Moreover, they are barely connected to the rest of the graph, e.g., they are typically connected to the rest of the nodes by 1 (or 2, or p erhaps 3--we will return to this issue in Section 4) edges. Ab ove the size scale of roughly 100 nodes, the NCP plot gradually increases over several orders of magnitude.

698


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
10  (conductance) 10 10 10
0

10  (conductance)

0

-1

10-1

-2

-3

10-2

10-4 0 1 2 3 4 5 6 7 10 10 10 10 10 10 10 10 k (number of nodes in the cluster)

10-3 0 10

(a) LiveJournal

(b) Epinions
10  (conductance) 10
0

10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

10  (conductance)

0

-1

10-1

10-2 10-3 10-4 0 10

10-2 0 10

(c) Cit-hep-th
100  (conductance)

10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

(d) Web-google
100  (conductance)

10 10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

6

10-1

10

-1

10

-2

10

-3

10

0

(e) Atp-DBLP

10 10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

6

10

-2

10

0

(f ) Gnutella

10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

We have observed qualitatively similar results in other large social and information networks we have examined. Several additional examples are presented in Figure 4: another social network, (Epinions, in Fig. 4(b)); an information/citation network (Cit-hep-th, in Fig. 4(c)); a Web graph (Web-google, in Fig. 4(d)); a Bipartite affiliation network (Atp-DBLP, in Fig. 4(e)); and an Internet network (Gnutella, in Fig. 4(f )). Qualitative observations are consistent across the range of network sizes, densities and different domains from which the networks are drawn. Of course, these six networks are very different than each other--some of these differences are hidden due to the definition of the NCP plot, whereas others are evident. An example of the latter is that even the b est cuts in Gnutella are not significantly smaller or deep er than in the corresp onding rewired network, whereas for Web-google we observe cuts that are orders of magnitude deep er. These findings mean that b est-expressed network communities are rather small, their size b eing practically indep endent of network size (ca. 100 nodes). Moreover, as the community size grows the community blends into the rest of the network, which makes them very difficult to detect using cut-based ideas. (We come back to this in Section 7.)

4. MORE STRUCTURAL OBSERVATIONS
Next we describ e the results of examining the networks in greater detail to understand which structural prop erties are resp onsible for the observed prop erties of the NCP plot.

Figure 4: [Best viewed in color.] NCP plots for a representative sample of large networks. Red curves plot the Local Spectral Algorithm; green curves plot Metis+MQI; blue curves plot the Bag of Whiskers Heuristic; and black curves plot the Local Spectral Algorithm applied to a randomly rewired network. The "b est" communities in the entire graph are quite good (in that they have size roughly 102 nodes and conductance scores less than 10-3 ) whereas the "b est" communities of size 105 or 106 have conductance scores of ab out 10-1 . In b etween these two size extremes, the conductance scores get gradually worse, although there are numerous local dips. (The green curve plots the Metis+MQI, and the blue curve the results of Bag of Whiskers Heuristic, as describ ed in Section 4.3.) Note that b oth axes in Figure 4 are logarithmic, and thus the upward trend of the NCP plot is over a wide range of size scales. The black curve in Figure 4(a) plots the Local Sp ectral Algorithm applied to a rewired version of the LiveJournal network, i.e., to a random graph conditioned on the same degree distribution as the original network. Interestingly, the rewired network also has an initially decreasing and then increasing/flattening NCP plot. Several things should b e noted. (1) The original LiveJournal network has considerably more structure, i.e., deep er/b etter cuts, than its rewired version, even up to the largest size scales. (2) Relative to the original network, the "b est" community in the rewired graph, i.e., the global minimum of the conductance curve, shifts upward and towards the left. This means that in rewired networks the b est conductance clusters get smaller and have worse conductance scores. (3) The sets at and near the minimum are small trees that are connected to the core of the random graph by a single edge. (4) After the small dip at a very small size scale ( 10 nodes), the NCP plot increases to its high level rather quickly. This is due to the absence of structure in the (expander-like) core.

4.1 General statistics on our network datasets
In nearly every network we have examined, there is a substantial fraction of nodes that are barely connected to the main part of the network, i.e., that are part of a small cluster of around 100 nodes that are attached to the remainder of the network via a small numb er of edges. In particular, a large fraction of the network is made out of nodes that are not in the (2-edge-connected) core, i.e., they are in comp onents attached to the core of the network via a single edge. For example, the core of Epinions network contains only 47% of the nodes and 80% of the edges. Averaging over all our networks, we see that the network core contains around only 60% of the nodes and 80% of the edges of the original network. This is somewhat akin to the so-called "Jellyfish" model [33] and "Octopus" models [7], which we describ e in more detail in Section 6.2. Moreover, the global minimum of the NCP plot is nearly always one of these pieces that is connected to the rest of the network by only a single edge. Since these small barely-connected pieces seem to have a disprop ortionately large influence on the community structure of our network datasets, we examine them in greater detail.

4.2 "Whiskers" and the "core" in our networks
We define whiskers, or more precisely 1-whiskers, to b e maximal subgraphs that can b e detached from the rest of the network by removing a single edge. To find 1-whiskers, we employ the following algorithm. Using a depth-first search algorithm, we find the largest 2-edge-connected comp onent B of the graph G. (A graph is 2-edge-connected if the removal of any single edge does not disconnect the graph.) We then delete all the edges that have one of the end p oints in 1 B . We call the connected comp onents of this new graph G -whiskers, since they corresp ond to largest subgraphs that can b e disconnected from G by removing just a single edge.

699


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
100  (conductance)  (conductance) Original network Whiskers removed
0

100

10

-1

10

-1

10-2 10 10
-3

10

-2

Figure 5: Five largest whiskers of Epinions network.

-4

10

10 10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

6

10

-3

Original network Whiskers removed
0

10

10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

(a) LiveJournal Not surprisingly, there is a wide range of whisker sizes and shap es. Empirically, 1-whisker distribution is heavytailed, with the largest whisker size ranging from around less than 10 to well ab ove 100. (See extended version [23] for plots.) The largest whiskers in co-authorship and citation networks have around 10 nodes, whiskers in bipartite graphs also tend to b e small, and very large whiskers are found in a web graph. In rewired networks the whiskers tend to b e much smaller than in the original network. A particularly noteworthy exception is found in the Autonomous systems networks and the Gnutella network. Here, whiskers are so small that even the rewired version of the network has more and larger whiskers. This makes sense, given how those networks were designed: many large whiskers would have bad effects on the Internet connectivity in case of link failures. Figure 5 shows the five largest whiskers of the Epinions social network. The whiskers have on the order of 50 nodes, and they are seen to have a rich internal structure. Similar but substantially more complex figures could b e generated for networks with larger whiskers. In general, the results we observe are consistent with a knowledge of the fields from which the particular datasets have b een drawn. For example, in Web-google we see very large whiskers. This probably represents a well-connected network b etween the main categories of a website (e.g., different pro jects), while the individual pro ject websites have a main index page that then p oints to the rest of the documents.

(b) Epinions

Figure 6: [Best viewed in color.] NCP plots with (in red) and without (in green) 1-whiskers, for two of the six networks shown Figure 4. ing good cuts then b est cuts in these large sparse graphs are obtained by comp osing unrelated disconnected pieces, which suggests that community goodness scores need to b e reevaluated by also considering the community "coherence".

4.4 Networks with no whiskers
One might wonder whether we see something different if we consider a network in which these barely-connected pieces have b een removed. Thus, we found all whiskers and removed them from the network, using procedure describ ed in Sec. 4.2, i.e., we kept the largest 2-edge-connected comp onent. Again, we computed the NCP plots in Figure 6. Notice that whisker removal does not change the NCP plot much: the plot shifts slightly upward, but the general trends remain the same. Up on examination, the global minimum occurs with a "whisker" that is connected by two edges to the rest of the network. Intuitively, the network core has a large numb er of barely connected pieces--connected now by two edges rather than by a single edge. Since the "volume" for these pieces is similar to that for the original whiskers, whereas the "surface area" is a factor of two larger, the conductance value is roughly a factor of two worse. Thus, although we have b een discussing 1-whiskers in this section, one should really view them as the simplest example of weakly-connected pieces that exert a significant effect on the community structure in large real-world networks.

4.3 Bags and communities of whiskers
Empirically, if one looks at the sets of nodes achieving the minimum in the NCP plot (usually the green Metis+MQI curve), then b efore the global NCP minimum communities are whiskers and ab ove that size scale they are often unions of disjoint whiskers. To understand the extent to which these whiskers and unions of them are resp onsible for the "b est" conductance sets of different sizes, we have develop ed the Bag-of-Whiskers Heuristic. Supp ose we have a set W = {w1 , w2 , . . .} of whiskers. In order to construct the optimal conductance cluster of size k, we need to solve the following i problem: find a set C of whiskers such that  C N ( wi ) = k i d(wi ) and C |C | is maximized, where N (wi ) is the numb er of nodes in wi and d(wi ) is its total internal degree. We then use a dynamic programming heuristic to get an approximate solution to this problem. This way, we find a cluster of a particular size that is comp osed solely from whiskers. Figure 4 (blue curve) shows the results of Bag-of-Whiskers. First, notice that the largest whisker (denoted with purple square) is the lowest p oint in all plots. This means that the b est conductance community is in a sense trivial as it is connected via just a single edge, and in addition a very simple heuristic can find it. Second, note that ab ove that size scale the Bag-of-Whiskers finds sets of extremely good conductance. Third, this heuristic often agrees with the results from Metis+MQI. This means that the b est communities are indeed disconnected. Thus, if one only cares ab out find-

5. RESULTS FROM OTHER ALGORITHMS
We we have employed a range of other algorithmic techniques to b e confident that we are computing quantities fundamental to the networks we are considering, rather than artifacts of the heuristics and approximation algorithms we employ. Due to space limitations, much of this technical material and its associated discussion is omitted from this conference pap er, but full details may b e found in the journal version of this pap er [23].

6. MODELS FOR NETWORK COMMUNITY STRUCTURE
In this section, we address modeling issues in order to understand the prop erties of generative models sufficient to reproduce the phenomena we have observed.

6.1 Commonly-used network models
We have studied a wide range of commonly-used network generative models in an effort to reproduce the upwardsloping NCP plots and to understand the structural prop erties of the real-world networks that are resp onsible for this phenomenon. In each case, we have exp erimented with a

700


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
10  (conductance)
0

10  (conductance)

0

10

-1

10

-1

10-2 0 10

Original network Rewired network 10 10 10 k (number of nodes in the cluster)
1 2 3

10-2 0 10

Original network Rewired network 10 10 10 10 k (number of nodes in the cluster)
1 2 3 4

(a) Pref. attachment
10  (conductance)
0

(b) Copying model
10  (conductance)
0

tween nodes i and j is added, indep endently, with probabilk wk . We use G(w) to denote a random ity pij = wi wj / graph generated in this manner. The sp ecial case of the G(w) model in which w has a p ower law distribution is of interest to us here. Given the numb er of nodes n, the p ower-law exp onent  , and the parameters w and wmax , Chung and Lu [7] give the degree sequence for a p ower-law graph: wi = ci-1/( -1) for i s.t. i0  i < n + i0 , (9)

10-1

10-1

10-2 0 10

Original network Rewired network 10 10 10 10 10 k (number of nodes in the cluster)
1 2 3 4 5

10-2 0 10

Original network Rewired network 10 10 10 10 k (number of nodes in the cluster)
1 2 3 4

(c) Barabasi Hierarchical

(d) Geometric PA

where, for the sake of consistency with their notation, we index the nodes from i0 to n + i0 - 1, and where c = c( , w, n) and i0 = i0 ( , w, n, wmax ) are as follows:  -1  w c = wn1/( -1) and i0 = n , (10) wmax where we have defined  =  -2 . In this case, one can verify  -1 that the numb er of vertices that have exp ected degree in the range (k - 1, k] is prop ortional to k- . The following theorem will characterize the shap e of the NCP plot for this G(w) model when the degree distribution follows Equation (9), with   (2, 3). The theorem makes two complementary claims: (1) the model has clusters of log size with logarithmically deep cuts; (2) once we get b eyond this size scale there do not exist any such deep cuts. Theorem 1. Consider the random power-law graph model G(w), where w is given by Equation (9), where w > 5.88, and the power-law exponent  satisfies 2 <  < 3. Then, then with probability 1 - o(1): 1. There exists. a cut of size (log n) whose conductance 1 is  log n

Figure 7: [Best viewed in color.] NCP for networks from commonly network generation models. Red curves are Local Spectral Algorithm on the original network, and black curves are Local Spectral Algorithm applied to a randomly rewired network. range of parameters, and in no case have we b een able to reproduce our empirical observations, at even a qualitative level. In Figure 7, we summarize these results. Figure 7(a) shows the NCP plot for a 10, 000 node network generated according to the original preferential attachment model [1], where at each time step a node joins the graph and connects to m = 2 existing nodes. Note that the NCP plot is very shallow and flat (even more than the corresp onding rewired graph), and thus the network that is generated is very expander-like at all size scales. In a different typ e of generative model edges are added via a copying mechanism [18]. Figure 7(b) shows the results for a network with 50, 000 nodes, generated with m = 2 and  = 0.05. Although the copying model aims to produce communities by linking a new node to neighb ors of a existing node, this does not seem to b e the right mechanism to reproduce the NCP plot since p otential attachment nodes are all treated equally and since new nodes always create same numb er of edges. Next, in Figure 7(c), we consider a network that was designed to have a recursively hierarchical community structure [27]. In this case, however, the NCP plot is sloping downwards, and the local dips in the plot corresp ond to multiples of the size of the basic module of the graph. Finally, Figure 7(d) shows the NCP plot for a geometric preferential attachment model [12]. This model aims to achieve a heavytailed degree distribution as well as deep cuts, and it does so by making the connection probabilities dep end b oth on the two-dimensional geometry and on the preferential attachment scheme. As we see, the effect of the underlying geometry eventually dominates the NCP plot since the b est bi-partitions are fairly well-balanced [12].

2. There exists c , > 0 such that there are no sets of size larger than c log n having conductance smal ler than . Proof. See the journal version of this pap er [23]. Recall that when w  4 and   (2, 3) then a typical e graph in this model is not fully connected but does have a giant comp onent [7]. (The well-studied Gn,p random graph model also has a similar regime when p  (1/n, log n/n).) In addition, under certain conditions, the average distance b etween nodes is in O (log log n) and yet the diameter of the graph is  (log n). Thus, in this case, the graph has an "octopus" structure, with a subgraph containing nc/(log log n) nodes constituting a deep core of the graph [7], and numerous "whiskers" attached.

6.3 A more realistic model of network community structure
We have seen that commonly-studied models, including preferential attachment models, copying models, simple hierarchical models, and models in which there is an underlying mesh-like or manifold-like geometry are not the right way to think ab out the network community structure. We have also seen that the extreme sparsity of the networks might b e resp onsible for the deep cuts at small sizes. The question arises as to whether we can find a simple generative model that can explain b oth the existence of small well-separated whisker-like clusters and also an expanderlike core whose b est clusters get gradually worse as the purp orted communities increase in size. Intuitively, a satisfactory network generation model must successfully take into account the following two mechanisms:

6.2 A very sparse random graph model
We have studied a random graph model with given exp ected degrees, as describ ed by Chung and Lu [7]. Let n, the numb er of nodes in the graph, and a vector w = (w1 , . . . , wn ), which will b e the exp ected degree k equence s 2 vector (where we will assume that maxi wi < wk ), b e given. Then, in this random graph model, an edge b e-

701


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
100  (conductance)  (conductance) 100

10

-1

10

-1

10

-2

10

-2

10

-3

10
0

0

10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

10

-3

10
0

0

10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

10  (conductance)

10  (conductance)

10

-1

10-1

10-2

10-3 0 10

10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

10-2 0 10

10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

Figure 8: [Best viewed in color.] NCP plots for the Forest Fire Model at various parameter settings. The backward burning probability is pb = 0.3, and we increase (left to right, top to bottom) the forward burning probability pf = {0.26, 0.33, 0.35, 0.40}. Note that the largest and smallest values for pf lead to less realistic community profile plots. (a) The model should produce a relatively large numb er of relatively small--but still large when compared to random graphs--well connected and distinct whisker-like communities. (This should reproduce the downward part of the community profile plot and the minimum at small size scales.) (b) The model should produce a large expander-like core, which may b e thought of as consisting of intermingled communities, p erhaps growing out from the whiskerlike communities, the b oundaries of which get less and less well-defined as the communities get larger and larger and as they gradually blend in with rest of the network. (This should reproduce the gradual upward sloping part of the community profile plot.) The so-called Forest Fire Model [21, 22] captures exactly these two comp eting phenomena. The Forest Fire Model is a model of graph generation (that generates directed graphs-- an effect we will ignore) in which new edges are added via a recursive "burning" mechanism in an epidemic-like fashion. Two prop erties of this model are particularly significant. First, although many nodes might form one or a small numb er of links, certain nodes can produce large conflagrations, burning many edges and thus forming a large numb er of out-links b efore the process ends. Such nodes will help generate a skewed out-degree distribution, and they will also serve as "bridges" that connect formerly disparate parts of the network. Second, there is a locality structure in that as each new node v arrives over time, it is assigned a "center of gravity" in some part of the network, i.e., at the ambassador node w, and the manner in which new links are added dep ends sensitively on the local graph structure around node w. See [21, 22] for details. The Forest Fire Model is parameterized by a forward burning probability pf and a backward burning probability pb , and, not surprisingly, the b ehavior of the model is sensitive to the choice of pf and pb . We have exp erimented with a wide range of network sizes and values for these parameters, and in Figure 8, we show the community profile plots of several 10, 000 node Forest Fire networks generated with pb = 0.3 and several different values of pf . The first thing

to note is that since we are varying pf the four plots in Figure 8, we are viewing networks with very different densities. Next, notice that if, e.g., pf = 0.33 or pf = 0.35 then we observe a very natural b ehavior: the conductance nicely decreases, reaches the minimum somewhere b etween 10 and 100 nodes, and then slowly but not too smoothly increases. Not surprisingly, it is in this parameter region where the Forest Fire Model has b een shown to exhibit realistic time evolving graph prop erties such as densification and shrinking diameters [21, 22]. Next, notice that if pf is too low or too high, then we obtain qualitatively different results. For example, if pf = 0.26, then the community profile plot gradually decreases for nearly the entire plot. For this choice of parameters, the forest fire does not spread well since the forward burning probability is too small, the network is extremely sparse and is tree-like with just a few extra edges, and so we get large well separated "communities" that get b etter as they get larger. On the other hand, when burning probability is too high, e.g., pf = 0.40, then the NCP plot has a minimum and then rises extremely rapidly. For this choice of parameters, if a node which initially attached to a whisker successfully burns into the core, then it quickly establishes many successful connections to other nodes in the core. Thus, the network has relatively large whiskers that failed to establish such a connection and a very expanderlike core, with no intermediate region, and the increase in the community profile plot is quite abrupt.

7. DISCUSSION 7.1 Comparison to ground truth communities
A common practice when evaluating community detection algorithms is to compare extracted communities with some notion of "ground truth" (in a hop e that extracted and true communities corresp ond). We have examined four networks in which we have access to some notion of "ground truth". · LiveJournal [5] is an online blogging community where users create and then join groups. We view each such group as defining a "ground truth" community. · CA-DBLP [5] is a network in which nodes are authors and edges connect authors co-authoring at least one pap er. Here, publication venues (e.g., journals, conferences) play the role of "ground truth" communities. · AmazonProd [8] is a network linking products often purchased together at amazon.com. Each item b elongs to one or more hierarchically organized categories, and products from the same category define a group which is a "ground truth" community. · AtM-IMDB is a bipartite actors-to-movies network. For each movie we also know the language and the country where it was produced. Countries and languages may b e taken as "ground truth" communities. To examine the quality of "ground truth" communities in the these network datasets, one can take all groups and measure the conductance of the cut that separates the group from the rest of the network. Thus, we generated NCP plots in the following way. For every "ground truth" community, we measure its conductance, from which we obtain a scatter plot of community size versus conductance. Then, we take the lower-envelop e of this plot, i.e., for every k we find the conductance value of the community of size k that has

702


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
10  (conductance) 10 10 10 10
0

10  (conductance)

0

-1

10

-1

-2

-3

10

-2

-4

100

101 102 103 104 105 k (number of nodes in the cluster)

10

-3

100

101 102 103 104 105 k (number of nodes in the cluster)

(a) LiveJournal
100  (conductance)  (conductance) 100

(b) CA-DBLP

10

-1

10

-1

10

-2

10

-2

10

-3

10

0

10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

10

-3

10

0

10 10 10 10 10 10 k (number of nodes in the cluster)

1

2

3

4

5

6

(c) AmazonProd

(d) AtM-Imbd Language

Figure 9: [Best viewed in color.] NCP plots for explicitly "ground truth" communities (green), compared with that for the original network (red) and a rewired version of the network (black). the lowest conductance. Figure 9 shows the results for these network datasets; the figure also shows the NCP plot obtained from using the Local Sp ectral Algorithm on b oth the original network (red) and on the rewired network (black). Several things should b e noted. First, the conductance of "ground truth" communities follows that for the network communities up to until size 10-100 nodes, i.e., communities get successively more community-like. As "ground truth" communities get larger, their conductance values tend to get worse and worse, in agreement with network communities discovered with graph partitioning approximation algorithms. Thus, the qualitative trend we observed in nearly every large sparse real-world network (of the b est communities blending in with the rest of the network as they grow in size) is seen to hold for "ground truth" communities. Second, one might exp ect that the NCP plot for the "ground truth" communities (the green curves) will b e somewhere b etween the NCP plot of the original network (red curve) and that for the rewired network (black), and this is seen to b e the case in general. The NCP plot for network communities goes much deep er and rises more gradually than for "ground truth" communities. This is also very consistent with our general observation that only small communities tend to b e dense and well separated, and to separate large groups one has to cut disprop ortionately many edges. Third, for the two social networks we studied (LiveJournal and CA-DBLP), larger "ground truth" communities have conductance scores that get quite "random", i.e., they are as well separated as they would b e in a randomly rewired network (green and black curves overlap). This is likely associated with the relatively weak and overlapping notion of "ground truth" we associated with those two network datasets. On the other hand, for AmazonProd and AtM-IMDB networks, the general trend still remains but large "ground truth" communities have conductance scores that lie well b elow the rewired network curve.

size of ab out 100, the "quality" of communities get worse and worse and communities more and more "blend into" the the graph. Eventually, even the existence of communities (at least when viewed as sets with stronger internal than external connectivity) is rather questionable. This seems to agree with Dunbar [11] who predicted that 150 is the upp er limit on the size of a human community. Moreover, Allen [2] gives evidence that on-line communities have around 60 memb ers, and on-line discussion forums start to break down at ab out 80 active contributors. Church congregations, military companies, divisions of corp orations, all are close to the magic sum of 150 [2]. We are thus led to ask: Why is community quality inversely prop ortional to its size? And why are NCP plots of small and large networks so different? Previous studies mainly focused on small networks (e.g., see [9]), which are simply not large enough for the clusters to gradually blend into one another as one looks at larger size scales. Our results do not disagree with literature at small sizes. But it seems that in order to make our observations one needs to look at large networks. Probably it is only when Dunbar's limit is passed that we find large communities blurring and eventually vanishing. A second reason is that previous work did not measure and examine the network community profile of cluster size vs. cluster quality. Another explanation could b e that in small, carefully collected networks, the semantics of edges is very precise while in large networks we know much less ab out each particular edge, e.g., esp ecially in when online p eople have very different criteria for calling someone a friend. Traditionally social scientists through questionnaires "normalized" the links by making sure each link has the same semantics/strength. There has also b een some evidence that hints towards the findings we make here. For example, Clauset et al. [8] analyzed community structure of the AmazonProd, and found that 50% of the nodes b elonged to the largest "miscellaneous" community. This agrees with the typical size of the network core (as defined in Section 4.1), and one could conclude that the largest community they found corresp onds to the intermingled core of the network, and the rest of the communities are whisker-like. Our work also raises an imp ortant question of what is a natural community size, and whether larger communities (in a network sense) even exist. It seems that when community size surpasses some threshold it b ecomes so diverse, that it stops existing as a traditionally understood "network community". It blends with the network, and intuitions based on connectivity and cuts seem to fail to identify it. Approaches that consider b oth the network structure and node attribute data might detect communities in these cases. Also, conductance seems like a very reasonable measure that satisfies intuition ab out community quality, but we have seen that if one only worries ab out conductance, then bags of whiskers and other internally disconnected sets have the b est scores. This raises interesting questions ab out cluster coherence, regularization and smoothness: what is a good definition of coherence, and how should this b e connected to the notion of community separability.

7.2 Broader implications
In contrast to numerous studies of community structure, we find that the b est communities are relatively small with sizes only up to ab out 100 nodes. We also find that ab ove

8. CONCLUSION
We investigated statistical prop erties of sets of nodes in large real-world social and information networks that could plausibly b e interpreted as good communities, and we discovered that community structure in these networks is very

703


WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Discovery and Evolution of Communities
different than what we exp ected from the literature and from what commonly-used models would suggest. The most striking example of this is that, in nearly every network dataset we examined, the conductance score of the b est p ossible set of nodes gets gradually worse and worse as those sets increase in size. This suggests that that larger and larger clusters are "blended in" more and more with the rest of the network. Our interpretation is that if a concept like conductance captures our intuitive notion of community goodness and if we model large networks with interaction graphs, then the b est p ossible communities get less and less communitylike as they grow in size. Our work op ens several new questions ab out the structure of large social and information networks in general, and it has implications for the use of graph partitioning algorithms on real-world networks and for detecting communities in them.
[15] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bul letin of the American Mathematical Society, 43:439­561, 2006. [16] R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Jour. of the ACM, 51(3), 2004. [17] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20:359­392, 1998. [18] R. Kumar, P. Raghavan, S. Ra jagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In FOCS '00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000. [19] K. Lang and S. Rao. A flow-based method for improving the expansion or conductance of graph cuts. In IPCO '04: Proceedings of the 10th International Conf. on Integer Programming and Combinatorial Optimization, 2004. [20] T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM, 46(6):787­832, 1999. [21] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD '05: Proceeding of the 11th ACM SIGKDD International Conference on Know ledge Discovery in Data Mining, pages 177­187, 2005. [22] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transact. on Know ledge Discovery from Data, 1(1), 2007. [23] J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Statistical properties of community structure in large social and information networks. Manuscript. [24] M.E.J. Newman. Detecting community structure in networks. The European Physical J. B, 38:321­330, 2004. [25] M.E.J. Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 74, 2006. [26] M.E.J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America, 103(23):8577­8582, 2006. [27] E. Ravasz and A.-L. Barab´si. Hierarchical organization in a complex networks. Physical Review E, 67:026112, 2003. [28] M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. In ISWC '03: Proceedings of the 2nd International Semantic Web Conference, pages 351­368, 2003. [29] M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. IEEE Internet Computing, 6(1):50­57, 2002. [30] S.E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27­64, 2007. [31] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transcations of Pattern Analysis and Machine Intel ligence, 22(8):888­905, 2000. [32] D.A. Spielman and S.-H. Teng. Spectral partitioning works: Planar graphs and finite element meshes. In FOCS '96: Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, pages 96­107, 1996. [33] S.L. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual model for the internet topology. In GLOBECOM '01: Global Telecommunications Conference, pages 1667­1671, 2001. [34] D.J. Watts and S.H. Strogatz. Collective dynamics of small-world networks. Nature, 393:440­442, 1998. [35] W.W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33:452­473, 1977. [36] C.T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1):68­86, 1971. [37] Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55:311­331, 2004.

Acknowledgement
We thank Reid Andersen, Christos Faloutsos and Jon Kleinb erg for discussions, Lars Backstrom for data, and Arpita Ghosh for assistance with the proof of Theorem 1.

9.

[1] R. Z. Albert and A-L. Barab´si. Emergence of scaling in a random networks. Science, 286(5439):509­512, 1999. [2] Christopher Allen. Life with alacrity: The Dunbar number as a limit to group sizes, http://www.lifewithalacrity. com/2004/03/the_dunbar_numb.html, 2004. [3] R. Andersen, F.R.K. Chung, and K. Lang. Local graph partitioning using PageRank vectors. In FOCS '06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475­486, 2006. [4] S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. In STOC '04: Proceedings of the 36th annual ACM Symposium on Theory of Computing, pages 222­231, 2004. [5] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. In KDD '06: Proceedings of the 12th ACM SIGKDD International Conference on Know ledge Discovery and Data Mining, pages 44­54, 2006. [6] F.R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. AMS, 1997. [7] F.R.K. Chung and L. Lu. Complex Graphs and Networks, volume 107 of CBMS Regional Conference Series in Mathematics. AMS, 2006. [8] A. Clauset, M.E.J. Newman, and C. Moore. Finding community structure in very large networks. arXiv:cond-mat/0408187, August 2004. [9] L. Danon, J. Duch, A. Diaz-Guilera, and A. Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 29(09):P09008, 2005. [10] I.S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 29(11):1944­1957, 2007. [11] Robin Dunbar. Grooming, Gossip, and the Evolution of Language. Harvard Univ Press, October 1998. [12] A.D. Flaxman, A.M. Frieze, and J. Vera. A geometric preferential attachment model of networks. In WAW '04: Proceedings of the 3rd Workshop On Algorithms And Models For The Web-Graph, pages 44­55, 2004. [13] M. Gaertler. Clustering. In U. Brandes and T. Erlebach, editors, Network Analysis: Methodological Foundations, pages 178­215. Springer, 2005. [14] J. Gehrke, P. Ginsparg, and J. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD Explorations, 5(2), 2003.

REFERENCES

704