WWW 2008 / Refereed Track: Rich Media April 21-25, 2008. Beijing, China Graph Theoretical Framework for Simultaneously Integrating Visual and Textual Features for Efficient Web Image Clustering Manjeet Rege Ming Dong Jing Hua Machine Vision Pattern Recognition Lab Graphics and Imaging Lab Depar tment of Computer Science, Wayne State University Detroit, MI 48202, USA {rege, mdong, jinghua}@wayne.edu ABSTRACT With the explosive growth of Web and the recent development in digital media technology, the numb er of images on the Web has grown tremendously. Consequently, Web image clustering has emerged as an imp ortant application. Some of the initial efforts along this direction revolved around clustering Web images based on the visual features of images or textual features by making use of the text surrounding the images. However, not much work has b een done in using multimodal information for clustering Web images. In this pap er, we prop ose a graph theoretical framework for simultaneously integrating visual and textual features for efficient Web image clustering. Sp ecifically, we model visual features, images and words from surrounding text using a tripartite graph. Partitioning this graph leads to clustering of the Web images. Although, graph partitioning approach has b een adopted b efore, the main contribution of this work lies in a new algorithm that we prop ose - Consistent Isop erimetric High-order Co-clustering (CIHC), for partitioning the tripartite graph. Computationally, CIHC is very quick as it requires a simple solution to a sparse system of linear equations. Our theoretical analysis and extensive exp eriments p erformed on real Web images demonstrate the p erformance of CIHC in terms of the quality, efficiency and scalability in partitioning the visual feature-image-word tripartite graph. Categories and Subject Descriptors I.5.3 [Pattern Recognition]: Clustering-algorithms General Terms Algorithms, Performance, Exp erimentation, Theory. 1. INTRODUCTION Over the last decade, the Web has evolved from its initial days of infancy into an unstructured database that we know of today. The initial search engines applied the text information retrieval model wherein a Web document was retrieved based on the relevance matching b etween the user provided query and the words app earing in the document. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. However, with the recent development in the field of digital media technology which has resulted in the generation of a huge numb er of images rapidly, efficient Web image search and browsing have b ecome an imp ortant application. Consequently, Web image clustering has drawn significant attention in the research community recently. For example, prop erly group ed Web images can provide a very neat bird's eye view of the retrieved images to the user. Much of the earlier efforts on image clustering were solely based on low-level visual features of images [14, 25]. To cluster images, visual features such as color histogram or wavelet-based [13] texture extracted from images were subjected to traditional data clustering algorithms [17]. The ma jor drawback in this approach is that the state-of-the-art visual features are unable to represent the image content on a semantic level. As a result, image clustering suffers from the semantic gap b etween visual features and highlevel semantic concepts. A low-tech and a naive solution adopted by search engines to overcome this problem to a certain extent has b een to treat image clustering as a text clustering problem. Web images are represented using textual features in terms of the surrounding texts and captions. Images clustered based on these textual features are then retrieved accordingly. But, since images are not actually text documents, this approach is hardly a solution to the problem at hand. A b etter approach is to incorp orate b oth visual and textual features together for efficient Web image clustering. In [3], a single index vector of an image is formed by combining textual and visual statistics. Textual statistics are captured in a vector form using latent semantic indexing (LSI) [5] based on text in the containing HTML document, while visual statistics are also stored in a vector form using color and orientation histograms. A similar approach was adopted in [29] where the textual and visual features were first combined into a global vector following which the LSI technique was applied. In b oth these works, the two different kinds of image representations were simply combined together in a rather rigid way without any theoretical basis. Cai et al. [2] has used three image representations - viz., visual, textual, and link information, to construct a relationship graph of Web images. In this work, images were first clustered into different semantic groups by employing the textual and link features. This step was then followed by visual feature-based clustering of images in each semantic group. A problem in this two-step process is that an erroneous first step results 317 WWW 2008 / Refereed Track: Rich Media can propagate to the second step leading to p oor image clustering. The use of visual, textual, and link features was also employed in [21]. Here, an iterative algorithm was applied to combine the co-clustering b etween images and surrounding text and the one-sided clustering of images based on visual features. The convergence prop erty of this algorithm was not proven and the kind of combination is unsymmetrical according to the status of visual and textual features. From the ab ove discussion, it is clear that the earlier efforts were along the direction of combining the information from visual and textual features instead of integrating them together synchronously under a sound theoretical framework. In this pap er, we prop ose the Consistent Isop erimetric High-Order Co-clustering (CIHC) framework for simultaneous integration of visual and textual features for efficient Web image clustering under a graph theoretical approach. Sp ecifically, visual features, images and textual features are modelled as the three typ es of vertices of a tripartite graph. This tripartite graph is treated as two bipartite graphs of visual features & images and that of images & textual features from the surrounding texts of the image. Co-clustering is then achieved by simultaneously partitioning these two bipartite graphs together such that the information from b oth the kinds of features is optimally utilized. Note that the simultaneous partitioning of the two bipartite graphs is p erformed in such a way that the local clustering of each graph need not b e optimal under the constraint that the fusion of the two results yields optimum image clustering. Actually, a similar concept was presented by Gao et al. [11] where the Consistent Bipartite Graph Co-partitioning (CBGC) was prop osed under the sp ectral graph partitioning paradigm. An iterative algorithm using semi-definite programming (SDP) [1] is used to partition the tripartite graph which is computationally exp ensive and does not work well on large data sets. On the other hand, the prop osed methodology requires a simple solution to a sparse system of overdetermined linear equations. Moreover, the CIHC framework has b een derived from the isop erimetric graph partitioning approach which has b een shown to achieve sup erior results than the sp ectral approach in terms of the quality, efficiency and stability of the partition [15, 16, 26]. Exp erimental results p erformed on images extracted from real Websites demonstrate the advantage of CIHC over CBGC in clustering Web images. Jij = April 21-25, 2008. Beijing, China (eij ), 0, w if eij exists otherwise (1) The degree of a verte x vi denoted by di is defined as, e di = w(eij ), eij E ij (2) The degree matrix D of the graph is a diagonal matrix having degree of vertices along the diagonal while a degree vector d of a graph is a vector consisting of degree of all the vertices. The Laplacian matrix L of a graph is a symmetric matrix with one row d column for each vertex such that, an if i = j di , -w(eij ), if eij exists (3) Lvi ,vj = 0, otherwise Supp ose we bipartition set V into subsets V1 and V2 , then the corresp onding graph cut isi defined as, Jij (4) cut(V1 , V2 ) = V1 ,j V2 The ab ove definition can b e extended to k -partitioning of the graph. The cut in which casenis defined as, cut(V1 , V2 , ....., Vk ) = cut(Vn , V ) (5) < 2. RELATED WORK In this Section, we introduce some essential background on graph theory and review related work in the literature. A graph partitioning algorithm assigns a set of values to each vertex in the graph. We will refer to a vector consisting of the values for each of the vertices as the indicator vector of the graph. The cutting of the graph is dividing the indicator vector based on the values associated with each vertex using a splitting value. If u denotes the indicator vector of the graph and s is the splitting value, then the vertices are partitioned into the set of i such that ui > s and the set such that ui s. Sp ectral graph theory [4] which is based on p erforming eigen decomp osition on matrices of the graphs, has b een one of the most p opular and widely applied graph partitioning methods. In [27], Shi and Malik have applied the sp ectral method to image segmentation. The ob jective function used in this work is, xT Lx min T (6) , sub ject to xT De = 0, x = 0 x Dx where e is a unit vector and x is a column vector such that xi = c1 if i V1 and xi = -c2 if i V2 , where c1 and c2 are constants derived from the degree matrix D. By relaxing xi from discrete to continuous, it can b e shown that the solution to Equation (6) is the eigenvector corresp onding to the second smallest eigenvalue 2 of the generalized eigenvalue problem [4, 12], Lx = D x (7) Partitions are then obtained by running a clustering algorithm such as k -means [17] on the eigenvector x corresp onding to 2 . 2.1 Homogeneous Graphs for clustering An undirected homogeneous graph G ={V, E} consists of a set of vertices V = {v1 , v2 , ...., v|V | } and a set of edges E={eij | edge b etween vi and vj , i, j <= |V |}, where |V | is the numb er of vertices. In a weighted graph, each edge eij has a p ositive weight denoted by w(eij ). The weight of the edge signifies the level of association b etween the vertices. An edge weight of zero denotes the absence of an edge b etween the two resp ective vertices. Given a vertex numb ering and the edge weights b etween the vertices, graphs can b e represented by matrices. We b egin with definitions of a few graph terminologies that play an essential role in the pap er. The adjacency matrix J of the graph is defined as, 2.2 Bipartite Graph Model for pair-wise coclustering An undirected bipartite graph G ={M, W, E }, has two sets of vertices, viz., M and W and a set of graph edges E. Let B b e an |W | by |M | graph weight matrix. An entry Bij in this matrix is the weight of an edge app earing b etween a vertex wi W and a vertex mj M. There are no edges b etween vertices of the same group. Then, the adjacency matrix of the bipartite graph 0s expressed as, i ( B 8) J= T 0 B 318 WWW 2008 / Refereed Track: Rich Media m1 m2 m3 m4 m5 m6 April 21-25, 2008. Beijing, China other data typ es (different features). There are no direct edges b etween the feature vertices. Images are able to utilize multimodal information by b eing connected to the different kinds of feature vertices simultaneously. In an abstract level, a star-structured k-partite graph can b e considered as a generalized version of a tripartite graph. As a preliminary attempt, in this work, we integrate visual and textual features simultaneously using a tripartite graph, shown in Figure 2. An undirected tripartite graph G ={F, M, W, E }, has three sets of vertices, viz., F, M and W with E as the set of edges. If A and B represent the weight matrices for feature-image and image-word bipartite graphs resp ectively, then the adjacency matrix of the raph is defined as, g 0 A0 0 B (9) J = AT 0 BT 0 Every entry in A and B represents the imp ortance of a particular feature and word for that image, resp ectively. For B, word frequency in the surrounding text of the images is used. Partitioning this graph lets us achieve Web image clustering by integrating information from visual and textual features synchronously. Gao et al. [11] have used a similar tripartite graph model in their Consistent Bipartite Graph Co-partitioning (CBGC) framework. The tripartite graph is considered to b e a fusion of the two bipartite graphs that are partitioned simultaneously using the sp ectral approach. Let q = [f m]T and p = [m w]T denote the indicator vectors for the two bipartite graphs and D(f m) , D(mw) , L(f m) and L(mw) represent the diagonal and Laplacian matrices of the two bipartite graphs. Then the ob jective function to minimize for the tripartite graph is expressed as a linear combination of ob jective functions of the two bipartite graphs as folows, l ( qT L(f m) q pT L(mw) p min + (1 - ) 10) qT D(f m) q pT D(mw) p sub ject to qT D(f m) e = 0, q = 0 pT D(mw) e = 0, p = 0 0<<1 where the parameter sp ecifies the weightage for each bipartite graph in the linear combination. Illumined by this work and the recent results of ICA for pair-wise co-clustering [26], we prop ose to partition the visual feature-image-word tripartite graph using isop erimetric graph partitioning. w1 w2 w3 w4 w5 w6 w7 Figure 1: The square and circular vertices (m and w , respectively) denote the two data types in the co-clustering problem that are represented by the bipartite graph. Partitioning this bipartite graph leads to co-clustering of the two data types. where the first |W | vertices index W and the last |M | index M. The two data typ es in the co-clustering problem can b e represented by the two vertices of the weighted bipartite graph. Co-clustering of the data is achieved by partitioning the bipartite graph. In Figure 1, we show the bipartite graph partitioned using dotted lines. The two partitions obtained are {m1 , m2 , m3 , w1 , w2 , w3 } and {m4 , m5 , m6 , w4 , w5 , w6 , w7 }, resp ectively. Therefore, the ob jects in M are clustered into {m1 , m2 , m3 } and {m4 , m5 , m6 }, while those in W are clustered into {w1 , w2 , w3 } and {w4 , w5 , w6 , w7 } simultaneously. In order to compute these partitions using the sp ectral approach, we also need to solve a generalized eigenvalue problem as in Equation (7). However, due to the bipartite nature of the problem, the eigenvalue problem reduces to a much efficient Singular Value Decomp osition (SVD) [12] problem, and has found application for co-clustering in varied fields [6, 19, 7]. Recently, Isop erimetric Co-clustering Algorithm (ICA) [26] was prop osed to achieve pairwise co-clustering by partitioning a bipartite graph. ICA b ears resemblance to the sp ectral approach in the sense that it does not require the coordinate information of the vertices of the graphs and allows us to find partitions of an optimal cardinality instead of a predefined cardinality. It has b een shown that ICA outp erforms the sp ectral approach in terms of the quality, efficiency and stability in partitioning a bipartite graph. 2.3 Tripartite Graph Model for triplet co-clustering f1 VISUAL FEATURES f2 f3 f4 f5 f6 WEB IMAGES m1 m2 m3 m4 m5 m6 3. ISOPERIMETRIC GRAPH PARTITIONING FOR WEB IMAGE CLUSTERING We partition the tripartite graph of features, web images and surrounding text words by extending the ICA framework. To proceed, we first provide a brief overview of ICA to partition a bipartite graph. ICA has b een motivated from the combinatorial formulation of the classic isop erimetric problem [8, 24, 15, 16]: For a fixed area, find the shape with minimum perimeter. It provides p olynomial time heuristic for the NP-hard problem of finding a regioW with minimum p erimeter for a fixed n area. Let V = {M } b e the set of vertices of the bipartite Sraph. ICA partitions V into sets S and S c , such that g Sc c = V and S = . Like other graph partitioning S algorithms, ICA achieves optimum partitioning by finding S and S c so that isop erimetric ratio of the graph hG defined SURROUNDING TEXT WORDS w1 w2 w3 w4 w5 w6 w7 Figure 2: Tripartite graph of visual features, web images and the words coming from the surrounding text of the images. The integration of multimodal information for efficient Web image clustering can b e represented using a star-structured k-partite graph. The structure of this weighted graph is such that the central data typ e (images) is connected to all the 319 WWW 2008 / Refereed Track: Rich Media as, |S | hG = V olS m1 m2 April 21-25, 2008. Beijing, China m3 m4 m5 m6 (11) is minimized. The numerator and denominator represent the b oundary area and the volume of S , resp ectively. The b oundary of S is defined as, S = {eij | edges b etween a vertex in S and S c }. Consequently, e w(eij ) (12) | S| = ij f1 f2 f3 f4 f5 f6 w1 w2 w3 w4 w5 w6 w7 S Figure 3: Traditional extension of ICA (TICA) for partitioning the feature-image-word tripartite graph ends up actually partitioning a bipartite graph of image and feature & word. The combinatorial volume [8] can b e defined as, V olS = |S |, (13) f1 f2 f3 After a few mathematical deductions, ICA achieves the minimization of Equation (11) by solving a sparse system of linear equations as, m= L e (14) w where L is the Laplacian matrix of the bipartite graph, [m w]T is the indicator vector to indicate partitioning of the two typ es of vertices, and e is a vector of ones of size |M | + |W |. Solving this system of equations results in a real valued [m w]T . In order to get partitions, this solution needs to b e cut using a splitting value (as explained in Section 2). To partition the visual feature-image-word tripartite graph, intuitively it might seem obvious to p erform traditional extension of ICA (abbreviated as TICA) by solving a similar system of linear equations corresp onding to the adjacency matrix defined in Equation (9) as follows, f (15) L m = e w where L, the indicator vector [f m w] and e are similarly defined for the tripartite graph. However, by doing so the visual feature-image-word tripartite graph actually ends up b eing a bipartite graph of image and visual feature & word. This can b e seen by shifting the visual feature vertices on to the side of the word vertices as illustrated in Figure 3. Due to this, we will b e unable to distinguish b etween cutting a visual feature-image edge and an image-word edge and is thus a conceptual misrepresentation of the basic structure of visual features, images and words as in Figure 2. An alternative approach is to use a weighting parameter to prevent the mixing of visual feature and word vertices by defining the adjacency matrix of the tripartite graph as follows, 0 A 0 T 0 B (16) J= A 0 0 BT However, as demonstrated in Section 3.1 and 5, the numerical weightage is unable to prevent the problem. Moreover, even if it works on a particular dataset, it is not p ossible to decide the numerical weight a priori and will not work across other datasets. To overcome the ill-partitioning of Figure 3, we prop ose the Consistent Isop erimetric High-order Co-clustering (CIHC) to partition the visual feature-image-word tripartite graph by considering it as two bipartite graphs coupled together. T m1 m2 m3 m4 m5 m6 m7 m8 w1 w2 w3 w4 w5 w6 Figure 4: Toy problem of 3, 8 and 6 vertices of F, M and W, respectively with uniform weights along all edges. The dotted line shows the ideal cut for partitioning this graph. 12 x 10 -16 CIHC 10 Embedding values of images 8 6 4 2 0 -2 0 2 F 4 6 8 M 10 12 14 W 16 18 Figure 6: CIHC results for partitioning the toy graph. We are able to get perfect partitioning for each of F, M and W vertices. It is easy to see that applying ICA separately on the two bipartite graphs will result in two different partitioning results on the images. In order to achieve consistent results, we need to partition the two bipartite graphs simultaneously. That is, visual feature-image bipartite graph needs to b e partitioned under the constraints enforced on images by words while, the partitioning of image-words bipartite graph has to b e under the constraints enforced on images by visual features. In other words, we achieve consistent partitioning of images under the constraints that the partitioning of visual feature-image or image-word need not b e optimal. By doing so, we can consistently integrate the visual and textual features simultaneously for clustering the Web images. Applying ICA to the visual feature-image bipartite graph, we get, = f L(f m) e(f m) (17) m 320 WWW 2008 / Refereed Track: Rich Media TICA with =0.01 2000 0 1 -2000 Embedding values of images Embedding values of images Embedding values of images -4000 -6000 -8000 -10000 -12000 -14000 -4 -16000 -18000 0 2 F 4 6 M 8 10 12 14 W 16 18 -5 0 2 F 4 6 8 M 10 12 14 W 16 18 -0.8 0 2 F 4 6 M 8 10 12 14 W -0.6 0 0.2 0.4 2 TICA with =1 0.6 TICA, with =100 April 21-25, 2008. Beijing, China TICA, with =1000 0.6 0.4 Embedding values of images 16 18 0.2 -1 0 0 -2 -0.2 -0.2 -3 -0.4 -0.4 -0.6 -0.8 0 2 F 4 6 M 8 10 12 14 W 16 18 Figure 5: TICA results for partitioning the toy problem graph of Figure 4. Each sub-figure shows the results for = 0.01, 1, 100 and 1000. Similarly, image-word bipartite graph yields us, m= (mw) e(mw) L w (18) We combine the ab ove two system of linear equations as, f e(f m) ( L(f m) 0 m = 19) (mw) (mw) 0 L e w Fr = v where v is a vector of ones of size |F | + 2|M | + |W |. Note that, F is not a square matrix, i.e. this is an overdetermined system of linear equations where the numb er of equations is more than the numb er of variables. Overdetermined system of linear equations is usually inconsistent and does not have any solution. However, many least squares methods exist to approximate the solution [20]. We adopted the QR decomp osition method due to its simplicity and efficiency to solve Equation (19). Notice that, any other method can b e employed as well. Amongst the common methods for cutting the indicator vector are the median cut and the ratio cut. Median cut uses the median of the indicator vector r as the splitting value to produce equally sized partitions while ratio cut chooses one such that the resulting partitions have the lowest isop erimetric ratio of the graph indicating optimal partitioning. As our goal is not to necessarily produce equally sized clusters, we employ the ratio cut to get a bipartition. an F-M edge from an M-W edge on a simple graph. On the other hand, Figure 6 shows the results achieved using CIHC. We can see that, CIHC achieves p erfect clustering for all the vertices. Through this toy problem, we have shown the need to consistently apply isop erimetric co-clustering simultaneously. More results, on real Web images will b e presented in Section 5. 3.2 Algorithm Summary The main steps of CIHC can b e summarized as follows: 1. Using weight matrix A of the visual feature-image bipartite graph, construct the Laplacian matrix L(f m) . 2. Similarly, using weight matrix B of the image-word bipartite graph, construct L(mw) . 3. Using Equation (19), construct F, r and v, and solve the system of linear equations F r = v. 4. Employ ratio cut on r to get the partitions. 3.3 Advantages of CIHC over CBGC Both CIHC and CBGC partition the visual feature-imageword tripartite graph by considering it as a fusion of the two bipartite graphs. However, as we explain b elow the CIHC framework has a numb er of advantages over CBGC. In CBGC, the sp ectral ob jective functions of the two bipartite graphs are transformed into single-ob jective function of Equation (10) by expressing it as a weighted linear combination. As argued in [9], this is a very ad-hoc approach and not a principled one. The two ob jective functions represent two different kinds of information. Hence, instead of converting the original dual-ob jective problem into a single ob jective problem, one should use a dual-ob jective algorithm directly. Another drawback of CBGC, is that its p erformance is dep endent on three parameters, viz. , 1 and 2 . is the weighting parameter used in the linear combination of objective functions while 1 and 2 are parameters used to put constraints on the SDP b ound controllers. The problem here is that these three parameters have to b e predetermined. In their work, the authors test the p erformance of CBGC on few image categories by varying the values of the parameters and choose the b est values. As we show in our results (Section 5), this approach for tuning the CBGC parameters works on only those few categories and p erforms p oorly on the other datasets. In the real world application for Web image clustering, it is imp ossible to know what parameter values are to b e chosen for clustering the images retrieved. On the other hand, CIHC is a completely parameter-less 3.1 Illustrative Toy problem Ab ove, we discussed the need for partitioning the tripartite graph by simultaneously partitioning the two bipartite graphs. We now illustrate this on a toy problem that TICA does not yield optimum results. For this, we created a tripartite graph shown in Figure 4 having 3 F, 8 M and 6 W vertices with uniform weights along all the edges. It is easy to infer the ideal partitioning of this graph shown in the Figure using a dotted line. We partitioned this graph using TICA with adjacency matrix defined in Equation (16). In Figure 5, we show the corresp onding results achieved when we varied the value of from 0.01 to 1000. The X-axis has the vertices in the order of F, M and W with the dotted line separating each of them. Emb edding value for each vertex in the indicator vector is plotted along the Y-axis. The two pattern plots ( and ) represent the two clusters obtained. These results clearly show that in spite of increasing edge weights for one of the bipartite graphs drastically over the other, TICA is still unable to distinguish b etween cutting 321 WWW 2008 / Refereed Track: Rich Media approach and does not require a priori sp ecification of any parameters. In terms of computational complexity, CIHC is much more efficient compared to CBGC as it only requires a simple solution to a sparse system of linear equations. April 21-25, 2008. Beijing, China Differentiating this equation with resp ect to , q L(f m) 2 D(f m) q + L(f m) = D(f m) q + 2 q (f m) q + 2 D (25) 4. THEORETICAL ANALYSIS OF CIHC 4.1 Time Complexity Computational time required by CIHC dep ends on the solution to Equation (19). In particular, the time complexity is dep endent on the numb er of non-zero entries in L(f m) and L(mw) , which asymptotically is O(|E |) where E is the set of edges in the tripartite graph. Note that, this only measures the time complexity to compute the indicator vector. We also need to include the time complexity to employ the ratio cut which is of the order of O(h log h) where h = |F | + |M | + |W |. Factoring this in, the time complexity of CIHC is O(|E | + h log h). Empirical results on computational sp eed are presented in Section 5.3. Using Rayleigh quotient and the Chain rule, it is p ossible to calculate 2 . The Rayleigh quotient is, = Applying Chain rule, 2 2 q = q In the ab ove equation, tion (26) as, 2 q qT L(f m) q qT q (26) (27) can b e calculated from equa- 4.2 Sensitivity Analysis In this section, we analyze the sensitivity of CIHC and CBGC with resp ect to a general parameter . 2 = 2L(f m) q(qT q)-1 - 2qT L(f m) q(qT q)-2 q q (28) 4.2.1 CIHC Recall from Equation (19), CIHC requires a solution to a sparse system of linear equations represented by, F r=v Differentiating this equation with resp ect to , r F v F = -r + (21) (20) From Equation (25), since all the terms are either known or can b e calculated analytically, we get a system of linear equations which may b e solved for q . If we now consider the second term in Equation (22), and proceed similarly, we get p 2 D(mw) p L(mw) + L(mw) = D(mw) p + 2 p p +2 D(mw) (29) For a given solution to Equation (20), F and r are known and F can b e determined analytically. In order to deter r mine the derivative at p oint r, can b e solved for as a system of linear equations. We can analyze the effect of a sp ecific parameter, e.g., edge weight, by substituting for the general parameter . Equations (21), (25) and (29) show that the derivative of the CIHC solution is never degenerate. On the other hand, the CBGC solution may b e degenerate dep ending on the value of 2 and the state of its corresp onding eigenvector. 5. EXPERIMENTS AND RESULTS 5.1 Data Preparation For dataset preparation we followed the same approach as in [11]. A web crawler was first sent out to crawl some of the Webpages in the Yahoo directory1 to extract images and their surrounding texts. Images having width-height ratios larger than 5 or less than 1/5 and images having b oth width and height less than 60 pixels were removed. After this, the image database consisted of 15, 000 images which were manually assigned to 42 categories. The surrounding text extracted was filtered to remove common stop words such as conjunctions, articles, etc. The words left over after this were considered to b e the textual features of the image. We observed that the quality of the surrounding text varies dep ending on the source Webpage from where an image was extracted. Corresp ondingly, we refer to the textual features for each category to b e "p oor", "average" or "good". For the exp eriments, we randomly selected 10 of these categories shown in Table I. To ensure that the image-text bipartite graph is not disconnected when two 1 4 . 2 . 2 CB G C The CBGC ob jective function is, T (s m) pT L(mw) p q Lf q + (1 - ) min (f m) qT D q pT D(mw) p ub ject to certain constraints (22) If we drop the constant weighting parameter and consider only the first term, we get, ( qT (f m) L q 23) min qT D(f m) q With reference to the previous discussion on Sp ectral graph partitioning in Section 2, we know from Equations (6) and (7) that the solution for minimizing this term is the eigenvector corresp onding to second smallest eigenvalue 2 of the generalized eigenvalue problem, L(f m) q = 2 D(f m) q (24) http://dir.yahoo.com/Arts/Visual Arts/Photography/ 322 WWW 2008 / Refereed Track: Rich Media Table I: Image categories used in the experiments Category Name Owls Flowers Lions Elephants Horses Snow Mountains Flying Eagle Dusk Plants Railways Category Size 71 64 56 85 76 82 67 58 79 77 Text Quality Average Average Average Good Average Average Average Poor Poor Good April 21-25, 2008. Beijing, China as photographer name, photographer address, copyrights, etc. Consequently, the clustering result on these categories using only surrounding text is very p oor. This exp eriment shows that if the textual features of the image categories are very good, then those themselves are enough to yield optimum results. However, that is not always true with the surrounding text around the Web images. Image-Feature Co-clustering 58 57.5 Embedding values of images 57 56.5 categories are mixed together, we added an extra dummy word to the graph that connects to all the images. The weights for each of these edges was set to b e the reciprocal of the numb er of images in the graph. For the visual features, we used the PCA-SIFT2 image descriptors [18]. 56 55.5 55 0 5 10 15 20 25 30 35 40 45 50 55 Horses and Snow Mountain images 60 65 70 Figure 8: Image-Feature Co-clustering. Most of the images selected from H or ses and S now M ountains are separable using only the visual features. Images from the H or ses category that were misclassified were due to their similar backgrounds such as sky or snow. 5.2 Web Image clustering results We first demonstrate the need to integrate b oth the visual and textual features simultaneously. We present the results for clustering Web images using only textual features and only visual features and then show that the p erformance can b e improved by integrating the two simultaneously using CIHC framework. We will also compare image clustering results of CIHC with TICA and CBGC. Image-Text Co-clustering 800 700 600 Embedding values of images 500 400 300 200 100 0 -100 Embedding values of images 1300 1200 1100 1000 900 800 700 600 500 400 Image-Text Co-clustering 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 Elephants and Railways images 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Dusk and Plants images Figure 7: Image-Text Co-clustering. (Left) Both E lephants and Railw ay s have "go o d" textual features leading to perfect clustering. (Right) D usk and P lants have "po or" textual features that are not sufficient in separating the two categories. Most of the Web images have noisy surrounding text and is not by itself sufficient for clustering. With regards to Web image clustering by relying only on visual features, we selected 4 categories viz., H or ses, S nowM ountains, Owls and F ly ing E ag le. For this exp eriment, we first mixed H or ses and S nowM ountains by selecting 35 visually dissimilar images from each of them. In Figure 8, we show sample images from b oth these categories and the image clustering results achieved by applying ICA to the Image-Feature bipartite graph. Images from these two categories are mainly dissimilar with few similarities in some images such as the sky or greenery in the background. Due to this fact, we are able to separate the two categories to a great extent using only the visual features. On the other hand, although Owls and F ly ing E ag le b elong to two different semantic categories, their images resemble on a numb er of grounds as shown in Figure 9. Images from these categories have a bird-like ob ject with similar backgrounds. Owing to the semantic gap, the visual features are unable to represent the semantics accurately. Figures 7, 8 and 9 show that integrating the low-level visual feature representation and high-level semantics coming from the textual features together simultaneously may yield b etter clustering results. For Web image clustering using only surrounding text, we selected 4 image categories viz., E lephants, T r ains, Dusk and P lants. We mixed E lephants & T r ains that b oth have "good" textual features and Dusk & P lants that have "p oor". We applied the ICA algorithm to the Image-Text bipartite graph to get the partitions. For all the exp eriments, we have used the volume as p er Equation (13). In Figure 7, we show the emb edding values of the images and the partitioning obtained. The dotted line separates the image categories. As exp ected in the case of E lephants & T r ains, we get p erfect clustering results. However, Dusk & P lants images were extracted from some photo gallery webpages that hardly had any worthwhile text in them. Images from these two categories had a lot of noisy text such 2 Image-Feature Co-clustering 56 55 Embedding values of images 54 53 52 51 50 49 0 5 10 15 20 25 30 35 40 45 50 Owls and Eagles images 55 60 65 70 Figure 9: Image-Feature Co-clustering. O w ls and E ag les are inseparable using only the visual features due to the semantic gap. Images from both the categories have similar backgrounds and a bird-like structure at the center of the image. http://www.cs.cmu.edu/~yke/p casift/ 323 WWW 2008 / Refereed Track: Rich Media -5 TICA with =0.01 -542.91 -542.92 -542.93 -542.94 -542.95 -542.96 -542.97 -542.98 -542.99 0 10 20 30 40 50 60 70 80 Flowers and Lions images TICA with =100 90 100 110 120 -6 -7 -8 -9 -10 -11 -12 -13 x 10 -16 April 21-25, 2008. Beijing, China CIHC Embedding values of images Embedding values of images 0 10 20 30 40 50 60 70 80 Flowers and Lions images 90 100 110 120 TICA with =1 6 0.1 4 Embedding values of images 0.05 Embedding values of images x 10 -3 Figure 11: CIHC performs well in grouping F low ers and Lions images by successfully integrating the visual and textual features simultaneously. 2 0 0 -0.05 -2 -0.1 -4 -0.15 0 10 20 30 40 50 60 70 80 Flowers and Lions images TICA with =500 90 100 110 120 -6 0 10 20 30 40 50 60 70 80 Flowers and Lions images TICA with =1000 90 100 110 120 10 8 6 Embedding values of images x 10 -4 6 x 10 -4 4 Embedding values of images 2 4 2 0 -2 -4 -6 -8 0 -2 -4 -6 0 10 20 30 40 50 60 70 80 Flowers and Lions images 90 100 110 120 -8 0 10 20 30 40 50 60 70 80 Flowers and Lions images 90 100 110 120 Figure 10: TICA is unable to separate F low ers and Lions images despite the varying values of from 0.01 to 1000. Images from both these categories have "average" textual features and visually dissimilar images. TICA performs disastrously due to the ill-partitioning of the tripartite graph explained in Figure 3 In Section 3, we had discussed the need for simultaneously partitioning the two bipartite graphs of visual features & images and images & textual features. We now show exp erimentally that TICA does not yield optimum results for clustering Web images. For this, we mixed F lower s and Lions images. Sample images from these two categories and the results obtained using TICA are shown in Figure 10. Both these categories have "average" textual features with images having similar backgrounds. The value of the weighting parameter was varied from 0.01 to 1, 000. This exp eriment demonstrates that TICA actually ends up partitioning the bipartite graph of images and the two features together. As was the case with the toy problem results (Section 3.1), even the change in the values of does not help in getting b etter results. Figure 11 displays the results we get for partitioning the same dataset using CIHC. From these results, we can make two key observations. First of all, we can see the advantages of utilizing the information from b oth kinds of features simultaneously. In most of the real-world scenarios, it is unlikely that either of the features would b e capable of completely describing the resp ective categories on their own. "Average" textual features with tolerable amount of noise along with visual features is good enough to get decent clustering results. Secondly, the capability of CIHC to achieve the same can b e seen. We have compared CIHC with CBGC in terms of the image clustering results, scalability and computational sp eed. As mentioned in Section 3.3, CBGC is a very parameterdep endent framework which relies heavily on the values of 1 , 2 and . To decide on the values for these 3 parameters, we followed the approach suggested by the authors [11]. We randomly selected two image categories (E lephants and S nowM outains) and varied the values of all the parameters b etween 0 and 1 for the partitioning. Values that gave b est clustering result were chosen and used for the rest of the categories. The problem with CBGC is that these values are category sp ecific and have to b e retuned for the other categories. As an example, in Figure 12, we show the clustering results for E lephants and S nowM ountains using CBGC and CIHC. Based on the ab ove parameter values, CBGC is able to get p erfect clustering results. On the same dataset, CIHC gets similar results. In another case involving F ly ing E ag le and Lions shown in Figure 13, CBGC has misclassified a numb er of Lions images. CIHC on the other hand was able to generate very good clusters. The same is true in the case of F lower s and H or ses shown in Figure 14 where CBGC produces disastrous results and is completely outp erformed by CIHC. We observed similar results in the clustering of other categories. To summarize, in CBGC, parameters tuned for partitioning certain categories produce excellent results for those particular categories and may not work well for others. CIHC framework instead does not require any parameter values to b e pre-determined. 6 x 10 -3 CBGC 9 8 x 10 -16 CIHC 4 Embedding values of the images Embedding values of images 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 Elephants and Snow Mountains images 7 6 5 4 3 2 -6 1 2 0 -2 -4 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 Elephants and Snow Mountains images In the clustering of E lephants and S now M outains, CBGC is able to get perfect clustering due to the fact that the values used for the parameters , 1 and 2 are suitable for separating these two categories. CIHC is devoid of any parameters and is able to get comparable results on the same dataset. Figure 12: To evaluate the image clustering p erformance of CIHC and CBGC across all the categories, we have used the crossaccuracy metric [11]. If two image categories having n1 and n2 images resp ectively are mixed, then the ground truth 324 WWW 2008 / Refereed Track: Rich Media 3 2 1 Embedding values of images 0 -1 -2 -3 -4 -5 -6 Embedding values of images x 10 -3 April 21-25, 2008. Beijing, China Table II: Average clustering performance Category Name Owls Flowers Lions Elephants Horses Snow Mountains Flying Eagle Dusk Plants Railways CBGC 0.5898 0.6248 0.6544 0.8706 0.6363 0.6050 0.6728 0.5412 0.6386 0.6631 CIHC 0.8241 0.8342 0.8132 0.8941 0.8244 0.7746 0.8608 0.5614 0.6543 0.8070 CBGC -2 -3 -4 -5 -6 -7 -8 -9 -10 x 10 -16 CIHC 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Eagles and Lions images 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Eagles and Lions images Figure 13: Parameters previously set for CBGC are unable to separate F ly ing E ag les and Lions. A number of Lions images are misclassified. On the other hand, CIHC performs well. Clustering performance comparison on all image category pairs 1 -1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5 -5 x 10 -3 CBGC 10 9 8 7 6 x 10 -16 CIHC 0.9 Accuracy of CIHC 0.8 Embedding values of images Embedding values of images 0.7 0.6 0.5 5 4 3 2 0.4 0.4 0.5 0.6 0.7 0.8 Accuracy of CBGC 0.9 1 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Roses and Horses images 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Flowers and Horses images Figure 15: Clustering performance of CBGC and CIHC on all image category pairs. Each circle represents a possible category pair. Most of the circles fall in the upper part of the diagonal. Figure 14: Another example of the effect of a set of parameter values on other image categories. The set parameter values of CBGC perform very po orly in separating F low er s and H or ses. Once again, as can be seen CIHC gets decent results. Computational Speed for CIHC and CBGC 1800 1600 1400 1200 Time in seconds 1000 800 600 400 200 0 CBGC CIHC Boolean vector rt can b e written as, rt = (1, 1, ..., 1, 0, 0, ..., 0) (30) where the first n1 elements are set to 1 and the rest n2 elements are set to 0. The image clustering results can b e represented as a Boolean vector rc having the same ordering of elements as rt. Cross-accuracy is defined as follows, i i (r ti r ci ) (r ti r ci ) ,1 - accur acy = max n1 + n2 n1 + n2 (31) where represents the exclusive-OR op eration. We mixed every image category with the rest of the category and measured the accuracy of the clustering. In Figure 15, we have plotted the accuracy of CIHC (Y-axis) vs CBGC (X-axis). Each circle in the plot represents a p ossible image category pair. It can b e seen that most of the circles fall in the upp er part of the diagonal. This indicates that in the clustering of most of the image category pairs, CIHC outp erforms CBGC. The few circles on the lower part of the diagonal are the category pairs for which CBGC has b een prop erly tuned with the parameter values. In Table I I, we show the mean accuracy b etween each category and all other categories for b oth the algorithms. CIHC has a higher mean accuracy and outp erforms CBGC on all the categories. 0 500 1000 1500 2000 2500 3000 3500 Total number of vertices in the tripartite graph 4000 Figure 16: Computational speed comparison of CIHC with CBGC. The time required by each of the algorithms to compute the indicator vector are displayed for increasing number of vertices in the tripartite graph. 5.3 Computational Speed We now compare the computational sp eed of CIHC with CBGC. The time take by b oth algorithms is dep endent on the sparseness of the two data matrices. In other words, it takes more time to partition a densely connected tripartite graph compared to a sparsely connected one. For this reason, we considered the worst case scenario of a fully connected tripartite graph (with uniform weights) where every vertex in b oth the bipartite graphs is connected with all other vertices of the other typ e. Since the time required to cut the indicator vector is the same for b oth algorithms, we compare on the basis of the time required to calculate the indicator vector. The algorithms were implemented using MATLAB 7.03 . For CBGC implementation, we made use of the SDP library SDPA-M 4 [10]. The exp eriment was p erformed on a machine with a 3 GHz Intel Pentium 4 processor with 1 GB RAM. In Figure 16, we plot the time required by the algorithms as the numb er of vertices in the fully connected tripartite graph increases. Time for CIHC gradually 3 4 http://www.mathworks.com http://grid.r.dendai.ac.jp/sdpa/ 325 WWW 2008 / Refereed Track: Rich Media increases with the numb er of vertices. For the maximum numb er of vertices we increased to - ab out almost 4, 000, CIHC required ab out 98 seconds only. CBGC on the other hand, was unable to keep up with CIHC. As can b e seen, the time required by CBGC really shoots up for a few hundred vertices in the graph. Moreover, CBGC is unscalable and is unable to handle larger sized graphs, which was also verified by [23, 22]. In our exp eriment, CBGC was unable to handle graphs with more than 1, 500 vertices. This exp eriment clearly demonstrates the computational efficiency of CIHC and the p otential for applicability in large-scale real-world applications. April 21-25, 2008. Beijing, China [5] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391­407, 1990. [6] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In proc. KDD, 2001. [7] C. H. Q. Ding. Unsupervised feature selection via two-way ordering in gene expression analysis. Bioinformatics, 19:1259 ­ 1266, 2003. [8] J. Dodziuk. Difference equations, isoperimetric inequality and the transience of certain random walks. Transactions of the American Mathematical Society, 284:787­794, 1984. [9] A. A. Freitas. A critical review of multi-ob jective optimization in data mining: a position paper. SIGKDD Explor. Newsl., 6(2):77­86, 2004. [10] K. Fujisawa, Y. Futakata, M. Ko jima, S. Matsuyama, S. Nakamura, K. Nakata, and M. Yamashita. Sdpa-m (semidefinite programming algorithm in matlab) user's manual - version 6.2.0. Technical report, Dept. Math. & Comp. Sciences, Tokyo Institute of Technology, http://grid.r.dendai.ac.jp/sdpa/publication.html, 2000. [11] B. Gao, T.-Y. Liu, T. Qin, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Web image clustering by consistent utilization of visual features and surrounding texts. In proc. of ACM Multimedia, 2005. [12] G. H. Golub and C. F. Van-Loan. Matrix Computations. John Hopkins Press, 1989. [13] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Prentice Hall, Upper Saddle River, NJ, 2002. [14] S. Gordon, H. Greenspan, and J. Goldberger. Applying the information bottleneck principle to unsupervised clustering of discrete and continuous image representations. In proc. of IEEE ICCV, 2003. [15] L. Grady and E. L. Schwartz. Isoperimetric graph partitioning for image segmentation. IEEE Transactions on PAMI, 28(3):469­ 475, 2006. [16] L. Grady and E. L. Schwartz. Isoperimetric partitioning: A new algorithm for graph partitioning. SIAM Journal on Scientific Computing, 27(6):1844­1866, 2006. [17] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264­323, 1999. [18] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive representation for local image descriptors. In proc. of IEEE CVPR, 2004. [19] R. Kumar, U. Mahadevan, and D. Sivakumar. A graph-theoretic approach to extract storylines from search results. In proc. of ACM KDD, 2004. [20] C. L. Lawson and R. J. Hanson. Solving Least Squares Problems. Soc for Industrial and Applied Math, 1995. [21] Z. Li, G. Xu, M. Li, W.-Y. Ma, and H.-J. Zhang. Grouping www image search results by novel inhomogeneous clustering method. In proc. of MMM, 2005. [22] B. Long, X. Wu, Z. Zhang, and P. S. Yu. Unsupervised learning on k-partite graphs. In proc. of ACM KDD, 2006. [23] B. Long, Z. Zhang, X. Wu, and P. S. Yu. Spectral clustering for multi-type relational data. In proc. of ICML, 2006. [24] B. Mohar. Isoperimetric numbers of graphs. Journal of Combinatorial Theory, Series B, 47:274­291, 1989. [25] G. Qiu. Image and feature co-clustering. In proc. of IEEE ICPR, 2004. [26] M. Rege, M. Dong, and F. Fotouhi. Co-clustering documents and words using bipartite isoperimetric graph partitioning. In proc. of IEEE ICDM, 2006. [27] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans PAMI, 22(8):888 ­ 905, 2000. [28] H. Zha, X. He, C. H. Q. Ding, H. Simon, and M. Gu. Bipartite graph partitioning and data clustering. In proc. of ACM CIKM, 2001. [29] R. Zhao and W. I. Grosky. Narrowing the semantic gap improved text-based web document retrieval using visual features. IEEE Trans. on Multimedia, 4(2):189­200, 2002. 6. CONCLUSIONS AND FUTURE WORK In this pap er, we addressed the problem of Web image clustering by simultaneous integration of visual and textual features from a graph partitioning p ersp ective. In particular, we modelled visual features, images, and words from the surrounding text of the images using a tripartite graph. This graph is actually considered as a fusion of two bipartite graphs that are partitioned simultaneously by the prop osed CIHC framework. Although a similar approach has b een adopted b efore, the main contribution of this work lies in the computational efficiency, quality in Web image clustering and scalability to large image rep ositories that CIHC is able to achieve. We demonstrate this through exp erimental results p erformed on real Web images. In future work, there are a numb er of directions we are actively pursuing. Currently, in order to get more than two partitions, we recursively apply CIHC which is a common approach in many other graph partitioning algorithms [11, 26, 15, 28]. We are currently investigating methods to get more than two clusters directly. Another extension of this work is to get flexible clusterings for each of the vertex typ es in the tripartite graph. That is, currently we have a hard partitioning framework where there is a one-one association b etween features, images and words b elonging to one cluster. It would b e interesting to discover the association b etween visual features group ed in one cluster and words or images group ed in another. We are also working on having different numb er of partitions for each of visual features, images and words, instead of all having the same. To this end, we have found the recent work on matrix factorization by Long et al. [23, 22] very interesting and helpful. This research was partially funded by the 21st Century Jobs Fund Award, State of Michigan, under grant: 06-1P1-0193, and by National Science Foundation, under grant: I IS-0713315. 7. ACKNOWLEDGMENTS 8. REFERENCES [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [2] D. Cai, X. He, Z. Li, W.-Y. Ma, and J.-R. Wen. Hierarchical clustering of www image search results using visual, textual and link information. In proc. of ACM Multimedia, 2004. [3] M. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In proc. of IEEE CBAIVL, 1998. [4] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997. 326