WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Automatic Web Image Selection with a Probabilistic Latent Topic Model Keiji Yanai The University of Electro-Communications Chofu, Tokyo, 182-8585 JAPAN yanai@cs.uec.ac.jp ABSTRACT We prop ose a new method to select relevant images to the given keywords from images gathered from the Web based on the Probabilistic Latent Semantic Analysis (PLSA) model which is a probabilistic latent topic model originally prop osed for text document analysis. The exp erimental results show that the results by the prop osed method are almost equivalent to or outp erform the results by existing methods. In addition, it is proved that our method can select more various images compared to the existing SVM-based methods. Categories and Sub ject Descriptors: I.4 [Image Processing and Computer Vision]: Miscellaneous General Terms: Algorithms, Exp erimentation Keywords: Web image mining, image recognition task. The method employs the bag-of-visual-words as image representation and a PLSA-based topic mixture model as a probabilistic model. Our main ob jective is to examine if the bag-of-visual-words model and the PLSA-based model are also effective for the Web image gathering task where training images always contains some noise. 2. OVERVIEW OF THE METHOD We assume that the method we prop ose in this pap er is used in the image selection stage of the Web image-gathering system [6, 7]. The system gathers images associated with the keywords given by a user fully automatically. Therefore, an input of the system is just keywords, and the output is several hundreds or thousands images associated with the keywords. The system consists of two stages: the collection stage and the selection stage. In the collection stage, the system carries out HTML-textbased image selection which is based on the method we prop osed b efore [5]. The basic idea on this stage is to gather as many images related to the given keywords as p ossible from the Web with Web text search engines such as Google and Yahoo, and to select candidate images which are likely to b e associated with the given keywords by analysis of surrounding HTML text based on simple heuristics. Particularly high-scored images among the candidate images are selected as pseudo-training images for training the probabilistic model. To explain simple HTML analysis briefly, if either ALT tags, HREF link words or image file names include the given keywords, the image is regarded as a pseudo-training image. If the other tags or text words which surround an image link include the given keywords, the image is regarded as a normal candidate image. Although the former rule to select training images is strongly restrictive, this simple rule can find out highly relevant images which can b e used as pseudo-training samples by examining a great many image gathered from the Web. The detail on the collection stage is describ ed in [5]. In the selection stage, the prop osed model is trained with the pseudo-training images selected automatically in the collection stage, and is applied to select relevant images from the candidate images. Note that all pseudo-training images are also part of candidate images at the same time, since pseudo-training images are also Web images and contain several irrelevant images which should b e removed. As an image representation, we adopt the bag-of-visualwords representation [2]. It has b een proved that it has the excellent ability to represent image concepts in the context of visual ob ject recognition in spite of its simplicity. The basic idea of the bag-of-visual-words representation is that a set of local image patches is sampled by an interest p oint detector or a grid, and a vector of visual descriptors is evaluated by Scale Invariant Feature Transform (SIFT) descriptor [3] on each patch. The resulting distribution of 1. INTRODUCTION Because of the recent growth of the World Wide Web, we can easily gather huge amount of image data. However, raw outputs of Web image search engines contain many irrelevant images, since they do not employ image analysis and basically rely on only HTML text analysis to rank images. Our goal is to gather large amount of relevant images to given words. In particular, we wish to build a large scale generic image database consisting of many highly relevant images for each of thousands of concepts, which can b e used as huge ground truth data for generic ob ject recognition research. To realize that, we have prop osed several Web image gathering systems employing image recognition methods so far [5, 6, 7]. In this pap er, we apply Probabilistic Latent Semantic Analysis (PLSA) to Web image gathering task. Recently, PLSA is applied to ob ject recognition task as a probabilistic generative model [4]. However, PLSA is not applied to Web images except [1]. The difference b etween this pap er and [1] is that in [1] they selects just one topic as a relevant topic while our prop osed method selects relevant images based on the mixture of p ositive topics. This can b e regarded as an extension of our previous work [6], which employed region segmentation and a probabilistic model based on a Gaussian mixture model (GMM). In [6], an image is represented as a set of region feature vectors such as color, texture and shap e, while in this pap er we use bag-of-visual-words representation [2] to represent an image. A method to recognize images based on the mixture of topics has already prop osed in [4]. Our work can b e regarded as the Web image version of that work. In this pap er, we prop ose a fully automated PLSA-based Web image selection method for the Web image-gathering Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 1237 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Table 1: The precision of top 100 output images of Google Image Search, the number and the precision (at 15% recall) of positive images and candidate images which are selected automatically in the collection stage, the results of image selection by the region-based probabilistic method employing GMM [6] and the bagof-visual-words-based method employing SVM [7] for comparison and results by the proposed PLSA-based methods with five different k. k is the number of topics. concepts sunset mountain waterfall beach flower lion apple Chinese noodle TOTAL/AVG. Go ogle result 85 57 78 67 71 52 49 68 65.9 positive images 790 (67) 1950 (88) 2065 (71) 768 (69) 576 (72) 511 (87) 1141 (78) 901 (78) 8702 (76) candidate images 1500 (55.3) 5837 (79.2) 4649 (70.3) 1923 (65.5) 1994 (69.6) 2059 (66.0) 3278 (64.3) 2596 (66.6) 23836 (66.5) GMM 100.0 96.5 82.0 75.0 78.5 74.6 81.0 70.9 82.4 SVM 98.0 100.0 90.7 99.0 91.9 85.7 90.7 95.3 93.9 k=10 95.1 93.9 75.3 92.5 83.9 82.5 88.2 93.8 88.2 PLSA(proposed metho d) k=20 k=30 k=50 k=100 96.0 96.0 95.1 97.0 96.5 96.5 96.5 96.5 78.1 75.3 76.8 74.5 94.2 96.1 94.2 93.3 82.3 80.8 81.3 81.3 66.7 64.7 84.6 85.7 82.7 84.8 87.0 83.8 90.9 89.5 95.2 95.2 85.9 85.5 88.8 88.4 BEST 97.0 96.5 78.1 96.1 83.9 85.7 88.2 95.2 90.1 description vectors is then quantified by vector quantization against a pre-sp ecified codeb ook, and the quantified distribution vector is used as a characterization of the image. The prop osed model is based on Probabilistic Latent Semantic Analysis (PLSA). PLSA is originally an unsup ervised latent topic model. First, we apply the PLSA method to the candidate images with the given numb er of topics, and get the probability of each topic over each image, P (z |I ). Next, we calculate the probability of b eing p ositive or negative regarding each topic, P (pos|z ) and P (neg |z ) using pseudo-training images, assuming that all other candidates images than pseudo p ositive images are negative samples. Here, "p ositive topic" means that the latent topic generates images relevant to the given keywords, and "negative topic" means that the latent topic generates irrelevant images. Finally, the probability of b eing p ositive over each candidate image, P (pos|I ), is calculated by marginalization over topics: X P (pos|z )P (z |I ) (1) P (pos|I ) = z Z the precision of p ositive images is 76%, while the average of the precision of candidate images is 65%. Although their difference is ab out 10% and it is not so large, our prop osed strategy to estimate p ositive and negative topics worked well in the most case. Regarding the numb er of topics k when the b est result was obtained, there is not a prominent tendency. For future work, we need to study how to decide the numb er of topics, which sometimes influence the result greatly. For example, in case of "apple", the precision was 85.7% for k = 100, while the precision was 64.7% for k = 30. The biggest difference to [7] is that our higher-rank results include various images as shown in Fig.1, while ones by SVM [7] include similar and uniform images as shown in Fig.2. This is b ecause our prop osed method is based on the mixture of the topics. , where z Z represents latent topics, the numb er of which is decided by the given numb er k. We can rank all the candidate images based on this probability, P (pos|I ), and obtain the final result. Figure 1: "Mountain" by Figure 2: "Mountain" by PLSA. SV M. We have prepared the exp erimental results on the Web: http://mm.cs.uec.ac.jp/yanai/www08/ 3. EXPERIMENTAL RESULTS We made exp eriments for the following eight concepts indep endently: sunset, mountain, waterfall, b each, flower, lion, apple and Chinese noodle. The first four concepts are "scene" concepts, and the rest are "ob ject" concepts. In the collection stage, we obtained around 5000 URLs for each concept from several Web search engines including Google Search and Yahoo Web Search. Table 1 shows the precision of top 100 output images of Google Image Search for comparison, the numb er and the precision of p ositive images and candidate images, and the results of image selection by the region-based probabilistic method employing GMM [6] and the bag-of-visual-wordsbased method employing SVM [7] for comparison. In the exp eriments, all the precision of the results except for p ositive and candidate images are evaluated at 15% recall. The 7th to 11th column of Table 1 shows the results of the precision of the PLSA-based image selection when the numb er of topics k varied from 10 to 100. In terms of the b est results, the precision of each keyword is almost equivalent to the precision by SVM and outp erforms GMM and Google Image Search. As shown in Table 1, the average of 4. REFERENCES [1] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning ob ject categories from go ogle's image search. In Proc. of IEEE International Conference on Computer Vision, pages 1816­1823, 2005. [2] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keyp oints. In Proc. of ECCV Workshop on Statistical Learning in Computer Vision, pages 59­74, 2004. [3] D. G. Lowe. Distinctive image features from scale-invariant keyp oints. International Journal of Computer Vision, 60(2):91­110, 2004. [4] F. Monay and D. Gatica-Perez. Mo deling semantic asp ects for cross-media image retrieval. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 29(10):1802­1817, 2007. [5] K. Yanai. Generic image classification using visual knowledge on the web. In Proc. of ACM International Conference Multimedia, pages 67­76, 2003. [6] K. Yanai and K. Barnard. Probabilistic Web image gathering. In Proc. of ACM SIGMM International Workshop on Multimedia Information Retrieval, pages 57­64, 2005. [7] K. Yanai. Image collector I I I: A web image-gathering system with bag-of-keyp oints. In Proc. of the International World Wide Web Conference, p oster pap er, 2007. 1238