WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web April 21-25, 2008 · Beijing, China Can Chinese Web Pages be Classified with English Data Source? Xiao Ling Gui-Rong Xue Wenyuan Dai Yun Jiang Qiang Yang Yong Yu Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China {shawnling, grxue, dwyak, yunjiang, yyu}@apex.sjtu.edu.cn Hong Kong University of Science and Technology, Clearway Bay, Kowloon, Hong Kong qyang@cs.ust.hk ABSTRACT As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages b ecomes more and more imp ortant. Mining Web information usually relies on the machine learning techniques which require a large amount of lab eled data to train credible models. Although the numb er of Chinese Web pages increases quite fast, it still lacks Chinese lab eled data. However, there are relatively sufficient English lab eled Web pages. These lab eled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can b e utilized to help classify Chinese Web pages. In this pap er, we prop ose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information b ottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as p ossible, the common part b etween Chinese and English Web pages is inclined to b e encoded to the same code (i.e. class lab el), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Op en Directory Pro ject (ODP). The exp erimental results show that our method significantly improves several existing sup ervised and semi-sup ervised classifiers. in China now exceeds 160 millions and the Chinese Web pages are numb ered in billions1 . As the most commonly used language only second to English, Chinese is exp ected to enjoy such a rocketing increase in scale. Because the Web pages written in Chinese is b ecoming a ma jor information source on the Internet, more research efforts are now devoted to organizing and mining the Chinese Web pages via Web mining techniques, such as Chinese blog mining [20] and query log analysis [19]. A p otential problem in mining the Chinese Web pages is the lack of sufficient lab eled data. As we know, classification requires a large amount of lab eled training data. Generally sp eaking, the more lab eled training data one can obtain, the b etter the classification accuracy and robustness are. Fortunately, due to many reasons, there exists a lot of lab eled Web-page information in English, in particular in the machine learning community. Examples of these resources are Reuters-21578 [16], 20 Newsgroups [15], and Op en Document Pro ject [22]. It is thus useful and intriguing to fully utilize the lab eled documents in English to help classify the Web pages in Chinese. This problem is called cross-language Web-page classification. In this pap er, we address this imp ortant problem using a novel information theory based technique. Although the training and test documents are in different languages, one can use a translation tool to help translate the test data sets in the English language, b efore a classifier trained on English pages can b e applied. While such a method may b e feasible, we observe that a simple-minded application of this method may result in serious problems b ecause of the following reasons: · First, due to the difference in language and culture, there exists a topic drift when we move from the English Web pages to the Chinese Web pages. This corresp onds to the situation in machine learning where the training and test data have different distributions in terms of the class lab els. This topic-imbalanced problem needs to b e overcome in our research. · Second, due to the errors introduced in the translation process, there may b e different kinds of errors in the translated text. For example, some errors may result from Chinese phrase segmentation, others are due to ambiguities introduced by a dictionary. This translation noise problem must b e addressed effectively. 1 According to the rep ort by CNNIC in January 2007: http://www.cnnic.net.cn/uploadfiles/p df/2007/2/13/95522.p df Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Exp erimentation Keywords Cross-Language Classification, Information Bottleneck 1. INTRODUCTION A dramatic development of Internet in China has b een witnessed in the recent years. The numb er of Internet users Gui-Rong Xue is the corresp onding author. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 969 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web April 21-25, 2008 · Beijing, China English Web Pages common part coding i output to b e extracted and used for classification, despite their differences. We show exp eriments on real English and Chinese Web documents. Our method is shown to improve other classification algorithms. The rest of our pap er is organized as follows: In Section 2 we briefly discuss the related work. Following the basic concepts reviewed in Section 3, we introduce the information bottleneck theory in Section 4. Section 5 describ es our prop osed method in details. The exp eriments and results are presented in Section 6. In the end, we conclude this pap er with future work discussion in Section 7. The detailed proofs to the lemmas and theorems in this pap er will b e given in App endix. Chinese Web Pages nformation bottleneck Figure 1: The model of our information bottleneck cross-language classifier. · Finally, the feature spaces of the English and Chinese Web pages may b e different, resulting in a situation where the training and test data may have different feature sets. We therefore must b e innovative in our solution to this problem by carefully extracting the common-semantic parts of the two data sets, and use these parts as a bridge to propagate the class lab els. To solve the ab ove problems, we develop a novel approach for classifying the Web pages in Chinese using the training documents in English. Our key observation is that despite the ab ove listed problems, linguistically, the pages in Chinese and English may share the same semantic information, although they are in different representation forms, i.e., Chinese characters and English words, resp ectively. This is reasonable, b ecause p eople with good command of b oth English and Chinese can convey the same information in b oth languages. Also, it is noted that the same meaning might b e expressed in different ways due to the cultural and linguistic differences. Based on the observations, we prop ose to tackle the crosslanguage classification problem using information bottleneck theory [29]. Figure 1 illustrates our idea intuitively. The training and translated target texts are encoded together, allowing all the information to b e put through a "b ottleneck" and represented by a limited numb er of codewords (i.e. lab els in the classification problem). Information b ottleneck based approach can maintain most of the common information and disregard the irrelevant information. Thus we can approximate the ideal situation where similar training and translated test pages, which is the common part, are encoded into the same codewords. In the exp erimentation, we collect the Web pages in English and Chinese from Op en Directory Pro ject for the data sets. Five binary and three multi-class tasks have b een set up. Our method gives significant improvement against several existing classifiers, and converges very well. In summary, in this pap er we make the following contributions: · In addressing the cross-language Web page classification problem, we observe that there is a common part of Chinese and English documents and develop a novel method for addressing the problem of topic drifts in Chinese and English documents, thus improving the classification p erformance. · We prop ose to handle noisy features and different features problem in cross-language Web-page classification by the information b ottleneck technique. This method allows the common part in the two languages 2. RELATED WORK In this section, we review several prior researches mostly related to our work, including traditional classification, crosslanguage classification and information theoretic learning. 2.1 Traditional Classification The traditional classification formulation is built on the foundation of statistical learning theory. Two schemes are generally considered, where one is supervised classification and the other is semi-supervised classification. Sup ervised classification focuses on the case where the lab eled data are sufficient, and where the learning ob jective is to estimate a function that maps examples to class lab els using the lab eled training instances. Examples of sup ervised classification algorithms include decision trees [25], K nearest neighb or methods [6], naive Bayes classifiers [17], supp ort vector machines [5], and so on. Semi-sup ervised classification [32] addresses the problem that the lab eled data are too few to build a good classifier. It makes use of a large amount of unlab eled data, together with a small amount of the lab eled data to enhance the classifiers. Many semi-sup ervised learning techniques have b een prop osed, e.g., co-training [4], EM-based methods [21], transductive learning [13] etc. As we claimed, there are two main difficulties in the crosslanguage classification, namely errors by translation and bias by topic drifts, which traditional classifiers cannot handle well. Our prop osed method tries to tackle these difficulties via the information bottleneck technique. 2.2 Cross-Language Text Classification There are several research works addressing the crosslanguage classification problem. Bel et al. [3] studied EnglishSpanish cross-language classification problem. Two scenarios are considered in their work. One scenario assumes to have training documents in b oth languages. The other scenario is to learn a model from the text in one language and classify the data in another language by translation. In our work, we focus on the second scenario. [26] gave good empirical results on English-Italian cross-language text categorization using an EM-based learning method. Note that to avoid trivial partitions, it applies feature selection b efore each iteration. [23] employed a general probabilistic EnglishCzech dictionary to translate Czech text into English and then classified Czech documents using the classifier built on English training data. Other cross-language text classification research include [11] (English-Spanish), [18] (EnglishJapanese) etc, to b e mentioned. In addition to text catego- 970 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web rization, there are some other sp ecific cross-language applications: Named Entity Recognition [30], Question Answering [8], etc. However, most existing algorithms are based on traditional sup ervised or semi-sup ervised classification techniques. As we stated, there are two difficulties in the cross-language classification, the translation error and topic drift, which lead to difference in distributions b etween the Web pages in two languages. Since traditional sup ervised classification techniques assume identical distribution for training and test data and in the cross-language setting this assumption is hardly met, most existing algorithms will not cop e with the cross-language text classification well. In this work, our method tries to handle the Chinese-English cross-language categorization problem by information b ottleneck. We will show that our algorithm can b etter alleviate the impact of translation error and topic drift, and improve the (English to Chinese) cross-language classification p erformance against existing methods. April 21-25, 2008 · Beijing, China X encoding X ~ Figure 2: The information bottleneck. Here, X is ~ the signals to be encoded, and X is the codewords. ~ In classification, X is the set of instances, and X is the prediction labels. 2.3 Information Theoretic Learning Another related research is information theory based learning. Information theory is widely used in machine learning, e.g. decision tree [25], feature selection [31], etc. The information bottleneck theory (IB) was first prop osed by Tishby et al. [29]. They constructed a model that uses information theory to solve the clustering problem. In their work, a rate distortion function is introduced as a loss function. They also presented a converging iterative algorithm for this self-consistent determination problem. After that, a lot of interesting works have b een conducted, c.f. [27]. As known, IB is an information theoretic formulation for clustering problem while maximum likelihood of mixture models is a standard statistical method to clustering. Interestingly, Slonim and Weiss [28] have proved that under a certain mapping, these two approaches are strongly related. Moreover, when input data is large enough, they are statistically equivalent. Several extensions to information b ottleneck method have b een investigated recently. [9] prop osed a word clustering method which minimizes the loss in mutual information b etween words and class-lab els, b efore and after clustering. Using similar strategy, mutual information based [10] and Bregman divergence based [2] co-clustering were prop osed. In contrast to these works, we focus on solving the crosslanguage classification problem via an information theoretic approach. More sp ecifically, the IB technique is used to mine the common part of the pages in different languages for classification. contained in a piece of data, i.e. its minimum average message length in bits. This means the b est p ossible lossless data compression is limited to this measure. Let X and Y b e random variable sets with a joint distribution p(X, Y ) and marginal distributions p(X ) and p(Y ). The mutual information I (X ; Y ) is defined as ZZ p (x , y ) I (X ; Y ) = p(x, y ) log dx dy . (2) p (x )p (y ) xy The mutual information is a measure of the dep endency b etween random variables. It is always non-negative, and it is zero if and only if the variables are statistically indep endent. Higher mutual information values indicate more certainty that one random variable dep ends on another. The mutual information is related to the Kul lback-Leibler (KL) divergence or relative entropy measures, defined for two probability mass functions p(x) and q (x), Z p (x ) D(p||q ) = p(x) log dx , (3) q (x ) x where q (x) is a reference distribution. The KL-divergence can b e considered as a kind of a distance b etween the two probability distributions, although it is not a real distance measure b ecause it is not symmetric. In addition, KLdivergence is always non-negative due to the Gibbs' inequality [7]. 4. INFORMATION BOTTLENECK In this section, we present the basic concepts for information b ottleneck, and show how it can b e applied to crosslanguage classification. 4.1 Basic Concepts 3. PRELIMINARY In this section, some preliminary knowledge in the information theory is briefly introduced, including information entropy, mutual information and Kullback-Leibler divergence [14]. For more details, please refer to [7]. The term entropy is used to measure the uncertainty associated with a random variable X . Formally, Z H (X ) = - p(x) log(p(x))dx, (1) x where x enumerates each value X may take. In the communication system, the entropy quantifies the information The information bottleneck (IB) method is a distributional learning algorithm prop osed by Tishby et al. [29]. In this theory, the clustering and classification problems can b e treated as a coding process. Let X b e the signals to b e ~ encoded, and X b e the set of codewords. In classification, ~ the codewords X is defined as class lab els, and then classi~ ~ fying X can b e seen as using X to encode X . Usually X is not able to contain as much information as X . Therefore, ~ when the signals X is encoded to codewords X , part of the information contained by X will b e lost. This can b e imaged as the information in X passing through a b ottleneck ~ ~ X , as shown in Figure 2, and thus X is called information bottleneck. 971 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web A good clustering should guarantee the encoding effectiveness (the lower rate the b etter) as well as meaningful information, which can b e formulated as a trade-off ob jec~ ~ tive I (X ; X ) - I (X ; Y ) [29], where Y is the feature set with ~ resp ect to X . Note that the coding rate I (X ; X ) in classification is no so imp ortant as in clustering problems, since classification focuses on prediction accuracy. Therefore, in this task, we need to mainly concentrate on improving the ~ classification accuracy. So that, -I (X ; Y ) is optimized in~ ; X ) - I (X ; Y ), as the coding rate I (X ; X ) is ~ ~ stead of I (X ignored. Moreover, in order to make the optimization eas~ ier, in this work, we optimize I (X ; Y ) - I (X ; Y ) instead of ~ ; Y ). We will show the optimization details in the next - I (X section. Note that, I (X ; Y ) is a constant, when X and Y are ~ fixed, and thus optimizing I (X ; Y ) - I (X ; Y ) is equivalent ~ to optimizing -I (X ; Y ). ~ The meaning of ob jective function I (X ; Y ) - I (X ; Y ) can b e also understood in another way. In classification, the instances are describ ed by the features, and thus there should b e mutual information b etween data instances and their features, i.e. I (X ; Y ) > 0. In addition, the category informa~ tion is also describ ed by the features, and thus I (X ; Y ) > 0. ~ Therefore, the information in X and X is contained in forms ~ of I (X ; Y ) and I (X ; Y ). Note that I (X ; Y ) is always greater ~ than or equal to I (X ; Y ), which will b e derived in Lemma ~ 1 in Section 5.2. I (X ; Y ) - I (X ; Y ) > 0 means there is some loss in mutual information after categorization. To sum up, a good categorization should keep the mutual information b etween data and features, and minimize the in~ formation loss. In other words, I (X ; Y ) should b e close to I (X ; Y ). Therefore, in the information b ottleneck classification setting, the quality of the categorization should b e judged by the loss in mutual information b etween the original instances and categorized instances. The following remark lays our the ob jective function formally. Remark 1. In the information bottleneck (IB) classifica~ tion setting, a qualified hypothesis h : X X approximately satisfies: " " ~ h = arg min I (X ; Y ) - I (X ; Y ) . (4) h April 21-25, 2008 · Beijing, China Figure 3: The cross-language classification problem. Here, Xe is the set of the Web pages in English and Xc is the set of the Web pages in Chinese. It is assumed that there is some common part between English and Chinese Web pages. To summarize, it is noted that cross-language Web pages are carrying a common part of semantic information. According to the information b ottleneck theory, this part of information is exp ected to help encode similar Web pages in different languages since it is highly relevant information to class lab els and also is contained in the Web pages in b oth languages. Therefore, it is clear that the information b ottleneck technique is suitable to address the cross-language classification problem. 5. CROSS-LANGUAGE CLASSIFIER VIA INFORMATION BOTTLENECK In this section, the problem is carefully formulated and an ob jective function is prop osed to build a classification model. The classification algorithm (IB) is then presented. Also, the convergence and time complexity are theoretically analyzed. 5.1 Problem Formulation Let Xe b e the set of the Web pages in English with class lab els and Xc b e the set of the Web pages in Chinese without lab els. Usually, it is assumed that the English Web pages Xe and Chinese Web pages Xc share some common information with each other, as shown in Figure 3. The feature space for Xe is the English words Ye and for Xc is the Chinese words Yc . The ob jective is, we are trying to classify the pages in Xc into C , the predefined class lab el set, which is the same for the pages in Xe . To sum up, lab eled training documents are available only in one language, and we want to estimate a hyp othesis h : Xc C which classifies documents written in another language. This is called crosslanguage text classification. In contrast to traditional text categorization, where the training and test pages are in the same language, crosslanguage text classification requires that the training and test data should b e unified into one single feature space. Otherwise, it is not p ossible for existing machine learning techniques to get the results. The most common way is to translate the pages in one language into the other one. However, it is noticed that this common approach will bring the error and bias for further classification. Linguistically sp eaking, machine translation technique is far from satisfactory. What is worse, the simple translation preprocessing does not give the accurate information. The topics of original data may drift under translation. Empirically, these claims were justified. The exp erimental details are presented in Section 6.1.2. 4.2 Applying to Cross-Language Classifier Before the formal problem formulation, we would like to present a brief idea to address the cross-language classification problem by information b ottleneck. Supp ose we have the union set X of English and Chinese Web pages, Y is ~ the set of words contained in X , and X stands for the class lab els. It is observed that there is a common part of English and Chinese web pages, which share the similar semantic information. This observation inspired our work. If the texts are put through the "b ottleneck", the lab eled data (English Web pages) will, to some extent, guide the classification on Chinese pages by encoding the common part into the same codewords. It is b ecause that during the information theoretic compression, IB tends to encode the similar pages into the same code (lab el) in that it can reduce the code length without loss of much information. If one Chinese page is in the common part, i.e. similar to a lab eled English page, this page will b e classified into the same category as that of the English one. 972 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web April 21-25, 2008 · Beijing, China 5.2 Objective Function As stated in Section 4.2, we prop ose to address the problems via the information b ottleneck technique. The test T Web page set is translated to English, denoted as Xc . Let T X = Xe Xc , as the original signal for the b ottleneck. The ~ class lab el set is the output of the b ottleneck X . Y is referred to the features of all the documents. To fully utilize the common part for classification, we defined the ob jective function as ~ I (X ; Y ) - I (X ; Y ), (5) which is exactly the ob jective function in Remark 1. Note ~ that I (X ; Y ) - I (X ; Y ) is always non-negative, which will b e derived by Lemma 1. Based on Remark 1, the ob jective function value should b e minimized, since we aim to draw ~ I (X ; Y ) close to I (X ; Y ). Before prop osing the optimization approach for minimizing the ob jective function, we first define some notations used in subsequent analysis. ~ Definition 1. We use X = {x} to denote a categoriza~ tion of X for the hypothesis h, where x = {x |h(x ) = h(x)}. ~ ~ Clearly, |X | is equal to |C |, since h maps the instances in X to the class-labels in C . Definition 2. The joint probability distribution of X and ~ Y under the categorization X is denoted by p(X, Y ), where ~ p(x, y ) = p(x, y )p(x|x) = p(x, y ) ~ ~ ~ ~ where x x, and p(x|x) = ~ ~ x. ~ p( x ) p(x) ~ Lemma 2 gives a straightforward explanation for crosslanguage text classification via the information b ottleneck technique. If one Chinese page is more similar to the English pages of one class x, i.e. the distance D(p(Y |x)||p(Y |x)) is ~ ~~ the smallest, assigning this page to x will lead to a lower ~ value of the ob jective function. Then it is desirable during the optimization process, which means the common part b etween English and Chinese Web pages work as stated in Section 4.2. Also, Lemma 2 provides an alternative way to reduce the ob jective function value. From Equation (9), we know that minimizing D(p(Y |x)||p(Y |x)) for a single instance x could ~ ~ reduce the global ob jective function D(p(X, Y )||p(X, Y )). ~ As a result, if we iteratively optimize the corresp onding D(p(Y |x)||p(Y |x)) for each instance x, the ob jective func~ ~ tion will decrease monotonically. Thus, based on Lemma 2, the information b ottleneck cross-language text classification (IB) is derived as in Algorithm 1. Algorithm 1 The Cross-Language Text Classification (IB) Algorithm Input: English Web pages Xe ; Chinese Web pages Xc ; an existing translator T ; the numb er of iterations N . Output: the final hyp othesis hf : Xc Xe C . T 1: Translate Xc into English: Xc = T (Xc ). Let X = T Xe Xc . 2: Train an initial hyp othesis h(0) based on Xe by sup ervised learning method (e.g. naive Bayes classifiers [17]). 3: Initialize the probability distribution p(0) based on h(0) ~ and Equation (6). 4: for t = 1, . . . , N do T 5: for each x Xc do 6: h(t) (x) = arg minxX D(p(Y |x)||p(t-1) (Y |x)) ~ ~ ~~ 7: end for 8: for each x Xe do 9: h(t) (x) = h(t-1) (x) 10: end for 11: Up date p(t) based on h(t) and Equation (6). ~ 12: end for 13: Return h(N ) as the final hyp othesis hf . In Algorithm 1, in each iteration, the algorithm keeps the prediction lab els for the English Web pages Xe unchanged since their true lab els are already known, while choosing the T ~ b est category X for each data instance x in Xc to minimize the function D(p(Y |x)||p(t-1) (Y |x)). As we have discussed ~ ~ ab ove, this process is able to decrease the ob jective function in Equation (7). The whole algorithm is illustrated in Figure 4. p (x ) , p(x) ~ (6) since x total ly depends on In the following, we will transform the ob jective function in Equation (5) into another representation by KLdivergence [14]. ~ Lemma 1. For a fixed categorization X , we can write the objective function in Equation (5) as ~ I (X ; Y ) - I (X ; Y ) = D(p(X, Y )||p(X, Y )), ~ (7) where D(·||·) is the KL-divergence defined as Equation (3). Note that, based on the non-negativity of KL-divergence, ~ the ob jective function I (X ; Y ) - I (X ; Y ) should always b e non-negative. 5.3 Optimization From Equation (7), it is found that the loss in mutual information in the ob jective function equals to the KL-divergence b etween p(X, Y ) and p(X, Y ). To minimize the ob jective ~ function in Equation (5), we need only to find a categoriza~ tion X which minimizes the KL-divergence value D(p(X, Y )||p(X, Y )) . ~ (8) However, the ob jective function in Equation (7) is in the joint probability form that is difficult to b e optimized. Now, we are to rewrite it into a conditional probability form, which will facilitate our algorithm to reduce the ob jective function value. Lemma 2. The objective function in Equation (7) can be expressed by a conditional probability form as XX D(p(X, Y )||p(X, Y )) = ~ p(x)D(p(Y |x)||p(Y |x)). (9) ~~ ~ x X x x ~~ 5.4 Convergence Since our algorithm IB is iterative, it is necessary to discuss its prop erty of convergence. The following theorem shows that the ob jective function in our algorithm monotonically decreases, which establishes that the algorithm converges eventually. Theorem 1. The objective function in Equation (7) monotonical ly decreases in each iteration of Algorithm IB. D(p(X, Y )||p(t) (X, Y )) D(p(X, Y )||p(t+1) (X, Y )). (10) ~ ~ Note that, although the algorithm is able to minimize the ob jective function value in Equation (5), it is only able to 973 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web Optimize the objective function and give new labels for Chinese Web Pages Unlabeled Chinese Web Pages April 21-25, 2008 · Beijing, China Chinese Web Pages 1,942 6,503 1,907 296 518 203 292 359 681 2,338 914 488 1,481 321 18,243 English Web Pages 186,307 203,569 102,571 39,269 47,607 23,117 27,323 96,510 77,901 48,231 75,434 86,736 185,466 71,065 1,271,106 Translator Unlabeled Chinese Web Pages in English Basic Classifier New Labels for Chinese Web pages N times Labeled English Web Pages Output the labels for Chinese Web pages Figure 4: The scheme of the IB-based cross-language classification algorithm. Category Arts Business Computers Games Health Home Kids and Teens News Recreation Reference Science Shopping Society Sports Total find a locally minimal one. Finding the global optimal solution is NP-hard. From Theorem 1, we can straightforwardly derived that the algorithm IB converges in a finite numb er of iterations, since the hyp othesis space is finite. Table 1: The descriptions for all the categories in ODP, including Chinese and English ones. Note that, all the Chinese Web pages as well as their category labels were translated into English using Google Translator. 5.5 Computational Complexity Regarding the computational cost for IB, supp ose the nonzeros in p(X, Y ) is N . In each iteration, IB needs to calcu~ late h(t) in O(|C | · N ) and up date p(t) (Y |X ) in O(|C | · |Y |). ~ Therefore, the time complexity of IB is O(|C | · (|Y | + N )) as a result. Usually, |C | is not large and could b e considered as a constant, while |Y | is usually not larger than N . Thus, the time complexity of IB is O(N ) in general. Thus, our algorithm IB has good scalability, and is capable for large data sets. Data Sets Games vs News Arts vs Computers Recreation vs Science Computers vs Sports Reference vs Shopping 3 Categories 4 Categories 5 Categories Categories Games, News Arts, Computers Recreation, Science Computers, Sports Reference, Shopping Arts, Computer, Society Business, Health, News, Sports Games, Home, News, Shopping, Sports Table 2: The composition for each data sets. 6. EXPERIMENTS In this section, we evaluate our cross-language classification algorithm based on information b ottleneck, and compare our algorithm with several state-of-art sup ervised and semi-sup ervised classifiers. b eled Web pages (1,271,106) is much more abundant than Chinese ones (18,243). 6.1.1 Data Preparation Data preprocessing has b een applied to the raw data. First, all the Chinese Web pages were translated by Google Translator [1]. Then, we converted all the letters to lower cases, and stemmed the words using the Porter's stemmer [24]. After that, stop words were removed. In order to reduce the size of the feature space, we used a simple feature selection method, document frequency (DF) thresholding [31], to cut down the numb er of features, and sp eed up the classification. Based on [31], DF thresholding, which has comparable p erformance with information gain (IG) or CHI, is suggested since it is simplest with lowest cost in computation. In our exp eriments, we set the DF threshold to 3. After feature selection, the vocabulary size b ecomes 512,896. In order to evaluate our cross-language classifier, we set up eight cross-language classification tasks. Five of them are binary classification tasks, and others are for multiple-class classification. Table 2 presents the detailed comp osition for each classification task. 6.1 Data Sets We conduct our evaluation on the Web pages crawled from the Op en Directory Pro ject (ODP) [22] during August 2006. Each Web page in ODP was classified by human exp erts into 17 top level categories (Arts, Business, Computers, Games, Health, Home, Kids and Teens, News, Recreation, Reference, Science, Shopping, Society, Sports, Regional, Adult and World). We removed the Regional category b ecause the Web pages in the Regional category are also in other categories. The Web pages in the Adult have not b een crawled by our crawler, b ecause most of them are banned by our internet service provider, and thus the Adult category is not included in our data collection. Moreover, the Web pages in the World category are in the languages other than English. We selected all the Chinese pages from the World category as Chinese test data. For Chinese Web pages, there are also 14 top categories each of which can b e mapp ed to a top category in the English ODP. Therefore, we have 14 categories for b oth Chinese and English Web pages in this exp eriments. Figure 1 represents the detailed description for our data collection. From the table, it can b e seen that, the numb er of English lab eled Web pages is much larger than that of Chinese ones in ODP, which indicates English la- 6.1.2 Cross-Language Topic Drift We extracted the most frequent features in the Chinese and English Web pages for each of the 14 ODP categories, and found the frequent features in the Chinese and English Web pages are quite different, although they share some 974 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web April 21-25, 2008 · Beijing, China 800 700 600 Web Pages 500 400 300 200 100 0 0 2000 4000 6000 8000 10000 Features 12000 14000 16000 Figure 5: The instance-feature co-occurrence density for the Games vs News data set. classification algorithms to show the advantages of our algorithm. We take the sup ervised classification algorithms to b e the baseline methods. Naive Bayes classifiers (NBC) [17] and supp ort vector machines (SVM) [5] are evaluated in the T exp eriments. They are trained on Xe and tested on Xc . Transductive supp ort vector machines (TSVM) [13] is also introduced as comparison semi-sup ervised learning methods, T which take b oth lab eled Xe and unlab eled Xc for training T and Xc for testing. For implementation details, TF-IDF is used for feature weighting when training supp ort vector machines (SVM) [5] and transductive supp ort vector machines (TSVM) [13]. TF is used for feature weighting when training naive Bayes classifier (NBC) [17] and our information b ottleneck based cross-language classification algorithm (IB). SVM and TSVM are implemented by SVMlight [12] with default parameters (linear kernel). For more details ab out SVM and TSVM, please refer to [5] and [13]. NBC and IB are implemented by ourselves. The initial categorizations for IB are given by NBC. part. Table 3 presents the top 5 frequent features in the Chinese and English Web pages for each of the 14 ODP categories. We b elieve there is topic drift b etween Chinese and English Web pages. For example, there are several Chinese words which are frequently app ears in the Chinese Web pages, such as "qiyuan", "mufurong", "pingqiu" and so on. In the Games, "qiyuan", one of the famous online game in China, is the most frequent keyword, while it hardly app ears in the English Web pages. This is due to the difference in culture b etween the Chinese and western societies. "pingqiu", which means "draw" in English, hardly app ears in English Web pages, b ecause the translator fails to provide a mapping b etween "draw" and it. The observations demonstrate the claim we made previously that there are two main obstacles for the cross-language classification: one is the difference in topic focus b etween the two languages; the other is the translation error. 6.3 Classification Performance We now present the classification p erformance for each comparison methods, and show advantages of our information b ottleneck cross-language classifier IB. 6.3.1 Evaluation Metrics The metrics used in this exp eriments are macro-average precision, recall and F1 -measure. Let f b e the function which maps from document d to its true class lab el c = f (d), and h b e the function which maps from document d to its prediction lab el c = h(d) given by the classifiers. The macroaverage precision P and recall R are defined as 1 X |{d|d Xc h(d) = f (d) = c}| P= , (11) |C | cC |{d|d Xc h(d) = c}| R= 1 X |{d|d Xc h(d) = f (d) = c}| . |C | cC |{d|d Xc f (d) = c}| (12) 6.1.3 Density Analysis Figure 5 shows the instance-feature co-occurrence distribution on the Games vs News data set. In this figure, documents 1 to 484 are from Xe , while documents 485 to 853 are from Xc . Within a data set, Xe or Xc , the documents are ordered by their categories (Games or News). The words are sorted by ng (w)/nn (w), where ng (w) and nn (w) represent the numb er of word p ositions w app ears in Games and News documents, resp ectively. From Figure 5, it can b e found that the distributions of English and Chinese Web pages are somewhat different, however the figure also shows large commonness exists b etween the two data sets. The density divergence b etween two data sets in the figure makes the cross-language classification difficult, b ecause most classification techniques rely on the basic assumption that the training data should b e drawn from the same distribution as the test data. However, the common part b etween the two data sets can help increase the feasibility of the classification. F1 -measure is a harmonic mean of precision and recall defined as follows 2P R F1 = . (13) P +R 6.3.2 Experimental Results Table 4 presents the p erformance on each binary classification data set given by NBC, SVM, TSVM and our algorithm IB. The implementation details of the algorithms have already b een presented in the last subsection. The evaluation metrics are macro-average precision, recall and F1 -measure, of which we have just given the definitions. From the table, we can see that IB significantly improves the other three methods. Although SVM and TSVM is slightly b etter than IB on the Arts vs Computers data set, IB is still comparable. But, on some of the other data sets, e.g. Computers vs Sports and Reference vs Shopping, b oth SVM and TSVM fail, while IB is much b etter than the two discriminative methods. In addition, NBC is always worse than IB, but never fails a lot. In average, IB gives the b est p erformance in all the three evaluation metrics. Table 5 presents the p erformance on each multiple-class classification data set given by NBC and our algorithm IB. 6.2 Comparison Methods In this exp eriments, we compare our information b ottleneck cross-language classifier (IB) with several state-of-art 975 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web Category Arts Business Computers Games Health Home Kids and Teens News Recreation Reference Science Shopping Society Sports Chinese Web Pages giotto, tugen, penchant, ashima, banzai congeni, nanci, decre, wallk, darshan volp, uptim, screenshot, malcolm, datastorag qiyuan, vierni, vernier, firstyeargirl, kangderong neurosyphili, maximowicziana, carboyhdr, podophyllin, interpol banlan, xero, bcsahin, mufurong, prestonwood head, yangpu, ashaar, geetanjali, urdupoetri uppercut, muham, readjust, pulverul, dovic frauenhof, fatehpur, xingyuncao, nakedpoetri, orchha filmcent, pessim, cold, seed, farm applaud, zhouyulin, flask, middlepillar, modarr lcjzl, lashel, aubrac, roozi, ebullit cass, indispens, buddhist, tahm, trod parama, cow, pingqiu, nesi, shiduotangbei April 21-25, 2008 · Beijing, China English Web Pages paean, base, dvdlaser, crew, taglin natstat, kazaa, bcm, aanspreken, wct volp, easyb, letterhack, grundriss, wsmcafe accident, enix, impress, rifl, freeenergynew finespun, stadium, linear, shyamalan, ryder machist, evita, beradino, bakk, fudoh pact, isch, argo, quem, melanesia narr, hume, mujer, dude, gif behoof, heepster, bonafid, tiltrecord, kapil platitudin, waggon, blankli, dfee, quotearch frobozzica, exploratori, forbrydelsen, kashyyk, pallot scherick, tricia, strewn, caryn, glenda wallpap, hornadai, lafort, obstat, duranczyk shinobi, jumbl, shirt, invert, sould Table 3: The most frequent (stemmed) features in the Chinese and English Web pages for each ODP category. All the Chinese Web pages have already been translated into English using Google Translator. Data Set Games vs News Arts vs Computers Recreation vs Science Computers vs Sports Reference vs Shopping Average NBC 0.749 0.731 0.836 0.783 0.911 0.802 Precision SVM TSVM 0.737 0.747 0.783 0.768 0.874 0.883 0.669 0.611 0.743 0.650 0.761 0.732 IB 0.798 0.767 0.903 0.844 0.929 0.848 NBC 0.747 0.728 0.832 0.800 0.827 0.787 Recall SVM TSVM 0.739 0.779 0.801 0.785 0.877 0.891 0.840 0.759 0.858 0.766 0.823 0.796 IB 0.817 0.782 0.906 0.873 0.859 0.847 NBC 0.748 0.730 0.834 0.792 0.867 0.794 F1 -Measure SVM TSVM 0.738 0.762 0.792 0.776 0.876 0.887 0.745 0.677 0.797 0.703 0.790 0.761 IB 0.807 0.774 0.905 0.858 0.893 0.847 Table 4: Macro-average precision, recall and F1 -measure for each classifier on each binary classification data T set. The training documents for NBC, SVM and IB are Xe and the test data for them are Xc . TSVM is T T trained on labeled Xe and unlabeled Xc , and tested on Xc . Data Set 3 Categories 4 Categories 5 Categories Average Precision NBC IB 0.647 0.661 0.445 0.592 0.550 0.608 0.547 0.620 Recall NBC IB 0.630 0.665 0.570 0.648 0.488 0.582 0.563 0.632 F1 -Measure NBC IB 0.638 0.663 0.500 0.619 0.517 0.627 0.552 0.636 F -measure 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 Arts vs Computers Recreation vs Science 3 Categories Table 5: Macro-average precision, recall and F1 measure for each classifier on each multiple-class classification data set. The training documents for NBC and IB are Xe and the test data for them are T Xc . Note that, SVM and TSVM are not included here, because they are designed for binary classification. 1 0 5 10 SVM and TSVM are not included here b ecause they are designed for binary classification, and cannot cop e well with the multiple-class classification problem. In this table, IB still gives significant improvements against the baseline method NBC. Therefore, we b elieve our algorithm IB is not only effective for English-Chinese cross-language classification, but also extensible for multiple-class classification. 15 20 Number of Iterations 25 30 Figure 6: The F1 -measure curves after each iterations on three data sets Arts vs Computers, Recreation vs Science and 3 Categories respectively. 6.4 Convergence Since our algorithm IB is an iterative algorithm, an imp ortant issue for IB is the convergence prop erty. Theorem 1 has already proven the convergence of IB theoretically. Now, let us empirically show the convergence prop erty of IB. Figure 6 shows the test error rate curves as functions for each iteration on three data sets, Arts vs Computers, Recreation vs Science and 3 Categories. From the figure, it can b e seen that IB always achieves almost convergence p oints within 10 iterations. This indicates that IB converges very fast. We b elieve that 10 iterations is empirically enough for IB. 7. CONCLUSIONS AND FUTURE WORK The tremendous growth of the World Wide Web in China has raised the need for classifying and organizing Chinese Web space via classification techniques. In this pap er we 976 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web put forward a technique for the Chinese Web mining task to exploit the abundant lab elled information in English. In particular, we have develop ed a novel method known as the information b ottleneck technique to address the topic drift and different feature-space problems across two languages. Our method brings out a common part b etween the Chinese and English Web pages, which can b e used to encode similar pages in different languages into the same codewords (class lab els). An iterative algorithm is presented to optimize the ob jective function and therefore solve this problem. The exp erimental results show that our method can effectively improve existing methods in general, including five binary and three multi-class problems. To extend our work, we wish to modify our method to achieve a global optimal value. It is also interesting to conduct more exp eriments in other language pair (e.g. French vs English, which does not suffer the word segmentation problem). Moreover, our method has the p otential to b e effective for the cross-language information retrieval problem. April 21-25, 2008 · Beijing, China 8. ACKNOWLEDGEMENTS Qiang Yang would like to thank the supp ort of Hong Kong RGC Grant 621307. We also thank the anonymous reviewers for their great helpful comments. 9. REFERENCES [1] Google translator. http://www.google.com/language_tools. [2] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In Proceedings of the Tenth ACM SIGKDD International Conference on Know ledge Discovery and Data Mining, pages 509­514, 2004. [3] N. Bel, C. Koster, and M. Villegas. Cross-Lingual Text Categorization. Proceedings ECDL, 200:126­139, 2003. [4] A. Blum and T. Mitchell. Combining lab eled and unlab eled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92­100, 1998. [5] B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144­152, 1992. [6] T. M. Cover and P. E. Hart. Nearest neighb or pattern classification. IEEE Transactions on Information Theory, 13:21­27, 1967. [7] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991. [8] M. Day, C. Ong, and W. Hsu. Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach. In Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration, pages 203­208, 2007. [9] I. S. Dhillon, S. Mallela, and R. Kumar. Enhanced word clustering for hierarchical text classification. In Proceedings of the Eighth ACM SIGKDD International Conference on Know ledge Discovery and Data Mining, pages 191­200, 2002. [10] I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of the Ninth ACM SIGKDD International Conference on Know ledge Discovery and Data Mining, 2003. [11] A. Gliozzo and C. Strapparava. Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corp ora. Proc. of the ACL Workshop on Building and Using Paral lel Texts (in conjunction of ACL-05), 2005. [12] T. Joachims. SVM light. Software available at http://svmlight.joachims.org. [13] T. Joachims. Transductive inference for text classification using supp ort vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 200­209, 1999. [14] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79­86, 1951. [15] K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331­339, 1995. [16] D. D. Lewis. Reuters-21578 test collection. http://www.daviddlewis.com/. [17] D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Amherst, MA, USA, 1992. [18] Y. Li and J. Shawe-Taylor. Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing and Management, 43(5):1183­1199, 2007. [19] Y. Liu, Y. Fu, M. Zhang, S. Ma, and L. Ru. Automatic search engine p erformance evaluation with click-through data analysis. Proceedings of the 16th international conference on World Wide Web, pages 1133­1134, 2007. [20] X. Ni, G. Xue, X. Ling, Y. Yu, and Q. Yang. Exploring in the weblog space by detecting informative and affective articles. Proceedings of the 16th international conference on World Wide Web, pages 281­290, 2007. [21] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from lab eled and unlab eled documents using em. Machine Learning, 39(2-3):103­134, 2000. [22] ODP. Op en directory pro ject. http://www.dmoz.com/. [23] J. Olsson, D. Oard, and J. Ha ji. Cross-language text c classification. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 645­646, 2005. [24] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130­137, 1980. [25] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [26] L. Rigutini, M. Maggini, and B. Liu. An em based training algorithm for cross-language text categorization. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intel ligence, pages 529­535, 2005. [27] N. Slonim. The Information Bottleneck: Theory and Applications. Unpublished doctoral dissertation, Hebrew University, Jerusalem, Israel, 2002. 977 WWW 2008 / Alternate Track: WWW in China - Mining the Chinese Web [28] N. Slonim and Y. Weiss. Maximum likelihood and the information b ottleneck. Advances in Neural Information Processing Systems, 15, 2002. [29] N. Tishby, F. C. Pereira, and W. Bialek. The information b ottleneck method. In Proceedings of the Thirty-seventh Annual Al lerton Conference on Communication, Control and Computing, pages 368­377, 1999. [30] Y.-C. Wu, K.-C. Tsai, and J.-C. Yang. Two-pass named entity classification for cross language question answering. In Proceedings of NTCIR-6 Workshop Meeting, pages 168­174, 2007. [31] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning, pages 144­152, 1997. [32] X. Zhu. Semi-sup ervised learning literature survey. Technical Rep ort 1530, University of Wisconsin­Madison, 2006. April 21-25, 2008 · Beijing, China X XX ~ x X y Y x x ~~ D(p(X, Y )||p(X, Y )) = ~ = = p(x)p(y |x) log X y Y p(x)p(y |x) p(x)p(y |x) ~~ p(y |x) p(y |x) ~~ XX p (x ) p(y |x) log ~ x X x x ~~ XX p(x)D(p(Y |x)||p(Y |x)) . ~~ ~ x X x x ~~ C. PROOF TO THEOREM 1 Proof. Based on Lemma 2, we have XX X p(y |x) p (x ) p(y |x) log (t) . D(p(X, Y )||p(t) (X, Y )) = ~ p (y |x) ~ ~ (t) xx ~ y Y x: h ~ From the Steps 6 and 9 in Algorithm 1, j ~ ~ arg minxX D(p(Y |x)||p(t-1) (Y |x)) ~~ h(t) (x) = h(t) (x) = h(t-1) (x) Thus, D(p(X, Y )||p(t) (X, Y )) ~ X XX p (x ) p(y |x) log ~ x:h(t) xx ~ x Xo x Xi . APPENDIX In this app endix, we provide the detailed proof to Lemmas 1 and 2, and Theorem 1. A. PROOF TO LEMMA 1 ~ I (X ; Y ) - I (X ; Y ) X XX p (x , y ) = p(x, y ) log p (x )p (y ) ~ xX y Y xx ~~ ! XX X p(x, y ) ~ - p(x, y ) log p(x)p(y ) ~ ~ y Y x x ~ = = X XX ~ xX y Y xx ~~ x X ~ = = XX y Y p(y |x) p(t) (y |h(t+1) (x)) ~ p(y |x) p(t) (y |x) ~ ~ p (x ) p (x ) X X p(y |x) log ,, Proof. ~ x:h(t+1) xx ~ y Y XX ~ x:h(t+1) xx ~ y Y 1 p(y |x) log p(y |x) + log (t) p (y |x) ~ ~ X y Y « . Here, XX ~ x:h(t+1) xx ~ p (x ) p(y |x) log ! 1 p(t) (y |x) ~ ~ 1 p(t) (y |x) ~ ~ 1 p(t) (y |x) ~ ~ 1 . p(t+1) (y |x) ~ ~ = = X x:h(t+1) ~ XX x x y Y ~ p(x)p(y |x) log X y Y p(x, y ) log p(x, y ) log p (x , y ) ) p(x, y ) p(x) ~ p( x ~ X X XX p ~ p ~ (t+1) (x ) ~ (x ) ~ p(t+1) (y |x) log ~ ~ p(t+1) (y |x) log ~ ~ ~ xX y Y xx ~~ p (x , y ) p (x , y ) ~ x:h(t+1) ~ X (t+1) X =D(p(X, Y )||p(X, Y )) . ~ x:h(t+1) ~ y Y B. PROOF TO LEMMA 2 X XX ~ x X y Y x x ~~ Note that, the last inequality follows by the non-negativity of the Kullback-Leibler divergence, that X (t+1) X (t+1) 1 1 - p ~ (y |x) log (t) ~ p ~ (y |x) log (t+1) ~ p (y |x) y Y ~ ~ p ~ (y |x) ~ y Y = D(p(t+1) (Y |x)||p(t) (Y |x)) 0 . ~ ~~ ~ p(x, y ) log p (x , y ) . p(x, y ) ~ Thus, D(p(X, Y )||p(t) (X, Y )) ~ ,, XX X p (x ) p(y |x) log p(y |x) + log ~ x:h(t+1) xx ~ Proof. D(p(X, Y )||p(X, Y )) = ~ Since p (x ) p(x, y ) = p(x, y )p(x|x) = p(x, y ) ~ ~ ~ ~ p(x) ~ = p(x)p(y |x) = p(x)p(y |x) , ~ ~~ we have = XX y Y 1 p(t+1) (y |x) ~ ~ « p (x ) X p(y |x) log ~ x:h(t+1) xx ~ y Y (t+1) p(y |x) p(t+1) (y |x) ~ ~ =D(p(X, Y )||p ~ (X , Y )) . 978