WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Social and Semantics Analysis Via Non-negative Matrix Factorization * Zhi-li Wu Computer Science Depar tment Hong Kong Baptist University Chi-wa Cheng Computer Science Depar tment Hong Kong Baptist University Chun-hung Li Computer Science Depar tment Hong Kong Baptist University {vincent,victor,chli}@comp.hkbu.edu.hk ABSTRACT Social media such as Web forum often have dense interactions b etween user and content where network models are often appropriate for analysis. Joint non-negative matrix factorization model of participation and content data can b e viewed as a bipartite graph model b etween users and media and is prop osed for analysis social media. The factorizations allow simultaneous automatic discovery of leaders and sub-communities in the Web forum as well as the core latent topics in the forum. Results on topic detection of Web forums and cluster analysis show that social features are highly effective for forum analysis. This matrix can also b e viewed as a bipartite graph where one set of nodes corresp onds to the discussions and one set of nodes to the users. Individual Media Categories and Subject Descriptors I.2.6 [Computing Methodologies]: Artificial Intelligence-- Learning ; H.3.1 [Information System]: Content Analysis and Indexing; H.5.4 [Information System]: Information interfaces and presentation--hypertext/hypermedia General Terms Theory, Algorithms Figure 1: Social Interaction of Individual and Web Media To detect the k groups of latent topics in discussions, the weighted discussion-participation matrix X can b e factorized via non-negative matrix factorization (NMF) into a matrix W of n × k and a matrix H of k × m. NNF has found lots of applications in text mining [3],[4]. The two matrices after factorization have the effect of indicating the cluster memb ership. The cluster memb ership ci of the i-th discussion is simply given by ci = arg max Wij , j Keywords Social Network Analysis, latent topic detection, latent interest detection 1. SOCIAL INTERACTION MODEL 1.1 Latent Topic Detection via Factorization of User Interaction Matrix In the study of social media involving dense interactions b etween users, relationships b etween the media and the individuals are often essential to the understanding of b oth the topics of discussion and the interest of users. Figure 1 shows the relationship b etween web media and individuals using the web site. In the case of Web forum, online discussion is the media where individuals participate according to their interest. By measuring the participation frequencies of each user in each discussion, and denoting it as matrix X of size n × m, where n discussions are participated by m users. This work is partially supp orted by FRG of HKBU Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. where j is the lab el of the latent topic of the discussions. Usually the numb er of latent topics are a much smaller numb er than the total numb er of discussions. 1.2 Joint Factorization of Social and Semantic Attributes For a set of n discussions where m words app ear altogether, we can represent the discussions-words into a matrix F of n × q . To factorize b oth matrices, X = W H , F = W G, where X and F are factorized to the same matrix W , together with H and G, resp ectively. And the ob jective function is 1245 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China 2.3 Clustering W,H,G0, min (||X - W H || + ||F - W G|| ), 2 2 (1) where is a user-sp ecified constant. This leads to the following up dating rules, H = H . (W T X )./(W T W H ), G = G. (W T F )./(W T W G), W = W. (X H T + F GT )./(W (H H T + GGT )), which can guarantee to keep the ob jective value non-increasing through the proof of trace op eration and Lagrange transform. In addition to the up dating rules, the following two separate up dating rules are adopted as initialization steps, W = W. (X H T )./(W (H H T )), W = W. (F G )./(W (GG )). T T For the clustering of the three b oards of discussions, the data set contains 1003 discussions from C hatB oar d, 1069 from 2ndH andB oar d, and 1040 from Av B oar d. There are 7728 participators and 24791 words in total. The clustering p erformance is measured by weighted purity. For a p-cluster task, if the factorization matrix W is m × k and hereby divides the dataset into k groups, the purity is calculated by first counting for each of the k groups the numb er of p oints with their true clustering lab el dominant in this group, and then divided by the total numb er of data p oint in the dataset. When p = k, this measure is equivalent to the typical clustering accuracy measure. Table 2: Purity tering k 3 6 9 12 15 Measure of AV Web Forum ClusDPDW 0.7382 0.8661 0.8746 0.8878 0.8668 DP 0.5480 0.5646 0.5883 0.6071 0.6199 DW 0.4968 0.6786 0.6735 0.6702 0.6591 2. EXPERIMENT 2.1 Data Extraction A p opular web forum in high fidelity Audio-visual equipments is studied. In this forum, three distinct discussion b oards are available to public users with assigned alias Av B oar d, C hatB oar d, and 2ndH andB oar d. In the first of the exp eriment, we conduct topic detection in Av B oar d using discussion participation data only. 2.2 Topic Detection in Web Forum The results of topic detection using user participation only in the Av B oar d is shown here. The pfidf-weighted user participation frequency matrix is decomp osed using NMF into ten groups. The ten groups are evaluated by human exp ert as well as cluster entropy. Human exp ert evaluations of the latent topic nature of the clusters are shown in Table 1. Clusters that do not have coherent topics in the discussions are lab eled as miscellaneous. The latent topics discovered Table 1: Latent Topics discovered in the AvBoard C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Exhibitions, shows Tube/DIY DAC DIY Compact disc Turntable, Vinyl discs Vintage Equipment Miscellaneous Japanese product CD players Miscellaneous Table 2 shows the clustering purity on different k, while DPDW refers to the NMF utilizing b oth the discussionparticipator and discussion-word matrices, and DP and DW refers to the NMF utilizing the discussion-participation and discussion-word matrix resp ectively. It can b e noticed from the result that the clustering results can b e significantly increased with the joint factorization approach. 3. CONCLUSION User participation in Web forum is essential to the analysis of Web discussions. We presented methods for detecting topics based on discussion-participation, and also on b oth discussion-participation and discussion-word. Results of topic detection in Web forum shows that the approach is feasible and latent topics previously unknown to the forum can b e discovered. It should also b e noted the degree of effectiveness could b e dep endent on the nature of the Web forum. Furthermore, we also present results on integrating the use of document corpus with user participation to cluster discussions from several different discussion b oards. 4. REFERENCES [1] P. J. Carrington, J. Scott, and S. Wasserman, editors. Models and Methods in Social Network Analysis. Cambridge University Press, 2005. [2] P. Kollock. The economies of online coop eration: Gifts and public goods in cyb erspace. In M. Smith and P. Kollock, editors, Communities in Cyberspace. Routledge, London, 1999. [3] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556­562, 2000. [4] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267­273, 2003. match well to the p osting characteristics. New latent topic, e.g. C1 of the Exhibition and shows ab out AV equipment that does not exist in the forum. The same is also true for C2 and C3 which is related to do-it-yourself (DIY) in audio hobby. As DIY discussions on audio equipment is a hobby where much supp ort and discussions are generated, the existence of sub-community based on it is quite evident. Furthermore, vintage equipment also found itself in a sp ecial sub-community. The entropy measures of the identified cluster agrees very well with human evaluations where miscellaneous clusters have higher entropies. 1246