WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Size Matters: Word Count as a Measure of Quality on Wikipedia Joshua E. Blumenstock School of Information University of California at Berkeley jblumenstock@berkeley.edu ABSTRACT Wikip edia, "the free encyclop edia", now contains over two million English articles, and is widely regarded as a highquality, authoritative encyclop edia. Some Wikip edia articles, however, are of questionable quality, and it is not always apparent to the visitor which articles are good and which are bad. We prop ose a simple metric ­ word count ­ for measuring article quality. In spite of its striking simplicity, we show that this metric significantly outp erforms the more complex methods describ ed in related work. Categories and Sub ject Descriptors: H.5.3 [Information Interfaces]: Group and Organization Interfaces­ Collaborative computing ; I.2.7 [Artificial Intelligence]: Natural Language Processing­ Text analysis. General Terms: Measurement, algorithms. Keywords: Wikip edia, information quality, word count. Stvilia et al. [5], computed 19 quantitative metrics, then used factor analysis and k-means clustering to differentiate featured from random articles. 2. METHODS In contrast to the complex quantitative methods found in related work, we prop ose a much simpler measure of quality for Wikip edia articles: the length of the article, measured in words. While there are many limitations to such a metric, there is good reason to b elieve that this metric will b e correlated to quality (see Figure 1). The simplicity of this metric presents several advantages: · article length is easy to measure; · many of the approaches mentioned in section 1 require information that is not easily obtained (such as the revision and history used in [6], [4], and [1]); · other approaches typically op erate in a black b ox fashion, with arcane parameters and results that are not easily interpreted by the average visitor to Wikip edia; · article length p erforms significantly b etter than other, more complex methods. Random Articles 1.0 0.8 0.8 1.0 1. INTRODUCTION As user-generated content grows in prominence, many web sites have employed complex mechanisms to help visitors identify high quality content. Wikip edia maintains a list of "featured" articles that serve as exemplars of good usergenerated content. For an article to b e featured, it must survive a rigorous nomination and p eer-review process; only one article in every thousand makes the cut. Unfortunately, this is a lab orious process, and many worthy articles never get the official stamp of approval. Thus, it might b e useful to have an automatic means for detecting articles of unusually high quality. A substantial amount of work has b een done to automatically evaluate the quality of Wikip edia articles. At a qualitative level, Lih [3] prop osed using the total numb er of edits and unique editors to measure article quality, and Cross [2] suggested coloring text according to age so that visitors could immediately discern its quality. Both studies, however, were primarily qualitative, and did not measure the discriminative value of such heuristics. The quantitative approaches found in related work tend to b e quite complex. Zeng et al. [6], for instance, used a dynamic bayesian network to develop a measure of trust and quality based on the edit history of an article. Adler and de Alfaro [1] devised similar metrics to quantify the reputation of authors and editors. In the work closest to our own, Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. Featured Articles probability density probability density 0 2 4 6 8 10 0.6 0.4 0.2 0.0 0.0 0 0.2 0.4 0.6 2 4 6 8 10 log(word_count) log(word_count) Figure 1: Word counts for featured/random articles To test the p erformance of article length as a discriminant b etween high and low quality articles, we followed the approach taken by Zeng et al. [6] and Stvilia et al. [5] That is, instead of comparing our metric against a scalar measure of article quality, we assume that featured articles are of much higher quality than random articles, and recast the problem as a classification task. The goal is thus to maximize precision and recall of featured and non-featured articles. To build a corpus, we extracted the full 5,654,236 articles from the 7/28/2007 dump of the English Wikip edia. 1095 WWW 2008 / Poster Paper class Featured Random n 1554 9513 TP rate 0.936 0.977 FP rate 0.023 0.064 Precision 0.871 0.989 Recall 0.936 0.977 April 21-25, 2008 · Beijing, China F-measure 0.902 0.983 Table 1: Performance of word count in classifying featured vs. random articles. After stripping all Wikip edia-related markup, we removed sp ecialized files (such as images and templates) and articles containing fewer than fifty words. This cleaned dataset contained 1,554 articles classified as "featured"; we randomly selected an additional 9,513 cleaned articles to serve as a non-featured "random" corpus. Our corpus thus contained a total of 11,067 articles. In the exp eriments describ ed b elow, we used 2/3 of the articles for training (7,378 articles) and 1/3 for testing (3,689 articles), with a similar ratio of featured/random articles in each set. "kitchen sink" of thirty features such as those listed in Table 2, no classifier achieved greater than 97.99% accuracy ­ a modest improvement given the considerable effort required to produce these metrics and build the classifiers. Frequency counts character count complex word count token count one-syll. word count Readability indices Gunning fog index FORCAST formula Coleman-Liau Automatic Readability Structural features internal links external links category count image count citation count table count sentence count total syllables Flesch-Kincaid SMOG index reference links reference count section count 3. RESULTS By classifying articles with greater than 2,000 words as "featured" and those with fewer than 2,000 words as "random," we achieved 96.31% accuracy in the binary classification task.1 The threshold was found by minimizing the error rate on the training set (see Figure 2). The rep orted accuracy results from testing on the held-out test set. Modest improvements could b e produced by more sophisticated classification techniques. A multi-layer p erceptron, for instance, achieved an overall accuracy of 97.15%, with an f-measure of .902 for featured articles and .983 for random articles (see Table 1). Similar results were replicated with a k -nearest neighb or classifier (96.94% accuracy), a logit model (96.74% accuracy), and a random-forest classifier (95.80% accuracy). All techniques represent a significant improvement over the more complex methods in Stvilia et al. [4] and Zeng et al. [6], which produced 84% and 86% accuracy, resp ectively. 1.0 Table 2: Features from "kitchen sink" classification 4. DISCUSSION AND CONCLUSIONS We have shown that article length is a very good predictor of whether an article will b e featured on Wikip edia. Word count is a simple metric that is considerably more accurate than the complex methods prop osed in related work, and p erforms well indep endent of classification algorithm and parameters. We do not, however, mean to exaggerate the imp ortance of this metric. By assuming that "featured" status is an accurate proxy for quality, we have implied that quality can b e measured via article length. However, if our assumption does not hold, then we can only conclude that long articles are featured, and featured articles are long. Future work will explore alternative standards for quality on Wikip edia. Error rate 0.4 0.6 Featured articles Random articles Combined 0.8 0.2 Acknowledgments We thank Marti Hearst for guidance and feedback. 0 1000 2000 3000 4000 5000 6000 0.0 Word count threshold 5. REFERENCES [1] B. T. Adler and L. de Alfaro. A content-driven reputation system for the wikip edia. Proc. 16th Intl. Conf. on the World Wide Web, pages 261­270, 2007. [2] T. Cross. Puppy smo othies: Improving the reliability of op en, collab orative wikis. First Monday, 11, 2006. [3] A. Lih. Wikip edia as participatory journalism: Reliable sources? metrics for evaluating collab orative media as a news resource. 13th Asian Media Information and Communications Centre Annual Conference, 2004. [4] B. Stvilia, M. Twidale, L. Gasser, and L. Smith. Information quality discussions in wikip edia. Proc. 2005 ICKM, pages 101­113, 2005. [5] B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclop edia. Proc. ICIQ, pages 442­454, 2005. [6] H. Zeng, M. Alhossaini, L. Ding, R. Fikes, and D. L. McGuinness. Computing trust from revision history. Intl. Conf. on Privacy, Security and Trust, 2006. Figure 2: Accuracy of different thresholds Given the high accuracy of the word count metric, we naturally wondered whether other simple metrics might increase classification accuracy. In other contexts, features such as part of sp eech tags, readability metrics, and n -gram bag-ofwords have b een moderately successful. In the context of Wikip edia quality, however, we found that word count was hard to b eat. N -gram bag-of-words classification, for instance, produced a maximum of 81% accuracy (tested with n =1,2,3 on b oth svm and bayesian classifiers). Even using a 1 A slightly higher accuracy of 96.46% was achieved with a threshold of 1,830 words. 1096