WWW 2008 / Poster Paper

April 21-25, 2008 ˇ Beijing, China

Finding Similar Pages in a Social Tagging Repository
Alex Penev
NICTA & University of NSW, Australia

Raymond K. Wong
NICTA & University of NSW, Australia

alexpenev@cse.unsw.edu.au

wong@cse.unsw.edu.au

ABSTRACT
Social tagging describ es a community of users lab eling web content with tags. It is a simple activity that enriches our knowledge ab out resources on the web. For a computer to help users search the tagged rep ository, it must know when tags are good or bad. We describ e TagScore, a scoring function that rates the goodness of tags. The tags and their ratings give us a succinct synopsis for a page. We `find similar' pages in Del.icio.us by comparing synopses. Our approach gives good correlation to the full cosine similarity but is hundreds of times faster. Categories and Sub ject Descriptors: H.3.5 [Information Retrieval]: Online Services--Web-based services General Terms: Algorithms, Exp erimentation Keywords: Web, tagging, social b ookmarking, del.icio.us

TagScore. It rates the goodness of a tag and helps us decide how much weight to give tags in a multi-tag search. Given a page, we have a basic synopsis from Del.icio.us: its tags, how often they were used, and the total numb er of b ookmarks. TagScore enriches this synopsis by also giving each tag a numeric rating (contribution) and giving the page itself an overall `confidence' score. The overall score is used to prune very large result sets to desired sizes. The contributions are used for the synopsis similarity measure.

2. APPROACH
TagScore. We rate tags with a weighted TFIDF-based approach. It is simple, fast1 , works well for tagging [1, 2] and allows us to score pages indep endently of each other. The raw scores are weighted to reflect where in the page tags match and the community's notion of their imp ortance. We address tagging's common criticisms by lessening the impact of idiosyncratic tags using frequencies and combatting word sense ambiguity by also matching related words. Using pre-built global lookups for cooccurrence and WordNet, we obtain up to 100 related words for any word in our dataset (DMOZ100k06 [3]). These are rated the same way as the tags, and their contributions are comp ounded into a single alb eit abstract tag. Related terms are imp ortant for two reasons. An average page has only 5.0 tags and its tag bag is coarse and not an exhaustive description; the terms significantly improve the match rate and overall score (21%). They also smoothen the distribution by reducing the numb er of ties. The final contributions of the abstract and user tags are combined (using several normalizations and geometric sums that insist on quality over quantity) to give a single overall `confidence' score in [0,1). Although we score pages indep endently, Fig 1(a) shows that TagScore gives a near-uniform score distribution for a randomly-sampled dataset. The ab ove computations are reasonably fast and scalable. The bulk (71%) of execution time is in disk IO, which can b e cut in half with the lookups in memory. Synopses. Tags are a succinct synopsis of a page. Finding similar pages can b e considered a multi-tag search where tag overlap is maximized. We enrich the basic synopsis by attaching a few extra bytes p er tag for its rating. There is an obvious space vs accuracy trade-off here, and we use a minimal amount. A comparison b etween enriched instead of basic synopses is slower, but only by 4% since the enrichment is a matter of bytes. The accuracy gains, however, are far larger (Fig 1(b)).
1

1.

INTRODUCTION

Conceptually, a tag is any simple idea describing an object. A combination of tags describ es the ob ject in higher detail. Today's tagging systems are built around users, tags and resources. Their tags are implemented as free-form text keywords and the resources may b e any web ob ject (URLs, photos, videos, research articles, blogs, etc). We are interested in URLs tracked by Del.icio.us, whose users b ookmark pages that they find interesting. It is `collab orative' tagging b ecause users can see the tags used for the URL by the whole community or by other individuals. So far tagging systems outpace our understanding of how to b est supp ort their socially-driven annotation model for searching the rep ository. In terms of search, a single tag acts as a filter b ecause it selects a subset of all ob jects (a combination of tags selects fewer). As Del.icio.us's rep ository grows, search b ecomes more difficult due to noise, inaccuracy, spam or lack of navigational functionality. TagScore can help in each case. However, we focus this work on currently missing functionality, in particular to `find similar' pages. Using a page's tags as filters to find similar pages will not work well: a b oolean `and' query will fail for pages with a reasonable numb er of tags, and `or' queries retrieve too many results. The user is forced to pick and choose which filters to apply. Unfortunately tags do not supp ort intuitive query refinement and such a b ehavior fails to make use of tags in intelligent ways. Instead, we apply
Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04.

We determine the enriched synopsis for a page in 0.006s.

1091


WWW 2008 / Poster Paper
Label source T = P 's meta T = P 's title T = User tags Samples 2664 4275 4278 Avg/Median TagScore 0.415 / 0.39 ( = 0.22) 0.555 / 0.59 ( = 0.22) 0.460 / 0.53 ( = 0.24) Ź CO 0.395 (-5%) 0.543 (-2%) 0.450 (-2%)

April 21-25, 2008 ˇ Beijing, China
Ź WN 0.417 (+0%) 0.554 (-0%) 0.454 (-1%) No related terms 0.361 (-13%) 0.510 (-8%) 0.388 (-16%)

Table 1: Summary of Fig 1(a). Also shows TagScore when not using COoccurrences or WordNet.

TagScore distribution for our dataset 1
0.6

Correlation with cosine similarity, top-25 results
0.7 E1: +enriched; avg. of 126 measures E2: +enriched; min. of 126 measures N1: -enriched; max. of 30 measures N2: -enriched; naive (overlap/tag-freq)

Correlation with cosine similarity, ranking similar pairs from large set
enriched best non-enriched

Score for labels T against page P

0.5

0.6

0.8
0.4

0.5

Correlation

Correlation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.6

0.4

0.3

0.4

0.3

0.2 0.2

0.2

T = P's title T = user tags T = P's meta 0 0.2 0.4 Percentile 0.6 0.8 1

0.1

0.1

0

0

0 0.4 0.5 0.6 0.7 0.8 0.9 1

TagScore percentile [higher = more low-scoring pages pruned]

TagScore percentile [higher = more low-scoring pages pruned]

(a) TagScore distribution for using the title, tags and meta as page P 's lab els.

(b) Kendall  for "given x, find similar to x" compared against cosine.

(c) Kendall  for "given a large p ool, find similar x,y pairs" against cosine.

3.

TAGSCORE PROPERTIES

Fig 1(a) shows the TagScore distribution. The smoothness indicates few ties. The linearity indicates a uniform spread. A summary is in Table 1. We can see that title is a stronger content descriptor than user tags, and b oth are b etter than meta-keywords. This has long b een susp ected by web develop ers. TagScore captures this sentiment and even quantifies the differences. Not matching a page's meta reduces the TagScore mean by 15%. Correlation with the original list is very strong (0.74) but does not justify ignoring meta completely b ecause it is inexp ensive. We found that taking the first-25 terms is a good shortcut (2% drop in mean,  > 0.9). Ignoring title gives a similar mean to ignoring meta, suggesting they are interchangeable for b eing matched against by user tags. Correlation was very strong (0.76), but title achieves this accuracy with fewer terms and is more ubiquitous, therefore it is the more-useful comp onent of a page. Taking only the first 2.5KB of the b ody content produced almost identical mean, median and  as using the full b ody and had a very strong (0.77) correlation against the original list, yet it is on average one-fifth of the page source. We find that this is a good shortcut in practice.

The result is clear: even the worst of the enriched measures was a closer approximation to cosine. We also see that  increases with p ercentile cut-off as low-scoring pages are prune. This suggests that under TagScore, low-scoring pages are indeed p oorly represented by their tags. Fig 1(c) shows a many-to-many comparison where we find the most similar pairs in a large p ool of synopses. We trialled one of the b etter enriched measures against the b est nonenriched measure. The result is clear: the ranking produced by the enriched measure is closer to cosine. Again, we see that  increases with p ercentile.

5. CONCLUSION
Social tagging systems have large rep ositories that require new methods for search. We created a scoring function to rate the goodness of tags and summarized how it was used to fill a gap in currently missing functionality: to `find similar' pages in a tagged rep ository. We contribute: ˇ TagScore, a scoring function to rate the goodness of tags and give the page an overall confidence score. ˇ Exp eriments on real Del.icio.us data to show TagScore's distribution and b ehavior. TagScore allows us to quantify various comparisons, such as the author's description of a page (title, meta) compared to users' tags. We also documented the effects of matching related terms. ˇ Two exp eriments on finding similar pages in a tagged rep ository. Our approximation has good correlation with cosine but is 490 times faster. Our implementation compared ab out 70,000 synopsis pairs p er second.

4.

APPROXIMATE SIMILARITY

We p erformed a one-to-many comparison by taking a random synopsis, comparing against the others and ranking the results. From 13 values (or combinations of values) obtained from enriched synopses, we trialled 156 different similarity measures for producing a ranked list by using the 13 as primary sort criteria and the other 12 to break any ties. Of these, 126 used the enriched tag contribution variable in some way; the others used data Del.icio.us currently has. Fig 1(b) shows a summary of the results. The curve N2 is the naive approach. The curve N1 plots the b est instance among any of the 30 non-enriched measures (averaged over 500 runs). It was similar to E2, the worst instance among the 126 enriched measures. The average of the enriched measures, E1, was much b etter.

6. REFERENCES

[1] Brooks, C., and Montanez, N. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW (2006). [2] Chirita, P. A., Costache, S., Nejdl, W., and Handschuh, S. P-tag: large scale automatic generation of personalized annotation tags for the web. In WWW (2007). [3] Noll, M., and Meinel, C. Authors vs. Readers - A Comparative Study of Document Metadata and Content in the WWW. ACM DocEng (2007).

1092