WWW 2008 / Refereed Track: Security and Privacy - Misc April 21-25, 2008 · Beijing, China Detecting Image Spam using Visual Features and Near Duplicate D etection Bhaskar Mehta Saurabh Nangia* IIT Guwahati Guwahati 781039 Assam, India Manish Gupta* IIT Guwahati Guwahati 781039 Assam, India bmehta@google.com Google Inc. Brandschenkestr 110 Zurich, Switzerland s.nangia@iitg.ernet.in Wolfgang Nejdl L3S Forschungszentrum Appelstrasse 4 Hannover, Germany m.gupta@iitg.ernet.in nejdl@L3S.de ABSTRACT Email spam is a much studied topic, but even though current email spam detecting software has b een gaining a comp etitive edge against text based email spam, new advances in spam generation have p osed a new challenge: image-based spam. Image based spam is email which includes emb edded images containing the spam messages, but in binary format. In this pap er, we study the characteristics of image spam to prop ose two solutions for detecting image-based spam, while drawing a comparison with the existing techniques. The first solution, which uses the visual features for classification, offers an accuracy of ab out 98%, i.e. an improvement of at least 6% compared to existing solutions. SVMs (Support Vector Machines ) are used to train classifiers using judiciously decided color, texture and shap e features. The second solution offers a novel approach for near duplication detection in images. It involves clustering of image GMMs (Gaussian Mixture Models ) based on the Agglomerative Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as the distance measure. 1. INTRODUCTION Email spam has b een around for more than a decade, with many covert companies offering mass emailing services for product advertisements. It has b ecome usual for p eople to receive more spam email than relevant email. The design of the SMTP protocol is one main reason for email spam: it is p ossible to send millions of fake email messages via computer software without any authentication. Lately, authentication p olicies have b een adopted by various web servers, where webmasters can maintain a list of blocked web domains. However, spammers keep registering millions of domain names at a pace far greater than what system administrators can cop e up with. The biggest step in the fight against spam has b een the successful design of spam filtering software. Such software typically trains classifiers to identify spam and ham. For the past decade, much progress has b een made in designing accurate classifiers for text emails. These classifiers are trained to use two kinds of rules: (a), rules based on connection and relay prop erties of the email, and (b), rules exploiting the features extracted from the content of emails. The former category of rules utilize publicly shared blacklists of servers which have b een known to transmit spam emails. This information is frequently up dated, requiring spammers to change domain names and servers frequently. The second typ e of rules use features extracted from spam text (e.g. keywords, frequent words etc) which are compared with corresp onding features from ham text (normal emails) and classifiers are learnt [6]. Such classifiers have proven to b e very useful and highly accurate, even though spammers continue to devise novel ways of fooling them. Categories and Subject Descriptors H.3 [Information Storage And Retrieval]: Information Search and Retrieval.; G.3 [Mathematics of Computing]: Probability And Statistics. General Terms Security Keywords Email spam, Image analysis, Machine learning. This work was done when the author was at L3S Forschungszentrum Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 2. WHAT IS IMAGE SPAM? Image based spam is a breakthrough from the spammers' view p oint; it is a simple and effective way of deceiving spam filters since they can process only text. An image spam email is formatted in HTML, which usually has only non-suspicious text with an emb edded image (sent either as an attachment or via the use of self referencing HTML with image data in its payload). The emb edded image carries the target message and most email clients display the message 497 WWW 2008 / Refereed Track: Security and Privacy - Misc April 21-25, 2008 · Beijing, China Figure 1: Examples of Image Spam. Notice the high amount of text used in combination with pictures used to make images resemble web pages. in their entirety. Since many ham emails also have similar prop erties (using HTML, carrying emb edded images, with normal text) as image-based emails, existing spam filters can no longer distinguish b etween image-based spam and image ham. A whitepap er released in Novemb er 2006 [12] shows the rise of image spam from 10% in April to 27% of all email spam in Octob er 2006 totaling 48 bil lion emails everyday. It is thus imp erative to devise new approaches for identifying image-based spam in email traffic which are b oth fast and accurate. Clearly, identification of image-based email spam (referred as I-spam here onwards) requires content analysis of the attached image. Since images are a large and heterogeneous data format, it is essential to understand characteristic prop erties of I-spam in order to identify them successfully. I-spam has b een studied very recently by some researchers and it is p ossible to identify some common traits of currently visible spam images; exploiting these prop erties forms the basis of our solution to detection of I-spam. · I-spam contains text messages : Almost all images in spam emails contain text messages conveying the intent of the spammer. This text is usually an advertisement and often contains text which has b een blacklisted by spam filters (e.g. Cialis pills, drug store, stock tip, etc). · I-spam is usual ly noisy and different from one another : It takes significant effort to design a spam image; however spammers reduce the involved effort by generating spam images using algorithms which ensure each generating image is distinct. To achieve this, techniques used include rearranging the items in spam emails, adding random noise, changing background or font colors, using different fonts and sizes, using random patterns like lines or circles, adding b orders and so on. While such changes might go unnoticed by a human at the first glance, they cause significant changes in the prop erties of images, resulting in statistically different data. This obfuscation of original data makes trivial content analysis unsuitable; in fact, even sophisticated techniques like Optical Character Recognition (OCR) can fail to recognize text in images, a fact that is often used in designing CAPTCHAs. This line of spam generation renders early i-spam detection techniques [6] ineffective. · I-spam messages use HTML effectively : I-spam uses MIME to transp ort attached image data along with HTML formatting and non-suspicious text. However, the text contained in I-spam images and the text contained in the HTML b ody usually have no correlation, a prop erty which can b e exploited to classify i-spam. Regular text is usually taken from b ooks or standard documents used to fool text-based email classifiers. Thus, the HTML text offers almost no help to test classifiers, unless app earing too harmless can b e exploited as a feature. · I-spam messages differ from natural images : Natural images tend to have smoother distribution in RGB/LAB color-space than I-spam. This prop erty leads to b etter approximation of natural images by artificial distributions (e.g. Gaussian) than i-spam which often includes clear and sharp ob jects. This prop erty is difficult to exploit in practice; nevertheless it is a significant prop erty since ham emails with attached images contain very often natural images (e.g. photographs). · I-spam messages are template based : A large numb er of I-spam are near-duplicates, i.e. a base pattern is p ermuted to form large numb er of similar looking but distinct I-spam images. Spammers exploit an existing pattern for a certain time p eriod b efore investing in a new I-spam pattern. Such patterns are common in the world of text email spam, and are normally tackled by collab orative feedback from the user community. We envision the application of similar methods for new kinds audio-visual spam in systems like YouTub e. Available figures for 2006 suggest more than 30% of all spam emails are image based, and most of these emails reach email inb oxes undetected. Humans are easily able to distinguish spam images from normal spam; moreover, they can abstract from various versions of the same base image, even though these images may have very different prop erties like background, fonts, or different arrangement of subimages. Current feature based approaches have had some success in identifying image spam [5]; extracted features include file typ e and size, average color, color saturation, edge detection, and random pixel test. However these features do not seem to provide good generalization, since spammers can easily modify these features; thus further 498 WWW 2008 / Refereed Track: Security and Privacy - Misc feature engineering is required. In particular, we observe that computer-generated images do not have the same texture as natural images; natural images like photographs would likely b e classified as ham. We therefore p ostulate that low-level features which capture how an image is p erceived are likely to b e b etter discriminants b etween spam and ham. This is the first part of our detection strategy: to find features which capture how humans p erceive spam images vs. ham images. While some extracted features may provide good generalization, a spammer can clearly realize this and mutate spam images by using more natural/photographic content in spam images. However, creating completely different image spam each time is a costly activity; thus any spam image is likely to have thousands of similar (though not same) images, each generated from the same template image, but modified using the usual spam tricks. Even though image spam may evolve to defeat obvious give-away features (like text), there are likely to b e many near duplicates. This forms the second part of our spam detection strategy: to detect near-duplicates of spam images which have b een lab eled as spam. April 21-25, 2008 · Beijing, China Feature Type Average Color Color Saturation Edge Detection File Format File Size Image Metadata Image Size Prevalent Color Coverage Random Pixel Test No. Of Features 44 10 32 332 6 7195 18 26 11 Table 1: Features used for I-spam classification by Dredze et al. [5] Method wave animate deform rotate shift crop size dots bars frame font line color shap e metadata url Description using wavy text to fool OCR using animated frame deforming text using irregular fonts rotate text to defeat OCR move template image cropping template image randomly same image on different canvas size adding random pixels adding random lines in the image add a frame to the use different font or size adding lines of random color use random shap es in background randomize metadata use different urls in image 2.1 Low Level Features of Image Spam As noted earlier, recent work has used feature-based classification to detect I-spam. Dredze et al. [5] have investigated the use of high-level features in classifying Ispam. Table 1 lists the features used by the authors in this work; the core idea is that high level features of the image, for example file format, can b e extracted much quicker than lower level features like edges; thus there is an intrinsic ordering of features we want to consider in order to build fast classifiers. The approach is general and our work can easily b e extended to this framework. Other researchers (cf. [16]) have discussed yet another imp ortant asp ect which we capture in our methodology: they have investigated common obfuscation methods like rotation, scaling, partial translation, font change etc. They correctly identify the procedure used by spammers to generate image spam; they p ostulate that the two steps involved are template generation and the randomization of template. This procedure makes images sent to different users different in nature, hence defeating simple techniques for checking exact duplicates (e.g. hashing). However, their method is limited in the features used, achieving an accuracy b elow 85%. In this work, we explore visual image features, while using a simple and novel mechanism of dealing with various obfuscation techniques. Table 2 lists obfuscation techniques that the survey in [16] has identified. Table 2: Randomization Methods used for I-spam generation by Wang et al. [16] 3. STRATEGIES FOR I-SPAM DETECTION 3.1 Near-duplicate Detection in Images While technology may not b e advanced enough for recognizing spam from random images without any explicit instructions or rules, recent research in machine learning has lead to efficient and accurate pattern recognition algorithms. Our near duplicate detection algorithm is based on the intuition that we can recognize a lot of images similar to an identified spam image; since I-spam is usually generated from a template (see Sec. 2), near duplicates should b e easy to detect. Given enough training data, we should b e able to detect large volumes of I-spam, while b eing op en to further training. We should additionally provide b etter generalization p erformance than pure similarity search and abstract from observer p ositive and negative samples. Near-duplicate detection for images is an extreme form of similarity search [7] which is a well studied topic. Recent work in probabilistic image clustering has focused on similar issues but from a different p ersp ective, where a query image is used to search for similar images in a large database. Probabilistic image modeling is particularly suited to the task at hand since we want to classify a family of images as spam, while having observed only a few samples from the family. We have chosen Gaussian Mixture Models (GMM) as the starting p oint for our approach. Gaussian Mixture Models model an image as coherent regions in feature space. First, features are extracted 2.2 Other Methods for I-spam filtering Early methods for I-spam exploited the heavy use of text in I-spam, a feature that is still present. However, spammers hit back by exploiting the deficiencies of Optical Character Recognition and modifying I-spam accordingly. Shortcomings of OCR have b een aptly demonstrated in the design of CAPTCHAs1 which remain an effective mechanism of telling humans and computer agents apart. To demonstrate the ineffectiveness of OCR for (current) I-spam filtering, we have compared our algorithm with OCR. 1 www.captcha.net 499 WWW 2008 / Refereed Track: Security and Privacy - Misc for each pixel, pro jecting the image to a high-dimensional feature space. For each pixel, we extract a seven tuple feature vector: two parameters for pixel coordinates (x, y ) (to include spatial information), three for color attributes in the color space and two for texture attributes (anisotropy and contrast [3]). We chose the (L , a , b ) color space (henceforth called as Lab ) since it models most closely how humans p erceive colors. The Lab color space is p erceptually linear, meaning that a change of the same amount in a color value produces a change of ab out the same visual imp ortance. For texture, anisotropy and contrast are extracted from the second moment matrix defined as follows: M = G (x, y ) (I )(I )T , (1) M-step : April 21-25, 2008 · Beijing, China 1X wij n i=1 Pn wij pi i j P=1 ^ n i=1 wij Pn ^ ^T i=1 wij (pi - j )(pi - j ) ^ Pn j i=1 wij n ^ j (6) (7) (8) where G (x, y ) is a separable binomial approximation to a Gaussian smoothing kernel with variance 2 , and (I ) is the gradient of the L channel in the Lab color-space. At every pixel, M is a 2 × 2 matrix, and can b e eigen-decomp osed very simply. Considering a fixed scale , we compute the eigenvalues {1 , 2 } for M at (x, y ). Anisotropy are consequent defined as A = 1 - 2 /1 and contrast c = 2 1 + 2 . Given this transformation, we now represent an image as a set of 7-tuples: I = {p1 , p2 , · · · , pn }, pi = (xi , yi , Li , ai , bi , Ai , Ci ) , (2) where n is the numb er of pixel in the image. Given this representation with parameters , we model the probability distribution of I with a mixture model with K gaussians, f (p|) = K X j =0 j fj (p|j ), s.t. k X j =0 j = 1 (3) where p is a feature vector, representing each p oint of an Image, and = (1 , · · · , K , 1 , · · · , K ), and each fj is a multivariate Gaussian density function parameterized by j i.e., (j and j ). The probability distribution function fj (p|j ) can then b e written as 1 1 exp{- (p - j )k -1 (p - j )T } fj (p|j ) = q 2 d (2 ) |j | Additional constraints on the parameters of the ab ove model include j > 0, and diagonal covariance j ; this simplifies the model and avoids inversions of the covariance matrix. We are now interested in finding the maximum likelihood estimate (MLE) of the model parameters , such that: M L = argmax f (p1 , p2 , · · · , pn | ) The GMM framework allows us to learn probabilistic models of images, but a hidden assumption is that the images are similar in size. While this is not true for our data, it offers an opp ortunity as well: at lower resolutions, many obfuscation techniques are ineffective. Using this key assumption, we scale every image to a resolution of 100×100. The computational advantage of this method is obvious. This scale has b een selected empirically and strikes a balance b etween loss of information and gain in p erformance. We explore the effect of scale when extracting visual image features in Section 4.1. In order to ensure faster GMM fitting, we optimized model fitting procedure; the EM model was parameterized using K -means clustering. For all exp eriments, the numb er of mixtures was fixed at K = 4. A smaller value of K would have resulted in omission of imp ortant comp onents of an image, while a larger value would have taken into consideration also the insignificant p ortions of an image. After sufficient exp erimentation, the value of K was fixed at 4 to ensure that noise was filtered out of the model. After all the GMMs have b een learnt (one for every image), we need to cluster similar images together. Clustering requires a similarity measure, and our task requires modeling distance b etween two probability distributions. A commonly used distance measure is Kullback-Leib er (KL) Divergence defined as follows: X P (i) (9) P (i) log DKL (P ||Q) = Q(i) i Even though it gives a measure of closeness of two distributions to each other, this measure is not symmetric. We therefore choose Jensen-Shannon divergence as the distance measure. The distance measure b etween clusters c1 and c2 takes into account b oth the dissimilarity b etween the probability distributions and the size of the two clusters as shown by the equation DJ S (P ||Q) = 1 (DKL (P ||M ) + DKL (Q||M )) 2 (10) (4) Since no closed form solutions exist for the ab ove maximization, Exp ectation Maximization [4] is used to find the MLE. The EM procedure alternates b etween E-steps and M-steps where the parameters are up dated in the direction of maximum gain in log-likelihood. The EM equations for GMM are presented b elow: E-step : j f (pi |j , j ) wij = Pk l=1 l f (pi |l , l ) (5) where M = 1 (P + Q). Using this distance, we can p erform 2 clustering (Lab eled Agglomerative Clustering); the idea is to find images which are similar enough in the training set, and to replace them with one signature. New images from the test set can now b e compared to the already identified signatures and if there is a close match, then a p ositive identification can b e made. 3.1.1 Labeled Agglomerative Clustering In [8], the authors have describ ed an agglomerative clustering algorithm, where clusters are learnt from observed data in an unsup ervised fashion, i.e. the numb er of clusters is initially unknown. To aid in clustering, the Information Bottleneck principle is used. The information b ottleneck 500 WWW 2008 / Refereed Track: Security and Privacy - Misc (IB) principle states that the loss of mutual information b etween the ob jects and the features extracted should b e minimized by the desired clustering among all p ossible clusterings. Using the IB principle, clustering of the ob ject space X is done by preserving the relevant information ab out another space Y . We assume, as part of the IB ^ approach, that X X Y is a Markov chain, i.e. given ^ X the clustering X is indep endent of the feature space Y . Consider the following distortion function: ^ ^ d(x, x) = DKL (p(y |X = x)||p(y |X = x)) ^ (11) April 21-25, 2008 · Beijing, China Algorithm 2 Lab eledAIB (i , Li ) Require: GMMs i , i = {1..N }, Lab el Li for each image 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: C = {c1 , · · · cN } {N clusters } for i = 1 to N do f (p|ci ) = {i } {Each image in its own cluster} end for for each i, j pair, i, j = 1 to N do Distance(i, j ) = JSdist (f (p|ci ), f (p|cj )) end for for i = 1 to N - 1 do repeat Find {m, n} = argmin |Distance(i, j )| m,n where DKL is as defined ab ove in (10). P Note that p(y |x) = ^ ^ x p(x|x)p(y |x) is a function of p(x|x). Hence, d(x|x) is not predetermined, but dep ends ^ ^ on the clustering. Therefore, as we search for the b est clustering, we also search for the most suitable distance measure. Since the numb er of clusters may not b e known apriori, we use the agglomerative clustering. The Agglomerative Information Bottleneck (AIB) algorithm for clustering is a greedy algorithm based on a b ottom-up merging procedure [13] which exploits the IB principle. The algorithm starts with the trivial clustering where each cluster consists of a single p oint. Since we want to minimize the overall information loss caused by the clustering, in every greedy step we merge the two classes such that the loss in the mutual information caused by merging them is minimized. Note however, that we are still op erating in the classification domain, i.e. the data available to us has a lab el (ham/spam ). The original AIB approach does not consider data lab els as input; therefore, we extend it here for handling lab el data. We can use lab el information to restrict the clustering to a similar data, thus sp eeding up the cluster formation by distance comparison with data of the same lab el. Since the base AIB algorithm is quadratic, this information can result in a sp eedup of up to 4 times (assuming lab els are equally likely to b e p ositive and negative). Further optimizations are p ossible to make the unsup ervised clustering faster (e.g. merging more than two clusters in an iteration or merging b elow a threshold); this is a placeholder for replacement with b etter Lab eled AIB algorithms. Our presented algorithm consists of the two phases GMM training and Lab eled AIB clustering which are summarized in Algorithms 1 and 2. Algorithm 1 TrainGMM (Ii , i = {1..N } , k) Require: Images Ii , i = {1..N }, No of Mixtures k 1: for i = 1 to N do 2: Resize Ii to 100 × 100 pixels. 3: Extract (x,y,L,a,b,A,c) features for each pixel. 4: Perform K-means clustering P Ii . on 5: Choose j randomly, s.t on k j = 1. j 6: Initialize j to center of cluster j . 7: Initialize j to 1. 8: while Convergence not reached do 9: Rep eat E-steps and M-steps (Eq. (5)-(8)) 10: end while 11: end for Ensure: = (j , j , j ) 11: until Lm == Ln , else find next minimum 12: if Distance(m, n) T hr eshold then 13: Break 14: end if 15: Merge {cm , cn } = c. ^ |c |c n | 16: f (c) = |cm mc|n | f (cm ) + |cm cn | f (cn ) ^ 17: Up date Distance(i, j ) 18: end for Ensure: f (p|cl ), cl , l No. Of Clusters 3.1.2 Prediction Phase New images, whose lab el have to b e predicted, are fitted to a GMM, and compared to the lab eled signatures already known from the trained cluster model. The closest cluster to a new image, is found using the JS divergence distance, and the lab el of the new image is the lab el of the cluster. As a result, images similar to the signatures corresp onding to spam lab el will b e marked as spam and images similar to ham lab eled signatures will b e marked as ham. An optimization of this algorithm is to introduce a new lab el called unidentified. Images that are not similar to any of the existing signatures (i.e. the JS divergence with the closest cluster is larger than a particular threshold) can b e lab eled as unidentified and may require user intervention to b e lab eled appropriately into either of ham or spam data sets. This way more accurate results are obtained while also ensuring that the training phase continues even in the testing phase. 3.2 Visual Features for Classification Near-Duplication is likely to p erform well in abstracting base templates, when given enough examples of various spam templates in use. However, the generalization ability of this method will b e limited, since we are not exploiting global features of images; there are many likely giveaway features as previous work has shown. We feel however that the explored feature set does not identify a key asp ect: i-spam is artificial ly generated. In this section, we explore visual features like texture, shap e and color and learn classifiers using these selected features. Notice that we extract global features, i.e. each feature represents a prop erty of the entire spam image. 3.2.1 Color Features Color models are used to classify colors and to qualify them according to such attributes as hue, saturation, chroma, lightness, or brightness. The Red-Blue-Green 501 WWW 2008 / Refereed Track: Security and Privacy - Misc (RGB) model is the most basic and well-known color model. It is also the basic color model for on-screen display. Using this color model, we chose the following features · Average RGB : Average RGB represents the average values in R, G and B channel of each pixels in an image. The range of RGB is normalized to (0, 1). · Color Histogram : For a given image, the color histogram HI is a compact summary of the image. A color histogram H is a vector (h1 , h2 , · · · , hn ), in which each bucket hj contains the numb er of pixels of color j in the image. Typically images are represented in the RGB color-space, and a few of the most significant bits are used from each color channel. We chose a 6-bit color space leading to 64 feature vectors. · Color Moment : The use of color moments is based on the assumption that the distribution of color in an image can b e interpreted as a probability distribution. Probability distributions are characterized by a numb er of unique moments; [14] uses three central moments of a image's color distribution; they are mean, standard deviation and skewness. Using RGB channels and 3 moments for each channel, we get 9 feature vectors. · Color Coherence Vector : An image's color coherence is the degree to which pixels of that color are memb ers of large similarly-colored regions. We refer to these significant regions as coherent regions and aim to classify pixels as coherent or incoherent. We first blur the image slightly by replacing pixel values with the average value in a small local neighb orhood (currently including the 8 adjacent pixels). This eliminates small variations b etween neighb oring pixels. We then discretize the color space, such that there are only n distinct colors in the image. The next step is to classify the pixels within a given color bucket as either coherent or incoherent. A coherent pixel is part of a large group of pixels of the same color, while an incoherent pixel is not. We determine the pixel groups by computing connected comp onents. Note that we only compute connected comp onents within a given discretized color bucket. This effectively segments the image based on the discretized color-space. We classify pixels as either coherent or incoherent dep ending on the size in pixels of its connected comp onent(C). A pixel is coherent if the size of its connected comp onent exceeds a fixed value T (e.g. 1% of numb er of pixels); otherwise, the pixel is incoherent. 128 feature vectors are chosen in this manner. April 21-25, 2008 · Beijing, China Figure 2: Examples of Ham images chosen by us for training the classifier. Note that we chose some ham images with text as well, other categories of ham images used company logos, cartoons and wall papers. give rise to coarse texture (e.g. rock surface) and small primitives give rise to fine texture (e.g. silk surface). If the primitives are large, the autocorrelation function decreases slowly with increasing distance whereas it decreases rapidly if texture consists of small primitives. If the primitives are p eriodic, the autocorrelation function increases and decreases p eriodically with distance. · Edge Frequency : A numb er of edge detectors can b e used to yield an edge image from an original image. We can compute an edge dep endent texture description function E as follows: E =|f (i, j ) - f (i, j + d)| + |f (i, j ) - f (i, j - d)| +|f (i, j ) - f (i + d, j )| + |f (i, j ) - f (i - d, j )| This function is inversely related to the autocorrelation function. Texture features can b e evaluated by choosing sp ecified distances d and averaging over the entire image. We vary the distance d parameter from 1 to 25 giving us a total of 25 features. · Primitive Length (Run Length) : A primitive is a continuous set of maximum numb er of pixels in the same direction that have the same gray level. Each primitive is defined by its gray level, length and direction. Primitive length (run length) uses lengths of texture primitives in different directions as texture description. Coarse textures contain more long texture primitives, and fine textures contain more short texture primitives. · Co-occurrence Matrices : Whether considering the intensity or gray-scale values of the image or various dimensions of color, the co-occurrence matrix can 3.2.2 Texture features Image texture, defined as a function of the spatial variation in pixel intensities (gray values), is useful in a variety of applications and has b een a sub ject of intense study by many researchers (cf. [15]). The intuition b ehind choosing texture features for classification is that natural images have a different quality of texture as compared to textures in computer generated images. We extract the following features from the images in our data set: · Autocorrelation : Autocorrelation measures the coarseness of an image by evaluating the linear spatial relationships b etween texture primitives. Large primitives 502 WWW 2008 / Refereed Track: Security and Privacy - Misc measure the texture of the image. Because cooccurrence matrices are typically large and sparse, often various metrics of the matrix are taken to get a more useful set of features. Features generated using this technique are usually called Haralick features [9]. Co-occurrence Matrices are based on rep eated occurrence of some gray-level configuration in the texture; this configuration varies rapidly in fine textures, though more slowly in coarse textures. 100 99 98 97 April 21-25, 2008 · Beijing, China Classification accuracy of SVM using Visual features. Prediction Accuracy 96 95 94 93 92 91 90 0.1 Images downscaled to 50x50 Images downscaled to 100x100 Images downscaled to 400x400 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 3.2.3 Shape features · Geometric Moment : Image moments are certain weighted averages (moments) of the image pixels' intensities, or functions of those moments, usually chosen to have some attractive prop erty or interpretation. When normalized, they can b e considered as a probability distribution. If f (x, y ) is a digital image, the central moment is defined as follows: XX pq = (x - x)p (y - y )q f (x, y ) ¯ ¯ (12) x y Fraction of Training data The central moment is useful in representing the orientation of the image content. · Eccentricity : An approximate measure of eccentricity or elongation of an ob ject is given by e= (20 - 02 )2 + 411 11 (20 + 02 )2 (13) Figure 3: Classification Accuracy of Visual Features based-SVM compared at different resolutions. Notice how higher resolutions provide better results, but at much higher costs, since most texture and shape features have non linear extraction time. with many supp orted kernels. Since data might not b e separable, soft margin classification may b e required. We have used the radial basis function as a kernel function since the corresp onding Hilb ert space is of infinite dimension. Algorithm 3 summarizes the overall approach. Algorithm 3 SVM-classify (Ii , Li ) Require: Images Ii , Lab el Li , i = {1..N }, 1: for i = 1 to N do 2: Fcolor = E xtr actC olor F eatur es(Ii) i 3: Fshape = E xtr actS hapeF eatur es(Ii) i 4: Ftexture = E xtr actT extur eF eatur es(Ii) i 5: Fi = Fcolor Fshape Ftexture i i i 6: end for 7: Classifier C = Lear nS V M C lassif ier (F, L) Ensure: Classifier C Where ij is the i, j th moment as defined in Eq. (12). · Legendre and Zernike Momemts : Global moments are employed as the invariant global features of an image in pattern recognition due to their ability to recognize shap es. The use of non-geometric moments has b een advocated in image reconstruction due to their b etter resistance to noise and more accurate reconstruction. Legendre moments use Legendre p olynomials while Zernike moments use Zernike p olynomials. 3.3 An algorithm for Classification using Visual Features Many approaches exist to train classifiers using extracted features; however, Support Vector Machines (SVM) [2] have b een established as the b est p erforming technique. In particular, the use of the kernel trick allows SVM to explore non-linear combinations of features by considering dot products in a high dimensional Hilb ert space. The SVM classification is formalized as follows: We consider data p oints of the form: {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} (14) 3.4 Optical Character Recognition Optical Character Recognition (OCR) was the first prop osed solution to I-spam; there are various commercial and op en source solutions (e.g. SpamAssasin's FuzzyOCR plugin) using OCR libraries. We chose to use the well known Tesseract OCR suite develop ed by HP labs, recently op en sourced by Google2 . The OCR based algorithm we use is a simple one: we classify an image as spam if the OCR module can find more than 2 characters in the image. This has low accuracy of around 80%; increasing the threshold to 5 characters leads to accuracies b elow 65%. where ci {-1, 1} and denotes the class that xi b elongs to (e.g. spam/ham).We can view this as training data, which denotes the correct classification which we would like the SVM to distinguish, by means of the dividing (or separating) hyp erplane, which takes the form w.x - b = 1, |w| = 1 (15) 4. EVALUATION To evaluate our algorithms, we have chosen three recent public datasets; the first is due to [6] who collected over Tesseract is available at http://code.google.com/p/ tesseract- ocr/ 2 The kernel trick is used in the dual form of quadratic problem solving the ab ove optimization. The SVMlight package [10] offers an efficient implementation of SVMs 503 WWW 2008 / Refereed Track: Security and Privacy - Misc Classification accuracy of SVM using Visual features. 100 April 21-25, 2008 · Beijing, China Comparision of Classification accuracy of SVM, AIB and OCR 100 95 95 90 Prediction Accuracy 85 Prediction Accuracy 90 80 80-20 Training vs Test Split reported by Dredze et al 85 75 80 OCR based recognition Visual features based SVM GMM based Labeled AIB linear 70 65 Personal Spam vs Ham Spam Archive vs Ham Spam Archive vs Ham* (Dredze et al) All Spam vs Ham* (Dredze et al) Personal Spam Archive vs Ham* (Dredze et al) OCR on SpamArchive 0.2 0.3 0.4 0.5 0.6 Fraction of Training data 0.7 0.8 0.9 75 60 0.1 70 0.2 0.3 0.4 0.5 0.6 Fraction of Training data 0.7 0.8 0.9 Figure 4: Classification Accuracy of Visual Features based-SVM compared with previous results. Notice a significant improvement over other approaches and good generalization results at training sizes as low as 10%. 13000 I-spam emails from the erstwhile SpamArchive3 . The second was created by Dredze et al. [5] (who also used the SpamArchive dataset for evaluation) from their p ersonal emails. This is called the Personal Spam dataset by the authors; they have also created a Personal Ham dataset from their p ersonal emails, representing true ham emails. This is an imp ortant collection since most other researchers have used general images and photos from the web (examples of ham images are shown in Fig. 3.2.1). Both these datasets are now publicly distributed at http://www.seas.upenn. edu/~mdredze/datasets/image_spam/. The SpamArchive Collection consists of 13621 files, out of which only 10623 are in image format. The other files are unreadable by image processors; this is due to delib erate manipulation by spammers to use other alternatives of image for e-mail spam. The Personal spam collection consists of 3300 images. In addition, we have also used the Princeton Image Spam Benchmark4 which contains 1071 images with category information. The utility of this dataset is that each category contains p ermutations of the same base I-spam template (See Fig. 6). Baseline: For the sake of comparison, we use the same datasets used previously by other researchers. In particular, we use the approach for testing as used by Dredze et al. since their results are the b est rep orted so far in comparison to other work. The ab ove presented algorithms are compared in identical conditions as in the exp eriments in [5]. OCR results using Op en Source Tesseract are also provided for comparison. Figure 5: Comparison of Classification Accuracy of Visual Features based-SVM along with GMM based AIB and OCR (Princeton Dataset). GMM Based AIB Original Clusters Clusters Found Total Images Images Correctly classified Images Misclassified % Error # # # # # # 178 140 1004 760 244 24% Table 3: Clustering Accuracy of GMM based Labeled AIB on the Princeton Spam Benchmark email exchanges. In all, this amounts to 5373 ham images created from collected images including a large chunk of Dredze et al.'s p ersonal ham (which has not b een released in its entirety due to privacy reasons). We first resize the images to 100 × 100 pixels; this simple mechanism makes our method robust to random pixels, simple translation and scaling. It also helps greatly with computational requirements as texture and shap e extraction are time consuming non-linear routines. We then extract features from all the images from the p ositive and negative test set. A fraction of the images are used for training the classifier based on SVMlight while the tests are then carried out on the remaining (1-) fraction of images. Fig. 4 rep orts the prediction accuracy over the p ositive and negative set as function of . We observe that visual features are highly indicative of spam images; our method rep orts a prediction accuracy of over 95% in all cases. Even with large sets of over 15, 000 emails with only 10% data used for training, the prediction accuracy is higher than b est numb ers rep orted by [5, 6]. At a comparable fraction of training/test set as used by Dredze et al., we achieved almost 98% accuracy, an improvement of over 6%. With the Personal spam dataset, we achieve comparable results as [5]. We also investigated the impact of resolution on our approach; in particular we are interested to know if lower resolutions can provide similar results at a cheap er computationally cost. Fig. 3 shows the results of this evaluation; we notice small losses at lower resolutions and small gains at higher resolutions. At 400 × 400, the approach is near 4.1 Results We first discuss the more general approach of binary classification using visual features. We use our homegrown ham data set consisting of images, photographs, logos, wallpap ers, emoticons and similar items used in real world 3 The site was earlier available at http://www.spamarchive. org/. 4 http://www.cs.princeton.edu/cass/spam/spam_bench/ 504 WWW 2008 / Refereed Track: Security and Privacy - Misc April 21-25, 2008 · Beijing, China b) b) a) c) a) c) b) c) Figure 6: Examples of two I-spam images from the same category in the Princeton dataset. In the figure on top, (b) is the 100X100 image, and (c) is the reproduced image. In the figure below, the small figures are GMM reconstructions of 100X100 thumbnails. Notice how close the reproduction of the low-res image is to the actual low-res image. Even though the spam images are different in resolution and content (see the figure below), the low res reproduction are almost identical to the GMM approximation. p erfect achieving more than 99.6%. Higher accuracies will b e practically imp ossible due to human errors in lab eling data. It is imp ortant to p oint out that our results are nearly completely comparable with the previous pap ers; all researchers have used their own ham datasets, including us. The choice of ham can have a big impact on the outcome of prediction. However, the imp ortant issue is lab eling spam correctly over a large heterogeneous spam set. We are currently procuring more spam by creating honeypots with an aim of collecting over 100,000 spam emails. We are also working on the creation of a SpamAssassin 5 Plug-in using our framework. Results from GMM based AIB The Lab eled AIB approach is unlikely to reach the same generalization p erformance; the approach is designed to identify p ermutations of a base template. Since p ositive images do not necessary follow this pattern and are p otentially infinite, Lab eled AIB is exp ected to b e clueless ab out images previously unseen. Early exp eriments in settings as ab ove lead to weak results. However, the idea is still p owerful; collab orative approaches to spam filtering like Spamato [1] encourage users to share manually-classified email spam, which can b e used to train 5 the spam filters of other users in the community. Our approach fits very well into this framework, since AIB based on GMMs can identify the base template. This makes our approach also well suited to server deployment as an identified pattern can b e protected against for a large user p opulation b efore spam reaches individual inb oxes. The goal is to detect a new pattern in a manner similar to viruses and make the pattern available to all subscrib ed email servers. Further investigation of this idea is in progress. To explore how effective the GMM approach is in identifying patterns pre-classified as spam, we find the Princeton dataset very helpful. This dataset has various categories of I-spam images clustered according to the base template; descriptions of different strategies used by spammers are also provided. Fig. 6 demonstrates that two spam images produced by the same template are indeed modeled very similarly to each other. Notice how well our chosen features can model the low resolution thumbnails with a simple parametric model. This provides the p erfect opp ortunity to raise the question: Can AIB Clustering based on GMM identify the clusters correctly and demonstrate that it recognizes the base templates? We p erformed a run with our algorithm to test this hyp othesis. The Princeton dataset contains images in 178 categories, www.spamassasin.org 505 WWW 2008 / Refereed Track: Security and Privacy - Misc however this categorization is very strict. Many of these clusters can b e merged and manual clustering resulted in fewer than 150 clusters. Our approach found 140 clusters; clearly some misclassification was also found. However, the p erformance is b etter than indicated by the numb ers; clusters that were wrongly merged were classified as wrong, though they are similar when manually observed. When constrained to find exactly 178 clusters, the misclassification rate falls to b elow 16%. This indicates that the predictive p erformance for spam detection should still b e high. Fig. 5 proves this empirically. GMM based Lab eled AIB has high accuracy when predicting spam in comparable conditions. OCR does much worse, while the trained SVM has a marginally b etter p erformance. April 21-25, 2008 · Beijing, China 5. CONCLUSIONS AND FUTURE WORK In this pap er, we have presented two novel approaches to fight the menace of image-based email spam by exploiting visual features and the template-based nature of I-spam. Exp erimental results heavily supp ort our hyp othesis that low-level feature analysis can provide a more accurate detection mechanism than previously known. Our work complements earlier work, and it should b e very easy to incorp orate visual features extracted by us into the framework of [5]. The gain is likely to b e b etter accuracy, with a much improved running time by using the JIT feature extraction suggested by them. Since research in Ispam is recent, with less than one year since the emergence of the problem, more work is likely to happ en in the future. In particular, spammers will notice the anti-spam measures taken and innovate to produce new attacks. The emergence of PDF based spam is one such innovation from spammers; clearly, spammers try to exploit all p opular document formats to get their messages through. Till more principled shifts in email (e.g. p ostage for email c.a. [11]) or improvements in the underlying protocols happ en, antispam research will remain a firefighting op eration. 6. REFERENCES [1] K. Albrecht, N. Burri, and R. Wattenhofer. Spamato-An Extendable Spam Filter System. 2nd Conference on Email and Anti-Spam (CEAS), Stanford University, Palo Alto, California, USA, 2005. [2] C. Burges. A Tutorial on Supp ort Vector Machines for Pattern Recognition. Data Mining and Know ledge Discovery, 2(2):121­167, 1998. [3] C. Carson, M. Thomas, S. Belongie, J. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. Third International Conference on Visual Information Systems, pages 509­516, 1999. [4] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1­38, 1977. [5] M. Dredze, R. Gevaryahu, and A. Elias-Bachrach. Learning Fast Classifiers for Image Spam. In proceedings of the Conference on Email and Anti-Spam (CEAS), 2007, pages 487­493, 2007. [6] G. Fumera, I. Pillai, and F. Roli. Spam Filtering Based On The Analysis Of Text Information Emb edded Into Images. The Journal of Machine Learning Research, 7:2699­2720, 2006. [7] J. Goldb erger, S. Gordon, and H. Greenspan. An efficient image similarity measure based on approximations of KL-divergence b etween two gaussian mixtures. Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 487­493, 2003. [8] J. Goldb erger, H. Greenspan, and S. Gordon. Unsup ervised Image clustering using the Information Bottleneck method. Proc. DAGM, 2002. [9] R. Haralick, I. Dinstein, and K. Shanmugam. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, 3:610­621, 1973. [10] T. Joachims. Making large-scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning. [11] R. Kraut, J. Morris, R. Telang, D. Filer, M. Cronin, and S. Sunder. Markets for attention: will p ostage for email help? Proceedings of the 2002 ACM conference on Computer supported cooperative work, pages 206­215, 2002. [12] Secure Computing. Image spam: The latest attack on the enterprise inb ox. Secure Computing Whitepap er, available online, Nov 2006. [13] N. Slonim and N. Tishby. Agglomerative information b ottleneck. Advances in Neural Information Processing Systems, 12:617­23, 2000. [14] M. Stricker and M. Orengo. Similarity of color images. Proc. SPIE Storage and Retrieval for Image and Video Databases, 2420:381­392, 1995. [15] M. Tuceryan and A. Jain. Texture analysis. Handbook of Pattern Recognition and Computer Vision, pages 235­276, 1993. [16] Z. Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li. Filtering Image Spam with Near-Duplicate Detection. Proceedings of the 4th Conference on Email and Anti-Spam (CEAS), 2007. 506