WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Online Change Detection in Individual Web User Behaviour Peter I. Hofgesang hpi@few.vu.nl Jan Peter Patist jpp@few.vu.nl VU University Amsterdam, Depar tment of Computer Science De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands ABSTRACT Based on our field studies and consultations with field experts, we identified three main problems that are of key importance to online web personalization and customer relationship management: 1) detecting changes in individual behaviour, 2) reporting on user actions that may need special care, and 3) detecting changes in visitation frequency. We propose solutions to these problems and experiment on real-world data from an investment bank collected over 1.5 years of web traffic. These solutions can be applied on any domains where individuals tend to revisit the website and can be identified accurately. Categories and Subject Descriptors E.1 [Data]: Data Structures - Trees; H.2.8 [Database Management]: Data Mining General Terms Algorithms, Experimentation Keywords Incremental Web Mining, User Profiling, Change Detection Goal 1: Detect changes in individual b ehaviour A solution to this problem should include incremental maintenance of a compact user profile for each individual and the output of changes. Change signals can be used to update personalized content or for marketing actions while the base profiles can help selecting the proper personalized content. Goal 2: Rep ort "interesting" sessions From time to time online users may not be able to find some content or they may have technical difficulties. The idea behind the second problem is to identify such points, "interesting" sessions, in a stream of user actions. Based on such alerts, an assistant may offer instant online help (e.g. chat or voice call) or the system may take prompt automated action. Goal 3: Detect changes in user activity Detecting changes in an individual's visitation frequency seems to relate to our first problem; however, it requires different treatment both technically and in the types of output alert. For example, early identification of a decrease in an individual's visitation frequency may support actions for customer retention. Based on the different types of change in visitation frequency, we may label changes as "runner up" or "slower down" based on the increase or decrease in frequency. The contribution our work makes is the identification of three relevant problems in individual, on-the-fly web personalization and their proposed space and computationally efficient solutions. 1. INTRODUCTION Web personalization techniques allow companies to customize their web services according to their customer's needs. However, changing customer behaviour and frequent structural changes and content updates of websites are likely to impair the underlying models. Current personalization techniques require periodic updates and human interaction to cope with this problem and thus maintain up-to-date personalization. To automate this process, we need to identify points of change (see e.g. [1]) in user behaviour automatically as the data flows in. Identifying such changes - whether they are seasonal or temporal, or a long-term behavioural trend shift - and promptly taking appropriate action in response leverages business advantage. During our analysis of numerous web usage datasets, including data from online retail shops and an online bank, we identified three main problems that we summarize in the following three goals: Copyright is held by the author/owner(s). W W W 2 0 0 8 , April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 2. INDIVIDUAL USER PROFILES We choose to use a tree-like structure to represent user profiles. Each node may contain more than one page id, a frequency counter, a reference to its parent and a vector of references to its children. Adding a session (we keep visited pages as sets, discarding order and cardinality information) to a profile tree can be broken down as a simpler recursive procedure. The procedure is called, in each step, by a given tree node n and the remaining pages to be inserted, P . As a first step, we select the children of n that have the highest overlap with P . If there are more nodes with equal, non-zero overlap we select the first one. If P has no overlap with any of the children nodes we insert P as a new node under n. Otherwise, we identify the type of overlap. In case the children node is equivalent to P or is fully part of P , we increase the counter of the children and invoke another recursive step on the children node and the non-overlapping remainder of P . In the other two cases, we need to split the children node by subtracting and raising the overlap directly above it as a new parent node, update the frequency counters and insert 1157 WWW 2008 / Poster Paper the non-overlapping part, if there is any, as a new children node under the newly created node of common pages. In our methods, we maintain the profiles over a sliding window. This can be easily done by maintaining references to leaf nodes in the tree in a "round" vector. Here we define a distance metric over a profile tree and an incoming session (S ). In fact, we define a similarity metric and we transform a distance by dist = 1 - sim. The similarity measure basically selects a branch with the highest overlap to the input session and calculates a score on it. Once again, we can define this metric recursively as each node returns the number of its overlapping pages with the subset of pages it received plus the total overlap on the remainder of pages returned from its children nodes. Each recursive step also returns a subscore and, among identical overlaps, we return the highest score as similarity. We calculate the scores at each node by taking the minimum of r eq u the normalised node frequency ( n.fnormency ) and the page T probability (normP ). Both norms are known upfront to the algorithm. normT is simply the number of pages added to the tree and normP = 1/|S |. April 21-25, 2008 · Beijing, China tion (pdf ) of the data by the poisson distribution. For a given user we incrementally estimate the probability p of ^ visitation on a given day. The estic ate p is compared to m ^ p alt h alternatives palt via the Sc = t log ( pdfp ) sums, where ^ t is the last change point and c is the current time. Each Sc , for every h, is maintained incrementally. The h alternatives are fixed upfront. Given that p is more likely than ^ the alternatives, Sc has a downward trend. An increasing Sc indicates a change in the trend and a change is detected if Sc - mint...c Sc > Talarm . The complexity is linear in the number of alternatives, which is O(|H1 |) per session. 4. EXPERIMENTAL RESULTS 3. CHANGE DETECTION FRAMEWORK Goal 1: Detect changes in individual b ehaviour A user profile is maintained for each individual and compared, with a window of most recent sessions. First, the tree is initialized with n sessions. The score of an incoming session is calculated using the distance function defined in Section 2. If the session is not an outlier (score < toutlier ), it is added to a buffer (ref ), excluding the last m scores, which are kept in a buffer W . A change is detected when the distance (d) between the most recent scores (W ) and ref is larger than a threshold, Tchange . The distance is defined as w f f d(ref , W ) = W ecd r ef (w ), where ecd (x) is the empirical cumulative distribution function of ref . If we maintain the ecdfref using a balanced tree algorithm, the complexity of the algorithm is O(log (|ref |)) and the distance function, dist, can be calculated efficiently as well. Goal 2: Rep ort "interesting" sessions The problem of finding "interesting" sessions is rather sub jective and cannot be clearly formulated. We choose to rate pages as "highly interesting" when they are popular for a specific individual but rarely visited by others. Interesting sessions are those containing interesting pages. We consider only outlier sessions (calculated in Goal 1) as potential interesting sessions. We calculate a new score over the outlier session based on an automatically maintained popularity matrix. The components of this matrix are weights calculated by n, | the T F - I DFi,j = k inj ,j log |{dj :|tDdj }| weighting applied i k to our data, where i refers to a term (web pages) and j to documents (individual sessions) and nij is the number of occurrences of term i in document j . We label a session as interesting if its overall score is higher than a user-defined threshold. Goal 3: Detect changes in user activity While the first two problems, Goals 1 and 2, can be included in a common framework, we need another layer of data abstraction and a different system for the third problem. The problem at hand is detecting change in a stream of binary data representing whether the user was active on a given day. Our algorithm is based on the CUSUM (cumulative sum) algorithm [3] and we model the probability density func- Our experiment included results on server-side web access log data of an investment bank collected over a 1.5year period. To evaluate our change detection methods, we randomly selected two sets each of 200 individuals with their sessions, and labelled them together with field experts. We looked at the individual sessions evolving over time and marked the points of change. Goal 1: evaluation The accuracy of our method is 69.57% with 67 false alarms and the mean detection latency is 5.12 sessions. Goal 2: evaluation While experimenting with the TFIDF scoring, we found that the high popularity of a relatively few pages among all individuals resulted in extreme high term frequency (T Fi,j ) components that overweighed the I DFi,j scores and, even though popular pages were present in most profiles, their final T F - I DFi,j scores were much higher for common pages than the much more interesting rare pages. To avoid the effect of the power law distribution of web pages, we ignored the T Fi,j weight and used only the I DFi,j component in final scoring. Goal 3: evaluation The accuracy of our method is 77.36% with 117 false alarms, and the mean detection latency is 6.37 days. Figure 1 shows an example of score evolution and change detection both for Goal 1 and Goal 3. For more experimental results we refer the reader to [2]. Change scores 1 2.5 Change scores 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 0 2 1.5 User activity data Change score Change detected Change point Change score Change point Change detected 50 100 150 200 250 300 350 400 1 0.5 0 0 20 40 60 80 100 120 140 160 180 Sessions over time User activity over time Figure 1: Example Goal 1 (left) and Goal2 (right): Score evolution and change detection for individuals 5. REFERENCES [1] M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes - Theory and Application. Prentice-Hall, Inc., 1993. [2] P. I. Hofgesang and J. P. Patist. Online change detection in individual web user behaviour. In http: // www. few. vu. nl/ ~hpi/ papers/ IndivChangeDet. pdf . Technical Report, Vrije Universiteit Amsterdam, 2007. [3] E. Page. On problems in which a change in a parameter occurs at an unknown point. Biometrika, 44:248­252, 1957. 1158