Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods are very attractive computationally but are brittle with respect to small changes in document content. This talk focuses the history of duplication algorithms and a general technique of increasing fingerprint robustness via multiple lexicon randomizations.
Relevant references:
Dr. Abdur Chowdhury has over 12 years of research and development experience in computer science. He has worked as both a researcher and a developer, which has given him a great perspective on balancing software design with innovation. During this time he has launched over 20 commercial search products, filed more than twenty patent applications and written over 70 publications in magazines, scientific conferences, journals and book chapters covering many computer science topics like AI, networking, operating systems, system scaling, and information retrieval. Dr Chowdhury served as the Chief Architect for Search at AOL and held positions at IIT Information Retrieval Lab and Georgetown as and adjunct professor. Currently, he is the founder of Summize.com working on organizing opinions from the web.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.