It is known that users of internet search engines often enter queries with misspellings in one or more search terms. Several web search engines make suggestions for correcting misspelled words, but the methods used are proprietary and unpublished to our knowledge. Here I will describe the methodology we have developed to perform spelling correction for the PubMed search engine. Our approach is based on the noisy channel model for spelling correction and makes use of statistics harvested from user logs to estimate the probabilities of different types of edits that lead to misspellings. The unique problems encountered in correcting search engine queries will be discussed and our solutions outlined.
John Wilbur is a Senior Scientist in the Computational Biology Branch of the National Center for Biotechnology Information. He is a principal investigator leading a research group in the study and development of statistical text processing algorithms. While at NCBI he has developed the algorithm that produces PubMed related documents and the algorithm that in PubMed allows fuzzy phrase matching. Most recently his group has developed algorithms for phrase identification in natural language text that are used in NCBI¡Çs electronic textbook project and allow for easy reference from MEDLINE documents to related textbook material. He has a strong interest in machine learning and natural language processing techniques and a focus of current research is improvements in named entity recognition in the field of molecular biology and medicine.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.