Entity resolution in graph data

TitleEntity resolution in graph data
Publication TypeReports
Year of Publication2005
AuthorsBhattacharya I, Getoor L
Date Published2005///
InstitutionTechnical report, University of Maryland, College Park
Abstract

The goal of entity resolution is to reconcile data references corresponding to the same realworld entity. Here we introduce the problem of entity resolution in graphs, where the nodes
are the references in the data and the hyper-edges represent the relations that are observed
to hold between the references. The goal then is to reconstruct a 'cleaned' entity graph
that captures the relations among the true underlying entities from the reference graph.
This is an important first step in any graph mining process; mining an unresolved graph
will be inefficient and result in inaccurate conclusions. We also motivate collective entity
resolution in graphs where references sharing hyper-edges are resolved jointly, as opposed to
independent pair-wise resolution of the references. We illustrate the problem of graph-based
entity resolution in bibliographic datasets. We discuss several interesting issues such as mul-
tiple entity types, local and global resolution and different kinds of graph-based evidence.
We formulate the graph-based entity resolution problem as an unsupervised clustering task,
where each cluster represents references that map to the same entity, and the similarity
measure between two clusters incorporates the similarity of the references attributes and,
more interestingly, the similarity between their relations. We explore two different measures
of relational similarity. One approach, which we call 'edge detail similarity', explicitly com-
pares the individual edges that each cluster participates in, but is expensive to compute. A
less computationally intensive alternative is measuring 'neighborhood similarity', which only
compares the multi-set of neighboring clusters for each cluster. We perform an extensive
empirical evaluation of the two relational similarity measures for author resolution using co-
author relations in two real bibliographic datasets. We show that both similarity measures
improve performance over unsupervised algorithms that consider only reference attributes.
We also describe an efficient implementation and show that these algorithms scale gracefully
with increasing size of the data.