Database and Multi-database Design
- Geoffrey Fox gfc@npac.syr.edu
- Bill Braithwaite Bill%TROPIQ@VAXF.Colorado.edu
- Alton Brantley bab2@mhg.edu
- Marina Chen mcchen@cs.bu.edy
- Jerry Cox jrc@hobbs.wustl.edu
- Dave O'Halloran Dave.Ohallaron@n2.sp.cmu.edu
- Judy Ozbolt ozbolt@virginia.edu
- Robert Sanders bobs@aplcomm.jhuapl.edu
- Paul Silverstein pauls@aplcomm.jhuapl.edu
Introduction
Currently there is not much medical data available in a form where
it can be used for large scale comparative analyses. However we expect
this to change as current practice improves and we move from computer
databases largely oriented towards billing to databases aimed at
recording and improving the patient's care and health. We expect this
new data to be encoded in a reasonably uniform fashion using the
standard vocabularies developed by the industry and medical informatics
community with the National Library of Medicine.
The pervasive
preparation of such patient data will allow one to meaningfully aggregate
them and compare individual records with averages or more generally
with average templates (care maps) formed from subsets of the data.
The databases will be provided by many vendors and use many
different internal technologies. However information interchange will be
enabled by the use of standard interface and transport protocols. National
standards will probably require federal mandates but nevertheless one
can expect standardization internally to extensive hospital groupings.
These would be large HMO's, consortia including those of Health Science
centers, and Medicare and Medicaid providers. These individual datasets
will be large enough to support significant comparative analysis activities
whose complexity and size will demand HPCC technologies.
A set of individual patient records will be the basis of this care
oriented dataset. This will then generate other databases such as those
used in billing. Further one would link the patient database with auxiliary
databases used to define such items as hospital facilities and procedures.
We anticipate some 100 million and eventually many more patient records
with for example a full database size of 10 terabytes corresponding to 100
text pages of information for each of 100 million patients. (One text page
is about a kilobyte in size). The databases will have growing amounts of
image and video information which will imply major storage and
processing requirements. Note that we expect that the information and
entertainment component of the future National Information Infrastructure,
whose size will be measured in petabytes, will largely consist of video
data. This can be contrasted with medical databases which will probably
consist of a relatively larger fraction of images and text.
These large scale national databases with uniform patient oriented
medical data will be produced in an evolutionary fashion. We can expect
them to become a significant component of medical information systems
over the next ten years. Hence they appear as a very suitable target for
NSF basic research leading to deployable systems on that time scale.
Health Care Applications
Imagine a set of
longitudinal patient records recording the care and health of every patient.
These records will be most interesting for analysis if they are large and of
national scope so as to allow statistically meaningful comparative analysis
for both clinical care as well as health science and clinical research. A
critical application of this database is the identification of medically
distinct models or templates for diagnosis and care maps. Although there
are important unresolved policy and practice issues on data sharing,
it is likely that either federal mandates or natural medical
collaboration will generate such major medical databases of national
scope. This data, although accessible in a uniform fashion, will be stored
nationwide in geographically distinct distributed heterogeneous
databases residing on a rich variety of computers. This distributed
heterogeneous characteristic is shared with the general National
Information Infrastructure. However a key complicating feature of medical
data is the richness of the internal structure and the correlations between
features both internal to and between records.
Medical Informatics Research Issues
- Functionalities needed in the Preparation of the Distributed Medical Database
- Patient records are very complex and subject to data entry error.
Quality control in the data entry function is very important and will be
aided by an automatic comparison of each new entry with the
expectations of suitable averages from the existing national database.
This will flag a possibly incorrect entry and allow operator correction.
As patients move around, it will be very helpful to build a
complete medical record for individuals by computer aided matching of
the record fragments stored in different databases of usually different
care providers. This function has interesting analogies wit the matching of
genome strands now done during the creation of the Human Genome
database.
- Typical Functionalities needed in Use of and Analysis of the Distributed medical database
- Detection of anomalies or outlying entries in either individual
records or more usually dynamic groupings of records. This function can
be used in approaches to fraud detection and the identification of
epidemics as in the "Colorado Vignette" presented at the meeting. The
fraud capability has already been demonstrated by Booz Allen and
Hamilton and similar techniques can be used in credit card and other
financial fraud.
Segmentation of medical data into typical models or templates.
This is analogous to market segmentation problem where HPCC is
already being used by the mail order industry.
Comparison of individual patients with templates to aid
diagnosis and establish canonical care maps.
Analysis of the effectiveness of particular care plans including
the results of deliberate or accidental deviation from the recommended
care.
HPCC Issues in Database Systems and
Datamining
The patient record database scenario and the selection of
interesting analysis functionalities suggest many important computer
science issues. Many of these would be best tackled by a collaboration
between computer science and medical informatics researchers.
- The architecture and systems issues for the distributed dynamic sparse irregular patient database.
- Information is continuously updated;
each patient has a sparse selection of the many possible attributes with
the different records being very heterogeneous or irregular. Study of the
relevance of relational, object and hybrid database structure is included
here.
- Aggregation of distributed database for global studies
- There are a set of choices with at one extreme pure Web
technology with knowledge agents accessing the database which itself
stays put in distributed form. The other extreme involves occasional
collection of all needed information to a central aggregated database
which is then mined. Intermediate solutions correspond to generalizations
of caching strategies familiar in parallel and distributed computing.
- Security and privacy
- These are critical issues in many parts of the
health care problem and indeed the full National Information
Infrastructure. Datamining and extraction of templates raises a special
need: namely, the protection of individuals whose medical data is used to
form a rare template from which it may be possible to identify the
contributing individual records.
- Benchmarks and collaborations
- Although there are important general research issues, we
believe very it would be helpful to identify some benchmark problems.
These can be
used to quantify and motivate the proposed health care database
research. This use of "real" medical records will be needed even for the
generic precompetitive systems we expect to be built as part of NSF
research programs. Further we note that some of the proposed research areas
would benefit by linking to other (national challenge) areas. For instance,
the distributed database system and searching issues have much in
common with the challenges of building the World Wide Web or more
generally the National Information Infrastructure. Several of the
datamining tools are applicable to a wide range of business problems.