|
The RDF EXtractor (TREX for short) being developed at LCCD, when complete, will work as follows.
- A user will be able to specify a schema of interest for him. For instance, a user interested in terror events can specify a schema that may include the type of event, where it took place, the perpetrators, numbers of fatalities, number of injuries, and so on.
- The T-REX system will find all "instances" of such schemas and fill in as many of these schema attributes as possible by browsing millions of web pages and extracting relevant information from each such web page. Instances will be represented in RDF - a World Wide Web Consortium (W3C) standard that effectively stores subject-property-object relationships. These relationships naturally lead to a labelled graph where there is an edge from a "subject" node to an "object" node labelled with the given property.
- The T-REX system will store all such data in a massive annotated database management system that will allow users to visualize a graph generated by the RDF data. They will be able to query this information in a variety of ways as well.
- T-REX will handle not just documents in English, but in a wide variety of languages including Indo-European languages, Romance languages, as well as languages of special interest such as Urdu and Arabic.
- At this time (Oct. 5, 2006), T-REX has preliminary algorithms that achieve goals (1)-(3) above. (4) will be addressed in the coming year.

|