 |
|
 |
Overall Architecture and Main Components
Design Philosophy
Our prototype registry has been developed to support a range of preservation functions suggested here. These include:
- Automatic identification of file formats - given a digital object; what format is it?
- Verification of digital objects compliance to a relevant file format specification - given an object that is supposed to be of format F; is it?
- Delivery - given an object of format F; how can it be rendered?
- Transformation - given an object of format F, to what formats can it be converted to?
- Risk assessment - given an object of format type F; is it at risk of obsolescence?
- Characterization - Given a format F; what are the representation specifications of F?
Overall Picture
The figure below provides a high-level overview of the architecture of FOCUS. It consists of three major components:
- A registry implemented through the LDAP (Light Directory Access Protocol) technology.
- Web-service Agent (WSA) that handles interactions between users and the registry, and includes a format identification
- Supplementary components used in validation, rendering, and conversion

Registry Design
The design of FOCUS must ensure that our file format registry itself can survive various technology changes. Therefore, we have made important design considerations to address this. Our file format registry is intended to store frequently read information that needs to be distributed in order to support any of the functionalities described previously. Since directories are optimized for reading rather than writing, the directory becomes an ideal physical structure to store the file format information which does not change frequently. In addition, directories incorporate standard schemas that can be extended, replicated, and distributed.
The format registry in FOCUS makes use of the LDAP directory technology, as it provides a mature and widely used directory service with proven security support. We have organized our registry around the LDAP technology using a hierarchical structure with two main subtrees. The left subtree corresponds to the information stored about all the applications software associated with any of the formats in the registry, while the right subtree contains detailed information about each format. The overall structure of our registry is shown in the figure below.

Each leaf node in the applications subtree contains detailed information about the corresponding application including description, formats supported, download site and/or service location. The first figure below shows an output to a request for information on a given application. Each leaf node in the formats subtree contains description, external signature information such as file extension, specification, owner, and available software tools related to the corresponding format. A sample of such format information is shown in the second figure below.


Web-Service Agent (WSA)
WSA is a web-service that acts as a mediator between the end user and the LDAP format registry. It provides two interfaces. The first one is an outgoing interface to the digital format registry, which is used to query for information on a given format. This LDAP interface to the registry can also be used to determine potential format obsolescence. The second one is a SOAP-based [16] incoming web-interface which allows the end users and other external components to indirectly access our registry.
Separate service modules can be independently developed and plugged into WSA to perform more complicated tasks. WSA currently contains an internal format identification module called Fider that identifies a file format using its internal signature (magic numbers). Fider is a Java module for identifying the file format of any file. This method, when applicable, offers fast and reasonably accurate output. However, in cases where the format specification does not include magic numbers, the only way to identify these formats will be to parse the entire file. In our current implementation, parsing is only done for the identification of US-ASCII format type. This provides the most accurate identification result, but the operation is very expensive.
Fider uses a sequential algorithm; it checks for magic number in order until the match is found. This order is defined from less to more generic format types. Fider has an extensible architecture; it can be configured at the time of invocation to add modules to identify more file format types. At present, Fider identifies most widely used formats including PDF, JPEG2000, GIF, TIFF, WAVE, and MP3.
The main advantage of WSA is that it can provide custom-tailored services to meet the end user's complicated needs. For example, a client can contact WSA through its SOAP-based web interface requesting to identify and obtain information on any supported format. With this identification module, the client can simply submit the digital file to the WSA. Upon receiving the file, WSA will identify the file format using this format identification module and query the registry for information on the identified format type behind the scene. Moreover, WSA makes the FOCUS system more flexible. From the system's perspective, if there are new operational requirements in the future, the new service modules can be implemented independently and easily plugged into WSA.
Another significant advantage of WSA is, like all other web applications, the ease with which it can be updated. Modifying existing services, or creating new services, can be done without affecting any client component, eliminating the need of redistributing new client-side components upon each update. Therefore, WSA offers advantages to both system administrators and clients.
Supplementary Software Components
Once the format is tentatively identified, WSA connects to FOCUS via LDAP to gather information on available validation software for the specific format. If the registered software is available as a web-service, the client contacts the available validation web-service of this file format to verify that the format is indeed the correct one. We currently make available a validation service through JHOVE, developed by JSTOR and Harvard University Library. We have implemented JHOVE as a web-service locally and included the service location as well as the JHOVE module documentation into our format registry. Afterwards, the registry can be consulted for available (and reliable) conversion, rendering, or emulation services for that particular format. Any of these services can be invoked as necessary.
|