M. Heller
Many organizations have started to publish historical documents in
electronically readable formats, preferably following XML-oriented standards.
Thus corpora of considerable sizes have been compiled where plain searches
are not applicable since they consume more time than users are ready to
tolerate.
Applying modern information retrieval techniques, especially indexing these
corpora seems the way of choice. Though, plain indexers still do not make use
of the special markup structures incorporated by the editors. For this reason
we have decided to implement a special algorithm developed at the Centrum für
Informations- und Sprachverarbeitung (Munich University), called
Content-Aware Data Guide (CADG), in form of a Perl module.
This (planned) CPAN contribution is able to index large amounts of structured
document data with good performance and allows structured access on the
indexed corpus. It is the backbone technology of our project in which we
implement a search engine for XML-encoded historical documents. The search
engine follows a modular design, suited to run on cluster architectures in a
distributed environment and by design provides near-linear scalability.
The engine is being implemented in close collaboration with the Department of
History at Munich University where Georg Vogeler has developed an XML scheme
called CEI ("Charters Encoding Initiative") for encoding historical
documents. Our engine is generic in a way that it supports any XML standard,
but specifically designed to provide a web based GUI to structure-oriented
search access in corpora that follow the CEI encoding.
|