Temporal language models for the disclosure of historical text

F. de Jong


Historical and heritage collections consist for a considerable part of
text and may incorporate diverse text types such as journals, archival
documents, and catalogue descriptions. Due to the historical distance,
access to this content is not straightforward. Historical variants of
languages are often more complex than modern variants due to the less
standardized spelling, the effect of on-going language change and
different word (de)compounding principles. Besides, more words are
ambiguous because one or more meaning shifts may have occurred. Common
full-text search tools can be applied successfully only by users who are
able to formulate queries with (a) knowledge of historical language and
(b) insight in the relevant time spam from which the words have evolved.
This paper describes search technology that may compensate for these
linguistic obstacles by linking contemporary search terms to their
historical equivalents. For this purpose statistical language models
will be applied that support the automatic detection of word
similarities/ambiguities that are obscured by language evolution/usage
to allow the 'dating' of a text. This involves building temporal
profiles of words as longitudinal sections in a reference corpus and
temporal language models as cross sections. The approach can be seen as
a step into the direction of a diachronic WordNet. Detailed examples
will be presented of the added value of this approach both for the
accessibility of historical content and the detection of language change
in relatively recent corpora from the news domain. Experimental
retrieval results will be included.

 


Last modified: 16-09-2005 08:48