A syntactically-based preprocessor for a limited experimental Arabic document retrieval system
Abstract
The research reported in this thesis is about the description and
discussion of an experimental document retrieval system for Arabic
texts, using linguistic methods of analysis. Specifically, Arabic
presents difficulties for the efficient retrieval of information because
it is an agglutinative language, thus rendering the stop list method (as
commonly used for English texts) near to useless. The system has two stages: the creation of the retrieval lexicon and the
search program. The latt~r+ :is done using a limited on-Hlne searching
which allows for partial matching. The former has four stages. Texts in
the form of abstracts are processed by morphological analysis, syntactic
analysis, term extraction and term manipulation modules. Each stage
produces a new representation of the source text. The morphological
analyser attempts to recognise any prefixes and/or suffixes attached to
the words in the corpus being processed. It also assigns grammatical
labels specifying the part of speech using a contextual analysis of
individual words (assuming that the inflectional features of a word are
indicative of its syntactic role). An augmented transition network
grammar and pars er have been built for this purpose. The same pars er has
been developed and used in the second stage which is syntactic analysis.
It takes as its input the representation of the text created by the
morphological analysis, and uses a separate grammar file defined as a
recursive transition network. The aim of syntactic analysis is the
definition of the relations of the different constituents in the
individual sentences being processed. The information added by the
morphological and syntactic analysers is used in the term extraction
module. This module uses a traversal algorithm to negotiate the
structure built by syntax, utilising a set of rules, kept on a file,
specifying the type of constructs needing to be selected. The
manipulative module generates new entries for each term selected by
rotating its elements.
The system has been implemented using the Hull V-mode Pascal compiler
available on the L.U.T. Prime System. It has been tested using 40
abstracts selected from a conference proceedings in the field of
computer applications. The results obtained were encouraging,
particularly in the identification of affixes (a success rate of 89% for
suffixes and 85% for '., prefixes) and in the identification of syntactic " ,
categories (a success rate of 98% for nouns and 79% for verbs). These
figures have highly influenced the identification of the syntagmatic
relations underlying the word order in the sentence.
The research concludes that although successful results have been
obtained without the aid of a pre-constructed Arabic lexicon, many
errors can be avoided by the inclusion of small dictionaries. This
research names these particularities.