A Novel Transliteration Approach in an English-Arabic Cross Language Information Retrieval System, pp.229-242
Authors: Ghita Amor-Tijani and Abdelghani Bellaachia, Department of Computer Science, The George Washington University, Washington DC
Abstract: One of the main issues facing Cross Language Information Retrieval (CLIR) is untranslatable words, i.e., words not found in dictionaries, which are usually referred to as Out Of Vocabulary (OOV) words. Bilingual dictionaries in general do not cover most proper nouns (e.g., names of places, people, countries, etc.), which constitute a large proportion of OOV words. As they are often primary keys in a query, their correct translation is often necessary to maintain a good retrieval performance. Because they are spelling variants of each other in most languages, an approximate string matching technique against the target database index is usually used to find the target language correspondents of the original query key. The n-gram technique has proven to be the most effective among other approximate string matching techniques. A more complicated issue arises when the languages dealt with have different alphabets. The approach usually taken is transliteration. It is applied based on phonetic similarities between the languages involved. However, transliteration by itself cannot guarantee the exact spelling of the transliterated words as found in the document collection. There are a variety of ways that a transliterated word can be spelled despite conventions that might exist. The fact that there is no one correct way of spelling a transliterated word shows the need for a technique that is capable of generating the different spellings found in the document collection. In this study, we chose to combine both transliteration and the n-gram technique in an English-Arabic CLIR system, in which Arabic documents were searched using English queries. We evaluated the effectiveness of this approach and compared it with other transliteration approaches. Experimental results showed the retrieval improvement gained using our transliteration approach over other existing approaches.