So far most experiments in cross-language querying in digital libraries have employed a multilingual lexicon of some sort. As mentioned in the previous section, general purpose electronic dictionaries are generally inadequate for this scope as they tend to be lacking in necessary technical vocabulary. The disadvantages of multilingual thesauri include the fact that they are expensive to construct, they need continual maintenance and updating as new terms enter the vocabulary, and they require the use of a highly controlled vocabulary, which puts a heavy constraint on searching. On the other hand, the problem with most corpus-based cross-language systems is that the acquisition of a suitable set of relevant documents on which to train the retrieval system is extremely resource consuming.
At Pisa, we are now working on the design of a cross-language query system for a digital library containing documents in more than one language. We propose to implement a system which will integrate a dictionary/thesaurus-type search with a corpus-based strategy in which the corpus is extracted from the collection of documents contained in the digital library itself. The aim is to be able to match a query formulated in one language against documents stored in other languages, even when the query terms themselves are not included in the multilingual lexicon. With this approach, we hope to overcome some of the problems listed above. The implementation of the system is dependent on a number of the components of an integrated set of tools for mono- and bilingual lexicon and text processing, known as the PiSystem, which has been developed in Pisa.
Our corpus-based strategy is based on the concept of comparable corpora. Comparable corpora are sets of texts in pairs (or multiples) of languages with the same communicative function, i.e. generally on the same topic or domain. We first began to analyse corpora of this type from a linguistic perspective; they are sources of natural language lexical equivalences across languages and as such can provide much useful data for contrastive language studies. For this reason, language scholars are now beginning to acquire such collections of texts for particular linguistic or terminological studies. However, a digital library system which contains document archives on the same domain but in different languages is actually a real world implementation of the comparable corpora principle. We thus decided to adapt our comparable text system to meet the requirements of a multilingual digital library.
The corpus query system is based on the assumptions that (i) words acquire sense from their context, and (ii) words used in a similar way throughout a sub-language or special domain corpus will be semantically similar. It follows that, if it is possible to establish equivalences between several items contained in two different contexts, there is a high probability that the two contexts themselves are to some extent similar. We thus use lexical and linguistic knowledge extracted from a domain-specific corpus in one language and project it onto a comparable corpus in the other, i.e. given a particular term or set of terms in the texts in one language (L1), the aim is to be able to identify contexts which contain equivalent or related expressions in the texts of the other (L2). To do this, we attempt to isolate the vocabulary related to that term in the L1 corpus -- hypothesising that lexically equivalent terms will be associated with a similar vocabulary in L2. Here below, we give just a brief outline of how the system operates. For a more complete description, see Picchi and Peters (1996)[18].
For any term of interest, T, the system automatically constructs a context window containing T and up to 'n' lexically significant words (nouns, verbs and adjectives can be accepted) to the right and left of T; The value for 'n' can be varied. For each of these co-occurrences of T, morphological procedures identify the source lemma(s). The significance of the correlation between these items and T is then calculated using a statistical procedure. We are currently using Church and Hanks' Mutual Information Index (1990)[19] although we are also testing a different measure based on the likelihood ratio as formulated by Dunning (1993)[20]. The set of most significant collocates derived makes up the vocabulary, V1, that is considered to characterize our term T in this particular subdomain corpus. To exemplify what we mean by this, Figure 1 shows the 20 most significant collocates found in a set of comparable English and Italian documents for two Italian nouns: assistenza (assistance) and accordo (agreement). In the figure, the first column shows the MI value, and the number associated with the collocate gives its frequency value, i.e. the number of times the collocate was found in a context window where n=5.
420 assistenza (assistance) | 801 accordo (agreement) | |||
500.000 | 420|ASSISTENZA | 500.000 | 801|ACCORDO|ACCORDARE | |
10.756 | 4|MUTUA|MUTUARE | 10.518 | 37|MULTIFIBRE | |
10.488 | 19|MEDICARE | 10.408 | 5|INTERSTATALE | |
10.445 | 87|TECNICA|TECNICO | 9.992 | 3|SCATURIRE | |
9.886 | 9|UMANITARIO | 9.603 | 4|INTERINALE | |
9.347 | 3|PRESTARE | 8.554 | 22|CONCLUSO | |
9.326 | 12|LEGALE | 8.483 | 7|STIPULARE | |
8.903 | 20|FINANZIARIA | 7.784 | 4|INTERISTITUZIONALE | |
8.367 | 21|SANITARIO | 7.747 | 10|RAGGIUNGERE | |
7.541 | 32|FORNIRE | 7.161 | 8|FIRMARE | |
7.145 | 4|DIRIGERE | 7.113 | 7|SPAZIO|SPAZIARE | |
7.120 | 3|RIFUGIARE|RIFUGIATO | 6.926 | 4|RATIFICARE | |
6.122 | 3|PROFUGO | 6.793 | 10|ATTO | |
5.949 | 3|CONCEDERE | 6.726 | 12|DERIVARE | |
5.853 | 7|ALIMENTARE | 6.576 | 7|POLITICO | |
5.784 | 11|SETTORE | 6.573 | 3|ACCIAIO | |
5.439 | 3|FINANZIARE | 6.539 | 10|BASARE | |
5.439 | 3|PROPRIARE | 6.287 | 3|CONCLUDERE | |
5.218 | 6|PROGRAMMA|PROGRAMMARE | 6.179 | 23|COMUNA|COMUNE | |
5.139 | 3|RUOLO | 6.146 | 15|NUOVO | |
4.934 | 4|DESTINARE | 6.141 | 3|ENTRARE | |
Figure 1: Significant collocates for assistenza and accordo |
It is important to stress that these lists give words identified as the significant collocates for the two terms in this particular corpus; if the same terms appear in a corpus for a different type of sub-language, we would expect to find different collocates. Looking at this list, it can be seen that there is not a lot of noise; most of the terms given have a strong semantic relationship with the term being examined. For example, with assistenza we have associated adjectives meaning "sanitary", "legal", "financial", "humanitarian", and verbs such as "provide, "take refuge in", "give (help to)", and with accordo we find verbs such as "reach", "stipulate", "sign", "ratify", "conclude" and nouns like "act" or "document" (our test corpus has been extracted from a series of parliamentary debates). When there is more than one source lemma, all are listed.
Next, using our lexical resources (e.g. English/Italian morphological procedures, a bilingual lexical database), we construct an equivalent L2 vocabulary of translation equivalents (V2). Words or expressions that can be considered as lexically equivalent to our selected term in the L1 texts are then searched in the L2 corpus, i.e. we do this by searching for those contexts in L2 in which there is a significant presence of the L2 vocabulary for T. The significance is determined on the basis of a statistical procedure that assesses the probability for different sets of L2 cooccurrences to represent lexically equivalent contexts for T. The L2 contexts retrieved are written in a file and listed in descending order of relevance to our L1 term.
Figures 2.1 and 2.2 show examples of comparable contexts that have been found in the L2 corpus (English) for our two terms: assistenza and accordo, as characterized by the L1 corpus. The contexts are ordered in descending order of number of items from the V2 vocabulary, and the sum of the MI values associated with the items; the third column gives the sum of their frequency values, and the fourth gives the ranking of the context in the list of results. Direct translations of the term being searched are assigned an arbitrarily high MI value and thus, for the same number of V2 items, are listed before contexts which do not contain direct translations of the term. For example, in the set of contexts for accordo in Figure 2.1, contexts 2-5 include translation equivalents of accordo, 6-10 do not although they each contain the same number of V2 items. It can be seen that they still reflect the concept represented lexically by accordo even though they do not contain direct dictionary-derived translations.
Search for Comparable Contexts for ACCORDO | ||||
6 | 522.726 | 828 | 1) | Commission *proposal* on transitional *arrangements* in *respect* of the *international* *textile* *agreement* 1. How does the Commission =FE"FXAC93086ENC.0035.01.00".11 |
5 | 520.973 | 827 | 2) | the territory of a Member State illicitly. The *Council* *reached* a *political* *agreement* on these two *proposals* at its meeting on =FE"FXAC93297ENC.0010.01.00".40 |
5 | 519.782 | 825 | 3) | in its proposal for two-year transitional *arrangements* in *respect* of the *international* *textile* *agreement* (uplift or maintenance =FE"FXAC93086ENC.0035.01.00".13 |
5 | 518.143 | 823 | 4) | make this possible, the central European *countries* will *apply* Community competition *rules*. Europe *Agreements*, *signed* but not yet ratified =FE"FXAC93145ENC.0016.01.00".33 |
5 | 517.224 | 839 | 5) | The reform of the *common* agricultural *policy* which was *agreed* by the *Council* of *Ministers* will have a major impact on both the economic =FE"FXAC93099ENC.0019.01.00".13 |
5 | 25.304 | 46 | 6) | Therefore, all efforts which have to be made in *order* to *achieve* the *common* *objective* of an *area* without internal borders have to be intensified. =FE"FXAC93032ENC.0014.01.00".25 |
5 | 24.386 | 25 | 7) | by successive meetings of the General Affairs *Council*. The *understanding* *reached* with the US *Trade* *Representative* on public procurement =FE"FXAC93264ENC.0034.02.00".48 |
5 | 22.262 | 43 | 8) | recalls that the Treaty on European Union *foresees* that asylum *policy* should become a *matter* of *common* *interest* and, in a separate statement =FE"FXAC93101ENC.0039.02.00".26 |
5 | 22.188 | 35 | 9) | Cooperation between sportsmen and women from *different* *countries* does a great *deal* to promote *international* *understanding*. Particularly for ... =FE"FXAC93145ENC.0043.03.00".18 |
5 | 21.295 | 23 | 10) | GATT, making it possible to agree on negotiated *rules* to *clarify* the *issues* arising in the *international* *trade* and environment interface. =FE "FXAC93283ENC.0051.01.00".34 |
Figure 2.1: DBT (Comparable Corpus) - English Texts |
In order to test the system, when retrieving the second set of contexts given in the Figure 2.2 for assistenza, we eliminated the direct translations given by our bilingual electronic dictionary ("assistance' and 'aid") from the L2 vocabulary for assistenza. However, in a number of cases (see contexts number 3 and 58, 59, 60) we still retrieve contexts that do contain these direct translations of assistenza , which suggests that the system is performing well. For reasons of space, we show just the first five results in order of ranking, and then numbers 57-60.
Search for Comparable Contexts for ASSISTENZA | ||||
4 | 25.653 | 45 | 1) | the planning of return-home programmes, and are *leading* *roles* encouraged for women in *food* distributions in *refugee* camps?) \NOT\(1) Source: =FE "FXAC93333ENC.0013.01.00".18 |
4 | 25.615 | 60 | 2) | Measures (SPS) texts, which specifically *address* measures taken for the *protection* of *health*, the *environment* or the consumer. The Commission ..... =FE "FXAC93333ENC.0003.03.00".23 |
4 | 22.390 | 23 | 3) | Assistance in the form of Community loans and *grants* under the structural *fund* *programmes* operating in the *area*, notably the Regional Operational ... =FE"FXAC93283ENC.0017.01.00".34 |
4 | 20.364 | 16 | 4) | through the Structural *Funds*, *programmes* developed to *protect* the *environment*. .. =FE "FXAC93065ENC.0011.01.00".51 |
4 | 20.364 | 16 | 5) | biological depuration at Lixourion within a *funding* *programme* so as to *protect* the natural *environment* in the Gulf of Argostolion? 3. Will it provide =FE "FXAC93095ENC.0003.01.00".23 |
3 | 19.659 | 31 | 57) | establishing a European food industry training *fund* to facilitate public *health* and consumer confidence in the *food* industry and which would ensure ... =FE "FXAC93065ENC.0022.01.00".13 |
3 | 19.369 | 38 | 58) | assistance for better public administration *planning* and coordination (including the *health* *sector*). In addition, both the nutritional and sanitation .... =FE "FXAC93016ENC.0025.01.00".60 |
3 | 19.369 | 38 | 59) | The Commission is currently implementing major *programmes* on AIDS in the *areas* of public *health*, research and assistance to developing countries. =FE "FXAC93137ENC.0005.02.00".30 |
3 | 19.369 | 38 | 60) | humanitarian aid, welfare-related projects and *programmes* in *areas* such as *health* and education, and projects and programmes for rural development. =FE"FXAC93283ENC.0028.01.00".31 |
Figure 2.2: DBT (Comparable Corpus) - English Texts |
Discussion
This approach to the problem of identifying cross-language lexical equivalences over homogeneous sets of texts for different languages has several merits: it allows us to disambiguate, to a considerable extent, both the L1 term being searched and the target language terms provided by the dictionary; it permits us to retrieve lexically equivalent cross-language expressions even when the L2 context does not contain a dictionary derived translation of the L1 term; and it provides a ranking of our results.
Query Term Disambiguation: Although the problem of polysemy is greatly reduced in a domain specific corpus, it is still present -- to a varying degree depending on the type of texts being treated. The construction of the L1 vocabulary which characterizes our term T will permit us to obtain a clustering of the most relevant terms connected to T. If the corpus contains a predominant sense for the term then the vocabulary should represent this sense -- secondary senses that appear rarely will not cause a representative vocabulary of collocates to be constructed. If, in the corpus, there is more than one relevant sense for T then we would expect two or more distinct clusterings of significant collocates. For example, the Italian noun accordo has two distinct senses in our bilingual dictionary: the general sense which is translated by "agreement", and the very specific musical sense translated by "chord". Our corpus of parliamentary debates contained no examples of the second sense. However, if it had done, we would expect to obtain two distinct clusterings of significant collocates with little or no overlap. Thus, using this method, it is possible to distinguish between common technical terms which are used with different meanings in different scientific areas. Think, for example, of the different usages of "protocol" in the medical and software engineering domains. Very different sets of collocates would be constructed for the different acceptations of this term and thus searching for the appropriate sense would be facilitated.
Target Term Disambiguation: When constructing the L2 vocabulary of significant collocates for the L1 term being searched, our procedure takes as input all the translation equivalents listed in the bilingual dictionary, regardless of sense distinctions. Spurious or inappropriate translations are eliminated by the fact that we normally do not find them together with a significant number of items from the L2 vocabulary for the term being searched. This makes it possible for us to perform a sense disambiguation on the target terms proposed. For example, if we examine all the occurrences of the Italian noun sicurezza in our parliamentary corpus, we find that the sense is that of "safety", or "security" (one sense of "security" is a synonym of "safety"). This is confirmed by the set of significant collocates for this term; the top ten are the Italian equivalents of toy, hygiene, reactor, health, nuclear, maritime, council, road, provisions, Euratom. The bilingual dictionary gives us four separate senses for sicurezza translated by safety, security, certainty, confidence. On the English side of the corpus, we find 17 occurrences of "confidence" and just one of "certainty". However, the context for "certainty" does not appear in the list of comparable contexts for sicurezza as it contains no other L2 vocabulary items; and the contexts for "confidence" are ranked very low as they never contain more than two L2 significant collocates for sicurezza. Thus, our approach helps us to identify the correct sense of the target terms offered by the bilingual dictionary and to provide a ranking of the best L2 matches for the L1 term searched.