Interactive Generalization of a Translation Example Using Queries Based on a Semantic Hierarchy

Yasuhiro Akiba    Hiromi Nakaiwa    Satoshi Shirai+    Yoshifumi Ooyama

NTT Communication Science Laboratories
2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, JAPAN
{akiba,nakaiwa,shirai,ooyama} @cslab.kecl.ntt.co.jp


Abstract

This paper addresses the issue of acquiring translation rules for machine translation (MT) systems that adopt a transfer approach. These rules are semantic pattern pairs (SPPs) of source and target languages. Practical MT systems must additionally contain a huge number of SPPs corresponding to rarely-used predicates and predicate usages. Such SPPs are difficult to automatically acquire with corpus-based methods [1,2,10,11]. To solve this difficulty, this paper proposes a method to acquire SPPs by using queries based on a semantic hierarchy. The proposed method asks the lexicographer for the necessary information in order to generalize the conditions of SPPs and then gradually generalizes these conditions. Experimental results show that the proposed method allows the acquisition of more plausible conditions within almost the same time spent for manual generalization.



[ In Proceedings of ICTAI00, pp.326-332 (November, 2000). ]



+ He affiliates with ATR Spoken Language Translation Research Labs. Adress: 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, JAPAN. E-mail: shirai@itl.atr.co.jp


INDEX

     1 Introduction
2 Acquisition Task
3 Proposed Method
  3.1 Generating Sentences for Queries
  3.2 Presenting Queries and Getting Answers
  3.3 Approaches for searching for an appropriate category
4 Experimental Work
5 Conclusion
  References



1 Introduction

The World Wide Web (WWW) has become very popular throughout the world. Due to this popularity, more and more people have come to realize that the opportunity is growing for them to read documents written in unfamiliar languages. Accordingly, there are higher hopes for more efficient machine translation (MT) systems.

One attempt to achieve MT systems is the use of a semantic-based transfer approach [6,4,12]. This approach focuses on the collocation between verbs and nouns. Semantic-based transfer MT systems provide if-then rules (called semantic pattern pairs or SPPs), whose if-parts and then-parts are collocation patterns in the source language (SL) and the target language (TL), respectively.

Figure 1 shows a simplified example [5] of an SPP in a Japanese-English MT system, ALT-J/E [6]. The if-part is a Japanese pattern, and the then-part is an English equivalent. This SPP indicates that if the Japanese verb is yaku, its subjective noun is categorized as People and its objective noun is categorized as Bread or Confectionery; accordingly, the corresponding English verb is bake and its subject and object are the translation into English of the Japanese subject and object, respectively. Each slot such as J-SUBJ or J-OBJ contains the meanings of nouns (semantic categories) like People, Bread, or Confectionery1.

IF    J-VERB =  yaku
J-SUBJ =  N1:[People]
J-OBJ =  N2:[Bread or Confectionery]
THEN   E-SUBJ =  N1
E-VERB =  bake
E-OBJ =  N2

Figure 1. SPP for Japanese verb yaku.

ALT-J/E has about 2,800 semantic categories [5], which constitute a hierarchy with a maximum depth of 12 (Figure 2). Each of about 400,000 nouns in the ALT-J/E lexicon has one or more semantic categories as its meanings.

Practical MT systems based on a semantic-based transfer approach must additionally contain a huge number of SPPs corresponding to rarely-used predicates and predicate usages. In the case of ALT-J/E, Shiral et al. [9] reported that about 10,000 SPPs, corresponding to rarely-used predicates and predicate usages, have to be generated for ALT-J/E to cover nearly all of the predicates in Japanese.

The existing approaches to acquire SPPs include corpus-based approaches [1,2,10,11]. As Church indicates in [3], many possible co-occurrences cannot be observed even in a very large corpus. That is, a sufficient number of illustrative sentences cannot be prepared for the above kinds of SPPs. These approaches are, therefore, of limited applicability.

Figure 2. Upper levels of the semantic hierarchy in ALT-J/E.

To date, the above kinds of SPPs have been acquired by manual approaches like those in [9]. The task of generating SPPs involves inputting the appropriate (not over-general and not over-specific) semantic categories into each slot. It is, however, difficult for lexicographers to manually generate SPPs due to the following reasons.

Large number of candidates:
Searching for appropriate categories requires a lot of effort on the part of lexicographers. This is because the number of candidates for appropriate categories ranges from several thousands [5] to several tens of thousands [7], depending on the semantic hierarchy used.

Necessity of experience:
To specify appropriate semantic categories, lexicographers must be very familiar with all of the semantic categories and the lexicon on an MT system.

Difrerent quality of generalization:
Lexicographers treat minor translation examples differently according to their own generalization standards. Some lexicographers specify more general semantic categories so as to be met by minor examples. Other lexicographers ignore minor examples.

To overcome the difficulty of generating SPPs and the sheer quantity of SPPs, methods to support lexicographers are required.

This paper proposes a new method that searches for appropriate semantic categories to be inputted into each slot by using queries based on a semantic hierarchy. The proposed method adopts three different approaches to searching for appropriate semantic categories. The approaches for the searching generate translation examples at each search point and ask whether the noun corresponding to the target slot can be semantically collocated with the verb of the SL side in the acquired SPP and whether that noun can also give a context where the most plausible equivalent for the verb of the SL side is the verb of the TL side. The only task that lexicographers have to do is to answer each query. Consequently, the resulting categories follow a specific generalization standard.

The authors experimentally evaluated the proposed method by acquiring the semantic categories for ten SPPs [5] with ALT-J/E. Experimental resuIts showed that the proposed method was able to acquire, in five cases, the same categories as, or in the other cases, more plausible categories than, those specified manually within almost the same time spent for the manual generalization.

The next section describes the acquisition task in this study. The proposed method is presented in Section 3. Experimental results are shown and a discussion is provided in Section 4. Finally, our conclusions are presented in Section 5.




2 Acquisition Task

This section characterizes the appropriate semantic categories that the proposed method should input into each slot in SPPs and describes the acquisition task that the proposed method handles.

Let us assume that lexicographers attempt to acquire the appropriate semantic categories for the slots of the SPP shown in Figure 1 while they derive (1) as a sample sentence that meets the conditions of the acquired SPP.

   (1)   Taro-ga   ppurupai-wo   yaku
taro-SUBJ apple pies-OBJ bake-PRESENT
'Taro bakes apple pies'

where the nouns such as Taro "a typical first name of Japanese males" for the J-SUBJ slot or appurupai "apple pies" for the J-OBJ slot are called the sample nouns, and the semantic categories of the sample nouns are called the sample categories. Let the sample categories of Taro and appurupai be Male and Confectionery, respectively.

When specifying different semantic categories from sample categories into slots so that we can investigate what kinds of linguistic phenomena can be observed, let us replace one ofthe sample nouns with another noun as follows.

   (2)   Meari-ga   appurupai-wo   yaku
mary-SUBJ apple pies-OBJ bake-PRESENT
'Mary bakes apple pies'

   (3)   Mangetsu-ga   appurupai-wo   yaku
a foll moon-SUBJ apple pies-OBJ bake-PRBSENT
'A full moon bakes apple pies'

   (4)   Taro-ga   sake-wo   yaku
taro-SUBJ a salmon-OBJ grill-PRESENT
'Taro grills a salmon'

In the case of (2), the sentence is natural as a Japanese sentence, so the substituted noun Meari "Mary" can be semantically collocated with Yaku "bake". In the case of (3), the description of the sentence Mangetsu-ga appurupai-wo yaku is not realistic and cannot happen normally. In such a sense, Mangetsu "a full moon" cannot be semantically collocated with yaku. In the case of (4), the sentence is natural as a Japanese sentence so the substituted noun sake "a salmon" can be semantically collocated with Yaku. On the other hand, the most plausible equivalent for Japanese verb yaku is grill rather than bake. This is because, although the source sentence does not describe definite clues in order to select the English equivalent, the most plausible situation of the source is Taro grills a salmon as long as one takes account of the culture of SL; in this case, Japanese culture.

As seen above, the following linguistic phenomena can be observed through the replacement of nouns,

(P1)  Some nouns (or semantic categories) can be semantically collocated with the verb of the SL side in the acquired SPP.
(P2) Out of all of the semantically collocated nouns (or semantic categories), some nouns (or semantic categories) can give a context where the most plausible equivalent for the verb of the SL side is the target verb of the TL side.

The phenomena is a good indicator for finding the appropriate categories.

As a semantic category specified to a slot processed by lexicographers (the target slot) can change from the sample category for the target slot to the root of the semantic hierarchy step by step, the number of nouns able to meet an acquired SPP can increase gradually. When Ci denotes the i-th semantic category on the path from the root of the semantic hierarchy to the sample category for the target slot (Figure 3), and when Ci is specified to the target slot instead of Ci+1, the ratio of the collocated nouns in (P1), to the additionally covered nouns, is called the acceptable rate of Ci , which is related to the occurrence probabilities of the SL sentences generated by the replacement. The ratio of the nouns that can give the context in (P2), to all collocated nouns in (P1), is called the translatable rate of Ci (Figure 4), which is related to the probabilities that such substituted nouns can give the context in (P2). Because both rates of the appropriate semantic categories specified to the target slot should be sufficiently large, let us focus on the lower thresholds of both rates and call them the minimal acceptable rate and the minimal translatable rate, respectively. A semantic category whose acceptable rate is the minimal acceptable rate or greater and whose translatable rate is the minimal translatable rate or greater is called an ok-category. On the other hand, a category that is not an ok-category is called an ng-category.

Figure 3. Proposed method: M denotes the depth of the sample category for the target slot.

Figure 4. Acceptable rate and translatable rate of Ci

This paper characterizes each appropriate semantic category as an ok-category that is located at the highest level on the semantic hierarchy when the minimal acceptable rate and the minimal translatable rate are given.

Consequently, given the minimal acceptable rate and the minimal translatable rate, the acquisition task that the proposed method handles is to search for the highest ok-category for the target slot of the SPP acquired by the method (Figure 5).

Input:      (A)  the skeleton of a sentence that meets the acquired SPP,
(B) a combination of nouns such that, for each slot in the if-part of the SPP, one of the nouns should meet the condition of the slot (the sample nouns),
(C) the semantic category for each sample noun (the sample categories).
Output: the highest ok-category in the semantic categories on the path between the root in the semantic hierarchy and the sample category for the target slot.

Figure 5. Acquisition task

For example, to make the proposed method acquire the appropriate semantic category for the J-SUBJ slot of the SPP shown in Figure 1, the inputted skeleton of the sentence that meets the acquired SPP is N1-ga N2-wo yaku "N1 bake N2", where Ni is a variable. Taro and Male are inputted as the sample noun for the J-SUBJ slot and the semantic category, respectively. At the same time, appurupai and Confectionery are inputted as the sample noun for the J-OBJ slot and the semantic category, respectively. Lexicographers can easily input this information. The output is the highest ok-category in the eight semantic categories on the path between the root in Figure 2, Anything (C1), and the leaf, Male (C8).




3 Proposed Method

This section describes the proposed method that adopts three approaches to search for the highest ok-category described in Section 2. The approaches use the same strategy for generating queries and presenting them to lexicographers in order to estimate the acceptable rate and the translatable rate of the current search point.




3.1 Generating Sentences for Queries

When the current search point is Ci , the approaches for the searching generate sentences in the following way: (i) initially generate a sentence by filling each variable Ni in the skeleton with the corresponding sample noun; (ii) then, generate some sentences by replacing the sample noun for the target slot with other nouns in Clusteri , which hereafter denotes the set of nouns categorized as Ci or descendants of Ci but not categorized as Ci+i or descendants of Ci+1. For example, assume that, in order to acquire the SPP shown in Figure 1, the input to the proposed method is the same as presented at the end of Section 2: N1-ga N2-wo yaku as the skeleton, Taro and appurupai as the sample nouns, and Male and Confectionery as the semantic categories; that the target slot is the J-SUBJ one; and that the current search point is C2 (concrete in Figure 2), then the substituted nouns are categorized as either Palaces or Object or descendants of them.

The approaches use the generated sentences for estimating the acceptable rate and the translatable rate of the current search point Ci. The main issues are that, in order to estimate both rates within significantly small errors by using only the limited number of generated sentences, how do the approaches select the nouns substituted for the sample noun and how do the approaches estimate both rates.

To resolve the two issues, the approaches employ stratified sampling [8], a sampling survey technique in statistics, in the following. Clusteri is separated into subsets of nouns, from which some substituted nouns are collected, where the number of substituted nouns from each subset is decided according to the total number of nouns in the subset as will be seen later. Then, the acceptable rate of Ci is estimated as the weighted average of the acceptable rates for the subsets, where the weight for each subset is the number of substituted nouns. For example, assume that Ci is separated into two subsets, that the ratio of the substituted nouns for the subsets is 3:1, and that the acceptable rates for the subsets are 100% and 50%, respectively. Then, the acceptable rate of Ci is estimated as (100*3+50*1)/ (3+1). The translatable rate of Ci is also estimated in the same way.

Stratified sampling does not provide a way to separate Clusteri the proposed method, therefore, adopts an original technique separating Clusteri into subsets, which is decided step by step as follows. For example, when the total number of substituted nouns in Cluster2 is 40 under the same input as seen at the beginning of this section, at first, allocate the number of substituted nouns (the sample size) in Cluster2 among all of the siblings of C3 (Agents in Figure 2) according to the ratio of the number2 of leave that are descendants of each sibling of C3, for example, 3:1. In this example, the sample sizes for Palaces and Object become 30 and 10, respectively. For each sibling, allocate the sample size of the sibling among all of the children in the same way, until the sample size is too small, for example, equal to or less than 15. If the ratio of the number of leaf-level descendants of Natural, Regions and Facilities is 2:1:3, then the sample sizes3 for Natural, Regions, and Facilities become 10, 5, and 15. Since the sample size of Object is less than 15, the sample size of Object is not allocated among the children: Animate and Inanimate. After this, for each child, allocate the sample size of the child among all of the children of the child recursively, until the sample size is too small. Let S (i,i ), (1 <= j <= Li ), hereafter, denote the semantic categories whose sample sizes are not allocated to their children. In the example, L2 = 4 and S (2, j ) (1 <= j <= L2) correspond to Natural, Regions, Facilities, and Object.

S (i, j ), (1 <= j <= Li ) are used as the subsets used by the stratified sampling. The substituted nouns are selected from S (i, j ) or descendants of S (i, j ) in the order of frequency in use. The number of selected nouns is the sample size for S (i, j ).




3.2 Presenting Queries and Getting Answers

For each S (i, j ), (1 <= j <= Li ), the approaches for searching simultaneously present the generated sentences to lexicographers. This simultaneous presentation prevents the lexicographers from misunderstanding the meanings of the substituted nouns. Since all of the substituted nouns in the presented sentences are categorized in a certain semantic category, the lexicographers can easily guess the correct meaning of a substituted noun.

The lexicographers judge whether each substituted noun can be semantically collocated with the verb of the SL side in the acquired SPP (Q1). If and only if this answer is positive, they also judge whether the substituted noun can give a context where the most plausible equivalent for the verb of the SL side is the verb of the TL side (Q2). The lexicographers make their determinations by answering queries: (Q1) and (Q2). For example, as explained in detail in Section 2, they answer positive to both (Q1) and (Q2) for (2) and negative to (Q1) for (3). Moreover, they answer positive to (Q1) and negative to (Q2) for (4).




3.3 Approaches for searching for an appropriate category

The three approaches for searching differ in the order in which they search for an appropriate semantic category (Figure 3). The first two approaches, the Bottom-up approach and the Top-down approach, are the same as a linear search. The last approach, the Dichotomy approach, is the same as a binary search. For convenience in the following explanation, let us define that LM = 1 and S (M,1) = CM.

The Bottom-up approach applies the above query strategy to each semantic category in reverse order of depth, CM , CM-1, ... . When an ng-category is found, this approach stops searching and outputs the latest ok-category.

The Top-down approach applies the above query strategy to each semantic category in order of depth, C1, C2, ... . When an ok-category is found, this approach stops searching and outputs that ok-category.

The Dichotomy approach initially applies the above query strategy to the leaf and the root in order. Next, this approach prepares a candidate list, (C1, C2, ... , Ci , ... , CM ) and applies the above query strategy to the semantic category in the middle of the candidate list or to the lower semantic category closest to the middle if a precisely central semantic category does not exist. According to whether the semantic category is an ok-category or not, the approach revises the candidate list in the same way as a binary search does. Then the above procedure by using the updated candidate list is repeated. Consequently, the first and last elements of the candidate list are always an ng-category and an ok-category, respectively, after the root and leaf semantic categories are processed. When the length of the candidate list becomes 2, the approach stops searching and outputs the last element.

As mentioned above, the only task that lexicographers have to do is to answer each query. Consequently, the resuIting categories follow a specific generalization standard. After repeatedly applying one of the approaches for searching to each corresponding input, the proposed method can specify all of the appropriate semantic categories for each slot in the SPP.




4 Experimental Work

The authors evaluated the proposed method on the three following points.

In order to evaluate the above points, the authors attempted to acquire semantic categories for SPPs whose if-parts corresponded to the skeletons in Table 1. The sample nouns inputted to the proposed method are shown under the skeletons. The target of the generalization is indicated by underline. The semantic hierarchy used was that of ALT-J/E as shown in Figure 2. The minimal acceptable rate and the minimal transIatable rate were fixed at 2% and 80%, respectively. The sample size for Ci and the lower threshold of the sample sizes were 50 and l0, respectively.

Table 1. Skeletons and sample sentences for interactive generalization
No Skeleton
Sample Sentence
Category
specified manually
1 N1-ga N2-wo yomu "N1 read N2"
Chichi-ga hon-wo yomu "My father reads a book"
Agents
2 N1-ga N2-wo yomu "N1 read N2"
Chichi-ga hon-wo yomu "My father reads a book"
Abstract Thing (Idea)
3 N1-ga N2-wo yomu "N1 read N2"
Chichi-ga houkokusho-wo yomu "My father reads a report"
Spirit/Soul/Mind
4 N1-ga N2-wo N3-ni erabu "N1 elect N2 N3"
Juhmin-ga kare-wo kaichou-ni erabu "Residents elect him their head"
Chief/President/Manager
5 N1-ga N2-de nyushou-suru "N1 win a prize in N2"
Kare-ga Konkuhru-de nyushou-suru "He wins a prize in the contest"
Abstract Thing (Behavior)
6 N1-ga N2-wo tatamu "N1 close N2"
Chichi-ga mise-wo tatamu "My father closes his shop"
Facilities
7 N1-ga N2-wo unten-suru "N1 run N2"
Chichi-ga hatsudouki-wo unten-suru "My father runs an electric dynamo"
Machinery
8 N1-ga N2-wo nageru "N1 throw N2"
Chichi-ga bohru-wo nageru "My father throws a ball"
Objects
9 N1-ga hanpatsu-suru "N1 rebound"
Kabusiki-ga hanpatsu-suru "Shares rebound"
Economic system
10 N1-ga N2-ni tassuru "N1 rise to N2"
Doru-ga saitakane-ni tassuru "The dollar rises to the highest level"
Price/Cost

Two lexicographers, who generated SPPs of ALT-J/E, participated in the experiments. Each semantic category in the 3rd column indicates a semantic category specified manually for the SPP corresponding to the skeleton in the 2nd column.

Table 2. Experimental results: the relative position of the acquired semantic category to that specified manually, the number of paired queries: (Q1) and (Q2) (See Section 3.2) and the time spent
NoDifference # of paired QueriesTime (M.)
 B  T  D  BTDBTD
1000 25953532365
2000 158155155171618
3-1+3+3 119515124105
4000 5526215893415
5-2-2-2 15325920482111
6-1-1-1 147208157101312
7-1-1-1 10431020810548
8-2-2-2 208207156557513
9000 1582121607128
10000 5031615516410
Ave.------ 141.1203.3145.717.924.510.5

Table 2 reports experimental results. The 2nd to 4th columns show the relative position of an acquired semantic category in comparison to a semantic category manually specified. For example, +1 or -1 indicates that the acquired semantic category is one semantic category above the semantic category manually specified or below it, respectively. Each number in the 5th to 7th columns shows the number of paired queries, i.e., (Q1) and (Q2) in Section 3.2, presented to the lexicographers. Each number in the 8th to 10th columns shows the time spent for generalization of the sample noun to the target slot. B, T, and D on the 2nd line indicate that the approach for the searching is, Bottom-up, Top-down, and Dichotomy, respectively. Through this experiment, the following things could be found:




5 Conclusion

This paper proposed a method to acquire appropriate semantic categories to be inputted into each slot of an SPP by using queries based on a semantic hierarchy. The queries ask whether the noun corresponding to the target slot in presented sentences can be semantically collocated with the verb in SL and ask whether the noun can also give a context where the most plausible equivalent for the verb in SL is the verb of the TL side in the acquired SPP.

The method allows lexicographers to acquire more plausibIe semantic categories for SPPs by simply answering the queries presented by the method.




References

[1]
Y. Akiba, M. Ishii, H. Almuallim, and S. Kaneda. Learning English verb selection rules from hand-made rules and translation examples. In Proc. TMI-95, pages 206-220, 1995.

[2]
H. Almuallim, Y. Akiba, and S. Kaneda. On handing tree-strtictured attributes in decision tree learning. In Proc. ICML-95, pages 12-20, 1995.

[3]
K. W. Church and R. L. Mercer. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19:1-24, 1993.

[4]
M. Dorna and M. C. Emele. Efficient implementation of a semantic-based transfer approach. In Proc. ECAI-96, pages 567-571, 1996.

[5]
S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo, H. Nakaiwa, K. Ogura, Y. Ooyama, and Y. Hayashi. Goi-Taikei: A Japanese Lexicon (in Japanese). Iwanami Shoten Publisher, Tokyo, 1997.

[6]
S. Ikehara, S. Shirai, A. Yokoo, and H. Nakaiwa. Toward an MT system without pre-editing -effects of new methods in ALT-J/E. In Proc. MT Summit-3, pages 10l-106, 1991.

[7]
G. A, Miller. Wordnet: a lexical database for English. Communications of the ACM, 38(11):39-41, 1995.

[8]
R. L. Scheaffer, W. I. Mendenhall, and R. L. Ott. Elementary Survey Sampling (5th ed.). Duxbury Press, California, 1996.

[9]
S. Shirai, S. Ikehara, A. Yokoo, and H. Inoue. The quantity of valency pattern pairs required for Japanese to English MT and their compilation. In Proc. NLPPS-95, pages 443-448, 1995.

[10]
H. Tanaka. Decision tree learning algorithm with structured application to verbal case-frame acquisition. In Proc. Coling-96, pages 943-948, 1996.

[11]
T. Utsuro. Sense classincation of verbal polysemy based on bilingual class/class association. In Proc. COLING-96, pages 968-973, 1996.

[12]
J. Yang. Towards the automatic acquisition of lexical selection rules. In Proc. MT Summit-7, pages 397-403, 1999.




Footnote
1 Some nouns like Zeri "jelly" are categorized as Confectionery; consequently, semantically strarige sentences as watasi-wa zeri-wo, yaku "I bake jelly" can meet the SPP shown in Figure 1. It does not matter that sentences line this meet an SPP since they are not inputted into MT systems. (Return)
2 There numbers are used as convenient indicators of semantic diversity under each sibling. (Return)
3 When the sample size becomes a decimal as 13.3 (= 20*2/3), the decimal is rounded up. (Return)