In order to realize the valency pattern method, which is used in the semantic analysis of co-occurrence of verbs and nouns, this paper tested some methods of collectively gathering these patterns and clarified the number of pattern pairs needed for machine translation. The experiments showed that the most of the pattern pairs needed could be collected by using example sentence pairs that had been generated by human power by relying on existing knowledge compiled in dictionaries for human use and his personal knowledge.
Specifically, three methods were examined. The results showed that Japanese to English machine translation required about 7,500 pattern pairs to cover the 1,000 Japanese origin verbs that were critical to differentiated translation. 15,000 example sentence pairs were needed to be prepared to collect these pattern pairs. It was also predicted that about 25,000 pattern pairs would be required to cover all Japanese predicates including verbs of Chinese origin and idiomatic expressions of declinable word type, Furthermore, the method of preparing example seniences through human knowledge was shown to be entirely feasible,
In order to improve the quality of machine translation, further development of semanlic analysis technologies are expected. As one ofthe technologies of semantic analysis, valency pattern method is known effective to analyze the meaning relation between verbs and nouns. However, realization of this method is hindered by the problems of inaccurate writing patterns and the excessive number of pattern pairs that must be collected.
Regarding the problem of writing pattern accuracy, it has already been clarified [Ikehara et al. 93] that the semantic attributes of Japanese nouns need to be classified into 2,000 or more categories in order to translate Japanese verbs according by their meanings.
For the problem of pattern pair collection, the quantity of pattern pairs required has remained unknown. Recently, many methods based on various heuristics or learning technologies have been proposed. Yet it is very ditficult to gather sufficient example sentences from existing documents to learn valency patterns automatically. For example, Kurohashi et al. proposed a method that can learn valency patterns by using example sentence pairs and a thesaurus. They pointed out the possibilities of achieving accuracy levels equivalent or superior to manual preparation [Kurohashi et al. 92]. Also, Almuallim et al. developed an automatic translation rule extraction method based on automatic learning techniques and showed that several valency patterns were extracted for every 6 verb from 27 to 80 example sentence pairs. [Almuallim et al. 94a, 94b].
These methods assume the existence of a number of example sentences sufficient for learning. For example, in the case of Japanese to English machine translation, it is said that 10 million pairs of example sentences are required in order to automatically generate the valency pattern pairs needed for verb translation [Kaneda et al. 94]. Furthermore, each sentence pair must have a simple sentence structure consisting of one verb and one or more nouns. Sentence examples obtained from existing documents tend to have complicated structures necessitating their simplification. Collecting such a volume of simplified examples from actual documentation is al1 but impossible*1.
Then, this paper focuses on three manual pattern pair preparation methods and clarifies the number of pattern pairs required for Japanese to English machine translation. We examine (1) the method of making valency patterns from Japanese to English dictionaries, (2) the method based on the example sentences which correspond to the meanings of Japanese verbs defined in Japanese dictionaries and (3) the method based on the example sentencs prepared from human knowledge. 36 Japanese verbs (in this papper, Japanese verbs refer to verbs not of Chinese origin) will be taken up for example and the number of pattern pairs obtained for these verbs by the 3 methods will be evaluated.
Finally, based on the results of the foregoing experiments, the number of pattern pairs required for Japanese to English machine translation will be estimated and how they can be conected will be discussed.
In order to compile the co-occurrence relations between declinable words and nouns into valency pattern pairs for machine translation, the types of declinable words involved and the method used for semantically categorizing nouns must be determined. Particularly in the semantic categorization of nouns, the minuteness required for writing pattern pairs depends on the linguistic pair to be translated. In the case of Japanese to English machine translation, it is said that Japanese noun meanings need to be categorized into over 2,000 types in order to write the valency pattern pairs which is needed for the differentiated translation of Japanese verbs into corresponding English expressions [Ikehara 93].
This paper will deal with pattern pair preparation methods under the framework of the Japanese to English Machine Translation System ALT-J/E [Ikehara 89] which is regarded as satisfying the above mentioned requirements. The ALT's framework for pattern pair writing is as follows.
The ALT's framework of valenvy pattern method consists of a semantic attribute system and two semantic dicrionaries (the semantic word dictionary and the semantic structure dictionary). In the semantic atrribute system , the semantic use of Japanese nouns is classified into a tree structure with 12 levels having some 3,000 types of attribute names. The semantic word dictionary holds the meanings of some 400,000 words described using semantic atrributes (one or more meanings per word). The semantic structure dictionary pairs the Japanese valency patterns and the corresponding English sentence structures. These dictionaries are used to disambiguate the results of syntax analysis, selection of verb translations, and other semantic analysis including the selection of noun translations.
Pattern pairs in ALT usually consist of declinable words (verbs, adjectives), noun elements (nouns, noun phrases or their attributes), adverb elements and aspect information. Noun elements are described by using semantic attributes of the minimum depth that still allows diffierentiated Japanese to English translation of verbs [Ikehara 93]. In the case of a noun in case element (noun+ joshi (post-positional word)) that cannot be represented by semantic atrributes, the noun itself is used. Patterns in which all of the noun elements in cases are represented by semantic atrributes are called general patterns. Patterns in which one or more noun of case elements are fixed are called idiomatic patterns*1. Idiomatic patterns are used for fixed form of Japanese expressions such as figurative expressions. This paper will deal with the collection of general pattern pairs.
Valency pattems are prepared for each declinable word functioning as predicates. The Japanese language allows nouns to become predicates. In such cases, patterns are prepared with nouns as the predicates. For example, the "Noun + da だ(or desu です) " form of Japanese predicate nouns is generally translated into English as noun compliments. In contrast, there are instances where predicate nouns cannot be translated into a noun compliment in English such as "Kyo-wa hare-da.: 今日は晴れだ。(It is fine today.)" or "Anata-ni shitsumon-desu.: あなたに質問です。 (I ask you a question.)". There are also instances where predicates happen to be compound words such as X-wa Y-shidai-da.: XはY次第だ。 (X depends on Y)". Pattern pairs are also prepared in such instances.
In the case of newly making patterns, we must find pattern pair candidates from example sentences. In the case of adding new panern pairs to existing patterns, we must find pattern shortages and after adding new patterns we must verify inconsistencies between added patterns and existing patterns. A computer supporting system is required to efficiently perform these processes.
(1) Support for Pattern Pair Generation
It is known that most pattern pair structures for Japanese to English machine translation can be described using 10 templates [Yokoo et al. 94]. Therefore, by using these templates and specifying pattern elements of the Japanese and English from example sentences, most valency patterns can easily be prepared. Unfortunately, in preparing high quality patterns of a high level of generality, describing the noun elements that determine the scope of pattern application poses a major problem. To alleviate this problem, a computer support system was developed in ALT. This system combined the noun elements used in the examples and the semantic word dictionary to generate candidates of semantic attributes to be specified as noun elements and displays them to the analyst.
For example, when the example sentence of "Kare-wa denwa-wo hiita.: 彼は電話を引いた。 (He installed a telephone)" is given, the sentence pattern "X (subject) install a telephone" is generated. The support system looks up the semantic word dictionary and also displays the semantic attribute of the noun "telephone". The analyst observes this and can prepare a pattern by replacing. "telephone" with a more generally used semantic attribute, but can also register this without change in the dictionary. If it is registered in the original form, when the example of the Japanese verb hiku 引く with" the meaning of "install" as an English verb is added afterward, the support system will display the semantic attribute of accusative case nouns once again so that the analyst can convert patterns to more general forms at this time. With the increase of the number of examples, the accuracy of semantic attribute candidates will improve.
(2) Support for Mutual Checking of Patteru Pairs
Valency patterns are registered using predicates as index words so that there can be no mutual interference between patterns with differing index words. Thus, mutual inconsistencies between patterns can be checked by the translation experiment between examples having identical index words. ALT therefore, has developed the following semi-automatic mechanism to support mutual inconsistency checks between patterns.
After the process mentioned in (1), the example sentences used in pattern preparation and the results of related machine translalion are kept in a file. When a new pattern is generated
second time through process (1), the new pattern is provisionally registered and thereafier, translation experiments are conducted with existing examples having identical index words. The results are compared with translation results achieved in the past and the examples showing differences are output together with the pattern pairs used for the translation in question. The analyst observes these and decides regarding final registration ofthe new pattern. In some cases, inconsistency checks shows the need for revision of not only new patterns, but also existing ones. Revisions of patterns are conducted by reverting to process (1).
(1) Pattern Pair Collection Method
Let's consider the use of conventional Japanese-English dictionaries for the first method. Japanese-English dictionaries for human use list the meanings of Japanese declinable words together with the corresponding verbs, phraseology, and example sentences in English. Therefore, analyzing the phraseology and example sentences listed in these dictionaries and re-arranging the restrictive conditions of case elements, adverb elements and other factors on the Japanese sentence pattern pairs ofJapanese and English can be prepared manually.
For example, in certain Japanese-Engllsh Dictionary [Kenkyusha 84], the following is listed as an example sentence of agaru: 上がる.
Kare-no | gakko-no | seiseki-ga | agatta | |||
彼の | 学校の | 成績が | 上がった。 | |||
(His school record has improved.) |
Analysis of the elements of this sentence with additional certain information results in the pattern pairs shown in Fig.1.
|
Several Japanese-English dictionaries [Kenkyusha 74, 84] were used in this study for the preparation of pattern pairs*1.
(2) The Quantity of Collected Pattern Pairs
Pattern pairs obtained by the above method inilially amounted to 10,000 general patterns and 5,000 idiomatic patterns. Subsequent reviews showed that some of the general patterns could be unified, while some of the idiomatic patterns could be converted into general patterns. Consequently, the total number of pattern pairs collected from dictionaries amounted to 10,000 patterns for general expressions and 3,000 patterns for idiomatic expressions.
(3) Sufficiency Check by Translation Experiments
Using the patterns described above, translation experiments were conducted on the document of specifications (1,361 sentences) for information processing devices. The results showed that the test sentences contained 142 different declinable words and 201 valency patterns were needed to translate them. However only 120 declinable words and 154 valency patterns for them had been prepared in the dictionary. It was found that no pattern pair was prepared for 22 declinable words (22/142= 15%) in the test sentences and that there was a shortfall of 25 patterns for prepared 23 declinable words. The shortfall of pattern pairs amount 23% ((201-154)/201).
(1) Method of Example Sentence Collection
As observed in the previous section, some Japanese verbs have so many meanings that collectlng a sufficient number of translation paiterns is no easy task. Regarding this kind of verb, Japanese philologists (some 20 specialists in all) have been researching the collection and analysis of example sentences corresponding to verb meanings. Already, example sentences for each meaning of 861 verbs (only Japanese example sentences, however) have been compiled into the IPAL Verb Dictionary [IPAL 87]*2. In this section, we shall deal with the conection of pattern pairs from this dictionary as the second method,
Eirst, regarding example sentences shown in this dictionary, translalors were requested to prepare perfectly acceptable English translations that were as faithful as passible to the original Japanese. Pattern pairs were collected from Japanese example sentences and those corresponding translations.
(2) The Quantity or Collected Pattern Pairs
For a total of 861 Japanese verbs, a total of 5,243 example sentence pairs (37,500 Japanese words and 40,000 English words) were obtained. Up to now, besides the pattern pairs obtained by the first method, 1,290 new pattern pairs were collected from 4,500 example sentences for 740 verbs, while 410 of the pattern pairs created by the first method have undergone revision.
(3) Correspondence between Meanings and Patterns
The IPAL Verb Dictionary yields examples based on categories of meanings of the Japanese verbs prepared therein. Thus, when considering pattern pairs for Japanese to English translation, it was found that the relation between Japanese verb meanings and pattern pairs was not always one to one.
For example, 4 Japanese verbs which typically have. numerous meanings were selected. The corresponding relationship between their meanings and pattern pairs are shown in Table 1.
|
Relation between Meanings and Patterns | Total | |||||||||||
1→1 | 1→n | m→1 | m→n | ? | |||||||||
あがる "agaru" | 8 | 5 | 1 | 3 | 1 | 18 | |||||||
あげる "ageru" | 14 | 2 | 1 | 1 | 3 | 21 | |||||||
だす "dasu" | 8 | 9 | 5 | 4 | 1 | 27 | |||||||
でる "deru" | 13 | 3 | 10 | 4 | 2 | 32 | |||||||
Total % |
43 43.9 | 19 19.4 | 17 17.3 |
12 12.2 | 7 7.1 | 98 100 |
This table shows that only some forty percent had a one to one relationship. This means that from the viewpoint of Japanese to English translation, the categorization of meanings in the IPAL dictionary is not necessarily appropriate in terms of translalion into English.
(1) Method of Example Sentence Collection
From the above results, it can be understood that using examples taken from both Japanese-English dictionaries and Japanese dictionaries for human use yields an insufficient number of pattern pairs. Consider the relationship between Japanese example sentences and pattern pairs; it can be realized that when there is a difference in the nuance of verb usage, a new and separate English pattern becomes necessary even when using the same verb. Thus, we propose the third method in which some Japanese with a competent capability of understanding the English language refers to various dictionaries, draws upon their own knowledge, and tries to write down a full collection of example sentences by listing Japanese usages with differing nuances.
The range of examples produced would depend on the time allowed for thinking, but it was decided that production would continue until further effort would be unproductive. It was decided that the target number of the exmaple sentences would be 3 to 4 times the number of meanings in the IPAL Verb Dictionary based on the results of a trial. The English translations of the Japanese example sentences were entrusted to translation specialists.
(2) The Quantity of Collected Pattern Pairs
We applied the above methods and conected 300 verbs (these verbs are covered by 450 by Kanji ideograms and express 1,700 meanings) and 5,200 examples (33,000 Japanese words, 17,000 English words); the time spent amounted to about six man months*1.
From the example sentences collected for 36 verbs (1,100 example sentences), 300 new patterns were generated. This means that the average of 10 additional pattern pairs for every verb could be generated by the third method.
36 Japanese verbs were selected at random from the result obtained by the three methods discussed in Chapter 3. Table 2 shows a comparison of the number of examples obtained and the number of resulting pattern pairs for these verbs. This table reveals the following.
1) | When the second method is used in addition to the first method, about double the number of pattern pairs can be collected. | ||
2) | When the third method was used in addition to the first and second methods, the pattern pair number doubled again. |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PO: Number of Patterrts Obtained, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MW: Number of Meanings per Word | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ES: Number of Examples SEntences, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AP: Number of Added Patterns, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ptrn.: Patterns |
These results indicate that by using human knowledge, we can quadruple the number of pattern pairs that can be collected by the first method.
(1) The Estimated Quantity of Pattern Pairs for Japanese origin Verbs
Some experiments using the third method were conducted by different analysts and revealed that as long as they were competent, the recall factor of pattern pairs obtained from example sentences prepared by diifrent analysts exceeded 90%. Thus, the number of pattern pairs for individual verbs obtained by the third method can be regarded as the number of pattern pairs necessary for each verb. Based on this result, the number of pattern pairs required for Japanese verbs in Japanese to English machine translalion is predicted as follows.
First the number of pattern pairs obtained from the first method are plotted by the bold line in Fig.2. Next, for the 36 verbs taken up in the preceding chapter, the number of pattern pairs resulting from the second and third methods are plotted and joined smoothly to form the dotted and one-point chain lines, respectively.
From this figure, the number o fpattern pairs necessary for Japanese origin verbs is estimated to be 7,500.
(2) Estimate Quantity of Pattern Pairs for MT
Up to the preceding chapter, discussion has concentrated on general patterns for Japanese verbs, but pattern pairs should also be prepared for idiomatic expressions with a declinable word. And in addition to Japanese verbs, Chinese-origin verbs, adjectives and other types of predicates must also be considered for forming pattern pairs.
Adjective type predicates have characteristics similar to Japanese verbs and the method prescribed in this paper is considered to be appropriate for the generation of pattern pairs. The same is applicable also for idiomatic patterns. In the case of Chinese-origin verbs, 1 word relates to about 1 to 2 patterns making it comparatively easy to collect the patterns needed from dictionaries.
An estimate of the number of pattern pairs required for all declinable words is shown in Table 3. This table also shows an estimate of the number that can be collected using three methods described in this paper.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NOIW: Number of Index Words | NOPP: Number of Pattern Pairs | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NOCW: Number of Correspoiding Words | NOES: Number of Examples of Sentences | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
* | The number of words as inscribed in Kanji are displayed. | V.: Verb, | Equiv.: Equivalents |
From this table, it may be observed that some 25,000 pattern pairs are estimated to be necessary for general and idiomatic patterns in Japanese to English machine trartslation. These patterns can be collected by relying on the third method, over and above the first and second methods.
The number of valency pattern pairs for translating the meanings of declinable words (verbs, adjectives) in Japanese to English machine translation, together with the ways and means of collecting these, have been clarified
Japanese verbs (rather than those of Chinese-origin) were considered because their pattern pair preparation has been most difficult due to the number of meanings for each word, Examinations of collecting pattern pairs were conducted by the three methods; (1) the method of collecting patterns from Japanese to English Dictionaries; (2) the method of using example sentences prepared based on the meanings of verb in Japanese dictionaries, and (3) the method of preparing example sentences referring to the above dictionaries and human knowledge. The example sentences and pattern pairs obtained for 36 Japanese verbs were compared. The results show that about 7,500 valency patterns are required in order to translate some 1,000 major verbs of Japanese-origin into English based on their meanings. It was also clarified that adding the second method doubles the number of pattern pairs collected. Adding the third method doubles the pattern number again. Il was furlher showed that for the collective gathering of necessary pattern pairs, the third method would be the most appropriate considering the volume of manpower and work hour commitments involved.
It was also predicted that for all pattern pairs in their entirety, including Chinese-origin verbs and adjective type predicates, some 25,000 patterns would be required.
Currently, the second and third methods described in this paper are being used in parallel. For Japanese verbs, Chinese-origin verbs and adjective type predicates, 5,000, 4,000, and 2,000 patterns respectively (totaling 11,000) have been collected. The remaining pattern pairs (about 9,000 instances of general patterns and 2,000 instances of idiomatic patterns) will be prepared in due course.