Characteristics of Colloquial Expressions in a Bilingual Travel Conversation Corpus

Toshiyuki Takezawa+, Satoshi Shirai+ and Yoshifumi Ooyama++

+ ATR Spoken Language Translation Research Laboratories
2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
E-mail: {takezawa, shirai}@slt.atr.co.jp

++ NTT Communication Science Laboratories
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237, Japan
E-mail: ooyama@cslab.kecl.ntt.co.jp


Abstract

In order to develop a speech translation system that has a large vocabulary and accepts various expressions, we have carried out a practical and quantitative investigation of similarities and differences among the tasks of machine translation for written language such as newspaper articles, and those for spoken language such as conversations in daily life. First, we mention the characteristics of vocabularies for colloquial expressions in conversations. Next, we report the cbaracteristics of basic sentence patterns which consist of one predicate and essential case phrases from the viewpoint of translation from Japanese to English. Finally, we discuss state-of-the-art technology based on a case study of a Japanese-to-English speech translation system ATR-MATRIX built by ATR Interpreting Telecommunications Research Laboratories.



Key words:

Bilingual corpus, conversation, spoken language, speech translation.



[ In Proceedings of ICCPOL 2001, pp.384-389 (May, 2001). ]



INDEX

     1 Introduction
2 Corpus and dictionary
  2.1 A bilingual travel conversation corpus
  2.2 Goi-Taikei: A Japanese Lexicon
3 Vocabulary characteristics of travel conversations
4 Basic sentence pattern characteristics of travel conversations
5 Discussions and future works
  5.1 State-of-the-art speech translation technology
  5.2 Future works
6 Conclusions
  Acknowledgments
  References



1 Introduction

Current speech translation research assumes a limited task such as hotel room reservations and carries out medium vocabulary size experiments of about ten thousand words [1,2,3]. In order to enlarge the application area of speech translation systems, we must develop a large vocabulary speech translation system that accepts more expressions. Machine translation systems for written language such as newspaper articles can deal with a larger vocabulary than speech translation systems. However, there seem to be different problems for machine translation of texts and that for conversations [4]. Despite one study on a machine translation system for spoken language [5], there have not been enough studies comparing the tasks and expressions of machine translation for text and that for conversations.

NTT has recently developed a Japanese-to-English machine translation system, ALT-J/E [6], which can deal with texts such as newspaper articles. Part of its semantic dictionary has been published as a book entitled "Goi-Taikei: A Japanese Lexicon" [7] and its CD-ROM is also available [8]. On the other hand, a bilingual travel conversation corpus built by ATR Interpreting Telecommunications Research Laboratories [9,10] has already been released outside of ATR and used for research at many research institutes and universities.

A practical and quantitative analysis is necessary for good guidelines or helpful knowledge to make a machine translation system for text deal with spoken language. Therefore, we investigated characteristics of a digitized bilingual travel conversation corpus in comparison with "Goi-Taikei: A Japanese Lexicon" which is a machine-readable large vocabulary dictionary for Japanese-to-English translation.

Section 2 describes the bilingual travel conversation corpus and "Goi-Taikei: A Japanese Lexicon." Section 3 reports the characteristics of vocabularies for colloquial expressions in conversations. Section 4 presents the characteristics of basic sentence patterns which consist of one predicate and essential case phrases from the viewpoint of translation from Japanese to English. Section 5 discusses state-of-the-art technology based on a case study of a Japanese-to-English speech translation system ATR-MATRIX built by ATR Interpreting Telecommunications Research Laboratories. Finally, section 6 offers conclusions.




2 Corpus and dictionary




2.1 A bilingual travel conversation corpus

ATR Interpreting Telecommunications Research Laboratories have built a bilingual travel conversation corpus for speech translation research [9,10]. An interpreter to each translation direction (J to E or E to J) is assigned when collecting one conversation in order to gather good quality data. The task of the bilingual corpus involves travel conversations between a tourist and a front desk clerk at a hotel. This task was selected because of its familiarity to people, and its expected use in future speech translation systems. The interpreters speak English and Japanese in all of the conversations, and serve as a speech translation system. The human interpreters successively interpreted each utterance so we could gather basic data for developing a speech translation system. Table 1 shows an overview of the bilingual travel conversation corpus. You can find release information at http://results.atr.co.jp/products_e/.

Table 1: Overview of a bilingual travel conversation corpus
Number of collected conversations618
Speaker participants 71
Interpreter participants 23
Total number of utterances 16,107
Total number of Japanese words 301,961




2.2 Goi-Taikei: A Japanese Lexicon

NTT has developed a Japanese-to-English machine translation system. ALT-J/E [6] which can deal with texts such as newspaper articles. A part of its semantic dictionary has been published as a book entitled "Goi-Taikei: A Japanese Lexicon" [7] and its CD-ROM is also available [8]. It contains 300,000 Japanese words tagged with 3,000 semantic categories and 14,000 Japanese-to-English valency patterns of 6,000 Japanese verbs. Table 2 shows an overview of the items used for this study. You can find useful information at http://www.kecl.ntt.co.jp/icl/mtg/.

Table 2: Overview of "Goi-Taikei: A Japanese Lexieon"
CategoryNumber of items
The word dictionary 300,000 words
The valency dictionary6,000 verbs
14,000 valency patterns




3 Vocabulary characteristics of travel conversations

Using utterances translated from Japanese to English in the bilingual travel conversation corpus, words were extracted and their frequency calculated. Figure 1 shows an example of the Japanese particle "no." The items separated by the symbol | are surface form, reading, standard form, and part-of-speech, respectively, and the number in () is frequency.

Surface form | Readings | Standard form | Part-of-speech | (Frequency)
No | no | no | case particle | ( 69 )
No | no | no | Sentence final particle | ( 4 )
No | no | no | Nominal particle |( 337 )
No | no | no | Prenominal particle | ( 4135 )

Figure 1: Example of particle "no"

As shown in Table 3, the part-of-speech system in the bilingual travel conversation corpus is different from that of "Goi-Taikei: A Japanese Lexicon." Therefore, we made mapping tables such as those described between () in Table 3, in which a nominal particle in the bilingual travel conversation corpus is mapped to the formal keishiki noun in "Goi-Taikei: A Japanese Lexicon," a topic particle is mapped to the adverbial particle, and parallel particles, prenominal particles and quotational particles in the bilingual travel conversation corpus are mapped to case particles in "Goi-Taikei: A Japanese Lexicon."

Table 3: Differences of part-of-speech information (example of particle)
A bilingual travel conversation corpus"Goi-Taikei: A Japanese Lexicon" or ALT-J/E
Case particleCase particle
Nominal particleN.A. (Formal keishiki noun)
Topic panicleN.A. (Adverbial particle)
Adverbial particleAdverbial particle
Parallel particleN.A. (Case particle)
Conjunctive particleConjunctive particle
Sentence final particleSentence final particle
Prenominal particleN.A. (Case particle)
Quotational particleN.A. (Case particle)

In the example shown in Figure 1, the case particle, nominal particle and prenominal particle "no" are covered by a "Goi-Taikei: A Japanese Lexicon" because the same lexical items are included through the mapping table. However, the sentence final panIcIe "no" is not covered by the "Goi-Taikei: A Japanese Lexicon." Therefore, coverage based on the word entries is 3/4 i.e. 75.0%. Coverage based on the total number of words is 4541/4545 i.e. 99.9%.

The bilingual travel conversation corpus contains word fragments because of self-repairs and disfluencies. The parts-of-speech of such fragments are tagged with "others." Moreover, inflection words are divided into a stem part and an inflection ending part in the bilingual travel conversation corpus. Therefore, we neglect a parts-of-speech such as "others" and "inflection ending." Table 4 shows the result. Figure 2 shows the coverage per pan-of-speech based on the word entries counting.

Table 4: Coverage of "Goi-Taikei: A Japanese Lexicon" to the bilingual travel conversation corpus (vocabulary)
The number of total wordsWord entries
Matched words80,6852,493
Unmatched words18,8281,269
Coverage81.1%66.3%

Number of words

Figure 2: Coverage per part-of-speech based on the word entries counting

Words that are not covered by "Goi-Taikei: A Japanese Lexicon" are divided into two groups. One is colloquial expressions in conversational language and the other is vocabulary dependent on the travel domain. Examples are shown in the following.

Among the words which are not covered by "Goi-Taikei: A Japanese Lexicon," the number of nouns is 888 words based on word entries and 7,261 words based on the total number of words. If we roughly estimate vocabulary dependent on the travel domain using the information of noun usage, the percentage is about 70% based on word entries and about 40% based on the total number of words. The remaining are the percentages for colloquial expressions in conversational spoken language.




4 Basic sentence pattern characteristics of travel conversations

Using utterances translated from Japanese to English in the bilingual travel conversation corpus, basic sentence patterns which consist of one predicate and essential case phrases were extracted and their frequency calculated. So-called dabun (assertive sentences), i.e. a kind of fragmental sentence, are frequently used in conversations. Figure 3 shows examples. The items are frequency count, basic Japanese sentence patterns, and basic English sentence patterns, respectively. The frequency count is calculated per the palr of Japanese and English sentence patterns. Basic Japanese sentence pattern consists of da predicate (a noun with da, an auxiliary verb of assertion) and essential case phrases. Da predicates are flanked by the symbol / . The symbol # is added just before the da. We extracted the surface form desu, i.e. a kind of polite expression of da, in the same form da. ga((wa)) indicates that the surface form was a topic particle wa and the function is similar to the case particle ga. Particles were sometimes dropped in conversational spoken language. If we can infer some particle in some case phrases without particles, we flank it with the symbol *. The basic English sentence patterns are surrounded by the symbol ". Parts corresponding to Japanese da predicates are indicated flanked by the symbol /. If a basic English sentence pattern contains some pronoun, we replaced it with the word "one." English constituents separated by | are those necessary for translation from basic Japanese sentence patterns.

1 /Ame#da/ "it /rain/"
3 /Hajimete#da/ "/be/ one's first tiine"
1 /Hajimete#da/ "/be/ one's first visit"
2 /Hajimete#da/ "/be/ the first time"
1 Nara ga ((wa)) /hajimete#da/ "this /be/ one's first trip | to Nara"
1 Kimono ga ((wa)) /hajimete#da/ "/be/ one's first time | to put kimono on"
1 Kimono ga ((wa)) /hajimete#da/ "/be/ one's first time | wearing a kimono"
1 Sore wa/o komari#da/ "that /be/ a problem"
1 Basu wo/go riyo#da/ "/take/ the bus"
1 Kochira ga/ chiketto#da/ "here /be/ one's tickets"
1 Sâbisuryô*ga* /komi#da/ "/include/ service charges"
1 Tôkyônaritakûkô wo/shuppatsu#da/ "/leave/ Tokyo Narita Airport"
1 Hitotsume ga/Tôjieki#da/ "the first stop /be/ Toji Station"
1 Chashitsu ga/iriyo#da/ "/need/ a tearoom"
3 /Go zonji#da/ "/know/"
6 /O tomari#da/ "/stay/"

Figure 3: Examples of "dabun (assertive sentences)"

In the same manner, we have extracted general predicates such as verbs and adjectives as the basic sentence patterns. Table 5 shows the result. Only the percentage based on word entries are shown because the percentage based on the number of total words is almost the same.

Table 5: Coverage of "Goi-Taikei: A Japanese Lexicon" to the bilingual travel conversation corpus (basic sentence patterns)
GeneralDabun (assertive sentences)
Number of itemsPercentageNumber of itemsPercentage
Correct English can be obtained. 3279.1%10.2%
Correct English cannot be obtained but a single English predicate can be inferred. 1,41339.5%17326.1%
Correct English cannot be obtained and a single English predicate cannot be infened. 1,84051.4%48873.7%
Total3,580100.0%662100.0%

As for the general predicates such as verbs and adjectives, approximately 10% of the basic sentence patterns can be translated correctly by "Goi-Taikei: A Japanese Lexicon." For example, "Chikatetsu Karasumasen ni /noru/" can be translated into "/take/ the Karasuma subway line" if the compound word "Chikatetsu Karasumasen" is recognized as a railway. The approximately 90% of the basic sentence patterns remaining cannot be translated correctly by "Goi-Taikei: A Japanese Lexicon." However, for approximately 40% of basic sentence patterns, we can easily infer single English predicates. Typical examples are as follows.

We cannot infer single English predicates to the approximately 50% of basic sentence patterns remaining. Typical examples are as follows.

On the other hand, as for so-called dabun (assertive sentences), only one pattern "Ame da (it rains)" is covered by "Goi-Taikei: A Japanese Lexicon." In other words, almost all patterns of dabun (assertive sentences) are not translated by "Goi-Taikei: A Japanese Lexicon." We can infer single English predicates to approximately one fourth of them such as "Chashitsu ga/iriyô#da/" which means "/need/ a tearoom." We cannot infer a single English predicate to the remaining three fourths of them such as "Shibikku wo nidai da" which means "two Civic" if English predicates such as "rent" or "reserve" must be output.




5 Discussions and future works




5.1 State-of-the-art speech translation technology

We discuss state-of-the-art technology based on a case study of a Japanese-to-English speech translation system ATR-MATRIX built by ATR Interpreting Telecommunications Research Laboratories [1]. Real examples output from the system are as follows.

ATR-MATRIX can output correct English from Japanese expressions such as "Chashitsu ga iriyô desu (The tea ceremony room is necessary)." "Shiharai wa kâdo desu (I will pay for it by card)" could be translated into "The payment is by the card" because some translation examples were prepared and successfully selected. However, "Shibikku ga nidai desu (Two Civic)" was translated into "Civic is two" because content words were replaced in the English syntactical structure pattern directly corresponding to source the Japanese syntactical structures.

Previous research on speech translation systems assumed cooperative dialogues between two people such as conversations between a tourist and a front desk clerk at a hotel. Sugaya et al. confirmed that users could achieve their task such as hotel room reservations through state-of-the-art speech translation systems because users sometimes uttered similar expressions after mis-recognitions, or the other party sometimes uttered confirmation questions after mis-translations [3]. ATR-MATRIX conld sometimes translate some constituents such as important key words based on the partial translation mechanism [11]. Important key words help the user of speech translation systems to understand the other party's message in the situation. Therefore, ATR-MATRIX could help communication between Japanese- and English-speakers, even though state-of-the-art speech translation conld not avoid recognition and translation errors.

In order to develop a large vocabulary speech translation system that accepts various expressions, translating key words does not always help users. High quality translation is indispensable for the announcements because a quick response from the other party cannot be expected. Therefore, we must improve the quality of translation, in particular for the category where "Correct English cannot be obtained and a single English predicate cannot be inferred." in TabIe 5.




5.2 Future works

We are planning to build a basic sentence pattern dictionary of conversational language for Japanese-to-English translation based on the data obtained by this study. Such a dictionary is expected to offer broader coverage than the original data

According to Table 4, the coverage based on the total number of words is broader than that based on the word entries, so that the vocabulary size may be limited by assuming a limited domain. However, as for the basic sentence patterns, the percentage based on the number of total words is almost the same as that based on the word entries. Even if the vocabulary size may be limited, the basic sentence patterns, which are combinations of lexicons, may not be limited, or, the corpus size may be too small to obtain the necessary information for building a broad coverage translation dictionary. We will study the density or sparseness of the corpus in both practical and theoretical manners.




6 Conclusions

Using "Goi-Taikei: A Japanese Lexicon" we have analyzed the characteristics of a bilingual travel conversation corpus built by ATR Interpreting Telecommunications Research Laboratories. There seem to be different problems for machine translation for texts and that for conversations, but there have not been enough studies comparing the two. Therefore, we investigated characteristics of a digitized bilingual travel conversation corpus in comparison with a machine-readable large vocabulary dictionary for text, both of which are available to the public. Such kinds of practical and quantitative analysis are expected to yield good guidelines or helpful knowledge to make a machine translation system for text deal with conversational language.




Acknowledgments

The authors wish to thank members at the Natural Language Processing Systems Department, Service Systems Division, NTT Advanced Technology Corporation for their contributions to the analysis and investigation.




References

[1]
Takezawa, T., Morimoto, T , Sagisaka, Y., Campbell, N., Iida, H., Sugaya, F,, Yokoo, A. and Yamamoto, S. A Japanese-to-English speech translation system: ATR-MATRIX. In Proceedings of International Conference on Spoken Language Processing. 1998, pp.2779-2782.

[2]
Sumita, E., Yamada, S., Yamamoto, K., Paul, M., Kashioka, H., Ishikawa, K. and Shirai, S. Solutions to problems inherent in spoken-language translation: the approach of ATR-MATRIX. In Proceedings of Machine Translalion Summit VII. 1999, pp.229-235.

[3]
Sugaya, F., Takezawa, T., Yokoo, A. and Yamamoto, S. End-to-end evaluation in ATR-MATRIX: speech translation system between English and Japanese. In Proceedings of EUROSPEECH. 1999, pp.2431-2434.

[4]
Nagao, M. Toward the new era of information technology. IPSJ Magazine, 2000, 41(1), pp.48-49. (in Japanese).

[5]
Iida, H., Sumita, E. and Furuse, O. Spoken-language translation method using examples. In Proceedings of COLING. 1996, pp.1074-1077.

[6]
Ikehara, S., Shirai, S., Yokoo, A. and Nakaiwa, H. Toward an MT system without pre-editing --effects of new methods in ALT-J/E. In Proceedings of Machine Translation Summit III. 1991, pp.101-106.

[7]
Ikehara, S., Miyazaki, M., Shirai, S., Yokoo, A., Nakaiwa, H., Ogura, K., Ooyama, Y and Hayashi, Y. (Eds). Goi-Taikei: A Japanese Lexicon. Iwanami Shoten Publisher, Tokyo, Japan, 1997. (in Japanese).

[8]
Ikehara, S., Miyazaki, M., Shirai, S., Yokoo, A., Nakaiwa, H., Ogura, K., Ooyama, Y and Hayashi, Y. (Eds). Goi-Taikei: A Japanese Lexicon CD-ROM. Iwanami Shoten Publisher, Tokyo, Japan, 1999. (in Japanese).

[9]
Morimoto, T , Uratani, N., Takezawa, T., Furuse, O., Sobashima, Y., Iida, H., Nakamura, A., Sagisaka, Y., Higuchi, N. and Yamazaki, Y. A speech and language database for speech translation research. In Proceedings of International Conference on Spoken Language Processing. 1994, pp.1791-1794.

[10]
Takezawa, T. Building a bilingual travel conversation database for speech translation research. In Proceedings of 2nd International Workshop on East-Asian Language Pesources and Evaluation --Oriental COCOSDA Workshop '99--. 1999, pp.17-20.

[11]
Wakita, Y., Kawai, J. and Iida, H. Correct parts extraction from speech recognition results using semantic distance calculation, and its application to speech translation. In Proceedings of ACL/EACL Workshop on Spoken Language Translation. 1997, pp.24-31.