Yamato Takahashi, Satoshi Shirai & Francis Bond, NLPRS'97, December 2-4, 1997

A Method of Automatically Aligning Japanese & English Newspaper Articles

Yamato TAKAHASHI, Satoshi SHIRAI and Francis BOND

NTT Communication Science Laboratories
1-1 Hikari-no-oka, Yokosuka-shi, Kanagawa-ken, JAPAN 239
{yamato,shirai,bond}@cslab.kecl.ntt.co.jp

Abstract

Bilingual Corpora are very useful in natural language processing. Unfortunately they are difficult to compile. We have developed a method in which numerical values and proper nouns are used as keywords to align the Japanese and English newspaper articles automatically in order to develop a corpus. In addition, we have developed a way to evaluate the results. We correctly align automatically an average of 38 out of 90 pairs of articles daily.

[ In Proceedings of NLPRS'97, pp.657-660 (December, 1997). ]

INDEX

	1 Introduction
	2 Characteristics of news articles
	3 Proposed Method of aligning Japanese and English articles
	3. 1 Extracting keywords from English Articles
	3. 1. 1 Extracting numerical keywords
	3.2 Extracting keywords from Japanese articles
	3.2.1 Extracting numerical keywords
	3.3 Aligning articles
	3.3.1 Aligning using numerical keylwords
	3.3.2 Aligning using proper noun keywords
	3.3.3 A method using proper noun and numerical keywords
	3.3.4 Summary
	4 Conclusion

	References

1 Introduction

Bilingual corpora are very usefui in natural language processing. There are two difficulties in compiling them: first, it is hard to gather very large corpora; second, it is hard to align them.

In order to obtain a large volume of data, we use newspaper articles, from an on-line source. This allows us to continually gather a very large amount of data.

Shirai et al. (1995a) showed that it is possible to align articles from the Nikkei Telecom/Japan News & Retrieval (English news) and NIKKEI TELECOM BIZ (Japanese news) on-line electronic information services provided by Ninon Keizal Shimbun, Inc. If you consider the content, sentences basically corresponded between Japanese and Enghsh, although there was considerable difference in how the content was expressed. If you include partial correspondence, almost all the sentences in the English articles align with some Japanese sentence. In addition, in half of the aligned sentences the Japanese nominative case and accusative case corresponds to the English subject and object,

To make a useful bilingual corpus we need to align articles and sentences between Japanese and English articles from the sources mentioned earher. Various methods have been proposed for aligning sentences (Brown et al. (1991) and Utsuro et al. (1994)), most of which assume that the data is already reasonably well aligned. In this paper we look at the preliminary question of efficiently aligning articles.

We propose a method of aligning articles using proper nouns and numerical keywords, and a method of selecting those that we are confident are correctly aligned, and show its performance. In this paper, we consider accuracy to be extremely important, even at the cost of aligning fewer articles, because we wish to fully automatically align articles.

2 Characteristics of news articles

We use Japanese and Enghsh news articles from Ninon Keizai Shimbun, Inc. They are downloaded by modem from an on-line data base. Japanese news articles are from Nihon Keizai Shimbun, Nikkei Industrial Daily, Nikkei Marketing Journal and Nikkei Financial Daily. English news articles are from the Nikkei Telecom/Japan News & Retrieval. English articles are provided with news flashes from four Japarlese newspapers.

The number of articles in one week are shown in Table, 1. There are seven times as many Japanese articles as there are English ones. So it is most efficient to search for corresponding Japanese articles using the English articles as a key.

Table 1: The number of Japanese & English articles

Date	Japanese		English
Date	articles	with numeral	articles	with numeral
2	842	703	120	111
3	485	360	39	34
4	738	647	97	85
5	504	358	29	26
6	167	127	12	8
7	629	519	128	116
8	929	671	152	132
Total	4294	2781	577	512

There are many approaches using bilinguai dictionaries and statistics to align Japanese and English sentences. But it takes a lot of time to searth for one out of many hundred corresponding Japanese articles. using translated keywords for each English article.

So, we use numerical keywords in articles. Numerical keywords are easy to extract and normally appear in both languages. In particular, as the four newspapers we are using as sources deal with financial topics, numerical keywords are numerous. We found that 80% of Japanese articles and 90% of English articles include one or more numerical keywords.

Shirai et al. (1995b) showed that we can align with an accuracy of 50% by statistical methods using frequency of numerical keywords only and 80% using the combined frequency of numerical keywords and proper nouns. But judgment of confidence is difficult. We propose a method that uses the number of different corresponding keywords in an article as the basis of judgment. Numbers that appear in both Japanese and Enghsh are part of the information conveyed by the article. Using them makes it easy to decide whether an article is correctIy aligned or not. Moreover, combining numerical keywords and a bilingual dictionary, we can identify even more aligned articles.

3 Proposed Method of aligning Japanese and English articles

In this section we give an explanation of our method of extracting keywords from articles and then aligning Japanese and English articles using these extracted keywords.

3. 1 Extracting keywords from English Articles

3. 1. 1 Extracting numerical keywords

Listing numerical keywords as followed from one days English articles.

First extract all strings of numerals, including decimal points [0-9.], and all measure units. We have prepared a list of frequently occurring measure units, including some numerical words. Convert combinations of numerals and numerical words to strings of numerals to easy to match numerical keywords:
dollar, yen, %, trillion, billion, million 4.3 trillion dollar => 4300000000000 dollar
Delete duplicated numerical values in each article, leaving a unique example of each type. Because an article has often several numerical keywords, and usually those are used twice or more (here, we run only characteristic analysis, so we may miss that different meaning numerical words are equivalent). So as not to match the same keyword more than once.

3.1.2 Extracting proper noun keywords

Extract proper nouns from the headline and body of the news article. The nouns (including compound nouns) that satisfy the following conditions are considered as proper noun keywords.
- Words including capital letters:
  SL-enhanced Intel i486SX
- Words including capital letters connected by the possessive clitic ('s), "of" or "&":
  Japan Federation of Employers ' Associations .
  
  If the words following the possessive clitic ('s) or "of" do not include capital letters then delete the clitic and don't include "of":
  NTT's line => NTT, Bank of city => Bank
- Delete the definite article, "The" ,because we suppose that "The U.S." and "the U.S." are equivalent:
  The U.S. => U.S.
Look for a Japanese equivalent to the extracted proper noun keyword in a bilingual dictionary. If there is a Japanese equivalent, add it to the list of proper noun keywords. If not, delete the word. If one Japanese word sequence includes another Japanese word sequence in the same articles delete the included one:

tokyô "Tokyo"

tokyô-ginkou "Bank of Tokyo"

Delete "Tokyo" from the proper noun keywords in this case.

3.2 Extracting keywords from Japanese articles

We try to align English articles with Japanese articles from the previous, same and next days, because Shirai et al. (1995a) showed that aligned articles come from the same day (30.6%), previous day (63.5%) and next day (5.9%).

So we extract all numerical keywords from three days of Japanese news articles. But, it is hard to extract Japanese proper nouns automatically from Japanese news articles (there is no equivalent to

English capitalization), so we only align proper nouns in the header and first paragraph of the Japanese news articles. Most important proper noun keywords appear in the lead sentences.

3.2.1 Extracting numerical keywords

We consider sequences of Japanese and English numerals, decimal points, and the following units, doru "dollar" , en "yen" , pâsento "%" to form numerical keywords. And we regularize the extracted sequences into strings of numerals followed by a unit, as we did for the English keywords:
san-ju-go en go-ju sen "35 yen 50 sen" => 35,50 yen
Delete duplicated numerical value in each articles, leaving one.

3.3 Aligning articles

We experimented with various combinations of alignment using numerical keywords or proper noun keywords.

3.3.1 Aligning using numerical keylwords

[EXPERIMENT 1]

For our first attempt, we examined the numerical keyword list for each day, in both Japanese and English, and deleted any keyword that appeared in more than one article. We then aligned all articles with matching keywords. If there were two articles in the same language with the same keyword (from different days) we did not align them. We call this method 1-to-1 alignment.

As result, an average of accuracy of alignment is about 70% with one matching keywords, but an average of accuracy of alignment is 100% with two or more matching keywords. It is shown that rare numerical keyword isn't a good marker. And numerical value frequency is under the influence of a day of the number of article. So judging from the frequency is very difficult.

Therefore, we think that aligning using the number of different matching keywords may be better than using the frequency of aligned keywords. We tested this in experiment 2.

[EXPERIMENT 2]

In this experiment we judged to be aligned any pair that had more matching keywords than any other pair. If there are multiple candidates with the same number of matches, we align the English article with none of them. We call this method best match.

Using this method, we were able to a1ign 265 articles, 254 of them correctly, an accuracy of 95.8%. Most of the errors occurred when there were only a few matching keywords, but there were errors even with as many as five matches.

In order to improve the accuracy, we next consider the difference in the number of matching keywords between the top and second ranked candidates in the next experiment.

[EXPERIMENT 3]

In this experiment we align only those articles where the top candidate has two or more keyword matches than any other candidates. We call this method superlative match.

Using this measure, all aligned articles are correctly aligned. However, we now only align 176 articles out of 577 articles (30.5%) in 7 days. An upper bound for this method is the number of articles with three or more numerical keywords, in this case, 373 (64.6%).

3.3.2 Aligning using proper noun keywords

Next, we attempt to align using the proper noun keywords extracted in 3.2. Takahashi et al. (1996) used a typical machine-translation dictionary of company and place names (1345 items). However people's names and product names are also good keywords, and companies are often referred to by abbreviated forms of their names. So we handmade a translation dictionary from the result of the numerical alignment. We intend to automate this step later on, possibly using the automatic transliteration method outlined by Matsuo et al. (1996).

[EXPERIMENT 1]

We attempted make to align articles using the English proper noun keyword list and the Japanese lead sentence list and our handmade dictionary. We judged to be aligned any pair that had more matching keywords than any other pair. If there are multiple candidates with the same number of matches, we align the English article with none of them.

As result, we align 134 articles, 122 of them correctly (91.0%). Because the bilingual dictionary is imperfect, the number of articles aiigned and the accuracy are both less than the numerical method. In particular, when only two keywords are aligned, the accuracy is as low as 88.7%. Some company names and place names are very frequent, so they do not give a good measure of alignment,

[EXPERIMENT 2]

In our second experiment, we align only those articles where the top candidate has two or more keyword matches than any other candidates.

In this case 50 articles are aligned, 49 of them correctIy, an accuracy of 98.0%, there is only one false positive. Extracted pairs of article are decrease to about 37.3%. Using only proper noun keywords, we don't get so good data.

3.3.3 A method using proper noun and numerical keywords

Finally we align only those articles where the top candidate has two or more keyword matches than any other candidates, using both proper noun and numerical keywords.

As a result, the numerous aligned articles increase by 73 articles to 249 articles (41.5% increased) from using only numerical keywords. So using proper nouns is worthwhile. An upper bound for this method is the number of articles with three or more numerical keywords and proper nouns, in this case, 492 (85.3%).

In an additional experiment, we added to a translation dictionary new words found from the result of using both proper noun and numerical keywords.

The translation dictionary increases by 11 items, but only one more article could be aligned. Remaking the translation dictionary is not so useful, because financial news articles do not deal with companies or events so much.

3.3.4 Summary

We summarize the results of the various methods in Table. 2.

Table 2: Comparison of the various methods.

The method of alignment		Aligned articles	Correctlyaligned
No	Statistical approach	577	298	51.6%
	1-to-1 alignment	174	150	86.2%
	Best match	265	254	95.8%
	Superlative match	176	176	100%
PN	Best match	134	122	91.0%
PN	Superlative match	50	49	98.0%
N+P	Statistical approach	577	463	80.2%
N+P	Superlative match	249	249	100%
All English articles		577

No :

Using only numerical keywords.

PN :

Using only proper noun keywords.

N+P:

Using both proper noun and numerical keywords.

The statistical method produces the most correct alignments, but with a lot of false nits. For an application that requires high accuracy we recommend the superlative match, using both numerical and proper noun keywords.

4 Conclusion

In this report, we propose and evaluate a method of automatically aligning Japanese and English articles using both numerical keywords and proper noun keywords. The merit of this method is that we can compile a large bilingual corpora automatically using simply analysis. in result, we align 249 articles out of 577 articles (43.2%) in 7 days, with all aligned articles correctly aligned.

In the future, we intend to align sentences within these aligned articles; to use as examples in a translation database (Takahashi et al., 1997); to make translation dictionaries, to complie a SGML tagged bitext corpora (Bond et al., 1996), and so on.

References

S. Shirai, S. Fujinami, S. Ikehara, H. Ueda and H. Inoue. 1995.: Constructing an aligned Japanese/English corpus of newspaper articles (1) -basic structure and discussion-. In Record of the 1995 Joint Conference of Electrical and Electronics Engineers in Kyushu , p.855. (in Japanese).
P. F. Brown, J. C. Lai and R. L. Mercer. 1991.: Aligning sentences in parallel corpora. In Proc. of the 29th Annuat Meeting of the ACL , pp.169-176.
T. Utsuro, H. Ikeda, M. Yamane, Y. Matsumoto and M. Nagano. 1994.: Bilingual text matching using bilingual dictionary and statistics. In 15th COLING , pp.1076-1082.
S. Shirai, H. Ueda, S. Abe, S. Fujinami and S. Ikehara. 1995.: Constructing an aligned Japanese/English corpus of newspaper articles (2) -aligning articles taken from a database-. In Record of the 1995 Joint Conference of Electrical and Electronics Engineers in Kyushu , p.856. (in Japanese).
Y. Takahashi, S, Shirai, S. Fujmami, S. Ikehara, H. Ueda and H. Matsusima. 1996.: Automatically Aligning Japanese & English Newspaper Articles. In IEICE Technical Report, NLC96-17 , pp. 55-62. (in Japanese).
Y. Matsuo and S. Shirai. 1996.: Using Pronunciation to Automatically Extract Bilingual Word Pairs. In IPSJ SIG Notes , 96-NL-116, pp.101-106. (in Japanese).
Y. Takahashi, S. Shirai, M. Tachibana, M. Nisigaki and S. Ikehara. 1997.: Outline of an Example Based Method for Japanese-to-English Machine Translation. in Proc. of 7he 3rd Annual Meeting of The Association for Natural Language Processing , A2-1, pp.145-148. (in Japanese).
F. Bond, Y. Takahashi, S. Yamada and M. Nisigaki. 1996.: Still tagging an aligned Japanese/English corpus. In Proc. of The 2nd Annual Meeting of The Assoctation for Natural Language Processing , pp.205-208

	tokyô	"Tokyo"
	tokyô-ginkou	"Bank of Tokyo"