Bilingual Corpora are very useful in natural language processing. Unfortunately they are difficult to compile. We have developed a method in which numerical values and proper nouns are used as keywords to align the Japanese and English newspaper articles automatically in order to develop a corpus. In addition, we have developed a way to evaluate the results. We correctly align automatically an average of 38 out of 90 pairs of articles daily.
Bilingual corpora are very usefui in natural language processing. There are two difficulties in compiling them: first, it is hard to gather very large corpora; second, it is hard to align them.
In order to obtain a large volume of data, we use newspaper articles, from an on-line source. This allows us to continually gather a very large amount of data.
Shirai et al. (1995a) showed that it is possible to align articles from the Nikkei Telecom/Japan News & Retrieval (English news) and NIKKEI TELECOM BIZ (Japanese news) on-line electronic information services provided by Ninon Keizal Shimbun, Inc. If you consider the content, sentences basically corresponded between Japanese and Enghsh, although there was considerable difference in how the content was expressed. If you include partial correspondence, almost all the sentences in the English articles align with some Japanese sentence. In addition, in half of the aligned sentences the Japanese nominative case and accusative case corresponds to the English subject and object,
To make a useful bilingual corpus we need to align articles and sentences between Japanese and English articles from the sources mentioned earher. Various methods have been proposed for aligning sentences (Brown et al. (1991) and Utsuro et al. (1994)), most of which assume that the data is already reasonably well aligned. In this paper we look at the preliminary question of efficiently aligning articles.
We propose a method of aligning articles using proper nouns and numerical keywords, and a method of selecting those that we are confident are correctly aligned, and show its performance. In this paper, we consider accuracy to be extremely important, even at the cost of aligning fewer articles, because we wish to fully automatically align articles.
We use Japanese and Enghsh news articles from Ninon Keizai Shimbun, Inc. They are downloaded by modem from an on-line data base. Japanese news articles are from Nihon Keizai Shimbun, Nikkei Industrial Daily, Nikkei Marketing Journal and Nikkei Financial Daily. English news articles are from the Nikkei Telecom/Japan News & Retrieval. English articles are provided with news flashes from four Japarlese newspapers.
The number of articles in one week are shown in Table, 1. There are seven times as many Japanese articles as there are English ones. So it is most efficient to search for corresponding Japanese articles using the English articles as a key.
Date | Japanese | English | ||
articles | with numeral | articles | with numeral | |
2 | 842 | 703 | 120 | 111 |
3 | 485 | 360 | 39 | 34 |
4 | 738 | 647 | 97 | 85 |
5 | 504 | 358 | 29 | 26 |
6 | 167 | 127 | 12 | 8 |
7 | 629 | 519 | 128 | 116 |
8 | 929 | 671 | 152 | 132 |
Total | 4294 | 2781 | 577 | 512 |
There are many approaches using bilinguai dictionaries and statistics to align Japanese and English sentences. But it takes a lot of time to searth for one out of many hundred corresponding Japanese articles. using translated keywords for each English article.
So, we use numerical keywords in articles. Numerical keywords are easy to extract and normally appear in both languages. In particular, as the four newspapers we are using as sources deal with financial topics, numerical keywords are numerous. We found that 80% of Japanese articles and 90% of English articles include one or more numerical keywords.
Shirai et al. (1995b) showed that we can align with an accuracy of 50% by statistical methods using frequency of numerical keywords only and 80% using the combined frequency of numerical keywords and proper nouns. But judgment of confidence is difficult. We propose a method that uses the number of different corresponding keywords in an article as the basis of judgment. Numbers that appear in both Japanese and Enghsh are part of the information conveyed by the article. Using them makes it easy to decide whether an article is correctIy aligned or not. Moreover, combining numerical keywords and a bilingual dictionary, we can identify even more aligned articles.
In this section we give an explanation of our method of extracting keywords from articles and then aligning Japanese and English articles using these extracted keywords.
Listing numerical keywords as followed from one days English articles.
tokyô | "Tokyo" | |
tokyô-ginkou | "Bank of Tokyo" |
We try to align English articles with Japanese articles from the previous, same and next days, because Shirai et al. (1995a) showed that aligned articles come from the same day (30.6%), previous day (63.5%) and next day (5.9%).
So we extract all numerical keywords from three days of Japanese news articles. But, it is hard to extract Japanese proper nouns automatically from Japanese news articles (there is no equivalent to
English capitalization), so we only align proper nouns in the header and first paragraph of the Japanese news articles. Most important proper noun keywords appear in the lead sentences.
We experimented with various combinations of alignment using numerical keywords or proper noun keywords.
[EXPERIMENT 1]
For our first attempt, we examined the numerical keyword list for each day, in both Japanese and English, and deleted any keyword that appeared in more than one article. We then aligned all articles with matching keywords. If there were two articles in the same language with the same keyword (from different days) we did not align them. We call this method 1-to-1 alignment.
As result, an average of accuracy of alignment is about 70% with one matching keywords, but an average of accuracy of alignment is 100% with two or more matching keywords. It is shown that rare numerical keyword isn't a good marker. And numerical value frequency is under the influence of a day of the number of article. So judging from the frequency is very difficult.
Therefore, we think that aligning using the number of different matching keywords may be better than using the frequency of aligned keywords. We tested this in experiment 2.
[EXPERIMENT 2]
In this experiment we judged to be aligned any pair that had more matching keywords than any other pair. If there are multiple candidates with the same number of matches, we align the English article with none of them. We call this method best match.
Using this method, we were able to a1ign 265 articles, 254 of them correctly, an accuracy of 95.8%. Most of the errors occurred when there were only a few matching keywords, but there were errors even with as many as five matches.
In order to improve the accuracy, we next consider the difference in the number of matching keywords between the top and second ranked candidates in the next experiment.
[EXPERIMENT 3]
In this experiment we align only those articles where the top candidate has two or more keyword matches than any other candidates. We call this method superlative match.
Using this measure, all aligned articles are correctly aligned. However, we now only align 176 articles out of 577 articles (30.5%) in 7 days. An upper bound for this method is the number of articles with three or more numerical keywords, in this case, 373 (64.6%).
Next, we attempt to align using the proper noun keywords extracted in 3.2. Takahashi et al. (1996) used a typical machine-translation dictionary of company and place names (1345 items). However people's names and product names are also good keywords, and companies are often referred to by abbreviated forms of their names. So we handmade a translation dictionary from the result of the numerical alignment. We intend to automate this step later on, possibly using the automatic transliteration method outlined by Matsuo et al. (1996).
[EXPERIMENT 1]
We attempted make to align articles using the English proper noun keyword list and the Japanese lead sentence list and our handmade dictionary. We judged to be aligned any pair that had more matching keywords than any other pair. If there are multiple candidates with the same number of matches, we align the English article with none of them.
As result, we align 134 articles, 122 of them correctly (91.0%). Because the bilingual dictionary is imperfect, the number of articles aiigned and the accuracy are both less than the numerical method. In particular, when only two keywords are aligned, the accuracy is as low as 88.7%. Some company names and place names are very frequent, so they do not give a good measure of alignment,
[EXPERIMENT 2]
In our second experiment, we align only those articles where the top candidate has two or more keyword matches than any other candidates.
In this case 50 articles are aligned, 49 of them correctIy, an accuracy of 98.0%, there is only one false positive. Extracted pairs of article are decrease to about 37.3%. Using only proper noun keywords, we don't get so good data.
Finally we align only those articles where the top candidate has two or more keyword matches than any other candidates, using both proper noun and numerical keywords.
As a result, the numerous aligned articles increase by 73 articles to 249 articles (41.5% increased) from using only numerical keywords. So using proper nouns is worthwhile. An upper bound for this method is the number of articles with three or more numerical keywords and proper nouns, in this case, 492 (85.3%).
In an additional experiment, we added to a translation dictionary new words found from the result of using both proper noun and numerical keywords.
The translation dictionary increases by 11 items, but only one more article could be aligned. Remaking the translation dictionary is not so useful, because financial news articles do not deal with companies or events so much.
We summarize the results of the various methods in Table. 2.
| ||||||||||||||||||||||||||||||||||||||||||||||||
No : | Using only numerical keywords. | |||||||||||||||||||||||||||||||||||||||||||||||
PN : | Using only proper noun keywords. | |||||||||||||||||||||||||||||||||||||||||||||||
N+P: | Using both proper noun and numerical keywords. |
The statistical method produces the most correct alignments, but with a lot of false nits. For an application that requires high accuracy we recommend the superlative match, using both numerical and proper noun keywords.
In this report, we propose and evaluate a method of automatically aligning Japanese and English articles using both numerical keywords and proper noun keywords. The merit of this method is that we can compile a large bilingual corpora automatically using simply analysis. in result, we align 249 articles out of 577 articles (43.2%) in 7 days, with all aligned articles correctly aligned.
In the future, we intend to align sentences within these aligned articles; to use as examples in a translation database (Takahashi et al., 1997); to make translation dictionaries, to complie a SGML tagged bitext corpora (Bond et al., 1996), and so on.