Chenqing Zong, Yujie Zhang, Kazuhide Yamamoto, Masashi Sakamoto & Satoshi Shirai, NLPRS-2001, November 27-29, 2001

Approach to Spoken Chinese Paraphrasing Based on Feature Extraction

Chengqing Zong^+*, Yujie Zhang^*, Kazuhide Yamamoto^*, Masashi Sakamoto^* and Satoshi Shirai^*

⁺ National Laboratory of Pattern Recognition, Institute of Automation, CAS
P. O. Box 2728, Beijing 100080, China
cqzong@nlpr.ia.ac.cn

^* ATR Spoken Language Translation Research Laboratories
2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
{yujie.zhang, masashi.sakamoto, satoshi.shirai}@atr.co.jp

Abstract

This paper presents an approach to spoken Chinese language paraphrasing based on feature extraction and techniques of language generation. In this approach, an input utterance is first analyzed in terms of phrase structure, dependency of chunks, etc., by using multiple methods. Then, the main features of the input utterance are extracted, and the extraction results are represented by a frame. Finally, other possible expressions of the input are generated based on the analysis results by different methods. Preliminary results are shown in the paper.

INDEX

Although many approaches have been proposed to cope with spoken language phenomena and many strategies for translation have been developed, spoken language translation (SLT) systems still suffer from performance limitations. One of the key problems involves deciding how to robustly parse the input utterances. If we examine the techniques employed by human interpreters, we can see that paraphrasing is unavoidable at times. When an interpreter is unable to directly translate an utterance due to an ill-formed expression or an even worse problem, he or she may have to paraphrase the utterance into other expressions in his/her mind before translating the utterance.

To cope with this effect, Yamamoto et al. (2001) proposed the Sandglass SLT paradigm. This paradigm separates the complicated parsing procedure in the Sandglass system from the translation module and explains the meaning of an input utterance in the source language itself. Accordingly, it is possible to employ a simple transfer to convert the paraphrased input utterance into the target language. The main goal of the paraphrasing module is to make it easy to get correct translation, especially for complicated utterances.

In this paper, we present an approach to Chinese utterance paraphrasing based on feature extraction. In Section 2, the related works on paraphrasing are briefly reviewed, and the problems and our countermeasures are introduced. In Section 3, the implementation of an experimental system is described in detail. In Section 4, the experimental results are shown. Finally, Section 5 gives concluding remarks.

Chandrasekar et al. (1996) presented ways to simplify long and complicated sentences by a Finite State Grammar (FSG) based approach and the Supertagging (DSM) model. In their method, punctuation marks and relative pronouns are necessary to define a set of rules that map from the given sentence patterns to simpler sentences patterns. Unfortunately, in SLT systems there are no punctuation marks to use because all of the sentences that the system's paraphraser processes are from the system's speech recognizer, which does not generally provide punctuation marks. Furthermore, the Chinese language does not use relative pronouns to indicate articulation points.

Dras (1997) introduced several methods to represent paraphrases by using synchronous TAGs. These methods, however, closely depend on the synchronous TAGs and require a fairly well parsed syntactic structure of the input. But in SLT systems, it is usually very difficult to parse an utterance into an adequate syntactic structure, especially when the input contains noisy words.

Boguslavsky et al. (2000) introduced synonymous paraphrasing of sentences, but did not address structure rewriting. McKeown (1983) described a paraphraser for a natural language question-answering system (CO-OP), but the system was only syntactic based. In addition, the method was found to have limitations in spoken language paraphrasing. Sato (1999) and Kondo et al. (2001) described methods to paraphrase Japanese technical papers' titles and simple Japanese sentences, respectively.

All of the research works mentioned above have provided us with very beneficial cues for paraphrasing the spoken Chinese language. Unfortunately, before we started our work, there was no reported work that specially addressed the Chinese language paraphrasing.

In a paraphrasing system (see Fig. 1), the input and output should comply with the following policy:

The same semantics: the output should have (almost) the same meanings as the input.

Simplification: in general, the output should have simple and well-formed expressions, especially when the input is ill-formed.

Like to develop a machine translation system, several candidate approaches can be used. The pattern based approach is one choice. This method is easy to realize and the speed is high, but the generality of the method is often limited. If the types of input utterances vary greatly and there are many unseen types of utterances, the performance of a system typically degrades. The statistical approach is another choice. In this method, paraphrasing is treated as just a procedure of translation based on the statistical approach. The only difference is that the paraphrasing is done within the same language rather than a translation between two different languages. Unfortunately, the statistical method needs very large-scale tagged corpora. Specifically, it is not practical to employ a costly statistic based paraphraser to rewrite input in a real-time SLT system.

Based on the analysis above, we proposed an approach to paraphrasing Chinese utterances based on feature extraction. The main ideas of the approach may be described as follows:

	1)	Segment the complex utterances into simple parts. Each part is separately paraphrased by using the following steps.
	2)	Jointly parse each separated part by multiple analyzers including a phrase parser, chunk dependency analyzer, and special chunk recognizer.
	3)	Extract the main features of the analyzing part, including the expression type (interrogative or declarative expressions, etc.), syntactic features, semantic features and so on.
	4)	Generate other possible expressions of each part by using different methods.

Our approach based on feature extraction is mainly grounded on the following points:

a) Almost all analysis results in the approach are useful and beneficial for the subsequent translation module in an SLT system. For example, the expression type, syntactic structures, and the relations among the chunks, are all necessary pieces of information for the translation module. This means that the paraphraser could not only provide alternative expressions of an input utterance to the translation module but also reduce much of the translation module's analysis work.

b) The possible expressions are generated under the guidance of analysis results. To a certain extent, the expressions may be generated with high correctness and well-formed structure.

c) The approach is not limited by any conditions. That is, the input utterances may be any possible expressions. This ensures that the approach is capable of processing the spoken Chinese language.

d) The expressions are generated by different methods. The input is not only paraphrased based on the parsing results but also based on the other features. This helps the input to be paraphrased correctly even if the input is parsed incorrectly.

Based on the ideas presented above, we have implemented an experimental system for paraphrasing Chinese utterances in the domain of hotel reservation. This section describes the key points in implementation of the experimental system in detail.

In our approach, the analyzer includes six modules: (1) time chunk recognizer implemented by an FST (Finite State Transducer); (2) phrase parser employing PCFG (Probabilistic Context Free Grammar) rules; (3) chunk dependency analyzer achieved by an FST; (4) utterance type analyzer based on recognition rules; (5) keyword spotting analyzer achieved by an FST; and (6) tense analyzer also achieved by an FST. In this sub-section, we describe (1), (2) and (3).

In the spoken Chinese language used in the hotel reservation domain, time phrases and quantifiers appear very frequently. According to our statistical results of 64,800 utterances in the domain of hotel reservation, 21.98% of utterances contain quantifiers or time phrases. Furthermore, the time phrases and quantifiers may act as different constituents in different contexts, such as an adverbial adjunct, object, or predicate. Accordingly, the time phrases and quantifiers of each input are recognized first by our analyzer before parsing.

The time phrase here mainly refers to a number related time expression, such as '

(3:30 in the afternoon)' or '7

(before July 16)'. Other time expressions are recognized by the phrase parser.

The number related time expression is recognized by an FST(time), which accepts only the following three types of words: (a) temporal noun (NT); (b) cardinal number (CD); and (c) the classifier (Mt). The set of the three types of words is signed as W_a. When a Word_i

W_a appears in the utterance under analysis, the FST(time) starts to work. In the case of Word_i

W_a, the FST(time) is stopped and the time chunk is marked.

According to our experimental results, the FST(time) processing of the time phrases and quantifiers was 65.6% completely correct, 17.2% partially correct but with no error, and 16.1% not processed. The error ratio was 1.1%, and the accuracy of the parser improved 6.5% by using the FST(time) (see Section 4).

After the recognition of the time phrases, the input is segmented into n+1 parts by n ( n is an integer and n [[$(C!C(B]] 0 ) time phrases. Each part is parsed by employing PCFG rules. Although there are large differences between the spoken Chinese language and the written Chinese language, we think these differences are mainly reflected at the sentence level, e.g., different orders of constituents containing redundant words in spoken Chinese expressions. Phrase construction in spoken Chinese and that in written Chinese follow the same policy. Accordingly, the PCFG rules employed in our system are directly extracted from the Penn Chinese Treebank (Xia, 2000). All of the rules comply with the condition of

_iP(LHS

_i)=1. For example:

In our system, the target of the parser is to recognize phrases rather than whole sentences.

The chunk dependency is analyzed by the FST(chunk), which treats a predicate as the center of an analyzed part. The dependency between the predicate and other chunks are divided into nine types as shown in Table 1.

Since the task of the paraphraser is not translation, the dependency is not divided to produce fine details. Some chunks are distinguished by their positions, e.g., far/near adverbial adjunct and pre-/post-SPV.

In the system, the predicate is recognized first, and the recognizer gives the most plausible candidate as the predicate. The dependencies between the predicate and other chunks are analyzed by the following algorithm.

Step 1. If there is only one predicate candidate, search for the subject and adverbial adjunct at the left of the predicate candidate and then determine the complement, object, quantifier, etc., at the right of the candidate predicate.

Step 2. If there are n (n>1) predicate candidates VP_i (i =1 .. n), perform the following operations:

	1)	Determine the subject and adverbial adjunct at the left of VP₁;
	2)	If VP₁ cannot take an object, treat the part after VP₁ as another processing unit, signed as PART-X;
	3)	If VP1 is allowed to take an object, determine the object, complement, quantifier, etc., after VP₁ but before VP₂ and treat other parts as another processing unit PART-X;
	4)	For others cases: (a) VP₁ may take two objects; (b) VP₁ may take a clause as its object; (c) VP₁ may take a noun as its object but the noun (pivot word) can act as the agent of another following verb; and (d) VP₁ is the judgment verb. Determine the possible sentence according to different situations and treat the remainder of the input as another processing unit PART-X.

Step 3. Treat PART-X as the input and repeat Step 1 and Step 2 until all chunks have been analyzed.

Step 4. Record all possible dependencies and fill the Frame (see Sub-section 3.2).

According to the description above, an input utterance is analyzed into n (n

1) parts, and each part is mapped into a frame.

A Frame consists of two parts, which we call Head and Body. The Head records the main features of the analyzed part, including the part's type, keywords, tense, and attribute. Type here refers to: (1) interrogative, (2) declarative, (3) greeting, or (4) simple reply. The attribute indicates the role that the part plays in the entire input. It may be a condition marked by Chinese words '

(if)', '

(suppose)', etc., or a reason clause marked by the Chinese words like '

(because)'.

Frame:	HEAD	: Type {Interrogative/...}
		: Keywords {Word₁/Positon, ...}
		: Tense {Present/Past/...}
		: Attribute {Condition/...}
	BODY	: Subject;
		: Adverbial₁ / Adverbial₂ ...;
		: Predicate;
		: Object₁ Sub-Body;
		: Object₂;
		: Quantity;
		: Complement;
		: PW;
		: CPW;
		: SPV₁ / SPV₂;

PW, CPW, and SPV have the same meanings as in Table 1. If the object is a simple clause, the object is represented by a sub-Body that has the same structure as the main-Body. Some of the slots in the Frame may be null.

Based on the Frame representation, the possible expressions are generated by different methods. If the input is just a simple reply or a greeting phrase, like '

(OK) or '

(no problem)', there is no generation. Otherwise, the expressions are generated by using the following methods.

Method 1: Change the positions of adverbials. For example, the input:

(Yesterday he reserved a single room in Beijing.) (I-1) is parsed, and the phrases '

(yesterday)' and '

(in Beijing)' are respectively recognized as two adverbials. The positions of these two adverbials are changed in the output expressions as shown below:

Method 2: Generate the expressions by using the Head information in the Frame and also explain each constituent separately. For example, after the analysis of the input I-1, the Head information of the Frame is determined as:

According to the Head information, the expression is generated as:

. Then, the non-null constituents in the Frame are individually explained as follows:

Method 3: Change the interrogative expression by using fixed patterns. For example, an interrogative expression like '

X' may be changed into '

', and '

' may be changed into '

X' or '

X'. Here, 'X' is any word, phrase, or even a simple clause.

Method 4: Generate expressions by using phrase based patterns. Here, we assume that the input has already been parsed into phrases, so that phrase based patterns may be extracted to generate other expressions of the input, e.g., from the sentence '

(I want a bigger room.)', we may extract the following patterns:

At present, the experimental system employs 244 PCFG rules, 43 sentence type recognition rules, 14 rules for complicated utterance segmentation at the shallow level, and a dictionary with 6,500 Chinese words extracted from 64,480 utterances. Some test results are presented in this section.

Here, we mainly test the effect of the time phrase and quantifier recognizer on the phrase parser by using 54 utterances, in which 89 time phrases or numerals are contained. Table 2 shows the parsing results both when the time and numeral phrases are pre-processed (Case 1) and not pre-processed (Case 2).

From the table we can see that the parsing accuracy improved 6.5% by using the time and numeral phrase recognizer.

The dependency analyzer was tested by using 100 utterances, which include 107 simple sentences. The analysis results are divided into two types. The first type is the completely correct results. If any relation is analyzed incorrectly, the result belongs to the second type. The test results show that 61 simple sentences were analyzed complete correctly. This number amounts to about 57% of the total input. On the other hand, 46 simple sentences were analyzed incorrectly, which was about 43% of the total input. The wrong results could be attributed to three reasons: (a) wrong parsing results, (b) wrong word segmentation, and (c) wrong dependency analysis. Table 3 gives the distribution of the three error types. The worse parsing result is clearly the main cause of incorrect dependency analysis.

The entire paraphrasing system was tested by using the same 100 input utterances mentioned above. Sixty simple sentences were not paraphrased, and 47 simple sentences were paraphrased into 90 expressions. That is, one simple sentence was paraphrased into 1.91 possible expressions on average. The paraphrased results are divided into three types: (A) the results are correct and well expressed, (B) the results are understandable and acceptable, and (C) the results are wrong. Table 4 presents the three types of paraphrased results.

Table 4 shows that about 77.8% of the paraphrased results were good or acceptable. The wrong results were mainly caused by dependency analysis errors.

Chinese paraphrasing is a new research effect, although many of the techniques employed in our approach are not new. Although the performance of our experimental system is not yet satisfactory, the preliminary results have given us confidence in our development of a practical paraphraser for SLT systems. We believe that the ideas proposed in this paper are beneficial not only for paraphrasing tasks but also for robust spoken language understanding, information extraction, and other related tasks.

However, the approach faces many complicated problems, including robust parsing, dependency analysis, and natural language generation. The following two problems remain for further research: (1) How to judge whether the generated results are simpler and better formed than the input at the very least, they should not be worse than the input; and (2) How to rank the generated results and then output the 'best' result to the transfer (translation) module.

The authors specially thank professor Shiwen Yu, professor Fuji Ren, Dr. Kiyonori Ohtake, and Ms. Lan Yao for their very useful help.

	1 Introduction
	2 Problems in Paraphrasing and Our Countermeasures
	2.1 Related Works on Paraphrasing
	2.2 Problems and Countermeasures
	2.3 Why We Use the Approach
	3 System Implementation
	3.1 Analyzer
	3.2 Frame Representation
	3.3 Generating Expressions
	4 Experimental Results
	4.1 Parsing Results
	4.2 Dependency Analysis Results
	4.3 Paraphrasing Results
	5 Conclusion
	6 Acknowledgements

	References

*Marks*	*Types*
SUB	Subject
Q-NUM	Quantifier
COMP	Complement
D-OBJ	Direct object
I-OBJ	Indirect object
ADV	Adverbial adjunct
SPV	Sequential predicates¹
PW	Pivot word²
CPW	Complement of pivot word

	Case 1	Case 2
Output Phrase	164	148
Correct	147	123
Correct Ratio	89.6%	83.1%

	(a)	(b)	(c)
Number	38	4	4
Ratio (%)	82.6	8.7	8.7

	(A)	(B)	(C)
Number	56	14	20
Ratio (%)	62.2	15.6	22.2

	NN NN NP, 1.00
	MSP VP VP, 0.94
	MSP VP NP, 0.06

Abstract

INDEX

1 Introduction

2 Problems in Paraphrasing and Our Countermeasures

2.1 Related Works on Paraphrasing

2.2 Problems and Countermeasures

2.3 Why We Use the Approach

3 System Implementation

3.1 Analyzer

3.2 Frame Representation

3.3 Generating Expressions

4 Experimental Results

4.1 Parsing Results

4.2 Dependency Analysis Results

4.3 Paraphrasing Results

5 Conclusion

6 Acknowledgements

References