Case Study: Porting an NLP Application to Unicode

Nicolas AUCLERC
Yves LEPAGE
SHIRAI Satoshi



[ In Proceedings of IUC19, pp.395-401 (September, 2001). ]



INDEX

     1 Case Study: Porting an NLP Application to Unicode
2 Plan
3 Presentation: ATR
4 Presentation: ATR SLT
5 Presentation: Translation
6 Presentation: Parsing
7 Presentation: Treebanks
8 Presentation: Tree Banking
9 Goals: a user-friendly tool for tree banking
10 Existing Applications
11 New Goals
12 Client Specifications
13 Server Specifications
14 Overview of the New Application
15 Plan
16 Implementation Choices
17 Benefits of having migrated to Unicode
18 Demonstration
19 Goals: Completions Status
20 Future Work
21 Conclusion
22 Questions/Comments?

Case Study:
Porting an NLP Application to Unicode


Nicolas AUCLERC
Yves LEPAGE
SHIRAI Satoshi

Good morning,

My name is Nicolas Auclerc and I work as a researcher in ATR, a structure of four Japanese laboratories, whose main research field is telecommunications. More specifically, my laboratory is called ATR-SLT, and is concerned with automatic translation over the telephone, also called automatic interpretation. Basically, our techniques rely on usage examples, which explains our need for large amounts of linguistic resources. Logically, we also need tools to create and update those linguistics resources.


Plan



  Presentation
BoardEdit
Conversion/Porting: Benefits/Drawbacks
Future Work & Conclusion

I shall start this presentation by explaining the context with an introduction to ATR-SLT, and then I shall explain the different steps in building a machine translation system. Next, I shall present a tool designed to help a tree-banker under its previous version. We defined new goals for this tool, which included porting the application to Unicode. This will lead me to speak about the benefits and the drawbacks of porting this Natural Language Processing application to Unicode.


Presentation: ATR


  Advanced Telecommunications Research
  Founded in March 1986 in the Kansai (Kyoto-Osaka-Kobe)
Capitalized at 22.0325 billion ($180 million)
250 researchers (20% are foreigners)
4 projects
Adaptive Communications Research(ACR)
Multimedia Integration & Communications (MIC)
Human Information Processing (ISD)
Spoken Language Translation (SLT)

ATR Intemational is the structure at the head of ATR's 4 research laboratories in telecommunications. It was founded in March 1986 in the Kansai area. The research center has four main projects, and employs 250 researchers. I am participation in the Spoken Language Translation project.


Presentation: ATR SLT


  Spoken Language Translation Research Laboratories
  Multi-lingual spoken language translation using large-scale parallel corpora
60 researchers
4 departments
Dept 1 : Recognition & speech synthesis
Dept 2 : Adaptive micro
Dept 3 : Machine translation
Dept 4 : Exploration of linguistic corpora

The Spoken Language Translation Research Laboratories started operation in March 2000 and continues the work of a previous laboratory, called ITL, which stands for Interpreting Telecommunications Laboratories. SLT is divided into 4 departments with 60 researchers. I work in the machine translation department.


Presentation: Translation



Linguistic structure
source language
Transfer
Linguistic structure
target language
Parsing Generation
Sentence
source language
Direct Translation
Sentence
target language

This slide shows the standard architecture of a translation system. The goal is to translate a sentence in a source language into one in a target language. Direct translation, which mean to translate word by word and then reorder, is an old technique, which does not deliver very good results. The second generation of machine translation systems uses another techniqtle which involves three steps: Parsing, transfer and generation. The first step, parsing, analyzes each sentence of the source language and returns a linguistic structure which represents the sentence parsed. The second step, transfer, converts the linguistic structure of the sentence in the source language into a linguistic structure in the target language. The third step, generation, generates a sentence in the target language from the linguistic structure obtained after the transfer.


Presentation: Parsing

In this presentation, my main concern is parsing. As I already mentioned about parsing, we start with a sentence, here in Japanese, and we want to obtain the linguistic structure of this sentence. To deliver such a structure is the job of a parser. in this slide, we see that the structure obtained is a tree. The parsers that we build in our lab exploits a large amount of already parsed data. A collection of such data is called a tree bank.


Presentation: Treebanks


  A set of sentences with their linguistic description
Excerpt ofthe ATR NEC treebank
. . .
( IT (I , + ( ( O ) ) ) )
IT ( I , + ( ( O ) ) )
( IT ( ET ( I ( ( O ) , + ) , ) )
. . .
Treebanks are mandatory data for example-based Natural Language Processing (NLP)

A treebank is a set of sentences together with their linguistic description. Here you see an excerpt of a treebank in the domain of travel situations. The sentences presented here are: Can you bring food in?, You can't bring food in, and Do you have a sleeping bag?


Presentation: Tree Banking


  Collect a corpus of sentences
For each sentence, create corresponding linguistic structures
-  For that, do parsing by hand or
- use parsing aids like
  Search engine
Prototypes of parsers
Etc.

Here we show the general process of building a tree bank. The first step is to collect a set of sentences. Then, for each sentence, we try to obtain the linguistic structure by using parsing aids, such as exact match, completion by analogy or just try to build it by hand.


Goals: BoardEdit
a user-friendly tool for tree banking

When augmenting a tree-bank, we start with a new sentence, and we want to get a linguistic structure for this new sentence. As our final goal is to build a parser, we do not yet have a parser at our disposal. In this case, primitive tools, called parsing aids, shall help us. As the structures delivered by the parsing aids may not be the precise ones, there is possibly a need to modify them. For that reason, a tree editor is integrated with the parsing aids.


Existing Applications


  Mono-platform
-  Solaris
Standalone application
- wxWindows (C/C++)
Parsing aids compatible with
- English (ASCII)
- Japanese (EUC-JP)

In 1995, the previous specifications, shows in this slide, guided the implementation of a tool called BoardEdit under a 3Mac OS. In 1999, a new implementation of BoardEdit was done for Solaris. Both implementations were thus mono-platform applications. Moreover, the parsing aids developed for these applications only support English and Japanese.


New Goals


  Share in-house tree banks and parsing aids
-  Client/server application
>  Multi-platform or intranet
- Unicode
> Multi-lingual support
> Universal parsing aids
Support for at least 5 languages
- Janese, Chinese, Korean, French and English

In the beginning of 2000, our company, ATR SLT, decided to use Unicode as the encoding character set for all linguistic databases. Consequently, we decided to redesign BoardEdit and to include in the new design a port to Unicode.


Client Specifications


  Interface with the tree banker
-  Multi-platform
- Light
 <+1>should work under Internet browser
- Input method support
- Localization
 Japanese, Chinese, Korean, French and English

Here are the main points of this new design. The previous implementations were quite heavy stand-alone applications useable only for English and Japanese, as I mentioned previously. To make everything lighter, we adopted a client-server model. Also, of course, we foresaw localization in several languages. To merge the previous different implementations, a multi- platform implementation was decided upon. Here are the client specification.


Server Specifications


  Search engine (host for parsing aids)
-  Keep current implementation in C for efficiency reasons
- Unicode support
>  C code: Adapt parsing aids to Unicode
> Data: Conversion of in-house tree banks to Unicode

And here are the server specifications. Please notice that, on the server side the adoption of Unicode implied adapting some existing code.


Overview of the New Application


This is an overview of the new BoardEdit. You can see what we discussed previously: parsing aids, tree editors for the programs, and tree banks for the data. The new design allowed us to separate things in a clearer way: parsing aids run now on the server side only, and tree editors are located on the client side. Tree banks need not be loaded on the user machine: they are used by the parsing aids on the server. In addition, in this new design, several clients can run on different workstations possibly running different operating systems, and they just communicate with the server.


Plan


  Presentation
BoardEdit
Conversion/Porting: Benefits/Drawbacks
Future work & Conclusion

Of course, for practical reasons, the new implementation should not require a complete rewriting of the existing code. This existing source code was entirely in C/C++. However we wanted a language which included graphical components with Unicode features. Hence, a tradeoff had to be found between reusing existing C code for the parsing aids and Unicode.


Implementation Choices


  Java 1 .3
-  Unicode
  Transparent
- Input method support
Easy thanks to Input Method Framework (IMF)
- Client/Server communication
Managed by Remote Method Invocation (RMI)
- Link between C (parsing aids) and Java
Direct call through Java Native Interface (JNI)

The right choice for our new design and the right answer to solve the previous trade-off seem to have been the use of Java. On the client side Java simplified the use of Unicode and the implementation of the user interface. On the server side, as a matter of fact, our parsing aids, which are primitive tools, do not manipulate the semantics associated with characters, they only perform basic character operations like testing equality. Hence we could clearly separate code that was concerned with the meaning of characters from code that was not concerned with this. The former code was rewritten in Java, which transparently supports Unicode, and the latter was kept in C, with few modifications.


Benefits of having migrated to Unicode


  Data: Single encoding for all languages
-  All treebanks in UTF-8
Server: Single binary for all languages
- Universal parsing aids
- Smaller binary
- No loss in processing time
Client: Language-independent interface
- I/O for all languages handled by IME (not interface)

Let us now speak about data, meaning tree banks. We simply had to apply converters to our data. For the server, using only one encoding character set, Unicode, there was a positive consequence on the programs from the static point of view: we reduced the size of the binaries. It also had no negative consequence: we kept the same processing time for all languages, Japanese, English, etc. Moreover, by using Unicode, we made the binary code universal: any new language will be handled by the same new program without any need to add anything whatsoever. Lastly, a benefit for the client side was that, thanks to the Java input Method Framework, we did not need to be concerned with character input.


Demonstration

(Running the demonstration to be shown during the conference) Given a new sentence, we first look for the same one in the tree bank by exact matching. Here exact matching is universal because the character set is Unicode. If the sentence is found, the user just copies and pastes the output. In case it is not found, we apply a much more complicated technique called completion by analogy, which relies entirely on character comparison for equality. It is thus insensitive to the character set. Hence using Unicode or whatever is transparent. If this technique fails, approximate matching, a generalization of exact matching, is used.


Goals: Completions Status


  Share in-house tree banks and parsing aids
v  Client/server application
v  Multi-platform (tested under Windows and Linux)
Intranet (just a matter of time)
v Unicode
v Multi-lingual support
vUniversal parsing aids
v Support for at least 5 languages (localized for 4)
v Japanese, Chinese, Korean, French and English

In conclusion, we have reached almost all of our goals, which we list here.


Future Work


  Parsing aids
-  Do not support bi-directional writing
 Hebrew and Arabic
Client
- Find free input methods
- Implement Boardedit as an applet
Useable under an Internet browser (intranet use)

By using Java for the client, we made the port to Unicode simpler. However, let me mention that, one day, possibly, in future extensions of the tool, we may need more information, like the direction of the text (left to right, right to left, or top to bottom). Also, to address a different but related problem, under the Java input method framework, there is no reliable or efficient input methed tools for the input of Asian languages like Japanese and Chinese.


Conclusion


BoardEdit has been successfully ported to support Unicode
    > Greater universality
> Code made simpler
> Laid foundations for extensions

In conclusion, we may claim that, for our purposes, the use of Unicode facilitated our work on the data side as well as on the programming side, Moreover, the use of Unicode had no influence on the processing speed, which can be a real problem with large linguIstic resources.


Questions/Comments?