| SARAKI Masashi | OSADA Tetsuo | WATANABE Yuhki | SHIRAI Satoshi | |||
| Saraki R & D | Waseda Universify | Toyohashi University of Technology | NTT CS Laboratories |
Etymology is the study of the origin of words. Etymology is a remarkable attribute of words, indicates as it does from where words have come from or have been borrowed. In this sense, the etymology never changes with the times and, in fact, has never changed throughout the history of languages. The signification of individual words, however, has changed and will be able to continue doing so with the times. Setting up etymology as one attribute of any word, we can create the potential for another tool for syntactical and rhetorical analysis.
The historical process of the developlnent of English is reflected in its contemporary vocabulary and expressions. English vocabulary and expressions consist of three layers, just as in geologic stratification: a primal layer, an intermediate layer, and a modern layer, referred to as Saxon, French, and Latin respectively. The primal, Saxon layer is psycholinguistically the deepest in the mind of the native speaker and is tied to images of the soul, the French includes concepts of social consciousness and the Latin involves logos, or reason.
The Dictionary of English Etymology for Natural Language Processing (hereafter simply refened to as DEE ) includes the origin and borrowing process of English words, but does not record semantic history, that is, the original meaning and the changes in significance the words have undergone. The historical approach enables the user of DEE to view the historical process of individual words DEE has now approximately 20.000 words and will be further increased to 25.000.
Referring to Figure 1, an entry consists of items which appear in the following order, although except serial number and head word, not all of these will necessarily be found in any particular entries:
|
Chronology indicates the year of first usage and time divisions of English. The marking for Etymology indicates the particular foreign language in which a word originated or from which it was borrowed. The etymological item mentions the origin and word formation of the head word, and if a loan word, further tracing its historical process of the word back to its true origin.
The DEE database includes a main database as shown in Figure 1 and supplementary databases of irregular verbs, postposed adjectives and affixes. Additional databases of other word classes will be incorporated.
When used in conjunction with the DEE database, the tagging utility will allow users to make the maximum use of data. The utility allows users to tag subsequent etymological labels to each word of the user's text and thus to view etymological word arrangement.
Some examples are shown hereinafter:
| OE ME, ModE, PE;1 | |
| F, OF, AF, NF2; | |
| L, late L, med L, mod L, VL;3 Gk4 | |
| Ir, Ital., Sp, Ar5; |
Utility includes a tagging program, a user interface, and database files of the etymology. The tagging program is implemented in using PERL and the user interface is available in CGI, as shown in Figure 2. The CGI interface allows the user to select a desired text and run the tagging program. The user can select optional labels by marking check boxes. Thus, the user can choose word classes such as Nouns, Verbs, Adjectives, Adverbs, Pronouns, Prepositions, Articles, Auxiliaries, Copula, Coordinates, Subordinates, and Interrogates to be tagged with the etymological labels.
|
The following original text is a excerpt of an English nove]ist's work6 with the text tagged according selected labels, for example, nouns, verbs , adjectives, adverbs, and subordinates.
| Original Text | |
| "Most people who bother with the matter at all would admit that the English language is in a bad way, but it is generally assumed that we cannot by conscious action do anything about it. Our civilization is decadent and our language so the argument runs must inevitably share in the general collapse." | |
| Tagged Text | |
| "Most(OE) peopie(AF) who bother(ON) with the matter(OF) at all would admit(L) that the English(OE) language(OF) is in a bad(OE) way(OE), but it is generally(L-OE) assumed(L) that we cannot by conscious(L) action(F) do anything about it." Our civilization(F) is decadent(L) and our language(OF) so(OE) the argument(F) runs(OE) must inevitably(L) share(OE) in the general(L) collapse(L). |
Peter Mark Roget proposed a methodology for compiling his thesaurus in the introduction to the original edition in 1852[6]. The principle for Roget's classification is the same as that which is employed in the various branches of Natural History, and thus the sectional divisions Roget formed, corresponding to natural families in botany and zoology, and the filiation of words presents a network analogous to the natural filiation of plants or animals. Thus, Roget established "tabular syrnopsis of categories" and accordingly classified English vocabulary into six primary classes with further subdivisions. Words are arranged under several topics or head of signification. A portion of Roget's thesaurus, which has been revised, is cited below (the words have been tagged with the etymological labels):
| Existence.(OF) | |
|
N. existence(OF) being(OE) entity(modL); absolute(F)
being(0E), the absolute(F) 965 diviness(L);
aseity(medL), self(OE)-exisatence(OF), monad(lateL),
a being(oE), an entity(medL), ens(OF) essence(L),
quiddity(medL), Platonic(L) idea(Gk), universal(OF);
subsistence(lateL) 360 life(OE); survival(AF),
eternity(OF), 115 perpetuity(OF), preexistence(OF)
119 priority(OF) ; this life(OE) 121 present(OF)
time(OE), existence(OF) in space(OF), prevalence(F)
189 presence(OF), entelechy(lateL), realization(F),
becoming(OE), evolution(L) 147 conversion(OF);
creation(OF) 164 production(OF); potentiality(medL)
469 possibility(MF); ontology(modL),
metaphysics(Gk); realism(F), materialism(modL),
idealism(F), existentialism(D) 449 philosophy(L)
reality(OF), realness(OF-OE7), actuality,(medL)
entelechy(lateL), Dasein(D); actual(OF) existence(OF),
material(lateL) e. 319 materiality 1(OF-OF); thatness
80 speciality(OF); positiveness(OF-OE); historicity(L-
OF)7 factuality(L-OF), factualness(L-OF-OE) 494
truth(OE); fact(L), fact(L) of life(OE),
undeniable(OE-ME-OF) f, positive(OF) f,
stubborn(OE) f, matter(OF) of f., fait accompli (F)154
event(L), real(AF) thing(OE), not a dream(OE), no
joke(L); realities(OF), nitty-gritty(America),
basics(PE), fundamentals(modL), bedrock; nuts(OE)
and(OE) bolts,(OE) brass(OE) tacks(AF) 638
important(medL) matter(OF) essence(L), nature,(natura < nasci) very(OF) n., essential(medL) n., quiddity(medL), hypostasis(Gk) 3 substance(L); constitutive(L-OF) principle(OF), inner(OE) being(OE), sum(L) and substance(L) 5 essential(medL) part(L); prime(OF) constituent(L), soul(OE), heart(OE), core(OF), centre(L) 224 interiority.(L-OF) |
Further, by tagging the whole of Roget's thesaurus, ue will be able to find the etymological characteristics of English words expressing general idcas
Secondly, Roget mentioned the notion of correlative words. "For the purpose of exhibiting with opposite and correlative ideas. I have, whenever the subject admitted of such an arrangement, placed them in two parallel columns in the same page. so that each group of expressions may be readily contrasted with those which occupy tIIe adjacent column, and constitute their anthithesis."[6] Roget also suggested a method of arranging correlative tenns in the form of a triad as fo1lows:
ƒ¿ Two ideaS which are completely opposed to each other, admitting of an intermediate or neutral ideas. In the following examples, the words in the first and third columns express opposite ideas, and the terms in the middle column have a neutral sense with reference to the former.
|
ƒÀ The intermediate word is simply the negative to each of two opposite positions:
|
ƒÁ The intermediate word is properly the standard with which each of the extremes is compared.
|
The same word has several correlative words according to the different relations.
|
|
The origins in each triad are represented with a set of etymoiogical labels. According to the tagged triads, we can work on the presumption that correlative ideas are expressed with corresponding correlative words having the similar origin.
Postposed adjectives have a prepositional phrase or nonfinite verb phrase as complement. English syntax allows adjectives both to be prepositive and postpositive in a noun phrase and further to be grouped immediately before a noun. This unique syntactical characteristic is an intermediate step between Roman and Germanic aspects. By tagging the etymological labels to postposed adjectives, we can show the data in tabular fashion, as shown in Table 1 and a concordance in Table 2. We can presume that postmodification by adjectives was devised according to Latin grammar and derive the hypothesis that "If S+P+N1 (French/Latin) +Adj.(French/Latin)+prep., then "Adj.+prep. is predicate of N1 ."
|
|
|
|
When DEE and its utiliry can work in conjunction with text processing, for example WRAPL[7][8] which can convert complex English sentences to simple sentences, it will help widen access to English texts throughout the World Wide Web for people to whom English is a foreign language and assist people who have language disabilities to read with greater ease. DEE and WRAPL will be simplify English complex sentences into simple sentences and replace words that are abstruse or abstract words(loan words) with ones that are plain or concrete(native words). Thus, we will intend to contribute PSET(Practical Simplification of English Texts)[9] .