Home :: DH Abstracts

Korean Analysis and Transfer in Multilingual Machine Translation System

Sung-Kwon

Choi

Systems Engineering Research Institute

skchoi@seri.re.kr

Tae-Wan

Kim

Systems Engineering Research Institute

twkim@seri.re.kr

Soo-Hyun

Lee

Systems Engineering Research Institute

shlee@seri.re.kr

Dong-In

Park

Systems Engineering Research Institute

dipark@seri.re.kr

1997

ACH/ALLC 1997

editor

the secretarial staff in the Department of French Studies at Queen's University

Greg

Lessard

encoder

Sara

Schmidt

machine translation

multilingualism

common grammatical knowledge

Abstract

Multilingual machine translation means translation between more than two languages. The existing multilingual machine translation systems can be classified into the transfer-based and interlingual-based multilingual machine translation. In the former the analysis and generation rules were written each other differently, so that the commonness of the languages was ignored and the whole memory space led to increase. The latter had the difficulty in implementing the linguistic universal model available to many languages. In order to get over the shortcomings of these existing multilingual machine translation systems, this paper describes the multilingual MT systems through the common rules which can accept the commonness of languages and many languages can share.

1 Introduction

The analysis and generation rules in the existing transfer-based multilingual machine translation systems (SYSTRAN, EUROTRA, METAL, LOGOS, GETA etc.) are independent and different according to target languages.[Hutchins 1992] It says that the existing multilingual machine translation systems don't acknowledge the commonness of languages. For this reason the existing multilingual machine translation systems have the form like the bundle of bilingual MT systems and this leads to a result increasing the size of system. There are the transfer-based multilingual machine translation systems that use interlingual method for reducing the transfer processes (CETA, SALAT, DLT, KANT etc.), however they have difficult problems to complete the linguistic universal model[Lewis 1992]. From this point of view this paper describes the new multilingual machine translation method by the common rules and constraint rules to overcome the problem the existing multilingual machine translation systems have. The common rules mean the rules that are in common with more than two languages. It is the merits of common rules that can reduce the memory space, augment the consistency of grammatical information and standardize the information structure of lexicon because the common rules are loaded into memory only once. They also have another merit for MT. That is, new grammar modules can be created easily through the combination of 'common' rules when we add a new language to the existing system and translate it into the existing languages. The constraint rules mean the rules controlling the linguistic characteristics of individual languages. This paper consists of three parts: In the chapter 2 the construction of the whole system is introduced. The chapter 3 describes the modules consisting of common rules, that is, the common grammatical rules, the common lexicon information structure, the common structural transfer rules, and the common information transfer rules. In the chapter 4 we explain the analysis and transfer of Korean through the parameterized common rules and the constraint rules.

2 System construction

The Figure 1 shows the system construction of multilingual machine translation by the common rules and constraint rules:

Figure1: Construction of multilingual machine translation system

The middle field of Figure 1 means the common module. 'rn' is a file of common rules consisting of the common module. These files of the common rules are called by the grammar modules of the individual languages and constitute the grammar rules of an individual language together with the constraint rules for the language. For example, Korean, Japanese, English and German in Figure 1 have in common a rule file r3, but Korean and Japanese share more rule files r2 and r4 because they are more similar in the language typology than English and German

3 Common rules

In this chapter I will show the construction of common rules. Common rules for analysis consist of the common grammar rules and the common lexicon information structure and those for transfer consist of the common structural transfer rules and the common information transfer rules.

3.1 Common grammatical rules

To handle many languages in multilingual machine translation system, common grammatical rules should explain linguistic phenomena of as many countries as possible. For explanation of linguistic phenomena of configurational language (e.g. English) as well as nonconfigurational languages (e.g. Korean, Japanese, German) whose word order is relatively free, we have made new grammar rules where X-bar syntactic theory[Jackendoff 1977] and HPSG [Pollard 1994] were mixed. The new grammar was made in binary structure except the coordination structure which was made in triple structure.

Table 1. Common grammar rules

head-final-structure

head-first-structure

head-middle-structure

PRED => ARG PRED

PRED => PRED ARG

COORD => ARG1 COORD ARG2

MODED => MOD MODED

MODED => MODED MOD

FUNCT => ARG FUNCT

FUNCT => FUNCT ARG

The common grammar rules of the table 1 are described in Appendix 1 according to the notation of the CAT2 machine translation system.

3.2 Common lexicon information structure

We need to make the lexicon information structure in order to input, manage and correct consistently the lexicon information of the multilingual machine translation system. It is desirable to build not monotonic, but multiple structure so that the information structure of lexicon may represent the possible linguistic information and be moved collectively. From this point of view I have selected the feature structure as the multilingual lexicon information structure and made the attributes be the same in many languages. Appendix 2 shows an example of multilingual lexicon information structure.

3.3 Common structural transfer rules

There is also the part in the transfer process the many languages can share. It is the compositional transfer that copies the node of the source language to that of the target language if the analysis structure of the former and the generation structure of the latter are the same. We make use of the method deleting the functional words and then transforming the syntactic nodes to the 'predicate-argument-modifier' nodes in our multilingual machine translation system in order to transfer compositionally the different structures between the languages. We have recorded the noncompositional structural rules unusable to the common structural transfer rules in the transfer lexicon because they depend on the lexemes. The transfer rules have the priority order: the noncompositional structural transfer rules are applied first to the transfer process, second, the common structural transfer rules and last, the lexical transfer rules in the lexicon. The following rule shows the common structural transfer rule:

(1) common_structural_transfer_rule = {}.[+node] <=> {}.[+node].

The rule (1) says that all compositional transfer trees, that is, '+node' are transferred unvaryingly from the source language to the target language.

3.4 Common information transfer rules

Simplifying the transfer process in the multilingual machine translation is also able to result from the separation of the structure from the information. In the existing transfer-based machine translation systems the structural transfer has included the information transfer. It has brought out the duplication of the information and the increase of the memory space. But the isolation between the structure and the information results in excluding the shortcomings of the existing machine translation systems. In this sense, the common information transfer rules have the function to transfer the common information available to many languages, that is, they are the rules that copy the semantic informations from the source language to target language. The semantic informations are produced by the mapping from form to its meaning in the analysis. The following rules show the common information transfer rules: (We use the notation of the CAT2 system.)

(2) Common information transfer rules

Lexical_semantic_transfer = {head:{ehead:{sem:SEM}}}.[*] <=> {head:{ehead:{sem:SEM}}}.[*].

Transfer_of_semantic_roles = {role:ROLE}.[*] <=> {role:ROLE}.[*].

The lexical semantic transfer says that the lexical semantic information of the source language is copied to that of the target language on the same node level and the reverse too ('<=>' means the bidirection). The transfer of semantic roles shows the copy of the information of the semantic role between the source language and the target language.

4 Korean analysis and transfer by constraint rules

The grammar of individual languages consists of the universal rule and its parameter [Chomsky 1981]. The language typology can be classified by the parameter [Greenberg 1963]. There is an example of machine translation[Dorr 1993] that has used the univeral principle and its parameter. According to the Greenberg's parameterized word order we can consider the Korean standard word order as follows:

(3) Standard Word Order of Korean

SOV

Number-Noun

Demonstrator-Noun

Adjective-Noun

Possessive Pronoun-Noun

Relative clause-Noun

This standard word order gives an individual language a clue for its parameter. In the next section we will see the paramterized common grammatical rules for Korean.

4.1 Korean analysis by parameterized common rules

According to the Korean standard word order the head word must always follow its argument or modifier. From this point of view we can select the head-final common rules for Korean under the multilingual common grammatical rules in the Figure 1. The head-final rules in Figure 1 and Head Feature Principle percolating the information of lexical head into that of its phrase are as follows: (the coordination structure of Korean can be considered as part of the 'Argument-Functional word structure'. I hold the coordination structure of Korean as the triple structure for the efficient analysis.)

Table 2. Parameterized common grammar rules for Korean

head-final-structure

head-middle-structure

PRED => ARG PRED

COORD => ARG1 COORD ARG2

MODED => MOD MODED

FUNCT => ARG FUNCT

(4) Head_Feature_Principle = {head:HEAD}.[{},{head:HEAD}].

A Korean sentence that is analysed by the parameterized common grammar rules and the HFP results in what follows:

(5) cengpwunun saylowun kyeyhoykanul malyenhayessta. government+SUBJ new plan+OBJ make+PAST+DECL The government made a new plan.

In (5) the fine line shows the application of 'FUNCT => ARG FUNCT', the dotted line that of 'MODED => MOD MODED' and the thick line that of 'PRED => ARG PRED'.

4.2 Korean analysis by grammatical constraint rules

With analysing Korean in the machine translation, we must consider specially the following [Oh 1994]:

(6) Korean Characteristics

Phonological peculiarity sonyen-i, sonye-ka boy-SUBJ, girl-SUBJ boy, girl

Double objects kunun seoulul yehayngul hayessta. He-SUBJ Seoul-OBJ trip-OBJ make-PAST-DECL He made a trip to Seoul

Honorifics kyoswunimkkeyse osipnita. professor-SUBJ(HON) come-HON-DECL The professor comes.

These peculiarities of Korean can be explained by the constraint rules. The table 3 shows the relation between common rules and their constraint rules.

Table 3. Common rules and constraint rules

Korean characteristics

Common rules

Constraint rules

Phonological peculiarities

FUNCT => ARG FUNCT

Phonological rule

Double objects

PRED => ARG PRED

Argument exchange

Honorifics

HFP

Context information

- Phonological rule

All morphemes contain their last phoneme that is subcategorized and predicted by a functional word.

example) sonyen{phon:con} i{phon:voc,frame:{arg1:{phon:con}}} boy{phon:con} SUBJ{phon:voc,frame:{arg1:{phon:con}}}

- Argument exchange

The subcategorization structures of functional verb 'hata (= do/make)' and those of predicate noun are exchanged for each other in the lexicon:

Table 4. Lexicon of 'hata (do/make)'

lex

hata

arg1

ARG1

arg2

ARG2

frame

cat

noun

arg3

frame

arg1

ARG1

arg2

ARG2

example) kunun(arg1) seoulul(arg2) yehayngul(arg3(arg1,arg2)) ha(arg1,arg2,arg3)yessta. He-SUBJ Seoul-OBJ Trip-OBJ make-PAST-DECL He made a trip to Seoul.

- Context information

The context information of sentence subject agrees with that of verb phrase.

example) kyoswunimkkeyse(context:honor) osi(context:honor)pnita. professor-SUBJ(HON) come-HON-DECL. The professor comes.

4.3 Transfer constraint rules

The syntactic tree of Korean results in the semantic tree through tree transformation. The semantic tree has the 'predicate-argument-modifier' arrangement. HFP also is applied to nodes of the semantic tree. We are transducing the Korean syntactic tree (5) to the following semantic tree through the transformation rules.

(7) cengpwunun saylowun kyeyhoykanul malyenhayessta. government-SUBJ new plan-OBJ make-PAST-DECL. The government made a new plan.

The semantic tree becomes the input of transfer. All semantic trees that can be transferred compositionally are transferred to target language by the 'common structural transfer rules' and 'common information tranfer rules'. There is, however, the compositional transfer that is not able to apply to the common information transfer rules. The idiomatic expressions with functional verbs 'hata(do/make)' or 'toyta(be done/be made)' belong to this example. We delete 'hata' during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun. But there is no multilingual rule that can control the relation between the predicate noun of source language and the predicate noun of target language or between predicate noun of source language and verb or adjective of target language. For this reason we need the rule constraining the common transfer rule. Now we have the transfer constraint rules for the common information transfer rules.

(8) Constraint rule of predicate noun

idiomatic expression vs idiomatic expression

Let copy the information of Korean functional verb to that of functional verb of target language, if the lexeme of target language has the functional verb that is equalent to the Korean idiomatic expression with 'hata'.

ex.) sanpolul hata => take a walk, einen Sparziergang machen, sanpowo suru ilul hata => sikotowo suru

idiomatic expression vs verb or adjective

Let copy the information of Korean functional verb to that of the lexeme of target language, if the lexeme of target language has no functional verb that is equivalent to the Korean idiomatic expression with 'hata'.

ex.) ilul hata => work, arbeiten

5 Conclusion

In this paper I have proposed a new philosophy of multilingual machine translation that accepts the commonness of languages to reduce the memory space of the multilingual machine translation system and to simplify the transfer process. This philosophy is explained by the common rules for many languages and the constraint rules for the individual languages. For example, the analysis of Korean is explained by the parameterized common rules and the constraint rules and the transfer from Korean to other target languages is explained by the common structure transfer rules, the common information transfer rules, and the transfer constraint rules. The following table shows the size of the common and constraint rules used for the analysis and transfer of Korean in the translation from 300 Korean sentences to English or German.

Syntactic Analysis

Semantic Analysis

Transfer

Common

Constraint

Common

Constraint

Common

Constraint

- Further work

Although the multilingual machine translation by the common rules and the constraint rules is performed reasonably well, reducing the analysis rules and simplifying the transfer process, there are yet many problems to be solved:

Truncation of the number of the parse trees

Conflict between the old and the new lexical information

Recognizing the idiomatic expressions and collocations

Disambiguation of polysemy

In order to solve the problems we are testing the following methods:

Usage of the probabilistic method

Information processing by the multiple inheritance

Implementation of the compound unit recognizer

Usage of the domain

Kil-Lok

Key-Sun

Choi

Sey-Young

Park

Korean Language Engineering

Tae-Young-Sa

1994

(in Korean)

Chomsky

Lectures on Government and Binding. The Pisa Lectures

Studies in Generative Grammar 9

Dordrecht Holland & Cinnaminson U.S.A.

Foris Publication

1981

Greenberg

Some universals of grammar with particular reference to the order of meaningful elements

Joseph

Greenberg

Universals of Language

2nd edition

Cambridge, Massachusetts

The M.I.T. Press

1963

Dorr

Machine Translation: A View from the Lexicon

Cambridge, Massachusetts and London, England

MIT Press

1993

Hutchins

Somers

An Introduction to Machine Translation

Academic Press

1992

Jackendoff

X-bar Syntax: A Study of Phrase Structure.

Cambridge

MIT Press

1977

Lewis

Computers and Translation

Christopher

Butler

Computers and Written Texts

Blackwell

1992

75-114

Pollard

Sag

Head-Driven Phrase Structure Grammar

Studies in Comtemporary Linguistics

Chicago & London

The University of Chicago Press

1994

Sharp

CAT2 Reference Manual Version 3.6

IAI Working Papers N.27

Saarbruecken, Germany

1994

Endnote

This paper summarizes the experiment of the multilingual machine translation system CAT2 [Sharp 1994]. The CAT2 system is now working on a UNIX-workstation. Its programming language is PROLOG and it uses the 'constraint bottom-up chart' parser. We are now translating Korean into English as well as German and are testing the translation from Korean into French, Chinese, Russian, and Japanese as the target languages.

Appendix 1. Multilingual common grammar rules written in CAT2 notation

Appendix 2. Multilingual Lexicon Information Structure