Home :: DH Abstracts

On Consensus between Tree-representations of Linguistic Data

Michel

Juillard

Université Nice-Sophia Antipolis, France

Xuan

Luong

Université Nice-Sophia Antipolis, France

2000

University of Glasgow

Glasgow

ALLC/ACH 2000

editor

Jean

Anderson

Amal

Chatterjee

Christian

Kay

Margaret

Scott

encoder

Sara

Schmidt

Computational / Corpus Linguistics

1. Introduction

One of the aims of modern linguistics, particularly of the computational persuasion, is to infer from the ever-growing mass of actual data available, the implicit, virtual organization underlying the apparent disorder and diversity of surface phenomena.

This ever-present crucial duality is also at work in computational linguistics where the chief question is how to reach, beyond the teeming, bristling surface of observed individual facts, for the latent abstract organisation, thus enabling the observer (i.e. the linguist) to gain access to knowledge that can be generalized.

2. Trees

Tree-representation is a powerful means of evincing the inherent structure of mutually dependent data.

Scholars in the main fields of taxonomy regularly and successfully avail themselves of tree-structures, e.g. genealogies, pedigrees and phylogenies.

Chomsky's syntagmatic trees have grown under every clime but they are far from being the sole way of imaging linguistic dependence or independence of the represented objects by means of a hierarchic tree where clearly outlined categories are paired and embedded.

Frequently enough, modern linguists tend to be interested more in the relative closeness of objects than in their belonging to this or that closed class. Additive, as opposed to hierarchic, trees do away with watertight partitions between objects and lay the stress on notions such as proximity and opposition. Figure 1 illustrates this new way of representing textual data. The linguistic units under scrutiny here are modals and auxiliaries in a body of contemporary English poetry.

The information contained in figure 1 is rich and clear. The organisation of the whole structure rests on the notions of proximity and opposition. The present auxiliaries have and be are closely associated while their past form counterparts had, and was and were form another distinct pair in the opposite part of the tree. More generally, the top of the tree can be seen as gathering the past forms whereas the present tense forms congregate in the bottom part. Deleting one of the edges of the structure breaks the tree down into two connex components. The case of should and would is interesting in that the two modals occupy an intermediate position between past and present which reflects their specificities in the actual texts.

3. Going further with trees :

The unrooted-tree representation (figure 5) makes conspicuous properties of coordinating conjunctions that were, of course, impossible to discern in the table of occurrences, let alone in the lines of the original text (Day Lewis, complete poetry).

The figure opposes the very tightly-knit pair but-and to the rest of the data which, in turn, form three other groups of coordinators (respectively either-or-which, neither-nor, then-yet-than) that are similar in behaviour, although more independent of each other than the previous two (and-but).

Figure 6 illustrates the behaviour of the same grammatical units in a body of contemporary English novels.

Without going into unnecessary detail, it is clear that this tree imposes unity or, at any rate, very close proximity on elements that evinced more independence in the previous tree (figure 5), neither-nor and either-or now forming more conspicuous pairs, while than becomes more closely associated to the structure as it teams up with but.

3.1. Fusion of the previous two trees :

The question of course arises of the possibility of representing the two distinct sets of original data in one tree-figure, considering that they correspond to the same grammatical units at work in two provinces of literature (poetry and the novel) but in the same language and in the same period of time.

Since these two original sets of data are technically disparate, it is impossible to start from the initial numbers as such - for instance, by adding them up.

The only procedure available is to attempt to achieve a consensus by fusion of the original two trees by means of a new algorithm which we have just devised. Figure 7 is the product of this fusion algorithm.

It is interesting to observe first of all that this representation does sum up the information of trees 5 and 6. Not only are the properties of each single separate tree preserved, which is indeed a prerequisite, but there also emerges a more legible picture of the actual syntagmatic roles and affinities of the function-words under scrutiny here. The correlative conjunctions either-or and neither-nor are more satisfactorily grouped together, then enters into a close set with but and and, while the proximity of than and then on the tree is evocative of their common etymology, although they do not form a set stricto sensu.

3.2 The fusion algorithm

The fusion algorithm is derived from the topological properties of the tree.

Consider two trees A and B . Let VA and VB be the matrices of the corresponding neighbourhoods.

One notes as VAB the Cartesian product whose elements are (x,y) with x Î VA and y Î VB . We shall build on VAB a preorder induced by the preorders of the neighbourhood levels of VA and of VB and define a neighbourhood relation on VAB compatible with the topologies of A and B. The fusion of the two trees shall ensue.

Preorder:Two elements (x,y) and (u,v) of VAB are:

- ordered by the relation < if and only if x+y < u+v

- equivalent by the relation @ if and only if x+y = u+v

Neighbours:A set G of elements of X is made up of neighbours if and only if all the pairs of distinct elements of G are:

- minimal by the relation < in VAB and

- equivalent by the relation @ in VAB .

Algorithm

Calculate the matrices of the neighbourhoods of A and of B. Build VAB. (iter): Look for the minimal elements of VAB. Use VAB in order to determine the neighbours. - Each set of neighbours G is represented by a single of its elements z. For each set G, of k elements, remove from X the k-1 elements other than z. Delete in VAB the corresponding lines and columns. If the numbers of lines and columns are larger than 3, goto (iter) else end of the algorithm.