On Consensus between Tree-representations of Linguistic
DataMichelJuillardUniversité Nice-Sophia Antipolis, France
XuanLuongUniversité Nice-Sophia Antipolis, France
2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtComputational / Corpus Linguistics1. IntroductionOne of the aims of modern linguistics, particularly of the computational
persuasion, is to infer from the ever-growing mass of actual data available,
the implicit, virtual organization underlying the apparent disorder and
diversity of surface phenomena.This ever-present crucial duality is also at work in computational
linguistics where the chief question is how to reach, beyond the teeming,
bristling surface of observed individual facts, for the latent abstract
organisation, thus enabling the observer (i.e. the linguist) to gain access
to knowledge that can be generalized.2. TreesTree-representation is a powerful means of evincing the inherent structure of
mutually dependent data.Scholars in the main fields of taxonomy regularly and successfully avail
themselves of tree-structures, e.g. genealogies, pedigrees and
phylogenies.Chomsky's syntagmatic trees have grown under every clime but they are far
from being the sole way of imaging linguistic dependence or independence of
the represented objects by means of a hierarchic tree where clearly outlined
categories are paired and embedded.Frequently enough, modern linguists tend to be interested more in the
relative closeness of objects than in their belonging to this or that closed
class. Additive, as opposed to hierarchic, trees do away with watertight
partitions between objects and lay the stress on notions such as proximity
and opposition. Figure 1 illustrates this new way of representing textual
data. The linguistic units under scrutiny here are modals and auxiliaries in
a body of contemporary English poetry.The information contained in figure 1 is rich and clear. The organisation of
the whole structure rests on the notions of proximity and opposition. The
present auxiliaries have and be are closely associated while their past form counterparts
had, and was and were form another distinct pair in the opposite part
of the tree. More generally, the top of the tree can be seen as gathering
the past forms whereas the present tense forms congregate in the bottom
part. Deleting one of the edges of the structure breaks the tree down into
two connex components. The case of should and would is interesting in that the two modals occupy an
intermediate position between past and present which reflects their
specificities in the actual texts.3. Going further with trees :The unrooted-tree representation (figure 5) makes conspicuous properties of
coordinating conjunctions that were, of course, impossible to discern in the
table of occurrences, let alone in the lines of the original text (Day
Lewis, complete poetry).The figure opposes the very tightly-knit pair but-and to the rest of the data which, in turn, form three other
groups of coordinators (respectively either-or-which, neither-nor, then-yet-than) that are similar in behaviour,
although more independent of each other than the previous two (and-but).Figure 6 illustrates the behaviour of the same grammatical units in a body of
contemporary English novels.Without going into unnecessary detail, it is clear that this tree imposes
unity or, at any rate, very close proximity on elements that evinced more
independence in the previous tree (figure 5), neither-nor and either-or now forming
more conspicuous pairs, while than becomes more
closely associated to the structure as it teams up with but.3.1. Fusion of the previous two trees :The question of course arises of the possibility of representing the two
distinct sets of original data in one tree-figure, considering that they
correspond to the same grammatical units at work in two provinces of
literature (poetry and the novel) but in the same language and in the
same period of time.Since these two original sets of data are technically disparate, it is
impossible to start from the initial numbers as such - for instance, by
adding them up.The only procedure available is to attempt to achieve a consensus by
fusion of the original two trees by means of a new algorithm which we
have just devised. Figure 7 is the product of this fusion algorithm. It is interesting to observe first of all that this representation does
sum up the information of trees 5 and 6. Not only are the properties of
each single separate tree preserved, which is indeed a prerequisite, but
there also emerges a more legible picture of the actual syntagmatic
roles and affinities of the function-words under scrutiny here. The
correlative conjunctions either-or and neither-nor are more satisfactorily grouped
together, then enters into a close set with
but and and, while
the proximity of than and then on the tree is evocative of their common etymology,
although they do not form a set stricto
sensu.3.2 The fusion algorithmThe fusion algorithm is derived from the topological properties of the
tree.Consider two trees A and B . Let VA and VB be the matrices of the corresponding
neighbourhoods.One notes as VAB the Cartesian product whose
elements are (x,y) with x Î VA and y Î
VB . We shall build on VAB a preorder induced by the preorders of the
neighbourhood levels of VA and of VB and define a neighbourhood relation on
VAB compatible with the topologies of
A and B. The
fusion of the two trees shall ensue.Preorder:Two elements (x,y) and (u,v)
of VAB are: - ordered by the relation < if
and only if x+y < u+v- equivalent by the relation @ if
and only if x+y = u+v Neighbours:A set G of elements of X
is made up of neighbours if and only if all
the pairs of distinct elements of G are: - minimal by the relation < in VAB and- equivalent by the relation @ in VAB .Algorithm Calculate the matrices of
the neighbourhoods of A and of B.
Build VAB.
(iter): Look for the
minimal elements of VAB. Use VAB in order to determine the neighbours.
- Each set of neighbours G is represented by a single of its elements z.
For each set G, of k elements, remove from X the k-1 elements other than
z. Delete in VAB the corresponding lines and columns.If the numbers of lines and columns are
larger than 3, goto (iter) elseend of the algorithm.