Home :: DH Abstracts

LEXSTATS: A program for the statistical analysis of word frequency distributions

Harald

Baayen

University of Nijmegan

Max Planck Institute for Psycholinguistics

baayen@mpi.nl

Fiona

Tweedie

Department of Statistics University of Glasgow

fiona@stats.gla.ac.uk

1999

University of Virginia

Charlottesville, VA

ACH/ALLC 1999

editor

encoder

Sara

Schmidt

Various computationally intensive statistical models are available for the analysis of word frequency distributions (e.g., Carroll, 1967; Sichel 1975, and Chitashvili and Baayen, 1993). These models provide linguists and lexicographers with elegant means for obtaining sample-size invariant characteristic textual measures, for extrapolating the development of the vocabulary beyond sample sizes larger than the observed text size, and for estimating the population vocabulary size.

Thusfar, these models have not been used widely, which is not surprising given the absence of software implementing these models. At the conference, we will present the beta version of LEXSTATS, a user-friendly GUI interface to a series of C programs that implement a wide range of word frequency analyses. LEXSTATS and the underlying C code will become available as freeware under the GNU software license.

We will illustrate LEXSTATS by applying it to word frequency distributions of various kinds of texts as well as to word frequency distributions of a range of morphological categories.

References

Caroll

On Sampling from a Lognormal Model of Word Frequency Distribution

Kucera

Francis

Computational Analysis of Present-Day American English

Providence

Brown University Press

1967

406-424

Chitashvili

Baayen

Word Frequency Distributions

Altmann

Hreibicek

Quantitative Text Analysis

Trier

Wissenschaftlicher Verlag Trier

1993

54-135

Sichel

On a Distibution Law for Word Frequencies

Journal of the American Statistical Association

542-547

1975