What can Hyperplane-Classifiers tell us about
Texts?EddaLeopoldGMD German National Research Center for Information
TechnologyInstitute for Autonomous intelligent SystemsJörgKindermannGMD German National Research Center for Information
TechnologyInstitute for Autonomous intelligent Systems2001editorencoderSaraA.Schmidtvector space representationtext classificationWe want to report from our results with Support Vector Machines for Text
Classification in order to promote the interdisciplinary dialogue. Our research
group consists mainly of statisticians and computer-scientists, and focusses on
the algorithmic side of text-classifica-tion. We want to discuss our experiences
with researchers working on other fields of linguistic computing and ask for the
implications of our results on linguistic approaches which use vector space
representations as for example "semantic spaces" and "latent semantic
indexing".The algorithm called "Support Vector Machines", can shortly be described as
follows (A more detailed description can be found in (Vapnik 1998)):1 A set of labeled documents is needed for training. Documents are
mapped to their type-frequency vectors. These vectors span an high
dimensional input space (every type represents one dimension). This kind
of abstraction from syntagmatic structures is often refered to as
"bag-of-words" approach.2. The algorithm searches for a hyperplane in input space which
optimally separates the training documents.3. Documents of a test-set are attributed to one of the classes
depending on the side of the hyperplane they are located on. SVM have
proven to provide an effective means for text classification on
different languages (English and German) and textual domains (English
Reuters news; Ohsumed medical abstracts, e-mail newsgroups; German:
newspapers taz, FR, BZ, e-mail newsgroups) and different tasks (topic
identification, Authorship attribution, classification according
newspaper issues of different years). (Joachims 1997; Joachims1998;
Drucker et al.; Dumais et al. 1998; Diederich & Kindermann &
Leopold & Paaþ 2000; Leopold & Kindermann 2001)The great advantage of SVMs is, that they can manage a very large number of
attributes (in our experiments we have worked with up to half a million
attributes), given that the attribute vectors are sparse. This makes it possible
to perform document-classification directly on the frequency spectra of
documents without any kind of feature selection. This is why we think that
results on the precision/recall performance of Support Vector Machines can be
interpreted as statements about frequency spectra of document collections, and
thus constitute a kind of linguistic evidence.Another advantage of SVMs is, that various kernel functions can be used. Kernel
functions correspond to a mapping of input vectors to a even higer dimensional
feature space and can heuristically interpreted as different geometries in input
space ((hyper)planes may be substituted by e.g. (hyper)spheres).The choice of the kernel function is crucial to most applications of support
vector machines. However in the case of text-classification Kernel functions
only slightly affect performance although they imply completely different
geometries of input space. So from the stand point of retrieval performance it
is nearly irrelevant if topic-boundaries are defined by planes or by
spheres.What does this mean for the bag-of-words approach which represents documents in
the form of type frequency vectors, and what does it mean for the quality of
co-occurrence of types within the context of a document? We will try to give an
answer in terms of stochastical dependency of types.Another observation we made is that lemmatization does not affect performance in
terms of precision and recall. In English our results on the Reuters news corpus
obtained without any linguistic preprocessing do not differ significantly from
those obtained by Joachims (1998) who has used the Porter stemmer. In German
lemmatization also did not yield an improovement of performance, which is
surprising because of the morphological richness of German. However our results
agree with those obtained with neural nets in French news data (Stricker 2000),
neural nets however need a reduction of dimensionality as opposed to SVM. An
explanation of this finding is that lemmatization leads loss of information,
because different word forms are mapped to the same lemma. A surprising result,
is that author identification is also best done on the bases of word-forms
rather than on the basis of bigramms of grammatical tags.We are currently working multi-class classification using Support Vector Machines
(Kindermann et al 2000). The problem here is to group the classes of documents
in an appropriate way. To this end we explore the inter- and intra-class
distance of type-frequency distributions.ReferencesJoachimDiederichJörgKindermannEddaLeopoldGerhardPaaßAuthorship Attribution with Support Vector
MachinesPoster presented at The Learning Workshop; 4 - 7 April,
2000 in Snowbird, Utah2000H.DruckerD.WuV.VapnikSupport vector machines for spam categorizationIEEE Transactions on Neural Networks1051048-10541999SusanDumaisJohnPlattDavidHeckermanMehranSahamiInductive Learning Algorithms and Representations for
Text CategorizationProceedings of ACM-CIKM-98; 7th International
Conference on Information Retrieval and Knowledge Management1998148--155ThorstenJoachimsText categorization with support vector machines:
learning with many relevant featuresProceedings of ECML-98, 10th European Conference on
Machine LearningLecture Notes in Computer Science, Number 1398Heidelberg, DESpringer Verlag1998137-142JörgKindermannEddaLeopoldGerhardPaaßMulticlass Classification with Error Correcting
CodesEddaLeopoldMathiasKirsten Treffen der GI-Fachgruppe 1.1.3 Maschinelles
Lernen200056-64EddaLeopoldJörgKindermannText Categorization with Support Vector Machines. How
to Represent Texts in Input Space?Machine Learningaccepted for publicationM.StrickerRéseaux de neurones pour le traitement automatique du
langage : conception et réalisatiion de filtres
d'informationsThèse de Doctorat de l'Université Pierre et Marie Curie - Paris
VISchool of Library, Archive and Information Studies,
University College London200VladimirVapnikStatistical Learning TheoryWiley & sons1998