ka | en

Authorisation

Software tools for processing of Georgian document

Author: Levan Iobashvili
Keywords: document, term, model, classification, language, algorithm
Annotation:

This work is about classification methods of unorganized documents which are used in classification of Georgian texts. Classification is one of the major and initial step in informational search. Search is made using different search models. In this topic we discuss Boolean, SVM and probabilistic models as well as their pros and cons. These models are related to calculations of weights of the terms. The Weight of term represents the statistical value defined according to the frequency of its appearance in text and is defining the term value. Methods of natural language processing are also described. We process texts considering their languages. We also discuss initial text processing which is made during first steps of classification. We speak about Stemming and Lemmatization processes. Describe algorithms which are frequently used in solving various classification problems. For example: KNN, SVM, Bayes. We review all features related to these algorithms during different stages of working. We used these algorithms in practice on Georgian medical documents. We had 25 000 documents which should have been classified in 3 main groups and 13 subgroups. We tried SVM and KNN algorithms on them. Result showed us that both of them work very good however Svm had little more precision. It’s worth mentioning that this was the first attempt of classification of these kind of documents.