Authorisation
Software tools for initial processing of Georgian texts
Author: giorgi giorkhelidzeAnnotation:
The presented work contains a description of initial processing of documents in Georgian language, that will be used for text classification, which is necessary to develop Georgian search system. Stemming and Lemmatization are essential part of text classification. This work reviews the well-known and popular algorithms such as Lovins, Porter and Paice/Husk stemming algorithms. The existing stemming algorithms are useless for a Georgian language, because of it’s complexity. This resulted in creation of a new algorithm for Georgian language to be used in text classification, this algorithm uses database consisting of words and suffixes. Also, the presented work reviews methods for the weight calculation of term. The Weight of term represents the statistical value defined according to the frequency of its appearance in text and is defining the term value. This work describes natural language processing (NLP), which means getting information from documens that are written in natural languages, based on syntax and semantics. The popular three classification algorithms: K Nearest Neighbor (KNN), Support Vector Machine (SVM) and Bayes have been described. Also, the following work contains information about how to use existing data in the weight calculation of term.