Abstract:
In this paper the comparison of two PPM (Prediction by Partial Matching) methods for automatic content-based text classification is described: on the basis of letters and on the basis of words. The investigation was driven by the idea that words and especially word combinations are more relevant features for many text classification tasks than letters and letter combinations. The results of the experiments proved applicability of PPM models for content-based text classification, although PPM model on the basis of words did not perform better than model on the basis of letters.