Class-index corpus-index measure: A novel feature selection method for imbalanced text data
Özet
In the field of text classification, some of the datasets are unbalanced datasets. In these datasets, feature selection stage is important to increase performance. There are many studies in this area. However, existing methods have been developed based on the document frequency of only intra-class. In this study, a new method is proposed considering the situation of the feature in class and corpus. A new feature selection method, namely class-index corpus-index measure (CiCi) was presented for unbalanced text classification. The CiCi is a probabilistic method which is calculated using feature distribution in both class and corpus. It has shown a higher performance compared to successful methods in the literature. Multinomial Naive Bayes and support vector machines were used as classifiers in the experiments. Three different unbalanced datasets are used in the experiments. These benchmark datasets are reuters-21578, ohsumed, and enron1. Experimental results show that the proposed method has more performance in terms of three different success measures.