Class-index corpus-index measure: A novel feature selection method for imbalanced text data

Parlak, Bekir

Erişim

info:eu-repo/semantics/closedAccess

Tarih

2022

Yazar

Parlak, Bekir

Üst veri

Tüm öğe kaydını göster

Özet

In the field of text classification, some of the datasets are unbalanced datasets. In these datasets, feature selection stage is important to increase performance. There are many studies in this area. However, existing methods have been developed based on the document frequency of only intra-class. In this study, a new method is proposed considering the situation of the feature in class and corpus. A new feature selection method, namely class-index corpus-index measure (CiCi) was presented for unbalanced text classification. The CiCi is a probabilistic method which is calculated using feature distribution in both class and corpus. It has shown a higher performance compared to successful methods in the literature. Multinomial Naive Bayes and support vector machines were used as classifiers in the experiments. Three different unbalanced datasets are used in the experiments. These benchmark datasets are reuters-21578, ohsumed, and enron1. Experimental results show that the proposed method has more performance in terms of three different success measures.

Cilt

Sayı

Bağlantı

https://doi.org/10.1002/cpe.7140
https://hdl.handle.net/20.500.12450/2034

Koleksiyonlar

Scopus İndeksli Yayınlar Koleksiyonu [1574]
WoS İndeksli Yayınlar Koleksiyonu [2182]