Web-Based Economic Activity Classification

Comparing semi-supervised text classification methods to deal with noisy labels

More Info
expand_more

Abstract

In order to provide accurate statistics for industries, the classification of enterprises by economic activity is an important task for national statistical institutes. The economic activity codes in the Dutch business register are less accurate for small enterprises since small enterprises are not labelled manually. To increase the quality of the register, automatic classification of enterprises based on their websites has been tried with supervised text mining techniques. The performance of current supervised text mining techniques is limited by the available accurately labelled training data. Since inaccurate labels are available for all enterprises, the current study investigates how to leverage the noisy labelled data to improve the economic activity classification of small enterprises based on their webpage texts. The current study compares the performance of various semi-supervised methods that enlarge the training data by leveraging the abundance of noisy labelled data. The methods are compared against a supervised baseline, which uses all noisy data as is. The proposed proportional weakly self-training method queries noisy labelled instances through high probability sampling and filters mispredicted instances. Results showed that proportional weakly self-training improves upon the supervised baseline while requiring far less training instances. From qualitative analyses, we conclude that the filter of proportional weakly self-training reduces error propagation compared to classic self-training. Additional experimental results showed that large enterprises are less suitable as training data for prediction of small enterprises and that top-k performance scores improve results but are not yet sufficient for semi-automatic classification. Further examination of error detection methods is recommended to improve web-based economic activity classification.