Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Huang, Biyun"

Sort by: Order: Results:

  • Huang, Biyun (2018)
    Text classification, also known as text categorization, is a task to classify documents into predefined sets. As the prosperity of the social networks, a large volume of unstructured text is generated exponentially. Social media text, due to its limited length, extreme imbalance, high dimensionality, and multi-label characteristic, needs special processing before being fed to machine learning classifiers. There are all kinds of statistics, machine learning, and natural language processing approaches to solve the problem, of which two trends of machine learning algorithms are the state of the art. One is the large-scale linear classification which deals with large sparse data, especially for short social media text; the other is the active deep learning techniques, which takes advantage of the word order. This thesis provided an end-to-end solution to deal with large-scale, multi-label and extremely imbalanced text data, compared both the active trends and discussed the effect of balance learning. The results show that deep learning does not necessarily work well in this context. Well-designed large linear classifiers can achieve the best scores. Also, when the data is large enough, the simpler classifiers may perform better.