Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Wachirapong, Fahsinee"

Sort by: Order: Results:

  • Wachirapong, Fahsinee (2023)
    The importance of topic modeling in the analysis of extensive textual data is magnified by the inefficiency of manual work due to its time-consuming nature. Data preprocessing is a critical step before feeding text data to analysis. This process ensures that irrelevant information is removed and the remaining text is suitably formatted for topic modeling. However, the absence of standard rules often leads practitioners to adopt undisclosed or poorly understood preprocessing strategies. This potentially impacts the reproducibility and comparability of research findings. This thesis examines text preprocessing, including lowercase conversion, non-alphabetic removal, stopword elimination, stemming, and lemmatization, and explores their influence on data quality, vocabulary size, and topic interpretation generated by the topic model. Additionally, the variations in text preprocessing sequences and their impact on the topic model's outcomes. Our examination spans 120 diverse preprocessing approaches on the Manifesto Project Dataset. The results underscore the substantial impact of preprocessing strategies on perplexity scores and prove the challenges in determining the optimal number of topics and interpreting final results. Importantly, our study raises awareness of data preprocessing in shaping the perceived themes and content in identified topics and proposes recommendations for researchers to consider before performing data preprocessing.