Abstract

This paper presents a way to validate the topic modelling results by comparing them with the results of network analysis. The reason for conducting validation is that the topic structure obtained from topic modelling does not guarantee the absence of errors, made by interpretative work or caused by the properties of topic models. The empirical base of the research was the corpus of texts from Youtube, which consisted of comments to the “Chaika” documentary, made by Anti-Corruption Foundation. In the preprocessing stage before the implementation of both methods the raw data had been cleaned and prepared. This component of the research was conducted in the restricted form because there are few linguistic instruments for Russian in open access. The first method for text analysis was topic modelling. Two models were generated: base and extended; they were used to obtain the topic structure. The second method used for validation of given topics was the semantic network on bigrams. This method proved its efficiency as a validation tool and an instrument for the extension of the topic set. Among its advantages is the possibility to visualize a topic structure. Moreover, the construction and analysis of a semantic network does not require much time, provided that text corpus does not contain tens or hundreds of thousands items.

Keywords: Big Data, topic modelling, network analysis, semantic network, text analysis