Abstract: Filtering out irrelevant documents andclassifying the relevant ones into topical categories is ade facto task in many applications. However, supervisedlearning solutions require extravagant human effortson document labeling. In this paper, we propose anovel seed-guided topic model for dataless short textclassification and filtering, named SSCF. Without usingany labeled documents, SSCF takes a few “seed words” foreach category of interest, and conducts short text filteringand classification in a weakly supervised manner. Toovercome the issues of data sparsity and imbalance, theshort text collection is mapped to a collection of pseudodocuments,one for each word. SSCF infers two kinds oftopics on pseudo-documents: category-topics and generaltopics.Each category-topic is associated with one categoryof interest, covering the meaning of the latter. In SSCF,we devise a novel word relevance estimation processbased on the seed words, for hidden topic inference. Thedominating topic of a short text is identified through postinference and then used for filtering and classification. Ontwo real-world datasets in two languages, experimentalresults show that our proposed SSCF consistentlyachieves better classification accuracy than state-of-theartbaselines. We also observe that SSCF can even achievesuperior performance than the supervised classifierssupervised latent dirichlet allocation (sLDA) and supportvector machine (SVM) on some testing tasks.
Keywords: dataless text classification, short text, topicmodeling, seed word, pseudo-document