Text data account for more than 80% of all data in organizations and play a critical role in countless domains. But success stories of existing text mining and natural language processing tools still rely on excessive labeled data, which are often too costly to obtain in practice. The goal of this project is to develop next-generation text mining methods that turn massive text data into actionable knowledge with limited human supervision. We study an array of fundamental text mining tasks, such as text classification, event extraction, and taxonomy construction. Departing from prevailing supervised models for these tasks, our methods require little human supervision yet still achieve inspiring performance.
- Weakly-Supervised Neural Text Classification, CIKM 2018
- Doc2Cube: Automated Document Allocation to Text Cube via Dimension-Aware Joint Embedding, ICDM 2018
- TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering, KDD 2018
- HiExpan: Task-Guided Taxonomy Construction by Hierarchical Tree Expansion, KDD 2018
- GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams, SIGIR 2016
- TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams, KDD 2017
Combining the above pieces, we have also developed the TextCube system, which facilitates on-demand text mining and learning with little human labeling effort. The following video provides a detailed introduction to the system.