The long-term goal of my research is to enable machines to acquire knowledge from data and make intelligent predictions faster, more accurately, and with less supervision. My current work focuses on developing models to improve our ability of mining knowledge from unstructured data, especially text and spatiotemporal data. My current research projects include: (1) multimodal learning on spatiotemporal data; and (2) low-resource text mining.

Multimodal Learning on Spatiotemporal Data

Modern big data applications are witnessing a confluence of text data and rich spatiotemporal contexts: millions of social media records contain not only textual messages but also user locations and creating timestamps; every scientific paper is stamped with its publication year; billions of SMS are sent with location and time information and are being collected by phone carriers. Existing text mining and spatiotemporal data mining techniques operate separately and quickly become obsolete in dealing with such complex data. How do we unveil the subtle correlations between different modalities? How do we discover interesting patterns in the multidimensional space? Can we integrate different modalities to make more accurate predictions? Answering the above questions requires algorithms that can handle multimodality in a principled way, thus enabling acquiring interesting patterns and making accurate predictions. In pursuit of this goal, I have investigated three fundamental problems for mining text-rich spatiotemporal data:

  • Multimodal Representation Learning

  • Sequential Pattern Discovery

  • Multimodal Sequential Prediction

For the above problems, we developed: (1) the first multimodal embedding method for text-rich spatiotemporal data, which learns general-purpose vector representations for location, time, and text; (2) a group-level sequential model which learns an ensemble of hidden Markov models (HMM) by jointly clustering sequences and performing HMM training; (3) a recurrent neural network model that integrates rich contexts (location, time, text, user) for sequential prediction.

Representative publications:

You may be also interested in playing with the Urbanity system we have developed. It uses the multimodal embedding technique we proposed to learn from massive social media to model human activities in the physical world.

Low-Resource Text Mining

Much of human knowledge is encoded in textual form. The amount of text data has grown dramatically in this era, which causes a significant challenge for humans to consume and gain insights from text data timely. The successes of existing text mining and NLP tools are often limited to tasks wherein massive human-labeled training data are available. In many practical text analytics tasks, however, getting labeled training data is too costly, which has become the de facto bottleneck for making sense of text data. My research on low-resource text mining attempts to develop text mining methods that enable acquiring useful knowledge from text with little supervision. Specific problems I investigate in this thrust include:

  • Document Classification

  • Event Detection

  • Taxonomy Construction

We have developed: (1) weakly-supervised document classification methods that classify documents using only label names as seed information instead of excessive labeled corpus; (2) unsupervised event detection methods that detect events from text streams based on non-parametric clustering; (3) unsupervised taxonomy construction methods that automatically organize a given collection of terms/entities into a concept taxonomy.

Representative publications:

Besides the above machine learning techniques, we have developed a system, TextCube, by integrating different pieces. TextCube is capable of organizing a given text corpus into a multidimensional, multi-granular structure that facilitates on-demand text mining and learning. For example, the following figure shows a three-dimensional <Location, Time, Topic> cube with the documents residing in. From the cube, the users can easily retrieve relevant data with simple queries (e.g., <disaster, USA, 2017>) and further apply any text mining tools, such as sentiment analysis and text summarization.