| Home
People
Publications
Projects
Software
News
|
This project aims to improve our ability to data mine information previously locked in unstructured natural language text. It focuses on developing novel statistical models for information extraction and data mining that have such tight integration that the boundaries between them disappearresulting in a powerful unified framework for extraction and mining.
Current information extraction methods populate slots in a database by identifying relevant subsequences of text, but they are usually unaware of the emerging patterns and regularities in the database. Current data mining methods begin from a populated database, and they are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach.
This project uses probabilistic graphical models that make extraction and mining decisions in the same probabilistic currency, with a common inference procedure. Such models promise significant gains in accuracy and capability, as well as an opportunity for deeper understanding of the role of high-level, top-down patterns in natural language processing, and the role of low-level, bottom-up language data in symbolic processing.
The project grounds this work in two real-world applications domains: scientific research and government information. The extraction and mining of large-scale databases in these domains will have broad impacts by providing useful, constantly-updated Web resources, by enabling insights into government efficiency and the flow of scientific ideas, and by making databases, analyses and source code publicly available.
|