Active Learning and Language Models for Web Information Extraction
This project studies how to automatically extract large knowledge bases from the Web. We aim to develop techniques that can integrate the Web's tabular and textual data into a coherent knowledge base.
Questions we're interested in include:
- How can we integrate extraction from both text data and Web tables?
- How can statistical language models trained over large corpora help improve extraction accuracy?
- How can an extraction system actively solicit well-selected human input to improve the extraction process?
Try our DEMO of Wikipedia-based Table Extraction, and associated data and other resources
Publications and associated resources:
- Using Natural Language to Integrate, Evaluate, and Optimize Extracted Knowledge Bases Doug Downey, Chandra Sekhar Bhagavatula, Alexander Yates. (AKBC 2013)
- Methods for Exploring and Mining Tables on Wikipedia. Chandra Sekhar Bhagavatula, Thanapon Noraset, Doug Downey. (IDEA 2013) Data and Code
- Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text. Yi Yang, Alex Yates, Doug Downey. (NAACL-HLT 2013) Code
- Explanatory Semantic Relatedness and Explicit Spatialization for Exploratory Search. Brent Hecht, Samuel H. Carton, Mahmood Quaderi, Johannes Schöning, Martin Raubal, Darren Gergle, Doug Downey. (SIGIR 2012) Atlasify 240 data set
This material is based upon work supported by the National Science Foundation under Grant Number 1016754. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.