This page describes the NSF-funded project IIS-1351029, CAREER: Web Information Extraction: Integration and Scaling.

This project studies Web Information Extraction (WIE), the task of automatically extracting computer-understandable knowledge bases (KBs) from the World Wide Web. The project addresses two key challenges in WIE. First, many different teams in academia and industry are pursuing WIE, but they lack methods for combining their KBs into a more powerful whole. This project explores how to integrate knowledge automatically across WIE systems and approaches. Secondly, a long-standing goal for WIE is to construct systems that can scale to billions of facts, by continually improving themselves over time. This project is investigating new methods that continually optimize a WIE system with limited human intervention. The project's goal of scaling and integrating WIE systems promises to address needs in the research community, the computing industry, and the public. Methods that allow different WIE systems to seamlessly exchange knowledge could dramatically hasten the progress of Web extraction efforts currently underway in academia and industry. For the public, advances in Web extraction promise to enable improved search engines that can assist users with tasks and answer complex questions. Further, through application prototypes, the project will provide public-facing information retrieval tools that promise to help users retrieve, understand, and analyze the Web's knowledge more rapidly. The project's research is also integrated with an education plan that includes outreach to underrepresented groups.

The technical solutions pursued in the project utilize probability distributions over natural language. For the integration challenge, the project is developing new Application Programming Interfaces (APIs) that leverage the expressiveness of natural language to automatically integrate current and future WIE systems, even when the systems extract from different types of corpora and represent knowledge in different ways. For the scaling challenge, the project is developing ways to continually optimize new Statistical Language Models (SLMs) over text on the Web. The project investigates the SLM approach for WIE theoretically, asking what types of knowledge different SLMs can encode, and how much text is required to obtain the knowledge. Further, the project introduces new SLM capabilities, including methods for scaling to larger corpora and more semantic classes, and novel models that incorporate collocations, quantitative attributes, sense disambiguation, and actively-selected human input. The project web site ( provides additional information and access to results, including software, corpora, and evaluation data sets.

PI: Doug Downey

PhD Students:



Data and Code

This material is based upon work supported by the National Science Foundation under Grant Number 1351029. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last Updated 8/5/2019