Web Information Extraction: Integration and Scaling

This page describes the NSF-funded project IIS-1351029, CAREER: Web Information Extraction: Integration and Scaling.

This project studies Web Information Extraction (WIE), the task of automatically extracting computer-understandable knowledge bases (KBs) from the World Wide Web. The project addresses two key challenges in WIE. First, many different teams in academia and industry are pursuing WIE, but they lack methods for combining their KBs into a more powerful whole. This project explores how to integrate knowledge automatically across WIE systems and approaches. Secondly, a long-standing goal for WIE is to construct systems that can scale to billions of facts, by continually improving themselves over time. This project is investigating new methods that continually optimize a WIE system with limited human intervention. The project's goal of scaling and integrating WIE systems promises to address needs in the research community, the computing industry, and the public. Methods that allow different WIE systems to seamlessly exchange knowledge could dramatically hasten the progress of Web extraction efforts currently underway in academia and industry. For the public, advances in Web extraction promise to enable improved search engines that can assist users with tasks and answer complex questions. Further, through application prototypes, the project will provide public-facing information retrieval tools that promise to help users retrieve, understand, and analyze the Web's knowledge more rapidly. The project's research is also integrated with an education plan that includes outreach to underrepresented groups.

The technical solutions pursued in the project utilize probability distributions over natural language. For the integration challenge, the project is developing new Application Programming Interfaces (APIs) that leverage the expressiveness of natural language to automatically integrate current and future WIE systems, even when the systems extract from different types of corpora and represent knowledge in different ways. For the scaling challenge, the project is developing ways to continually optimize new Statistical Language Models (SLMs) over text on the Web. The project investigates the SLM approach for WIE theoretically, asking what types of knowledge different SLMs can encode, and how much text is required to obtain the knowledge. Further, the project introduces new SLM capabilities, including methods for scaling to larger corpora and more semantic classes, and novel models that incorporate collocations, quantitative attributes, sense disambiguation, and actively-selected human input. The project web site (http://websail.eecs.northwestern.edu/wie/) provides additional information and access to results, including software, corpora, and evaluation data sets.

PI: Doug Downey

PhD Students:

Publications

Yiben Yang, Ji-Ping Wang, Doug Downey. Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models. NAACL 2019.
Michael Chen, Mike D'Arcy, Alisa Liu, Jared Fernandez, Doug Downey. CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense. RepEval workshop, NAACL 2019
Yiben Yang, Larry Birnbaum, Ji-Ping Wang, Doug Downey. Extracting Commonsense Properties from Embeddings with Limited Human Guidance. ACL 2018
Thanapon Noraset, Dave Demeter, Doug Downey. Controlling Global Statistics in Recurrent Neural Network Text Generation. AAAI 2018
Zheng Yuan, Doug Downey. OTyper: A Neural Architecture for Open Named Entity Typing. AAAI 2018
Jared Fernandez, Doug Downey. Sampling Informative Training Data for RNN Language Models. ACL 2018 Student Paper
Jared Fernandez, Zhaocheng Yu, Doug Downey. VecShare: A Framework for Sharing Word Representation Vectors. EMNLP 2017
Thanapon Noraset, Chen Liang, Larry Birnbaum, Doug Downey. Definition Modeling: Learning to Define Word Embeddings in Natural Language AAAI 2017
Nishant Subramani, Doug Downey. PAG2ADMG: A Novel Methodology to Enumerate Causal Graph Structures AAAI 2017 (Student Paper)
Yuji Mo, Stephen Scott, Doug Downey. Learning Hierarchically Decomposable Concepts with Active Over-Labeling ICDM 2016
Yi Yang, Doug Downey, Jordan Boyd-Graber. Efficient Methods for Incorporating Knowledge into Topic Models. EMNLP 2015
Chandra Sekhar Bhagavatula, Thanapon Noraset, Doug Downey. TabEL: Entity Linking in Web Tables. ISWC 2015
Doug Downey, Chandra Sekhar Bhagavatula, Yi Yang. Efficient Methods for Inferring Large Sparse Topic Hierarchies. ACL 2015
Chandra Sekhar Bhagavatula, Thanapon Noraset, Doug Downey. TextJoiner: On-demand Information Extraction with Multi-pattern Queries. AKBC 2014

Demos

Data and Code

This material is based upon work supported by the National Science Foundation under Grant Number 1351029. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last Updated 8/5/2019