3W

What is 3W?

Wikipedia’s links are very valuable for many NLP tasks, but only a fraction of the text is annotated with hyperlinks. Our goal is to produce additional links to the articles at high-precision to facilitate other NLP systems. 3W is a system that identifies and links phrases in Wikipedia articles to their referent concept. 3W leverages rich information present in Wikipedia article to achieve high precision, yet yeild radically more new links than baseline.

Links

We provide 2 versions of new links: Baseline and 3W. Please refer to our Publication for further explanation of both versions

Baseline (1.1 GB)
3W (1.3 GB) (from ~2.6 million articles)

Both versions have the same format as following:

<source article id>\t<start offset>\t<end offset>\t<link target article id>\t<confidence>

The offsets are computed from a parsed Wikipedia article provided in Wikipedia Resources section.

Publication

The paper that describes this project is Adding High-Precision Links to Wikipedia.

Paper
Poster

BibTex:

@InProceedings{noraset-bhagavatula-downey:2014:EMNLP2014,
	author    = {Noraset, Thanapon  and  Bhagavatula, Chandra  and  Downey, Doug},
	title     = {Adding High-Precision Links to Wikipedia},
	booktitle = {Proceedings of the 2014 Conference on Empirical Methods 
			in Natural Language Processing (EMNLP)},
	month     = {October},
	year      = {2014},
	address   = {Doha, Qatar},
	publisher = {Association for Computational Linguistics},
	pages     = {651--656},
	url       = {http://www.aclweb.org/anthology/D14-1072}
}

Experimental Data

We provide data generated in Adding High-Precision Links to Wikipedia. The data include:

Parsed articles: ~2,000 randomly selected Wikipedia articles.
Extracted mentions: Phrases that are considered as a potential link.
Hand-labeled links: A subset of Extracted mentions that we manually annotate and use in our experiment.
Baseline links: Link result from baseline approach.
3W links: Link result from our system.

The format of mention and link files is described in the Links section. Note that <confidence> is not in Extracted mentions and Hand-labeled links, and <link target article id> is not in Extracted mentions. The link files might have slight difference from the one reported in the paper because we re-run the experiments. The confidence threshold is 0.934 for 3W, and 0.90 for Baseline.

Wikipedia Resources

In addition to links and experimental data, we think that it will be useful to provide Wikipedia-related data that we preprocess and use in many of our projects. Warning: these files are very large.

Parsed articles (4.8 GB): All parsed articles using custom-made Sweble parser. Each file is an article and named by the article ID.
Article ID Map (108 MB): <article id>\t<title>
Dependency-parsed articles (15 GB): Dependency of all articles using Stanford Dependency Parser. There are 2 files: dependency file (each line is an article) and position file (<article id>\t<start byte offset>).
Wikipedia Links (0.8 GB): All links in the articles in the link file format described in Links section, but there is no <confidence>.

All resources are built using English Wikipedia as of September 2013. The resources do not include information about templates and tables in the articles. For table-related data, please checkout WikiTables.

Team

Also visit our group website for other projects.

Acknowledgement

This work was supported in part by DARPA contract D11AP00268 and Allen Institute for Artificial Intelligence.