Wikipedia’s links are very valuable for many NLP tasks, but only a fraction of the text is annotated with hyperlinks. Our goal is to produce additional links to the articles at high-precision to facilitate other NLP systems. 3W is a system that identifies and links phrases in Wikipedia articles to their referent concept. 3W leverages rich information present in Wikipedia article to achieve high precision, yet yeild radically more new links than baseline.
We provide 2 versions of new links: Baseline and 3W. Please refer to our Publication for further explanation of both versions
Both versions have the same format as following:
<source article id>\t<start offset>\t<end offset>\t<link target article id>\t<confidence>
The offsets are computed from a parsed Wikipedia article provided in Wikipedia Resources section.
The paper that describes this project is Adding High-Precision Links to Wikipedia.
@InProceedings{noraset-bhagavatula-downey:2014:EMNLP2014, author = {Noraset, Thanapon and Bhagavatula, Chandra and Downey, Doug}, title = {Adding High-Precision Links to Wikipedia}, booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = {October}, year = {2014}, address = {Doha, Qatar}, publisher = {Association for Computational Linguistics}, pages = {651--656}, url = {http://www.aclweb.org/anthology/D14-1072} }
We provide data generated in Adding High-Precision Links to Wikipedia. The data include:
The format of mention and link files is described in the Links section. Note that <confidence> is not in Extracted mentions and Hand-labeled links, and <link target article id> is not in Extracted mentions. The link files might have slight difference from the one reported in the paper because we re-run the experiments. The confidence threshold is 0.934 for 3W, and 0.90 for Baseline.
In addition to links and experimental data, we think that it will be useful to provide Wikipedia-related data that we preprocess and use in many of our projects. Warning: these files are very large.
All resources are built using English Wikipedia as of September 2013. The resources do not include information about templates and tables in the articles. For table-related data, please checkout WikiTables.
This work was supported in part by DARPA contract D11AP00268 and Allen Institute for Artificial Intelligence.