Semantically Tagged glosses

Word forms from the definitions ("glosses") in WordNet's synsets are manually linked to the context-appropriate sense in WordNet. Thus, the glosses are a sense-disambiguated corpus and WordNet version 3.0 is the dictionary against which the corpus was annotated.

Release Contents

This release, once extracted, is comprised of three subdirectories:

/WordNet-3.0/glosstag/mergedWordNet glosses in merged format
/WordNet-3.0/glosstag/standoffWordNet glosses in standoff format
/WordNet-3.0/glosstag/dtdDTD describing the markup for the merged annotations

When using this freely available resource, we ask that you refer to it as the "Princeton WordNet Gloss Corpus."


Readme File



Tokenized text (word and collocation forms)

Types     47334
Tokens  1621129

Multi-word forms (globs)

man      7168
auto    45967
all     53135

Taggable lemmas (potential lemmas)

Types       55561
Tokens    1504077

Sense tags (sense keys on sense tags)

Kind    Types    Tokens
man     33862    339969
auto    26139    118856
all     59250    458825

Taggable tokens (word forms and globs)

Kind       wf     glob       all
man    317812    12687    330499
auto    82238    36618    118856
un     202881     3830    206711
ignore 457502        0    457502


wf      word form
man     manually-inserted sense tag or collocation
auto    automatically generated sense tag or collocation
un      taggable item that has not been tagged
ignore  stoplist item
glob    collocation/multi-word term


While standoff annotations have many benefits, particularly the ability to isolate annotations of choice, it is not a well-supported format. Our standoff encoding is based heavily on the ANC format, but is not identical to it as our markup is necessarily different. Therefore, some tools that work with the ANC data may work with ours, but not all. We are supplying the data in this format as a service to users who are used to working with standoff annotations, and who will build or modify existing software to work with it. We are not supporting the ANC standoff annotation format, nor any software that uses or manipulates it, nor are we providing any tools ourselves. The standoff annotations do not contain more, or better, information than the merged files. The annotations contained in them are identical to the merged data, just reformulated in a different way. If you have any doubts about which format to use, then use the merged files.


This work was sponsored by ARDA/DTO through the AQUAINT Program.