SemCor: semantically annotated English corpus
The SemCor corpus is an English corpus with semantically annotated texts. The semantic analysis was done manually with WordNet 1.6 senses (SemCor version 1.6) and later automatically mapped to WordNet 3.0 (SemCor version 3.0). The SemCorpus corpus consists of 352 texts from Brown corpus.
This sense-tagged corpus SemCor 3.0 was automatically created from SemCor 1.6 by mapping WordNet 1.6 to WordNet 3.0 senses. SemCor 1.6 was created and is property of Princeton University. The automatic mapping was performed by Rada Mihalcea (rada@cs.unt.edu).
The corpus has also multi-word expressions (MWE) marked with underscore (_), e.g. manor_house. These multi-word units were annotated by Siva Reddy.
Part-of-speech tagset
SemCor was tagged by TreeTagger using Penn TreeBank tagset.
License
Tools to work with the SemCor corpus
A complete set of tools is available to work with this sense-annotated English corpus to generate:
- keywords – terminology extraction of one-word units
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Search the sense-tagged annotated corpus
Sketch Engine offers a range of tools to work with the SemCor corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is. Use our Quick Start Guide to learn it in minutes.