OANC: Open American National Corpus

The OANC-MASC Corpus

The Open American National Corpus (OANC) and its subcorpus  The Manually Annotated Sub-Corpus (MASC) is a text corpus of American English. Texts in the corpus include all genres and transcripts of spoken data produced from 1990 onward. The whole corpus is comprised of 11 million words.

The MASC subcorpus consist of 480k words with manually validated annotations for sentence boundaries, tokens, lemmas, POS, noun, verb chunks, named entities (person, location, organization, date), coreference and discourse structure.

The OANC-MASC corpus contains merged data from OANC and MASC corpus. Because the MASC is a sub-corpus of OANC in the resulting OANC-MASC corpus the OANC’s MASC part was replaced by the MASC data to remove duplicated documents.

The OANC-MASC corpus has two separate parts: The OANC-MASC Written and The OANC-MASC Spoken part.

For more information visit http://www.anc.org

Part-of-speech tagset

This OANC corpus is tagged by TreeTagger tool using Penn TreeBank tagset with Sketch Engine modifications.

Available tools

A complete set of tools is available to work with this English corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word
  • text type analysis – statistics of metadata in the corpus

Open American National Corpus (OANC)

Ide, N. (2008). The American National Corpus: Then, Now, and Tomorrow. In Michael Haugh, Kate Burridge, Jean Mulder and Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, Cascadilla Proceedings Project, Sommerville, MA.

The Manually Annotated subcorpus (MASC)

Ide, N., Baker, C., Fellbaum, C., Fillmore, C., Passonneau, R. (2008). MASC: The Manually Annotated Sub-Corpus of American English. Proceedings of the Sixth Language Resources and Evaluation Conference (LREC), Marrakech, Morocco.

Search this English corpus

Sketch Engine offers a range of tools to work with this American English corpus.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.