Corpus of Elsevier Open Access Journals
The Elsevier OA CC-BY Corpus is an English corpus consisting of 40,000 scientific research papers which are a representative sample from across scientific disciplines. The Elsevier corpus is comprised of open access articles with the CC-BY 4.0 (Creative Commons) license available in Elsevier journals of a Dutch publishing company specializing in scientific, technical, and medical content. These articles were published between 2014 and 2020.
The original data of the Elsevier OA CC-BY corpus have been prepared by Daniel Kershaw and Rob Koeling. More information about the corpus can be found in the Digital Commons (Elsevier) deposit.
Part-of-speech tagset
The Elsevier Open Access Journals corpus is part-of-speech tagged by the TreeTagger part-of-speech tagset.
Basic information
Frequency | |
Tokens | 43,125,207,462 |
Words | 36,561,273,153 |
Sentences | 2,008,143,278 |
Web pages | 78,373,887 |
Elsevier OA CC-BY Corpus – year distribution
The English corpus of Elsevier Open Access Journals contains 40,000 scientific articles from 2014 to 2020.
Hover over the chart to display a number of tokens of the particular topic.
Search the Elsevier OA CC-BY Corpus
Sketch Engine offers a range of tools to work with this English corpus of Elsevier Journals.
Tools to work with the Elsevier OA CC-BY Corpus
A complete set of Sketch Engine tools is available to work with this English corpus of scientific papers to generate:
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Citation & Reference
Kershaw, Daniel; Koeling, Rob (2020), “Elsevier OA CC-BY Corpus”, Mendeley Data, V1, doi: 10.17632/zm33cdndxs.1
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.