Elsevier OA CC-BY Corpus

Corpus of Elsevier Open Access Journals

The Elsevier OA CC-BY Corpus is an English corpus consisting of 40,000 scientific research papers which are a representative sample from across scientific disciplines. The Elsevier corpus is comprised of open access articles with the CC-BY 4.0 (Creative Commons) license available in Elsevier journals of a Dutch publishing company specializing in scientific, technical, and medical content. These articles were published between 2014 and 2020.

The original data of the Elsevier OA CC-BY corpus have been prepared by Daniel Kershaw and Rob Koeling. More information about the corpus can be found in the Digital Commons (Elsevier) deposit.

Part-of-speech tagset

The Elsevier Open Access Journals corpus is part-of-speech tagged by the TreeTagger part-of-speech tagset.

Basic information

	Frequency
Tokens	43,125,207,462
Words	36,561,273,153
Sentences	2,008,143,278
Web pages	78,373,887

Elsevier OA CC-BY Corpus – year distribution

The English corpus of Elsevier Open Access Journals contains 40,000 scientific articles from 2014 to 2020.

Hover over the chart to display a number of tokens of the particular topic.

Search the Elsevier OA CC-BY Corpus

Sketch Engine offers a range of tools to work with this English corpus of Elsevier Journals.

open in Sketch Engine

about Sketch Engine

Tools to work with the Elsevier OA CC-BY Corpus

A complete set of Sketch Engine tools is available to work with this English corpus of scientific papers to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Citation & Reference

Kershaw, Daniel; Koeling, Rob (2020), “Elsevier OA CC-BY Corpus”, Mendeley Data, V1, doi: 10.17632/zm33cdndxs.1

Other English corpora

Explore our largest Timestamped English corpus with 80+ billion words.

available English corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

Corpus of Elsevier Open Access Journals

Part-of-speech tagset

Basic information

Elsevier OA CC-BY Corpus – year distribution

Search the Elsevier OA CC-BY Corpus

Tools to work with the Elsevier OA CC-BY Corpus

Other English corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine