Transhistorical Corpus of Written English
The Transhistorical Corpus of Written English (TCWE) is a diachronic text corpus developed at Edge Hill University as part of a project which has taken place between 2019–2021, directed by Dr. Imogen Marcus and with the assistance of Dr. Ursula Maden-Weinberger. The corpus contains five sub-corpora: sermons, statutes, letters, emails, and instant messages. Please see the infographic below:
The texts within the corpus range in date from the fifteenth to the twenty-first century. See Table 1 below, which details how many tokens are in each sub-corpora by century, and Table 2, which details how many words are in each sub-corpora by century. Sermons, statutes, and letters stretch back to the medieval period, whilst email and instant messaging are confined to the twenty-first century.
Grammatical tagging
Since Sketch Engine can apply a tagger without any additional effort, all Modern English texts in the corpus, i.e. those dating from the 20th and 21st Centuries, were tagged and lemmatized using TreeTagger. This tagging is useful for anyone working with these Modern English documents. However, it should be noted that this tagging only applies to these modern texts. It cannot be reliably used in relation to texts in the corpus dating from before the 20th Century, because there is a much higher degree of spelling variation in these texts. There are also words used which have fallen out of use in Modern English. These words are usually tagged as nouns and are not lemmatized.
Annotation consistency
We have made sure that the annotation across the Transhistorical Corpus of Written English is consistent. For more information, please see the table below.
Annotation table
Characteristic | Specific character | CEECS | Innsbruck | EEBO/ECCO | CLEP | SketchEngine | ||
FIND | REPLACE | This affects ONLY texts | ||||||
Early letters | + | represented as characters | converted into modern equivalents | n/a | ||||
Ash | +A | Æ | +A | AE | L15, L16 | |||
ash | +a | æ | +a | ae | L15, L16 | |||
Eth | +D | Ð | +D | Th | L15, L16 | |||
eth | +d | ð | +d | th | L15, L16 | |||
Yogh | +G | 3 | +G | Ȝ | L15, L16 | |||
yogh | +g | 3 | +g | ȝ | L15, L16 | |||
Thorn | +T | Þ | +T | Th | L15, L16 | |||
thorn | +t | þ | +t | th | L15, L16 | |||
Crossed Thorn | +TT | Ꝥ | +TT | Th | L15, L16 | |||
Crossed Thorn | +Tt | +Tt | th | L15, L16 | ||||
crossed thorn | +tt | ꝥ | +tt | th | L15, L16 | |||
+e | e caudata | ae | +e | ae | L15, L16 | |||
+L | £ (pound sign) | £ | +L | £ | L15, L16 | |||
3 | ȝ | S15_001 | ||||||
Þ | Th | S15_001 | ||||||
þ | th | S15_001 | ||||||
Abbreviations | ||||||||
tilde or dash above letter, flourish, apostrophe within word | letter followed by ~ (e.g. p~vided) | followed editors, either extending or printing as is | dash above letter (e.g. declaracōn for declaracion) | ō | o~ | S15, S16, S17, S18 | ||
ū | u~ | S15, S16, S17, S18 | ||||||
n̄ | n~ | S15, S16, S17, S18 | ||||||
ē | e~ | S15, S16, S17, S18 | ||||||
ȳ | y~ | S15, S16, S17, S18 | ||||||
ā | a~ | S15, S16, S17, S18 | ||||||
m̄ | m~ | S15, S16, S17, S18 | ||||||
p̄ | p~ | S15, S16, S17, S18 | ||||||
q̄ | q~ | S15, S16, S17, S18 | ||||||
ī | i~ | S15, S16, S17, S18 | ||||||
ꝓ | p~ | S15, S16, S17, S18 | ||||||
fecimꝰ = abbreviation “us” –> fecimus | ꝰ | us | S15, S16 | |||||
& | & = and | & = and | & = and | & | and | |||
y. | ||||||||
Superscripts | ||||||||
superscript e.g. t, r etc | between == (e.g. w=t=) | between == | ^ (e.g. w^t) | between == | ^[any letters] | =any letters= | S15, S16, S17, S18 | |
Accents | ||||||||
é, è etc. | any accent replaced by accent grave ` after letter | on letter (e.g. ô dearely beloved) | é | e` | ||||
ô | o` | |||||||
Text Level Codes | ||||||||
editors’ comments | [\…\] | |[…] or […] within words | […], also page numbers | [\…\] | delete | L15-L19 | ||
|[…] | delete | L15-L19 | ||||||
font other than basic font | (^…^) | (^ | delete | L15-L19 | ||||
^) | delete | L15-L19 | ||||||
foreign language | (\…\) | (\…\) | (\ | delete | L15-L19 | |||
\) | delete | L15-L19 | ||||||
emendations | [{…..{] | [{ | delete | L15-L19 | ||||
{] | delete | L15-L19 | ||||||
[} | delete | L15-L19 | ||||||
heading | [}…}] | |… (also page/line numbers, remarks) | }] | delete | L15-L19 | |||
Corpus Coder comment | [^…^] | [^…^] | [^…^] | leave as is | L15-L19 | |||
metadata | <…> | |<...> | <…> (metadata) | |||||
special initials | | | |||||||
folio references | |r[f.8v] | |||||||
paragraph marker | ¶ | |||||||
deviant word joining | % (e.g. Iam = I %am) | % | delete | L18, L19 | ||||
uncertain letters | {…} | { | delete | L18, L19 | ||||
unreadable letters | {**…} | } | delete | L18, L19 | ||||
omissions | ^…^ | ^…^, e.g. I had the pleasure of seeing ^you^ but it | ^…^ | leave as is | L18, L19 |
Corpus rationale
The corpus has been designed to investigate innovation in digital written language, in particular the way it has been previously been conceptualised as a hybrid of speech and writing, in a historical context. It is for this reason that the corpus contains sermons (towards the speech end of a conceptual speech-writing continuum), statutes (towards the writing end of a conceptual speech-writing continuum), as well as letters, email and instant messages. However, the corpus does not need to be used for just this purpose.
The corpus contains a large amount of metadata and each user can therefore use many search criteria, including text type, century, in the case of letters, author name, author gender, recipient name and recipient gender. Below is a table which outlines what each metadata label means, and which text type sub-corpus it applies to.
Tools to work with the Transhistorical Corpus of Written English
A complete set of tools is available to work with this English transhistorical corpus to generate:
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of English nouns, verbs, adjectives, etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- trends – diachronic analysis automatically identifies neologisms and changes in use
- text type analysis – statistics of metadata in the corpus
Bibliography - Copyright and permissions
Copyright and permissions
The texts within the corpus have come from a range of sources. Many of them, such as the emails, taken from the freely available online ENRON corpus, are within the public domain. It was not therefore necessary to seek permission to reproduce these texts. The data in the instant messaging sub-corpus was collected by the project leader Imogen Marcus. Consent was gained in advance from everyone who donated their messaging data for this sub-corpus. Other parts of the TCWE, predominantly the correspondence sub-corpus but also parts of the sermon sub-corpus, are texts from other corpora and websites which have been reproduced here, with the permission of the creators. Further details can be found below.
Copyright and permissions pertaining to the correspondence sub-corpus
Copyright agreement pertaining to the 15-17th Century correspondence included in the corpus:
The TCWE incorporates a sample of 15th-17th Century letters from Corpus of Early English Correspondence Sampler (CEECS) within its correspondence sub-corpus. This document recognizes and acknowledges that, as copyright holders of CEECS, the CEEC team (led by Professor Terttu Nevalainen) have agreed to allow the inclusion of these letters in the TCWE corpus, and for the same to be made available on the Sketch Engine platform. The full references and credit lines for these letters are listed below:
CEEC = Corpus of Early English Correspondence. Compiled by the CEEC team under Terttu Nevalainen at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/
CEECS = Corpus of Early English Correspondence Sampler. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.
Nurmi, Arja (ed.). 1998. Manual for the Corpus of Early English Correspondence Sampler CEECS. Department of Modern Languages. University of Helsinki. clu.uni.no/icame/manuals/CEECS/INDEX.HTM.
PCEEC = Parsed Corpus of Early English Correspondence. 2006. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Helsinki: University of Helsinki and York: University of York.
Copyright agreement pertaining to the 18th Century letters in the corpus:
The sample of 18th Century letters from the Corpus of Late Eighteenth Century Prose have been reproduced with the permission of Professor David Denison, University of Manchester and Dr. Linda van Bergen, University of Edinburgh. This document recognizes and acknowledges the John Rylands University Library of Manchester, where the originals of the texts are held, as well as the ‘The English language of the north-west in the late Modern English period’ project, directed by David Denison, with Linda van Bergen as principal collaborator.
Copyright agreement pertaining to the 19th Century letters in the corpus:
The sample of 19th Century letters from the Corpus of Late Modern English Prose have been reproduced with the permission of Professor David Denison. The Corpus of Late Modern English Prose was constructed between 1992 and 1994 by Prof. David Denison, Department of English Language & Literature, University of Manchester, with the very considerable assistance of Graeme Trousdale and Linda van Bergen.
Copyright agreement pertaining to the 20th Century letters in the corpus:
The sample of 20th Century letters from the British Telecom Correspondence Corpus (BTCC) have been reproduced with the permission of its creator Dr. Ralph Morton (Birmingham City University).
Copyright and permissions pertaining to the sermon sub-corpus
Copyright agreement pertaining to the 21st Century sermons in the corpus:
The 21st Century sermons have been taken from the Lancaster Priory website and reproduced with the permission of their authors: Joel Love, Kara Cooper, Chris Newlands, John Rodwell and Kevin Huggett.
Search the Transhistorical Corpus of Written English
Sketch Engine offers a range of tools to search and analyze this diachronic corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.