Transhistorical Corpus of Written English

The Transhistorical Corpus of Written English (TCWE) is a diachronic text corpus developed at Edge Hill University as part of a project which has taken place between 2019–2021, directed by Dr. Imogen Marcus and with the assistance of Dr. Ursula Maden-Weinberger. The corpus contains five sub-corpora: sermons, statutes, letters, emails, and instant messages. Please see the infographic below:

The texts within the corpus range in date from the fifteenth to the twenty-first century. See Table 1 below, which details how many tokens are in each sub-corpora by century, and Table 2, which details how many words are in each sub-corpora by century. Sermons, statutes, and letters stretch back to the medieval period, whilst email and instant messaging are confined to the twenty-first century.

Text types

Sermons

Letters

Instant messaging

Email

Statutes

Century

15th C

24579

24507

0

0

21532

16th C

23752

2365

0

0

22447

17th C

24157

26904

0

0

21621

18th C

24079

28056

0

0

22719

19th C

22404

23134

0

0

21359

20th C

23055

22503

0

0

25308

21st C

23021

0

83196

51072

25377

Total

165047

148756

83196

51072

160363

Text types

Sermons

Letters

Instant messaging

Email

Statutes

Century

15th C

20456

20819

0

0

20307

16th C

19846

20139

0

0

20666

17th C

19930

22718

0

0

20599

18th C

20336

21657

0

0

20182

19th C

19393

19880

0

0

18949

20th C

19823

20122

0

0

21225

21st C

19966

0

50990

44354

19273

Grammatical tagging

Since Sketch Engine can apply a tagger without any additional effort, all Modern English texts in the corpus, i.e. those dating from the 20th and 21st Centuries, were tagged and lemmatized using TreeTagger. This tagging is useful for anyone working with these Modern English documents. However, it should be noted that this tagging only applies to these modern texts. It cannot be reliably used in relation to texts in the corpus dating from before the 20th Century, because there is a much higher degree of spelling variation in these texts. There are also words used which have fallen out of use in Modern English. These words are usually tagged as nouns and are not lemmatized.

Annotation consistency

We have made sure that the annotation across the Transhistorical Corpus of Written English is consistent. For more information, please see the table below.

Characteristic Specific character CEECS Innsbruck EEBO/ECCO CLEP SketchEngine
FIND REPLACE This affects ONLY texts
Early letters   + represented as characters converted into modern equivalents n/a
Ash +A Æ +A AE L15, L16
ash +a æ +a ae L15, L16
Eth +D Ð +D Th L15, L16
eth +d ð +d th L15, L16
Yogh +G 3 +G Ȝ L15, L16
yogh +g 3 +g ȝ L15, L16
Thorn +T Þ +T Th L15, L16
thorn +t þ +t th L15, L16
Crossed Thorn +TT +TT Th L15, L16
Crossed Thorn +Tt +Tt th L15, L16
crossed thorn +tt +tt th L15, L16
+e e caudata ae +e ae L15, L16
+L £ (pound sign) £ +L £ L15, L16
3 ȝ S15_001
Þ Th S15_001
þ th S15_001
Abbreviations          
tilde or dash above letter, flourish, apostrophe within word letter followed by ~ (e.g. p~vided) followed editors, either extending or printing as is dash above letter (e.g. declaracōn for declaracion) o~ S15, S16, S17, S18
u~ S15, S16, S17, S18
n~ S15, S16, S17, S18
e~ S15, S16, S17, S18
y~ S15, S16, S17, S18
a~ S15, S16, S17, S18
m~ S15, S16, S17, S18
p~ S15, S16, S17, S18
q~ S15, S16, S17, S18
i~ S15, S16, S17, S18
p~ S15, S16, S17, S18
fecimꝰ = abbreviation “us” –> fecimus us S15, S16
& & = and & = and & = and & and
y.
Superscripts
superscript e.g. t, r etc between == (e.g. w=t=) between == ^ (e.g. w^t) between == ^[any letters] =any letters= S15, S16, S17, S18
Accents        
é, è etc. any accent replaced by accent grave ` after letter on letter (e.g. ô dearely beloved) é e`
ô o`
Text Level Codes
editors’ comments [\…\] |[…] or […] within words […], also page numbers [\…\] delete L15-L19
|[…] delete L15-L19
font other than basic font (^…^) (^ delete L15-L19
^) delete L15-L19
foreign language (\…\) (\…\) (\ delete L15-L19
\) delete L15-L19
emendations [{…..{] [{ delete L15-L19
{] delete L15-L19
[} delete L15-L19
heading [}…}] |… (also page/line numbers, remarks) }] delete L15-L19
Corpus Coder comment [^…^] [^…^] [^…^] leave as is L15-L19
metadata <…> |<...> <…> (metadata)
special initials |
folio references |r[f.8v]
paragraph marker
deviant word joining % (e.g. Iam = I %am) % delete L18, L19
uncertain letters {…} { delete L18, L19
unreadable letters {**…} } delete L18, L19
omissions ^…^ ^…^, e.g. I had the pleasure of seeing ^you^ but it ^…^ leave as is L18, L19

Corpus rationale

The corpus has been designed to investigate innovation in digital written language, in particular the way it has been previously been conceptualised as a hybrid of speech and writing, in a historical context. It is for this reason that the corpus contains sermons (towards the speech end of a conceptual speech-writing continuum), statutes (towards the writing end of a conceptual speech-writing continuum), as well as letters, email and instant messages. However, the corpus does not need to be used for just this purpose.

The corpus contains a large amount of metadata and each user can therefore use many search criteria, including text type, century, in the case of letters, author name, author gender, recipient name and recipient gender. Below is a table which outlines what each metadata label means, and which text type sub-corpus it applies to.

Metadata

Metadata label

What it means

Which text type sub-corpus it applies to

Text ID

The ID number of each individual text file in the corpus. See text ID key below this table.

Every sub-corpus

Text type

You can search for each of the five text types in the corpus: sermons, statutes, letters, email and instant messaging.

Every sub-corpus

Century

These are the different centuries the texts date from. They include: 15th Century, 16th Century, 17th Century, 18th Century, 19th Century, 20th Century, 21st Century.

Every sub-corpus

Year

Specific year in which the text was composed.

Every sub-corpus

Date

Specific day and month on which the text was composed if known.

Email and 15th, 16th, 18th, 19th Century letters.

Source

Source from which the data is taken. This may be a previously established corpus, website, archive or transcriptions made by the project members. See the ‘copyright’ section below for more information.

Every sub-corpus

Collection

Particular collections, e.g. letter collections, that the data has been sourced from.

15th Century Statutes, 20th Century Sermons, 15-17th Century letters

Original ID

If the text files in a particular sub-corpus are taken from a previously created corpus, e.g. the Corpus of Early English Correspondence (CEEC), these are the text IDs that were originally assigned to them in that source corpora.

15-17th Century letters, 20th Century letters

Author name

The author’s name

Instant messaging (although nb anonymised in this sub-corpus), email, all letters, all sermons

Author gender

The author’s gender

Instant messaging, email, all letters, all sermons

Recipient name

The recipient’s name

Instant messaging, email, all letters

Recipient gender

The recipient’s gender

Instant messaging, email, all letters

Location

Where the text was composed/written

18-19th Century letters

Title

The title of the text, if applicable

All statutes, all sermons

Text ID Label

What it refers to

S15, e.g. S15_001

15th Century sermon, first text file (text file number included for information)

S16

16th Century sermon

S17

17th Century sermon

S18

18th Century sermon

S19

19th Century sermon

S20

20th Century sermon

S21

21st Century sermon

T15

15th Century statute

T16

16th Century statute

T17

17th Century statute

T18

18th Century statute

T19

19th Century statute

T20

20th Century statute

T21

21st Century statute

L15

15th Century letter

L16

16th Century letter

L17

17th Century letter

L18

18th Century letter

L19

19th Century letter

L20

20th Century letter

E21

21st Century email

I21

21st Century instant message

Tools to work with the Transhistorical Corpus of Written English

A complete set of tools is available to work with this English transhistorical corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

Copyright and permissions

The texts within the corpus have come from a range of sources. Many of them, such as the emails, taken from the freely available online ENRON corpus, are within the public domain. It was not therefore necessary to seek permission to reproduce these texts. The data in the instant messaging sub-corpus was collected by the project leader Imogen Marcus. Consent was gained in advance from everyone who donated their messaging data for this sub-corpus. Other parts of the TCWE, predominantly the correspondence sub-corpus but also parts of the sermon sub-corpus, are texts from other corpora and websites which have been reproduced here, with the permission of the creators. Further details can be found below.

Copyright and permissions pertaining to the correspondence sub-corpus

Copyright agreement pertaining to the 15-17th Century correspondence included in the corpus:

The TCWE incorporates a sample of 15th-17th Century letters from Corpus of Early English Correspondence Sampler (CEECS) within its correspondence sub-corpus. This document recognizes and acknowledges that, as copyright holders of CEECS, the CEEC team (led by Professor Terttu Nevalainen) have agreed to allow the inclusion of these letters in the TCWE corpus, and for the same to be made available on the Sketch Engine platform. The full references and credit lines for these letters are listed below:

CEEC = Corpus of Early English Correspondence. Compiled by the CEEC team under Terttu Nevalainen at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/

CEECS = Corpus of Early English Correspondence Sampler. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.

Nurmi, Arja (ed.). 1998. Manual for the Corpus of Early English Correspondence Sampler CEECS. Department of Modern Languages. University of Helsinki. clu.uni.no/icame/manuals/CEECS/INDEX.HTM.

PCEEC = Parsed Corpus of Early English Correspondence. 2006. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Helsinki: University of Helsinki and York: University of York.

Copyright agreement pertaining to the 18th Century letters in the corpus:

The sample of 18th Century letters from the Corpus of Late Eighteenth Century Prose have been reproduced with the permission of Professor David Denison, University of Manchester and Dr. Linda van Bergen, University of Edinburgh. This document recognizes and acknowledges the John Rylands University Library of Manchester, where the originals of the texts are held, as well as the ‘The English language of the north-west in the late Modern English period’ project, directed by David Denison, with Linda van Bergen as principal collaborator.

Copyright agreement pertaining to the 19th Century letters in the corpus:

The sample of 19th Century letters from the Corpus of Late Modern English Prose have been reproduced with the permission of Professor David Denison. The Corpus of Late Modern English Prose was constructed between 1992 and 1994 by Prof. David Denison, Department of English Language & Literature, University of Manchester, with the very considerable assistance of Graeme Trousdale and Linda van Bergen.

Copyright agreement pertaining to the 20th Century letters in the corpus:

The sample of 20th Century letters from the British Telecom Correspondence Corpus (BTCC) have been reproduced with the permission of its creator Dr. Ralph Morton (Birmingham City University).

Copyright and permissions pertaining to the sermon sub-corpus

Copyright agreement pertaining to the 21st Century sermons in the corpus:

The 21st Century sermons have been taken from the Lancaster Priory website and reproduced with the permission of their authors: Joel Love, Kara Cooper, Chris Newlands, John Rodwell and Kevin Huggett.

Search the Transhistorical Corpus of Written English

Sketch Engine offers a range of tools to search and analyze this diachronic corpus.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.