frWaC: French corpus from the .fr domain
The frWaC corpus is a French text corpus collected from the .fr domain with using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus consists of French websites with total size 1.3 billion words.
Part-of-speech tagset
The corpus texts were POS tagged with TreeTagger using the following tagset.
Tools to work with the French web corpus
A complete set of Sketch Engine tools is available to work with this French frWaC corpus to generate:
- word sketch – French collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of French nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Changelog
version 1.1 (2012/04/13)
- retagged with UTF-8 TreeTagger models to fix lemmatization
- improved sentence segmentation
version 1.0
- POS tagged and lemmatized with the TreeTagger tool
initial version
- 100-million-word corpus
- gathered using a list of URLs provided by Serge Sharoff (the University of Leeds) as described in A Corpus Factory for Many Languages
Bibliography
BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 2009, 43.3: 209-226.
Search the French corpus
Sketch Engine offers a range of tools to work with this French corpus.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.