HindiWaC – Hindi Corpus from the web

HindiWaC: Hindi Corpus from the web

The Hindi Web corpus (HindiWaC) is a Hindi corpus made up of texts collected from the Internet. This corpus contains more than 100 million words crawled from the Hindi Internet during 2009, 2011 and 2014.

Texts in the corpus are lemmatized and morphologically tagged. The corpus has a word sketch grammar enables users to explore the grammatical and collocational behavior of Hindi words. The whole process corpus preparation is described in the Corpus factory method document (Kilgarriff et al. at LREC 2010).

The corpus contains a special attribute cpos which is a coarse POS tag that it is not derived from the attribute tag.

Part-of-speech tagset

See the Hindi part-of-speech tagset describing POS tags used in the corpus.

Special positional attributes in the 3rd version of the corpus

Attributes only in the 3rd version of the corpus

hlemma/hword (heuristic) – tags where all the vowels are stripped, and just the consonants appear. Most spelling variations are due to the usage of differents vowels, so in order to find similarly spelt words hlemma and hword becomes handy, e.g. ka (क) + e -> ki की
Tags with suffix “:?” are words which cannot be classified into the target tag linguistically but had to be classified due to the context

Tools to work with the Hindi corpus

A complete set of tools is available to work with this HindiWaC corpus to generate:

word sketch – Hindi collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Hindi nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

v4.0 (10th Feb 2017)

added data from 2014 with the total size 107 million words
improved sketch grammar
removed special positional attributes: hlemma and hword

v3.0 (17th Jan 2012)

recollected in 2011, size 58 million tokens
tagged with using the shallow tagging legend
afterward, retagged using a new POS tagger (91.31% accuracy) and lemmatized; lemmatizer and POS analyzer available at http://sivareddy.in/downloads
- tagger uses the POS tags listed in POS guidelines for Indian languages
written a simple sketch grammar for Hindi and generated first word sketches for Hindi
in 2014 Sketch Grammar revised with new rules making use of post-position markers (which are crucial in Hindi dependency parsing) and added more rules (see more in the bibliography)
added lempos attribute
special positional attributes: hlemma, hword, and cpos

v1.0 (dec 2009)

initial size 27 million words
created by Siva Reddy
no part-of-speech tagging

Bibliography

Eragani, A. K., Kuchibhotla, V., Sharma, D. M., Reddy, S., & Kilgarriff, A. (2014). Hindi Word Sketches. In Proceedings the 11th International Conference on Natural Language Processing (ICON).

Search the Hindi corpus

Sketch Engine offers a range of tools to work with this Hindi corpus (HindiWaC).

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide