Chinese Gigaword: Corpus of the Mainland and Traditional Chinese

The Chinese Gigaword Corpus is a Chinese corpus made up of Chinese journalism. The corpus contains data from archives of News Agencies and was prepared by Linguistic Data Consortium (LDC) with source data covering the period 1990–2002. Chinese Gigaword comprises almost 600 million words belong to two separate corpora:

Chinese GigaWord 2 Corpus: Mainland, simplified characters

  • source data is journalism from the Xinhua News Agency, Beijing from 1991 and 2002
  • size more than 200 million words

Chinese GigaWord 2 Corpus: Taiwan, traditional characters

  • source data is journalism from the Central News Agency, Taiwan from 1990 and 2002
  • size more than 380 million words

More information can be found at https://catalog.ldc.upenn.edu/LDC2003T09

Part-of-speech tagset

The Chinese Gigaword corpus has POS tagging with the following Chinese part-of-speech tagset.

Tools to work with the Chinese Gigaword corpus

A complete set of Sketch Engine tools is available to work with these Chinese corpora of Mainland and Traditional Chinese to generate:

  • word sketch – Chinese collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Chinese nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywordsterminology extraction of one-word units
  • text type analysis – statistics of metadata in the corpus

Citation

Graff, David, and Ke Chen. Chinese Gigaword LDC2003T09. Web Download. Philadelphia: Linguistic Data Consortium, 2003.

Bibliographical references about the corpus

Hong, J. F., & Huang, C. R. (2006, November). Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research. In PACLIC.

Ma, W. Y., & Huang, C. R. (2006, May). Uniform and effective tagging of a heterogeneous giga-word corpus. In 5th International Conference on Language Resources and Evaluation (LREC2006) (pp. 24-28).

Chinese word sketches

Kilgarriff, A., Huang, C. R., Rychlý, P., Smith, S., & Tugwell, D. (2005). Chinese word sketches.

Search the Chinese Gigaword corpus

Sketch Engine offers a range of tools to work with the Mainland and Traditional Chinese corpora.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.