Gujarati web corpus (guWaC)

GuWac web as corpus is a corpus of Gujarati language (Indo-Aryan language belonging to the Indo-European language family), was crawled in 2013. It contains almost 18 million words and is encoding in UTF-8 without tagging.