Hebrew General corpus
This corpus was crawled from the Internet and includes mostly newspaper materials. It contains more than 150 million words. Development of the corpus was donated by Prof Ari Rappoport and Daphna Shezaf from the Computer Science and Engineering Department at the Hebrew University of Jerusalem.
HebWaC
Web corpus crawled, deduplicated and including multiple domains: blog posts, newspapers, commercial pages, … The size of the corpus is ca 50 million words.
Part-of-speech tagset
See the Hebrew POS tagset summary.