Corpus of Spontaneous Japanese
To see all corpus holdings, click here.
Jointly created by the Communications Research Laboratory, the Tokyo Institute of Technology, and the National Institute for Japanese Language and Linguistics, the Corpus of Spontaneous Japanese (CSJ) offers speech predominantly from spontaneous monologues (academic presentations and public speaking), with some of the data also coming from spontaneous dialogues and from reading. In total it provides 658 hours of speech (over 7 million words) from over 1 400 speakers, whose ages range from their twenties to their eighties. The CSJ's audio is accompanied by annotations of various information: transcriptions, part-of-speech tags, as well as labels of phonetic segmentation and of intonation.
CSJ provides a rich set of annotations, including transcriptions, parts of speech, labels of phonetic segmentation and intonation, which are provided both in text files and XML format