The Scottish Corpus Of Texts and Speech (SCOTS) is a corpus containing over 1300 written and spoken texts, 77% of which is made up of written texts and 23% of spoken texts. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.
This data can be found on Havarti at
The following files are in Gaelic, not any dialect of English. They should be placed in their own sub-directory and removed from the top-level directory containing the wav files.
Before treatment, the following files, which have been ambiguously encoded from the corpus, should be re-encoded into UTF-8 format. This can be done by opening the files in a text editor and re-saving them in UTF-8 encoding.
0598John_lecture_1.TextGrid 0799_Niall_Rob.TextGrid 1387_-_Doreen_Waugh_Linda_Riddell.TextGrid 1428_-_LESSER_CONTENT_scotland-glasgow.TextGrid 1450_-_kyle_bettley_cheryl_campbell.wav 1470_Mairead_Mackechnie.wav 1471_Three_Shetlandic_Ladies.wav 1485_-_David_and_Rosalyn_Sweeney.wav 1521_-_Jim_and_June_Anderson.wav 1544_-_Matthew_Fitt.wav 1545_Christine_and_Greg_-_CHILD_disk_1.wav 1546_Ross_and_Shona_-_CHILD_disk_1.wav 1547_Andrea_and_Marcus_disk_2.wav 1548_Jamie_and_Linda_-_CHILD_disk_1.wav 1549_Fiona_and_Liam_-_CHILD_disk_1.wav 1550_Kelly_and_Abbey_-_CHILD_disk_3.wav
The transcripts of SCOTS audio already come in TextGrid format, but they need some treatment before being able to be used with the Montreal Forced Aligner.
The following files are formatted incorrectly for TextGrid and are in Gaelic, not any dialect of English. They should be placed in their own sub-directory and removed from the top-level directory containing the TextGrid files.
The SCOTS transcripts are already in TextGrid format, but they need some treatment before alignment, which is done by
clean_scots.py (add link). It needs the following prerequisites to run:
Ensure that all the TextGrids are in their own directory and that there exists a desired output directory for the fixed transcripts. Then, run
python clean_scots.py textgrid_input_dir textgrid_output_dir. This will put all the fixed textgrid transcripts in their own directory.
Not yet done.
The following files are missing a corresponding audio file, and thus should not be included in the directory to be used for alignment:
The following files have typos in their names, and should be renamed accordingly:
0579_james_austinDOUBLE.TextGrid (to 0579_james_austin.TextGrid) 0598John_lecture_1.TextGrid (to 0598_john_lecture_1.wav)