This module contains classes to read and write corpora from the filesystem in a wide range of formats. They can also be used to convert between formats.


Get a mapping of all available downloaders.

Returns:A dictionary with downloader classes with the name of these downloaders as key.
Return type:dict


>>> available_downloaders()
    "voxforge" :

Get a mapping of all available readers.

Returns:A dictionary with reader classes with the name of these readers as key.
Return type:dict


>>> available_readers()
    "default" :,
    "kaldi" :

Get a mapping of all available writers.

Returns:A dictionary with writer classes with the name of these writers as key.
Return type:dict


>>> available_writers()
    "default" :,
    "kaldi" :

Create an instance of the downloader with the given name.

Parameters:type_name – The name of a downloader.
Returns:An instance of the downloader with the given type.[source]

Create an instance of the reader with the given name.

Parameters:type_name – The name of a reader.
Returns:An instance of the reader with the given type.[source]

Create an instance of the writer with the given name.

Parameters:type_name – The name of a writer.
Returns:An instance of the writer with the given type.

Base Classes


Abstract class for downloading a corpus.

To implement a downloader for a custom format, programmers are expected to subclass this class and to implement all abstract methods. The documentation of each abstract methods details the requirements that have to be met by an implementation.


Performs the actual downloading of the corpus.

Parameters:target_path (str) – Path to a directory where the data should be saved to.

Downloads the data of the corpus and saves it to the given path. The data has to be saved in a way, so that the corresponding CorpusReader can load the corpus.

Parameters:target_path (str) – The path to save the data to.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str
class, ark_type=<ArkType.AUTO: 3>, move_files_up=False)[source]

Convenience base class for a downloader of a corpus, that consists of a single archive.

  • url (str) – URL, from where to download the archive.
  • ark_type (ArkType) – The type of the archive. If AUTO it tries to find the type automatically.
  • move_files_up (bool) – If True moves all files/folders from subfolders to the root-folder.

Performs the actual downloading of the corpus.

Parameters:target_path (str) – Path to a directory where the data should be saved to.

Downloads the data of the corpus and saves it to the given path. The data has to be saved in a way, so that the corresponding CorpusReader can load the corpus.

Parameters:target_path (str) – The path to save the data to.
classmethod type()

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Abstract class for reading a corpus.

To implement a reader for a custom format, programmers are expected to subclass this class and to implement all abstract methods. The documentation of each abstract methods details the requirements that have to be met by an implementation.


Tests whether all required files (like annotations) to read the corpus successfully are present. If files are missing, a list with the paths of the missing files is returned. All paths are relative to path. If no files are missing, an empty list is returned.

Parameters:path (str) – Path to the root directory of the data set
Returns:Paths of all the missing files, relative to the path of the root directory of the data set.
Return type:list

Performs the actual reading of the corpus.

Implementations do not have to call _check_for_missing_files() themselves. This is automatically done by load().

Parameters:path (str) – Path to a directory where the data set resides.
Returns:The loaded corpus
Return type:Corpus

Load and return the corpus from the given path.

Parameters:path (str) – Path to the data set to load.
Returns:The loaded corpus
Return type:Corpus
Raises:IOError – When the data set is invalid, for example because required files (annotations, …) are missing.
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

Abstract class for writing a corpus.

To implement a writer for a custom format, programmers are expected to subclass this class and to implement all abstract methods. The documentation of each abstract methods details the requirements that have to be met by an implementation.

_save(corpus, path)[source]

Writes the corpus to disk to the given path.

  • corpus (Corpus) – Corpus to write to disk
  • path (str) – Path of the target directory
save(corpus, path)[source]

Save the dataset at the given path.

  • corpus (Corpus) – The corpus to save.
  • path (str) – Path to save the corpus to.
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_writer_of_type() or get a list of all built-in readers with available_writers().

Returns:Name of the writer
Return type:str


Support for Reading and Writing by Format
Format Download Read Write
Acoustic Event Dataset x x  
AudioMNIST x x  
Broadcast   x  
Common Voice   x  
Default   x x
ESC-50 x x  
Free-Spoken-Digit-Dataset x x  
Folder   x  
Fluent Speech Commands Dataset   x  
Google Speech Commands   x  
GTZAN x x  
Kaldi   x x
Mozilla DeepSpeech     x
MUSAN x x  
M-AILABS Speech Dataset x x  
LITIS Rouen Audio scene dataset x x  
Tatoeba x x  
TIMIT   x  
TUDA German Distant Speech   x  
Urbansound8k   x  
VoxForge x x  
Wav2Letter     x

Acoustic Event Dataset


Downloader for the Acoustic Event Dataset.

classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the Acoustic Event Dataset.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Downloader for the audioMNIST dataset.

Parameters:url (str) – The url to download the dataset from. If not given the default URL is used. It is expected to be a zip file.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the audioMNIST Corpus.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Reads corpora in the Broadcast format.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Reader for the Common Voice Corpus.

See also

Project Page
static create_assets_if_needed(corpus, path, entry)[source]

Create File/Utterance/Issuer, if they not already exist and return utt-idx.

static get_subset_ids(path)[source]

Return a list with ids of all available subsets (based on existing csv-files).

static load_subset(corpus, path, subset_idx)[source]

Load subset into corpus.

static map_age(age)[source]

Map age to correct age-group.

static map_gender(gender)[source]

Map gender to correct value.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Reads corpora in the Default format.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

Writes corpora in the Default format.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_writer_of_type() or get a list of all built-in readers with available_writers().

Returns:Name of the writer
Return type:str



Downloader for the ESC-50 dataset.

Parameters:url (str) – The url to download the dataset from. If not given the default URL is used.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the ESC-50 dataset (Environmental Sound Classification).

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Loads all wavs from the given folder and creates a corpus from it.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Downloader for the Free-Spoken-Digit dataset.

Parameters:url (str) – The url to download the dataset from. If not given the default URL is used. It is expected to be a zip file.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the Free-Spoken-Digit Corpus.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

Fluent Speech Commands Dataset


Reader for the Fluent Speech Commands Dataset.

See also

Fluent Speech Commands Dataset
Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

Google Speech Commands


Reads the google speech commands dataset.

See also

Launching Speech Commands DS
Blog-Entry on the release of the speech commands dataset.
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Downloader for the GTZAN Corpus.

Parameters:url (str) – The url to download the dataset from. If not given the default URL is used. It is expected to be a tar.gz file.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the GTZAN music/speech corpus. The corpus consits of 64 music and 64 speech tracks that are each 30 seconds long. The Wave files are 16-bit mono and have a sampling rate of 22050 Hz.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str


class'word-transcript', main_feature_idx='default')[source]

Supports reading data sets in Kaldi format.

See also

Kaldi: Data preparation
Describes how a data set has to be structured to be understood by Kaldi and the format of the individual files.
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str
class'word-transcript', main_feature_idx='default', use_utt_idx_if_no_speaker_available=True, create_spk2gender=False, default_gender='m', prefix_utterances_with_speaker=True, use_absolute_times=False)[source]

Supports writing data sets in Kaldi format.

  • main_label_list_idx (str) – The idx of the label-list to use for writing to transcriptions file.
  • main_feature_idx (str) – The idx of the feature-container to export.
  • use_utt_idx_if_no_speaker_available (bool) – If True, the utterance-idx is used as speaker-idx in the utt2spk file, if no speaker exists for an utterance.
  • create_spk2gender (bool) – If True creates the file spk2gender.
  • default_gender (str) – If create_spk2gender==True and the gender of an issuer is not known, this default value will be used (default ‘m’).
  • prefix_utterances_with_speaker (bool) – If True, add a prefix in form of the issuer-idx to every utterance.
  • use_absolute_times (bool) – If True, doesn’t use -1 for segment ends, but reads the audio to get absolute duration.

See also

Kaldi: Data preparation
Describes how a data set has to be structured to be understood by Kaldi and the format of the individual files.
static extended_filename(file_track)[source]

Create extended filename. Kaldi only supports wav. Therefore other files have to be converted using sox.

static feature_scp_generator(path)[source]

Return a generator over all feature matrices defined in a scp.

static read_float_matrix(rx_specifier)[source]

Return float matrix as np array for the given rx specifier.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_writer_of_type() or get a list of all built-in readers with available_writers().

Returns:Name of the writer
Return type:str
static write_float_matrices(scp_path, ark_path, matrices)[source]

Write the given dict matrices (utt-id/float ndarray) to the given scp and ark files.

Mozilla DeepSpeech


Writes files to use for training with Mozilla DeepSpeech (

Since it is expected that every utterance is in a separate file, any utterances that are not in separate file in the original corpus, are extracted into a separate file in the subfolder audio of the target path.

Parameters:transcription_label_list_idx (str) – The transcriptions are used from the label-list with this id.
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_writer_of_type() or get a list of all built-in readers with available_writers().

Returns:Name of the writer
Return type:str



Downloader for the MUSAN Corpus.

Parameters:url (str) – The url to download the dataset from. If not given the default URL is used. It is expected to be a tar.gz file.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the MUSAN corpus. MUSAN is a corpus of music, speech, and noise recordings.

See also

MUSAN: A Music, Speech, and Noise Corpus
Paper explaining the structure and characteristics of the corpus
Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

M-AILABS Speech Dataset


Downloader for the M-AILABS Speech Dataset.

Parameters:tags (list) – List of tags for different parts to download. Corresponds to the tags in the Statistics & Download Links on the webpage. If None, all parts are downloaded.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the M-AILABS Speech Dataset.

See also

M-AILABS Speech Dataset
Project Page
static get_folders(path)[source]

Return a list of all subfolder-paths in the given path.

static load_books_of_speaker(corpus, path, speaker)[source]

Load all utterances for the speaker at the given path.

static load_speaker(corpus, path)[source]

Create a speaker instance for the given path.

static load_tag(corpus, path)[source]

Iterate over all speakers on load them. Collect all utterance-idx and create a subset of them.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

LITIS Rouen Audio scene dataset


Downloader for the LITIS Rouen Audio scene dataset.

classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for the LITIS Rouen Audio scene dataset.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str


class, include_licenses=None, include_empty_licence=False)[source]

Downloader for audio files from the tatoeba platform.

See also

  • include_languages (list) – List of languages to download. If None all are downloaded.
  • include_licenses (list) – Sentences are downloaded only if their license is in this list. If None all licenses are included.
  • load_empty_license (bool) – Sentences with an empty license are not meant to be reused. If False these sentences are ignored.
classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for audio data downloaded with the Tatoeba downloader.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

TIMIT DARPA Acoustic-Phonetic Continuous Speech Corpus


Reader for the TIMIT Corpus.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str

TUDA German Distant Speech


Reader for the TUDA german distant speech corpus (german-speechdata-package-v2.tar.gz).


It only loads files ending in -beamformedSignal.wav

static get_ids_from_folder(path, part_name)[source]

Return all ids from the given folder, which have a corresponding beamformedSignal file.

static load_file(folder_path, idx, corpus)[source]

Load speaker, file, utterance, labels for the file with the given id.

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Reader for the Urbansound8k dataset.

See also

Download page
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str


class'de', url=None)[source]

Downloader for audio files from All .tgz files that are linked from the given url are downloaded and extracted.

  • lang (str) – If no URL is given the predefined URL’s for the given language is used, if one is defined.
  • url (str) – The url to check for available .tgz files.
static available_files(url)[source]

Extract and return urls for all available .tgz files.

static download_files(file_urls, target_path)[source]

Download all files and store to the given path.

static extract_files(file_paths, target_path)[source]

Unpack all files to the given path.

classmethod type()[source]

Returns a string that uniquely identifies the downloader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_downloader_of_type() or get a list of all built-in downloaders with available_downloaders().

Returns:Name of the downloader
Return type:str

Reader for collections of voxforge audio data. The reader expects extracted .tgz files in the given folder.

See also
Download page
static data_folders(path)[source]

Generator which yields a list of valid data directories (corresponds to the content of one .tgz).

static parse_prompts(etc_folder)[source]

Read prompts and prompts-orignal and return as dictionary (id as key).

static parse_speaker_info(readme_path)[source]

Parse speaker info and return tuple (idx, gender).

classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_reader_of_type() or get a list of all built-in readers with available_readers().

Returns:Name of the reader
Return type:str



Writes files to use for training/testing/decoding with wav2letter (

Since it is expected that every utterance is in a separate file, any utterances that are not in separate file in the original corpus, are extracted into a separate file in the subfolder audio of the target path.

Parameters:transcription_label_list_idx (str) – The transcriptions are used from the label-list with this id.
classmethod type()[source]

Returns a string that uniquely identifies the reader. This is usually the name of the corpus, for example musan or timit. Users can use this string to obtain an instance of the desired reader through create_writer_of_type() or get a list of all built-in readers with available_writers().

Returns:Name of the writer
Return type:str