audiomate.corpus

This module contains all parts needed for using a corpus. Aside the main corpus class audiomate.Corpus, there are different loaders in the audiomate.corpus.io and the assets used in a corpus in audiomate.corpus.assets.

CorpusView

class audiomate.corpus.CorpusView[source]

This class defines the basic interface of a corpus. It is not meant to be instantiated directly. It only describes the methods for accessing data of the corpus.

Notes

All paths to files should be held as absolute paths in memory.

all_label_values(label_list_ids=None)[source]

Return a set of all label-values occurring in this corpus.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A set of distinct label-values.
Return type:set
feature_containers

Return the feature-containers in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.FeatureContainer objects with the feature-idx as key.
Return type:dict
files

Return the files in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.File objects with the file-idx as key.
Return type:dict
issuers

Return the issuers in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.Issuer objects with the issuer-idx as key.
Return type:dict
label_count(label_list_ids=None)[source]

Return a dictionary containing the number of times, every label-value in this corpus is occurring.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A dictionary containing the number of occurrences with the label-value as key.
Return type:dict
label_durations(label_list_ids=None)[source]

Return a dictionary containing the total duration, every label-value in this corpus is occurring.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A dictionary containing the total duration with the label-value as key.
Return type:dict
name

Return the name of the dataset (Equals basename of the path, if not None).

num_feature_containers

Return the number of feature-containers in the corpus.

num_files

Return number of files.

num_issuers

Return the number of issuers in the corpus.

num_subviews

Return the number of subviews in the corpus.

num_utterances

Return number of utterances.

stats()[source]

Return statistics calculated overall samples of all utterances in the corpus.

Returns:A DataStats object containing statistics overall samples in the corpus.
Return type:DataStats
stats_per_utterance()[source]

Return statistics calculated for all samples of each utterance in the corpus.

Returns:A dictionary containing a DataStats object for each utt.
Return type:dict
subviews

Return the subviews of the corpus.

Returns:A dictionary containing audiomate.corpus.Subview objects with the subview-idx as key.
Return type:dict
total_duration

Return the total amount of audio summed over all utterances in the corpus in seconds.

utterances

Return the utterances in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.Utterance objects with the utterance-idx as key.
Return type:dict

Corpus

class audiomate.corpus.Corpus(path=None)[source]

The Corpus class represents a single corpus. It extends audiomate.corpus.CorpusView with the functionality for loading and saving. Furthermore it provides the functionality for adding/modifying assets of the corpus like files and utterances.

Parameters:path (str) – Path where the corpus is stored. (Optional)
feature_containers

Return the feature-containers in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.FeatureContainer objects with the feature-idx as key.
Return type:dict
files

Return the files in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.File objects with the file-idx as key.
Return type:dict
classmethod from_corpus(corpus)[source]

Create a new modifiable corpus from any other CorpusView. This for example can be used to create a independent modifiable corpus from a subview.

Parameters:corpus (CorpusView) – The corpus to create a copy from.
Returns:A new corpus with the same data as the given one.
Return type:Corpus
import_files(files)[source]

Add the given files/file to the corpus. If any of the given file-ids already exists, a suffix is appended so it is unique.

Parameters:files (list) – Either a list of or a single audiomate.corpus.assets.File.
Returns:
A dictionary containing file idx mappings (old-file-idx/file-instance).
If a file is imported, whose id already exists this mapping can be used to check the new id.
Return type:dict
import_issuers(issuers)[source]

Add the given issuers/issuer to the corpus. If any of the given issuer-ids already exists, a suffix is appended so it is unique.

Parameters:issuers (list) – Either a list of or a single audiomate.corpus.assets.Issuer.
Returns:
A dictionary containing file idx mappings (old-issuer-idx/issuer-instance).
If a issuer is imported, whose id already exists this mapping can be used to check the new id.
Return type:dict
import_subview(idx, subview)[source]

Add the given subview to the corpus.

Parameters:
  • idx (str) – An idx that is unique in the corpus for identifying the subview. If already a subview exists with the given id it will be overridden.
  • subview (Subview) – The subview to add.
import_utterances(utterances)[source]

Add the given utterances/utterance to the corpus. If any of the given utterance-ids already exists, a suffix is appended so it is unique.

Parameters:utterances (list) – Either a list of or a single audiomate.corpus.assets.Utterance.
Returns:
A dictionary containing file idx mappings (old-utterance-idx/utterance-instance).
If a utterance is imported, whose id already exists this mapping can be used to check the new id.
Return type:dict
issuers

Return the issuers in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.Issuer objects with the issuer-idx as key.
Return type:dict
classmethod load(path, reader=None)[source]

Loads the corpus from the given path, using the given reader. If no reader is given the audiomate.corpus.io.DefaultReader is used.

Parameters:
  • path (str) – Path to load the corpus from.
  • reader (str, CorpusReader) – The reader or the name of the reader to use.
Returns:

The loaded corpus.

Return type:

Corpus

classmethod merge_corpora(corpora)[source]

Merge a list of corpora into one.

Parameters:corpora (Iterable) – An iterable of audiomate.corpus.CorpusView.
Returns:A corpus with the data from all given corpora merged into one.
Return type:Corpus
merge_corpus(corpus)[source]

Merge the given corpus into this corpus. All assets (files, utterances, issuers, …) are copied into this corpus. If any ids (utt-idx, file-idx, issuer-idx, subview-idx, …) are occurring in both corpora, the ids from the merging corpus are suffixed by a number (starting from 1 until no other is matching).

Parameters:corpus (CorpusView) – The corpus to merge.
name

Return the name of the dataset (Equals basename of the path, if not None).

new_feature_container(idx, path=None)[source]

Add a new feature container with the given data.

Parameters:
  • idx (str) – An unique identifier within the dataset.
  • path (str) – The path to store the feature file. If None a default path is used.
Returns:

The newly added feature-container.

Return type:

FeatureContainer

new_file(path, file_idx, copy_file=False)[source]

Adds a new file to the corpus with the given data.

Parameters:
  • path (str) – Path of the file to add.
  • file_idx (str) – The id to associate the file with.
  • copy_file (bool) – If True the file is copied to the data set folder, otherwise the given path is used directly.
Returns:

The newly added File.

Return type:

File

new_issuer(issuer_idx, info=None)[source]

Add a new issuer to the dataset with the given data.

Parameters:
  • issuer_idx (str) – The id to associate the issuer with. If None or already exists, one is generated.
  • info (dict, list) – Additional info of the issuer.
Returns:

The newly added issuer.

Return type:

Issuer

new_utterance(utterance_idx, file_idx, issuer_idx=None, start=0, end=-1)[source]

Add a new utterance to the corpus with the given data.

Parameters:
  • file_idx (str) – The file id the utterance is in.
  • utterance_idx (str) – The id to associate with the utterance. If None or already exists, one is generated.
  • issuer_idx (str) – The issuer id to associate with the utterance.
  • start (float) – Start of the utterance within the file [seconds].
  • end (float) – End of the utterance within the file [seconds]. -1 equals the end of the file.
Returns:

The newly added utterance.

Return type:

Utterance

save(writer=None)[source]

If self.path is defined, it tries to save the corpus at the given path.

save_at(path, writer=None)[source]

Save this corpus at the given path. If the path differs from the current path set, the path gets updated.

Parameters:
  • path (str) – Path to save the data set to.
  • writer (str, CorpusWriter) – The writer or the name of the reader to use.
subviews

Return the subviews of the corpus.

Returns:A dictionary containing audiomate.corpus.Subview objects with the subview-idx as key.
Return type:dict
utterances

Return the utterances in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.Utterance objects with the utterance-idx as key.
Return type:dict