audiomate.corpus.subset

This module contains functionality for creating any kind of subsets from a corpus. A subset of a corpus is represented with a Subview. The data contained in a subview is defined by one or more FilterCriterion.

For creating subviews there are additional classes. Splitter can be used to divide a corpus into subsets according to given proportions. SubsetGenerator can be used to create subset with given settings.

Subview

class audiomate.corpus.subset.Subview(corpus, filter_criteria)[source]

A subview is a readonly layer representing some subset of a corpus. The assets the subview contains are defined by filter criteria. Only if an utterance passes all filter criteria it is contained in the subview.

Parameters:

Example:

>>> filter = subview.MatchingUtteranceIdxFilter(utterance_idxs=(['utt-1', 'utt-3']))
>>> corpus = audiomate.corpus.load('path/to/corpus')
>>> corpus.num_utterances
14
>>> subset = subview.Subview(self.corpus, filter_criteria=[filter])
>>> subset.num_utterances
2
all_label_values(label_list_ids=None)

Return a set of all label-values occurring in this corpus.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A set of distinct label-values.
Return type:set
all_tokens(delimiter=' ', label_list_ids=None)

Return a list of all tokens occurring in one of the labels in the corpus.

Parameters:
  • delimiter (str) – The delimiter used to split labels into tokens. (see audiomate.annotations.Label.tokenized())
  • label_list_ids (list) – If not None, only labels from label-lists with an idx contained in this list are considered.
Returns:

A set of distinct tokens.

Return type:

set

contains_issuer(issuer)

Return True if the given issuer is in the corpus already, False otherwise.

contains_track(track)

Return True if the given track is in the corpus already, False otherwise.

feature_containers

Return the feature-containers in the corpus.

Returns:
A dictionary containing
audiomate.container.FeatureContainer objects with the feature-idx as key.
Return type:dict
issuers

Return the issuers in the corpus.

Returns:
A dictionary containing audiomate.issuers.Issuer
objects with the issuer-idx as key.
Return type:dict
label_count(label_list_ids=None)

Return a dictionary containing the number of times, every label-value in this corpus is occurring.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:
A dictionary containing the number of occurrences with the
label-value as key.
Return type:dict
label_durations(label_list_ids=None)

Return a dictionary containing the total duration, every label-value in this corpus is occurring.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:
A dictionary containing the total duration with
the label-value as key.
Return type:dict
name

Return the name of the dataset (Equals basename of the path, if not None).

num_feature_containers

Return the number of feature-containers in the corpus.

num_issuers

Return the number of issuers in the corpus.

num_subviews

Return the number of subviews in the corpus.

num_tracks

Return number of tracks.

num_utterances

Return number of utterances.

classmethod parse(representation, corpus=None)[source]

Creates a subview from a string representation (created with self.serialize).

Parameters:representation (str) – The representation.
Returns:The created subview.
Return type:Subview
serialize()[source]

Return a string representing the subview with all of its filter criteria.

Returns:String with subview definition.
Return type:str
split_utterances_to_max_time(max_time=60.0, overlap=0.0)

Create a new corpus, where all the utterances are of given maximal duration. Utterance longer than max_time are split up into multiple utterances.

Warning

Subviews and FeatureContainers are not added to the newly create corpus.

Parameters:
  • max_time (float) – Maximal duration for target utterances in seconds.
  • overlap (float) – Amount of overlap in seconds. The overlap is measured from the center of the splitting. (The actual overlap of two segments is 2 * overlap)
Returns:

A new corpus instance.

Return type:

Corpus

stats()

Return statistics calculated overall samples of all utterances in the corpus.

Returns:
A DataStats object containing statistics overall
samples in the corpus.
Return type:DataStats
stats_per_utterance()

Return statistics calculated for all samples of each utterance in the corpus.

Returns:A dictionary containing a DataStats object for each utt.
Return type:dict
subviews

Return the subviews of the corpus.

Returns:
A dictionary containing audiomate.corpus.Subview
objects with the subview-idx as key.
Return type:dict
total_duration

Return the total amount of audio summed over all utterances in the corpus in seconds.

tracks

Return the tracks in the corpus.

Returns:
A dictionary containing audiomate.track.Track
objects with the track-idx as key.
Return type:dict
utterances

Return the utterances in the corpus.

Returns:
A dictionary containing
audiomate.corpus.assets.Utterance objects with the utterance-idx as key.
Return type:dict

Filter

class audiomate.corpus.subset.FilterCriterion[source]

A filter criterion decides wheter a given utterance contained in a given corpus matches the filter.

match(utterance, corpus)[source]

Check if the utterance matches the filter.

Parameters:
  • utterance (Utterance) – The utterance to match.
  • corpus (CorpusView) – The corpus that contains the utterance.
Returns:

True if the filter matches the utterance, False otherwise.

Return type:

bool

classmethod name()[source]

Returns a name identifying this type of filter criterion.

classmethod parse(representation)[source]

Create a filter criterion based on a string representation (created with serialize).

Parameters:representation (str) – The string representation.
Returns:The filter criterion from that representation.
Return type:FilterCriterion
serialize()[source]

Serialize this filter criterion to write to a file. The output needs to be a single line without line breaks.

Returns:A string representing this filter criterion.
Return type:str

MatchingUtteranceIdxFilter

class audiomate.corpus.subset.MatchingUtteranceIdxFilter(utterance_idxs, inverse=False)[source]

A filter criterion that matches utterances based on utterance-ids.

Parameters:
  • utterance_idxs (set) – A list of utterance-ids. Only utterances in the list will pass the filter
  • inverse (bool) – If True only utterance not in the list pass the filter.

MatchingLabelFilter

class audiomate.corpus.subset.MatchingLabelFilter(labels, label_list_ids=None)[source]

A filter criterion that only accepts utterances which only have the given labels.

Parameters:
  • labels (set) – A set of labels which are accepted.
  • label_list_ids (set) – Only check label-lists with these ids. If None, checks all label-lists.

Splitter

class audiomate.corpus.subset.Splitter(corpus, random_seed=None)[source]

A splitter provides methods for splitting a corpus into different subsets. It provides different approaches for splitting the corpus. (Methods indicated by split_by_) These methods mostly take some proportions parameter, which defines how big (in relation) the subsets should be. The subsets are returned as audiomate.corpus.Subview.

Parameters:
  • corpus (Corpus) – The corpus that should be splitted.
  • random_seed (int) – Seed to use for random number generation.
split(proportions, separate_issuers=False)[source]

Split the corpus based on the number of utterances. The utterances are distributed to len(proportions) subsets, according to the ratios proportions[subset].

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier

from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
60
>>> subsets['dev'].num_utterances
20
>>> subsets['test'].num_utterances
20
split_by_audio_duration(proportions, separate_issuers=False)[source]

Split the corpus based on the the total duration of audio. The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains audio with a duration proportional to the given proportions.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier

from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split_by_audio_duration(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
55
>>> subsets['dev'].num_utterances
35
>>> subsets['test'].num_utterances
10
split_by_label_duration(proportions, separate_issuers=False)[source]

Split the corpus based on the total duration of labels (end - start). The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains labels with a duration proportional to the given proportions.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier

from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split_by_label_duration(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
55
>>> subsets['dev'].num_utterances
35
>>> subsets['test'].num_utterances
10
split_by_label_length(proportions, label_list_idx=None, separate_issuers=False)[source]

Split the corpus based on the the total length of the label-list. The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains labels summed up to a length proportional to the given proportions. Length is defined as the number of characters.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • label_list_idx (str) – The idx of the label-list to use for compute the length. If None all label-lists are used.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier

from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split_by_label_length(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
55
>>> subsets['dev'].num_utterances
35
>>> subsets['test'].num_utterances
10
split_by_label_occurence(proportions, separate_issuers=False)[source]

Split the corpus based on the total number of occcurences of labels. The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains labels-occurences proportional to the given proportions.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier

from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split_by_label_occurence(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
55
>>> subsets['dev'].num_utterances
35
>>> subsets['test'].num_utterances
10

SubsetGenerator

class audiomate.corpus.subset.SubsetGenerator(corpus, random_seed=None)[source]

This class is used to generate subsets of a corpus.

Parameters:
  • corpus (Corpus) – The corpus to create subsets from.
  • random_seed (int) – Seed to use for random number generation.
maximal_balanced_subset(by_duration=False, label_list_ids=None)[source]

Create a subset of the corpus as big as possible, so that the labels are balanced approximately. The label with the shortest duration (or with the fewest utterance if by_duration=False) is taken as reference. All other labels are selected so they match the shortest one as far as possible.

Parameters:
  • by_duration (bool) – If True the size measure is the duration of all utterances in a subset/corpus.
  • label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns:

The subview representing the subset.

Return type:

Subview

random_subset(relative_size, balance_labels=False, label_list_ids=None)[source]

Create a subview of random utterances with a approximate size relative to the full corpus. By default x random utterances are selected with x equal to relative_size * corpus.num_utterances.

Parameters:
  • relative_size (float) – A value between 0 and 1. (0.5 will create a subset with approximately 50% of the full corpus size)
  • balance_labels (bool) – If True, the labels of the selected utterances are balanced as far as possible. So the count/duration of every label within the subset is equal.
  • label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns:

The subview representing the subset.

Return type:

Subview

random_subset_by_duration(relative_duration, balance_labels=False, label_list_ids=None)[source]

Create a subview of random utterances with a approximate duration relative to the full corpus. Random utterances are selected so that the sum of all utterance durations equals to the relative duration of the full corpus.

Parameters:
  • relative_duration (float) – A value between 0 and 1. (e.g. 0.5 will create a subset with approximately 50% of the full corpus duration)
  • balance_labels (bool) – If True, the labels of the selected utterances are balanced as far as possible. So the count/duration of every label within the subset is equal.
  • label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns:

The subview representing the subset.

Return type:

Subview

random_subsets(relative_sizes, by_duration=False, balance_labels=False, label_list_ids=None)[source]

Create a bunch of subsets with the given sizes relative to the size or duration of the full corpus. Basically the same as calling random_subset or random_subset_by_duration multiple times with different values. But this method makes sure that every subset contains only utterances, that are also contained in the next bigger subset.

Parameters:
  • relative_sizes (list) – A list of numbers between 0 and 1 indicating the sizes of the desired subsets, relative to the full corpus.
  • by_duration (bool) – If True the size measure is the duration of all utterances in a subset/corpus.
  • balance_labels (bool) – If True the labels contained in a subset are chosen to be balanced as far as possible.
  • label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns:

A dictionary containing all subsets with the relative size as key.

Return type:

dict

Utils

audiomate.corpus.subset.utils.absolute_proportions(proportions, count)[source]

Split a given integer into n parts according to len(proportions) so they sum up to count and match the given proportions.

Parameters:proportions (dict) – Dict of proportions, with a identifier as key.
Returns:Dictionary with absolute proportions and same identifiers as key.
Return type:dict

Example:

>>> absolute_proportions({'train': 0.5, 'test': 0.5}, 100)
{'train': 50, 'test': 50}
audiomate.corpus.subset.utils.get_identifiers_splitted_by_weights(identifiers, proportions, seed=None)[source]

Divide the given identifiers based on the given proportions. But instead of randomly split the identifiers it is based on category weights. Every identifier has a weight for any number of categories. The target is, to split the identifiers in a way, so the sum of category k within part x is proportional to the sum of category x over all parts according to the given proportions. This is done by greedily insert the identifiers step by step in a part which has free space (weight). If there are no fitting parts anymore, the one with the least weight exceed is used. This function is deterministic, given the same seed. First the identifiers are sorted before shuffled using the given seed.

Parameters:
  • identifiers (dict) – A dictionary containing the weights for each identifier (key). Per item a dictionary of weights per category is given.
  • proportions (dict) – Dict of proportions, with a identifier as key.
  • seed (int) – Seed to use for random operations.
Returns:

Dictionary containing a list of identifiers per part with the same key as the proportions dict.

Return type:

dict

Example:

>>> identifiers = {
>>>     'a': {'music': 2, 'speech': 1},
>>>     'b': {'music': 5, 'speech': 2},
>>>     'c': {'music': 2, 'speech': 4},
>>>     'd': {'music': 1, 'speech': 4},
>>>     'e': {'music': 3, 'speech': 4}
>>> }
>>> proportions = {
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> }
>>> get_identifiers_splitted_by_weights(identifiers, proportions)
{
    'train': ['a', 'b', 'd'],
    'dev': ['c'],
    'test': ['e']
}
audiomate.corpus.subset.utils.select_balanced_subset(items, select_count, categories, select_count_values=None, seed=None)[source]

Select items so the summed category weights are balanced. Each item has a dictionary containing the category weights. Items are selected until select_count is reached. The value that is added to select_count for an item can be defined in the dictionary select_count_values. If this is not defined it is assumed to be 1, which means select_count items are selected.

Parameters:
  • items (dict) – Dictionary containing items with category weights.
  • select_count (float) – Value to reach for selected items.
  • categories (list) – List of all categories.
  • select_count_values (dict) – The select_count values to be used. For example an utterance with multiple labels: The category weights (label-lengths) are used for balance, but the utterance-duration is used for reaching the select_count.
Returns:

List of item ids, containing number_of_items (or len(items) if smaller).

Return type:

list

Example

>>> items = {
>>>    'utt-1' : {'m': 1, 's': 0, 'n': 0},
>>>    'utt-2' : {'m': 0, 's': 2, 'n': 1},
>>>    ...
>>> }
>>> select_balanced_subset(items, 5)
>>> ['utt-1', 'utt-3', 'utt-9', 'utt-33', 'utt-34']
audiomate.corpus.subset.utils.split_identifiers(identifiers, proportions, seed=None)[source]

Split the given identifiers by the given proportions. This function is deterministic, given the same seed. First the identifiers are sorted before shuffled using the given seed.

Parameters:
  • identifiers (list) – List of identifiers (str).
  • proportions (dict) – A dictionary containing the proportions with the identifier from the
  • as key. (input) –
  • seed (int) – Seed to use for random operations.
Returns:

Dictionary containing a list of identifiers per part with the same key as the proportions dict.

Return type:

dict

Example:

>>> split_identifiers(
>>>     identifiers=['a', 'b', 'c', 'd'],
>>>     proportions={'melvin' : 0.5, 'timmy' : 0.5}
>>> )
{'melvin' : ['a', 'c'], 'timmy' : ['b', 'd']}