audiomate.corpus.subset

This module contains functionality for creating any kind of subsets from a corpus. A subset of a corpus is represented with a Subview. The data contained in a subview is defined by one or more FilterCriterion.

For creating subviews there are additional classes. Splitter can be used to divide a corpus into subsets according to given proportions. SubsetGenerator can be used to create subset with given settings.

Subview

class audiomate.corpus.subset.Subview(corpus, filter_criteria=[])[source]

A subview is a readonly layer representing some subset of a corpus. The assets the subview contains are defined by filter criteria. Only if an utterance passes all filter criteria it is contained in the subview.

Parameters:

Example:

>>> filter = subview.MatchingUtteranceIdxFilter(utterance_idxs=(['utt-1', 'utt-3']))
>>> corpus = audiomate.corpus.load('path/to/corpus')
>>> corpus.num_utterances
14
>>> subset = subview.Subview(self.corpus, filter_criteria=[filter])
>>> subset.num_utterances
2
all_label_values(label_list_ids=None)

Return a set of all label-values occurring in this corpus.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A set of distinct label-values.
Return type:set
feature_containers

Return the feature-containers in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.FeatureContainer objects with the feature-idx as key.
Return type:dict
files

Return the files in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.File objects with the file-idx as key.
Return type:dict
issuers

Return the issuers in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.Issuer objects with the issuer-idx as key.
Return type:dict
label_count(label_list_ids=None)

Return a dictionary containing the number of times, every label-value in this corpus is occurring.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A dictionary containing the number of occurrences with the label-value as key.
Return type:dict
label_durations(label_list_ids=None)

Return a dictionary containing the total duration, every label-value in this corpus is occurring.

Parameters:label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered.
Returns:A dictionary containing the total duration with the label-value as key.
Return type:dict
name

Return the name of the dataset (Equals basename of the path, if not None).

num_feature_containers

Return the number of feature-containers in the corpus.

num_files

Return number of files.

num_issuers

Return the number of issuers in the corpus.

num_subviews

Return the number of subviews in the corpus.

num_utterances

Return number of utterances.

classmethod parse(representation, corpus=None)[source]

Creates a subview from a string representation (created with self.serialize).

Parameters:representation (str) – The representation.
Returns:The created subview.
Return type:Subview
serialize()[source]

Return a string representing the subview with all of its filter criteria.

Returns:String with subview definition.
Return type:str
stats()

Return statistics calculated overall samples of all utterances in the corpus.

Returns:A DataStats object containing statistics overall samples in the corpus.
Return type:DataStats
stats_per_utterance()

Return statistics calculated for all samples of each utterance in the corpus.

Returns:A dictionary containing a DataStats object for each utt.
Return type:dict
subviews

Return the subviews of the corpus.

Returns:A dictionary containing audiomate.corpus.Subview objects with the subview-idx as key.
Return type:dict
total_duration

Return the total amount of audio summed over all utterances in the corpus in seconds.

utterances

Return the utterances in the corpus.

Returns:A dictionary containing audiomate.corpus.assets.Utterance objects with the utterance-idx as key.
Return type:dict

Filter

class audiomate.corpus.subset.FilterCriterion[source]

A filter criterion decides wheter a given utterance contained in a given corpus matches the filter.

match(utterance, corpus)[source]

Check if the utterance matches the filter.

Parameters:
  • utterance (Utterance) – The utterance to match.
  • corpus (CorpusView) – The corpus that contains the utterance.
Returns:

True if the filter matches the utterance, False otherwise.

Return type:

bool

classmethod name()[source]

Returns a name identifying this type of filter criterion.

classmethod parse(representation)[source]

Create a filter criterion based on a string representation (created with serialize).

Parameters:representation (str) – The string representation.
Returns:The filter criterion from that representation.
Return type:FilterCriterion
serialize()[source]

Serialize this filter criterion to write to a file. The output needs to be a single line without line breaks.

Returns:A string representing this filter criterion.
Return type:str

MatchingUtteranceIdxFilter

class audiomate.corpus.subset.MatchingUtteranceIdxFilter(utterance_idxs=set(), inverse=False)[source]

A filter criterion that matches utterances based on utterance-ids.

Parameters:
  • utterance_idxs (set) – A list of utterance-ids. Only utterances in the list will pass the filter
  • inverse (bool) – If True only utterance not in the list pass the filter.

MatchingLabelFilter

class audiomate.corpus.subset.MatchingLabelFilter(labels=set(), label_list_ids=set())[source]

A filter criterion that only accepts utterances which only have the given labels.

Parameters:
  • labels (set) – A set of labels which are accepted.
  • label_list_ids (set) – Only check label-lists with these ids. If empty checks all label-lists.

Splitter

class audiomate.corpus.subset.Splitter(corpus, random_seed=None)[source]

A splitter provides methods for splitting a corpus into different subsets. It provides different approaches for splitting the corpus. (Methods indicated by split_by_) These methods mostly take some proportions parameter, which defines how big (in relation) the subsets should be. The subsets are returned as audiomate.corpus.Subview.

Parameters:
  • corpus (Corpus) – The corpus that should be splitted.
  • random_seed (int) – Seed to use for random number generation.
split_by_length_of_utterances(proportions={}, separate_issuers=False)[source]

Split the corpus into subsets where the total duration of subsets are proportional to the given proportions. The corpus gets splitted into len(proportions) parts, so the number of utterances are distributed according to the proportions.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split_by_length_of_utterances(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
60
>>> subsets['test'].num_utterances
20
split_by_number_of_utterances(proportions={}, separate_issuers=False)[source]

Split the corpus into subsets with the given number of utterances. The corpus gets splitted into len(proportions) parts, so the number of utterances are distributed according to the proportions.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns:

A dictionary containing the subsets with the identifier from the input as key.

Return type:

(dict)

Example:

>>> spl = Splitter(corpus)
>>> corpus.num_utterances
100
>>> subsets = spl.split_by_number_of_utterances(proportions={
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> })
>>> print(subsets)
{'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>,
'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>,
'train': <audiomate.corpus.subview.Subview at 0x104ce7438>}
>>> subsets['train'].num_utterances
60
>>> subsets['test'].num_utterances
20
split_by_proportionally_distribute_labels(proportions={}, use_lengths=True)[source]

Split the corpus into subsets, so the occurrence of the labels is distributed amongst the subsets according to the given proportions.

Parameters:
  • proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
  • use_lengths (bool) – If True the lengths of the labels are considered for splitting proportionally, otherwise only the number of occurrences is taken into account.
Returns:

A dictionary containing the subsets with the identifier from the input as key.

Return type:

(dict)

SubsetGenerator

class audiomate.corpus.subset.SubsetGenerator(corpus, random_seed=None)[source]

This class is used to generate subsets of a corpus.

Parameters:
  • corpus (Corpus) – The corpus to create subsets from.
  • random_seed (int) – Seed to use for random number generation.
random_subset(relative_size, balance_labels=False)[source]

Create a subview of random utterances with a approximate size relative to the full corpus. By default x random utterances are selected with x equal to relative_size * corpus.num_utterances.

Parameters:
  • relative_size (float) – A value between 0 and 1. (0.5 will create a subset with approximately 50% of the full corpus size)
  • balance_labels (bool) – If True, the labels of the selected utterances are balanced as far as possible. So the count/duration of every label within the subset is equal.
Returns:

The subview representing the subset.

Return type:

Subview

random_subset_by_duration(relative_duration, balance_labels=False)[source]

Create a subview of random utterances with a approximate duration relative to the full corpus. Random utterances are selected so that the sum of all utterance durations equals to the relative duration of the full corpus.

Parameters:
  • relative_duration (float) – A value between 0 and 1. (e.g. 0.5 will create a subset with approximately 50% of the full corpus duration)
  • balance_labels (bool) – If True, the labels of the selected utterances are balanced as far as possible. So the count/duration of every label within the subset is equal.
Returns:

The subview representing the subset.

Return type:

Subview

random_subsets(relative_sizes, by_duration=False, balance_labels=False)[source]

Create a bunch of subsets with the given sizes relative to the size or duration of the full corpus. Basically the same as calling random_subset or random_subset_by_duration multiple times with different values. But this method makes sure that every subset contains only utterances, that are also contained in the next bigger subset.

Parameters:
  • relative_sizes (list) – A list of numbers between 0 and 1 indicating the sizes of the desired subsets, relative to the full corpus.
  • by_duration (bool) – If True the size measure is the duration of all utterances in a subset/corpus.
  • balance_labels (bool) – If True the labels contained in a subset are chosen to be balanced as far as possible.
Returns:

A dictionary containing all subsets with the relative size as key.

Return type:

dict

Utils

audiomate.corpus.subset.utils.absolute_proportions(proportions, count)[source]

Split a given integer into n parts according to len(proportions) so they sum up to count and match the given proportions.

Parameters:proportions (dict) – Dict of proportions, with a identifier as key.
Returns:Dictionary with absolute proportions and same identifiers as key.
Return type:dict

Example:

>>> absolute_proportions({'train': 0.5, 'test': 0.5}, 100)
{'train': 50, 'test': 50}
audiomate.corpus.subset.utils.get_identifiers_splitted_by_weights(identifiers={}, proportions={})[source]

Divide the given identifiers based on the given proportions. But instead of randomly split the identifiers it is based on category weights. Every identifier has a weight for any number of categories. The target is, to split the identifiers in a way, so the sum of category k within part x is proportional to the sum of category x over all parts according to the given proportions. This is done by greedily insert the identifiers step by step in a part which has free space (weight). If there are no fitting parts anymore, the one with the least weight exceed is used.

Parameters:
  • identifiers (dict) – A dictionary containing the weights for each identifier (key). Per item a dictionary of weights per category is given.
  • proportions (dict) – Dict of proportions, with a identifier as key.
Returns:

Dictionary containing a list of identifiers per part with the same key as the proportions dict.

Return type:

dict

Example:

>>> identifiers = {
>>>     'a': {'music': 2, 'speech': 1},
>>>     'b': {'music': 5, 'speech': 2},
>>>     'c': {'music': 2, 'speech': 4},
>>>     'd': {'music': 1, 'speech': 4},
>>>     'e': {'music': 3, 'speech': 4}
>>> }
>>> proportions = {
>>>     "train" : 0.6,
>>>     "dev" : 0.2,
>>>     "test" : 0.2
>>> }
>>> get_identifiers_splitted_by_weights(identifiers, proportions)
{
    'train': ['a', 'b', 'd'],
    'dev': ['c'],
    'test': ['e']
}
audiomate.corpus.subset.utils.select_balanced_subset(items, select_count, categories, select_count_values=None, seed=None)[source]

Select items so the summed category weights are balanced. Each item has a dictionary containing the category weights. Items are selected until select_count is reached. The value that is added to select_count for an item can be defined in the dictionary select_count_values. If this is not defined it is assumed to be 1, which means select_count items are selected.

Parameters:
  • items (dict) – Dictionary containing items with category weights.
  • select_count (float) – Value to reach for selected items.
  • categories (list) – List of all categories.
  • select_count_values (dict) – The select_count values to be used.
Returns:

List of item ids, containing number_of_items (or len(items) if smaller).

Return type:

list

Example

>>> items = {
>>>    'utt-1' : {'m': 1, 's': 0, 'n': 0},
>>>    'utt-2' : {'m': 0, 's': 2, 'n': 1},
>>>    ...
>>> }
>>> select_balanced_subset(items, 5)
>>> ['utt-1', 'utt-3', 'utt-9', 'utt-33', 'utt-34']
audiomate.corpus.subset.utils.split_identifiers(identifiers=[], proportions={})[source]

Split the given identifiers by the given proportions.

Parameters:
  • identifiers (list) – List of identifiers (str).
  • proportions (dict) – A dictionary containing the proportions with the identifier from the
  • as key. (input) –
Returns:

Dictionary containing a list of identifiers per part with the same key as the proportions dict.

Return type:

dict

Example:

>>> split_identifiers(
>>>     identifiers=['a', 'b', 'c', 'd'],
>>>     proportions={'melvin' : 0.5, 'timmy' : 0.5}
>>> )
{'melvin' : ['a', 'c'], 'timmy' : ['b', 'd']}