audiomate.corpus.subset¶
This module contains functionality for creating any kind of subsets from a corpus.
A subset of a corpus is represented with a Subview
.
The data contained in a subview is defined by one or more FilterCriterion
.
For creating subviews there are additional classes.
Splitter
can be used to divide a corpus into subsets according to given proportions.
SubsetGenerator
can be used to create subset with given settings.
Subview¶
-
class
audiomate.corpus.subset.
Subview
(corpus, filter_criteria)[source]¶ A subview is a readonly layer representing some subset of a corpus. The assets the subview contains are defined by filter criteria. Only if an utterance passes all filter criteria it is contained in the subview.
Parameters: - corpus (CorpusView) – The corpus this subview is based on.
- filter_criteria (list, FilterCriterion) – List of
FilterCriterion
Example:
>>> filter = subview.MatchingUtteranceIdxFilter(utterance_idxs=(['utt-1', 'utt-3'])) >>> corpus = audiomate.corpus.load('path/to/corpus') >>> corpus.num_utterances 14 >>> subset = subview.Subview(self.corpus, filter_criteria=[filter]) >>> subset.num_utterances 2
-
all_label_values
(label_list_ids=None)¶ Return a set of all label-values occurring in this corpus.
Parameters: label_list_ids (list) – If not None
, only labels from label-lists with an id contained in this list are considered.Returns: A set of distinct label-values. Return type: set
-
all_tokens
(delimiter=' ', label_list_ids=None)¶ Return a list of all tokens occurring in one of the labels in the corpus.
Parameters: - delimiter (str) – The delimiter used to split labels into tokens.
(see
audiomate.annotations.Label.tokenized()
) - label_list_ids (list) – If not
None
, only labels from label-lists with an idx contained in this list are considered.
Returns: A set of distinct tokens.
Return type: set
- delimiter (str) – The delimiter used to split labels into tokens.
(see
-
contains_issuer
(issuer)¶ Return
True
if the given issuer is in the corpus already,False
otherwise.
-
contains_track
(track)¶ Return
True
if the given track is in the corpus already,False
otherwise.
-
feature_containers
¶ Return the feature-containers in the corpus.
Returns: - A dictionary containing
audiomate.container.FeatureContainer
objects with the feature-idx as key.
Return type: dict
-
issuers
¶ Return the issuers in the corpus.
Returns: - A dictionary containing
audiomate.issuers.Issuer
- objects with the issuer-idx as key.
Return type: dict - A dictionary containing
-
label_count
(label_list_ids=None)¶ Return a dictionary containing the number of times, every label-value in this corpus is occurring.
Parameters: label_list_ids (list) – If not None
, only labels from label-lists with an id contained in this list are considered.Returns: - A dictionary containing the number of occurrences with the
- label-value as key.
Return type: dict
-
label_durations
(label_list_ids=None)¶ Return a dictionary containing the total duration, every label-value in this corpus is occurring.
Parameters: label_list_ids (list) – If not None, only labels from label-lists with an id contained in this list are considered. Returns: - A dictionary containing the total duration with
- the label-value as key.
Return type: dict
-
name
¶ Return the name of the dataset (Equals basename of the path, if not None).
-
num_feature_containers
¶ Return the number of feature-containers in the corpus.
-
num_issuers
¶ Return the number of issuers in the corpus.
-
num_subviews
¶ Return the number of subviews in the corpus.
-
num_tracks
¶ Return number of tracks.
-
num_utterances
¶ Return number of utterances.
-
classmethod
parse
(representation, corpus=None)[source]¶ Creates a subview from a string representation (created with
self.serialize
).Parameters: representation (str) – The representation. Returns: The created subview. Return type: Subview
-
serialize
()[source]¶ Return a string representing the subview with all of its filter criteria.
Returns: String with subview definition. Return type: str
-
split_utterances_to_max_time
(max_time=60.0, overlap=0.0)¶ Create a new corpus, where all the utterances are of given maximal duration. Utterance longer than
max_time
are split up into multiple utterances.Warning
Subviews and FeatureContainers are not added to the newly create corpus.
Parameters: - max_time (float) – Maximal duration for target utterances in seconds.
- overlap (float) – Amount of overlap in seconds. The overlap is measured from the center of the splitting. (The actual overlap of two segments is 2 * overlap)
Returns: A new corpus instance.
Return type:
-
stats
()¶ Return statistics calculated overall samples of all utterances in the corpus.
Returns: - A DataStats object containing statistics overall
- samples in the corpus.
Return type: DataStats
-
stats_per_utterance
()¶ Return statistics calculated for all samples of each utterance in the corpus.
Returns: A dictionary containing a DataStats object for each utt. Return type: dict
-
subviews
¶ Return the subviews of the corpus.
Returns: - A dictionary containing
audiomate.corpus.Subview
- objects with the subview-idx as key.
Return type: dict - A dictionary containing
-
total_duration
¶ Return the total amount of audio summed over all utterances in the corpus in seconds.
-
tracks
¶ Return the tracks in the corpus.
Returns: - A dictionary containing
audiomate.track.Track
- objects with the track-idx as key.
Return type: dict - A dictionary containing
-
utterances
¶ Return the utterances in the corpus.
Returns: - A dictionary containing
audiomate.corpus.assets.Utterance
objects with the utterance-idx as key.
Return type: dict
Filter¶
-
class
audiomate.corpus.subset.
FilterCriterion
[source]¶ A filter criterion decides wheter a given utterance contained in a given corpus matches the filter.
-
match
(utterance, corpus)[source]¶ Check if the utterance matches the filter.
Parameters: - utterance (Utterance) – The utterance to match.
- corpus (CorpusView) – The corpus that contains the utterance.
Returns: True if the filter matches the utterance, False otherwise.
Return type: bool
-
classmethod
parse
(representation)[source]¶ Create a filter criterion based on a string representation (created with
serialize
).Parameters: representation (str) – The string representation. Returns: The filter criterion from that representation. Return type: FilterCriterion
-
MatchingUtteranceIdxFilter¶
-
class
audiomate.corpus.subset.
MatchingUtteranceIdxFilter
(utterance_idxs, inverse=False)[source]¶ A filter criterion that matches utterances based on utterance-ids.
Parameters: - utterance_idxs (
set
) – A list of utterance-ids. Only utterances in the list will pass the filter - inverse (bool) – If True only utterance not in the list pass the filter.
- utterance_idxs (
MatchingLabelFilter¶
-
class
audiomate.corpus.subset.
MatchingLabelFilter
(labels, label_list_ids=None)[source]¶ A filter criterion that only accepts utterances which only have the given labels.
Parameters: - labels (
set
) – A set of labels which are accepted. - label_list_ids (
set
) – Only check label-lists with these ids. IfNone
, checks all label-lists.
- labels (
Splitter¶
-
class
audiomate.corpus.subset.
Splitter
(corpus, random_seed=None)[source]¶ A splitter provides methods for splitting a corpus into different subsets. It provides different approaches for splitting the corpus. (Methods indicated by
split_by_
) These methods mostly take some proportions parameter, which defines how big (in relation) the subsets should be. The subsets are returned asaudiomate.corpus.Subview
.Parameters: - corpus (Corpus) – The corpus that should be splitted.
- random_seed (int) – Seed to use for random number generation.
-
split
(proportions, separate_issuers=False)[source]¶ Split the corpus based on the number of utterances. The utterances are distributed to len(proportions) subsets, according to the ratios proportions[subset].
Parameters: - proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
- separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns: - A dictionary containing the subsets with the identifier
from the input as key.
Return type: (dict)
Example:
>>> spl = Splitter(corpus) >>> corpus.num_utterances 100 >>> subsets = spl.split(proportions={ >>> "train" : 0.6, >>> "dev" : 0.2, >>> "test" : 0.2 >>> }) >>> print(subsets) {'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>, 'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>, 'train': <audiomate.corpus.subview.Subview at 0x104ce7438>} >>> subsets['train'].num_utterances 60 >>> subsets['dev'].num_utterances 20 >>> subsets['test'].num_utterances 20
-
split_by_audio_duration
(proportions, separate_issuers=False)[source]¶ Split the corpus based on the the total duration of audio. The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains audio with a duration proportional to the given proportions.
Parameters: - proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
- separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns: - A dictionary containing the subsets with the identifier
from the input as key.
Return type: (dict)
Example:
>>> spl = Splitter(corpus) >>> corpus.num_utterances 100 >>> subsets = spl.split_by_audio_duration(proportions={ >>> "train" : 0.6, >>> "dev" : 0.2, >>> "test" : 0.2 >>> }) >>> print(subsets) {'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>, 'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>, 'train': <audiomate.corpus.subview.Subview at 0x104ce7438>} >>> subsets['train'].num_utterances 55 >>> subsets['dev'].num_utterances 35 >>> subsets['test'].num_utterances 10
-
split_by_label_duration
(proportions, separate_issuers=False)[source]¶ Split the corpus based on the total duration of labels (end - start). The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains labels with a duration proportional to the given proportions.
Parameters: - proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
- separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns: - A dictionary containing the subsets with the identifier
from the input as key.
Return type: (dict)
Example:
>>> spl = Splitter(corpus) >>> corpus.num_utterances 100 >>> subsets = spl.split_by_label_duration(proportions={ >>> "train" : 0.6, >>> "dev" : 0.2, >>> "test" : 0.2 >>> }) >>> print(subsets) {'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>, 'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>, 'train': <audiomate.corpus.subview.Subview at 0x104ce7438>} >>> subsets['train'].num_utterances 55 >>> subsets['dev'].num_utterances 35 >>> subsets['test'].num_utterances 10
-
split_by_label_length
(proportions, label_list_idx=None, separate_issuers=False)[source]¶ Split the corpus based on the the total length of the label-list. The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains labels summed up to a length proportional to the given proportions. Length is defined as the number of characters.
Parameters: - proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
- label_list_idx (str) – The idx of the label-list to use for compute the length. If None all label-lists are used.
- separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns: - A dictionary containing the subsets with the identifier
from the input as key.
Return type: (dict)
Example:
>>> spl = Splitter(corpus) >>> corpus.num_utterances 100 >>> subsets = spl.split_by_label_length(proportions={ >>> "train" : 0.6, >>> "dev" : 0.2, >>> "test" : 0.2 >>> }) >>> print(subsets) {'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>, 'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>, 'train': <audiomate.corpus.subview.Subview at 0x104ce7438>} >>> subsets['train'].num_utterances 55 >>> subsets['dev'].num_utterances 35 >>> subsets['test'].num_utterances 10
-
split_by_label_occurence
(proportions, separate_issuers=False)[source]¶ Split the corpus based on the total number of occcurences of labels. The utterances are distributed to len(proportions) subsets. Utterances are split up in a way that each subset contains labels-occurences proportional to the given proportions.
Parameters: - proportions (dict) – A dictionary containing the relative size of the target subsets. The key is an identifier for the subset.
- separate_issuers (bool) – If True it makes sure that all utterances of an issuer are in the same subset.
Returns: - A dictionary containing the subsets with the identifier
from the input as key.
Return type: (dict)
Example:
>>> spl = Splitter(corpus) >>> corpus.num_utterances 100 >>> subsets = spl.split_by_label_occurence(proportions={ >>> "train" : 0.6, >>> "dev" : 0.2, >>> "test" : 0.2 >>> }) >>> print(subsets) {'dev': <audiomate.corpus.subview.Subview at 0x104ce7400>, 'test': <audiomate.corpus.subview.Subview at 0x104ce74e0>, 'train': <audiomate.corpus.subview.Subview at 0x104ce7438>} >>> subsets['train'].num_utterances 55 >>> subsets['dev'].num_utterances 35 >>> subsets['test'].num_utterances 10
SubsetGenerator¶
-
class
audiomate.corpus.subset.
SubsetGenerator
(corpus, random_seed=None)[source]¶ This class is used to generate subsets of a corpus.
Parameters: - corpus (Corpus) – The corpus to create subsets from.
- random_seed (int) – Seed to use for random number generation.
-
maximal_balanced_subset
(by_duration=False, label_list_ids=None)[source]¶ Create a subset of the corpus as big as possible, so that the labels are balanced approximately. The label with the shortest duration (or with the fewest utterance if by_duration=False) is taken as reference. All other labels are selected so they match the shortest one as far as possible.
Parameters: - by_duration (bool) – If True the size measure is the duration of all utterances in a subset/corpus.
- label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns: The subview representing the subset.
Return type:
-
random_subset
(relative_size, balance_labels=False, label_list_ids=None)[source]¶ Create a subview of random utterances with a approximate size relative to the full corpus. By default x random utterances are selected with x equal to
relative_size * corpus.num_utterances
.Parameters: - relative_size (float) – A value between 0 and 1. (0.5 will create a subset with approximately 50% of the full corpus size)
- balance_labels (bool) – If True, the labels of the selected utterances are balanced as far as possible. So the count/duration of every label within the subset is equal.
- label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns: The subview representing the subset.
Return type:
-
random_subset_by_duration
(relative_duration, balance_labels=False, label_list_ids=None)[source]¶ Create a subview of random utterances with a approximate duration relative to the full corpus. Random utterances are selected so that the sum of all utterance durations equals to the relative duration of the full corpus.
Parameters: - relative_duration (float) – A value between 0 and 1. (e.g. 0.5 will create a subset with approximately 50% of the full corpus duration)
- balance_labels (bool) – If True, the labels of the selected utterances are balanced as far as possible. So the count/duration of every label within the subset is equal.
- label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns: The subview representing the subset.
Return type:
-
random_subsets
(relative_sizes, by_duration=False, balance_labels=False, label_list_ids=None)[source]¶ Create a bunch of subsets with the given sizes relative to the size or duration of the full corpus. Basically the same as calling
random_subset
orrandom_subset_by_duration
multiple times with different values. But this method makes sure that every subset contains only utterances, that are also contained in the next bigger subset.Parameters: - relative_sizes (list) – A list of numbers between 0 and 1 indicating the sizes of the desired subsets, relative to the full corpus.
- by_duration (bool) – If True the size measure is the duration of all utterances in a subset/corpus.
- balance_labels (bool) – If True the labels contained in a subset are chosen to be balanced as far as possible.
- label_list_ids (list) – List of label-list ids. If none is given, all label-lists are considered for balancing. Otherwise only the ones that are in the list are considered.
Returns: A dictionary containing all subsets with the relative size as key.
Return type: dict
Utils¶
-
audiomate.corpus.subset.utils.
absolute_proportions
(proportions, count)[source]¶ Split a given integer into n parts according to len(proportions) so they sum up to count and match the given proportions.
Parameters: proportions (dict) – Dict of proportions, with a identifier as key. Returns: Dictionary with absolute proportions and same identifiers as key. Return type: dict Example:
>>> absolute_proportions({'train': 0.5, 'test': 0.5}, 100) {'train': 50, 'test': 50}
-
audiomate.corpus.subset.utils.
get_identifiers_splitted_by_weights
(identifiers, proportions, seed=None)[source]¶ Divide the given identifiers based on the given proportions. But instead of randomly split the identifiers it is based on category weights. Every identifier has a weight for any number of categories. The target is, to split the identifiers in a way, so the sum of category k within part x is proportional to the sum of category x over all parts according to the given proportions. This is done by greedily insert the identifiers step by step in a part which has free space (weight). If there are no fitting parts anymore, the one with the least weight exceed is used. This function is deterministic, given the same seed. First the identifiers are sorted before shuffled using the given seed.
Parameters: - identifiers (dict) – A dictionary containing the weights for each identifier (key). Per item a dictionary of weights per category is given.
- proportions (dict) – Dict of proportions, with a identifier as key.
- seed (int) – Seed to use for random operations.
Returns: Dictionary containing a list of identifiers per part with the same key as the proportions dict.
Return type: dict
Example:
>>> identifiers = { >>> 'a': {'music': 2, 'speech': 1}, >>> 'b': {'music': 5, 'speech': 2}, >>> 'c': {'music': 2, 'speech': 4}, >>> 'd': {'music': 1, 'speech': 4}, >>> 'e': {'music': 3, 'speech': 4} >>> } >>> proportions = { >>> "train" : 0.6, >>> "dev" : 0.2, >>> "test" : 0.2 >>> } >>> get_identifiers_splitted_by_weights(identifiers, proportions) { 'train': ['a', 'b', 'd'], 'dev': ['c'], 'test': ['e'] }
-
audiomate.corpus.subset.utils.
select_balanced_subset
(items, select_count, categories, select_count_values=None, seed=None)[source]¶ Select items so the summed category weights are balanced. Each item has a dictionary containing the category weights. Items are selected until
select_count
is reached. The value that is added toselect_count
for an item can be defined in the dictionaryselect_count_values
. If this is not defined it is assumed to be 1, which means select_count items are selected.Parameters: - items (dict) – Dictionary containing items with category weights.
- select_count (float) – Value to reach for selected items.
- categories (list) – List of all categories.
- select_count_values (dict) – The select_count values to be used. For example an utterance with multiple labels: The category weights (label-lengths) are used for balance, but the utterance-duration is used for reaching the select_count.
Returns: List of item ids, containing
number_of_items
(orlen(items)
if smaller).Return type: list
Example
>>> items = { >>> 'utt-1' : {'m': 1, 's': 0, 'n': 0}, >>> 'utt-2' : {'m': 0, 's': 2, 'n': 1}, >>> ... >>> } >>> select_balanced_subset(items, 5) >>> ['utt-1', 'utt-3', 'utt-9', 'utt-33', 'utt-34']
-
audiomate.corpus.subset.utils.
split_identifiers
(identifiers, proportions, seed=None)[source]¶ Split the given identifiers by the given proportions. This function is deterministic, given the same seed. First the identifiers are sorted before shuffled using the given seed.
Parameters: - identifiers (list) – List of identifiers (str).
- proportions (dict) – A dictionary containing the proportions with the identifier from the
- as key. (input) –
- seed (int) – Seed to use for random operations.
Returns: Dictionary containing a list of identifiers per part with the same key as the proportions dict.
Return type: dict
Example:
>>> split_identifiers( >>> identifiers=['a', 'b', 'c', 'd'], >>> proportions={'melvin' : 0.5, 'timmy' : 0.5} >>> ) {'melvin' : ['a', 'c'], 'timmy' : ['b', 'd']}