audiomate.encoding

The encoding module provides functionality to encode labels to use for example for training a DNN.

Encoder

class audiomate.encoding.Encoder[source]

Base class for an encoder. The goal of an encoder is to extract encoded targets for an utterance. The base class provides functionality to perform encoding for a full corpus. A concrete encoder just has to provide the method to encode a single utterance via encode_utterance.

For example for training a frame-classifier, an encoder extracts one-hot encoded vectors from a label-list.

encode_corpus(corpus, output_path)[source]

Encode all utterances of the given corpus and store them in a audiomate.container.Container.

Parameters:
  • corpus (Corpus) – The corpus to process.
  • output_path (str) – The path to store the container with the encoded data.
Returns:

The container with the encoded data.

Return type:

Container

encode_utterance(utterance, corpus=None)[source]

Encode the given utterance.

Parameters:
  • utterance (Utterance) – The utterance to encode.
  • corpus (Corpus) – The corpus the utterance is from.
Returns:

Encoded data.

Return type:

np.ndarray

Frame-Based

class audiomate.encoding.FrameHotEncoder(labels, label_list_idx, frame_settings, sr=None)[source]

The FrameHotEncoder is used to encode the labels per frame. It creates a matrix with dimension num-frames x len(labels). The vector (2nd dim) has an entry for every label in the passed labels-list. If the sequence contains a given label within a frame it is set to 1.

Parameters:
  • labels (list) – List of labels (str) which should be included in the vector representation.
  • label_list_idx (str) – The name of the label-list to use for encoding. Only labels of this label-list are considered.
  • frame_settings (FrameSettings) – Frame settings to use.
  • sr (int) – The sampling rate used, if None it is assumed the native sampling rate from the file is used.

Example

>>> from audiomate import annotations
>>> from audiomate.utils import units import
>>>
>>> ll = annotations.LabelList(idx='test', labels=[
>>>     annotations.Label('music', 0, 2),
>>>     annotations.Label('speech', 2, 5),
>>>     annotations.Label('noise', 4, 6),
>>>     annotations.Label('music', 6, 8)
>>> ])
>>> utt.set_label_list(ll)
>>>
>>> labels = ['speech', 'music', 'noise']
>>> fs = units.FrameSettings(16000, 16000)
>>> encoder = FrameHotEncoder(labels, 'test', frame_settings=fs, sr=16000)
>>> encoder.encode_utterance(utt)
array([
    [0, 1, 0],
    [0, 1, 0],
    [1, 0, 0],
    [1, 0, 0],
    [1, 0, 1],
    [0, 0, 1],
    [0, 1, 0],
    [0, 1, 0]
])
encode_utterance(utterance, corpus=None)[source]

Encode the given utterance.

Parameters:
  • utterance (Utterance) – The utterance to encode.
  • corpus (Corpus) – The corpus the utterance is from.
Returns:

Encoded data.

Return type:

np.ndarray

class audiomate.encoding.FrameOrdinalEncoder(labels, label_list_idx, frame_settings, sr=None)[source]

The FrameOrdinalEncoder is used to encode the labels per frame. It creates a vector with length num-frames. For every frame sets the index of the label that is present for that frame. If multiple labels are present the longest within the frame. If multiple labels have the same length the smaller index is selected, hence the passed labels list acts as a priority.

Parameters:
  • labels (list) – List of labels (str) which should be included in the vector representation.
  • label_list_idx (str) – The name of the label-list to use for encoding. Only labels of this label-list are considered.
  • frame_settings (FrameSettings) – Frame settings to use.
  • sr (int) – The sampling rate used, if None it is assumed the native sampling rate from the file is used.

Example

>>> from audiomate import annotations
>>> from audiomate.utils import units import
>>>
>>> ll = annotations.LabelList(idx='test', labels=[
>>>     annotations.Label('music', 0, 2),
>>>     annotations.Label('speech', 2, 5),
>>>     annotations.Label('noise', 4, 6),
>>>     annotations.Label('music', 6, 8)
>>> ])
>>> utt.set_label_list(ll)
>>>
>>> labels = ['speech', 'music', 'noise']
>>> fs = units.FrameSettings(16000, 16000)
>>> encoder = FrameOrdinalEncoder(labels, 'test', frame_settings=fs)
>>> encoder.encode_utterance(utt)
array([1,1,0,0,0,2,1,1])
encode_utterance(utterance, corpus=None)[source]

Encode the given utterance.

Parameters:
  • utterance (Utterance) – The utterance to encode.
  • corpus (Corpus) – The corpus the utterance is from.
Returns:

Encoded data.

Return type:

np.ndarray

Utterance-Based

class audiomate.encoding.TokenOrdinalEncoder(label_list_idx, tokens, token_delimiter=' ')[source]

Class to encode labels of a given label-list. Every token of the labels is mapped to a number. For the full utterance a sequence/array of numbers are computed, which correspond to tokens.

Tokens are extracted from labels by splitting using a delimiter (by default space). See audiomate.annotations.Label.tokenized(). Hence a token can be word, phone, …, depending on the label and the delimiter.

Parameters:
  • label_list_idx (str) – The name of the label-list to use for encoding. Only labels of this label-list are considered.
  • tokens (list) – List of tokens that defines the mapping. First label will get the 0 in the encoding and so on.
  • token_delimiter (str) – Delimiter to split labels into tokens.

Example

>>> ll = LabelList(idx='words', labels=[Label('down the  road')])
>>> utt = Utterance('utt-1', 'file-x', label_lists=ll)
>>>
>>> tokens = ['up', 'down', 'road', 'stree', 'the']
>>> encoder = TokenOrdinalEncoder('words', tokens, token_delimiter=' ')
>>> encoder.encode_utterance(utt)
np.array([1, 4, 2])
encode_utterance(utterance, corpus=None)[source]

Encode the given utterance.

Parameters:
  • utterance (Utterance) – The utterance to encode.
  • corpus (Corpus) – The corpus the utterance is from.
Returns:

Encoded data.

Return type:

np.ndarray