audiomate.processing

The processing module provides tools for processing audio data in a batch-wise manner. The idea is to setup a predefined tool that can process all the audio from a corpus.

The basic component is the audiomate.processing.Processor. It provides the functionality to reduce any input component like a corpus, feature-container, utterance, file to the abstraction of frames. A concrete implementation then only has to provide the proper method to process these frames.

Often in audio processing the same components are used in combination with others. For this purpose a pipeline can be built that processes the frames in multiple steps. The audiomate.processing.pipeline provides the audiomate.processing.pipeline.Computation and audiomate.processing.pipeline.Reduction classes. These abstract classes can be extended to create processing components of a pipeline. The different components are then be coupled to create custom pipelines.

Processor

class audiomate.processing.Processor[source]

The processor base class provides the functionality to process audio data on different levels (Corpus, Utterance, Track). For every level there is an offline and an online method. In the offline mode the data is processed in one step (e.g. the whole track/utterance at once). This means the process_frames method is called with all the frames of the track/utterance. In online mode the data is processed in chunks, so the process_frames method is called multiple times per track/utterance with different chunks.

To implement a concrete processor the process_frames method has to be implemented. This method is called in online and offline mode. So it is up to the user to determine if a processor can be called in either online or offline mode, maybe both. This differs between use cases.

If the implementation of a processor does change the frame or hop-size, it is expected to provide a transform via the frame_transform method. Frame-size and hop-size are measured in samples regarding the original audio signal (or simply its sampling rate).

frame_transform(frame_size, hop_size)[source]

If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.

This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.

By default it is assumed that the processor doesn’t change the frame-size and the hop-size.

Parameters:
  • frame_size (int) – The original frame-size.
  • hop_size (int) – The original hop-size.
Returns:

The (frame-size, hop-size) after processing.

Return type:

tuple

process_corpus(corpus, output_path, frame_size=400, hop_size=160, sr=None)[source]

Process all utterances of the given corpus and save the processed features in a feature-container. The utterances are processed in offline mode so the full utterance in one go.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • output_path (str) – A path to save the feature-container to.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_corpus_online(corpus, output_path, frame_size=400, hop_size=160, chunk_size=1, buffer_size=5760000)[source]

Process all utterances of the given corpus and save the processed features in a feature-container. The utterances are processed in online mode, so chunk by chunk.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • output_path (str) – A path to save the feature-container to.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • chunk_size (int) – Number of frames to process per chunk.
  • buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_features(corpus, input_features, output_path)[source]

Process all features of the given corpus and save the processed features in a feature-container. The features are processed in offline mode, all features of an utterance at once.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • input_features (FeatureContainer) – The feature-container to process the frames from.
  • output_path (str) – A path to save the feature-container to.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_features_online(corpus, input_features, output_path, chunk_size=1)[source]

Process all features of the given corpus and save the processed features in a feature-container. The features are processed in online mode, chunk by chunk.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • input_features (FeatureContainer) – The feature-container to process the frames from.
  • output_path (str) – A path to save the feature-container to.
  • chunk_size (int) – Number of frames to process per chunk.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_frames(data, sampling_rate, offset=0, last=False, utterance=None, corpus=None)[source]

Process the given chunk of frames. Depending on online or offline mode, the given chunk is either the full data or just part of it.

Parameters:
  • data (np.ndarray) – nD Array of frames (num-frames x frame-dimensions).
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • offset (int) – The index of the first frame in the chunk. In offline mode always 0. (Relative to the first frame of the utterance/sequence)
  • last (bool) – True indicates that this is the last frame of the sequence/utterance. In offline mode always True.
  • utterance (Utterance) – The utterance the frame is from, if available.
  • corpus (Corpus) – The corpus the frame is from, if available.
Returns:

The processed frames.

Return type:

np.ndarray

process_track(track, frame_size=400, hop_size=160, sr=None, start=0, end=inf, utterance=None, corpus=None)[source]

Process the track in offline mode, in one go.

Parameters:
  • track (Track) – The track to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None, uses the native sampling rate from the underlying data.
  • start (float) – The point within the track in seconds, to start processing from.
  • end (float) – The point within the track in seconds, to end processing.
  • utterance (Utterance) – The utterance that is associated with this track, if available.
  • corpus (Corpus) – The corpus this track is part of, if available.
Returns:

The processed features.

Return type:

np.ndarray

process_track_online(track, frame_size=400, hop_size=160, start=0, end=inf, utterance=None, corpus=None, chunk_size=1, buffer_size=5760000)[source]

Process the track in online mode, chunk by chunk. The processed chunks are yielded one after another.

Parameters:
  • track (Track) – The track to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • start (float) – The point within the track in seconds to start processing from.
  • end (float) – The point within the trac in seconds to end processing.
  • utterance (Utterance) – The utterance that is associated with this track, if available.
  • corpus (Corpus) – The corpus this track is part of, if available.
  • chunk_size (int) – Number of frames to process per chunk.
  • buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the type of track. It can be of block-size higher, where the block-size is typically 1024 or 4096.
Returns:

A generator that yield processed chunks.

Return type:

Generator

process_utterance(utterance, frame_size=400, hop_size=160, sr=None, corpus=None)[source]

Process the utterance in offline mode, in one go.

Parameters:
  • utterance (Utterance) – The utterance to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
  • corpus (Corpus) – The corpus this utterance is part of, if available.
Returns:

The processed features.

Return type:

np.ndarray

process_utterance_online(utterance, frame_size=400, hop_size=160, chunk_size=1, buffer_size=5760000, corpus=None)[source]

Process the utterance in online mode, chunk by chunk. The processed chunks are yielded one after another.

Parameters:
  • utterance (Utterance) – The utterance to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • chunk_size (int) – Number of frames to process per chunk.
  • buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
  • corpus (Corpus) – The corpus this utterance is part of, if available.
Returns:

A generator that yield processed chunks.

Return type:

Generator

Pipeline

This module contains classes for creating frame processing pipelines.

A pipeline consists of one of two types of steps. A computation step takes data from a previous step or the input and processes it. A reduction step is used to merge outputs of multiple previous steps. It takes outputs of all incoming steps and outputs a single data block.

The steps are managed as a directed graph, which is built by passing the parent steps to the __init__ method of a step. Every step that is created has his own graph, but inherits all nodes and edges of the graphs of his parent steps.

Every pipeline represents a processor and implements the process_frames method.

class audiomate.processing.pipeline.Chunk(data, offset, is_last, left_context=0, right_context=0)[source]

Represents a chunk of data. It is used to pass data between different steps of a pipeline.

Parameters:
  • data (np.ndarray or list) – A single array of frames or a list of separate chunks of frames of equal size.
  • offset (int) – The index of the first frame in the chunk within the sequence.
  • is_last (bool) – Whether this is the last chunk of the sequence.
  • left_context (int) – Number of frames that act as context at the begin of the chunk (left).
  • right_context (int) – Number of frames that act as context at the end of the chunk (right).
class audiomate.processing.pipeline.Step(name=None, min_frames=1, left_context=0, right_context=0)[source]

This class is the base class for a step in a processing pipeline.

It handles the procedure of executing the pipeline. It makes sure the steps are computed in the correct order. It also provides the correct inputs to every step.

Every step has to provide a compute method which is the actual processing.

If the implementation of a step does change the frame or hop-size, it is expected to provide a transform via the frame_transform_step method. Frame-size and hop-size are measured in samples regarding the original audio signal (or simply its sampling rate).

Parameters:name (str, optional) – A name for identifying the step.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

frame_transform(frame_size, hop_size)[source]

If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.

This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.

By default it is assumed that the processor doesn’t change the frame-size and the hop-size.

Parameters:
  • frame_size (int) – The original frame-size.
  • hop_size (int) – The original hop-size.
Returns:

The (frame-size, hop-size) after processing.

Return type:

tuple

frame_transform_step(frame_size, hop_size)[source]

If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.

This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.

By default it is assumed that the processor doesn’t change the frame-size and the hop-size.

Note

This function is simply for this step, whereas frame_transform() computes the transformation for the whole pipeline.

Parameters:
  • frame_size (int) – The original frame-size.
  • hop_size (int) – The original hop-size.
Returns:

The (frame-size, hop-size) after processing.

Return type:

tuple

process_frames(data, sampling_rate, offset=0, last=False, utterance=None, corpus=None)[source]

Execute the processing of this step and all dependent parent steps.

class audiomate.processing.pipeline.Computation(parent=None, name=None, min_frames=1, left_context=0, right_context=0)[source]

Base class for a computation step. To implement a computation step for pipeline the compute method has to be implemented. This method gets the frames from its parent step including context frames if defined. It has to return the same number of frames but without context frames.

Parameters:
  • parent (Step, optional) – The parent step this step depends on.
  • name (str, optional) – A name for identifying the step.
class audiomate.processing.pipeline.Reduction(parents, name=None, min_frames=1, left_context=0, right_context=0)[source]

Base class for a reduction step. It gets the frames of all its parent steps as a list. It has to return a single chunk of frames.

Parameters:
  • parents (list) – List of parent steps this step depends on.
  • name (str, optional) – A name for identifying the step.

Implementations

Some processing pipeline steps are already implemented.

Implementations of processing pipeline steps.
Name Description
MeanVarianceNorm Normalizes features with given mean and variance.
MelSpectrogram Exctracts MelSpectrogram features.
MFCC Extracts MFCC features.
PowerToDb Convert power spectrum to Db.
Delta Compute delta features.
AddContext Add previous and subsequent frames to the current frame.
Stack Reduce multiple features into one by stacking them on top of each other.
AvgPool Compute the average (per dimension) over a given number of sequential frames.
VarPool Compute the variance (per dimension) over a given number of sequential frames.
OnsetStrength Compute onset strengths.
Tempogram Compute tempogram features.
class audiomate.processing.pipeline.MeanVarianceNorm(mean, variance, parent=None, name=None)[source]

Pre-processing step to normalize mean and variance.

frame = (frame - mean) / sqrt(variance)

Parameters:
  • mean (float) – The mean to use for normalization.
  • variance (float) – The variance to use for normalization.s
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.MelSpectrogram(n_mels=128, parent=None, name=None)[source]

Computation step that extracts mel-spectrogram features from the given frames.

Based on http://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html

Parameters:n_mels (int) – Number of mel bands to generate.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.MFCC(n_mfcc=13, n_mels=128, parent=None, name=None)[source]

Computation step that extracts mfcc features from the given frames.

Based on http://librosa.github.io/librosa/generated/librosa.feature.mfcc.html

Parameters:
  • n_mels (int) – Number of mel bands to generate.
  • n_mfcc (int) – number of MFCCs to return.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.PowerToDb(ref=1.0, amin=1e-10, top_db=80.0, parent=None, name=None)[source]

Convert a power spectrogram (amplitude squared) to decibel (dB) units.

See http://librosa.github.io/librosa/generated/librosa.core.power_to_db.html

Note

The output can differ depending on offline or online processing, since it depends on statistics over all values. And in online mode it only considers values from a single chunk, while in offline mode all values of the whole sequence are considered.

compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.Delta(width=9, order=1, axis=0, mode='interp', parent=None, name=None)[source]

Compute delta features.

See http://librosa.github.io/librosa/generated/librosa.feature.delta.html

compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.AddContext(left_frames, right_frames, parent=None, name=None)[source]

For every frame add context frames from left or/and right. For frames at the beginning and end of a sequence, where no context is available, zeros are used.

Parameters:
  • left_frames (int) – Number of previous frames to prepend to a frame.
  • right_frames (int) – Number of subsequent frames to append to a frame.

Example

>>> input = np.array([
>>>     [1,2,3],
>>>     [4,5,6],
>>>     [7,8,9]
>>> ])
>>> chunk = Chunk(input, offset=0, is_last=True)
>>> AddContext(left_frames=1, right_frames=1).compute(chunk, 16000)
array([[0, 0, 0, 1, 2, 3, 4, 5, 6],
       [1, 2, 3, 4, 5, 6, 7, 8, 9],
       [4, 5, 6, 7, 8, 9, 0, 0, 0]])
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.Stack(parents, name=None, min_frames=1, left_context=0, right_context=0)[source]

Stack the features from multiple inputs. All input matrices have to be of the same length (same number of frames).

compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.AvgPool(size, parent=None, name=None)[source]

Average a given number of sequential frames into a single frame. If at the end of a stream just the remaining frames are used, no matter how many there are left.

Parameters:size (float) – The maximum number of frames to pool by taking the mean.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

frame_transform_step(frame_size, hop_size)[source]

If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.

This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.

By default it is assumed that the processor doesn’t change the frame-size and the hop-size.

Note

This function is simply for this step, whereas frame_transform() computes the transformation for the whole pipeline.

Parameters:
  • frame_size (int) – The original frame-size.
  • hop_size (int) – The original hop-size.
Returns:

The (frame-size, hop-size) after processing.

Return type:

tuple

class audiomate.processing.pipeline.VarPool(size, parent=None, name=None)[source]

Variance over a given number of sequential frames to form a single frame. If at the end of a stream just the remaining frames are used, no matter how many there are left.

Parameters:size (float) – The maximum number of frames to pool by taking the mean.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

frame_transform_step(frame_size, hop_size)[source]

If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.

This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.

By default it is assumed that the processor doesn’t change the frame-size and the hop-size.

Note

This function is simply for this step, whereas frame_transform() computes the transformation for the whole pipeline.

Parameters:
  • frame_size (int) – The original frame-size.
  • hop_size (int) – The original hop-size.
Returns:

The (frame-size, hop-size) after processing.

Return type:

tuple

class audiomate.processing.pipeline.OnsetStrength(n_mels=128, parent=None, name=None)[source]

Compute a spectral flux onset strength envelope.

Based on http://librosa.github.io/librosa/generated/librosa.onset.onset_strength.html

Parameters:n_mels (int) – Number of mel bands to generate.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.Tempogram(n_mels=128, win_length=384, parent=None, name=None)[source]

Computation step to compute tempogram

Based on http://librosa.github.io/librosa/generated/librosa.feature.tempogram.html

Parameters:
  • n_mels (int) – Number of mel bands to generate.
  • win_length (int) – Length of the onset autocorrelation window (in frames/onset measurements). The default settings (384) corresponds to 384 * hop_length / sr ~= 8.9s.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray