audiomate.processing

The processing module provides tools for processing audio data in a batch-wise manner. The idea is to setup a predefined tool that can process all the audio from a corpus.

The basic component is the audiomate.processing.Processor. It provides the functionality to reduce any input component like a corpus, feature-container, utterance, file to the abstraction of frames. A concrete implementation then only has to provide the proper method to process these frames.

Often in audio processing the same components are used in combination with others. For this purpose a pipeline can be built that processes the frames in multiple steps. The audiomate.processing.pipeline provides the audiomate.processing.pipeline.Computation and audiomate.processing.pipeline.Reduction classes. These abstract classes can be extended to create processing components of a pipeline. The different components are then be coupled to create custom pipelines.

Processor

class audiomate.processing.Processor[source]

The processor base class provides the functionality to process audio data on different levels (Corpus, Utterance, File). For every level there is an offline and an online method. In the offline mode the data is processed in one step (e.g. the whole file/utterance at once). This means the process_frames method is called with all the frames of the file/utterance. In online mode the data is processed in chunks, so the process_frames method is called multiple times per file/utterance with different chunks.

To implement a concrete processor the process_frames method has to be implemented. This method is called in online and offline mode. So it is up to the user to determine if a processor can be called in either online or offline mode, maybe both. This differs between use cases.

process_corpus(corpus, output_path, frame_size=400, hop_size=160, sr=None)[source]

Process all utterances of the given corpus and save the processed features in a feature-container. The utterances are processed in offline mode so the full utterance in one go.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • output_path (str) – A path to save the feature-container to.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_corpus_online(corpus, output_path, frame_size=400, hop_size=160, sr=None, chunk_size=1, buffer_size=5760000)[source]

Process all utterances of the given corpus and save the processed features in a feature-container. The utterances are processed in online mode, so chunk by chunk.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • output_path (str) – A path to save the feature-container to.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
  • chunk_size (int) – Number of frames to process per chunk.
  • buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_features(corpus, input_features, output_path)[source]

Process all features of the given corpus and save the processed features in a feature-container. The features are processed in offline mode, all features of an utterance at once.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • input_features (FeatureContainer) – The feature-container to process the frames from.
  • output_path (str) – A path to save the feature-container to.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_features_online(corpus, input_features, output_path, chunk_size=1)[source]

Process all features of the given corpus and save the processed features in a feature-container. The features are processed in online mode, chunk by chunk.

Parameters:
  • corpus (Corpus) – The corpus to process the utterances from.
  • input_features (FeatureContainer) – The feature-container to process the frames from.
  • output_path (str) – A path to save the feature-container to.
  • chunk_size (int) – Number of frames to process per chunk.
Returns:

The feature-container containing the processed features.

Return type:

FeatureContainer

process_file(file_path, frame_size=400, hop_size=160, sr=None, start=0, end=-1, utterance=None, corpus=None)[source]

Process the audio-file in offline mode, in one go.

Parameters:
  • file_path (str) – The audio file to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
  • start (float) – The point within the file in seconds to start processing from.
  • end (float) – The point within the file in seconds to end processing.
  • utterance (Utterance) – The utterance that is associated with this file, if available.
  • corpus (Corpus) – The corpus this file is part of, if available.
Returns:

The processed features.

Return type:

np.ndarray

process_file_online(file_path, frame_size=400, hop_size=160, sr=None, start=0, end=-1, utterance=None, corpus=None, chunk_size=1, buffer_size=5760000)[source]

Process the audio-file in online mode, chunk by chunk. The processed chunks are yielded one after another.

Parameters:
  • file_path (str) – The audio file to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
  • start (float) – The point within the file in seconds to start processing from.
  • end (float) – The point within the file in seconds to end processing.
  • utterance (Utterance) – The utterance that is associated with this file, if available.
  • corpus (Corpus) – The corpus this file is part of, if available.
  • chunk_size (int) – Number of frames to process per chunk.
  • buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
Returns:

A generator that yield processed chunks.

Return type:

Generator

process_frames(data, sampling_rate, offset=0, last=False, utterance=None, corpus=None)[source]

Process the given chunk of frames. Depending on online or offline mode, the given chunk is either the full data or just part of it.

Parameters:
  • data (np.ndarray) – nD Array of frames (num-frames x frame-dimensions).
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • offset (int) – The index of the first frame in the chunk. In offline mode always 0. (Relative to the first frame of the utterance/sequence)
  • last (bool) – True indicates that this is the last frame of the sequence/utterance. In offline mode always True.
  • utterance (Utterance) – The utterance the frame is from, if available.
  • corpus (Corpus) – The corpus the frame is from, if available.
Returns:

The processed frames.

Return type:

np.ndarray

process_utterance(utterance, frame_size=400, hop_size=160, sr=None, corpus=None)[source]

Process the utterance in offline mode, in one go.

Parameters:
  • utterance (Utterance) – The utterance to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
  • corpus (Corpus) – The corpus this utterance is part of, if available.
Returns:

The processed features.

Return type:

np.ndarray

process_utterance_online(utterance, frame_size=400, hop_size=160, sr=None, chunk_size=1, buffer_size=5760000, corpus=None)[source]

Process the utterance in online mode, chunk by chunk. The processed chunks are yielded one after another.

Parameters:
  • utterance (Utterance) – The utterance to process.
  • frame_size (int) – The number of samples per frame.
  • hop_size (int) – The number of samples between two frames.
  • sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
  • chunk_size (int) – Number of frames to process per chunk.
  • buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
  • corpus (Corpus) – The corpus this utterance is part of, if available.
Returns:

A generator that yield processed chunks.

Return type:

Generator

Pipeline

This module contains classes for creating frame processing pipelines.

A pipeline consists of one of two types of steps. A computation step takes data from a previous step or the input and processes it. A reduction step is used to merge outputs of multiple previous steps. It takes outputs of all incoming steps and outputs a single data block.

The steps are managed as a directed graph, which is built by passing the parent steps to the __init__ method of a step. Every step that is created has his own graph, but inherits all nodes and edges of the graphs of his parent steps.

Every pipeline represents a processor and implements the process_frames method.

class audiomate.processing.pipeline.Chunk(data, offset, is_last, left_context=0, right_context=0)[source]

Represents a chunk of data. It is used to pass data between different steps of a pipeline.

Parameters:
  • data (np.ndarray or list) – A single array of frames or a list of separate chunks of frames of equal size.
  • offset (int) – The index of the first frame in the chunk within the sequence.
  • is_last (bool) – Whether this is the last chunk of the sequence.
  • left_context (int) – Number of frames that act as context at the begin of the chunk (left).
  • right_context (int) – Number of frames that act as context at the end of the chunk (right).
class audiomate.processing.pipeline.Step(name=None, min_frames=1, left_context=0, right_context=0)[source]

This class is the base class for a step in a processing pipeline.

It handles the procedure of executing the pipeline. It makes sure the steps are computed in the correct order. It also provides the correct inputs to every step.

Every step has to provide a compute method which is the actual processing.

Parameters:name (str, optional) – A name for identifying the step.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

process_frames(data, sampling_rate, offset=0, last=False, utterance=None, corpus=None)[source]

Execute the processing of this step and all dependent parent steps.

class audiomate.processing.pipeline.Computation(parent=None, name=None, min_frames=1, left_context=0, right_context=0)[source]

Base class for a computation step. To implement a computation step for pipeline the compute method has to be implemented. This method gets the frames from its parent step including context frames if defined. It has to return the same number of frames but without context frames.

Parameters:
  • parent (Step, optional) – The parent step this step depends on.
  • name (str, optional) – A name for identifying the step.
class audiomate.processing.pipeline.Reduction(parents, name=None, min_frames=1, left_context=0, right_context=0)[source]

Base class for a reduction step. It gets the frames of all its parent steps as a list. It has to return a single chunk of frames.

Parameters:
  • parents (list) – List of parent steps this step depends on.
  • name (str, optional) – A name for identifying the step.

Implementations

Some processing pipeline steps are already implemented.

Implementations of processing pipeline steps.
Name Description
MeanVarianceNorm Normalizes features with given mean and variance.
MelSpectrogram Exctracts MelSpectrogram features.
MFCC Extracts MFCC features.
PowerToDb Convert power spectrum to Db.
Delta Compute delta features.
Stack Reduce multiple features into one by stacking them on top of each other.
class audiomate.processing.pipeline.MeanVarianceNorm(mean, variance, parent=None, name=None)[source]

Pre-processing step to normalize mean and variance.

frame = (frame - mean) / sqrt(variance)

Parameters:
  • mean (float) – The mean to use for normalization.
  • variance (float) – The variance to use for normalization.s
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.MelSpectrogram(n_mels=128, parent=None, name=None)[source]

Computation step that extracts mel-spectrogram features from the given frames.

Based on http://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html

Parameters:n_mels (int) – Number of mel bands to generate.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.MFCC(n_mfcc=13, n_mels=128, parent=None, name=None)[source]

Computation step that extracts mfcc features from the given frames.

Based on http://librosa.github.io/librosa/generated/librosa.feature.mfcc.html

Parameters:
  • n_mels (int) – Number of mel bands to generate.
  • n_mfcc (int) – number of MFCCs to return.
compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.PowerToDb(ref=1.0, amin=1e-10, top_db=80.0, parent=None, name=None)[source]

Convert a power spectrogram (amplitude squared) to decibel (dB) units.

See http://librosa.github.io/librosa/generated/librosa.core.power_to_db.html

Note

The output can differ depending on offline or online processing, since it depends on statistics over all values. And in online mode it only considers values from a single chunk, while in offline mode all values of the whole sequence are considered.

compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.Delta(width=9, order=1, axis=0, mode='interp', parent=None, name=None)[source]

Compute delta features.

See http://librosa.github.io/librosa/generated/librosa.feature.delta.html

compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray

class audiomate.processing.pipeline.Stack(parents, name=None, min_frames=1, left_context=0, right_context=0)[source]

Stack the features. All input matrices have to be of the same length (same number of frames).

compute(chunk, sampling_rate, corpus=None, utterance=None)[source]

Do the computation of the step. If the step uses context, the result has to be returned without context.

Parameters:
  • chunk (Chunk) – The chunk containing data and info about context, offset, …
  • sampling_rate (int) – The sampling rate of the underlying signal.
  • corpus (Corpus) – The corpus the data is from, if available.
  • utterance (Utterance) – The utterance the data is from, if available.
Returns:

The array of processed frames, without context.

Return type:

np.ndarray