audiomate.processing¶
The processing module provides tools for processing audio data in a batch-wise manner. The idea is to setup a predefined tool that can process all the audio from a corpus.
The basic component is the audiomate.processing.Processor
. It provides the functionality
to reduce any input component like a corpus, feature-container, utterance, file to the abstraction of frames.
A concrete implementation then only has to provide the proper method to process these frames.
Often in audio processing the same components are used in combination with others.
For this purpose a pipeline can be built that processes the frames in multiple steps.
The audiomate.processing.pipeline
provides the audiomate.processing.pipeline.Computation
and audiomate.processing.pipeline.Reduction
classes.
These abstract classes can be extended to create processing components of a pipeline.
The different components are then be coupled to create custom pipelines.
Processor¶
-
class
audiomate.processing.
Processor
[source]¶ The processor base class provides the functionality to process audio data on different levels (Corpus, Utterance, Track). For every level there is an offline and an online method. In the offline mode the data is processed in one step (e.g. the whole track/utterance at once). This means the
process_frames
method is called with all the frames of the track/utterance. In online mode the data is processed in chunks, so theprocess_frames
method is called multiple times per track/utterance with different chunks.To implement a concrete processor the
process_frames
method has to be implemented. This method is called in online and offline mode. So it is up to the user to determine if a processor can be called in either online or offline mode, maybe both. This differs between use cases.If the implementation of a processor does change the frame or hop-size, it is expected to provide a transform via the
frame_transform
method. Frame-size and hop-size are measured in samples regarding the original audio signal (or simply its sampling rate).-
frame_transform
(frame_size, hop_size)[source]¶ If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.
This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.
By default it is assumed that the processor doesn’t change the frame-size and the hop-size.
Parameters: - frame_size (int) – The original frame-size.
- hop_size (int) – The original hop-size.
Returns: The (frame-size, hop-size) after processing.
Return type: tuple
-
process_corpus
(corpus, output_path, frame_size=400, hop_size=160, sr=None)[source]¶ Process all utterances of the given corpus and save the processed features in a feature-container. The utterances are processed in offline mode so the full utterance in one go.
Parameters: - corpus (Corpus) – The corpus to process the utterances from.
- output_path (str) – A path to save the feature-container to.
- frame_size (int) – The number of samples per frame.
- hop_size (int) – The number of samples between two frames.
- sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
Returns: The feature-container containing the processed features.
Return type:
-
process_corpus_online
(corpus, output_path, frame_size=400, hop_size=160, chunk_size=1, buffer_size=5760000)[source]¶ Process all utterances of the given corpus and save the processed features in a feature-container. The utterances are processed in online mode, so chunk by chunk.
Parameters: - corpus (Corpus) – The corpus to process the utterances from.
- output_path (str) – A path to save the feature-container to.
- frame_size (int) – The number of samples per frame.
- hop_size (int) – The number of samples between two frames.
- chunk_size (int) – Number of frames to process per chunk.
- buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
Returns: The feature-container containing the processed features.
Return type:
-
process_features
(corpus, input_features, output_path)[source]¶ Process all features of the given corpus and save the processed features in a feature-container. The features are processed in offline mode, all features of an utterance at once.
Parameters: - corpus (Corpus) – The corpus to process the utterances from.
- input_features (FeatureContainer) – The feature-container to process the frames from.
- output_path (str) – A path to save the feature-container to.
Returns: The feature-container containing the processed features.
Return type:
-
process_features_online
(corpus, input_features, output_path, chunk_size=1)[source]¶ Process all features of the given corpus and save the processed features in a feature-container. The features are processed in online mode, chunk by chunk.
Parameters: - corpus (Corpus) – The corpus to process the utterances from.
- input_features (FeatureContainer) – The feature-container to process the frames from.
- output_path (str) – A path to save the feature-container to.
- chunk_size (int) – Number of frames to process per chunk.
Returns: The feature-container containing the processed features.
Return type:
-
process_frames
(data, sampling_rate, offset=0, last=False, utterance=None, corpus=None)[source]¶ Process the given chunk of frames. Depending on online or offline mode, the given chunk is either the full data or just part of it.
Parameters: - data (np.ndarray) – nD Array of frames (num-frames x frame-dimensions).
- sampling_rate (int) – The sampling rate of the underlying signal.
- offset (int) – The index of the first frame in the chunk. In offline mode always 0. (Relative to the first frame of the utterance/sequence)
- last (bool) – True indicates that this is the last frame of the sequence/utterance. In offline mode always True.
- utterance (Utterance) – The utterance the frame is from, if available.
- corpus (Corpus) – The corpus the frame is from, if available.
Returns: The processed frames.
Return type: np.ndarray
-
process_track
(track, frame_size=400, hop_size=160, sr=None, start=0, end=inf, utterance=None, corpus=None)[source]¶ Process the track in offline mode, in one go.
Parameters: - track (Track) – The track to process.
- frame_size (int) – The number of samples per frame.
- hop_size (int) – The number of samples between two frames.
- sr (int) – Use the given sampling rate. If
None
, uses the native sampling rate from the underlying data. - start (float) – The point within the track in seconds, to start processing from.
- end (float) – The point within the track in seconds, to end processing.
- utterance (Utterance) – The utterance that is associated with this track, if available.
- corpus (Corpus) – The corpus this track is part of, if available.
Returns: The processed features.
Return type: np.ndarray
-
process_track_online
(track, frame_size=400, hop_size=160, start=0, end=inf, utterance=None, corpus=None, chunk_size=1, buffer_size=5760000)[source]¶ Process the track in online mode, chunk by chunk. The processed chunks are yielded one after another.
Parameters: - track (Track) – The track to process.
- frame_size (int) – The number of samples per frame.
- hop_size (int) – The number of samples between two frames.
- start (float) – The point within the track in seconds to start processing from.
- end (float) – The point within the trac in seconds to end processing.
- utterance (Utterance) – The utterance that is associated with this track, if available.
- corpus (Corpus) – The corpus this track is part of, if available.
- chunk_size (int) – Number of frames to process per chunk.
- buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the type of track. It can be of block-size higher, where the block-size is typically 1024 or 4096.
Returns: A generator that yield processed chunks.
Return type: Generator
-
process_utterance
(utterance, frame_size=400, hop_size=160, sr=None, corpus=None)[source]¶ Process the utterance in offline mode, in one go.
Parameters: - utterance (Utterance) – The utterance to process.
- frame_size (int) – The number of samples per frame.
- hop_size (int) – The number of samples between two frames.
- sr (int) – Use the given sampling rate. If None uses the native sampling rate from the underlying data.
- corpus (Corpus) – The corpus this utterance is part of, if available.
Returns: The processed features.
Return type: np.ndarray
-
process_utterance_online
(utterance, frame_size=400, hop_size=160, chunk_size=1, buffer_size=5760000, corpus=None)[source]¶ Process the utterance in online mode, chunk by chunk. The processed chunks are yielded one after another.
Parameters: - utterance (Utterance) – The utterance to process.
- frame_size (int) – The number of samples per frame.
- hop_size (int) – The number of samples between two frames.
- chunk_size (int) – Number of frames to process per chunk.
- buffer_size (int) – Number of samples to load into memory at once. The exact number of loaded samples depends on the block-size of the audioread library. So it can be of block-size higher, where the block-size is typically 1024 or 4096.
- corpus (Corpus) – The corpus this utterance is part of, if available.
Returns: A generator that yield processed chunks.
Return type: Generator
-
Pipeline¶
This module contains classes for creating frame processing pipelines.
A pipeline consists of one of two types of steps. A computation step takes data from a previous step or the input and processes it. A reduction step is used to merge outputs of multiple previous steps. It takes outputs of all incoming steps and outputs a single data block.
The steps are managed as a directed graph,
which is built by passing the parent steps to the __init__
method of a step.
Every step that is created has his own graph, but inherits all nodes and edges of the graphs of his parent steps.
Every pipeline represents a processor and implements the process_frames
method.
-
class
audiomate.processing.pipeline.
Chunk
(data, offset, is_last, left_context=0, right_context=0)[source]¶ Represents a chunk of data. It is used to pass data between different steps of a pipeline.
Parameters: - data (np.ndarray or list) – A single array of frames or a list of separate chunks of frames of equal size.
- offset (int) – The index of the first frame in the chunk within the sequence.
- is_last (bool) – Whether this is the last chunk of the sequence.
- left_context (int) – Number of frames that act as context at the begin of the chunk (left).
- right_context (int) – Number of frames that act as context at the end of the chunk (right).
-
class
audiomate.processing.pipeline.
Step
(name=None, min_frames=1, left_context=0, right_context=0)[source]¶ This class is the base class for a step in a processing pipeline.
It handles the procedure of executing the pipeline. It makes sure the steps are computed in the correct order. It also provides the correct inputs to every step.
Every step has to provide a
compute
method which is the actual processing.If the implementation of a step does change the frame or hop-size, it is expected to provide a transform via the
frame_transform_step
method. Frame-size and hop-size are measured in samples regarding the original audio signal (or simply its sampling rate).Parameters: name (str, optional) – A name for identifying the step. -
compute
(chunk, sampling_rate, corpus=None, utterance=None)[source]¶ Do the computation of the step. If the step uses context, the result has to be returned without context.
Parameters: Returns: The array of processed frames, without context.
Return type: np.ndarray
-
frame_transform
(frame_size, hop_size)[source]¶ If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.
This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.
By default it is assumed that the processor doesn’t change the frame-size and the hop-size.
Parameters: - frame_size (int) – The original frame-size.
- hop_size (int) – The original hop-size.
Returns: The (frame-size, hop-size) after processing.
Return type: tuple
-
frame_transform_step
(frame_size, hop_size)[source]¶ If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.
This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.
By default it is assumed that the processor doesn’t change the frame-size and the hop-size.
Note
This function is simply for this step, whereas
frame_transform()
computes the transformation for the whole pipeline.Parameters: - frame_size (int) – The original frame-size.
- hop_size (int) – The original hop-size.
Returns: The (frame-size, hop-size) after processing.
Return type: tuple
-
-
class
audiomate.processing.pipeline.
Computation
(parent=None, name=None, min_frames=1, left_context=0, right_context=0)[source]¶ Base class for a computation step. To implement a computation step for pipeline the
compute
method has to be implemented. This method gets the frames from its parent step including context frames if defined. It has to return the same number of frames but without context frames.Parameters: - parent (Step, optional) – The parent step this step depends on.
- name (str, optional) – A name for identifying the step.
-
class
audiomate.processing.pipeline.
Reduction
(parents, name=None, min_frames=1, left_context=0, right_context=0)[source]¶ Base class for a reduction step. It gets the frames of all its parent steps as a list. It has to return a single chunk of frames.
Parameters: - parents (list) – List of parent steps this step depends on.
- name (str, optional) – A name for identifying the step.
Implementations¶
Some processing pipeline steps are already implemented.
Name | Description |
---|---|
MeanVarianceNorm | Normalizes features with given mean and variance. |
MelSpectrogram | Exctracts MelSpectrogram features. |
MFCC | Extracts MFCC features. |
PowerToDb | Convert power spectrum to Db. |
Delta | Compute delta features. |
AddContext | Add previous and subsequent frames to the current frame. |
Stack | Reduce multiple features into one by stacking them on top of each other. |
AvgPool | Compute the average (per dimension) over a given number of sequential frames. |
VarPool | Compute the variance (per dimension) over a given number of sequential frames. |
OnsetStrength | Compute onset strengths. |
Tempogram | Compute tempogram features. |
-
class
audiomate.processing.pipeline.
MeanVarianceNorm
(mean, variance, parent=None, name=None)[source]¶ Pre-processing step to normalize mean and variance.
frame = (frame - mean) / sqrt(variance)
Parameters: - mean (float) – The mean to use for normalization.
- variance (float) – The variance to use for normalization.s
-
class
audiomate.processing.pipeline.
MelSpectrogram
(n_mels=128, parent=None, name=None)[source]¶ Computation step that extracts mel-spectrogram features from the given frames.
Based on http://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html
Parameters: n_mels (int) – Number of mel bands to generate.
-
class
audiomate.processing.pipeline.
MFCC
(n_mfcc=13, n_mels=128, parent=None, name=None)[source]¶ Computation step that extracts mfcc features from the given frames.
Based on http://librosa.github.io/librosa/generated/librosa.feature.mfcc.html
Parameters: - n_mels (int) – Number of mel bands to generate.
- n_mfcc (int) – number of MFCCs to return.
-
class
audiomate.processing.pipeline.
PowerToDb
(ref=1.0, amin=1e-10, top_db=80.0, parent=None, name=None)[source]¶ Convert a power spectrogram (amplitude squared) to decibel (dB) units.
See http://librosa.github.io/librosa/generated/librosa.core.power_to_db.html
Note
The output can differ depending on offline or online processing, since it depends on statistics over all values. And in online mode it only considers values from a single chunk, while in offline mode all values of the whole sequence are considered.
-
class
audiomate.processing.pipeline.
Delta
(width=9, order=1, axis=0, mode='interp', parent=None, name=None)[source]¶ Compute delta features.
See http://librosa.github.io/librosa/generated/librosa.feature.delta.html
-
class
audiomate.processing.pipeline.
AddContext
(left_frames, right_frames, parent=None, name=None)[source]¶ For every frame add context frames from left or/and right. For frames at the beginning and end of a sequence, where no context is available, zeros are used.
Parameters: - left_frames (int) – Number of previous frames to prepend to a frame.
- right_frames (int) – Number of subsequent frames to append to a frame.
Example
>>> input = np.array([ >>> [1,2,3], >>> [4,5,6], >>> [7,8,9] >>> ]) >>> chunk = Chunk(input, offset=0, is_last=True) >>> AddContext(left_frames=1, right_frames=1).compute(chunk, 16000) array([[0, 0, 0, 1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 6, 7, 8, 9], [4, 5, 6, 7, 8, 9, 0, 0, 0]])
-
class
audiomate.processing.pipeline.
Stack
(parents, name=None, min_frames=1, left_context=0, right_context=0)[source]¶ Stack the features from multiple inputs. All input matrices have to be of the same length (same number of frames).
-
class
audiomate.processing.pipeline.
AvgPool
(size, parent=None, name=None)[source]¶ Average a given number of sequential frames into a single frame. If at the end of a stream just the remaining frames are used, no matter how many there are left.
Parameters: size (float) – The maximum number of frames to pool by taking the mean. -
compute
(chunk, sampling_rate, corpus=None, utterance=None)[source]¶ Do the computation of the step. If the step uses context, the result has to be returned without context.
Parameters: Returns: The array of processed frames, without context.
Return type: np.ndarray
-
frame_transform_step
(frame_size, hop_size)[source]¶ If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.
This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.
By default it is assumed that the processor doesn’t change the frame-size and the hop-size.
Note
This function is simply for this step, whereas
frame_transform()
computes the transformation for the whole pipeline.Parameters: - frame_size (int) – The original frame-size.
- hop_size (int) – The original hop-size.
Returns: The (frame-size, hop-size) after processing.
Return type: tuple
-
-
class
audiomate.processing.pipeline.
VarPool
(size, parent=None, name=None)[source]¶ Variance over a given number of sequential frames to form a single frame. If at the end of a stream just the remaining frames are used, no matter how many there are left.
Parameters: size (float) – The maximum number of frames to pool by taking the mean. -
compute
(chunk, sampling_rate, corpus=None, utterance=None)[source]¶ Do the computation of the step. If the step uses context, the result has to be returned without context.
Parameters: Returns: The array of processed frames, without context.
Return type: np.ndarray
-
frame_transform_step
(frame_size, hop_size)[source]¶ If the processor changes the number of samples that build up a frame or the number of samples between two consecutive frames (hop-size), this function needs transform the original frame- and/or hop-size.
This is used to store the frame-size and hop-size in a feature-container. In the end one can calculate start and end time of a frame with this information.
By default it is assumed that the processor doesn’t change the frame-size and the hop-size.
Note
This function is simply for this step, whereas
frame_transform()
computes the transformation for the whole pipeline.Parameters: - frame_size (int) – The original frame-size.
- hop_size (int) – The original hop-size.
Returns: The (frame-size, hop-size) after processing.
Return type: tuple
-
-
class
audiomate.processing.pipeline.
OnsetStrength
(n_mels=128, parent=None, name=None)[source]¶ Compute a spectral flux onset strength envelope.
Based on http://librosa.github.io/librosa/generated/librosa.onset.onset_strength.html
Parameters: n_mels (int) – Number of mel bands to generate.
-
class
audiomate.processing.pipeline.
Tempogram
(n_mels=128, win_length=384, parent=None, name=None)[source]¶ Computation step to compute tempogram
Based on http://librosa.github.io/librosa/generated/librosa.feature.tempogram.html
Parameters: - n_mels (int) – Number of mel bands to generate.
- win_length (int) – Length of the onset autocorrelation window (in frames/onset measurements). The default settings (384) corresponds to 384 * hop_length / sr ~= 8.9s.