Add Dataset/Format

In this section it is described how to a downloader, reader or writer for a new dataset or another corpus format. The implementation is pretty straight-forward. For examples checkout some of the existing implementations at audiomate.corpus.io.

Important

  • Use the same name (type() method) for downloader/reader/writer.
  • Import your components in audiomate.corpus.io.__init___. So all components are available from the io module.
  • Checkout Data Mapping on what and how to add info/data when reading a corpus.

Corpus Downloader

If we aim to load some specific dataset/corpus, a downloader can be implemented, if it is possible to automate the whole download process. First we create a new class that inherits from audiomate.corpus.io.CorpusDownloader. There we have to implement two methods. The type method just has to return a string with the name of the new dataset/format. The _download method will do the heavy work of download all the files to the path target_path.

from audiomate.corpus.io import base


class MyDownloader(base.CorpusDownloader):

    @classmethod
    def type(cls):
        return 'MyDataset'

    def _download(self, target_path):
        # Download the data to target_path

In the module audiomate.corpus.io.downloader, common base classes for downloaders are implemented. This is useful since for a lot of corpora the way of downloading is similar.

Corpus Reader

The reader is the one component that is mostly used. Either for a specific dataset/corpus or a custom format, a reader is most likely to be required. First we create a new class that inherits from audiomate.corpus.io.CorpusReader. There we have to implement three methods. The type method just has to return a string with the name of the new dataset/format. The _check_for_missing_files method can be used to check if the given path is a valid input. For example if the format/dataset requires some specific meta-files it can be check here if they are available. Finally in the _load method the actual loading is done and the loaded corpus is returned.

from audiomate.corpus.io import base


class MyReader(base.CorpusReader):

    @classmethod
    def type(cls):
        return 'MyDataset'

    def _check_for_missing_files(self, path):
        # Check the path for missing files that are required to read with this reader.
        # Return a list of missing files
        return []

    def _load(self, path):
        # Create a new corpus
        corpus = audiomate.Corpus(path=path)

        # Create files ...
        corpus.new_file(file_path, file_idx)

        # Issuers ...
        issuer = assets.Speaker(issuer_idx)
        corpus.import_issuers(issuer)

        # Utterances with labels ...
        utterance = corpus.new_utterance(file_idx, file_idx, issuer_idx)
        utterance.set_label_list(annotations.LabelList(idx='transcription', labels=[
            annotations.Label(str(digit))
        ]))

        return corpus

Testing

For testing a reader the tests.corpus.io.reader_test.CorpusReaderTest can be used. It provides base test methods for checking the correctness/existence of the basic components (tracks, utterances, labels, …).

from tests.corpus.io import reader_test as rt

class TestMyReader(rt.CorpusReaderTest):

   #
   # Define via EXPECTED_* variables, what components are expected to be loaded
   #
   EXPECTED_NUMBER_OF_TRACKS = 3
   EXPECTED_TRACKS = [
      rt.ExpFileTrack('file-id', '/path/to/file'),
   ]

   #
   # Override the load method, that loads the sample-corpus.
   #
   def load(self):
      return MyReader().load('/path/to/sample/corpus')

For testing any custom functionality specific test-methods can be added as well.

Corpus Writer

A writer is only useful for custom formats. For a specific dataset a writer is most likely not needed. First we create a new class that inherits from audiomate.corpus.io.CorpusWriter. There we have to implement two methods. The type method just has to return a string with the name of the new dataset/format. The _save method does the serialization of the given corpus to the given path.

from audiomate.corpus.io import base


class DefaultWriter(base.CorpusWriter):

    @classmethod
    def type(cls):
        return 'MyDataset'

    def _save(self, corpus, path):
        # Do the serialization