Dataset Base Classes

Abstract base classes for the dataset hierarchy.

Hierarchy

Dataset (ABC) ├── LabeledDataset (ABC) └── UnlabeledDataset (ABC)

Concrete subclasses live in image_dataset.py and audio_dataset.py.

class src.dataset.Dataset(root: str, lazy: bool = True)[source]

Bases: ABC

Abstract base class for all datasets.

Subclasses must implement _scan_files(), _load_file(), and __getitem__().

Attributes (private):: _root: Root folder path where data files are stored. _lazy: Whether to load data lazily (on access) or eagerly (at init). _file_paths: Paths discovered by _scan_files(). _data: Pre-loaded data objects when eager; None when lazy.

split(train_ratio: float) → tuple[Dataset, Dataset][source]

Split the dataset into training and test subsets.

The dataset is shuffled randomly before splitting so that the distribution of examples is approximately balanced in both subsets.

Parameters:

train_ratio – Fraction of data points to include in the training set. Must be in the open interval (0, 1).

Returns:

A (train_dataset, test_dataset) tuple, each of the same concrete class as self.

Raises:

TypeError – If train_ratio is not a float.
ValueError – If train_ratio is not strictly between 0 and 1.

property root: str: Root folder path.

property lazy: bool: Whether the dataset uses lazy loading.

class src.dataset.LabeledDataset(root: str, lazy: bool = True)[source]

Bases: Dataset, ABC

Abstract base class for datasets that carry per-sample labels.

Adds a _labels list that is populated by _load_labels() and kept parallel to _file_paths.

Concrete subclasses must still implement _scan_files() and _load_file(). They should call _load_labels() after super().__init__() (which calls _scan_files).

property labels: list[Any]: Per-sample labels, parallel to the file-path list.

class src.dataset.UnlabeledDataset(root: str, lazy: bool = True)[source]

Bases: Dataset, ABC

Abstract base class for datasets without labels.

Provides a concrete __getitem__() that returns only the data.