Datasets (cpmpy.tools.datasets)
This module provides an abstract, PyTorch-style dataset interface for Constraint Optimisation (CO) benchmarks. With a single line of code, classical benchmarks such as XCSP3, PSPLib, JSPLib, etc. can be downloaded and iterated over.
Available datasets:
XCSP3Dataset: XCSP3 competition benchmark instances for constraint satisfaction and optimization.
Note
Whilst the dataset class provides a PyTorch compatible access pattern, it has no actual dependency on PyTorch and can be used without installing this library.
Class hierarchy:
Dataset (ABC)
└── FileDataset (ABC)
└── XCSP3Dataset
└── (your dataset here)
Whilst the class hierarchy will support more exotic dataset types in the future, with a structure put in place that takes inspiration from conventions within the ML community, currently only file-based datasets are supported, i.e. datasets where the instances are stored as files on disk.
The base classes standardize:
download and local storage of benchmark instances (file-based datasets)
instance access via
__len__/__getitem__(PyTorch compatibility)optional
parse/transform/target_transformargumentsdataset metadata (with sidecar collection)
To implement a new dataset, one needs to subclass one of the abstract dataset classes, and provide implementation for the following methods:
category: return a dictionary of category labels, describing to which subset the dataset has been restricted (year, track, …)download: download the dataset (helper function_download_file()is provided)
Some optional methods to overwrite are:
collect_instance_metadata: collect metadata about individual instances (e.g. number of variables, constraints, …), potentially domain specificopen: how to open the instance file (e.g. for compressed files using .xz, .lzma, .gz, …)
Datasets must also implement the following dataset metadata attributes:
name: the name of the datasetdescription: a short description of the datasethomepage: a URL to the homepage of the datasetcitation: optionally, a list of citations for the dataset
All parts for which an implementation must be provided are marked with an @abstractmethod decorator, raising a NotImplementedError if not overwritten.
Dataset files are preferably downloaded as-is, without any preprocessing or decompression. Upon initial download, instance-level metadata gets automatically collected and stored in a JSON sidecar file. All subsequent accesses to the dataset will use the sidecar file to avoid re-collecting the metadata.
Iterating over the dataset is done in the same way as a PyTorch dataset. It returns 2-tuples (x,y) of:
x: instance reference (a file path is the only supported instance reference type at the moment)
y: instance metadata (solution, features, origin, etc.)
Example:
dataset = MyDataset(download=True)
for instance, info in dataset:
print(instance, info)
The dataset also supports PyTorch-style transforms and target transforms.
dataset = MyDataset(download=True, transform=my_model_loader)
for model, info in dataset:
...
List of classes
Abstract base class for CO datasets. |
|
Abstract base class for PyTorch-style datasets of file-based CO benchmarking sets. |
List of functions
Create a FileDataset from a list of files. |
- class cpmpy.tools.datasets.core.Dataset(transform: Callable | None = None, target_transform: Callable | None = None)[source]
Abstract base class for CO datasets.
The Dataset class is an abstract base class for all datasets. It provides a standardized interface for the PyTorch-compatible access pattern for CO benchmark datasets. It is not meant to be instantiated directly, but rather subclassed. Have a look at
FileDatasetfor a concrete implementation.Each instance in a dataset is characterised by a (x, y) pair of:
x: instance reference (e.g., file path, database key, generated seed, …)
y: instance metadata (solution, features, origin, etc.)
Instances are indexed by a unique identifier can be accessed by that identifier. For example its positional index within the dataset.
Implementing this class requires implementing the following methods:
__len__: return the total number of instances__getitem__: return the instance and metadata at the given index / identifier
And providing the following class attributes:
name: the name of the datasetdescription: a short description of the datasethomepage: a URL to the homepage of the datasetcitation: optionally, a list of citations for the dataset
Optional methods to overwrite:
instance_metadata: return the metadata for a given instance
- citation: ClassVar[List[str]] = []
- classmethod dataset_metadata() Dict[str, Any][source]
Return dataset-level metadata as a dictionary.
- Returns:
The dataset-level metadata.
- Return type:
dict
- description: ClassVar[str]
- homepage: ClassVar[str]
- abstractmethod instance_metadata(instance: Any) Dict[str, Any][source]
Return the metadata for a given instance.
- Parameters:
instance – the instance identifier for which to return the metadata
- Returns:
The metadata for the instance.
- Return type:
dict
- name: ClassVar[str]
- class cpmpy.tools.datasets.core.FileDataset(dataset_dir: str | PathLike[str] = '.', transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False, parse: bool = False, extension: str = '.txt', **kwargs: Any)[source]
Abstract base class for PyTorch-style datasets of file-based CO benchmarking sets.
The FileDataset class provides a standardized interface for downloading and accessing file-backed benchmark instances. This class should not be used on its own. Either have a look at one of the concrete subclasses, providing access to well-known datasets from the community, or use this class as the base for your own dataset.
Two dataset styles are supported:
Model-defined instances: files directly encode variables/constraints/objective (for example XCSP3, OPB, DIMACS, FlatZinc). In this case, users typically pass a model-builder as
transform, converting the raw file instance into a model.Data-only instances: files encode problem data for a fixed family, but no model. In this case, subclasses should override
parse()and users can enableparse=Trueto obtain parsed intermediate data structures (for example table/dict structures for RCPSP-style scheduling data), then build a model separately or via a transform.
- METADATA_EXTENSION: ClassVar[str] = '.meta.json'
- abstractmethod categories() Dict[str, Any][source]
Labels to distinguish instances into categories matching to those of the dataset. E.g.
year
track
- citation: ClassVar[List[str]] = []
- collect_instance_metadata(file: Path) Dict[str, Any][source]
Provide domain-specific instance metadata. Called once after download for each instance.
- Parameters:
file – path to the instance file
- Returns:
dict with instance-specific metadata fields
- classmethod dataset_metadata() Dict[str, Any]
Return dataset-level metadata as a dictionary.
- Returns:
The dataset-level metadata.
- Return type:
dict
- description: ClassVar[str]
- homepage: ClassVar[str]
- instance_metadata(instance: PathLike) Dict[str, Any][source]
Return the metadata for a given instance file.
- Parameters:
file (os.PathLike) – Path to the instance file.
- Returns:
The metadata for the instance.
- Return type:
dict
- name: ClassVar[str]
- classmethod open(instance: PathLike) TextIOBase[source]
How an instance file from the dataset should be opened. Especially usefull when files come compressed and won’t work with Python standard library’s ‘open’, e.g. ‘.xz’, ‘.lzma’.
- Parameters:
instance (os.PathLike) – File path to the instance file.
- Returns:
The opened file handle.
- Return type:
io.TextIOBase
- parse(instance: PathLike) Any[source]
Parse an instance file into intermediate data structures.
Override this for datasets whose files contain problem data but not an explicit model. Typical outputs are structures like tables, arrays, and dictionaries that can then be passed to a separate model-construction function.
Default behavior is
read(instance), i.e. return raw text content.- Parameters:
instance (os.PathLike) – File path to the instance file.
- Returns:
The parsed intermediate data structure(s).
- cpmpy.tools.datasets.core.from_files(dataset_dir: PathLike, extension: str = '.txt') FileDataset[source]
Create a FileDataset from a list of files.
Example:
dataset = from_files("path/to/dataset_files", extension=".txt") for x, y in dataset: print(x, y)