Datasets (`cpmpy.tools.datasets`)

This module provides an abstract, PyTorch-style dataset interface for Constraint Optimisation (CO) benchmarks. With a single line of code, classical benchmarks such as XCSP3, PSPLib, JSPLib, etc. can be downloaded and iterated over.

Available datasets:

XCSP3Dataset: XCSP3 competition benchmark instances for constraint satisfaction and optimization.

Note

Whilst the dataset class provides a PyTorch compatible access pattern, it has no actual dependency on PyTorch and can be used without installing this library.

Class hierarchy:

Dataset (ABC)
└── FileDataset (ABC)
    └── XCSP3Dataset
    └── (your dataset here)

Whilst the class hierarchy will support more exotic dataset types in the future, with a structure put in place that takes inspiration from conventions within the ML community, currently only file-based datasets are supported, i.e. datasets where the instances are stored as files on disk.

The base classes standardize:

download and local storage of benchmark instances (file-based datasets)
instance access via __len__ / __getitem__ (PyTorch compatibility)
optional parse/transform/target_transform arguments
dataset metadata (with sidecar collection)

To implement a new dataset, one needs to subclass one of the abstract dataset classes, and provide implementation for the following methods:

category: return a dictionary of category labels, describing to which subset the dataset has been restricted (year, track, …)
download: download the dataset (helper function _download_file() is provided)

Some optional methods to overwrite are:

collect_instance_metadata: collect metadata about individual instances (e.g. number of variables, constraints, …), potentially domain specific
open: how to open the instance file (e.g. for compressed files using .xz, .lzma, .gz, …)

Datasets must also implement the following dataset metadata attributes:

name: the name of the dataset
description: a short description of the dataset
homepage: a URL to the homepage of the dataset
citation: optionally, a list of citations for the dataset

All parts for which an implementation must be provided are marked with an @abstractmethod decorator, raising a NotImplementedError if not overwritten.

Dataset files are preferably downloaded as-is, without any preprocessing or decompression. Upon initial download, instance-level metadata gets automatically collected and stored in a JSON sidecar file. All subsequent accesses to the dataset will use the sidecar file to avoid re-collecting the metadata.

Iterating over the dataset is done in the same way as a PyTorch dataset. It returns 2-tuples (x,y) of:

x: instance reference (a file path is the only supported instance reference type at the moment)
y: instance metadata (solution, features, origin, etc.)

Example:

dataset = MyDataset(download=True)
for instance, info in dataset:
    print(instance, info)

The dataset also supports PyTorch-style transforms and target transforms.

dataset = MyDataset(download=True, transform=my_model_loader)
for model, info in dataset:
    ...

List of classes

`Dataset`	Abstract base class for CO datasets.
`FileDataset`	Abstract base class for PyTorch-style datasets of file-based CO benchmarking sets.

List of functions

from_files

Create a FileDataset from a list of files.

class cpmpy.tools.datasets.core.Dataset(transform: Callable | None = None, target_transform: Callable | None = None)[source]

Abstract base class for CO datasets.

The Dataset class is an abstract base class for all datasets. It provides a standardized interface for the PyTorch-compatible access pattern for CO benchmark datasets. It is not meant to be instantiated directly, but rather subclassed. Have a look at FileDataset for a concrete implementation.

Each instance in a dataset is characterised by a (x, y) pair of:

x: instance reference (e.g., file path, database key, generated seed, …)

y: instance metadata (solution, features, origin, etc.)

Instances are indexed by a unique identifier can be accessed by that identifier. For example its positional index within the dataset.

Implementing this class requires implementing the following methods:

__len__: return the total number of instances

__getitem__: return the instance and metadata at the given index / identifier

And providing the following class attributes:

name: the name of the dataset

description: a short description of the dataset

homepage: a URL to the homepage of the dataset

citation: optionally, a list of citations for the dataset

Optional methods to overwrite:

instance_metadata: return the metadata for a given instance

citation: ClassVar[List[str]] = []

classmethod dataset_metadata() → Dict[str, Any][source]

Return dataset-level metadata as a dictionary.

Returns:: The dataset-level metadata.
Return type:: dict

description: ClassVar[str]

homepage: ClassVar[str]

abstractmethod instance_metadata(instance: Any) → Dict[str, Any][source]

Return the metadata for a given instance.

Parameters:: instance – the instance identifier for which to return the metadata
Returns:: The metadata for the instance.
Return type:: dict

name: ClassVar[str]

class cpmpy.tools.datasets.core.FileDataset(dataset_dir: str | PathLike[str] = '.', transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False, parse: bool = False, extension: str = '.txt', **kwargs: Any)[source]

Abstract base class for PyTorch-style datasets of file-based CO benchmarking sets.

The FileDataset class provides a standardized interface for downloading and accessing file-backed benchmark instances. This class should not be used on its own. Either have a look at one of the concrete subclasses, providing access to well-known datasets from the community, or use this class as the base for your own dataset.

Two dataset styles are supported:

Model-defined instances: files directly encode variables/constraints/objective (for example XCSP3, OPB, DIMACS, FlatZinc). In this case, users typically pass a model-builder as transform, converting the raw file instance into a model.
Data-only instances: files encode problem data for a fixed family, but no model. In this case, subclasses should override parse() and users can enable parse=True to obtain parsed intermediate data structures (for example table/dict structures for RCPSP-style scheduling data), then build a model separately or via a transform.

METADATA_EXTENSION: ClassVar[str] = '.meta.json'

abstractmethod categories() → Dict[str, Any][source]

Labels to distinguish instances into categories matching to those of the dataset. E.g.

year

track

citation: ClassVar[List[str]] = []

collect_instance_metadata(file: Path) → Dict[str, Any][source]

Provide domain-specific instance metadata. Called once after download for each instance.

Parameters:: file – path to the instance file
Returns:: dict with instance-specific metadata fields

classmethod dataset_metadata() → Dict[str, Any]

Return dataset-level metadata as a dictionary.

Returns:: The dataset-level metadata.
Return type:: dict

description: ClassVar[str]

abstractmethod download(*args: Any, **kwargs: Any)[source]: Download the dataset.

homepage: ClassVar[str]

instance_metadata(instance: PathLike) → Dict[str, Any][source]

Return the metadata for a given instance file.

Parameters:: file (os.PathLike) – Path to the instance file.
Returns:: The metadata for the instance.
Return type:: dict

name: ClassVar[str]

classmethod open(instance: PathLike) → TextIOBase[source]

How an instance file from the dataset should be opened. Especially usefull when files come compressed and won’t work with Python standard library’s ‘open’, e.g. ‘.xz’, ‘.lzma’.

Parameters:: instance (os.PathLike) – File path to the instance file.
Returns:: The opened file handle.
Return type:: io.TextIOBase

parse(instance: PathLike) → Any[source]

Parse an instance file into intermediate data structures.

Override this for datasets whose files contain problem data but not an explicit model. Typical outputs are structures like tables, arrays, and dictionaries that can then be passed to a separate model-construction function.

Default behavior is read(instance), i.e. return raw text content.

Parameters:: instance (os.PathLike) – File path to the instance file.
Returns:: The parsed intermediate data structure(s).

read(instance: PathLike) → str[source]

Read raw file contents from an instance file. Handles optional decompression automatically via dataset.open().

Parameters:: instance (os.PathLike) – File path to the instance file.
Returns:: The raw file contents.
Return type:: str

cpmpy.tools.datasets.core.from_files(dataset_dir: PathLike, extension: str = '.txt') → FileDataset[source]