<?xml version="1.0" encoding="UTF-8" ?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-24T13:36:09Z</responseDate><request identifier="10.35097/8ppb5x50nyvw1wa7" metadataPrefix="datacite" verb="GetRecord">https://www.radar-service.eu/oai/OAIHandler</request><GetRecord><record><header><identifier>10.35097/8ppb5x50nyvw1wa7</identifier><datestamp>2025-02-20T09:02:35Z</datestamp><setSpec>radar4kit</setSpec></header><metadata><resource xmlns="http://datacite.org/schema/kernel-4"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://datacite.org/schema/kernel-4 https://schema.datacite.org/meta/kernel-4.4/metadata.xsd">
   <identifier identifierType="DOI">10.35097/8ppb5x50nyvw1wa7</identifier>
   <creators>
      <creator>
         <creatorName>Bach, Jakob</creatorName>
         <givenName>Jakob</givenName>
         <familyName>Bach</familyName>
         <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-0301-2798</nameIdentifier>
         <affiliation/>
      </creator>
   </creators>
   <titles>
      <title>Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" (Version 2)</title>
   </titles>
   <publisher>Karlsruhe Institute of Technology</publisher>
   <dates>
      <date dateType="Created">2025</date>
   </dates>
   <publicationYear>2025</publicationYear>
   <subjects>
      <subject>Computer Science</subject>
      <subject>subgroup discovery</subject>
      <subject>alternatives</subject>
      <subject>constraints</subject>
      <subject>satisfiability modulo theories</subject>
      <subject>explainability</subject>
      <subject>interpretability</subject>
      <subject>XAI</subject>
   </subjects>
   <resourceType resourceTypeGeneral="Dataset"/>
   <rightsList>
      <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
      <rights schemeURI="https://spdx.org/licenses/"
              rightsIdentifierScheme="SPDX"
              rightsIdentifier="CC-BY-4.0"
              rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
   </rightsList>
   <contributors>
      <contributor contributorType="RightsHolder">
         <contributorName>Bach, Jakob</contributorName>
         <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0003-0301-2798</nameIdentifier>
      </contributor>
   </contributors>
   <descriptions>
      <description descriptionType="Abstract">These are the experimental data for the second version (v2) of the paper&#xD;
&#xD;
&gt; Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"&#xD;
&#xD;
This version of the paper was published on [arXiv](https://arxiv.org/) in 2025.&#xD;
You can find the paper [here](https://doi.org/10.48550/arXiv.2406.01411) and the code [here](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery).&#xD;
See the `README` for details.&#xD;
&#xD;
The datasets used in our study (which we also provide here) originate from [PMLB](https://epistasislab.github.io/pmlb/).&#xD;
The corresponding [GitHub repository](https://github.com/EpistasisLab/pmlb) is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).&#xD;
Please see the file `LICENSE` in the folder `datasets/` for the license text.</description>
      <description descriptionType="TechnicalInfo"># Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" (Version 2)&#xD;
&#xD;
These are the experimental data for the second version (v2) of the paper&#xD;
&#xD;
&gt; Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"&#xD;
&#xD;
published on [arXiv](https://arxiv.org/) in 2025.&#xD;
If we create further versions of this paper in the future, these experimental data may cover them as well.&#xD;
&#xD;
Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery) for the code and instructions to reproduce the experiments.&#xD;
We obtained the experimental results on a server with an `AMD EPYC 7551` CPU (32 physical cores, base clock of 2.0 GHz) and 160 GB RAM.&#xD;
The operating system was `Ubuntu 20.04.6 LTS`.&#xD;
The Python version was `3.8` for the main experiments and `3.9` for the competitor-runtime experiments.&#xD;
With this configuration, running the main experimental pipeline (`main_experiments/run_experiments.py`) took about 34 hours.&#xD;
&#xD;
Note that the experimental data originate from multiple pipeline runs, as we have two experimental pipelines, and we reran one of them to include additional subgroup-discovery methods:&#xD;
&#xD;
- The commit hash for the last run of the competitor-runtime experimental pipeline (`competitor_runtime_experiments/run_competitor_runtime_experiments.py`) is [1a026326b3](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/1a026326b34246a9fcf4df658a99a898b5fd38e1) (tag: `competitor-runtime-2025-01-12`).&#xD;
- The commit hash for the last run of the main experimental pipeline (`main_experiments/run_experiments.py`) except the subgroup-discovery methods *BSD* and *SD-Map* is [0a57bcd529](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/0a57bcd52938dce6285e8113d777360c2b17f30f) (tag: `run-2024-05-13`).&#xD;
  These data are identical to the [experimental data for v1 of the paper](https://doi.org/10.35097/caKKJCtoKqgxyvqG).&#xD;
- The commit hash for the last run of the main experimental pipeline (`main_experiments/run_experiments.py`) for the (new) subgroup-discovery methods *BSD* and *SD-Map* is [50dd82e0fc](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/50dd82e0fcd175a4068ef4ce838dcacdc1045fc9) (tag: `run-2025-01-21-arXiv-v2`).&#xD;
&#xD;
The main experimental pipeline did not change between the last two mentioned commits (except for refactorings and to include two new subgroup-discovery methods),&#xD;
and the competitor-runtime experimental pipeline changed neither.&#xD;
Thus, using the tag `run-2025-01-21-arXiv-v2` to reproduce all experiments should yield the same results (except runtimes and timeout-affected results).&#xD;
&#xD;
The commit hash for the last run of the two evaluation pipelines (`main_experiments/run_evaluation_arxiv.py` and `competitor_runtime_experiments/run_competitor_runtime_evaluation_arxiv.py`) is [bc3aafc904](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/bc3aafc904cc27214d458f93782104484ffc372d) (tag: `evaluation-2025-02-16-arXiv-v2`).&#xD;
&#xD;
The experimental data are stored in five folders, i.e., `competitor-runtime-datasets/`, `competitor-runtime-results/`, `datasets/`, `plots/`, and `results/`.&#xD;
Further, the console output of `main_experiments/run_evaluation_arxiv.py` is stored in `Evaluation_console_output_main.txt`,&#xD;
and the console output of `competitor_runtime_experiments/run_competitor_runtime_evaluation_arxiv.py` is stored in `Evaluation_console_output_competitor_runtimes.txt`.&#xD;
We manually copied both evaluation outputs from the console to a file.&#xD;
In the following, we describe the structure and content of each data file.&#xD;
&#xD;
## `competitor-runtime-datasets/`&#xD;
&#xD;
These are the input data for the competitor-runtime experimental pipeline `competitor_runtime_experiments/run_competitor_runtime_experiments.py`.&#xD;
They were obtained with the script `competitor_runtime_experiments/prepare_competitor_runtime_datasets.py`.&#xD;
The folder structure of `competitor-runtime-datasets/` is similar to that of the dataset folder of the main experiments (`datasets/`), so please consult the corresponding section of this document for more details on the contained file types.&#xD;
The main difference is that only five of the 27 PMLB datasets are included, plus the `iris` dataset provided by `scikit-learn`.&#xD;
&#xD;
## `competitor-runtime-results/`&#xD;
&#xD;
These are the output data of the competitor-runtime experimental pipeline in the form of CSVs, produced by the script `competitor_runtime_experiments/run_competitor_runtime_experiments.py`.&#xD;
`_results.csv` contains all results merged into one file and acts as input for the script `competitor_runtime_experiments/run_competitor_runtime_evaluation_arxiv.py`.&#xD;
The remaining 325 files are subsets of the results.&#xD;
The competitor-runtime experimental pipeline parallelizes over 6 datasets, 5 cross-validation folds, and 17 subgroup-discovery methods.&#xD;
Thus, a full cross-product would yield `6 * 5 * 17 = 510` files containing subsets of the results, but some subgroup-discovery methods timed out on some datasets,&#xD;
so the corresponding result files are missing.&#xD;
&#xD;
Each row in a result file corresponds to one subgroup.&#xD;
One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.:&#xD;
&#xD;
- dataset `dataset_name`&#xD;
- cross-validation fold `split_idx`&#xD;
- subgroup-discovery method `sd_name`&#xD;
- feature-cardinality threshold `param.k`&#xD;
&#xD;
The remaining column, `fitting_time`, represents the evaluation metric.&#xD;
&#xD;
In detail, all result files for the competitor-runtime experiments contain the following columns (whose names are consistent with the main experiments):&#xD;
&#xD;
- `fitting_time` (non-negative float): The runtime (in seconds) of the subgroup-discovery method.&#xD;
- `dataset_name` (string, 6 different values): The name of the dataset used for subgroup discovery.&#xD;
- `split_idx` (int in `[0, 4]`): The index of the cross-validation fold of the dataset used for subgroup discovery.&#xD;
- `sd_name` (string, 17 different values): The name of the subgroup-discovery method, consisting of the package name and the algorithm name (e.g., `sd4py.Beam`).&#xD;
- `param.k` (int in `[1, 5]`): The feature-cardinality threshold for subgroup descriptions.&#xD;
&#xD;
## `datasets/`&#xD;
&#xD;
These are the input data for the main experimental pipeline `main_experiments/run_experiments.py`, i.e., the prediction datasets.&#xD;
The folder contains one overview file, one license file, and two files for each of the 27 datasets.&#xD;
&#xD;
The original datasets were downloaded from [PMLB](https://epistasislab.github.io/pmlb/) with the script `main_experiments/prepare_datasets.py`.&#xD;
Note that we do not own the copyright for these datasets.&#xD;
However, the [GitHub repository of PMLB](https://github.com/EpistasisLab/pmlb), which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).&#xD;
Thus, we include the file `LICENSE` from that repository.&#xD;
&#xD;
After downloading from `PMLB`, we split each dataset into the feature part (`_X.csv`) and the target part (`_y.csv`), which we save separately.&#xD;
Both file types are CSVs that only contain numeric values (categorical features are ordinally encoded in `PMLB`) except for the column names.&#xD;
There are no missing values.&#xD;
Each row corresponds to a data object (= instance, sample), and each column either corresponds to a feature (in `_X`) or the target (in `_y`).&#xD;
The first line in each `_X` file contains the names of the features as strings; for `_y` files, there is only one column, always named `target`.&#xD;
For the prediction target, we ensured that the minority (i.e., less frequent) class is the positive class (i.e., has the class label `1`), so the labeling may differ from PMLB.&#xD;
&#xD;
`_dataset_overview.csv` contains meta-data for the datasets, like the number of instances and features.&#xD;
&#xD;
## `plots/`&#xD;
&#xD;
These are the output files of the main evaluation pipeline `main_experiments/run_evaluation_arxiv.py`.&#xD;
We include these plots in our paper.&#xD;
&#xD;
## `results/`&#xD;
&#xD;
These are the output data of the main experimental pipeline in the form of CSVs, produced by the script `main_experiments/run_experiments.py`.&#xD;
`_results.csv` contains all results merged into one file and acts as input for the script `main_experiments/run_evaluation_arxiv.py`.&#xD;
The remaining files are subsets of the results, as the main experimental pipeline parallelizes over 27 datasets, 5 cross-validation folds, and 8 subgroup-discovery methods.&#xD;
Thus, there are `27 * 5 * 8 = 1080` files containing subsets of the results.&#xD;
&#xD;
Each row in a result file corresponds to one subgroup.&#xD;
One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.:&#xD;
&#xD;
- dataset `dataset_name`&#xD;
- cross-validation fold `split_idx`&#xD;
- subgroup-discovery method `sd_name`&#xD;
- feature-cardinality threshold `param.k` (missing value if no feature-cardinality constraint employed)&#xD;
- solver timeout `param.timeout` (missing value if not solver-based search)&#xD;
- number of alternatives `param.a` (missing value if only original subgroup searched)&#xD;
- dissimilarity threshold `param.tau_abs` (missing value if only original subgroup searched)&#xD;
&#xD;
For each value combination of these seven columns, there is either one subgroup (search for original subgroups)&#xD;
or six subgroups (search for alternative subgroup descriptions, in which case the column `alt.number` identifies individual subgroups within a search run).&#xD;
Further, note that the last four mentioned columns contain missing values, which should be treated as a category on their own.&#xD;
In particular, if you use `groupby()` from `pandas` for analyzing the results and you want to include any of the last four mentioned columns in the grouping,&#xD;
you should either fill in the missing values with an (arbitrary) placeholder value or use the parameter `dropna=False`,&#xD;
because the grouping (by default) ignores the rows with missing values in the group columns otherwise.&#xD;
&#xD;
The remaining columns represent results and evaluation metrics.&#xD;
&#xD;
In detail, all result files contain the following columns:&#xD;
&#xD;
- `objective_value` (float `&gt;= -0.25` + missing values): Objective value of the subgroup-discovery method on the training set.&#xD;
  Usually quantifies WRAcc (in `[-0.25, 0.25]`) when searching original subgroups and normalized Hamming similarity (in `[0, 1]`) when searching alternative subgroup descriptions.&#xD;
  First exception: The subgroup-discovery method *MORS* has missing values since *MORS* does not explicitly compute an objective when searching for subgroups.&#xD;
  Second exception: The subgroup-discovery methods *BSD* and *SD-Map* use WRAcc times the number of data objects (a dataset-dependent constant) as objective,&#xD;
  so the objective value is higher than for the remaining subgroup-discovery methods.&#xD;
- `optimization_status` (string, 2 different values + missing values): For *SMT*, `sat` if optimal solution found and `unknown` if timeout.&#xD;
  Missing value for all other subgroup-discovery methods (which do not use solver timeouts).&#xD;
- `optimization_time` (non-negative float): The optimization runtime (in seconds) inside the subgroup-discovery method, i.e., without pre- and post-processing steps.&#xD;
- `fitting_time` (non-negative float): The complete runtime (in seconds) of the subgroup-discovery method (as reported in the paper), i.e., including pre- and post-processing steps.&#xD;
  Very similar to `optimization_time` except for *SMT* as the subgroup-discovery method, which may spend a considerable amount of time formulating the optimization problem.&#xD;
- `train_wracc` (float in `[-0.25, 0.25]`): The weighted relative accuracy (WRAcc) of the subgroup description on the training set.&#xD;
- `test_wracc` (float in `[-0.25, 0.25]`): The weighted relative accuracy (WRAcc) of the subgroup description on the test set.&#xD;
- `train_nwracc` (float in `[-1, 1]`): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the training set.&#xD;
- `test_nwracc` (float in `[-1, 1]`): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the test set.&#xD;
- `box_lbs` (list of floats, e.g., `[-inf, 0, -inf, -2, 8]`): The lower bounds for each feature in the subgroup description.&#xD;
  Negative infinity if a feature's lower bound did not exclude any data objects from the subgroup.&#xD;
- `box_ubs` (list of floats, e.g., `[inf, 10, inf, 5, 9]`): The upper bounds for each feature in the subgroup description.&#xD;
  Positive infinity if a feature's upper bound did not exclude any data objects from the subgroup.&#xD;
- `selected_feature_idxs` (list of non-negative ints, e.g., `[0, 4, 5]`): The indices (starting from 0) of the features selected (= restricted) in the subgroup description.&#xD;
  Is an empty list, i.e., `[]`, if no feature was restricted (so the subgroup contains all data objects).&#xD;
- `dataset_name` (string, 27 different values): The name of the `PMLB` dataset used for subgroup discovery.&#xD;
- `split_idx` (int in `[0, 4]`): The index of the cross-validation fold of the dataset used for subgroup discovery.&#xD;
- `sd_name` (string, 8 different values): The name of the subgroup-discovery method (`Beam`, `BI`, `BSD`, `MORS`, `PRIM`, `Random`, `SD-Map`, or `SMT`).&#xD;
- `param.k` (int in `[1, 5]` + missing values): The feature-cardinality threshold for subgroup descriptions.&#xD;
  Missing value if unconstrained subgroup discovery.&#xD;
  Always `3` if alternative subgroup descriptions searched.&#xD;
- `param.timeout` (int in `[1, 2048]` + missing values): For *SMT*, solver timeout (in seconds) for optimization (not including time for formulating the optimization problem).&#xD;
  Missing value for all other subgroup-discovery methods.&#xD;
- `alt.hamming` (float in `[0, 1]` + missing values): Normalized Hamming similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched.&#xD;
  Missing value if only original subgroup searched.&#xD;
- `alt.jaccard` (float in `[0, 1]` + missing values): Jaccard similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched.&#xD;
  Missing value if only original subgroup searched.&#xD;
- `alt.number` (int in `[0, 5]` + missing values): The number of the current alternative if alternative subgroup descriptions searched.&#xD;
  Missing value if only original subgroup searched.&#xD;
  Thus, original subgroups either have `0` or a missing value in this column (i.e., for experimental settings where alternative subgroup descriptions searched, there is no separate search for an original subgroup, only a joint sequential search for original and alternatives).&#xD;
- `param.a` (int with value `5` + missing values): The number of desired alternative subgroup descriptions, not counting the original (zeroth) subgroup description.&#xD;
  Missing value if only original subgroup searched.&#xD;
- `param.tau_abs` (int in `[1, 3]` + missing values) The dissimilarity threshold for alternatives, corresponding to the absolute number of features that have to be deselected from the original subgroup description and each prior alternative.&#xD;
  Missing value if only original subgroup searched.&#xD;
&#xD;
You can easily read in any of the result files with `pandas`:&#xD;
&#xD;
```python&#xD;
import pandas as pd&#xD;
&#xD;
results = pd.read_csv('results/_results.csv')&#xD;
```&#xD;
&#xD;
All result files are comma-separated and contain plain numbers and unquoted strings, apart from the columns `box_lbs`, `box_ubs`, and `selected_feature_idxs`&#xD;
(which represent lists and whose values are quoted except for empty lists).&#xD;
The first line in each result file contains the column names.&#xD;
You can use the following code to make sure that the lists of feature indices are treated as lists (rather than plain strings):&#xD;
&#xD;
```python&#xD;
import ast&#xD;
&#xD;
results['selected_feature_idxs'] = results['selected_feature_idxs'].apply(ast.literal_eval)&#xD;
```&#xD;
&#xD;
Note that this conversion does not work for `box_lbs` and `box_ubs`, where the lists not only contain ordinary numbers but also `-inf`, and `inf`;&#xD;
see [this *Stack Overflow* post](https://stackoverflow.com/questions/64773836/error-converting-string-list-to-list-when-it-contains-inf) for potential alternatives.</description>
   </descriptions>
   <relatedIdentifiers>
      <relatedIdentifier relatedIdentifierType="URL" relationType="IsIdenticalTo">https://publikationen.bibliothek.kit.edu/1000179246</relatedIdentifier>
   </relatedIdentifiers>
   <sizes>
      <size/>
   </sizes>
   <formats>
      <format>application/x-tar</format>
   </formats>
</resource></metadata></record></GetRecord></OAI-PMH>