<?xml version="1.0" encoding="UTF-8" ?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-24T12:24:56Z</responseDate><request identifier="10.35097/caKKJCtoKqgxyvqG" metadataPrefix="datacite" verb="GetRecord">https://www.radar-service.eu/oai/OAIHandler</request><GetRecord><record><header><identifier>10.35097/caKKJCtoKqgxyvqG</identifier><datestamp>2024-07-10T07:56:47Z</datestamp><setSpec>radar4kit</setSpec></header><metadata><resource xmlns="http://datacite.org/schema/kernel-4"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://datacite.org/schema/kernel-4 https://schema.datacite.org/meta/kernel-4.4/metadata.xsd">
   <identifier identifierType="DOI">10.35097/caKKJCtoKqgxyvqG</identifier>
   <creators>
      <creator>
         <creatorName>Bach, Jakob</creatorName>
         <givenName>Jakob</givenName>
         <familyName>Bach</familyName>
         <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-0301-2798</nameIdentifier>
         <affiliation/>
      </creator>
   </creators>
   <titles>
      <title>Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"</title>
   </titles>
   <publisher>Karlsruhe Institute of Technology</publisher>
   <dates>
      <date dateType="Created">2024</date>
   </dates>
   <publicationYear>2024</publicationYear>
   <subjects>
      <subject>Computer Science</subject>
      <subject>subgroup discovery</subject>
      <subject>alternatives</subject>
      <subject>constraints</subject>
      <subject>satisfiability modulo theories</subject>
      <subject>explainability</subject>
      <subject>interpretability</subject>
      <subject>XAI</subject>
   </subjects>
   <resourceType resourceTypeGeneral="Dataset"/>
   <rightsList>
      <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
      <rights schemeURI="https://spdx.org/licenses/"
              rightsIdentifierScheme="SPDX"
              rightsIdentifier="CC-BY-4.0"
              rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
   </rightsList>
   <contributors>
      <contributor contributorType="RightsHolder">
         <contributorName>Bach, Jakob</contributorName>
         <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0003-0301-2798</nameIdentifier>
      </contributor>
   </contributors>
   <descriptions>
      <description descriptionType="Abstract">These are the experimental data for the paper&#xD;
&#xD;
&gt; Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"&#xD;
&#xD;
published on [arXiv](https://arxiv.org/) in 2024.&#xD;
You can find the paper [here](https://doi.org/10.48550/arXiv.2406.01411) and the code [here](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery).&#xD;
See the `README` for details.&#xD;
&#xD;
The datasets used in our study (which we also provide here) originate from [PMLB](https://epistasislab.github.io/pmlb/).&#xD;
The corresponding [GitHub repository](https://github.com/EpistasisLab/pmlb) is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).&#xD;
Please see the file `LICENSE` in the folder `datasets/` for the license text.</description>
      <description descriptionType="TechnicalInfo"># Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"&#xD;
&#xD;
These are the experimental data for the paper&#xD;
&#xD;
&gt; Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"&#xD;
&#xD;
published at [arXiv](https://arxiv.org/) in 2024.&#xD;
If we create further versions of this paper in the future, these experimental data may cover them as well.&#xD;
&#xD;
Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery) for the code and instructions to reproduce the experiments.&#xD;
We obtained the experimental results on a server with an `AMD EPYC 7551` CPU (32 physical cores, base clock of 2.0 GHz) and 160 GB RAM.&#xD;
The operating system was `Ubuntu 20.04.6 LTS`.&#xD;
The Python version was `3.8`.&#xD;
With this configuration, running the experimental pipeline (`run_experiments.py`) took about 34 hours.&#xD;
&#xD;
The commit hash for the last run of the experimental pipeline (`run_experiments.py`) is [0a57bcd529](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/0a57bcd52938dce6285e8113d777360c2b17f30f).&#xD;
The commit hash for the last run of the evaluation pipeline (`run_evaluation_arxiv.py`) is [48f2465b4c](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/48f2465b4cabf2d6657e4ff4b73c28c240d0b883).&#xD;
We also tagged both commits (`run-2024-05-13` and `evaluation-2024-05-15`).&#xD;
&#xD;
The experimental data are stored in two folders, `datasets/` and `results/`.&#xD;
Further, the console output of `run_evaluation_arxiv.py` is stored in `Evaluation_console_output.txt` (manually copied from the console to a file).&#xD;
In the following, we describe the structure and content of each data file.&#xD;
&#xD;
## `datasets/`&#xD;
&#xD;
These are the input data for the experimental pipeline `run_experiments.py`, i.e., the prediction datasets.&#xD;
The folder contains one overview file, one license file, and two files for each of the 27 datasets.&#xD;
&#xD;
The original datasets were downloaded from [PMLB](https://epistasislab.github.io/pmlb/) with the script `prepare_datasets.py`.&#xD;
Note that we do not own the copyright for these datasets.&#xD;
However, the [GitHub repository of PMLB](https://github.com/EpistasisLab/pmlb), which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).&#xD;
Thus, we include the file `LICENSE` from that repository.&#xD;
&#xD;
After downloading from `PMLB`, we split each dataset into the feature part (`_X.csv`) and the target part (`_y.csv`), which we save separately.&#xD;
Both file types are CSVs that only contain numeric values (categorical features are ordinally encoded in `PMLB`) except for the column names.&#xD;
There are no missing values.&#xD;
Each row corresponds to a data object (= instance, sample), and each column either corresponds to a feature (in `_X`) or the target (in `_y`).&#xD;
The first line in each `_X` file contains the names of the features as strings; for `_y` files, there is only one column, always named `target`.&#xD;
For the prediction target, we ensured that the minority (i.e., less frequent) class is the positive class (i.e., has the class label `1`), so the labeling may differ from PMLB.&#xD;
&#xD;
`_dataset_overview.csv` contains meta-data for the datasets, like the number of instances and features.&#xD;
&#xD;
## `results/`&#xD;
&#xD;
These are the output data of the experimental pipeline in the form of CSVs, produced by the script `run_experiments.py`.&#xD;
`_results.csv` contains all results merged into one file and acts as input for the script `run_evaluation_arxiv.py`.&#xD;
The remaining files are subsets of the results, as the experimental pipeline parallelizes over 27 datasets, 5 cross-validation folds, and 6 subgroup-discovery methods.&#xD;
Thus, there are `27 * 5 * 6 = 810` files containing subsets of the results.&#xD;
&#xD;
Each row in a result file corresponds to one subgroup.&#xD;
One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.:&#xD;
&#xD;
- dataset `dataset_name`&#xD;
- cross-validation fold `split_idx`&#xD;
- subgroup-discovery method `sd_name`&#xD;
- feature-cardinality threshold `param.k` (missing value if no feature-cardinality constraint)&#xD;
- solver timeout `param.timeout` (missing value if not solver-based search)&#xD;
- number of alternatives `param.a` (missing value if only original subgroup searched)&#xD;
- dissimilarity threshold `param.tau_abs` (missing value if only original subgroup searched)&#xD;
&#xD;
For each value combination of these seven columns, there is either one subgroup (search for original subgroups)&#xD;
or six subgroups (search for alternative subgroup descriptions, in which case the column `alt.number` identifies individual subgroups within a search run).&#xD;
Further, note that the last four mentioned columns contain missing values, which should be treated as a category on their own.&#xD;
In particular, if you use `groupby()` from `pandas` for analyzing the results and you want to include any of the last four mentioned columns in the grouping,&#xD;
you should either fill in the missing values with an (arbitrary) placeholder value or use `dropna=False`,&#xD;
because the grouping (by default) ignores the rows with missing values in the group columns otherwise.&#xD;
&#xD;
The remaining columns represent results and evaluation metrics.&#xD;
&#xD;
In detail, all result files contain the following columns:&#xD;
&#xD;
- `objective_value` (float in `[-0.25, 1]` + missing values): Objective value of the subgroup-discovery method on the training set.&#xD;
  WRAcc when searching original subgroups and normalized Hamming similarity when searching alternative subgroup descriptions.&#xD;
  Missing value for *MORS* as the subgroup-discovery method, since *MORS* does not explicitly compute an objective when searching for subgroups.&#xD;
- `optimization_status` (string, 2 different values + missing values): For *SMT*, `sat` if optimal solution found and `unknown` if timeout.&#xD;
  Missing value for all other subgroup-discovery methods (which do not use solver timeouts).&#xD;
- `optimization_time` (non-negative float): The runtime of optimization in the subgroup-discovery method, i.e., without pre- and post-processing steps.&#xD;
- `fitting_time` (non-negative float): The complete runtime of the subgroup-discovery method (as reported in the paper), i.e., including pre- and post-processing steps.&#xD;
  Very similar to `optimization_time` except for *SMT* as the subgroup-discovery method, which may spend a considerable amount of time formulating the optimization problem.&#xD;
- `train_wracc` (float in `[-0.25, 0.25]`): The weighted relative accuracy (WRAcc) of the subgroup description on the training set.&#xD;
- `test_wracc` (float in `[-0.25, 0.25]`): The weighted relative accuracy (WRAcc) of the subgroup description on the test set.&#xD;
- `train_nwracc` (float in `[-1, 1]`): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the training set.&#xD;
- `test_nwracc` (float in `[-1, 1]`): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the test set.&#xD;
- `box_lbs` (list of floats, e.g., `[-inf, 0, -inf, -2, 8]`): The lower bounds for each feature in the subgroup description.&#xD;
  Negative infinity if a feature's lower bound did not exclude any data objects from the subgroup.&#xD;
- `box_ubs` (list of floats, e.g., `[inf, 10, inf, 5, 9]`): The upper bounds for each feature in the subgroup description.&#xD;
  Positive infinity if a feature's upper bound did not exclude any data objects from the subgroup.&#xD;
- `selected_feature_idxs` (list of non-negative ints, e.g., `[0, 4, 5]`): The indices (starting from 0) of the features selected (= restricted) in the subgroup description.&#xD;
  Is an empty list, i.e., `[]`, if no feature was restricted (thus, the subgroup contains all data objects).&#xD;
- `dataset_name` (string, 27 different values): The name of the `PMLB` dataset used for subgroup discovery.&#xD;
- `split_idx` (int in `[0, 4]`): The index of the cross-validation fold of the dataset used for subgroup discovery.&#xD;
- `sd_name` (string, 6 different values): The name of the subgroup-discovery method (`Beam`, `BI`, `MORS`, `PRIM`, `Random`, or `SMT`).&#xD;
- `param.k` (int in `[1, 5]` + missing values): The feature-cardinality threshold for subgroup descriptions.&#xD;
  Missing value if unconstrained subgroup discovery.&#xD;
  Always `3` if alternative subgroup descriptions searched.&#xD;
- `param.timeout` (int in `[1, 2048]` + missing values): For *SMT*, solver timeout (in seconds) for optimization (not including formulation of the optimization problem).&#xD;
  Missing value for all other subgroup-discovery methods.&#xD;
- `alt.hamming` (float in `[0, 1]` + missing values): Normalized Hamming similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched.&#xD;
  Missing value if only original subgroup searched.&#xD;
- `alt.jaccard` (float in `[0, 1]` + missing values): Jaccard similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched.&#xD;
  Missing value if only original subgroup searched.&#xD;
- `alt.number` (int in `[0, 5]` + missing values): The number of the current alternative if alternative subgroup descriptions searched.&#xD;
  Missing value if only original subgroup searched.&#xD;
  Thus, original subgroups either have `0` or a missing value in this column (i.e., for experimental settings where alternative subgroup descriptions searched, there is no separate search for an original subgroup, only a joint sequential search for original and alternatives).&#xD;
- `param.a` (int with value `5` + missing values): The number of desired alternative subgroup descriptions, not counting the original (zeroth) subgroup description.&#xD;
  Missing value if only original subgroup searched.&#xD;
- `param.tau_abs` (int in `[1, 3]` + missing values) The dissimilarity threshold for alternatives, corresponding to the absolute number of features that have to be deselected from the original subgroup description and each prior alternative.&#xD;
  Missing value if only original subgroup searched.&#xD;
&#xD;
You can easily read in any of the result files with `pandas`:&#xD;
&#xD;
```python&#xD;
import pandas as pd&#xD;
&#xD;
results = pd.read_csv('results/_results.csv')&#xD;
```&#xD;
&#xD;
All result files are comma-separated and contain plain numbers and unquoted strings, apart from the columns `box_lbs`, `box_ubs`, and `selected_feature_idxs`&#xD;
(which represents lists and whose values are quoted except for empty lists).&#xD;
The first line in each result file contains the column names.&#xD;
You can use the following code to make sure that the lists of feature indices are treated as such (rather than strings):&#xD;
&#xD;
```python&#xD;
import ast&#xD;
&#xD;
results['selected_feature_idxs'] = results['selected_feature_idxs'].apply(ast.literal_eval)&#xD;
```&#xD;
&#xD;
Note that this conversion does not work for `box_lbs` and `box_ubs`, where the lists not only contain ordinary numbers but also `-inf`, and `inf`;&#xD;
see [this *Stack Overflow* post](https://stackoverflow.com/questions/64773836/error-converting-string-list-to-list-when-it-contains-inf) for potential alternatives.</description>
   </descriptions>
   <relatedIdentifiers>
      <relatedIdentifier relatedIdentifierType="URL" relationType="IsIdenticalTo">https://publikationen.bibliothek.kit.edu/1000171166</relatedIdentifier>
   </relatedIdentifiers>
   <sizes>
      <size/>
   </sizes>
   <formats>
      <format>application/x-tar</format>
   </formats>
</resource></metadata></record></GetRecord></OAI-PMH>