<?xml version="1.0" encoding="UTF-8" ?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-24T15:02:24Z</responseDate><request identifier="10.35097/1975" metadataPrefix="datacite" verb="GetRecord">https://www.radar-service.eu/oai/OAIHandler</request><GetRecord><record><header><identifier>10.35097/1975</identifier><datestamp>2024-07-10T07:58:50Z</datestamp><setSpec>radar4kit</setSpec></header><metadata><resource xmlns="http://datacite.org/schema/kernel-4"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://datacite.org/schema/kernel-4 https://schema.datacite.org/meta/kernel-4.4/metadata.xsd">
   <identifier identifierType="DOI">10.35097/1975</identifier>
   <creators>
      <creator>
         <creatorName>Bach, Jakob</creatorName>
         <givenName>Jakob</givenName>
         <familyName>Bach</familyName>
         <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-0301-2798</nameIdentifier>
         <affiliation>Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)</affiliation>
      </creator>
   </creators>
   <titles>
      <title>Experimental Data for the Paper "Alternative feature selection with user control"</title>
   </titles>
   <publisher>Karlsruhe Institute of Technology</publisher>
   <dates>
      <date dateType="Created">2024</date>
   </dates>
   <publicationYear>2024</publicationYear>
   <subjects>
      <subject>Computer Science</subject>
      <subject>feature selection</subject>
      <subject>alternatives</subject>
      <subject>constraints</subject>
      <subject>mixed-integer programming</subject>
      <subject>explainability</subject>
      <subject>interpretability</subject>
      <subject>XAI</subject>
   </subjects>
   <resourceType resourceTypeGeneral="Dataset"/>
   <rightsList>
      <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
      <rights schemeURI="https://spdx.org/licenses/"
              rightsIdentifierScheme="SPDX"
              rightsIdentifier="CC-BY-4.0"
              rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
   </rightsList>
   <contributors>
      <contributor contributorType="RightsHolder">
         <contributorName>Bach, Jakob</contributorName>
         <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0003-0301-2798</nameIdentifier>
      </contributor>
   </contributors>
   <descriptions>
      <description descriptionType="Abstract">These are the experimental data for the paper&#xD;
&#xD;
&gt; Bach, Jakob, and Klemens Böhm. "Alternative feature selection with user control"&#xD;
&#xD;
published in the [International Journal of Data Science and Analytics](https://link.springer.com/journal/41060) in 2024.&#xD;
You can find the paper [here](https://doi.org/10.1007/s41060-024-00527-8) and the code [here](https://github.com/jakob-bach/alternative-feature-selection).&#xD;
See the `README` for details.&#xD;
&#xD;
The datasets used in our study (which we also provide here) originate from [PMLB](https://epistasislab.github.io/pmlb/).&#xD;
The corresponding [GitHub repository](https://github.com/EpistasisLab/pmlb) is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).&#xD;
Please see the file `LICENSE` in the folder `datasets/` for the license text.</description>
      <description descriptionType="TechnicalInfo"># Experimental Data for the Paper "Alternative feature selection with user control"&#xD;
&#xD;
These are the experimental data for the paper&#xD;
&#xD;
&gt; Bach, Jakob, and Klemens Böhm. "Alternative feature selection with user control"&#xD;
&#xD;
published in the [International Journal of Data Science and Analytics](https://link.springer.com/journal/41060) in 2024.&#xD;
&#xD;
Check our [GitHub repository](https://github.com/Jakob-Bach/Alternative-Feature-Selection) for the code and instructions to reproduce the experiments.&#xD;
The data were obtained on a server with an `AMD EPYC 7551` CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM.&#xD;
The operating system was `Ubuntu 20.04.6 LTS`.&#xD;
The Python version was `3.8`.&#xD;
With this configuration, running the experimental pipeline (`run_experiments.py`) took about 255 h.&#xD;
&#xD;
The commit hash for the last run of the experimental pipeline (`run_experiments.py`) is [`2212360a32`](https://github.com/Jakob-Bach/Alternative-Feature-Selection/tree/2212360a320121862556bc8d54da45cb6dc8ee93).&#xD;
The commit hash for the last run of the evaluation pipeline (`run_evaluation_journal.py`) is [`758b71e70b`](https://github.com/Jakob-Bach/Alternative-Feature-Selection/tree/758b71e70b941a54c88bda17a5591698dafcc03c).&#xD;
We also tagged both commits (`run-2023-06-23` and `evaluation-2024-03-19`).&#xD;
&#xD;
The experimental data are stored in two folders, `datasets/` and `results/`.&#xD;
Further, the console output of `run_evaluation_journal.py` is stored in `Evaluation_console_output.txt` (manually copied from the console to a file).&#xD;
In the following, we describe the structure and content of each data file.&#xD;
&#xD;
## `datasets/`&#xD;
&#xD;
These are the input data for the experimental pipeline `run_experiments.py`, i.e., prediction datasets.&#xD;
The folder contains one overview file, one license file, and two files for each of the 30 datasets.&#xD;
&#xD;
The original datasets were downloaded from [PMLB](https://epistasislab.github.io/pmlb/) with the script `prepare_datasets.py`.&#xD;
Note that we do not own the copyright for these datasets.&#xD;
However, the [GitHub repository of PMLB](https://github.com/EpistasisLab/pmlb), which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).&#xD;
Thus, we include the file `LICENSE` from that repository.&#xD;
&#xD;
After downloading from `PMLB`, we split each dataset into the feature part (`_X.csv`) and the target part (`_y.csv`), which we save separately.&#xD;
Both files are CSVs only containing numeric values (categorical features are ordinally encoded in `PMLB`) except for the column names.&#xD;
There are no missing values.&#xD;
Each row corresponds to a data object (= instance, sample), and each column corresponds to a feature.&#xD;
The first line in each CSV contains the names of the features as strings; for `_y.csv` files, there is only one column, always named `target`.&#xD;
&#xD;
`_dataset_overview.csv` contains meta-data for the datasets, like the number of instances and features.&#xD;
&#xD;
## `results/`&#xD;
&#xD;
These are the output data of the experimental pipeline in the form of CSVs, produced by the script `run_experiments.py`.&#xD;
`_results.csv` contains all results merged into one file and acts as input for the script `run_evaluation_journal.py`.&#xD;
The remaining files are subsets of the results, as the experimental pipeline parallelizes over 30 datasets, 5 cross-validation folds, and 5 feature-selection methods.&#xD;
Thus, there are `30 * 5 * 5 = 750` files containing subsets of the results.&#xD;
&#xD;
Each row in a result file corresponds to one feature set.&#xD;
One can identify individual search runs for alternatives with a combination of multiple columns, i.e.:&#xD;
&#xD;
- dataset `dataset_name`&#xD;
- cross-validation fold `split_idx`&#xD;
- feature-selection method `fs_name`&#xD;
- search method `search_name`&#xD;
- objective aggregation `objective_agg`&#xD;
- feature-set size `k`&#xD;
- number of alternatives `num_alternatives`&#xD;
- dissimilarity threshold `tau_abs`&#xD;
&#xD;
The remaining columns mostly represent evaluation metrics.&#xD;
In detail, all result files contain the following columns:&#xD;
&#xD;
- `selected_idxs` (list of ints, e.g., `[0, 4, 5, 6, 8]`): The indices (starting from 0) of the selected features (i.e., columns in the dataset).&#xD;
  Might also be an empty list, i.e., `[]` if no valid solution was found.&#xD;
  In that case, the two `_objective` columns and the four `_mcc` columns contain a missing value (empty string).&#xD;
- `train_objective` (float in `[-1,1]`): The training-set objective value of the feature set.&#xD;
  Three feature-selection methods (*FCBF*, *MI*, *Model Importance*) have the range `[0,1]`, while two methods (*mRMR*, *Greedy Wrapper*) have the range `[-1,1]`.&#xD;
- `test_objective` (float in `[-1,1]` ): The test-set objective value of the feature set.&#xD;
- `optimization_time` (non-negative float): Time for alternative feature selection in seconds.&#xD;
  The interpretation of this value depends on the feature-selection method.&#xD;
  For white-box feature-selection methods, this value corresponds to one solver call.&#xD;
  For wrapper feature selection, we record the total runtime of the *Greedy Wrapper* algorithm, which calls the solver and trains prediction models multiple times.&#xD;
- `optimization_status` (int in `{0, 1, 2, 6}`): The status of the solver; for wrapper feature selection, this is only the status of the last solver call and only refers to a satisfiability problem rather than the (black-box) optimization problem.&#xD;
  - 0 = (proven as) optimal&#xD;
  - 1 = feasible (valid solution, but might be suboptimal)&#xD;
  - 2 = (proven as) infeasible&#xD;
  - 6 = not solved (no valid solution found, but one might exist)&#xD;
- `decision_tree_train_mcc` (float in `[-1,1]`): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features.&#xD;
- `decision_tree_test_mcc` (float in `[-1,1]`): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features.&#xD;
- `random_forest_train_mcc` (float in `[-1,1]`): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features.&#xD;
- `random_forest_test_mcc` (float in `[-1,1]`): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features.&#xD;
- `k` (int in `{5, 10}`): The number of features to be selected.&#xD;
- `tau_abs` (int in `{1, ..., 10}`): The dissimilarity threshold for alternatives, corresponding to the absolute number of features (`k * tau`) that have to differ between feature sets.&#xD;
- `num_alternatives` (int in `{1, 2, 3, 4, 5, 10}`): The number of desired alternative feature sets, not counting the original (zeroth) feature set.&#xD;
  A number from `{1, 2, 3, 4, 5}` for simultaneous search, but always `10` for sequential search.&#xD;
- `objective_agg` (string, 2 different values): The name of the quality-aggregation function for alternatives (`min` or `sum`).&#xD;
  Min-aggregation or sum-aggregation for simultaneous search but always sum-aggregation for sequential search.&#xD;
  (In fact, sequential search only optimizes individual feature sets anyway, so the aggregation does not matter for the search.)&#xD;
- `search_name` (string, 2 different values): The name of the search method for alternatives (`search_sequentially` or `search_simultaneously`).&#xD;
- `fs_name` (string, 5 different values): The name of the feature-selection method (`FCBFSelector`, `MISelector`, `ModelImportanceSelector` (= *Model Gain* in the paper), `MRMRSelector`, or `GreedyWrapperSelector`).&#xD;
- `dataset_name` (string, 30 different values): The name of the `PMLB` dataset.&#xD;
- `n` (positive int): The number of features of the `PMLB` dataset.&#xD;
- `split_idx` (int in `[0,4]`): The index of the cross-validation fold.&#xD;
- `wrapper_iters` (int in `[1,1000]`): The number of iterations in case wrapper feature selection was used, missing value (empty string) in the other cases.&#xD;
  This column does not exist in result files not containing wrapper results.&#xD;
&#xD;
You can easily read in any of the result files with `pandas`:&#xD;
&#xD;
```python&#xD;
import pandas as pd&#xD;
&#xD;
results = pd.read_csv('results/_results.csv')&#xD;
```&#xD;
&#xD;
All result files are comma-separated and contain plain numbers and unquoted strings, apart from the column `selected_features` (which is quoted and represents lists of integers).&#xD;
The first line in each result file contains the column names.&#xD;
You can use the following code to make sure that lists of feature indices are treated as such (rather than strings):&#xD;
&#xD;
```python&#xD;
import ast&#xD;
&#xD;
results['selected_idxs'] = results['selected_idxs'].apply(ast.literal_eval)&#xD;
```</description>
   </descriptions>
   <relatedIdentifiers>
      <relatedIdentifier relatedIdentifierType="URL" relationType="IsIdenticalTo">https://publikationen.bibliothek.kit.edu/1000169377</relatedIdentifier>
   </relatedIdentifiers>
   <sizes>
      <size/>
   </sizes>
   <formats>
      <format>application/x-tar</format>
   </formats>
</resource></metadata></record></GetRecord></OAI-PMH>