Alternativer Identifier:

Verwandter Identifier:

(Is Identical To) https://publikationen.bibliothek.kit.edu/1000168220 - URL

Ersteller/in:

Bach, Jakob https://orcid.org/0000-0003-0301-2798 [Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)]

Beitragende:

Titel:

Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2)

Weitere Titel:

Beschreibung:

(Abstract) These are the experimental data for the second version (v2) of the paper> Bach, Jakob. "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" published on [arXiv](https://arxiv.org/) in 2024. You can find the paper [here](https://doi.org/10.48550/arXiv.2307.11607) and the code [he...

(Technical Remarks) # Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2) These are the experimental data for the second version (v2) of the paper> Bach, Jakob. "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" , published at [arX... # Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2) These are the experimental data for the second version (v2) of the paper> Bach, Jakob. "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" , published at [arXiv](https://arxiv.org/) in 2024. If we create further versions of this paper in the future, these experimental data may cover them as well. Check our [GitHub repository](https://github.com/Jakob-Bach/Alternative-Feature-Selection) for the code and instructions to reproduce the experiments. The data were obtained on a server with an `AMD EPYC 7551` CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The operating system was `Ubuntu 20.04.6 LTS`. The Python version was `3.8`. With this configuration, running the experimental pipeline (`run_experiments.py`) took about 255 h. Note that the data originate from two separate runs of the experimental pipeline since we re-ran a subset of the experimental tasks to include heuristic search methods for alternatives. Only two out of five feature-selection methods were affected by these changes (as the heuristics do not work for the others), so we did not run the full pipeline again. Results for the feature-selection methods `FCBFSelector`, `GreedyWrapperSelector`, and `MRMRSelector` base on a run of the experimental pipeline (`run_experiments.py`) at the commit [`2212360a32`](https://github.com/Jakob-Bach/Alternative-Feature-Selection/tree/2212360a320121862556bc8d54da45cb6dc8ee93). Results for the feature-selection methods `MISelector` and `ModelImportanceSelector` base on a run of the experimental pipeline (`run_experiments.py`) at the commit [`ae73417c9c`](https://github.com/Jakob-Bach/Alternative-Feature-Selection/tree/ae73417c9cf64aedd77e02a50a3df8955e177251). The commit hash for the last run of the evaluation pipeline (`run_evaluation_arxiv.py`) is [`898e9ebedf`](https://github.com/Jakob-Bach/Alternative-Feature-Selection/tree/898e9ebedfae642f4c6be80244b79fdce3920500). We also tagged all relevant commits (`run-2023-06-23`, `run-2024-01-23`, and `evaluation-2024-02-01`). The experimental data are stored in two folders, `datasets/` and `results/`. Further, the console output of `run_evaluation_arxiv.py` is stored in `Evaluation_console_output.txt` (manually copied from the console to a file). In the following, we describe the structure and content of each data file. ## `datasets/` These are the input data for the experimental pipeline `run_experiments.py`, i.e., prediction datasets. The folder contains one overview file, one license file, and two files for each of the 30 datasets. The original datasets were downloaded from [PMLB](https://epistasislab.github.io/pmlb/) with the script `prepare_datasets.py`. Note that we do not own the copyright for these datasets. However, the [GitHub repository of PMLB](https://github.com/EpistasisLab/pmlb), which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Thus, we include the file `LICENSE` from that repository. After downloading from `PMLB`, we split each dataset into the feature part (`_X.csv`) and the target part (`_y.csv`), which we save separately. Both files are CSVs only containing numeric values (categorical features are ordinally encoded in `PMLB`) except for the column names. There are no missing values. Each row corresponds to a data object (= instance, sample), and each column corresponds to a feature. The first line in each CSV contains the names of the features as strings; for `_y.csv` files, there is only one column, always named `target`. `_dataset_overview.csv` contains meta-data for the datasets, like the number of instances and features. ## `results/` These are the output data of the experimental pipeline in the form of CSVs, produced by the script `run_experiments.py`. `_results.csv` contains all results merged into one file and acts as input for the script `run_evaluation_arxiv.py`. The remaining files are subsets of the results, as the experimental pipeline parallelizes over 30 datasets, 5 cross-validation folds, and 5 feature-selection methods. Thus, there are `30 * 5 * 5 = 750` files containing subsets of the results. Each row in a result file corresponds to one feature set. One can identify individual search runs for alternatives with a combination of multiple columns, i.e.: - dataset `dataset_name` - cross-validation fold `split_idx` - feature-selection method `fs_name` - search method `search_name` - objective aggregation `objective_agg` - feature-set size `k` - number of alternatives `num_alternatives` - dissimilarity threshold `tau_abs` The remaining columns mostly represent evaluation metrics. In detail, all result files contain the following columns: - `selected_idxs` (list of ints, e.g., `[0, 4, 5, 6, 8]`): The indices (starting from 0) of the selected features (i.e., columns in the dataset). Might also be an empty list, i.e., `[]` if no valid solution was found. In that case, the two `_objective` columns and the four `_mcc` columns contain a missing value (empty string). - `train_objective` (float in `[-1,1]`): The training-set objective value of the feature set. Three feature-selection methods (*FCBF*, *MI*, *Model Importance*) have the range `[0,1]`, while two methods (*mRMR*, *Greedy Wrapper*) have the range `[-1,1]`. - `test_objective` (float in `[-1,1]` ): The test-set objective value of the feature set. - `optimization_time` (non-negative float): Time for alternative feature selection in seconds. The interpretation of this value depends on the search method for alternatives and the feature-selection method. (1a) In solver-based search for alternatives in combination with white-box feature-selection methods, this value corresponds to one solver call. (1b) In solver-based search for alternatives in combination with wrapper feature selection, we record the total runtime of the *Greedy Wrapper* algorithm, which calls the solver and trains prediction models multiple times. (2) In heuristic search for alternatives (algorithms *Greedy Balancing* and *Greedy Replacement*), we record the total runtime of the heuristic search algorithms. - `optimization_status` (int in `{0, 1, 2, 6}`): The status of the solver-based or heuristic search method for alternatives; for wrapper feature selection, this is only the status of the last solver call and only refers to a satisfiability problem rather than the (black-box) optimization problem. - 0 = (proven as) optimal; cannot occur for heuristic search methods - 1 = feasible (valid solution, but might be suboptimal) - 2 = (proven as) infeasible; cannot occur for heuristic search methods - 6 = not solved (no valid solution found, but one might exist) - `decision_tree_train_mcc` (float in `[-1,1]`): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features. - `decision_tree_test_mcc` (float in `[-1,1]`): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features. - `random_forest_train_mcc` (float in `[-1,1]`): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features. - `random_forest_test_mcc` (float in `[-1,1]`): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features. - `k` (int in `{5, 10}`): The number of features to be selected. - `tau_abs` (int in `{1, ..., 10}`): The dissimilarity threshold for alternatives, corresponding to the absolute number of features (`k * tau`) that have to differ between feature sets. - `num_alternatives` (int in `{1, 2, 3, 4, 5, 10}`): The number of desired alternative feature sets, not counting the original (zeroth) feature set. A number from `{1, 2, 3, 4, 5}` for solver-based simultaneous search and *Greedy Balancing*, but always `10` for solver-based sequential search and *Greedy Replacement*. - `objective_agg` (string, 2 different values): The name of the quality-aggregation function for alternatives (`min` or `sum`). Min-aggregation or sum-aggregation for solver-based simultaneous search but always sum-aggregation for the remaining search methods (where the aggregation does not matter for the search). - `search_name` (string, 4 different values): The name of the search method for alternatives (`search_greedy_balancing`, `search_greedy_replacement`, `search_sequentially`, or `search_simultaneously`). *Greedy Balancing* and *Greedy Replacement* are (solver-free) heuristics and are only combined with two feature-selection methods (*MI* and *Model Importance*). The other two values denote solver-based search here (though in the paper, we also categorize optimization problems as sequential/simultaneous, no matter how they are solved) and are combined with all five feature-selection methods. - `fs_name` (string, 5 different values): The name of the feature-selection method (`FCBFSelector`, `MISelector`, `ModelImportanceSelector` (= *Model Gain* in the paper), `MRMRSelector`, or `GreedyWrapperSelector`). - `dataset_name` (string, 30 different values): The name of the `PMLB` dataset. - `n` (positive int): The number of features of the `PMLB` dataset. - `split_idx` (int in `[0,4]`): The index of the cross-validation fold. - `wrapper_iters` (int in `[1,1000]`): The number of iterations in case wrapper feature selection was used, missing value (empty string) in the other cases. This column does not exist in result files not containing wrapper results. You can easily read in any of the result files with `pandas`: ```python import pandas as pd results = pd.read_csv('results/_results.csv') ``` All result files are comma-separated and contain plain numbers and unquoted strings, apart from the column `selected_features` (which is quoted and represents lists of integers). The first line in each result file contains the column names. You can use the following code to make sure that lists of feature indices are treated as such (rather than strings): ```python import ast results['selected_idxs'] = results['selected_idxs'].apply(ast.literal_eval) ```

Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2)

These are the experimental data for the second version (v2) of the paper> Bach, Jakob. "Finding Optimal Diverse Feature Sets with Alternative Feature Selection"
, published at arXiv in 2024.
If we create further versions of this paper in the future, these experimental data may cover them as well.
Check our GitHub repository for the code and instructions to reproduce the experiments.
The data were obtained on a server with an AMD EPYC 7551 CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM.
The operating system was Ubuntu 20.04.6 LTS.
The Python version was 3.8.
With this configuration, running the experimental pipeline (run_experiments.py) took about 255 h.
Note that the data originate from two separate runs of the experimental pipeline since we re-ran a subset of the experimental tasks to include heuristic search methods for alternatives.
Only two out of five feature-selection methods were affected by these changes (as the heuristics do not work for the others), so we did not run the full pipeline again.
Results for the feature-selection methods FCBFSelector, GreedyWrapperSelector, and MRMRSelector base on a run of the experimental pipeline (run_experiments.py) at the commit 2212360a32.
Results for the feature-selection methods MISelector and ModelImportanceSelector base on a run of the experimental pipeline (run_experiments.py) at the commit ae73417c9c.
The commit hash for the last run of the evaluation pipeline (run_evaluation_arxiv.py) is 898e9ebedf.
We also tagged all relevant commits (run-2023-06-23, run-2024-01-23, and evaluation-2024-02-01).
The experimental data are stored in two folders, datasets/ and results/.
Further, the console output of run_evaluation_arxiv.py is stored in Evaluation_console_output.txt (manually copied from the console to a file).
In the following, we describe the structure and content of each data file.

`datasets/`

These are the input data for the experimental pipeline run_experiments.py, i.e., prediction datasets.
The folder contains one overview file, one license file, and two files for each of the 30 datasets.
The original datasets were downloaded from PMLB with the script prepare_datasets.py.
Note that we do not own the copyright for these datasets.
However, the GitHub repository of PMLB, which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).
Thus, we include the file LICENSE from that repository.
After downloading from PMLB, we split each dataset into the feature part (_X.csv) and the target part (_y.csv), which we save separately.
Both files are CSVs only containing numeric values (categorical features are ordinally encoded in PMLB) except for the column names.
There are no missing values.
Each row corresponds to a data object (= instance, sample), and each column corresponds to a feature.
The first line in each CSV contains the names of the features as strings; for _y.csv files, there is only one column, always named target.
_dataset_overview.csv contains meta-data for the datasets, like the number of instances and features.

`results/`

These are the output data of the experimental pipeline in the form of CSVs, produced by the script run_experiments.py.
_results.csv contains all results merged into one file and acts as input for the script run_evaluation_arxiv.py.
The remaining files are subsets of the results, as the experimental pipeline parallelizes over 30 datasets, 5 cross-validation folds, and 5 feature-selection methods.
Thus, there are 30 * 5 * 5 = 750 files containing subsets of the results.
Each row in a result file corresponds to one feature set.
One can identify individual search runs for alternatives with a combination of multiple columns, i.e.:

dataset dataset_name
cross-validation fold split_idx
feature-selection method fs_name
search method search_name
objective aggregation objective_agg
feature-set size k
number of alternatives num_alternatives
dissimilarity threshold tau_abs
The remaining columns mostly represent evaluation metrics.
In detail, all result files contain the following columns:
selected_idxs (list of ints, e.g., [0, 4, 5, 6, 8]): The indices (starting from 0) of the selected features (i.e., columns in the dataset).
Might also be an empty list, i.e., [] if no valid solution was found.
In that case, the two _objective columns and the four _mcc columns contain a missing value (empty string).
train_objective (float in [-1,1]): The training-set objective value of the feature set.
Three feature-selection methods (FCBF, MI, Model Importance) have the range [0,1], while two methods (mRMR, Greedy Wrapper) have the range [-1,1].
test_objective (float in [-1,1] ): The test-set objective value of the feature set.
optimization_time (non-negative float): Time for alternative feature selection in seconds.
The interpretation of this value depends on the search method for alternatives and the feature-selection method.
(1a) In solver-based search for alternatives in combination with white-box feature-selection methods, this value corresponds to one solver call.
(1b) In solver-based search for alternatives in combination with wrapper feature selection, we record the total runtime of the Greedy Wrapper algorithm, which calls the solver and trains prediction models multiple times.
(2) In heuristic search for alternatives (algorithms Greedy Balancing and Greedy Replacement), we record the total runtime of the heuristic search algorithms.
optimization_status (int in {0, 1, 2, 6}): The status of the solver-based or heuristic search method for alternatives; for wrapper feature selection, this is only the status of the last solver call and only refers to a satisfiability problem rather than the (black-box) optimization problem.
- 0 = (proven as) optimal; cannot occur for heuristic search methods
- 1 = feasible (valid solution, but might be suboptimal)
- 2 = (proven as) infeasible; cannot occur for heuristic search methods
- 6 = not solved (no valid solution found, but one might exist)
decision_tree_train_mcc (float in [-1,1]): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features.
decision_tree_test_mcc (float in [-1,1]): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features.
random_forest_train_mcc (float in [-1,1]): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features.
random_forest_test_mcc (float in [-1,1]): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features.
k (int in {5, 10}): The number of features to be selected.
tau_abs (int in {1, ..., 10}): The dissimilarity threshold for alternatives, corresponding to the absolute number of features (k * tau) that have to differ between feature sets.
num_alternatives (int in {1, 2, 3, 4, 5, 10}): The number of desired alternative feature sets, not counting the original (zeroth) feature set.
A number from {1, 2, 3, 4, 5} for solver-based simultaneous search and Greedy Balancing, but always 10 for solver-based sequential search and Greedy Replacement.
objective_agg (string, 2 different values): The name of the quality-aggregation function for alternatives (min or sum).
Min-aggregation or sum-aggregation for solver-based simultaneous search but always sum-aggregation for the remaining search methods (where the aggregation does not matter for the search).
search_name (string, 4 different values):
The name of the search method for alternatives (search_greedy_balancing, search_greedy_replacement, search_sequentially, or search_simultaneously).
Greedy Balancing and Greedy Replacement are (solver-free) heuristics and are only combined with two feature-selection methods (MI and Model Importance).
The other two values denote solver-based search here (though in the paper, we also categorize optimization problems as sequential/simultaneous, no matter how they are solved) and are combined with all five feature-selection methods.
fs_name (string, 5 different values): The name of the feature-selection method (FCBFSelector, MISelector, ModelImportanceSelector (= Model Gain in the paper), MRMRSelector, or GreedyWrapperSelector).
dataset_name (string, 30 different values): The name of the PMLB dataset.
n (positive int): The number of features of the PMLB dataset.
split_idx (int in [0,4]): The index of the cross-validation fold.
wrapper_iters (int in [1,1000]): The number of iterations in case wrapper feature selection was used, missing value (empty string) in the other cases.
This column does not exist in result files not containing wrapper results.
You can easily read in any of the result files with pandas:

import pandas as pd
results = pd.read_csv('results/_results.csv')

All result files are comma-separated and contain plain numbers and unquoted strings, apart from the column selected_features (which is quoted and represents lists of integers).
The first line in each result file contains the column names.
You can use the following code to make sure that lists of feature indices are treated as such (rather than strings):

import ast
results['selected_idxs'] = results['selected_idxs'].apply(ast.literal_eval)

Schlagworte:

feature selection
alternatives
constraints
mixed-integer programming
explainability
interpretability
XAI

Zugehörige Informationen:

Sprache:

Herausgeber/in:

Karlsruhe Institute of Technology

Erstellungsjahr:

2024

Fachgebiet:

Computer Science

Objekttyp:

Dataset

Datenquelle:

Verwendete Software:

Datenverarbeitung:

Erscheinungsjahr:

2024

Rechteinhaber/in:

Bach, Jakob https://orcid.org/0000-0003-0301-2798

Förderung:

Mehr anzeigen Weniger anzeigen

Name	Speichervolumen	Metadaten	Upload	Aktion

Status:

Publiziert

Eingestellt von:

kitopen

Erstellt am:

2024-02-08

Archivierungsdatum:

2024-02-13

Archivgröße:

34,2 MB

Archiversteller:

kitopen

Archiv-Prüfsumme:

e4d00661da9698b9d7db94a72f13aa63 (MD5)

Embargo-Zeitraum:

Die Metadaten wurden nachträglich korrigiert. Die ursprünglichen Metadaten sind nach Download des Datenpakets verfügbar.

DOI: 10.35097/1920

Publikationsdatum: 2024-02-13

Datenpaket herunterladen

Herunterladen (34,2 MB)

Metadaten herunterladen

Statistik

0
Views

0
Downloads

Lizenz für das Datenpaket

Dieses Werk ist lizenziert unter
CC BY 4.0

Datenpaket zitieren

Bach, Jakob (2024): Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2). Karlsruhe Institute of Technology. DOI: 10.35097/1920

Datenpaket: Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2)

Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection" (Version 2)

datasets/

results/

`datasets/`

`results/`