Datasets¶
To test the capabilities of XAIoGraphs, it provides a series of datasets via xaiographs.datasets module’s features.
The following datasets are included:
Dataset |
Rows |
Num. Feats |
Task |
|---|---|---|---|
1309 |
8 |
Binary |
|
4230 |
7 |
Multi-Class (3) |
|
4230 |
7 |
Binary |
|
13393 |
11 |
Multi-Class (3) |
|
145 |
29 |
Multi-Class (5) |
|
981 |
17 |
Multi-Class (5) |
These datasets are accessible in both raw and discretized form, ready for usage by the
Explainability and Fairness classes.
The details of these Datasets are shown below:
Note
The original Datasets have been treated to remove outlayers, impute null values, and so on.
Â
Titanic¶
The supposedly “unsinkable” RMS Titanic sank on April 15, 1912, during her first voyage after hitting an iceberg.
Unfortunately, there were not enough lifeboats to accommodate everyone, and 1502 out of 2224 passengers and staff
perished.Individual Titanic passengers’ chances of survival are described in the famous Titanic Dataset.
Source |
|
Num Rows: |
1309 |
Num Features |
8 |
Num Targets: |
2 |
function to obtain dataset |
|
function to obtain discretized dataset |
Â
Compas¶
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a popular commercial algorithm used by judges and parole officers for scoring criminal defendant’s likelihood of reoffending (recidivism). It has been shown that the algorithm is biased in favor of white defendants, and against black inmates, based on a 2 year follow up study (i.e who actually committed crimes or violent crimes after 2 years). The pattern of mistakes, as measured by precision/sensitivity is notable.
Source |
|
Num Rows: |
4230 |
Num Features |
7 |
Num Targets: |
3 (model) - 2 (reality) |
function to obtain dataset |
|
function to obtain discretized dataset (Model) |
|
function to obtain discretized dataset (Reality) |
Â
Body Performance¶
This dataset demonstrates how performance levels change with age and some exercise-related variables.
Source |
https://www.kaggle.com/datasets/kukuroo3/body-performance-data |
Num Rows: |
13393 |
Num Features |
11 |
Num Targets: |
3 |
function to obtain dataset |
|
function to obtain discretized dataset |
Â
Education Performance¶
The data was collected from the Faculty of Engineering and Faculty of Educational Sciences students in 2019. The purpose is to predict students’ end-of-term performances using ML techniques.
Source |
https://www.kaggle.com/datasets/mariazhokhova/higher-education-students-performance-evaluation |
Num Rows: |
145 |
Num Features |
29 |
Num Targets: |
5 |
function to obtain dataset |
|
function to obtain discretized dataset |
|
Â
Smartphone Brand Preferences¶
The data was collected through a combination of three datasets containing the most noteworthy features on the preferred smartphones in the US in 2022, user’s data and smartphone ratings. This information was obtained via a Mechanical Turk survey where participants assessed 10 randomly presented phones by likelihood of purchase and provided personal information. This example highlights the most important features smartphones from certain brands have, to predict the most likely smartphone-brand purchase.
Source |
|
Num Rows: |
259 |
Num Features |
11 |
Num Targets: |
5 |
function to obtain dataset |
|
function to obtain discretized dataset |
|