Datasets¶
Methods:
Returns Titanic dataset with the following Features: |
|
Returns titanic dataset (and other metadata) to be tested in xaiographs. |
|
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Titanic dataset. |
|
Returns COMPAS dataset with the following Features: |
|
Returns COMPAS dataset (and other metadata) to be tested in xaiographs. |
|
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the COMPAS dataset. |
|
Returns COMPAS dataset (and other metadata) to be tested in xaiographs. |
|
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the COMPAS dataset. |
|
Returns body performance dataset with the following Features: |
|
Returns body performance dataset (and other metadata) to be tested in xaiographs. |
|
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Body Performance dataset. |
|
Returns body performance dataset with the following Features: |
|
Returns education performance dataset (and other metadata) to be tested in xaiographs. |
|
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Body Performance dataset. |
|
Returns the smarthpone brand preferences dataset that contains the following Features: |
|
|
Returns smartphone brand preferences dataset (and other metadata) to be tested in xaiographs. |
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Smartphone Brand Preferences dataset. |
- xaiographs.datasets.load_titanic() DataFrame¶
Returns Titanic dataset with the following Features:
id: unique passenger identifier
gender: passenger gender
title: passenger title
age: passenger age
family_size: number of family members the passenger was traveling with
is_alone: flag that indicates if the passenger was traveling alone or with a family
embarked: city of embarkation {S: Southampton, C: Cherbourg, Q: Queenstown}
class: class in which the passenger was traveling {1: first class, 2: second class, 3: third class}
ticket_price: price that the passenger pays for the trip
survived: flag that indicates if it survived or not {1: Survived, 0: No Survived}
- Returns:
load_titanic – Titanic dataset
- Return type:
pd.DataFrame
- Example:
>>> from xaiographs.datasets import load_titanic >>> df_dataset = load_titanic() >>> df_dataset.head(5) id gender title age family_size is_alone embarked class ticket_price survived 0 0 female Mrs 29.0000 0 1 S 1 211.3375 1 1 1 male Mr 0.9167 3 0 S 1 151.5500 1 2 2 female Mrs 2.0000 3 0 S 1 151.5500 0 3 3 male Mr 30.0000 3 0 S 1 151.5500 0 4 4 female Mrs 25.0000 3 0 S 1 151.5500 0
- xaiographs.datasets.load_titanic_discretized() Tuple[DataFrame, List[str], List[str], str, str]¶
Returns titanic dataset (and other metadata) to be tested in xaiographs. The dataset contains a series of discretized features, two columns (SURVIVED and NO_SURVIVED) with the probability [0,1] of classification given by an ML model and two columns ‘y_true’ and ‘y_predict’ with GroundTruth and prediction given by ML model. Dataset contains the following columns:
id: unique passenger identifier
gender: passenger gender - {male, female}
title: passenger title - {Mrs, Mr, rare}
age: passenger age discretized - {<12_years, 12_18_years, 18_30_years, 30_60_years, >60_years}
family_size: number of family members the passenger was traveling with - {1, 2, 3-5, >5}
is_alone: flag that indicates if the passenger was traveling alone or with a family - {0, 1}
embarked: city of embarkation - {S: Southampton, C: Cherbourg, Q: Queenstown}
class: class in which the passenger was traveling - {1: first class, 2: second class, 3: third class}
ticket_price: discretized price that the passenger pays for the trip - {high, mid, low}
NO_SURVIVED: probability [0,1] that the passenger will not survive. Calculated by ML model
SURVIVED: probability [0,1] that the passenger will survive. Calculated by ML model
y_true: real target - {SURVIVED, NO_SURVIVED}
y_predict: machine learning model prediction - {SURVIVED, NO_SURVIVED}
- Returns:
load_titanic_discretized –
pd.DataFrame, with data
List[str], with features name columns
List[str], with target names probabilities
str, with GroundTruth
str, with prediction ML model
- Return type:
Tuple[pd.DataFrame, List[str], List[str], str, str]
Example
>>> from xaiographs.datasets import load_titanic_discretized >>> df_dataset, features_cols, target_cols, y_true, y_predict = load_titanic_discretized() >>> df_dataset.head(5) id gender title age family_size is_alone embarked class ticket_price SURVIVED NO_SURVIVED y_true y_predict 0 0 female Mrs 18_30_years 1 1 S 1 High 1 0 SURVIVED SURVIVED 1 1 male Mr <12_years 3-5 0 S 1 High 1 0 SURVIVED SURVIVED 2 2 female Mrs <12_years 3-5 0 S 1 High 0 1 NO_SURVIVED NO_SURVIVED 3 3 male Mr 18_30_years 3-5 0 S 1 High 0 1 NO_SURVIVED NO_SURVIVED 4 4 female Mrs 18_30_years 3-5 0 S 1 High 0 1 NO_SURVIVED NO_SURVIVED >>> features_cols ['gender', 'title', 'age', 'family_size', 'is_alone', 'embarked', 'class', 'ticket_price'] >>> target_cols ['SURVIVED', 'NO_SURVIVED'] >>> y_true 'y_true' >>> y_predict 'y_predict'
- xaiographs.datasets.load_titanic_why(language: str = 'en') Tuple[DataFrame, DataFrame]¶
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Titanic dataset.
- Parameters:
language (str) – Language identifier {es: Spanish, en: English}. Default uses English language
- Returns:
load_titanic_why –
pd.DataFrame with the natural language explanation of feature-value we want to use
pd.DataFrame with the natural language explanation of feature-value we want to use per target
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> from xaiographs.datasets import load_titanic_why >>> df_values_semantics, df_target_values_semantics = load_titanic_why() >>> df_values_semantics.head(5) feature_value reason 0 gender_male to be a man 1 gender_female to be a woman 2 is_alone_1 travel alone 3 family_size_2 to be from a family of few members 4 family_size_3-5 be a large family >>> df_target_values_semantics.head(5) target feature_value reason 0 NO_SURVIVED gender_male many men have died 1 NO_SURVIVED gender_female to be a woman 2 NO_SURVIVED is_alone_1 they traveled alone 3 NO_SURVIVED family_size_2 they were from a family of few members 4 NO_SURVIVED family_size_3-5 they were from a large family
- xaiographs.datasets.load_compas() DataFrame¶
Returns COMPAS dataset with the following Features:
id
FirstName
LastName
Gender
Age_range
Ethnicity
days_b_screening_arrest
c_jail_in
c_jail_out
Days_in_jail
c_charge_degree
c_charge_desc
is_recid
is_violent_recid
score_risk_recidivism
score_text_risk_recidivism
score_risk_violence
score_text_risk_violence
Low_Recid
Medium_Recid
High_Recid
No_Recid
Recid
predict_two_year_recid
real_two_year_recid
- Returns:
compas – compas dataset
- Return type:
pd.DataFrame
- Example:
>>> from xaiographs.datasets import load_compas >>> df_dataset = load_compas() >>> df_dataset.head(3) id FirstName LastName Gender Age_range Ethnicity days_b_screening_arrest c_jail_in c_jail_out Days_in_jail c_charge_degree c_charge_desc is_recid is_violent_recid score_risk_recidivism score_text_risk_recidivism score_risk_violence score_text_risk_violence Low_Recid Medium_Recid High_Recid No_Recid score_text_risk_violence Low_Recid Medium_Recid High_Recid No_Recid Recid predict_two_year_recid real_two_year_recid 0 1 miguel hernandez Male Greater than 45 Other -1.0 13/8/13 6:03 14/8/13 5:41 1 F Aggravated Assault w/Firearm 0 0 1 Low 1 Low 1 0 0 1 Low 1 0 0 1 0 0 0 1 3 kevon dixon Male 25 - 45 African-American -1.0 26/1/13 3:45 5/2/13 5:36 10 F Felony Battery w/Prior Convict 1 1 3 Low 1 Low 1 0 0 0 Low 1 0 0 0 1 0 1 2 5 marcu brown Male Less than 25 African-American NaN NaN NaN 0 F Possession of Cannabis 0 0 8 High 6 Medium 0 0 1 1 Medium 0 0 1 1 0 1 0
- xaiographs.datasets.load_compas_discretized() Tuple[DataFrame, List[str], List[str], str, str]¶
Returns COMPAS dataset (and other metadata) to be tested in xaiographs. The dataset contains a series of discretized features, three columns (Low_Recid, Medium_Recid, High_Recid) with the probability [0,1] of classification given by COMPAS model that predict the probability of recidivism, and two columns ‘y_true’ & ‘y_predict’ with GroundTruth and prediction given by ML model (0: did not recidivist two years after arrest, 1: two years after the arrest, recidivist). Dataset contains the following columns:
id: unique person identifier
Gender: {Male, Female}
Age_range: {Less than 25, 25 - 45, Greater than 45}
Ethnicity: {African-American, Asian, Caucasian, Hispanic, Native American, Other}
MaritalStatus: {Married, Separated, Single, Other}
c_charge_degree: {F, M}
is_recid: {YES, NO}
is_violent_recid: {YES, NO}
Low_Recid: probability assigned by the model to the label “low probability of recidivism”.
Medium_Recid: probability assigned by the model to the label “medium probability of recidivism”.
High_Recid: probability assigned by the model to the label “High probability of recidivism”.
y_predict: Model prediction (1: recidivism, 0: no recidivism)
y_true: Recidivism two years after arrest (1: recidivism, 0: no recidivism)
- Returns:
load_compas_discretized –
pd.DataFrame, with data
List[str], with features name columns
List[str], with target names probabilities
str, with GroundTruth
str, with prediction ML model
- Return type:
Tuple[pd.DataFrame, List[str], List[str], str, str]
Example
>>> from xaiographs.datasets import load_compas_discretized >>> df_dataset, features_cols, target_cols, y_true, y_predict = load_compas_discretized() >>> df_dataset.head(3) id Gender Age_range Ethnicity MaritalStatus c_charge_degree is_recid is_violent_recid High_Recid Medium_Recid Low_Recid y_true y_predict 0 1 Male Greater than 45 Other Single F NO NO 0 0 1 No_Recid No_Recid 1 3 Male 25 - 45 African-American Single F YES YES 0 0 1 Recid No_Recid 2 5 Male 25 - 45 Other Separated M NO NO 0 0 1 No_Recid No_Recid >>> features_cols ['Gender', 'Age_range', 'Ethnicity', 'MaritalStatus', 'c_charge_degree', 'is_recid', 'is_violent_recid'] >>> target_cols ['High_Recid', 'Medium_Recid', 'Low_Recid'] >>> y_true 'y_true' >>> y_predict 'y_predict'
- xaiographs.datasets.load_compas_why(language: str = 'en') Tuple[DataFrame, DataFrame]¶
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the COMPAS dataset.
- Parameters:
language (str) – Language identifier {es: Spanish, en: English}. Default uses English language
- Returns:
load_compas_why –
pd.DataFrame with the natural language explanation of feature-value we want to use
pd.DataFrame with the natural language explanation of feature-value we want to use per target
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> from xaiographs.datasets import load_compas_why >>> df_values_semantics, df_target_values_semantics = load_compas_why() >>> df_values_semantics.head(5) feature_value reason 0 Age_range_25 - 45 middle-aged 1 Age_range_Greater than 45 be older than 45 years 2 Age_range_Less than 25 be young 3 Ethnicity_African-American being black 4 Ethnicity_Asian being of Asian race >>> df_target_values_semantics.head(5) target feature_value reason 0 High_Recid Age_range_25 - 45 some in the age range between 25 and 45 years old were classified as "High Risk of recidivism" 1 High_Recid Age_range_Greater than 45 few of those over 45 years of age were classified as "High Risk of recidivism" 2 High_Recid Age_range_Less than 25 many under 25 years of age were classified as "High Risk of recidivism" 3 High_Recid Ethnicity_African-American many classified as "High Risk of Recidivism" were Black 4 High_Recid Ethnicity_Asian very few classified as "High Risk of Recidivism" were of Asian race
- xaiographs.datasets.load_compas_reality_discretized() Tuple[DataFrame, List[str], List[str], str, str]¶
Returns COMPAS dataset (and other metadata) to be tested in xaiographs. The dataset contains a series of discretized features, two columns (No_Recid, Recid) with two flags indicating whether or not they reoffended two years after arrest. Dataset contains the following columns:
id: unique person identifier
Gender: {Male, Female}
Age_range: {Less than 25, 25 - 45, Greater than 45}
Ethnicity: {African-American, Asian, Caucasian, Hispanic, Native American, Other}
MaritalStatus: {Married, Separated, Single, Other}
c_charge_degree: {F, M}
is_recid: {YES, NO}
is_violent_recid: {YES, NO}
No_Recid: not recidivist two years after arrest
Recid: recidivist two years after the arrest
- Returns:
load_compas_discretized –
pd.DataFrame, with data
List[str], with features name columns
List[str], with target names probabilities
- Return type:
Tuple[pd.DataFrame, List[str], List[str], str, str]
Example
>>> from xaiographs.datasets import load_compas_reality_discretized >>> df_dataset, features_cols, target_cols = load_compas_reality_discretized() >>> df_dataset.head(3) id Gender Age_range Ethnicity MaritalStatus c_charge_degree is_recid is_violent_recid Recid No_Recid 0 1 Male Greater than 45 Other Single F NO NO 0 1 1 3 Male 25 - 45 African-American Single F YES YES 1 0 2 5 Male 25 - 45 Other Separated M NO NO 0 1
>>> features_cols ['Gender', 'Age_range', 'Ethnicity', 'MaritalStatus', 'c_charge_degree', 'is_recid', 'is_violent_recid'] >>> target_cols ['Recid', 'No_Recid']
- xaiographs.datasets.load_compas_reality_why(language: str = 'en') Tuple[DataFrame, DataFrame]¶
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the COMPAS dataset.
- Parameters:
language (str) – Language identifier {es: Spanish, en: English}. Default uses English language
- Returns:
load_compas_why –
pd.DataFrame with the natural language explanation of feature-value we want to use
pd.DataFrame with the natural language explanation of feature-value we want to use per target
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> from xaiographs.datasets import load_compas_reality_why >>> df_values_semantics, df_target_values_semantics = load_compas_reality_why() >>> df_values_semantics.head(3) feature_value reason 0 Age_range_25 - 45 middle-aged 1 Age_range_Greater than 45 be older than 45 years 2 Age_range_Less than 25 be young >>> df_target_values_semantics.head(3) target feature_value reason 0 Recid Age_range_25 - 45 some in the age range between 25 and 45 years were repeat offenders 1 Recid Age_range_Greater than 45 few of those over 45 were repeat offenders 2 Recid Age_range_Less than 25 many of those under 25 years of age were repeat offenders
- xaiographs.datasets.load_body_performance() DataFrame¶
Returns body performance dataset with the following Features:
id: unique person identifier
age: person age
gender: person gender
height_cm: Measurement of the waist expressed in centimeters
weight_kg: Weight of the person expressed in kilograms
body_fat_%: Percentage of body fat
diastolic: diastolic blood pressure (min)
systolic: systolic blood pressure (min)
gripForce: Measure the grip force of the hands
sit_and_bend_forward_cm: Distance expressed in centimeters from the length of the entire back, from the heels to the crown of the head
sit-ups_counts: Num of repetitions of raising the torso to a sitting position and returning to the original position without using the arms or lifting the feet
broad_jump_cm: Longest jump forward jump with a running start and a single leap, expressed in centimiters
class: Grade of performance
- Returns:
load_body_performance – Body Performance dataset
- Return type:
pd.DataFrame
- Example:
>>> from xaiographs.datasets import load_body_performance >>> df_dataset = load_body_performance() >>> df_dataset.head(5) id age gender height_cm weight_kg body_fat_% diastolic systolic gripForce sit_and_bend_forward_cm sit-ups_counts broad_jump_cm class 0 0 27.0 M 172.3 75.24 21.3 80.0 130.0 54.9 18.4 60.0 217.0 mid_performance 1 1 25.0 M 165.0 55.80 15.7 77.0 126.0 36.4 16.3 53.0 229.0 high_performance 2 2 31.0 M 179.6 78.00 20.1 92.0 152.0 44.8 12.0 49.0 181.0 mid_performance 3 3 32.0 M 174.5 71.10 18.4 76.0 147.0 41.4 15.2 53.0 219.0 mid_performance 4 4 28.0 M 173.8 67.70 17.1 70.0 127.0 43.5 27.1 45.0 217.0 mid_performance
- xaiographs.datasets.load_body_performance_discretized() Tuple[DataFrame, List[str], List[str], str, str]¶
Returns body performance dataset (and other metadata) to be tested in xaiographs. The dataset contains a series of discretized features, four columns (high_performance,mid-top_performance,mid-low_performance,low_performance) with the probability [0,1] of classification given by an ML model and two columns ‘y_true’ and ‘y_predict’ with GroundTruth and prediction given by ML model. Dataset contains the following columns:
id: unique person identifier
age: person age
gender: person gender
height_cm: Measurement of the waist expressed in centimeters
weight_kg: Weight of the person expressed in kilograms
body_fat_%: Percentage of body fat
diastolic: diastolic blood pressure (min)
systolic: systolic blood pressure (min)
gripForce: Measure the grip force of the hands
sit_and_bend_forward_cm: Distance expressed in centimeters from the length of the entire back, from the heels to the crown of the head
sit-ups_counts: Num of repetitions of raising the torso to a sitting position and returning to the original position without using the arms or lifting the feet
broad_jump_cm: Longest jump forward jump with a running start and a single leap, expressed in centimiters
y_true: real target - {high_performance,mid-top_performance,mid-low_performance,low_performance}
y_predict: machine learning model prediction - {high_performance,mid-top_performance,mid-low_performance,low_performance}
high_performance: probability [0,1] that the person has a high level of body performance. Calculated by ML model
mid-top_performance: probability [0,1] that the person has a mid-top level of body performance. Calculated by ML model
mid-low_performance: probability [0,1] that the person has a mid-low level of body performance. Calculated by ML model
low_performance: probability [0,1] that the person has a low level of body performance. Calculated by ML model
- Returns:
load_body_performance_discretized –
pd.DataFrame, with data
List[str], with features name columns
List[str], with target names probabilities
str, with GroundTruth
str, with prediction ML model
- Return type:
Tuple[pd.DataFrame, List[str], List[str], str, str]
Example
>>> from xaiographs.datasets import load_body_performance_discretized >>> df_dataset, features_cols, target_cols, y_true, y_predict = load_body_performance_discretized() >>> df_dataset.head(5) id age gender height_cm weight_kg body_fat_% diastolic systolic gripForce sit_and_bend_forward_cm sit-ups_counts broad_jump_cm y_true y_predict high_performance mid_performance low_performance 0 0 26-35 M 160-mid-176 55-mid-79 15-mid-30 68-mid-89 115-mid-144 over_47 6-mid-23 over_54 150-mid-229 mid_performance mid_performance 0 1 0 1 1 <25 M 160-mid-176 55-mid-79 under_15 68-mid-89 115-mid-144 26-mid-47 6-mid-23 25-mid-54 150-mid-229 high_performance high_performance 1 0 0 2 2 26-35 M over_176 55-mid-79 15-mid-30 over_89 over_144 26-mid-47 6-mid-23 25-mid-54 150-mid-229 mid_performance mid_performance 0 1 0 3 3 26-35 M 160-mid-176 55-mid-79 15-mid-30 68-mid-89 over_144 26-mid-47 6-mid-23 25-mid-54 150-mid-229 mid_performance mid_performance 0 1 0 4 4 26-35 M 160-mid-176 55-mid-79 15-mid-30 68-mid-89 115-mid-144 26-mid-47 over_23 25-mid-54 150-mid-229 mid_performance mid_performance 0 1 0 >>> features_cols ['age', 'gender', 'height_cm', 'weight_kg', 'body_fat_%', 'diastolic', 'systolic', 'gripForce', 'sit_and_bend_forward_cm', 'sit-ups_counts', 'broad_jump_cm'] >>> target_cols ['high_performance', 'mid_performance', 'low_performance'] >>> y_true 'y_true' >>> y_predict 'y_predict'
- xaiographs.datasets.load_body_performance_why(language: str = 'en') Tuple[DataFrame, DataFrame]¶
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Body Performance dataset.
- Parameters:
language (str) – Language identifier {es: Spanish, en: English}. Default uses English language
- Returns:
load_body_performance_why –
pd.DataFrame with the natural language explanation of feature-value we want to use
pd.DataFrame with the natural language explanation of feature-value we want to use per target
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> from xaiographs.datasets import load_body_performance_why >>> df_values_semantics, df_target_values_semantics = load_body_performance_why() >>> df_values_semantics.head(5) feature_value reason 0 age_26-35 being a child 1 age_36-45 being adolescent 2 age_46-55 being young 3 age_<25 to be adult 4 age_>55 being an older person >>> df_target_values_semantics.head(5) target feature_value reason 0 high_performance age_26-35 a child with a physical condition above average 1 high_performance age_36-45 a teenager with a higher than average physical... 2 high_performance age_46-55 a young man with a physical condition above av... 3 high_performance age_<25 an adult with a physical condition above average 4 high_performance age_>55 an older person with a higher than average phy...
- xaiographs.datasets.load_education_performance() DataFrame¶
Returns body performance dataset with the following Features:
id: unique studen identifier
age: Student Age
sex: Sex
graduated_h_school_type: Graduated high-school type
scholarship_type: Scholarship type
additional_work: Additional work
partner: Do you have a partner
total_salary: Total salary if available
transport: Transportation to the university
accomodation: Accommodation type in Cyprus
mother_ed: Mother’s education
farther_ed: Father’s education
siblings: Number of sisters/brothers
parental_status: Parental status
mother_occup: Mother’s occupation
father_occup: Father’s occupation
weekly_study_hours: Weekly study hours
reading_non_scientific: Reading frequency
reading_scientific: Reading frequency
**attendance_seminars_dep:**Attendance to the seminars/conferences related to the department
impact_of_projects: Impact of your projects/activities on your success
attendances_classes: Attendance to classes
preparation_midterm_company: Preparation to midterm exams 1
preparation_midterm_time: Preparation to midterm exams 2
taking_notes: Taking notes in classes
listenning: Listening in classes
discussion_improves_interest: Discussion improves my interest and success in the course
flip_classrom: Flip-classroom
grade: Grade of performance
- Returns:
load_education_performance – Education Performance dataset
- Return type:
pd.DataFrame
- Example:
>>> from xaiographs.datasets import load_education_performance >>> df_dataset = load_education_performance() >>> df_dataset.head(3) id age sex graduated_h_school_type scholarship_type additional_work activity partner total_salary transport accomodation mother_ed farther_ed siblings parental_status mother_occup father_occup weekly_study_hours reading_non_scientific reading_scientific attendance_seminars_dep impact_of_projects attendances_classes preparation_midterm_company preparation_midterm_time taking_notes listenning discussion_improves_interest flip_classrom course_id grade 0 0 2 1 2 3 2 2 1 3 4 2 1 2 3 1 2 3 2 2 2 1 1 2 1 1 2 2 2 2 1 Fail 1 1 1 1 1 4 1 1 2 4 2 3 4 4 1 1 3 2 3 3 3 1 3 1 3 2 3 1 3 3 1 Fail 2 2 1 1 1 4 2 2 2 1 1 1 3 4 4 2 2 2 3 2 2 1 1 1 1 1 2 2 2 3 1 Fail
- xaiographs.datasets.load_education_performance_discretized() Tuple[DataFrame, List[str], List[str], str, str]¶
Returns education performance dataset (and other metadata) to be tested in xaiographs. The dataset contains a series of discretized features, five columns (A, B, C, D, Fail) with the probability [0,1] of classification given by an ML model and two columns ‘y_true’ & ‘y_predict’ with GroundTruth and prediction given by ML model. Dataset contains the following columns:
id: unique studen identifier
age: Student Age
sex: Sex
graduated_h_school_type: Graduated high-school type
scholarship_type: Scholarship type
additional_work: Additional work
partner: Do you have a partner
total_salary: Total salary if available
transport: Transportation to the university
accomodation: Accommodation type in Cyprus
mother_ed: Mother’s education
farther_ed: Father’s education
siblings: Number of sisters/brothers
parental_status: Parental status
mother_occup: Mother’s occupation
father_occup: Father’s occupation
weekly_study_hours: Weekly study hours
reading_non_scientific: Reading frequency
reading_scientific: Reading frequency
**attendance_seminars_dep:**Attendance to the seminars/conferences related to the department
impact_of_projects: Impact of your projects/activities on your success
attendances_classes: Attendance to classes
preparation_midterm_company: Preparation to midterm exams 1
preparation_midterm_time: Preparation to midterm exams 2
taking_notes: Taking notes in classes
listenning: Listening in classes
discussion_improves_interest: Discussion improves my interest and success in the course
flip_classrom: Flip-classroom
y_true: real target - {A, B, C, D, Fail}
y_predict: machine learning model prediction - {A, B, C, D, Fail}
A: probability [0,1] that the person has a better educational performance. Calculated by ML model
B: probability [0,1] that the person has a second educational performance. Calculated by ML model
C: probability [0,1] that the person has a third educational performance. Calculated by ML model
D: probability [0,1] that the person has a fourth educational performance. Calculated by ML model
Fail: probability [0,1] that the person has a lower educational performance. Calculated by ML model
- Returns:
load_education_performance_discretized –
pd.DataFrame, with data
List[str], with features name columns
List[str], with target names probabilities
str, with GroundTruth
str, with prediction ML model
- Return type:
Tuple[pd.DataFrame, List[str], List[str], str, str]
Example
>>> from xaiographs.datasets import load_education_performance_discretized >>> df_dataset, features_cols, target_cols, y_true, y_predict = load_education_performance_discretized() >>> df_dataset.head(3) id age sex graduated_h_school_type scholarship_type additional_work activity partner total_salary transport accomodation mother_ed farther_ed parental_status mother_occup father_occup weekly_study_hours reading_non_scientific reading_scientific attendance_seminars_dep impact_of_projects attendances_classes preparation_midterm_company preparation_midterm_time taking_notes listenning discussion_improves_interest flip_classrom y_true y_predict A B C D Fail 0 0 22-25 female state 50% No No Yes USD 271-340 Other dormitory primary school secondary school married housewife private sector employee <5 hours Sometimes Sometimes Yes positive sometimes alone closest date to the exam sometimes sometimes sometimes useful Fail Fail 0 0 0 0 1 1 1 18-21 female private 75% Yes Yes No USD 341-410 Private car/taxi with family university university married government officer government officer 6-10 hours Often Often Yes neutral always not applicable regularly during the semester always never always not applicable Fail Fail 0 0 0 0 1 2 2 18-21 female private 75% No No No USD 135-200 Bus rental high school university divorced housewife government officer 6-10 hours Sometimes Sometimes Yes positive always alone closest date to the exam sometimes sometimes sometimes not applicable Fail Fail 0 0 0 0 1 >>> features_cols ['age', 'sex', 'graduated_h_school_type', 'scholarship_type', 'additional_work', 'activity', 'partner', 'total_salary', 'transport', 'accomodation', 'mother_ed', 'farther_ed', 'parental_status', 'mother_occup', 'father_occup', 'weekly_study_hours', 'reading_non_scientific', 'reading_scientific', 'attendance_seminars_dep', 'impact_of_projects', 'attendances_classes', 'preparation_midterm_company', 'preparation_midterm_time', 'taking_notes', 'listenning', 'discussion_improves_interest', 'flip_classrom'] >>> target_cols ['A', 'B', 'C', 'D', 'Fail'] >>> y_true 'y_true' >>> y_predict 'y_predict'
- xaiographs.datasets.load_education_performance_why(language: str = 'en') Tuple[DataFrame, DataFrame]¶
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Body Performance dataset.
- Parameters:
language (str) – Language identifier {es: Spanish, en: English}. Default uses English language
- Returns:
load_education_performance_why –
pd.DataFrame with the natural language explanation of feature-value we want to use
pd.DataFrame with the natural language explanation of feature-value we want to use per target
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> from xaiographs.datasets import load_education_performance_why >>> df_values_semantics, df_target_values_semantics = load_education_performance_why() >>> df_values_semantics.head(5) feature_value reason 0 accomodation_other having been in another type of accommodation 1 accomodation_dormitory having been housed in a bedroom 2 accomodation_rental having been in a rented accommodation 3 accomodation_with family having been in a family accommodation 4 age_18-21 being under 21 years old >>> df_target_values_semantics.head(5) target feature_value reason 0 A accomodation_Other live in other facilities 1 A accomodation_dormitory living in a bedroom 2 A accomodation_rental living in rental 3 A accomodation_with family he lives with his family 4 A age_18-21 it is below the average age
- xaiographs.datasets.load_phone_brand_preferences() DataFrame¶
Returns the smarthpone brand preferences dataset that contains the following Features:
user_id: id of user
brand: Apple, Asus, Google, Motorola, Samsung, …
internal memory: in GB
RAM: in GB
performance: from AnTuTu rating
main camera: in MP
selfie camera: in MP
battery size: in mAh
screen size: in inches
weight: in grams
price: in dollars, collected from Amazon and Best-Buy (in Aug 22).
age: age of user
gender: {female, male}
occupation: occupation of user
- Returns:
load_phone_brand_preferences – Smartphone brand preferences dataset
- Return type:
pd.DataFrame
- Example:
>>> from xaiographs.datasets import load_phone_brand_preferences >>> df_dataset = load_phone_brand_preferences() >>> df_dataset.head(5) brand internal_memory performance main_camera selfie_camera battery_size screen_size weight price age gender occupation 0 Samsung 128 8.81 50 10 3700 6.1 167 528 38 Female Data analyst 1 Apple 256 7.94 12 12 3065 6.1 204 999 38 Female Data analyst 2 Google 128 6.76 50 8 4614 6.4 207 499 31 Female sales 3 Samsung 128 7.22 50 10 4500 6.6 195 899 31 Female sales 4 Google 128 6.88 12 8 4410 6.1 178 449 27 Female Team leader
- xaiographs.datasets.load_phone_brand_preferences_discretized() Tuple[DataFrame, List[str], List[str], str, str]¶
Returns smartphone brand preferences dataset (and other metadata) to be tested in xaiographs. The dataset contains a series of discretized features, four columns (Apple, Samsung, Xiaomi, Other) with the probability [0,1] of classification given by an ML model and two columns ‘y_true’ and ‘y_predict’ with GroundTruth and prediction given by ML model. Dataset contains the following columns:
id: unique user identifier
internal memory: {<=64_GB, 128_GB, >=256_GB}
performance: {Low, Mid, Top, Ultra top}
main camera: {0_15_MP, 15_30_MP, >30_MP}
selfie camera: {0_15_MP, 15_30_MP, >30_MP}
battery size: {<4000_mAh, 4000_4700_mAh, >4000_mAh}
screen size: {<6.4_inches, 6.4_6.6_inches, >6.6_inches}
weight: {<190_g, 190_205_g, >205_g}
price: {<200_dollars, 200_450_dollars, 450_700_dollars, >700_dollars}
age: {<25_years, 25_35_years, 25_45_years, >45_years}
gender: {female, male}
occupation: {Administrative, Business, Technology, Other}
Apple: probability [0,1] that the user will choose this brand. Calculated by ML model
Samsung: probability [0,1] that the user will choose this brand. Calculated by ML model
Xiaomi: probability [0,1] that the user will choose this brand. Calculated by ML model
Google: probability [0,1] that the user will choose this brand. Calculated by ML model
Motorola: probability [0,1] that the user will choose this brand. Calculated by ML model
y_true: real target - {Apple, Samsung, Xiaomi, Other}
y_predict: machine learning model prediction - {Apple, Samsung, Xiaomi, Other}
- Returns:
load_phone_brand_preferences_discretized –
pd.DataFrame, with data
List[str], with features name columns
List[str], with target names probabilities
str, with GroundTruth
str, with prediction ML model
- Return type:
Tuple[pd.DataFrame, List[str], List[str], str, str]
- Example:
>>> from xaiographs.datasets import load_phone_brand_preferences_discretized >>> df_dataset, features_cols, target_cols, y_true, y_predict = load_phone_brand_preferences_discretized() >>> df_dataset.head(5) id internal_memory performance main_camera selfie_camera battery_size screen_size weight price age gender occupation y_true y_predict Apple Google Motorola Samsung Xiaomi 0 0 128_GB Ultra top 15_50_MP <10_MP <4000_mAh <6.4_inches <190_g 450_700_dollars 35_45_years Female Technology Samsung Apple 1 0 0 0 0 1 1 >=256_GB Top <15_MP 10_30_MP <4000_mAh <6.4_inches 190_205_g >700_dollars 35_45_years Female Technology Apple Apple 1 0 0 0 0 2 2 128_GB Mid 15_50_MP <10_MP 4000_4700_mAh <6.4_inches >205_g 450_700_dollars 25_35_years Female Business Google Google 0 1 0 0 0 3 3 128_GB Mid 15_50_MP <10_MP 4000_4700_mAh 6.4_6.6_inches 190_205_g >700_dollars 25_35_years Female Business Samsung Samsung 0 0 0 1 0 4 4 128_GB Mid <15_MP <10_MP 4000_4700_mAh <6.4_inches <190_g 200_450_dollars 25_35_years Female Administration Google Google 0 1 0 0 0 >>> features_cols ['internal memory', 'performance', 'main camera', 'selfie camera', 'battery size', 'screen size', 'weight', 'price', 'age', 'gender', 'occupation'] >>> target_cols ['Apple', 'Samsung', 'Xiaomi', 'Motorola', 'Google'] >>> y_true 'y_true' >>> y_predict 'y_predict'
- xaiographs.datasets.load_phone_brand_preferences_why(language: str = 'en') Tuple[DataFrame, DataFrame]¶
Returns the necessary DataFrames to test the WHY module of XAIoGraphs with the explainability calculated with the Smartphone Brand Preferences dataset.
- Parameters:
language (str) – Language identifier {es: Spanish, en: English}. Default uses English language
- Returns:
load_phone_brand_preferences_why –
pd.DataFrame with the natural language explanation of feature-value we want to use
pd.DataFrame with the natural language explanation of feature-value we want to use per target
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- Example:
>>> from xaiographs.datasets import load_phone_brand_preferences_why >>> df_values_semantics, df_target_values_semantics = load_phone_brand_preferences_why() >>> df_values_semantics.head(5) feature_value reason 0 battery size_4000_4700_mAh having a battery size between 4000 and 4700 mAh 1 battery size_<4000_mAh having a battery size smaller than 4000 mAh 2 battery size_>4700_mAh having a battery size larger than 4700 mAh 3 internal memory_128_GB having 128 GB of internal memory 4 internal memory_<=64_GB having 64 GB or less of internal memory >>> df_target_values_semantics.head(5) target feature_value reason 0 Apple battery size_4000_4700_mAh some Apple phones have a battery size between 4000 and 4700 mAh 1 Apple battery size_<4000_mAh many Apple models have batteries smaller than 4000 mAh 2 Apple battery size_>4700_mAh few Apple phones feature batteries larger than 4700 mAh 3 Apple internal memory_128_GB many Apple phones offer 128 GB of internal memory 4 Apple internal memory_<=64_GB entry-level Apple models have 64 GB or less internal memory