Explainer#
- class xaiographs.Explainer(importance_engine: str, destination_path: str = './xaioweb_files', number_of_features: int = 8, verbose: int = 0)#
The Explainer class provides an abstract layer which encapsulates everything related to the explanation process from statistics calculation, importance calculation (using the engine chosen by the user) and information export for visualization tasks.
Read more in the Explainability User Guide
- Parameters
importance_engine (str) β
The name of the method use to compute feature importance.
Important
LIDE is the available option for version 0.0.2
destination_path (str, default='./xaioweb_files') β The path where output XAIoWeb files will be stored.
number_of_features (int) β The number of top relevant features to be selected for importance calculation.
verbose (int, default=0) β
Verbosity level.
Hint
Any value greater than 0 means verbosity is on.
Methods:
fit(df, feature_cols, target_cols[, ...])It coordinates all the steps of the explanation process which consists of the following parts:
Attributes:
Property containing each feature ranked by its global importance.
Property that returns the number of occurrences for each feature-value pair.
Property that returns all the features to be explained, ranked by their global importance by target value.
Property that, for each target value, returns all the pairs feature-value ranked by their global importance.
Property returns the computed importance values.
Property that, for each sample, returns as many rows as feature-value pairs, together with their calculated importance.
Property that, for each sample, returns its top1 target and the reliability value associated to that target.
Property retrieves the sample ids which will be used to build the interactive visualization.
Property returns all the features ranked by the
FeatureSelector.Property returns all the features ranked by the
FeatureSelector.- fit(df: DataFrame, feature_cols: List[str], target_cols: List[str], num_samples_local_expl: int = 100, num_samples_global_expl: int = 50000, batch_size_expl: int = 5000, train_stratify: bool = True)#
It coordinates all the steps of the explanation process which consists of the following parts:
Feature selection, takes care of determining which are top K most relevant features. K is defined by the parameter
number_of_featuresin the constructor of theExplainerclass.Importance calculation, takes care of computing importance for the remaining features from the previous step and the possible target values.
Stats calculation, takes care of computing different counts and ratios which are particularly important to feed those files used for visualization purposes.
Exporter, generates all those files related to the Explanation process which will be used for visualization purposes.
- Parameters
df (pandas.DataFrame) β Structure containing the whole dataset.
feature_cols (List[str]) β List containing the names of those columns representing features within the dataset DataFrame.
target_cols (List[str]) β List containing the names of those columns representing the target values within the dataset DataFrame.
num_samples_local_expl (int, default=100) β Number of samples to be taken into account for local explainability.
num_samples_global_expl (int, default=50000) β
Number of samples to be taken into account for global explainability.
Hint
Set up this parameter in case your dataset to be explained is too big. Global explanation will only take into account a number of samples equal to this parameter.
batch_size_expl (int, default=5000) β Batch size to be used when computing the importance.
train_stratify (bool, default=True) β When
train_sizeis different from 0.0, this parameter can be set to True so that the train/test split will keep the target ratio in both of the resulting dataset partitions.
- property global_explainability#
Property containing each feature ranked by its global importance. This property is computed in two steps:
The mean of each feature importance for each target is computed.
The mean of each feature importance is now computed throughout all the targets.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
global_explainability β Structure containing each feature ranked by its global importance. It contains the following columns:
Column
Description
feature
feature name.
importance
feature importance considering all possible target values and all the samples.
rank
position of the feature when sorted by its importance. The lower the rank the higher the importance.
- Return type
pandas.DataFrame
- property global_frequency_feature_value#
Property that returns the number of occurrences for each feature-value pair. This is computed by adding up each feature-value pair occurrence.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
global_frequency_feature_value β Structure containing the number of times each feature-value occurs. It contains the following columns:
Column
Description
feature_value
feature name together with each of its possible values.
frequency
total number of occurrences for each feature name-value pairs.
- Return type
pandas.DataFrame
- property global_target_explainability#
Property that returns all the features to be explained, ranked by their global importance by target value. This is achieved by computing the mean of each feature importance for each target value.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
global_target_explainability β Structure containing all the features to be explained, ranked by their global importance by target value. It contains the following columns:
Column
Description
target
each of the possible target values.
feature
feature name.
importance
feature importance with respect to each possible target values.
rank
position of the feature when sorted by its importance. The lower the rank the higher the importance.
- Return type
pandas.DataFrame
- property global_target_feature_value_explainability#
Property that, for each target value, returns all the pairs feature-value ranked by their global importance. This is achieved by computing the mean of the importance/s of each feature-value pair linked to the target value being processed.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
global_target_feature_value_explainability β Structure containing for each target value all the pairs feature-value ranked by their global importance. It contains the following columns:
Column
Description
target
each of the possible target values.
feature_value
feature name together with each of its possible values.
importance
feature importance with respect to each possible target values.
rank
position of the feature for each target value when sorted by its importance. The lower the rank the higher the importance.
- Return type
pandas.DataFrame
- property importance_values#
Property returns the computed importance values.
Caution
If the method
local_explain()from anImportanceCalculatorchild class has not been executed, it will return a warning message.- Returns
importance_values β Structure containing for each sample, feature and target value, the computed importance values
- Return type
numpy.array, shape (n_samples, n_features, n_target_values)
- property local_feature_value_explainability#
Property that, for each sample, returns as many rows as feature-value pairs, together with their calculated importance.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
local_feature_value_explainability β Structure containing for each sample all its feature-value pairs together with their importance. It contains the following columns:
Column
Description
id
identifier for each sample.
feature_value
feature name together with each of its possible values.
importance
feature importance for each feature_value pair and the top1 target.
rank
position of the feature_value pair for each sample when sorted by its importance. The lower the rank the higher the importance.
- Return type
pandas.DataFrame
- property local_reliability#
Property that, for each sample, returns its top1 target and the reliability value associated to that target.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
local_dataset_reliability β Structure containing for each sample its top1 target and the reliability value associated to that target. It contains the following columns:
Column
Description
id
identifier for each sample.
target
each of the possible target values.
reliability
associated reliability value.
- Return type
pandas.DataFrame
- property sample_ids_to_display#
Property retrieves the sample ids which will be used to build the interactive visualization.
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
local_feature_value_explainability β Structure containing the ids for the chosen samples.
- Return type
pandas.Series
- property top_features#
Property returns all the features ranked by the
FeatureSelector. Ranking is calculated as follows:For each target value and for all the features, two histograms are calculated per feature. The first one considering the input pandas DataFrame filtered by the target value and the second one considering the opposite (DataFrame filtered by the absence of target value).
Modified Jensen Shannon distance (see below for details) is calculated between the resulting two distributions.
Once all distances have been computed for all the features for a given target value, theyβre ranked, so that the larger the distance, the higher the rank.
Finally, for each feature, its ranks for all of the targets are taken into account so that the feature with the largest aggregated rank will rank the first in the top K features (note that when talking about ranks, 1 is greater than 2).
Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.Modified Jensen Shannon distance calculation:
The formula can be found here.
- However the used formula is a modified version which returns a four element numpy array:
First element replaces the square root by the square root of the median.
Second element replaces the square root by the square root of the mean.
Third element replaces the square root by the square root of the max.
Fourth element replaces the square root by the square root of the sum.
An numpy array as explained above will be returned per feature and all of them stacked up becoming a matrix of shape (number_of_features, 4).
Each element of the matrix is normalized by dividing it by the sum of the elements of its corresponding column.
For each feature (each matrix row), its normalized statistics are added, as a result the matrix becomes a vector containing one element per feature.
Finally each element is normalized by dividing it by the sum of all the elements. These are the distances taken into account to compute the rank so that the higher the distance the more discriminative the feature is considered, thus, the more interesting from the predictive point of view. The feature with the highest distance will be ranked first while the feature with the smallest distance will be ranked last.
A vector like the one described in the step above will be obtained for each target value, this means that a ranking will be obtained for each target value.
In order to obtain a final ranking, partial ranks per target value are added for each feature, so that, the higher the rank sum for each feature, the less relevant it will be considered.
- Returns
top_features β Structure containing with all the features ranked by the
FeatureSelector.- Return type
pd.DataFrame
- property top_features_by_target#
Property returns all the features ranked by the
FeatureSelector.See also
For further information about how the selection process works, please refer to
top_featuresfrom theExplainerclass.Caution
If the method
fit()from theExplainerclass has not been executed, it will return a warning message.- Returns
top_features β Structure providing for each feature its rank per target calculated by the
FeatureSelector. Furthermore, the distance for each feature and target value, is provided along with its rank.- Return type
pd.DataFrame