Explainer#

class xaiographs.Explainer(importance_engine: str, destination_path: str = './xaioweb_files', number_of_features: int = 8, verbose: int = 0)#

The Explainer class provides an abstract layer which encapsulates everything related to the explanation process from statistics calculation, importance calculation (using the engine chosen by the user) and information export for visualization tasks.

Read more in the Explainability User Guide

Parameters

importance_engine (str) –
The name of the method use to compute feature importance.

Important

LIDE is the available option for version 0.0.2
destination_path (str, default='./xaioweb_files') – The path where output XAIoWeb files will be stored.
number_of_features (int) – The number of top relevant features to be selected for importance calculation.
verbose (int, default=0) –
Verbosity level.

Hint

Any value greater than 0 means verbosity is on.

Methods:

fit(df, feature_cols, target_cols[, ...])

It coordinates all the steps of the explanation process which consists of the following parts:

Attributes:

`global_explainability`	Property containing each feature ranked by its global importance.
`global_frequency_feature_value`	Property that returns the number of occurrences for each feature-value pair.
`global_target_explainability`	Property that returns all the features to be explained, ranked by their global importance by target value.
`global_target_feature_value_explainability`	Property that, for each target value, returns all the pairs feature-value ranked by their global importance.
`importance_values`	Property returns the computed importance values.
`local_feature_value_explainability`	Property that, for each sample, returns as many rows as feature-value pairs, together with their calculated importance.
`local_reliability`	Property that, for each sample, returns its top1 target and the reliability value associated to that target.
`sample_ids_to_display`	Property retrieves the sample ids which will be used to build the interactive visualization.
`top_features`	Property returns all the features ranked by the `FeatureSelector`.
`top_features_by_target`	Property returns all the features ranked by the `FeatureSelector`.

fit(df: DataFrame, feature_cols: List[str], target_cols: List[str], num_samples_local_expl: int = 100, num_samples_global_expl: int = 50000, batch_size_expl: int = 5000, train_stratify: bool = True)#

It coordinates all the steps of the explanation process which consists of the following parts:

Feature selection, takes care of determining which are top K most relevant features. K is defined by the parameter number_of_features in the constructor of the Explainer class.
Importance calculation, takes care of computing importance for the remaining features from the previous step and the possible target values.
Stats calculation, takes care of computing different counts and ratios which are particularly important to feed those files used for visualization purposes.
Exporter, generates all those files related to the Explanation process which will be used for visualization purposes.

Parameters

df (pandas.DataFrame) – Structure containing the whole dataset.
feature_cols (List[str]) – List containing the names of those columns representing features within the dataset DataFrame.
target_cols (List[str]) – List containing the names of those columns representing the target values within the dataset DataFrame.
num_samples_local_expl (int, default=100) – Number of samples to be taken into account for local explainability.
num_samples_global_expl (int, default=50000) –
Number of samples to be taken into account for global explainability.

Hint

Set up this parameter in case your dataset to be explained is too big. Global explanation will only take into account a number of samples equal to this parameter.
batch_size_expl (int, default=5000) – Batch size to be used when computing the importance.
train_stratify (bool, default=True) – When train_size is different from 0.0, this parameter can be set to True so that the train/test split will keep the target ratio in both of the resulting dataset partitions.

property global_explainability#

Property containing each feature ranked by its global importance. This property is computed in two steps:

The mean of each feature importance for each target is computed.
The mean of each feature importance is now computed throughout all the targets.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns

global_explainability – Structure containing each feature ranked by its global importance. It contains the following columns:

Column	Description
feature	feature name.
importance	feature importance considering all possible target values and all the samples.
rank	position of the feature when sorted by its importance. The lower the rank the higher the importance.

Return type

pandas.DataFrame

property global_frequency_feature_value#

Property that returns the number of occurrences for each feature-value pair. This is computed by adding up each feature-value pair occurrence.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns

global_frequency_feature_value – Structure containing the number of times each feature-value occurs. It contains the following columns:

Column	Description
feature_value	feature name together with each of its possible values.
frequency	total number of occurrences for each feature name-value pairs.

Return type

pandas.DataFrame

property global_target_explainability#

Property that returns all the features to be explained, ranked by their global importance by target value. This is achieved by computing the mean of each feature importance for each target value.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns

global_target_explainability – Structure containing all the features to be explained, ranked by their global importance by target value. It contains the following columns:

Column	Description
target	each of the possible target values.
feature	feature name.
importance	feature importance with respect to each possible target values.
rank	position of the feature when sorted by its importance. The lower the rank the higher the importance.

Return type

pandas.DataFrame

property global_target_feature_value_explainability#

Property that, for each target value, returns all the pairs feature-value ranked by their global importance. This is achieved by computing the mean of the importance/s of each feature-value pair linked to the target value being processed.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns

global_target_feature_value_explainability – Structure containing for each target value all the pairs feature-value ranked by their global importance. It contains the following columns:

Column	Description
target	each of the possible target values.
feature_value	feature name together with each of its possible values.
importance	feature importance with respect to each possible target values.
rank	position of the feature for each target value when sorted by its importance. The lower the rank the higher the importance.

Return type

pandas.DataFrame

property importance_values#

Property returns the computed importance values.

Caution

If the method local_explain() from an ImportanceCalculator child class has not been executed, it will return a warning message.

Returns: importance_values – Structure containing for each sample, feature and target value, the computed importance values
Return type: numpy.array, shape (n_samples, n_features, n_target_values)

property local_feature_value_explainability#

Property that, for each sample, returns as many rows as feature-value pairs, together with their calculated importance.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns

local_feature_value_explainability – Structure containing for each sample all its feature-value pairs together with their importance. It contains the following columns:

Column	Description
id	identifier for each sample.
feature_value	feature name together with each of its possible values.
importance	feature importance for each feature_value pair and the top1 target.
rank	position of the feature_value pair for each sample when sorted by its importance. The lower the rank the higher the importance.

Return type

pandas.DataFrame

property local_reliability#

Property that, for each sample, returns its top1 target and the reliability value associated to that target.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns

local_dataset_reliability – Structure containing for each sample its top1 target and the reliability value associated to that target. It contains the following columns:

Column	Description
id	identifier for each sample.
target	each of the possible target values.
reliability	associated reliability value.

Return type

pandas.DataFrame

property sample_ids_to_display#

Property retrieves the sample ids which will be used to build the interactive visualization.

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Returns: local_feature_value_explainability – Structure containing the ids for the chosen samples.
Return type: pandas.Series

property top_features#

Property returns all the features ranked by the FeatureSelector. Ranking is calculated as follows:

For each target value and for all the features, two histograms are calculated per feature. The first one considering the input pandas DataFrame filtered by the target value and the second one considering the opposite (DataFrame filtered by the absence of target value).
Modified Jensen Shannon distance (see below for details) is calculated between the resulting two distributions.
Once all distances have been computed for all the features for a given target value, they’re ranked, so that the larger the distance, the higher the rank.
Finally, for each feature, its ranks for all of the targets are taken into account so that the feature with the largest aggregated rank will rank the first in the top K features (note that when talking about ranks, 1 is greater than 2).

Caution

If the method fit() from the Explainer class has not been executed, it will return a warning message.

Modified Jensen Shannon distance calculation:

The formula can be found here.
However the used formula is a modified version which returns a four element numpy array:
- First element replaces the square root by the square root of the median.
- Second element replaces the square root by the square root of the mean.
- Third element replaces the square root by the square root of the max.
- Fourth element replaces the square root by the square root of the sum.
An numpy array as explained above will be returned per feature and all of them stacked up becoming a matrix of shape (number_of_features, 4).
Each element of the matrix is normalized by dividing it by the sum of the elements of its corresponding column.
For each feature (each matrix row), its normalized statistics are added, as a result the matrix becomes a vector containing one element per feature.
Finally each element is normalized by dividing it by the sum of all the elements. These are the distances taken into account to compute the rank so that the higher the distance the more discriminative the feature is considered, thus, the more interesting from the predictive point of view. The feature with the highest distance will be ranked first while the feature with the smallest distance will be ranked last.
A vector like the one described in the step above will be obtained for each target value, this means that a ranking will be obtained for each target value.
In order to obtain a final ranking, partial ranks per target value are added for each feature, so that, the higher the rank sum for each feature, the less relevant it will be considered.

Returns: top_features – Structure containing with all the features ranked by the FeatureSelector.
Return type: pd.DataFrame

property top_features_by_target#

Property returns all the features ranked by the FeatureSelector.