S is a subset of the coalition that does not include feature i,
|S| is the size of this subset and
|F| is the size of the full coalition.
The general idea is to obtain different combinations of how features can appear together in subsets of the coalition and find out what the effect of having these combinations would have on the predictions made by the model. Those combinations are often called permutations of the data sample.
But what happens with the features that we remove? We cannot expect our model to just handle missing values at inference mode.
What we do instead is that we define a background dataset. How we define this background dataset can vary and depends on the type of the problem. One approach is to have the whole training dataset as the background dataset if its size is small. Another approach is to use some algorithms like k-means to summarize the dataset in a meaningful way so that the background dataset is representative of the data the model was trained on.
Once we have our background dataset, every time a feature is set to missing, we replace it with values (or the one single value) it takes in the background dataset.
It is probably obvious by now that if one wants to compute the Shapley value of a feature, she or he should have to sum all possible pairs of subsets of features. That is 2n-1 different pairs, where n is the number of features in our dataset. Now imagine having to run the model in inference mode for each one of those pairs of subsets of coalitions, which increases exponentially as we saw.
Is there a better way of doing that? The solution lies in approximation. Instead of taking all subsets of the coalition into account, we sample them. But another pair of problems arise now. The first is that to have a good estimation we need a large number of samples. The second problem is that the approximation method we use, which is probably going to be a Monte Carlo approximation, is naive in this case because it randomly samples combinations in a non principled way. That is probably not the best way to use our computational power. We still need a better way.
This is when SHAP (SHapley Additive exPlanations) comes into play. The algorithm introduced in this paper, alongside having other properties, suggests improved sampling methods. For Kernel SHAP in particular, which is a model agnostic algorithm, the sampling method the authors suggest does not have a uniform prior like Monte Carlo. Rather, it tends to prefer the subsets that include fewer features. Think of it like this: a subset in which many features are missing, say for example that there is only one feature left in it, tells a lot about how that feature is going to affect the prediction. Intuitively it makes a lot of sense: the pair (🔼, ⏹, - , *️⃣) - (🔼, ⏹, ⏺, *️⃣) will lead to more information gain than the pair
(-, -, -, *️⃣) - (-, -,⏺, *️⃣) when studying the effect of feature ⏺, for example. Based on this concept, Kernel SHAP assigns larger weights to both subsets with small coalitions and subsets with large coalitions. This can be considered as a distance function between subsets that determines which ones get assigned to larger weights. The word kernel generally refers to a distance function, hence the name of the algorithm.
If the kernel tells you how to weigh every single subset of the coalition, how does Kernel SHAP work? It is a beautiful combination of another algorithm called LIME and the Shapley values we introduced before. LIME algorithm conducts permutations of samples that are around the sample you want to explain, and it weights the samples that are closer to it more. Then, it fits a model locally on these permuted samples which are interpretable.
In Kernel SHAP, we pick a sample, we then sample the subsets of the coalitions using the method explained above and we run the model in inference mode on those subsets. Just like LIME does, we then fit a local model on them which aims to approximate our complex model. If the weights of the linear model are picked in the right way, the coefficients of the linear model are the Shapley values as computed by the Kernel SHAP algorithm.