Home Which Scikit-Learn Models Have Built-In Uncertainty Quantification?
Post
Cancel

Which Scikit-Learn Models Have Built-In Uncertainty Quantification?

Uncertainty quantification is a valuable thing to have when modelling noisy data. We need to make decisions and it can make a huge difference having estimates of the worst probable outcome for our planning. Many of the machine learning methodologies have ignored probabilistic inferences and have rather focused on the conditional expectation. But there are a growing number of exceptions. While there are dedicated modules for predicting probabilities, I wanted to see what was available in Scikit-Learn as “low hanging fruit”.

There are three methods that a SKLearn estimator can have that seem relevant: sample, sample_y, and predict_proba. The sample method allows for values to be sampled from some internal probability distribution. Here is an example using sklearn.mixture.BayesianGaussianMixture.

1
2
3
4
5
import numpy as np
from sklearn.mixture import BayesianGaussianMixture
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [12, 4], [10, 7]])
bgm = BayesianGaussianMixture(n_components=2, random_state=42).fit(X)
sampled_X, sampled_labels = bgm.sample(10_000)

The above example returns both a data array and labels representing which statistical subpopulation the sample came from in the mixture distribution.

The sample_y is currently only held by the GaussianProcessRegressor. It works the same way as sample in theory; it samples from the probability distribution. I suspect the name is different because it is considered a regressor by SKLearn whereas other classes with sample are not.

The predict_proba is held by many of the estimators. When it is implemented it returns an estimated probability for each class, which differs from the predict method which would return the class label with the greatest estimated probability. This allows us to sample from a multinomial distribution to obtain samples with those probabilties. Here is a short example:

1
2
3
4
5
6
7
8
9
import numpy as np
rng = np.random.RandomState(2018)
X = rng.randint(5, size=(6, 100))
Y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, Y)
prob_first_entry = clf.predict_proba(X)[0]
first_row_sample = rng.multinomial(1, prob_first_entry, size=10_000)

Since the probabilities are themselves estimates, they should have uncertainty. Yup, your estimated probabilities themselves should have probability distributions (sometimes called metadistributions). This means that sampling from such probabilities may actually underestimate the quantifiable uncertainty. Still, it is better than no uncertainty quantification at all.

I developed and ran a Python script which allows us to easily check what estimators (thanks to sklearn.utils.all_estimators) have these methods.

Model Classsamplesample_ypredict_proba
AdaBoostClassifierFalseFalseTrue
BaggingClassifierFalseFalseTrue
BayesianGaussianMixtureTrueFalseTrue
BernoulliNBFalseFalseTrue
CalibratedClassifierCVFalseFalseTrue
CategoricalNBFalseFalseTrue
ClassifierChainFalseFalseTrue
ComplementNBFalseFalseTrue
DecisionTreeClassifierFalseFalseTrue
DummyClassifierFalseFalseTrue
ExtraTreeClassifierFalseFalseTrue
ExtraTreesClassifierFalseFalseTrue
GaussianMixtureTrueFalseTrue
GaussianNBFalseFalseTrue
GaussianProcessClassifierFalseFalseTrue
GaussianProcessRegressorFalseTrueFalse
GradientBoostingClassifierFalseFalseTrue
GridSearchCVFalseFalseTrue
HistGradientBoostingClassifierFalseFalseTrue
KNeighborsClassifierFalseFalseTrue
KernelDensityTrueFalseFalse
LabelPropagationFalseFalseTrue
LabelSpreadingFalseFalseTrue
LinearDiscriminantAnalysisFalseFalseTrue
LogisticRegressionFalseFalseTrue
LogisticRegressionCVFalseFalseTrue
MLPClassifierFalseFalseTrue
MultiOutputClassifierFalseFalseTrue
MultinomialNBFalseFalseTrue
NuSVCFalseFalseTrue
OneVsRestClassifierFalseFalseTrue
PipelineFalseFalseTrue
QuadraticDiscriminantAnalysisFalseFalseTrue
RFEFalseFalseTrue
RFECVFalseFalseTrue
RadiusNeighborsClassifierFalseFalseTrue
RandomForestClassifierFalseFalseTrue
RandomizedSearchCVFalseFalseTrue
SGDClassifierFalseFalseTrue
SVCFalseFalseTrue
SelfTrainingClassifierFalseFalseTrue
StackingClassifierFalseFalseTrue
VotingClassifierFalseFalseTrue

A concern I have about this table is that not all of the classes should have working implementations of the methods of interest. For example, Pipeline, MultiOutputClassifier, and similar ensembling classes should not have any logic built-in on how to compute probabilities or samples. Rather, such methods will call an underlying estimator if it exists and has the relevant method. So the table is a little bit misleading, but a good starting point.

The above table was generated with this Python script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import inspect
from sklearn.utils import all_estimators

def get_models_with_methods():
    models_with_methods = []

    # Get all estimators from sklearn
    estimators = all_estimators()

    for name, EstimatorClass in estimators:
        # Check if EstimatorClass is a class
        if inspect.isclass(EstimatorClass):
            has_sample = 'sample' in dir(EstimatorClass)
            has_sample_y = 'sample_y' in dir(EstimatorClass)
            has_predict_proba = 'predict_proba' in dir(EstimatorClass)
            if has_sample or has_sample_y or has_predict_proba:
                models_with_methods.append((name, has_sample, has_sample_y, has_predict_proba))

    return models_with_methods

if __name__ == "__main__":
    models_with_methods = get_models_with_methods()
    df = pd.DataFrame(models_with_methods, columns=['Model Class', '`sample`', '`sample_y`', '`predict_proba`'])
    print(df.to_markdown(index=False))
This post is licensed under CC BY 4.0 by the author.

Notes On Laszlo Sragner's PyData Talk On Code Smells

A Quick Reference to Some Common Orders