arviz_stats.compare

Contents

arviz_stats.compare#

arviz_stats.compare(compare_dict, method='stacking', var_name=None, reference=None, round_to='auto')[source]#

Compare models based on their expected log pointwise predictive density (ELPD).

The ELPD is estimated by Pareto smoothed importance sampling leave-one-out cross-validation, the same method used by arviz_stats.loo. The method is described in [2] and [3]. By default, the weights are estimated using "stacking" as described in [4].

If more than 11 models are compared, a diagnostic check for selection bias is performed. If detected, avoid LOO-based selection and use model averaging or projection predictive inference.

See the EABM chapters on Model Comparison, Model Comparison (Case Study), and Model Comparison for Large Data for more details.

Parameters:
compare_dict: dict of {str: DataTree or ELPDData}

A dictionary of model names and xr.DataTree or ELPDData.

method: str, optional

Method used to estimate the weights for each model. Available options are:

  • ‘stacking’ : stacking of predictive distributions.

  • ‘BB-pseudo-BMA’ : pseudo-Bayesian Model averaging using Akaike-type weighting. The weights are stabilized using the Bayesian bootstrap.

  • ‘pseudo-BMA’: pseudo-Bayesian Model averaging using Akaike-type weighting, without Bootstrap stabilization (not recommended).

For more information read https://arxiv.org/abs/1704.02030

var_name: str, optional

If there is more than a single observed variable in the InferenceData, which should be used as the basis for comparison.

reference: str, optional

Name of the reference model used for computing elpd_diff. If None (default), the best-performing model (highest ELPD) is used as the reference. When specified, all elpd_diff values are computed relative to this model, which will have elpd_diff = 0. This is useful for comparing against a baseline model, null model, or a specific model of interest rather than the top-ranked model.

round_toint or {“auto”, “none”}, optional

Rounding specification. Defaults to “auto”. If integer, number of decimal places to round to. Use the string “None” or “none” to return raw numbers. If None use rcParams["stats.round_to"]. If "auto", applies custom rounding rules to columns in the returned DataFrame:

  • elpd and elpd_diff are rounded based on se and dse respectively,

    using the same rule as summary stat/se pairs.

  • se and dse are rounded based on rcParams["stats.round_to"].

  • p is rounded to 1 decimal place.

  • weight uses precision based on the largest weight, showing approximately 2 significant digits for that maximum value.

Returns:
DataFrame

A DataFrame, ordered from best to worst model (measured by the ELPD). The index reflects the key with which the models are passed to this function. The columns are:

  • rank: The rank-order of the models. 0 is the best.

  • elpd: ELPD estimated using PSIS-LOO-CV (elpd_loo). Higher ELPD indicates higher out-of-sample predictive fit (“better” model).

  • pIC: Estimated effective number of parameters.

  • elpd_diff: The difference in ELPD between two models. If more than two models are compared, the difference is computed relative to the top-ranked model, that always has an elpd_diff of 0.

  • weight: Relative weight for each model. This can be loosely interpreted as the probability of each model (among the compared models) given the data. By default the uncertainty in the weights estimation is considered using Bayesian bootstrap.

  • SE: Standard error of the ELPD estimate. If method = BB-pseudo-BMA these values are estimated using Bayesian bootstrap.

  • dSE: Standard error of the difference in ELPD between each model and the top-ranked model. It’s always 0 for the top-ranked model.

  • subsampling_dSE: (Only when subsampling is used) The subsampling component of the standard error of the ELPD difference. This quantifies the uncertainty due to using a subsample rather than all observations.

  • warning: A value of 1 indicates that the computation of the ELPD may not be reliable. This could be indication of LOO starting to fail see http://arxiv.org/abs/1507.04544 for details.

See also

loo

Compute the ELPD using the Pareto smoothed importance sampling Leave-one-out cross-validation method.

arviz_plots.plot_compare

Summary plot for model comparison.

References

[1]

McLatchie, Y., Vehtari, A. Efficient estimation and correction of selection-induced bias with order statistics. Statistics and Computing, 34, 132 (2024). https://doi.org/10.1007/s11222-024-10442-4 arXiv preprint https://arxiv.org/abs/2309.03742

[2]

Vehtari et al. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5) (2017) https://doi.org/10.1007/s11222-016-9696-4 arXiv preprint https://arxiv.org/abs/1507.04544.

[3]

Vehtari et al. Pareto Smoothed Importance Sampling. Journal of Machine Learning Research, 25(72) (2024) https://jmlr.org/papers/v25/19-556.html arXiv preprint https://arxiv.org/abs/1507.02646

[4]

Yao et al. Using stacking to average Bayesian predictive distributions Bayesian Analysis, 13, 3 (2018). https://doi.org/10.1214/17-BA1091 arXiv preprint https://arxiv.org/abs/1704.02030.

Examples

Compare the centered and non centered models of the eight school problem:

In [1]: In [1]: from arviz_stats import compare
   ...:    ...: from arviz_base import load_arviz_data
   ...:    ...: data1 = load_arviz_data("non_centered_eight")
   ...:    ...: data2 = load_arviz_data("centered_eight")
   ...:    ...: compare_dict = {"non centered": data1, "centered": data2}
   ...:    ...: compare(compare_dict)
   ...: 
Out[1]: 
              rank  elpd    p  elpd_diff  weight   se    dse  warning
non centered     0 -31.0  0.9        0.0     1.0  1.4  0.000    False
centered         1 -31.0  0.9        0.0     0.0  1.3  0.055    False

Compare models using subsampled LOO:

In [2]: In [1]: from arviz_stats import loo_subsample
   ...:    ...: from arviz_base import load_arviz_data
   ...:    ...: data1 = load_arviz_data("non_centered_eight")
   ...:    ...: data2 = load_arviz_data("centered_eight")
   ...:    ...: loo_sub1 = loo_subsample(data1, observations=6, pointwise=True, seed=42)
   ...:    ...: loo_sub2 = loo_subsample(data2, observations=6, pointwise=True, seed=42)
   ...:    ...: compare({"non_centered": loo_sub1, "centered": loo_sub2})
   ...: 
Out[2]: 
              rank  elpd    p  elpd_diff  ...   se    dse  warning  subsampling_dse
non_centered     0 -31.0  1.0        0.0  ...  1.5  0.000    False          0.00000
centered         1 -31.0  1.1        0.0  ...  1.4  0.052    False          0.01881

[2 rows x 9 columns]

When using subsampled LOO, the subsampling_dse column quantifies the additional uncertainty from using subsamples instead of all observations. The elpd_diff values are computed using a difference-of-estimators approach on overlapping observations, which can differ from simple subtraction of ELPD values. Using the same seed across models ensures overlapping observations for more accurate paired comparisons with smaller standard errors.