hckrnws
What frustrates me about Bayesian NNs is that talking about "priors" doesn't make nearly as much sense as it does in a regression context. A prior over parameter weights has no interpretation in the way that a prior over a regression coefficient, or even a spline smoothness, does. What you really want -- and what natural intelligence probably has -- are priors over aspects of the world.
Francois Chollet's paper on measuring intelligence was really informative for me on this front; the "priors" you should have about the world are not half-cauchys over certain hyperparameters or whatever, but priors about agent-ness, object-ness, goal-oriented-ness, and so on. How to encode that in a network...well, that's the real trick, right?
I agree that priors over aspects of the world would be more useful, but I don't think that they're important in making natural intelligence powerful. In my experience, the important thing is to make your prior really broad, but containing all kinds of different hypotheses with different kinds of rich structure.
I claim that knowing a priori about things like agents and objects just doesn't save you all that much data, as long as you have the imagination to consider all structures at least that complex.
Bayesian Neural Networks just seem like a failed approach, unfortunately. For one, Bayesian inference and UQ fundamentally depends on the choice of the prior, but this is rarely discussed in the Bayesian NN literature and practice, and is further compounded by how fundamentally hard to interpret and choose these priors are (what is the intuition behind a NN's parameters?). Add to that the fact that the Bayesian inference is very much approximate, and you should see the trouble.
If you want UQ, 'frequentist nonparametric' approaches like Conformal Prediction and Calibration/Multi-Calibration methods seem to work quite well (especilly when combined with the standard ML machinery of taking a log-likelihood as your loss), and do not suffer from any of the issues above while also giving you formal guarantees of correctness. They are a strict improvement over Bayesian NNs, IMO.
The Conformal Prediction advocates (especially a certain prominent Twitter account) tend to rehash old frequentist-vs-bayesian arguments with more heated rhetoric than strictly necessary. That fight has been going on for almost a century now. Bayesian counterargument (in caricature form) would be that MLE frequentists just choose an arbitrary (flat) prior, and penalty hyperparameters (common in NN) are a de facto prior. The formal guarantees only have bite in the asymptotic setting or require convoluted statements about probabilities over repeated experiments; and asymptotically, the choice of prior doesn't matter anyway.
(I'm a moderate that uses both approaches, seeing them as part of a general hierarchical modeling method, which means I get mocked by either side for lack of purity).
Bayesians are losing ground at the moment because their computational methods haven't been advanced as fast by the GPU revolution for reasons having to do with difficulty in parallelization, but there's serious practical work (especially using JAX) to catch up, and the whole normalizing flow literature might just get us past the limitations of MCMC for hard problems.
But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Calibration is also a pretty magical way to improve just about any estimator. It's cheap to do and it works (although hard to guarantee anything with that in the general case...)
And don't forget quantile regression penalties! Awkward to apply in the NN setting, but an easy and effective way to do UQ in XGBoost world.
"Bayesian counterargument (in caricature form) would be that MLE frequentists just choose an arbitrary (flat) prior, and penalty hyperparameters (common in NN) are a de facto prior."
This has been my view for a while now. Is this not correct?
In general, I think the idea of a big "frequentist vs Bayesian" debate is silly. I think it is very useful to take frequentist ideas and see what they look like from a Bayesian point of view, and vice versa (when applicable). I think this is pretty much the general stance among most people in the field - it's generally expected that one will understand that regularization methods equate to certain priors, for instance, and in general be able to relate these two perspectives as much as possible.
I would argue against the idea that "MLE is just Bayes with a flat prior". The power of Bayes usually comes mainly from keeping around all the hypothesis that are compatible with the data, not from the prior. This is especially true in domains where something black-box (essentially prior-less) like a neural net has any chance of working.
Yeah, I know the account you are talking about, it really is a bit over the top. It's a shame, I've met a bunch of people who mentioned that they were actually turned away from Conformal Prediction due to them.
> But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Many of these things can actually work really well with Conformal Prediction, but the algorithms require extensions (much like if you are doing Bayesian inference, you also need to update your model accordingly!). They generally end up being some form of reweighting to compensate for the distribution shifts (excluding the Online Conformal Prediction literature, which is another beast entirely). Also, worth noting that if you have iid data then Conformal Prediction is remarkably data-efficient; as little as 20 samples are enough for it to start working for 95% predictive intervals, and with 50 samples (and with almost surely unique conformity scores) it's going to match 95% coverage fairly tightly.
Are we talking about NN Taleb? I am curious about the twitter persona.
Someone by the name of V. Minakhin. They have an irrational hatred of Bayesian statistics. He blocked me on twitter for pointing out his claim about significant companies do not use Bayesian methods is contradicted by the fact that I work for one of those companies and use Bayesian methods.
> For one, Bayesian inference and UQ fundamentally depends on the choice of the prior, but this is rarely discussed in the Bayesian NN literature and practice, and is further compounded by how fundamentally hard to interpret and choose these priors are (what is the intuition behind a NN's parameters?).
I agree that, computationally, it is hard to justify the use of Bayesian methods on large-scale neural networks when stochastic gradient descent (and friends) is so damn efficient and effective.
On the other hand, the fact that there's a dependence on (subjective) priors is hardly a fair critique: non-Bayesian training of neural networks also depends on the use of (subjective) loss functions with (subjective) regularization terms (in fact, it can be shown that, mathematically, the use of priors is precisely equivalent to adding regularization to a loss function). Non-Bayesian training of neural networks is not "a failed approach" just because someone can arbitrarily choose L1 regularization (i.e., a Laplacian prior) over L2 regularization (i.e., a Gaussian prior).
Furthermore, we do have some intuition over NN parameters (particularly when inputs and outputs are properly scaled): a value of 10^15 should be less likely than a value of 0. Note that, in Bayesian practice, people often use weakly-informative priors (see, e.g., http://www.stat.columbia.edu/~gelman/presentations/weakprior...) to encode such intuitive statements while ensuring that (for all practical purposes) the data will effectively overwhelm the prior (again, this is equivalent to adding a minimal amount of regularization to a loss function, to make a problem well-posed when e.g. you have more parameters than data points).
Non-Bayesian NN training does indeed use regularizers that are chosen subjectively —- but they are then tested in validation, and the best-performing regularizer is chosen. Thus the choice is empirical, not subjective.
A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more. So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.
You can, in fact, do that. It's called (aptly enough) the empirical Bayes method. [1]
Empirical Bayes is exactly what I was getting at. It's a pragmatic modelling choice, but it loses the theoretical guarantees about uncertainty quantification that pure Bayesianism gives us.
(Though if you have a reference for why empirical Bayes does give theoretical guarantees, I'll be happy to change my mind!)
I agree that Bayesian neural networks haven't been worth it in practice for many applications, but I think the main problem is that it's usually better to spend your compute training a single set of weights for a larger model, rather than doing approximate inference over weights in a smaller model. The exception is probably scientific applications where you mostly know the model, but then you don't really need a neural net anymore.
Choosing a prior is hard, but I'd say it's analogously hard to choosing an architecture - if all else fails, you can do a brute force search, and you even have the marginal likelihood to guide you. I don't think it's the main reason why people don't use BNNs much.
I disagree with one conceptual point; if you are truly Bayesian you don’t “choose” a prior, by definition you “already have” a prior that you are updating with data to get to a posterior.
Sure, instead of saying "choose" a prior, you could say "elicit". But I think in this context, focusing on a practitioner's prior knowledge is missing the point. For the sorts of problems we use NNs for, we don't usually think that the guy designing the net has important knowledge that would help making good predictions. Choosing a prior is just an engineering challenge, where one has to avoid accidentally precluding plausible hypotheses.
100% correct, but there are ways to push Bayesian inference back a step to justify this sort of thing.
It of course makes the problem even more complex and likely requires further approximations to computing the posterior (or even the MAP solution).
This stretches the notion that you are still doing Bayesian reasoning but can still lead to useful insights.
Probably should just call it something else then; though, I gather that the simplicity of Bayes theorom belies the complexity of what it hides.
At some level, you have to choose something. You can't know every level in your hierarchy.
Priors on parameters are not an issue. On models of scale, priors are just some computationally convenient shrinkage, and what works is found empirically and canonized into the practice; projecting prior knowledge of the problem at hand by parameter priors does not really happen except in some vague sense ("I think most predictors are irrelevant, so make it sparse by Cauchy/horseshoe/whatever").
The important thing in bayesian (statistical, ML) modelling in general is the ability to gain in flexibility and do model structures that otherwise would be hard or impossible: latent states, hierarchies, etc.
In bayesian NNs the main advantages would be around uncertainty quantification (UQ) and in finding good optima and partly to avoid overfitting. These do apply in some cases of simple NNs.
Mostly however, especially with larger conventional models (not speaking of normalizing flows and such here), using explicit bayes is not feasible. Instead, people use approximate point estimates with tricks:
(1) UQ has been taken care of by post-calibration. (2) Stochastic gradient actually searches for large posterior masses like a variational approximation would do, so it is kind of bayes. (3) And those priors: using dropout is commonplace, it has a bayesian interpretation, and L2 regularization aka gaussian priors are frequent too.
So bayes is there in practice, just not in a neat, pure form but as a collection of practical hacks.
Conformal learning is relatively new to me. Tell me if I'm getting any of this wrong: Conformal learning is a frequentist approach that uses a calibration set to determine how unusual a prediction is.
It seems like the main time they aren't a strict improvement over bayesian methods is when it is difficult to define your calibration set? I know this scenario isn't so commonplace, but I'm working in a scenario where I quickly looked at conformal learning and wasn't sure if it is applicable.
That's a particular form of Conformal Prediction, called Split Conformal Prediction. Incidentally, it's also one of the best ones (i.e., most extensible, strongest guarantees, easiest to implement, remarkably sample-efficient).
Making a calibration set is pretty easy, it's just a data split (just like the train/test split). The hardest part (which is still fairly easy) is creating a 'conformity score', which is a function that receives the input and a candidate output and scores how well this candidate output 'conforms' to the input. This is where an underlying ML model can come in handy: it can, itself, estimate this! Split Conformal Prediction then does a fairly simple quantile calculation on these scores (or some variant thereof) to then form the set prediction.
In a sense, you could use Bayesian NNs to produce a conformity score. But that doesn't seem to be much better than just using e.g. the model's logits for your conformity score. Theory-wise, Conformal Prediction methods have a number of favorable guarantees that Bayesian models (and especially Bayesian NNs) generally don't, and in practice we've seen that conditional on the model giving calibrated outputs (which is guaranteed for Conformal Prediction, but not for Bayesian NNs), Conformal Prediction predicted sets seem to be tighter than the Bayesian NN ones.
I’m not an expert in BNNs but the prior does not need to be justified in terms of each parameter. Bayesian analysis frequently uses hyperparameters to set the overall tightness or looseness of the parameters (a la Minnesota priors in the econometric literature for example). This would be a similar regularisation intuition as, eg, L1 and L2 regularisation in traditional NN training. This is of course just one example.
What is 'UQ', I assume some measure of uncertainty over your model outputs?
Usually means uncertainty quantification
Unbiased quantifier
Author here! What a surprise. This was an abandoned project from 2019, that we never linked or advertised anywhere as far as I know. Anyways, happy to answer questions.
What did you use to produce the article? I really really like the formatting.
I think we used a distill.pub template. Also Jerry wrote some custom BNN fitting code in javascript. I'll ask my co-authors to open-source it.
Update: the code is here:
Somewhat related — I’d love to hear your thoughts on dex-Lang and Haskell for array programming?
I still am excited by Dex (https://github.com/google-research/dex-lang/) and still write code in it! I have a bunch of demos and fixes written, and am just waiting for Dougal to finish his latest re-write before I can merge them.
just a little typo, but it's Kullback-Leibler.
Thanks for pointing that out!
why (if) was this not picked for further research? i know that oatml did quite amount of work on this front as well and it seems the direction is still being worked on. want to get ur 2 cent on this approach.
BNNs certainly have their uses, but I think people in general found that it's a better use of compute to fit a larger model on more data than to try to squeeze more juice from a given small dataset + model. Usually there is more data available, it's just somewhat tangentially related. LLMs are the ultimate example of how training on tons of tangentially-related data can ultimately be worthwhile for almost any task.
I like Bayesian inference for few-parameter models where I have solid grounds for choosing my priors. For neural networks, I like to ask people "what's your prior for ReLU versus LeakyReLU versus sigmoid?" and I've never gotten a convincing answer.
I choose LeakyReLU vs ReLU depending on if it's an odd day of the week, LeakyReLU being the slightly favored odd-days because it's aesthetically nicer that gradients propagate through negative inputs, though I can't discern a difference. I choose sigmoid if I want to waste compute to remind myself that it converges slowly due to vanishing gradients at extreme activation levels. So its empiricism retroactively justified by some mathematical common sense that let's me feel good about the choices. Kind of like aerodynamics.
I agree choosing priors is hard, but choosing ReLU versus LeakyReLU versus sigmoid seems like a problem with using neural nets in general, not Bayesian neural nets in particular. Am I misunderstanding?
Kolmogorov Arnold nets might have an answer for you!
Ah, Kolmogorov Arnold Networks. Perhaps the only model I have ever tried that managed to fairly often get AUCs below 0.5 in my tabular ML benchmarks. It even managed to get a frankly disturbing 0.33, where pretty much any other method (including linear regression, IIRC) would get >=0.99!
Why do you think they perform so poorly?
Theory-wise, I'm not convinced that the models have good approximation properties (the Kolmogorov-Arnold / Kolmogorov Superposition Theorem they base themselves on has quite a bit of nuance), and the optimization problem might be a bit tricky. I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering).
In practice, I've personally ran some benchmarks on a collection of datasets I had laying around. The results were generally abysmal, with the method only matching simple baselines in some few datasets.
Finally, the original paper is very weird, and reads more as a marketing piece. The theory, which is touted throughout the paper, is very weak, the actual algorithm is not sufficiently well explained there and the experiments are lacking. In particular, I find it telling that they do not include and even go out of their way to ignore important baselines such as boosted trees, which are the state-of-the-art solution to the problem that they intended to solve (and even work very well in occasions where they claim that both KANs and MLPs perform badly, e.g. in high dimensions).
Could you say a bit more about how so?
KANs have learnable activations based on splines parameterized on few variables. You can specify a prior over those variables, effectively establishing a prior over your activation function.
I'm sure there is a way of interpreting a relu as a sparsity prior on the layer.
https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004...
mixture density networks are quite interesting if you want probabilistic estimates of neural. here, your model learns to output and array of gaussian distribution coefficient distributions, and mixture weights.
these weights are specific to individual observations, and trained to maximise likelihood.
This approach characterizes a different type of uncertainty than BNNs do, and the approaches can be combined. The BNN tracks uncertainty about parameters in the NN, and mixture density nets track the noise distribution _conditional on knowing the parameters_.
BNNs were an attractive choice in scenarios where the data is expensive to collect, like actual physical experiments. But boosting and other tree-based regression methods give you similar performance with a more straightforward framework for limited tabular data.
Comment was deleted :(
I like Bayes, but I thought the "surprising" result is that double descent is supposed to prevent nns from overfitting?
Good point. We wrote this pre-double descent, and a massively overparameterized model would make a nice addition to the tutorial as a baseline. However, if you want a rich predictive distribution, it might still make sense to use a Bayesian NN.
Comment was deleted :(
[flagged]
Crafted by Rajat
Source Code