It’s been a year since my last post. My last post was about deep (multi-layer) Bayesian classifiers capable of learning non-linear decision boundaries.
Since then, I’ve put on hold the work I was doing on deep (multi-layer) Bayesian classifiers and instead been working on deep learning using neural networks.
The reason for this was simple: our last paper revealed a limitation of deep directed graphical models that deep neural networks did not share, which allowed the latter to be of much greater depth (or to remember way more information) than the former.
The limitation turned out to be in the very equation that allowed us (read our last paper on deep (multi-layer) Bayesian classifiers for an explanation of the mathematics) to introduce non-linearity into deep Bayesian networks:
The equation contains a product of feature probabilities P(f|h,c) [the part inside the big brackets in the above equation].
This product yields extreme (uncalibrated) probabilities and we had observed that those extreme probabilities were essential to the formation of non-linear decision boundaries in the deep Bayesian classifiers we’d explored in the paper. The extremeness allowed the nearest cluster to a data point to have a greater say in the classification than all the other clusters.
We had found that when using this equation, there was no need to explicitly add non-linearities between the layers, because the above product itself gave rise to non-linear decision boundaries.
However, because of the extremeness of the product of P(f|h,c), the probability P(h|F) (the probability of a hidden node given the features) becomes a one-hot vector.
Thus a dense input vector (f) becomes transformed into a one hot vector (h), in just one layer.
Once we have a one-hot vector, we don’t gain much from the addition of more layers of neurons (which is also why you shouldn’t use the softmax activation function in intermediate layers of deep neural networks).
This is because one-hot encodings encode very little information.
There’s an explanation of this weakness of one-hot encodings in the following lecture by Hinton comparing RNNs and HMMs.
Hinton points out there that an RNN with its dense representation can encode exponentially more information than a finite state automaton (that is, an HMM) with its one-hot representation of information.
I call this tendency of deep Bayesian models to reduce dense representations of information to one-hot representations the vanishing information problem.
Since the one-hot representation is a result of overconfidence (a kind of poor calibration), it can be said that the vanishing information problem exists in any system that suffers from overconfidence.
Since Bayesian systems suffer from the overconfidence problem, they don’t scale up to lots of layers.
(We are not sure whether the overconfidence problem is an artifact of the training method that we used, namely expectation maximization, or of the formalism of directed graphical models themselves).
What our equations told us though was that the vanishing information problem was inescapable for deep Bayesian classification models trained using EM.
As a result, they would never be able to grow as deep as deep neural networks.
And that is the main reason why we switched to using deep neural networks in both our research and our consulting work at Aiaioo Labs.