Intuitively, if you want to be energy efficient, you should die. (I credit Tony Bell with this analysis.) But now, it seems like there is buzz around the idea that energy efficiency leads to prediction of input.
In my opinion, this is almost true.
I can't find a way in which Tony Bell's argument doesn't hold up, unless you add so many constraints to your system that the death solution is impossible. For example, in this very interesting recent preprint, it seems to be the case that the death solution is impossible. If you add a scalar v in front of the input in their RNN, the death solution is now possible; I hypothesize some experiments might confirm that training for energy efficiency would set v to 0 and kill the activations of the network entirely. But instead, their system is constrained so that energy efficiency demands that "p" must be the negative of the input, and so predictive coding results.
A more general take on energy efficiency and prediction is the thermodynamics of prediction. Continuous-time versions are in this paper. I find this bound to be quite clever in that it equates prediction inefficiency with energy efficiency, rather than prediction wholesale. Prediction inefficiency can actually be zero when there is no prediction (e.g., this paper).
It is not yet clear to me if this bound is tight, though, for optimized systems. Based on the overly simple examples in the aforementioned paper, I'd say no, but we'll have to see.
I wrote some stuff. I held off on doing this for a while, but now seems like a good time to be a nitpicker. Here's some thoughts.
At the end of the day, it seems likely that early brain regions are doing joint source-channel coding, and I wish I had a good way to think about that.
Just to update: I have two pieces of work on this with Simon Dedeo. We take the view that rate-distortion theory is a mathematical description of the efficient coding hypothesis, and because we're not sure what else to do, we randomly draw distortions and probabilities. We find two things. In this paper, we find that there are two regimes-- one in which resources grows with the number of environmental states, and one in which it doesn't. In this other paper, we find an experimental mathematical result (that looks like but isn't the Central Limit Theorem) which says that the rate-distortion curve doesn't change much from environment to environment. Hence, no need to change the number of sensory neurons, and no need for sensory neurogenesis.
I don't usually like to write about stuff like this, but I feel like this might actually do some good. Myself and Jim Crutchfield (well-known for his work on chaos theory) have papers about new methods for continuous-time, discrete-event process inference and prediction (here) and about how one can view the predictive capabilities of dynamical systems as a function of their attractor type (here). The reviews-- one from an information theory journal and another from machine learning experts-- unfortunately illustrated a lack of common knowledge on interdisciplinary problems. So I thought I'd put a few key points here, for those studying recurrent neural networks in any way, shape, or form.
First, if you have a dynamical system, you can classify its behavior qualitatively by attractor type. There are three types of attractors: fixed points, limit cycles, and beautiful strange attractors. It turns out that the "qualitative" attractor type is a guide to many computational properties of the dynamical system (again, soon to appear on arXiv).
Second, hidden Markov models-- including unifilar ones, in which the current state and next symbol determine the next state-- are not memoryless or Markovian.
More to come.