Many people have written about this, and I do know this literature as well as I'd like, but I did a random calculation that I thought was semi-useful based on this paper. I haven't touched the channel coding aspect of this paper, but I did try to understand the mutual information between input and neural response in the small noise limit. In this case, I'm assuming that the neurons are nearly noiseless, and that their information is coded by a firing rate. The approximation was heavily inspired by this paper.
The details are not pretty, but the story ends up being pretty simple. Essentially, we'll assume that the firing rate of the neuron is some nonlinear function of some gain multiplied by the input signal, plus some Gaussian noise that ends up being rather irrelevant. We'll also assume that the probability density function of the input is highly peaked around some value. This is essentially ICA, but we learn a few things about the optimal nonlinearity: - the neurons want to be maximally silent or maximally firing when the stimulus is most probable;
- if the neurons are maximally firing when the stimulus is maximally probable, silence is nearby, and soon after an increase in firing rate;
- and the gain should be large.
Please let me know if someone has already done this calculation! It seems like an obvious move, but I am unfortunately not familiar enough with the literature to know. Here are some calculations that I will never publish.
0 Comments
Intuitively, if you want to be energy efficient, you should die. (I credit Tony Bell with this analysis.) But now, it seems like there is buzz around the idea that energy efficiency leads to prediction of input.
In my opinion, this is almost true. I can't find a way in which Tony Bell's argument doesn't hold up, unless you add so many constraints to your system that the death solution is impossible. For example, in this very interesting recent preprint, it seems to be the case that the death solution is impossible. If you add a scalar v in front of the input in their RNN, the death solution is now possible; I hypothesize some experiments might confirm that training for energy efficiency would set v to 0 and kill the activations of the network entirely. But instead, their system is constrained so that energy efficiency demands that "p" must be the negative of the input, and so predictive coding results.A more general take on energy efficiency and prediction is the thermodynamics of prediction. Continuous-time versions are in this paper. I find this bound to be quite clever in that it equates prediction inefficiency with energy efficiency, rather than prediction wholesale. Prediction inefficiency can actually be zero when there is no prediction (e.g., this paper). It is not yet clear to me if this bound is tight, though, for optimized systems. Based on the overly simple examples in the aforementioned paper, I'd say no, but we'll have to see. I wrote some stuff. I held off on doing this for a while, but now seems like a good time to be a nitpicker. Here's some thoughts.
At the end of the day, it seems likely that early brain regions are doing joint source-channel coding, and I wish I had a good way to think about that. Just to update: I have two pieces of work on this with Simon Dedeo. We take the view that rate-distortion theory is a mathematical description of the efficient coding hypothesis, and because we're not sure what else to do, we randomly draw distortions and probabilities. We find two things. In this paper, we find that there are two regimes-- one in which resources grows with the number of environmental states, and one in which it doesn't. In this other paper, we find an experimental mathematical result (that looks like but isn't the Central Limit Theorem) which says that the rate-distortion curve doesn't change much from environment to environment. Hence, no need to change the number of sensory neurons, and no need for sensory neurogenesis. I don't usually like to write about stuff like this, but I feel like this might actually do some good. Myself and Jim Crutchfield (well-known for his work on chaos theory) have papers about new methods for continuous-time, discrete-event process inference and prediction (here) and about how one can view the predictive capabilities of dynamical systems as a function of their attractor type (here). The reviews-- one from an information theory journal and another from machine learning experts-- unfortunately illustrated a lack of common knowledge on interdisciplinary problems. So I thought I'd put a few key points here, for those studying recurrent neural networks in any way, shape, or form.
First, if you have a dynamical system, you can classify its behavior qualitatively by attractor type. There are three types of attractors: fixed points, limit cycles, and beautiful strange attractors. It turns out that the "qualitative" attractor type is a guide to many computational properties of the dynamical system (again, soon to appear on arXiv). Second, hidden Markov models-- including unifilar ones, in which the current state and next symbol determine the next state-- are not memoryless or Markovian. More to come. I've been wanting to write this post for a while, but never had the courage. But here goes.
Every so often, I encounter a paper that proposes a new objective function for agent behavior. Sometimes, something like predictive information is proposed; sometimes, something more like entropy rate is proposed. In both cases, I have a bit of an issue. When we try and say maximize predictability while minimizing memory, you end up either flipping coins (when you penalize memory too much) or running in very large circles (when you don't penalize memory enough). There doesn't usually seem to be an interesting intermediate behavior. When you maximize entropy rate, you typically end up flipping coins. The key to making these objective functions interesting, I think, is to add enough constraints that they start doing interesting things. And since this now impinges upon an old project that I may pick up again in the near future, that's all I'll say! I've been reading a lot of papers lately in physics, in neuroscience, in biology in general, in which new mechanisms for memory are discovered. Certain types of inputs are shown, the system's state appears to remember something about past inputs, and victory is declared.
I think that the general trend here is wonderful. Systems do have memory, and many have memory for a reason; some of them don't, but their memory can actually be used for something. It's definitely about time that we started cataloguing these phenomena. However, at the risk of being a Debbie Downer, I want to point out that pretty much any input-dependent dynamical system has memory. What that means is that if your system's state evolves according to some set of rules than involves the input, then you're pretty much guaranteed to have memory. Thus, memory in and of itself is not that special. Both forgetting and remembering are typical. The real question is, what does the system remember? What is special is when the system remembers, and only remembers, just the "right" things. Now, "right" is an unfortunate word because it's user-defined, but what is "right" depends on what the system is used for. In any particular application, there's some number of necessary things that the system needs to remember in order to do its job, whatever that may be. And it is usually tricky to design an input-dependent dynamical system that stores these things and only these things. (Exception: periodic input is exceptionally easy to remember. It is entirely predictable.) I've therefore started to look at some of these papers a little more critically, like the grump that I am. It doesn't make me less excited where this field will go, but it does highlight that more studies on what is "typical" are sorely needed. As such, I asked a philosopher at the recent Mathematical Biosciences Institute summit at Ohio State University to tell me how philosophers of science would define a theory. He gave me six characteristics of a theory: - Falsifiability: most scientists would claim that this is the cornerstone of what makes a theory a theory. However, this is a misleading characteristic. If someone thought that they had discovered a violation of energy conservation, they'd probably identify a new form that energy could take so that energy conservation was preserved. [Thanks to Feynman for that classic example.]
- Accuracy: in my notes, I have written down "error bars", and I unfortunately can't remember for the life of me why I wrote that down.
- Generality: this rules out things like "this particular animal happens to have a mole on her right cheek." But how general is general enough?
- Predictive: not just explanatory!
- Explanatory: this is way easier than the predictive requirement.
- Fruitfulness: the theory should yield new questions and areas of research.
Update: check out this paper about a theory of theories that I'm a co-author on! Or, let's start simpler: why should I care about entropy rate?
A lot of machine learning research nowadays is focused on finding minimal sufficient statistics of prediction (a.k.a. "causal states"), or just sufficient statistics, of some time series, whether it be a time series of Wikipedia edits or of Amazon purchases. Most of my research assumes that we know these causal states, and then tries to use that knowledge to calculate a range of quantities (including entropy rate and predictive information curves) more accurately than if you were to do it directly from the time series/data. This leads to the question... why? Why care about these quantities? Entropy rate enjoys a privileged status, due to Shannon's first theorem, so let's focus on predictive information curves for just a second. For the initiated, the predictive information bottleneck are an application of the information bottleneck method to time series prediction, in which we compress the past as efficiently as possible to understand the future to some desired extent. For the uninitiated, predictive information curves tell us the tradeoff between resources required to predict the future and predictive power. In one of the first papers on the subject, Still et al identified causal states as one limiting case of the predictive information bottleneck. With that theorem in mind, one might reasonably ask the following question: why study causal states? Just study the predictive information bottleneck, and causal states pop out as a special case. Surprisingly, or maybe not so surprisingly, it turns out that calculating predictive information curves and lossy predictive features is much easier when you have the lossless predictive features, a.k.a. the causal states. For instance, check out some of the examples in this paper. So, we end up in sort of a Catch-22 situation: to get accurate lossy predictive features, we need accurate lossless predictive features. The jaded among us might finally wearily ask the following: now what? We set out to find causal states (lossless predictive features). Some smart people promised us that we could calculate these using the predictive information bottleneck, but now someone else has told us that those calculations are likely to be crappy unless we already have access to causal states. At this point, I "pivoted", provoked by the following question: how can we tell if a sensor is excellent at extracting lossy predictive features? One way to find out is to send input with known causal states to the sensor, and then calculate how well the sensor performs relative to the corresponding predictive information curve, as was done in this inspiring paper. If we know the input's causal states, then we can calculate its predictive information curve rather accurately, and therefore can be confident in our assessment of the sensor's predictive capabilities. At this point, Professor Crutchfield pointed something else out: a coarse-grained dynamical model might be desired if the original model is too complicated to be understood. Imagine generating a very principled low-dimensional dynamical description of complicated genetic or neural circuits. It's not yet clear that the predictive information bottleneck provides the best way of doing so, but it's at least a start. These two applications are summed up by the following paragraph: "At second glance, these results may also seem rather useless. Why would one want lossy predictive features when lossless predictive features are available? Accurate estimation of lossy predictive features could and have been used to further test whether or not biological organisms are near-optimal predictors of their environment. Perhaps more importantly, lossless models can sometimes be rather large and hard to interpret, and a lossy model might be desired even when a lossless model is known." Check out this paper for an example of what I mean. Calculating the entropy rate (the conditional entropy of the present symbol given all past symbols) or excess entropy (the mutual information between all past symbols and all future symbols) is not as easy as it may seem. Why? Because there are infinities-- an infinite number of past symbols and/or an infinite number of future symbols.
You can certainly make a lot of progress by tackling this problem head on, looking at longer and longer pasts and/or longer and longer futures. I'm pretty lazy, so I usually look for shortcuts. Here's my favorite shortcut: identifying the minimal sufficient statistics of prediction and/or retrodiction, also known as forward- and reverse-time "causal states". Then, you can rewrite most of your favorite quantities that have the "right" kind of infinities in terms of these minimal sufficient statistics. If you're lucky, manipulation of these joint probability distributions of these forward- and reverse-time causal states is tractable. My favorite paper illustrating this point is "Exact complexity", but for the more adventurous, I self-aggrandizingly recommend four of my own papers: "Predictive rate-distortion of infinite-order Markov processes", "Signatures of Infinity", "Statistical Signatures of Structural Organization", and the hopefully-soon-to-be-published "Structure and Randomness of Continuous-Time Discrete-Event Processes". And finally, here's a copy of my talk at APS (that I missed due to sickness) that covers the corollary in "Predictive rate-distortion of infinite-order Markov processes". Finding these causal states can be difficult, but this seems to be the best algorithm out there. |
## AuthorWrite something about yourself. No need to be fancy, just an overview. ## Archives
May 2024
## Categories |

Proudly powered by Weebly