Picture this scenario: your back room during the summer is taken up by six undergraduates, all working on an independent project, all coding. At any given time, roughly two of the six students need to talk to you because they've gotten stuck and can't work past it by themselves. What should the students who need help do?
I know ChatGPT and other large language models might have negative effects on society, including (for example) the spread of disinformation in an authoritative manner due to their hallucinations that nobody can seem to get rid of. But ChatGPT really helps out in this particular mentoring situation because the students who need help but who can't get it until an hour from now don't have to just sit there and wait for me to be free-- they can use ChatGPT to help them write code. That being said, there are ways in which this works and ways in which this doesn't. So I just wanted to share my experience with how to use ChatGPT successfully and how it sometimes fails as a mentorship tool if not used properly. Honestly, I should probably pretty this idea up and publish in an education journal but I'm too exhausted, so here we go. One of the students who regularly used ChatGPT was coding in a language that I didn't remember well. She already knew other languages but needed this particular language to code an application for a really cool potential psychotherapy intervention that we're trying. To learn this language and how to use this language to code applications, she took a course at the same time that she coded up her application at the same time that she used ChatGPT when she couldn't figure out the right piece of code. As far as I can tell, this worked really, really well. Because she took the coding course (that was free) at the same time that she used ChatGPT, she could check that ChatGPT wasn't spitting out nonsense, and could prompt ChatGPT to change what it was spitting out if it was almost but not quite right. Because she was coding the project at the same time that she was taking the coding course and using ChatGPT, she could keep focused on exactly what she needed to learn for that summer. (Keep in mind that the summer research assistantship is only 2 months long for us!) And so, in the end-- though I have to ask this student for her impression of the summer-- I would say that she needed my input for more general user design questions and less for the questions that I couldn't answer without taking the course with her and Googling a lot. ChatGPT was value-added. Another student came in with less coding experience and less math under his belt and was doing a reinforcement learning project. He used ChatGPT a lot to help him out, from the start to the finish, even on conceptual questions. It is important to note that because he started out not knowing how to code, ChatGPT was used as a crutch rather than as a tool. When ChatGPT is used as a crutch rather than as a tool, research projects don't work out so well. First of all, it turns out, unsurprisingly, that even if a student, like this student, is smart, eager, and really just enthusiastic beyond belief, it is a bad idea to give them a project where you assert to yourself that you can teach them enough probability theory, multivariable calculus, and coding to understand Markov Decision Processes and policy gradient methods in 2 months. Just, no. By the end of the summer, this poor student was getting scared to come to work because he did not feel like he was going to be able to solve the problem that I gave him that day. Second of all, even if I made mistakes with this mentorship, I learned a valuable lesson about what happens when you don't really understand the theory behind a research project and ask ChatGPT for help-- you hit a wall, and quickly. This student often had slightly wrong prompts (which ChatGPT sort of auto-corrected) and then got answers that didn't quite work and he didn't know what prompts to try next. So by the end of the summer, I was drilling him on the theory so that he could do prompt engineering a bit better. That was successful, although scary for him, he said. But basically, if you or your students are using Large Language Models to do research and don't really understand the theory behind the project and don't really know the coding language if there is coding involved, the Large Language Model is not going to be able to do the project for you. And then finally, my own efforts to use ChatGPT were funny. I just thought it would be interesting to see if ChatGPT could come up with anything novel. I know people are working on this actively, but at the time that I tried it, either ChatGPT or Bard (I can't remember which) could not write anything that was not boilerplate. The exact prompt that I used was a question on a research project finding the information acquired by the environment during evolution. (My student and I had just roughly written a paper on that, which would not be in the corpus.) What it came up with was some not-very-interesting text on how there was a calculation of the information acquired by evolution and how that could lead to better models, which doesn't even really make any sense-- you use a model to calculate the information acquired unless you happen to have some really nice experimental data. And if you calculate the information acquired, regardless, you will not get a better model. Later, I thought it might be interesting to try and see if Bard could teach. I was trying to figure out how to teach thermodynamics concepts in a week, which is a ridiculous ask, but that's introductory physics for life sciences for you, and what it spit out was just subpar and dry. Nothing of active learning, and no real sense of how long it would take to really teach thermodynamics concepts so that they would be understood. (Three laws of thermodynamics in one hour-long class is not a good idea.) Anyway, it was a long time ago, but I'm interested in seeing if the new Large Language Models like Gemini can actually show some signs of creativity that could be helpful enough that I could use them as tools in research or in teaching.
0 Comments
This is a project I no longer want to do for a very weird reason. I have voices due to my paranoid schizophrenia and although my psychoses have inspired a large number of projects, I do not feel good about doing this particular project because the voices helped me understand that regret and expectation were different. If you have heard of paranoid schizophrenia, you might understand that schizophrenics often feel like their brain split apart and that what would usually be considered their own thoughts feel to them like someone else's thoughts. So I felt like someone else was trying to help me read about decision theory and basically decided that decision theory ideas were therefore the intellectual property of the voices and not me.
But before that happened, I had an idea. I think it's an interesting one. In decision theory, there's this idea of minimax and there's also this idea of expectation in terms of utility. The two do different things. In minimax formulations, you take the best action you possibly can in terms of utility in the worst possible environment. In expectation formulations, you take the best action policy for the expected value of the utility. But when people talk about how organisms have evolved and someone invokes optimization, the knee-jerk response often is: but aren't organisms often just "good enough"? I feel like it's almost obvious that the minimax formulation of decision theory leads to organisms that are just good enough in every environment they encounter. I think to support this, one would need to do some simulations and consider a simple environment in which they can show that the minimax solution actually leads to "good enough" or satisficing behavior. Now, it is definitely going to be easy to come up with counterexamples, in which the minimax solution is very far from "good enough" for some environments. But I feel like we haven't evolved in those pathological conditions, so that realistic simulations would place minimax solutions closer to "good enough". The mechanism by which organisms evolve or adapt to achieve minimax optimality might actually explain the "good enough" behavior, too-- adapting to a constantly fluctuating environment can be used as a stand-in for minimax, which would undoubtedly imply good enough. Perhaps someone has already done this, but if not, here's an idea that I'm throwing out into the ether. One of the articles written by a superbly talented high school student (or that's what she was when we worked together) was accepted by PLoS One. Here it is on bioRxiv.
They took forever to proofread our article. We took forever to respond because Alejandra was busy. And now, we've been completely ghosted. They haven't responded to any of my emails to please print the article that they accepted. This is horrible. This is damaging to Alejandra's career and to my tenure case. Update: they finally responded. The article is published! I've now talked to many people who think that causal states (minimal sufficient statistics of prediction) and epsilon-Machines (the hidden Markov models that are built from them) are niche. I would like to argue that causal states are useful. To understand their utility, we have to understand a few unfortunate facts about the world. 1. Take a random time series, any random time series, and ask if the process it comes from is Markovian or order-R Markov. Generically, no. Even if you look at a pendulum swinging, if you only take a picture every dt, you're looking at an infinite-order Markov process. Failure to understand that you need the entire past of the input to estimate the angular velocity of that pendulum results in a failure to infer the equations of motion correctly. Another way of seeing this: most hidden Markov models that generate processes will generate an infinite-order Markov process. So no matter what way you slice it, you probably need to deal with infinite-order Markov processes unless you are very clever with how you set up your experiments (e.g., overdamped Langevin equations as the stimulus). 2. Causal states are no more than the minimal sufficient statistics of prediction. They are a tool. If they are unwieldy because the process has an uncountable infinity of them-- which is the case generically-- we still have to deal with them, potentially. In truth, you can approximate processes with an infinite number of causal states by using order-R Markov approximations (better than their Markov approximation counterparts) or by coarse-graining the mixed state simplex. 3. This is less an unfortunate fact and more just a rejoinder to a misconception that people seem to have: epsilon-Machines really are hidden Markov models. To see this, check out the example below. If you see a 0, you know what state you're in, and for those pasts that have a 0, the hidden states are unhidden, allowing you to make predictions as well as possible. But if you see only 1's, then you have no clue what state you're in-- which makes the process generated by this epsilon-Machine an infinite-order Markov process. These epsilon-Machines are the closest you can get to unhidden Markov models, and it's just a shame that usually they have an infinite number of states. 4. This is a conjecture that I wish somebody would prove: that the processes generated by countably infinite epsilon-Machines are dense in the space of processes. So if we want to predict or calculate prediction-related quantities for some process, it is highly likely that the process is infinite-order Markov and has an uncountable infinity of causal states. What do we do?
There are basically two roads to go down, as mentioned previously. One involves pretending that the process is order-R Markov with large enough R, meaning that only the last R symbols matter for predicting what happens next. The other involves coarse-graining the mixed state simplex, which works perfectly for the infinite-order Markov processes generated by hidden Markov models with a finite number of causal states. Which one works better depends on the process. That being said, here are some benefits to thinking in terms of causal states: 1. Model inference gets slightly easier because there is, given the starting state, only one path through the hidden states for a series of observations for these weird machines. This was used to great effect here and here, and to some effect here. 2. It is easy to compute certain prediction-related quantities (entropy rate, prediction information curves) from the epsilon-Machine even though it's nearly impossible otherwise, as is schematized in the diagram below. See this paper and this paper for an example of how to simplify calculations by thinking in terms of causal states. There was an early paper on the MWC molecule that showed that you could get different logical gates by changing binding energies. At the same time, we all know that the Izhikevich neuron can yield many different types of neural behavior that all have different computational properties.
At root, gene regulatory networks and neural networks must perform computations on inputs-- really, arbitrary computations. How do they succeed? I think it has something to do with the presence of different apparent activation functions in the biophysical network. Take a gene regulatory network, with its first layer being production of mRNA and its second layer being production of protein. The first layer is full of different apparent activation functions-- same thermodynamic model underlying, but different apparent logical functions based on changes in binding energy, leading to literally any computation you want. Or take a neural network, with its many layers. Every layer might have different apparent activation functions from the same underlying dynamical system, with a few parameters changed. The weird part about this is that it seems easy to analyze if you view it as: there is some underlying hidden state dynamics that describes everything, and the only thing you're changing are parameters of the hidden state dynamics to get different apparent activation functions. Back when I was at MIT and starting to think about the predictive capabilities of reservoirs, I wanted to pull out the dynamical systems textbook and answer all my questions about prediction.
The first dynamical systems textbook I pulled out was Strogatz. I realized that I could hit the problem of prediction with that textbook in the limit of weak input, when the basin-attractor portrait failed to change, and that resulted in this paper. The second book I looked up was one on "Random Dynamical Systems". I got one chapter in when I realized something-- this book wasn't answering any of my questions on prediction. I realized that a conceptual shift was needed. These dynamical systems were not "random". They were filters of input, and the input was the signal, and it was incorrect for all the problems I was working on to treat the input as noise. I barely care about what the state of the system is; I only care about how the state of the system relates to the past of the input, something that may be harder to keep track of. The fix to this, I think, is to look at the joint state space of predictive features of the input and the state of the system and find dynamics on that joint state space. I did this in this paper. You have to know something about the input. The math gets a bit complicated, but I'm hoping that slight fixes to the Strogatz textbook can be imported in the more general case! Altogether, I think a new textbook on dynamical systems with input is in order, one that includes more recent work on reservoir computing. These input-dependent dynamical systems actually do a computation, and so many fields-- from biophysics to theoretical neuroscience-- care about quantifying exactly how well that computation is done. Considering the input as noise is the opposite of solving the problems in these fields. I think this is a classic example of how bio-inspired math could spark an entirely new textbook. Many people have written about this, and I do know this literature as well as I'd like, but I did a random calculation that I thought was semi-useful based on this paper. I haven't touched the channel coding aspect of this paper, but I did try to understand the mutual information between input and neural response in the small noise limit. In this case, I'm assuming that the neurons are nearly noiseless, and that their information is coded by a firing rate. The approximation was heavily inspired by this paper.
The details are not pretty, but the story ends up being pretty simple. Essentially, we'll assume that the firing rate of the neuron is some nonlinear function of some gain multiplied by the input signal, plus some Gaussian noise that ends up being rather irrelevant. We'll also assume that the probability density function of the input is highly peaked around some value. This is essentially ICA, but we learn a few things about the optimal nonlinearity:
Please let me know if someone has already done this calculation! It seems like an obvious move, but I am unfortunately not familiar enough with the literature to know. Here are some calculations that I will never publish. Intuitively, if you want to be energy efficient, you should die. (I credit Tony Bell with this analysis.) But now, it seems like there is buzz around the idea that energy efficiency leads to prediction of input.
In my opinion, this is almost true. I can't find a way in which Tony Bell's argument doesn't hold up, unless you add so many constraints to your system that the death solution is impossible. For example, in this very interesting recent preprint, it seems to be the case that the death solution is impossible. If you add a scalar v in front of the input in their RNN, the death solution is now possible; I hypothesize some experiments might confirm that training for energy efficiency would set v to 0 and kill the activations of the network entirely. But instead, their system is constrained so that energy efficiency demands that "p" must be the negative of the input, and so predictive coding results. A more general take on energy efficiency and prediction is the thermodynamics of prediction. Continuous-time versions are in this paper. I find this bound to be quite clever in that it equates prediction inefficiency with energy efficiency, rather than prediction wholesale. Prediction inefficiency can actually be zero when there is no prediction (e.g., this paper). It is not yet clear to me if this bound is tight, though, for optimized systems. Based on the overly simple examples in the aforementioned paper, I'd say no, but we'll have to see. I wrote some stuff. I held off on doing this for a while, but now seems like a good time to be a nitpicker. Here's some thoughts.
At the end of the day, it seems likely that early brain regions are doing joint source-channel coding, and I wish I had a good way to think about that. Just to update: I have two pieces of work on this with Simon Dedeo. We take the view that rate-distortion theory is a mathematical description of the efficient coding hypothesis, and because we're not sure what else to do, we randomly draw distortions and probabilities. We find two things. In this paper, we find that there are two regimes-- one in which resources grows with the number of environmental states, and one in which it doesn't. In this other paper, we find an experimental mathematical result (that looks like but isn't the Central Limit Theorem) which says that the rate-distortion curve doesn't change much from environment to environment. Hence, no need to change the number of sensory neurons, and no need for sensory neurogenesis. I don't usually like to write about stuff like this, but I feel like this might actually do some good. Myself and Jim Crutchfield (well-known for his work on chaos theory) have papers about new methods for continuous-time, discrete-event process inference and prediction (here) and about how one can view the predictive capabilities of dynamical systems as a function of their attractor type (here). The reviews-- one from an information theory journal and another from machine learning experts-- unfortunately illustrated a lack of common knowledge on interdisciplinary problems. So I thought I'd put a few key points here, for those studying recurrent neural networks in any way, shape, or form.
First, if you have a dynamical system, you can classify its behavior qualitatively by attractor type. There are three types of attractors: fixed points, limit cycles, and beautiful strange attractors. It turns out that the "qualitative" attractor type is a guide to many computational properties of the dynamical system (again, soon to appear on arXiv). Second, hidden Markov models-- including unifilar ones, in which the current state and next symbol determine the next state-- are not memoryless or Markovian. More to come. |
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
January 2025
Categories |