Humans rationally balance abstract world models

This work adds to a growing body of research showing that the brain arbitrates between approximate decision strategies.
The current study extends these ideas from simple habits into usage of more sophisticated approximate predictive models, and demonstrates that individuals dynamically adapt these in response to the predictability of their environment.


How do people model the world’s dynamics to guide mental simulation and evaluate choices?
One prominent approach, the Successor Representation (SR), takes advantage of temporal abstraction of future states: by aggregating trajectory predictions over multiple timesteps, the brain can avoid the costs of iterative, multi-step mental simulation. Human behavior broadly shows signatures of such temporal abstraction, but finer-grained characterization of individuals’ strategies and their dynamic adjustment remains an open question. We developed a task to measure SR usage during dynamic, trial-by-trial learning.
Using this approach, we find that participants exhibit a mix of SR and model-based learning strategies that varies across individuals. Further, by dynamically manipulating the task contingencies within-subject to favor or disfavor temporal abstraction, we observe evidence of resource-rational reliance on the SR, which decreases when future states are less predictable

A hallmark of human and animal planning is its flexibility: the ability to plan and replan effectively in the face of novel or changing circumstances and outcomes. This requires not only predicting the immediate outcomes of actions based on experience, but also inferring longer-term payoffs, which often depend on a series of subsequent states and actions. A long-considered solution to this problem is for the brain to simulate outcomes through use a cognitive map or internal model. Reinforcement learning (RL) theories formalize this idea in terms of “model-based” (MB) algorithms, which employ a learned model of the short-term consequences of actions in order iteratively to simulate the long-run consequences of candidate actions. Other learning approaches aim to simplify this laborious decision-time computation at the expense of reduced flexibility. Notably, model-free (MF) RL algorithms directly learn and cache the long-run aggregate reward expected for performing an action (the key decision variable), without representing or computing over the step-by-step contingencies. This avoids costly model-based search, but at the expense of sub-optimal behavior in some circumstances: for instance, when distal contingencies or goals change in a way that invalidates the cached aggregates. It has been suggested that the brain employs both methods, and that it judiciously trades them off, balancing their costs and benefits according to situational demands. This approach suggests a formal, resource-rational account of both healthy and pathological habits, as well as helping to characterize changes in their deployment in other situations such as over development and aging.

Planning model and task structure.

a Trials alternate between “traversal” trials (top) where subjects choose an island and then a boat, and “non-traversal” trials (bottom) in which a boat is selected at random and its reward delivered, without island choice or presentation. On each traversal trial, the participant selects either the left or right island. Upon selecting an island, the two boats available on that island appear, and the participant selects either the left or right boat. On each non-traversal trial, the participant is not given the option to select an island or boat, and instead, one of the four boats “visits them” at the starting location, with identical payout probabilities. The locations of all boats are fixed for the duration of the task. Bottom Right: Full schematic of trial structure. 
b A MB agent (left) evaluates choices via recursive maximization, while an SR agent (center) evaluates choices via on-policy averaging. Both SR and MB agents are capable of integrating nonlocal information through the use of an appropriate world model. A MF agent (right) directly learns the values of actions and thus cannot integrate nonlocal information. 
c The experiment consisted of 22 reward blocks of between 16 and 24 trials, alternating between traversal and non-traversal trials. Reward probabilities were consistent within a given block, and systematically altered between blocks. 

Participants use world models to incorporate reward information

The task was designed to distinguish between MB and SR strategies on a per-trial basis, because these predict different characteristic patterns by which reward information from non-traversal trials would be incorporated into subsequent decisions. We can understand and provide an initial test of the models’ predictions by focusing specifically on the effect of each non-traversal trial’s reward on the island choice made on the subsequent traversal trial. These types of simplified, one-trial-back analyses help complement and provide intuition about the factors driving the relative fit of more elaborate learning models, which we consider later. First, to verify that participants were indeed incorporating reward information from these trials, we examined whether receiving reward (vs. not) for some boat on a non-traversal trial affected participants’ chance of choosing the island associated with the sampled boat on the immediately following traversal trial. Both approaches to world modeling (MB and SR) predict a win-stay-lose-shift pattern, on average, at the island level: that is, to the extent they follow either strategy, participants should be more likely to choose the associated island if the boat was rewarded rather than unrewarded on the immediately succeeding trial.


When do we spend time and mental resources iteratively simulating the future, and when do we rely on simpler procedures? Simplified evaluation strategies inherently trade off planning complexity against accuracy, and the usefulness of any given approach for planning depends on the environment an agent is operating within. This has often been discussed in terms of a trade-off between model-based and model-free methods. MF learning is advantageous when the long-run values of candidate actions change slowly enough for a cached estimate to accurately reflect distant reward outcomes. It has been hypothesized that the brain trades off such strategies judiciously according to circumstance, and (in the case of MB vs. MF arbitration) that this process explains phenomena of both healthy automaticity and pathological compulsion.
Here we extended this program to the question of how the world model itself works. Simplified world models, such as the SR, have seen recent attention in decision neuroscience. These simplify planning not by foregoing model-based simulation altogether, but instead by collapsing the iterative simulation of dynamics by using temporal abstraction over future state encounters: in effect, adopting a simplified, multi-step world model. Furthermore, one candidate implementation of full MB planning in the brain involves an inherent, tuneably cost-saving trade-off with SR-like abstraction, suggesting that judicious MB-SR trade-offs may be as, or even more, fundamental than the classic MB-MF picture. The SR and related methods provide a potential solution to how the brain can plan efficiently over temporally extended timescales, so long as the associated predictions are stable, but also predict subtler biases.

However, recent theoretical work on linear RL suggests an alternative implementation of MB computation in the brain, in which a trade-off against SR-like simplification occurs intrinsically within a single computation. In linear RL, values are predicted under an assumed multistep dynamics, similar to the SR, but with an additional softmax-like nonlinearity that can, in the limit, mimic full MB computation. This trade-off, in turn, has a concrete cost (e.g. in terms of spikes or time): more MB-like values require larger, more precise decision variables. On this view, then, tuneable MB vs. SR trade-offs are inherent to word modeling, and potentially as fundamental as MB vs. MF. This is thus a natural framework for considering rational meta-control of the sort we find evidence for here.

Leave a comment