Revisiting the notion of "State" in World Models - what does it really mean?

A recent substack article on world models by Fei-Fei Li suggests a taxonomy of world models – renderers, simulators and planners. Embedded in the article is a brief blurb on the notion of “state” – as paraphrased below –

“The word “state” needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response.”

The above description attempts to simplify the notion of “state” – focusing on local agent/robots view, observability of a state etc. Furthermore the article also hints that a complete understanding of state is essential to build a “reliable” simulator of the physical world but suggests that it is not possible since the world is not fully observable. In our perspective, the success of reliable world models for any task fundamentally depends on getting this notion of “state” correct without sacrificing any determinism. We briefly outline our perspective why this is so.

Firstly, just to get one’s basics on the notion of “state”, readers can review the following essay written nearly a decade on what is the computational notion of state – We never define it expliclitly – but have a comfortable understanding of what it means in a given context. As the research on “world models” unfolds, there are a few fundamental issues to contend with:

Another comment in the aforementioned article is: “Where language models learn the statistical structure of text, world models learn the statistical structure of space and time” – Is the physical world really non-deterministic or Is our “modeling” of the same non-deterministic and hence statistical. – Is the lack of determinism as a requirement for autonomous systems a good thing – without a human in the loop – would anyone be comfortable with an autonomous surgical robot doing a kidney surgery?
We believe the notion of state is not just restricted to what a robot needs – but the notion of state is to be seen as a “Network of different state models” – each such state model covering some aspect of understanding context. Consider a robot boiling water in a pan – Apart from the state of the robot manipulators – does it need to know the “state” of water in the pan – Is it cold? Is it warm? Is it boiling/gaseous etc.? Point being dynamics of context may be provided by a “network” of dynamic state models to assess and monitor context. Semantics of each state model may be different and the dynamics of state in each sub-context may be different. How do we pick and choose these sets of basic state models in a given context?
Transitions between “states” across these “state model” networks may have complex underlying phenomena with explicit and implicit events. Explict events may be triggered by agent driven actions, Implicit events occur auto-magically based on unmodelled dependencies or due to lack of actual causal or correlational models.
Are the current “observables” in any domain enough to understand the context? Are all physical measurables enough to model the phenomena at hand? Do we even know what has not been captured? State models are built with a modeling objective in mind – to capture some dynamics of interest in the target system – it by definition does not capture everything. Additionally capturing every detail does not guarantee improvements in the quality of the state transition model.
Partial observability is a modeling and data issue – not a limitation of the underlying phenomena which is fully deterministic. Given this – what are the “world models” really capturing? – What is their fidelity? What are the promises a world model is making to an engineer using a given world model. To understand this it may be worthwhile to review this article on LLMs and “models”.
Building a world model based on “captured training data” by definition seems limiting. How do we know we have enough training data to say a world model is “complete” or assess even how far away from being complete or what is being missed on the path to completeness. What are the failure modes of a world model? When and how can it fail? How do you resolve the failure or patch up the failure to continue reasoning and acting? How much of this “filling the holes” depends on the notion of state and transitions? How does one reason with a world model? Can a world model be run in reverse? What is the notion of equilibrium in a world model? How does a world model transition from stable equilibrium to a dynamic system and back to equilibrium? We also have to contend with the notion of the “Frame problem” and its variants as defined in classical AI planning – how do we model this in modern world models or do we not need to? This issue is relevant even in the context of a transformer-based connectionist model setting.
Building world models from visual, textual and audio data only makes the “modeling” problem that much more acute and risky overall. How do we find gaps – going from video to spatial coordinates, aliasing issues due to discretization of analog observables and sampling and more? Do we use our off-the-shelf analytical models and run numerical simulations for every initial and boundary condition? or run physical experiments across all possible combinations of inputs. Any errors in the world model amplify the errors in a simulator, rendering engine or planner many-fold. Debugging failures in these dependent systems can be extremely cumbersome. Going from stochastic models to deterministic models is much harder – we do not know what is the cause of the non-determinism apriori in the physical world. Having an abundance of compute – may not necessarily lead to high-fidelity models.
Finally – a network of state models can be built from mutliple perspectives – a single agent trying to models its observations (local state) versus a group of agents (a shared state). Alternatively, it can also be built from the perspective of an omniscient all-knowing agent that has a “global” view of context rather than a local view. Consider a “central traffic coordination agent” – Its state model is the union (network) of all local state models at each traffic intersection whereas a local autonomous car’s state model will just be the traffic junction it is currently at. Building such networked state models is an open issue – they are also “physical world models” – but understanding the “world” from a different point of view for a different objective.

We believe the problem of Building World Models just from data is a complex endeavor and much remains to be figured out especially the semantics of “state” that make up a given context. Existing research teams are embarking on this effort with different notions of state, system dynamics and more. We believe getting our basic notions correct is an important first step as a good bit of resources may be expended re-discovering old unsolved problems!.