A recent substack article on world models by Fei-Fei Li suggests a taxonomy of world models – renderers, simulators and planners. Embedded in the article is a brief blurb on the notion of “state” – as paraphrased below –
“The word “state” needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response.”
The above description attempts to simplify the notion of “state” – focusing on local agent/robots view, observability of a state etc. Furthermore the article also hints that a complete understanding of state is essential to build a “reliable” simulator of the physical world but suggests that it is not possible since the world is not fully observable. In our perspective, the success of reliable world models for any task fundamentally depends on getting this notion of “state” correct without sacrificing any determinism. We briefly outline our perspective why this is so.
Firstly, just to get one’s basics on the notion of “state”, readers can review the following essay written nearly a decade on what is the computational notion of state – We never define it expliclitly – but have a comfortable understanding of what it means in a given context. As the research on “world models” unfolds, there are a few fundamental issues to contend with:
- Another comment in the aforementioned article is: “Where language models learn the statistical structure of text, world models learn the statistical structure of space and time” – Is the physical world really non-deterministic or Is our “modeling” of the same non-deterministic and hence statistical. – Is the lack of determinism as requirement for autonomous systems a good thing – without a human in the loop – would anyone be comfortable with a surgical robot doing a kidney surgery?
- We believe the notion of state is not just restricted to what a robot needs – but the notion of state is to be seen as a “Network of different state models” – each model covering some aspect of understanding context. Consider a robot boiling water in a pan – Apart from the state of the robot manipulators – does it need to know the “state” of water in the pan – Is it cold? Is it warm? Is it boiling/gaseous etc.? Point being dynamics of context may be provided by a “network” of state models to assess and monitor context. Semantics of each state model may be different. How do we pick and choose these set of basic state models?
- Transitions between “states” across these networks may have complex underlying phenomena with explicit and implicit events. Explict events may be triggered by agent driven actions, Implicit events occur auto-magically based on unmodelled dependencies or due to lack of actual causal or correlational models.
- Are the current “observables” in any domain enough to understand the context? Are all physical measurables enough to model the phenomena at hand? Do we even know what has not been captured? State models are built with a modeling objective in mind – to capture some dynamics of interest in the target system – it by definition does not capture everything. Additional capturing every detail does not improve the quality of the state transition model.
- Partial observability is a modeling and data issue – not a limitation of the underlying phenomena which is fully deterministic. Given this – what are the “world models” really capturing? – What is their fidelity? What are the promises a world model is making to an engineer using a given world model. To understand this it may be worthwhile to review this article on LLMs and “models”.
- Building a world model based on “captured training data” by definition seems limiting. How do we know we have enough training data to say a world model is “complete” or assess even how far away from being complete or what is being missed on the path to completeness. What are the failure modes of a world model? When and how can it fail? How do you resolve the failure or patch up the failure to continue reasoning and acting? How much of this “filling the holes” depends on the notion of state and transitions? How does one reason with a world model? Can a world model be run in reverse? What is the notion of equilibrium in a world model? How does a world model transition from stable equilibrium to a dynamic system and back to equilibrium? We also have to contend with the notion of the “Frame problem” and its variants as defined in classical AI planning – how do we model this in modern world models or do we not need to?
- Building world models from visual, textual and audio data only makes the “modeling” problem that much more acute and risky overall. How do we find gaps – going from video to spatial coordinates, aliasing issues and more? Do we use our off-the-shelf analytical models and run numerical simulations for every initial and boundary condition? or run physical experiments across all possible combinations of inputs. Any errors in the world model amplify the errors in a simulator, rendered or planner many-fold. Debugging failures in these dependent systems can be extremely cumbersome. Going from stochastic models to deterministic models is much harder – we do not know what is the cause of the non-determinism apriori in the physical world. Having an abundance of compute – may not necessarily lead to high-fidelity models.
We believe the problem of Building World Models just from data is a complex endeavor and much remains to be figured out. Existing research teams are embarking on this effort with different notions of state, system dynamics and more. We believe getting our basic notions correct is an important first step.
