LLM’s – both the vanilla ones and the latest with different kinds of reasoning capabilities – are being positioned as a very viable “model” for the reality around us. The long-term vision is such a model – would enable us to do a wide variety of “tasks” across a wide-variety of domains – software, robotics, teaching and more. Given the existing claims, this brief essay explores – how good a “model” an LLM-based system would be? Over the past five centuries, our scientific and engineering legacy has a pretty good notion of “what a good model should be”. We revisit these criteria and then evaluate LLMs in the context of these criteria.
What are the different types of “Models”?
Models may be classified in two major ways – a) By the domains or “areas” of subject matter they aim to capture and reflect and b) By the “types” of modeling “formalisms” or model frameworks – i.e. the structure/behavior/utility of the model representation itself.
Domain models
- Physics/Chemistry models – A good model in physics simplifies complex reality to provide clear insights, accurately explaining known data while making testable, falsifiable predictions. It prioritizes parsimony (simplicity) and logical consistency, balancing accuracy with tractability. Key criteria include predictive power, explanatory depth, and robustness across varied contexts. A model in chemistry is a simplified, visual, or mathematical representation of atomic or molecular systems used to explain and predict chemical behavior.
- Biological models – A model in biology is a simplified representation—physical, conceptual, or computational—of a complex biological system, organism, or process used to understand, simulate, and predict its behavior. These tools allow researchers to study intricate biological phenomena (e.g., genetics, disease) in manageable systems,
- Engineering models – Engineering models are essential, abstract representations—physical, mathematical, or process-based—used to design, analyze, test, and predict the behavior of systems before construction. They enable engineers to optimize designs, verify functionality, and reduce costs in fields like mechanical, civil, and aerospace engineering.
- Social science models – Social science models are visual, conceptual, or formal representations used to understand complex human, environmental, or historical systems.
Models in terms of their “formalisms”
- Mathematical models – A mathematical model is a representation of a real-world system, process, or phenomenon using mathematical concepts like equations, variables, and formulas. These models enable scientists and engineers to analyze, explain, and predict the behavior of complex systems—ranging from physical, economic, to biological processes—often serving to optimize decision-making and test scenarios,
- Computational/Simulation models – A computational model is a mathematical simulation used to study, predict, and understand complex systems by adjusting variables, often enabling virtual experiments. These models integrate math, physics, and computer science to simulate scenarios—such as infectious disease tracking, drug design, or engineering analysis—that are often impossible or difficult to perform in a lab.
- Physical models – A physical model is a tangible, constructed replica (possibly scaled down or up) of a system, object, or concept used for visualization, experimentation, and validation in engineering, science, and design. These models can be scale representations (smaller/larger) or life-size, allowing researchers to observe behavior, test structural integrity, and analyze complex phenomena safely, such as during seismic tests or hydraulic simulations.
- Statistical models – A statistical model is a mathematical representation, often an equation, that describes the relationship between random variables and represents the data-generating process. It identifies patterns, tests hypotheses, and predicts future outcomes based on data. These models are characterized by their structure, typically dividing variables into dependent (outcome) and independent (predictor) components.
- Symbolic models – A symbolic model is an AI approach that represents knowledge using explicit, human-readable symbols (variables, rules, logic) to model concepts and relationships. It uses symbolic reasoning to infer conclusions, often employed in expert systems, formal verification (model checking), and cognitive simulation for high precision and transparency.
- ML models – Data -driven ML models such as classifiers, neural nets etc are also models capturing some “statistical” regularity based on the data provided in a given domain. At some level of abstraction – they are large-scale “function fitters” in the mathematical sense.
- Textual, image, video, audio models – Realworld scenarios described in terms of text, images, video, audios is also at some level a “model”. For example, consider a vehicle crash on a highway. The crash report, the videos and images from onboard cameras and street cameras “constitute” a model. We try to use these models to re-construct the “crash” and more.
- A computer program or piece of code is also a model – a model of a process or dynamics of the structure of a system.
Given the wide-variety of models listed above – it is important to also understand the following –
- Models are developed to understand, explain, predict, analyze a “particular” part of the realworld in its entirety or some specific aspect. They are used to possibly discover cause-effect, suggest experiments (if possible), suggest solutions to fix some issue etc.
- Models usually focus only a particular aspect of reality. Usually in engineering – multiple models are used to understand the artifact under study. For example, apart from VLSI models for understanding a chips logical behavior, one also does mechanical thermal analysis to understand power dissipation and more.
- Models lead to physical artifacts and also the need to understand different kinds of physical artifacts drives modeling. This loop suggests experiments, data collection efforts and ensuing analyses.
- Given the overall notion of “reductionism”, it is rare to find models that address multiple objectives in a single instance. Aggregate behaviors of systems are much harder to study and build models of than individual aspects – whatever it may be.
- Models describing natural systems or artificial man-made systems usually address the following – a) Static structure of a system – its components – their relations to each other, physical structure etc. and properties of their static structure. b) Dynamic aspects of a system – in time and space as the system goes about its functionality – this includes “processes” encapsulated in a system or processes executed by a system or its components c) Behavior of the system under different input, output and ambient conditions and d) Finally – how the system internals and externals – deliver the functionality of the system, how to maintain a system, how it interacts with other systems in the environment etc.
- Finally models facilitate different kinds of “manipulation” to perform what-ifs and counterfactuals to understand potential realworld behavior. This capability is what makes a model useful as it allows for envisioning scenarios/prediction that may have not yet occured in the realworld or we do not have data about.
So – even an LLM is a model – a large language model (LLM) is a computational model trained on a vast amount of data, designed for natural language processing, image, video and audio processing tasks, especially language generation. The above notion has been “generalized” and “anthropomorphized” to suggest that LLMs can be used to model all kinds of “phenomena” and more. The implicit rationale is “all models” are encoded/represented as /text/video/images/audio and if such models are “encoded” in an LLM, such an LLM can model most of reality. The key to realize is that LLMs have minimal machinery for “model manipulatability” – which limits its “reasoning capabilities” to a large measure.
Criteria for Evaluating Models
As outlined above – “domain” models and “model formalisms” – each have their own set of criteria. These criteria provide some guidance on how good a model is – either to understand and explain a phenomena or predict a phenomena or even design and build a widget that exhibits that phenomena. An understanding of both these sets of criteria as first step will help us evaluate how good a model is an LLM as it aims to combine both perspectives. The criteria below are not exhaustive across all different domains of knowledge but reflect atleast the major ones that any model should support
Criteria for “Domain Models”
Key criteria for a good domain model include:
- Explanatory and Predictive Power: A model must explain existing observations and make accurate, quantifiable, and verifiable predictions about future experiments.
- Simplicity (Parsimony): Based on Occam’s Razor, the best model is often the one that explains the data with the fewest parameters and assumptions.
- Falsifiability: The model must be capable of being proven wrong through experimentation; if it cannot be tested, it is not a scientific model.
- Tractability and Utility: It must be usable and solvable (either analytically or numerically) to provide practical, intuitive, or computational insight.
- Consistency: It should be consistent with well-established laws of physics/chemistry/biology or the engineering phenomena and compatible with other established theories.
- Domain of Applicability: A good model clearly defines its limits and when it breaks down (e.g., Newtonian mechanics at high speeds). Models should be well-grounded with the “correspondence” to reality well defined.
- Reproducibility: The phenomena being described by the model should be reproducible under different contexts. Context conditions outline the “regimes” of model validity.
- Robustness: The model should be able to describe a set of “closely” related phenomena. Such a model allows for equivalent explanations as the phenomena is perturbed under different conditions.
- Elegance: While subjective, a “beautiful” or elegant model often unifies diverse phenomena under a single, simple framework. This is a classical notion in modern science.
Criteria for Modeling Frameworks
- Logical Consistency & Transparency: Models must have clear, explicit, and definite assumptions. The steps from premises to conclusions must be transparent, allowing other researchers to replicate the findings.
- Usability/Re-usability: Models must be usable/reusable in different contexts as a subsystem providing the same fidelity.
- Representativeness: It must adequately capture the cause-and-effect relationships of the system (mechanistic models) or effectively map inputs to outputs (empirical models).
- Composability: Complex systems may be modelled by composing simpler subsystem models which faciliate different kinds of reductionist reasoning or simulating aggregate behavior and functionality.
- Completeness and Soundness of the Inferential machinery/Computability – Reasoning machinery supported by the model (during its manipulation) or otherwise – should lead to valid answers and also cover all aspects of what is “implicitly” suggested by the model.
- Identifiability: Parameters should be unique and estimable from the available data. When many parameters are unknown, the modeler must perform sensitivity analysis (e.g., checking for “sloppy” vs. “stiff” parameters).
- Interpretability: The results of the model should be easily understood in the context of the phenomena it represents. Mechanistic evaluation should lead to identifying “repeat” behaviors and a sense of correlation with the realworld phenomena.
- Stability: Models must be stable – Different kinds of manipulations should lead to same conclusions in different contexts
- Tunability: Models must support incremental “granularity” and behave consistently at different resolutions of the problem contexts. They must behave consistently.
- Efficiency and Cost: The cost of constructing the model (time, resources) must be weighed against its predictive accuracy.
- Traceability and Interoperability: Elements must be clearly connected to requirements, allowing for validation, and often must be in open standard formats to be used across different tools.
- Consistency and Unambiguity: Models must avoid contradictions and redundancies, using standardized terminology and structures to ensure consistency.
- Generalization: The model performs well on new, unseen data, not just the training data.
- Performance Metrics: Well-defined performance metrics – such as precision/recall for search, convergence for “gradient descent”, resource optimization metrics etc.
- Model-development Efficiency: Fast training and prediction times for “data-driven” models
Evaluating “LLMs” as a “models” of Reality –
The above sets of criteria provide a canvas on which to assess “LLMs” as models in a domain and as a modeling framework. Our current assessment is outlined in the table below. Column 1 lists the criteria, Column 2 the LLM status with respect to the criteria. The status is currently qualitative in nature – high meaning it exceeds the requirement, low indicates it does not meet the criteria adequately, medium suggesting it meets the criteria partially and mixed suggesting it meets or misses the criteria under different contexts. Column 3 provides some additional comments on our rationale.
| Criteria | LLM Status | Notes |
|---|---|---|
| Explanatory & Predictive Power | Mixed | For textual mode, past may be explained better, Future predictions have low fidelity. For non-textual data, it is still not clear how LLMs will perform. |
| Simplicity (aka Parsimony) | Low | A SOTA LLM is anything but simple with biliions of parameters and complex NN architectures. Occams Razor is violated from many perspectives. |
| Falsifiability | High | Easily falsiable in many contexts |
| Tractability & Utility | Mixed | It is usable in “text generation” mode, coding etc, but in general domains it is still early days |
| Consistency | Low | LLMs can hallucinate at any time |
| Domain of Applicability | High | Many domains with “textual/symbolic” sequential encodable knowledge can be modelled. |
| Reproducibility | Low | Varies with the dataset the model is trained on – probability estimates will vary across 2 LLMs trained with the same data and same settings. |
| Robustness | Low | Very sensitive to training data sets, inference mechanisms |
| Elegance | Low | From a traditional perspective, it is quite clunky. Too many moving pieces. |
| Logical Consistency & Transparency | Low | Notions of consistency/replicabality do not apply in context of LLMs. Research areas such as alignment, interpretability attempt to address some of these issues. |
| Usability/Re-usability | Mixed | Usability is high at a certain level of abstraction and in some domains. If anything changes, the whole LLM needs to be retrained. Re-usability is not possible. |
| Representativeness | Mixed | This is major R&D area in the context of reasoning. LLMs aim to capture the worlds knowledge but evaluation for this criteria is a big issue. |
| Composability | Low | Following up on re-usability – LLMs as a whole are non-composable. Though one talks of components such as encoders/decoders/heads – conceptually they are all tightly linked. No single subsystem can be used as-is in another system. Weights are “fully” exported in open-weight LLMs |
| Computability | High | The notion of reasoning, inference, state-based evaluation is a major research area. Continual learning is an R&D area. |
| Identifiability | Low | Final set of parameters depend on the Neural Net stack and components. There is no notion of minimal number of parameters for a certain level of overall performance or efficiency. |
| Interpretability | Low | In domain’s like coding it is well understood. Many other domains-humans-in-the-loop – provide the interpretations. |
| Stability | Low | LLMs are highly sensitive to prompts both at inference time and during training. |
| Tunability | Low | Really fine-grained tuning is not possible. Supervised fine tuning is relatively more focused than a pre-trained LLM. From an output point of view, one has very little control. |
| Cost & Model Efficiency | Low | Dev and Runtime costs are very high compared to the quality of output |
| Traceability/Interoperability | Low | Interoperability is through weight exports. Traceability is still a research problem. |
| Consistency/Unambiguity | Low | LLMs are prone to inconsistencies and hallunications. |
| Generalizability | High | Beyond training data, generalizability is low on the fat tails of knowledge in a domain. |
| Performance Metrics | Low | Metrics are handmade and vary by benchmarks. |
| Model-development Efficiency | Low | Cost of developing a pre-trained and fine tuned models runs in millions of dollars. |
A review of the above table can be used to understand a few LLM trends and outline possible future scenarios for its evolution and adoption. The above table also suggests why the reactions are so varied across domains when experts interact with LLMs – expectations of LLMs vary across different criteria listed above.
Firstly, the table above suggests a framework for “doing” AI evals. One can build test cases to address each of these criteria in different domains. As past knowledge gets assimilated in an LLM, the above table also suggests the “operating boundary” for an LLM where a human has a critical role to play. The table also suggests various avenues for LLM research in the small and the large – for example – how do I make LLMs composable (without retraining the whole machinery) ?
As LLMs get deployed in various domains, we expect the SOTA models to face the issues outlined above. Domain knowledge needs to be modelled and assimilated before adoption is reliable in a domain. This also leads to the issue of identifying what is defensible intellectual property in modern organizations. What is the core competency and what is auxiliary? Overall much remains to be done, till AI-based systems go mainstream at scale – they need to assimilate the knowledge harnessed painstakingly the past few centuries into an all-knowing framework and how to fruitfully utilize it in a given scenario.
