Differences

This shows you the differences between two versions of the page.

--- universal_ai_as_imitation [2026/03/07 12:21] – pedroortega
+++ universal_ai_as_imitation [2026/03/17 01:36] (current) – [Universal Artificial Intelligence as Imitation] pedroortega
@@ Line 1: / Line 1: @@
+{{htmlmetatags>
+metatag-og:image=(::uai-imitation.webp)
+metatag-og:description=Many AI frameworks define agency as utility maximization: specify a scalar objective, then learn actions that increase it. A different foundation treats agency as imitation.
+}}
 {{ ::uai-imitation.webp?nolink&800 |}}
 ====== Universal Artificial Intelligence as Imitation ======
-Paper:\\
+**Full Paper:** {{ ::universal-ai-as-imitation.pdf | PDF }} - [[uiai| HTML]]\\
-{{ ::universal_ai_imitation_general_audience.pdf |General Audience Version}}
+**Simplified:** {{ ::universal_ai_imitation_general_audience.pdf |General Audience Version}}
-> Many AI frameworks define agency as reward maximization: specify a scalar objective, then learn actions that increase it. A different foundation treats //agency as inference//. An agent acquires compact explanations of how outcomes depend on its interventions, and it acts by continuing an interaction under a first-person discipline: the agent’s own actions are recorded as choices rather than evidence, while the world’s responses are treated as information. A single idealized agent is built from two ingredients: (i) a universal, simplicity-biased pattern generator that ranges over all computable ways an interaction could unfold, and (ii) an explicit turn-taking interface that determines, at each step, whether the next chunk of transcript is produced by the world or by the agent. For any world generated by a fixed computable rule, the agent learns to track the world’s own action-shaped continuations on the turns when the agent must act. In everyday terms, large departures from the world’s continuation targets can occur only finitely many times, in an appropriate averaged sense, provided the turn-taking rule depends only on what has already been written and not on what would be written next. The same framework can, in principle, support plural and heterogeneous behavioral schemas, including reward maximization, without making reward the primitive definition of purpose.
+> Many AI frameworks define agency as utility maximization: specify a scalar objective, then learn actions that increase it. A different foundation treats //agency as imitation//. An interacting agent acquires //behavioral schemas//, compact first-person rules for how to act and what outcomes to expect, by learning compressive explanations of interaction regularities. A single idealized agent is built from three ingredients: (i) a universal, simplicity-biased pattern generator that ranges over all computable ways an interaction could unfold; (ii) a first-person distinction between actions and observations, where only observations count as evidence; and (iii) an event-triggered turn-taking interface that determines, at each step, whether the next output is produced by the world or by the agent. For any world generated by a fixed computable rule, the agent learns to imitate by translating the world’s third-person demonstrations into first-person behavior. In everyday terms, large departures from the counterfactual target, what the world would have produced in the agent’s place next, can occur only finitely many times in an appropriate averaged sense. The same framework can, in principle, support plural and heterogeneous schemas, including utility maximization, without making utility the primitive definition of purpose.
-Pedro A. Ortega
+Pedro A. Ortega\\
 March 2026
 //Keywords: universal imitation, counterfactual action, third-party action, evidence-transfer, interface.//
-===== Introduction: unseen tests and continuation under consequences =====
+===== Introduction: learning next steps from consequences =====
-A common engineering interaction is programming against tests the developer does not directly observe. A developer proposes a patch, and the system replies with structured evidence: pass/fail outcomes, error traces, and sometimes example inputs and outputs. Over repeated trials, competence takes the form of an internal rule: a compact explanation of why some patches succeed and others fail, and a way to generalize beyond observed cases to new ones. In this setting, success does not require optimizing a standalone score in isolation. It requires learning what the system will do //after// a patch is submitted, and then using that learned relationship to produce patches that work on cases not yet encountered.
+A common task is learning to cook with a chef. The learner attempts a step, and the kitchen replies with structured evidence: aroma shifts, sound changes, browning, viscosity, texture, and taste, as well as brief corrections or confirmations from the chef. Over repeated trials, competence takes the form of an internal rule: a compact explanation of why some interventions succeed and others fail, and a way to generalize beyond the demonstrated cases. In this setting, success is not naturally described as optimizing a standalone score in isolation. It is better described as learning what tends to //happen after// a chosen step, and using that relationship to produce good outcomes in dishes the learner has not seen before.
-The same structure appears outside programming. In tool use, a learner issues a command or spreadsheet formula on an unfamiliar dataset, and the tool replies with computed values, diagnostics, or errors. Competence consists in acquiring a reusable explanation of how the tool behaves under interventions, and then applying that explanation in new layouts and tasks where memorizing past inputs would not suffice. Driving norms provide a third illustration: a driver entering a new country infers a merging convention from other drivers’ reactions to maneuvers, then applies that convention across many junctions. These examples share a single organizing distinction that will control the discussion: an agent makes choices, the world produces consequences, and learning improves expectations about consequences of choices.
+Cooking instruction also fits an event-triggered turn-taking pattern. At some moments the chef demonstrates the next move, while at other moments the chef expects the learner to proceed, and the learner cannot reliably predict in advance which moments will demand action. When the chef acts, the learner observes third-person demonstrations that have the same “action shape” as what the learner will later need to produce. When the learner acts, the ensuing sensory consequences and the chef’s response provide evidence about whether the chosen next step fit the situation. This setting illustrates how third-person regularities can become first-person competence through interaction.
+The same structure appears outside kitchens. In tool use, a learner issues a command or spreadsheet formula on an unfamiliar dataset, and the tool replies with computed values, diagnostics, or errors. Driving norms provide another illustration: a driver entering a new country infers a merging convention from other drivers’ reactions to maneuvers, then applies that convention across many junctions. These examples share a single organizing distinction that will control the discussion: an agent makes choices, the world produces consequences, and learning improves expectations about consequences of choices.
 <WRAP info>
@@ Line 26: / Line 33: @@
   * **do(a)**: an action chosen by the agent, an intervention.
   * **see(o)**: an outcome produced by the world in response, evidence.
-  * **identify(a)**: an action-shaped continuation embedded in what the world produces, a third-party action present in the data.
+  * **identify(a)**: an action-shaped fragment embedded in what the world produces, a third-party action present in the data.
-  * **imagine(a)**: the action-shaped continuation the world would have produced if the world, rather than the agent, had written the next action-shaped chunk, a counterfactual target.
+  * **imagine(a)**: the action-shaped fragment the world would have produced next if the world, rather than the agent, had produced the next action-shaped chunk, a counterfactual target.
 Learning is a change in what an agent expects to **see(o)** after what it **do(a)**. The agent’s own **do(a)** is recorded as a choice, not as evidence about which internal explanation is correct.
@@ Line 34: / Line 41: @@
 ===== Reward maximization is powerful, but not mandatory =====
-Reward maximization became a default foundation for AI because it unifies learning, planning, and evaluation under a single scalar objective. It aligns with historical influences from optimal control and economics, and it fits engineered settings where objectives are stable and measurable. In that tradition, “purpose” is introduced by design: a reward signal is treated as the primitive definition of what the agent is for, and the rest of the system is organized to increase it over time. This approach remains indispensable in many domains, and nothing in the present perspective denies its utility. The claim here is narrower and structural: reward maximization is a modeling commitment rather than the only coherent starting point for agency.
+Reward maximization became a default foundation for AI because it unifies learning, planning, and evaluation under a single scalar objective. It aligns with historical influences from optimal control and economics, and it fits engineered settings where objectives are stable and measurable. In that tradition, “purpose” is introduced by design: a reward signal is treated as the primitive definition of what the agent is for, and the rest of the system is organized to increase it over time. This approach remains indispensable in many domains. However, the claim here is that reward maximization is a modeling commitment rather than the only coherent starting point for agency.
-Many forms of real competence are acquired second-hand through structure in interactions: examples, demonstrations, language, norms, and feedback that does not naturally reduce to a single number. Programming against unseen tests teaches a rule through traces and failures; tool use teaches a protocol through outputs and diagnostics; traffic norms teach a convention through other drivers’ responses. In these cases, what is learned is usefully described as a //schema//: a reusable rule that produces an appropriate continuation in context. Reward can still appear, but it is one observation type among many inside **see(o)**. The alternative foundation therefore adopts a single thesis that will recur throughout: //agency as inference//, where an agent learns compact explanations of how **see(o)** depends on **do(a)**, and acts by continuing interaction in a first-person way.
+Many forms of real competence are acquired second-hand through structure in interactions: demonstrations, language, norms, and feedback that does not naturally reduce to a single number. Cooking instruction teaches a schema through demonstrations and multi-channel consequences rather than scalar grades after each move. Tool use teaches protocols through outputs and diagnostics that matter in different ways depending on context. Driving norms teach conventions through other drivers’ responses. In these cases, what is learned is usefully described as a //schema//: a reusable rule that produces an appropriate next step in context. Reward can still appear, but it is one observation type among many inside **see(o)**.
 ===== A first-person discipline: actions are choices, not evidence =====
@@ Line 42: / Line 49: @@
 Interactive learning exposes a failure mode when an agent updates on its own actions as if they were evidence. The problem appears whenever two internal explanations agree on the world’s responses for every intervention that might be tried, but differ in the probabilities they assign to which action the agent will take next. If the agent treats **do(a)** as evidence, then whichever explanation assigned higher probability to the realized action receives a spurious boost, even though no new information about the world has been obtained. Beliefs can then drift toward explanations that predict the agent’s sampled tendencies, rather than toward explanations supported by changes in **see(o)**. This matters for any system that represents probabilistic explanations over complete interaction transcripts, because such systems naturally represent “stories” about both what the world will do and what the agent will do.
-The programming example makes the point operational. A patch submission does not, by itself, reveal anything about the specification; it is a condition under which evidence is generated. Evidence arrives in the subsequent trace, failure, or pass result, which is part of **see(o)**. Tool use exhibits the same structure: the choice of a command or formula is an intervention, while the tool’s output is evidence about the tool’s behavior under that intervention. A first-person discipline therefore treats **do(a)** as a choice that sets the conditions for evidence, and it updates beliefs only using **see(o)** conditioned on **do(a)**. This separation is the minimal correction that prevents an agent from drifting toward self-confirming explanations when no new world information has arrived.
+The cooking example makes this point operational. A choice such as changing heat, adding salt, or delaying an ingredient does not, by itself, reveal anything about which internal explanation of the dish is correct; it sets the conditions under which evidence will arrive. Evidence arrives in the subsequent **see(o)**: whether browning accelerates, whether an emulsion holds, how taste balance shifts, and how the chef corrects or confirms the move. Tool use exhibits the same structure: the choice of a command is an intervention, while the tool’s output is evidence about the tool’s behavior under that intervention. A first-person discipline therefore treats **do(a)** as a choice that sets the conditions for evidence, and it updates beliefs only using **see(o)** conditioned on **do(a)**. This separation is the minimal correction that prevents an agent from drifting toward self-confirming explanations when no new world information has arrived.
 <WRAP info>
@@ Line 50: / Line 57: @@
 $$
--\log P(\textbf{see(o)} \mid \textbf{do(a)}).
+-\log P(\textbf{see(o)} \mid \textbf{do(a)}, \textbf{context}).
 $$
-The first-person update uses the surprise of **see(o)** given **do(a)**. The model’s own probability assigned to **do(a)** is excluded from evidence because **do(a)** is a choice rather than a fact produced by the world.
+The first-person update uses the surprise of **see(o)** given **do(a)** and the context. The model’s own probability assigned to **do(a)** is excluded from evidence because **do(a)** is a choice rather than a fact produced by the world.
 </WRAP>
 ===== A universal pattern generator plus a turn-taking interface =====
-With the choice/evidence separation in place, the remaining ingredients are a broad model of interaction and an explicit account of turn-taking. The construction begins with a universal pattern generator: a single idealized predictor that assigns weight to //every// interaction rule that can be generated by a computer program, with a built-in preference for shorter programs. Informally, it is a weighted collection of possible explanations for how an interaction could unfold, where compact explanations receive more initial weight than sprawling ones. In programming against unseen tests, this corresponds to preferring a concise specification that predicts observed traces and generalizes to new tests. In tool use, it corresponds to preferring a short protocol explanation that predicts outputs across commands, rather than a memorized table of past inputs.
+With the choice/evidence separation in place, the remaining ingredients are a broad model of interaction and an explicit account of turn-taking. The construction begins with a universal pattern generator: a single idealized predictor that assigns weight to //every// interaction rule that can be generated by a computer program, with a built-in preference for shorter programs. Informally, it is a weighted collection of possible explanations for how an interaction could unfold, where compact explanations receive more initial weight than sprawling ones. In cooking instruction, this corresponds to preferring a concise explanation that predicts how outcomes depend on technique and timing, rather than a memorized list of recipes.
 This “universal” ingredient supports a broad competence claim. The agent is not tailored to one environment; it ranges over //any// world whose behavior is generated by some fixed computable rule. Different worlds may be easier or harder to learn from the available evidence, but the same architecture applies uniformly. The price of such breadth is that the construction is an idealized limit picture rather than a practical algorithm, because ranging over all programs is not tractable in general. The benefit is conceptual clarity about what “broad capability in principle” means: the agent always retains some weight on the true world rule, as long as it is computable.
@@ Line 64: / Line 71: @@
 Prediction alone does not produce agency, because interaction requires commitments about what is written next. The construction therefore includes an explicit turn-taking interface that partitions the transcript into alternating chunks and determines which side writes the next chunk. When the world writes next, the content is recorded as **see(o)** and treated as evidence. When the agent writes next, the content is recorded as **do(a)** and treated as a choice. This interface perspective matches modern practice: a tool call is a structured intervention whose completion is followed by a structured report, and the boundary between the two defines what counts as a single unit of learning. The agent’s internal update depends on this boundary, because it must treat completed action chunks as interventions and completed world chunks as evidence.
-Once turn-taking is explicit, the same continuation machinery that predicts what belongs next can also generate an action when the agent must write next. Operationally, the agent samples an action-shaped continuation from its current predictive distribution, emits it as **do(a)**, observes the subsequent **see(o)**, and updates its beliefs using only what was seen. In this sense, predicting and acting share a single engine, and the engine is updated only by world-produced consequences.
+Once turn-taking is explicit, the same next-output prediction machinery that forecasts what should come next can also generate an action when the agent must write next. Operationally, the agent samples the next action-shaped chunk from its current predictive distribution, emits it as **do(a)**, observes the subsequent **see(o)**, and updates its beliefs using only what was seen. In this sense, predicting and acting share a single engine, and the engine is updated only by world-produced consequences.
 ===== Third-party actions, counterfactual targets, and why turn-taking can teach =====
-Turn-taking creates two ways an action-shaped continuation can appear in the agent’s experience. Sometimes the world produces material whose content has the same “action shape” that the agent will later be expected to produce. In programming work, example input/output pairs and diagnostic traces can include fragments that function like worked steps. In tool use, a system output can include a suggested command, a reference formula, or a pattern that the learner can reuse. In traffic, other drivers’ maneuvers are continuously present as behavior that has the same action-shape a newcomer will later execute. In the tags, these are cases where an action-shaped continuation is **identified** inside world-produced material: **identify(a)** is a third-party action present in the data stream.
+Turn-taking creates two ways an action-shaped fragment can appear in the agent’s experience. Sometimes the world produces material whose content has the same “action shape” that the agent will later be expected to produce. In cooking instruction, the chef’s demonstrated steps and corrections are directly reusable: how to cut, when to deglaze, when to stop reducing, and how to adjust seasoning. In the tags, these are cases where an action-shaped fragment is **identified** inside world-produced material: **identify(a)** is a third-party action present in the data stream.
-At other times, the agent reaches a point where the next action-shaped chunk must be produced by the agent. The framework defines a comparison target that does not require a teacher: **imagine(a)** is the action-shaped continuation the world //would// have produced at that point if the world, rather than the agent, had written next. This target is counterfactual by design. It is not part of the on-path data at the moment the agent acts, but it is well-defined by the world’s own generative rule. Performance, in this foundation, is measured by how closely **do(a)** matches the distribution of **imagine(a)** on the turns when the agent must act.
+At other times, the agent reaches a point where the next action-shaped chunk must be produced by the agent. The framework defines a comparison target that does not require a teacher to score each attempt: **imagine(a)** is the action-shaped fragment the world //would// have produced next at that point if the world, rather than the agent, had produced it. In cooking instruction, this is the step the chef would have taken next in the learner’s place given the current state of the dish. This target is counterfactual by design. It is not part of the on-path data at the moment the agent acts, but it is well-defined by the world’s generative rule. Performance, in this foundation, is measured by how closely **do(a)** matches the distribution of **imagine(a)** on the turns when the agent must act.
-Learning from **identify(a)** can constrain later **do(a)** only under a clean interface condition. The condition is chronological in the everyday sense: the rule that determines whose turn it is may depend on what has already happened, but it does not depend on information that would only be available after the next continuation is written. In plain terms, the scheduler does not use the content of the next continuation as an input to deciding who writes it. Under this condition, action-shaped material the world writes inside **see(o)** is representative of the same continuation structure that governs the agent’s later action turns, so evidence from **identify(a)** can transfer to constraints on **do(a)**.
+Learning from **identify(a)** transfers to **do(a)** when the learner cannot anticipate exactly when it will be required to act. Without a reliable “my turn” signal, the learner uses the same schema for what comes next. When the world supplies the next step, it enters as **identify(a)** evidence; when the learner must supply it, it becomes **do(a)**. In cooking instruction, the learner is sometimes shown the next move and sometimes expected to do it, without knowing which will happen next, so demonstrated moves constrain later performance.
 ===== The universal guarantee: a finite budget of large deviations =====
-The main guarantee is a limit claim about continuation under interventions. Fix a world whose behavior is generated by some fixed computable rule. Consider the universal agent described earlier: it maintains a simplicity-weighted mixture over computable interaction explanations, updates that mixture using only **see(o)** under its chosen **do(a)**, and produces actions by sampling from its current predictive continuation when it must write next. Under a chronological turn-taking interface, the agent’s distribution over **do(a)** on its action turns tracks the world’s distribution over **imagine(a)**. The takeaway is that large departures from the world’s continuation targets can occur only finitely many times in expectation, so sustained large divergence cannot persist indefinitely.
+The main guarantee is a limit claim about acting under interventions. Fix a world whose behavior is generated by some fixed computable rule. Consider the universal agent described earlier: it maintains a simplicity-weighted mixture over computable interaction explanations, updates that mixture using only **see(o)** under its chosen **do(a)**, and produces actions by sampling from its current predictive distribution when it must act. Under a chronological turn-taking interface, the agent’s distribution over **do(a)** on its action turns tracks the world’s distribution over **imagine(a)**. The takeaway is that large departures from the world’s targets can occur only finitely many times in expectation, so sustained large divergence cannot persist indefinitely.
+This conclusion supports a natural generalization reading in cooking instruction. At first glance, imitation can look like copying the chef’s moves recipe by recipe. The finite-mistake guarantee points to a different outcome: large departures are an early phase and cannot persist. What remains is a compact schema that selects the next move from the situation—heat, timing, texture, taste—so the learner can succeed on dishes that were never demonstrated step-by-step by recombining the learned rules in new contexts.
-The proof idea can be conveyed as a //budget of surprise//. Because the agent’s beliefs include every computable world rule with some positive initial plausibility, it never falsifies the hypothesis about the true world. As a result, on the stream of world-produced evidence **see(o)**, the mixture cannot remain systematically worse than the true world by more than a world-dependent constant amount of cumulative surprise. When action-shaped fragments occur inside **see(o)**, they contribute to that same accounting. A chronological interface then provides the bridge: because turn-taking depends only on what is already written, the expected total discrepancy that could occur on the agent’s action turns can be related to the expected total discrepancy that occurs on action-shaped fragments inside **see(o)**. Since the latter is bounded by the mixture’s finite surprise budget, the former is bounded as well, yielding the “only finitely many large deviations” conclusion.
+The proof idea can be conveyed as a //budget of surprise//. Because the agent’s beliefs include every computable world rule with some positive initial plausibility, it never falsifies the hypothesis about the true world. As a result, on the stream of world-produced evidence **see(o)**, the mixture cannot remain systematically worse than the true world by more than a world-dependent constant amount of cumulative surprise. When action-shaped fragments occur inside **see(o)**, they contribute to that same accounting. Then, evidence accumulated from demonstrations can be related to the agent’s actions. Since the evidence budget is finite, the expected cumulative discrepancy on the agent’s action turns is bounded as well, yielding the “only finitely many large deviations” conclusion.
-The guarantee is deliberately limited in scope. It is not an optimality claim about reward, because no scalar objective is taken as the primitive definition of purpose. It is not a safety guarantee about downstream consequences of executing demanded continuations in arbitrary worlds. The statement isolates a different notion of success: learning to continue an interaction in the way the world itself would have continued, under a disciplined separation between choice and evidence, and under a turn-taking interface that depends only on past interaction.
+The guarantee is not an optimality claim about reward, because no scalar objective is taken as the primitive definition of purpose. Instead, it isolates a different notion of success: learning to act in the way the world itself would have acted next.
 ===== Schemas, pluralism, development, and limits =====
-Because the object being learned is a continuation rule for interaction transcripts, the framework can in principle absorb a wide range of behavioral schemas. Tool protocols, debugging practices, dialogue conventions, and constrained procedures can all be represented as schemas that map contexts to actions, with feedback arriving in **see(o)**. The technical content can be summarized concisely: for any computable target schema that maps prompts to action-shaped outputs, there exists an interaction protocol in which the world repeatedly provides prompts and produces consequences, and the universal agent learns to supply the corresponding action turns with only finitely many large departures. This is the intended sense of “teaching arbitrary schemas”: a possibility claim in the limit model.
+Because the object being learned is a rule for what comes next in interaction transcripts, the framework can in principle absorb a wide range of behavioral schemas. Tool protocols, debugging practices, dialogue conventions, and constrained procedures can all be represented as schemas that map contexts to actions, with feedback arriving in **see(o)**. For any computable target schema that maps prompts to action-shaped outputs, there exists a teaching protocol in which the world repeatedly provides prompts and produces consequences, and the universal agent learns to act with only finitely many large mistakes.
-Reward maximization fits this picture as one schema among many. If rewards appear in **see(o)**, a continuation rule can treat them as salient observations and select actions that tend to improve future observed rewards under an inferred world model. The conceptual shift is that reward is not required as the primitive definition of purpose. It can be an observation type inside **see(o)** that some schemas prioritize, while other schemas prioritize other structures, such as passing tests, following a protocol, satisfying constraints, or conforming to norms. This framing also supports pluralism: a single interaction stream can contain multiple schemas, and competence consists in identifying which schema is operative in a context and producing the appropriate continuation. In one context the continuation is a moral or procedural norm, such as apologizing, clarifying, refusing, or asking before acting; in another it is a cautious, risk-sensitive habit, such as running the diagnostic, taking the reversible step, or staying within a safety constraint.
+Reward maximization fits this picture as one schema among many. If rewards appear in **see(o)**, a schema can treat them as salient observations and select actions that tend to improve future observed rewards under an inferred world model. Other schemas prioritize other structures, such as passing tests, following a protocol, satisfying constraints, or conforming to norms. In the same framework, the universal agent can learn multiple schemas along with a strategy to identify which one applies given the context. In one context the response is a moral or procedural norm, apologize, clarify, refuse, or ask before acting; in another it is a cautious, risk-sensitive habit, run the diagnostic, take the reversible step, stay within a safety constraint.
-A developmental interpretation follows from the same interface logic. Curriculum training can be understood as identifiability engineering: early interaction supplies stable interface conventions and representative action-shaped patterns within the data stream so that later action turns are anchored. In programming terms, early problems may come with many examples and clear diagnostics; later problems may demand correct behavior with sparse feedback and broad generalization. In tool use, early tasks may provide worked examples and predictable outputs; later tasks may demand correct use across new layouts. This aligns with Turing’s child-machine idea, which emphasizes education and staged competence rather than expecting an adult system from the outset.
+A developmental interpretation follows from the same interface logic. Early interaction must install basic schemas first, because later competence depends on conditioning new choices on rules learned earlier rather than starting from scratch each time. In cooking, early lessons emphasize fundamentals such as heat control, timing cues, tasting, and simple corrections, with demonstrations and immediate **see(o)** that make the consequences legible. Later lessons increase difficulty and reduce support, so the learner must combine earlier schemas under subtler cues and higher time pressure. In tool use, early tasks similarly provide worked examples and predictable outputs, while later tasks demand correct operation on new layouts and unfamiliar failure modes. This staged structure matches Turing’s child-machine idea: education builds foundations before expecting broad, adult-level competence.
-The framework also clarifies limits and failure modes. Transfer from **identify(a)** to later **do(a)** can fail under selection bias, for example when easy patterns are abundant in the data stream but the agent’s action turns occur primarily in hard cases. Transfer can fail when the interface introduces systematic cues that split interaction into distinct regimes, so that continuation behavior learned in one regime no longer constrains the other. Transfer can also fail when turn-taking is not determined from past interaction alone. These issues are central when interpreting modern continuation-trained systems used as agents, including LLM-based tool users, because real deployments often couple model outputs back into future inputs. Such systems can compress interaction traces effectively, but interactive deployment can still create self-referential loops unless **do(a)** is treated as a choice and updates are driven only by **see(o)**.
+The framework also clarifies limits and failure modes. Transfer from **identify(a)** to later **do(a)** can fail under selection bias, for example when easy patterns are abundant in the data stream but the agent’s action turns occur primarily in hard cases. Transfer can fail when the interface introduces systematic cues that split interaction into distinct regimes, so that behavior learned in one regime no longer constrains the other. These issues are central when interpreting modern continuation-trained systems used as agents, including LLM-based tool users, because real deployments often couple model outputs back into future inputs. Such systems can compress interaction traces effectively, but interactive deployment can still create self-referential loops unless **do(a)** is treated as a choice and updates are driven only by **see(o)**.
-A final limitation concerns computation. The universal pattern generator is an idealized limit object, and universality here is a statement about an existence construction that ranges over all computable interaction rules. Modern systems such as LLMs approximate continuation with finite models and finite training, and their success depends on the structure present in data and interaction protocols rather than on literal universality. The perspective offered here is therefore not a turnkey recipe. It is a foundation and a set of interface conditions under which broad, transferable schema learning is possible in principle, together with clear ways those conditions can fail.
+A final limitation concerns computation. The universal pattern generator is an idealized limit object, and universality here is a statement about an existence construction that ranges over all computable interaction rules. Modern systems such as LLMs approximate next-token prediction with finite models and finite training, and their success depends on the structure present in data and interaction protocols rather than on literal universality. The perspective offered here is therefore a foundation and a set of interface conditions under which broad, transferable schema learning is possible in principle, together with clear ways those conditions can fail.
 ===== Conclusion =====
-Reward maximization has earned its central position in the history of AI, but it is not the only coherent foundation for agency. The alternative begins with a first-person discipline: **do(a)** is a choice that sets conditions for evidence, and learning is driven by the world’s **see(o)** in response to that choice. Combined with a universal, simplicity-biased pattern generator and an explicit turn-taking interface that determines what counts as an action turn and what counts as evidence, this discipline yields an account of behavior as continuation under interventions. The data stream can contain action-shaped fragments that can be extracted as **identify(a)**, while action turns can be evaluated against **imagine(a)**, the continuation the world would have produced at the same point.
+Reward maximization has earned its central position in the history of AI, but it is not the only coherent foundation for agency. The alternative begins with a first-person discipline: **do(a)** is a choice that sets conditions for evidence, and learning is driven by the world’s **see(o)** in response to that choice. Combined with a universal, simplicity-biased pattern generator and an interface that determines who gets to act, this yields an account of behavior as next-step prediction under interventions. The data stream can contain action-shaped fragments that can be extracted as **identify(a)**, while action turns can be evaluated against **imagine(a)**, what the world would have produced next.
-The payoff is a limit guarantee for universal artificial intelligence. For any world generated by a fixed computable rule, the same idealized agent learns to track the world’s continuation targets on its action turns, with only finitely many large departures in expectation, provided turn-taking depends only on past interaction. This does not assert reward-optimality and does not provide a safety guarantee. It does show how a universal inference-based agent can, in principle, acquire and execute many coexisting schemas, including reward maximization when rewards are present as observations, without treating reward as the primitive definition of purpose.
+The payoff is a limit guarantee for universal artificial intelligence. For any world generated by a fixed computable rule, the same idealized agent learns to track the world’s targets on its action turns, with only finitely many large departures in expectation, provided turn-taking depends only on past interaction. A universal inference-based agent can, in principle, acquire and execute many coexisting schemas, including utility maximization when rewards are present as observations, without treating utility as the primitive definition of purpose.
 <WRAP info>
 **Suggested reading.**
-Background threads include: compression-based viewpoints on learning, including universal induction and related work; causal perspectives on interventions versus evidence, including do-operator style reasoning; and posterior-sampling approaches to action selection under uncertainty, often termed Thompson sampling. Technical treatments develop the universal interaction model, the first-person update rule that excludes **do(a)** from evidence, and the evidence-transfer argument under a turn-taking interface that depends only on past interaction.
+Background threads include: compression-based viewpoints on learning, including universal induction and related work; causal perspectives on interventions versus evidence, including do-operator style reasoning; and posterior-sampling approaches to action selection under uncertainty, often termed Thompson sampling. Technical treatments develop the universal interaction model, the first-person update rule that excludes **do(a)** from evidence, and the evidence-transfer argument under an event-triggered turn-taking interface that depends only on past interaction.
 </WRAP>