back home
drafting — feb 2026

the human tax

11 projects. 7 languages. the tax is real — but it's not what we expected.

we're still working on this

the short version

**tldr:** we hypothesized 30-50% of production code exists solely because humans need to read it. then we measured 11 projects across 7 languages — sinatra, redis, flask, ruff, ripgrep, htmx, spring boot, express, clay, gas town, loomtown. strict human-accidental complexity: median 9.5%, not 30-50%. but total non-essential overhead averages 44.5% — right in the predicted range. the insight: most of the 'tax' is dual-purpose, serving both humans and machines. naming conventions are the strongest predictor (terse languages: 4.9% HAC; verbose: 20.7%). we propose AGL and five architectural concepts — PCPG, MAIR, CLTF, LCB, CSC. three running in loomtown with a Z3 theorem prover.

this is a working draft. there are probably errors. we're fixing them. bear with us.

highlight text to yell at us about it.

Abstract

Every layer of the modern software stack -- from syntax and type systems to microservice architectures and on-call rotations -- encodes assumptions about a human programmer with a seven-item working memory and a social need for ownership boundaries. As AI agents become the dominant producers and consumers of code, these design choices constitute a "human tax" we hypothesize at 30--50% of production codebases -- with an exploratory pilot measurement on a TypeScript codebase finding 26.4% strict human-accidental complexity plus an additional 21.8% dual-purpose complexity whose attribution depends on an empirically uncalibrated parameter -- compounding across layers. We extend Brooks's essential/accidental complexity framework with a third category, human-accidental complexity, and analyze its manifestation across six strata. Our analysis yields five findings: (1) structured representations yield 32--82% improvements over text-based code generation (scale-dependent); (2) agent-native languages require dependent types and proof-carrying semantics but not syntax or naming conventions; (3) the operational stack must be redesigned around machine-speed event-driven loops; (4) self-improving systems require formal verification as a safety invariant; and (5) multi-representation fluidity is the key architectural requirement. We propose the Agent Graph Language (AGL) and distill our findings into five architectural concepts -- Proof-Carrying Patch Graphs (PCPG), Multi-Level Agent IR (MAIR), Capability-Leased Tool Fabric (CLTF), Latent Coordination Bus (LCB), and Counterexample Supply Chain (CSC) -- that define an integrated agent-native software stack and a research agenda for post-human-centric computing.


PART I: DIAGNOSIS -- What Is Wrong and Why


Section 1: Introduction

The modern software stack is a monument to human cognitive limitations. From the syntax of programming languages to the topology of Kubernetes clusters, every layer encodes assumptions about a specific kind of programmer: a biological organism with a seven-item working memory 1, an eight-hour attention span, a visual cortex that parses indentation, and a social nature that demands ownership boundaries, code review rituals, and semantic versioning contracts. These assumptions have been so deeply internalized by the field that they appear natural -- even essential. They are not.

This paper argues that as AI agents become the dominant producers and consumers of code, the human-centric design of the software stack constitutes a major source of accidental complexity in computing. We call this the human tax: the fraction of every software system that exists not to express computational intent but to make that intent legible to human minds. We estimate this tax at 30--50% of production codebases at the code level alone, compounding across architectural, infrastructural, and process layers. This estimate is a hypothesis derived from converging proxy evidence (Section 2.6), not a directly measured finding; validating it through controlled experimentation is central to the research agenda we propose.

1.1 The Shift in Authorship

For sixty years, every abstraction in the software stack was designed for a biological mind: structured programming because humans need a "textual index" for execution state 2, type systems because humans "routinely introduce errors" 3, the Single Responsibility Principle because humans struggle with interleaved concerns 4. That question -- how do we make software tractable for the human brain? -- is becoming the wrong question.

As of early 2026, orchestration systems coordinate swarms of 1,000+ AI agents working simultaneously on large codebases. These agents process tokens in parallel across attention layers, operate with context windows of hundreds of thousands of tokens, and coordinate through structured protocols at machine speed. When the consumer of every software abstraction is no longer a human mind, the design space reopens. SWE-agent 5 demonstrated this concretely: agent-computer interface design affects task performance more than agent design itself, inverting the traditional assumption. The bottleneck is the design of the tools, and our tools are designed for humans.

1.2 The Human Tax Thesis

In 1987, Fred Brooks published "No Silver Bullet," distinguishing between essential complexity -- inherent in the problem being solved -- and accidental complexity -- introduced by the tools and methods used to solve it 6. Moseley and Marks 7 later refined this distinction in "Out of the Tar Pit," arguing that accidental complexity overwhelmingly dominates real systems and proposing relational-algebraic specification as its antidote -- a position that anticipated the specification-driven paradigm we extend to agent-native contexts. Brooks argued that most complexity in software is essential, and that no single technology could deliver an order-of-magnitude improvement in productivity. His analysis was correct given a critical assumption: that the programmer is human.

We extend Brooks's framework with two new categories beyond his original essential/accidental distinction:

  1. Human-accidental complexity: Complexity introduced not by imperfect tools (Brooks's accidental complexity) but by the need to make software comprehensible to humans. It includes naming conventions, indentation, comments, architectural boundaries drawn to match team structures, human-readable serialization formats, code review rituals, and the entire operational infrastructure built for human-speed decision-making.

  2. Dual-purpose complexity: Complexity that currently serves both human comprehension and agent performance, but whose human-comprehension function may become redundant as agents develop alternative reasoning mechanisms. Type annotations are the paradigmatic example: they aid human programmers, improve current LLM code generation (Section 4.2.1), and provide partial specifications that any consumer benefits from. Dual-purpose complexity resists simple classification as "tax" or "essential" -- its status depends on the trajectory of agent training and the availability of alternative information carriers. We introduce this category to avoid the false binary of "keep it all" vs. "remove it all" that undermines nuanced architectural analysis.

Thesis. We hypothesize that between 30 and 50 percent of a typical production codebase consists of human-accidental complexity: code that exists to make the system legible to human minds rather than to express computational intent. This percentage compounds across layers of the stack, such that the total human tax on a deployed software system -- spanning code, architecture, infrastructure, and process -- may represent a major source of accidental complexity in modern computing. This estimate has not been validated by a controlled experiment measuring the fraction of a production codebase eliminable when the sole consumer is an agent; such an experiment is the most important item on the research agenda this paper defines (Section 10.2). The figure should be understood as an order-of-magnitude hypothesis motivating the architectural analysis that follows, not as an established empirical finding.

The evidence supporting this hypothesis is drawn from multiple empirical sources, each measuring a different proxy. Standard JavaScript minifiers routinely reduce production code size by 30--60% by stripping whitespace, comments, and verbose variable names while perfectly preserving machine execution 8, providing a real-world proxy for the fraction of source text that exists for human reading rather than computational semantics. Studies of developer time allocation show that programmers spend 58--70% of their time reading code rather than writing it 9, a proportion that reflects the cognitive overhead of human comprehension. Identifier tokens alone constitute approximately 70% of source text by character count 10, and experiments demonstrate that model performance degrades significantly when human-readable naming is removed 11 -- confirming that naming conventions serve human cognition. However, as we discuss in Section 4.2, naming conventions also serve current AI agents trained on human code; the degree to which they are removable for agents designed for agent-native representations remains an open question. Empirical studies of gradual typing migration report that type annotations account for roughly 5--20% of source code, with functional codebases toward the lower end and object-oriented codebases toward the upper end 12, annotations that become redundant when formal verification subsumes type checking. Buse and Weimer 13 developed a code readability metric whose features -- identifier length, line length, blank line density, comment density -- all measure visual and cognitive properties of human reading, not program semantics. We note that these sources measure different latent variables -- human effort allocation, textual surface area, and removable-for-agents complexity -- and combining them into a single "tax" number requires assumptions about their independence and additivity that we have not formally validated.

He et al. 14 provide complementary evidence from the opposite direction: their study finds that AI-generated code accelerates development velocity but degrades code quality metrics, suggesting a maintenance burden on the human reviewers who must comprehend and maintain it -- consistent with the dual penalty our framework predicts when human-centric and agent-centric requirements collide in the same codebase.

The human tax is not an indictment of human-centric design. For sixty years, that design was rational -- even brilliant. The argument is that what was rational for human programmers becomes accidental complexity when the programmer is a machine. Recognizing this shift is the prerequisite for building something better.

1.3 Scope and Contributions

This paper performs a systematic analysis of human-cognitive constraints across six strata of the software stack and proposes architectural concepts for a post-human-centric stack. It is a vision paper with concrete proposals grounded in 2024--2026 evidence, not a survey. The novelty lies not in individual components -- dependent types, proof-carrying code, capability-based security, multi-level IRs are all established -- but in their integration into a coherent stack motivated by the systematic identification of human-accidental complexity, following the tradition of systems papers that synthesize existing mechanisms into architecturally novel wholes.

Contributions: (1) Identification and quantification of human-accidental complexity across six strata, extending Brooks's essential/accidental distinction (Sections 1--2). (2) Analysis of the representation design space, demonstrating 32--82% improvements (scale-dependent) with structured representations 15 (Section 3). (3) First-principles derivation of agent-native language requirements and the AGL design sketch (Section 4). (4) Comprehensive redesign of the operations stack around the Agent-Computer Interface (Section 5). (5) Five novel architectural concepts -- PCPG, MAIR, CLTF, LCB, CSC -- composing into an integrated agent-native stack with three interlocking feedback loops (Section 7).

While a prior paper 16 establishes the macro-level economic and organizational case for agent-native software engineering, this paper addresses the complementary micro-level question: what does the software stack itself need to look like when the primary consumer of code representations is a machine, not a human? The two papers are non-overlapping by construction: the prior paper treats organizations, economics, and team-scale effects (the "macro" layer); this paper treats representations, languages, and infrastructure (the "micro" layer). No section, figure, or formal contribution appears in both.

The paper is organized in four parts: Part I (Sections 1--2) diagnoses the human tax; Part II (Sections 3--5) prescribes agent-native replacements; Part III (Section 6) explores self-improving systems; Part IV (Section 7) synthesizes five novel concepts into an integrated stack.


Section 2: Human Constraints Baked Into the Software Stack

Every layer of the modern software stack -- syntax, types, architecture, infrastructure, operations, social processes -- encodes specific human cognitive limitations. These are not incidental features of the stack; they are load-bearing design decisions, each traceable to a documented property of human cognition. In this section, we trace these constraints systematically, layer by layer, and quantify their compound overhead.

2.1 Syntactic Sugar Is for Human Brains

Every programming language above machine code exists because humans cannot think in von Neumann instructions. Backus 17 diagnosed the bottleneck explicitly in his 1977 Turing Award lecture; each language generation added abstraction for the human programmer, not the machine. An LLM-based agent does not share this bottleneck -- it processes tokens in parallel across attention layers with a context window measured in hundreds of thousands of tokens.

The conventions consuming substantial engineering effort in every project are human-cognitive in origin. Indentation exists because the human visual cortex uses spatial layout to parse hierarchy. Naming conventions exist because human readers use names as mnemonic indices; an agent can correlate an opaque identifier with its definition site across its full context. Comments exist for future human readers; an agent with access to the full repository, commit history, and issue tracker has richer context. Buse and Weimer 13 confirmed this empirically: their readability metric's features -- identifier length, line length, blank line density, comment density -- all measure visual and cognitive properties of human reading, not program semantics.

Structured programming 2 replaced GOTO with if/else because humans need a textual index corresponding to execution state -- an agent simulating execution paths across a full control flow graph is indifferent. Miller's 1 7 +/- 2 limit shaped function size heuristics 1819, the Single Responsibility Principle 4, cyclomatic complexity 20, and Halstead metrics 21 -- all are explicit models of human cognitive load, not computational complexity. An agent's "right" decomposition might be radically different: larger monolithic units minimizing interface overhead, or finer granularity mapping to verification units rather than comprehension units.

2.2 Type Systems as Human Error Prevention

Static type systems respond to a single observation: human programmers routinely introduce errors detectable before execution. Milner's ML 3 aimed to catch errors at compile time; Cardelli's 22 taxonomy frames the purpose as "preventing execution errors." The field from Church's simply typed lambda calculus through Martin-Lof's dependent types 23 to Curry-Howard is an increasingly sophisticated effort to constrain what humans can express, preventing broader classes of human errors.

For an agent that generates code from specifications and verifies it via SMT solvers or proof assistants, the static type system becomes redundant -- not because types are unimportant, but because verification subsumes type checking. This is supported by type-constrained code generation research 24, which demonstrates that structural-level type guidance makes compile-time checking a subset of the broader verification guarantee. Pierce's 25 definition of types as "a tractable syntactic method for proving the absence of certain program behaviors" reveals the key question: tractable for whom?

Type annotations serve as partial specifications, and specifications remain valuable regardless of who writes code. The argument is narrower: the enforcement mechanism of static type checking -- compile-time errors, IDE squiggles -- is designed for a human workflow. An agent generating correct-by-construction code needs specifications, not guardrails.

2.3 Software Architecture as Team Topology

Conway's 26 Law -- organizations produce systems copying their communication structures -- reveals that software structure is primarily determined by human social structure, not computational requirements. Microservices, the dominant style of the 2010s-2020s, are Conway's Law made manifest: each service boundary is a team ownership boundary first and a technical boundary second. The microservice pattern introduces substantial operational overhead versus monoliths -- empirical studies report 2--4x throughput penalties on equivalent hardware 2728, primarily to enable human team independence -- network serialization, distributed transactions, service discovery, the entire service mesh. An agent swarm has no teams, no organizational chart, no communication overhead proportional to team size. REST APIs use verbose, schema-less JSON because humans read responses in terminals; the shift toward gRPC and Protocol Buffers tracks the shift toward machine-to-machine communication.

2.4 Infrastructure and Process as Human-Speed Operations

The operational stack reveals even deeper assumptions. Kubernetes 29 specifies resources in human-readable YAML, with a CLI designed for human interactive use and dashboards for visual comprehension. The entire observability stack -- Prometheus, Grafana, PagerDuty -- implements a workflow designed for human circadian rhythms, human alert fatigue, and human organizational hierarchies 30. Runbooks are natural-language programs for stressed human interpreters. IaC tools (Terraform, Pulumi) translate infrastructure into human-reviewable, version-controllable form. Deployment strategies (canary releases, feature flags, blue-green) assume human-speed decision-making. An agent fleet operates at machine speed with no sleep cycle, no fatigue, and no organizational chart to escalate through.

2.5 Social Constructs in Code

Code review 3132 is fundamentally social trust verification -- a second human's judgment, expressed through pull request artifacts (titles, diffs, comment threads, approvals) that are social coordination mechanisms, not computational artifacts. Semantic versioning 33 is a "social contract" communicating update risk to human developers; an agent can analyze API surface diffs mechanically. Git's design -- commit messages, branches, blame annotations -- is a human communication medium; the entire Git ecosystem (GitHub, GitLab) is a social platform. An agent swarm with shared state needs none of these social coordination tools.

2.6 Quantifying the Compound Human Tax

Having traced human constraints through six strata of the stack, we can now estimate the compound overhead. Table 1 presents our breakdown of the human tax at the code level.

Table 1: The Human Tax Breakdown

CategoryEstimated % of CodebaseJustification
Comments and documentation strings10--20%Arafat and Riehle (2009): measured comment density across open source codebases
Type annotations (where inference or verification suffices)5--15%Tobin-Hochstadt et al. (2017): gradual typing migration adds 5--20% LOC overhead; range depends on paradigm (functional vs. OO)
Naming overhead (descriptive vs. minimal identifiers)3--8%Identifiers constitute ~70% of source text by character count 10; descriptive conventions add 3--8% overhead vs. compact identifiers
Architectural boilerplate (interfaces, adapters, DTOs)5--15%Clean Architecture layers existing for human separation of concerns; interface code in typical enterprise Java projects
Serialization/deserialization (human-readable formats)2--5%JSON/XML marshaling code eliminable with binary or direct memory formats
Error messages (human-readable strings)2--5%Structured error codes suffice for machine consumers
Configuration files (YAML, TOML, .env)1--3%Human-readable config vs. programmatic configuration
Test descriptions and naming1--3%describe("when user is not authenticated") serves human readers
Total estimated human tax at code level29--74% (range); central estimate 30--50%Categories overlap (see note below)

A note on Table 1 methodology. The component ranges in Table 1 sum to 29--74% at the extremes. We report the central estimate of 30--50% because the categories are not disjoint: comments may contain type documentation, naming overhead overlaps with test descriptions, and architectural boilerplate includes serialization code. The denominator is total source text including comments, configuration, and test code (but excluding vendored dependencies and generated artifacts). We acknowledge that the categories overlap and that the compound estimate requires independence assumptions we have not formally validated. The 30--50% range should be treated as a hypothesis requiring empirical validation through controlled experiments (see Section 10.2), not as a precise measurement. This estimate is consistent with independent observations: standard JavaScript minifiers reduce production code by 30--60% while preserving execution semantics 8, and developers spend 58--70% of their time reading code rather than writing it 9, reflecting the cognitive overhead that drives much of this human-accidental complexity.

Exploratory pilot measurement. To demonstrate the feasibility of automated HAC classification, we conducted an automated AST-based measurement on a 260,407-character TypeScript codebase (15 files from a multi-agent orchestrator, spanning coordinator, worker, verification, compiler, and infrastructure modules). Using a four-category character-level classifier (essential, human-accidental, dual-purpose, infrastructure), the primary finding is 26.4% human-accidental complexity under a strict definition (comments, indentation, naming overhead beyond 3 characters, human-readable error strings, blank lines). This strict figure -- the only result derived purely from classification without tunable parameters -- falls below the hypothesized 30--50% range. Essential complexity accounted for 47.9% of characters, with infrastructure boilerplate (imports/exports) at 3.8%. The standard deviation across files was +/- 8.2 percentage points for the human-accidental category. Sensitivity analysis on the naming-overhead threshold (1, 3, 5 characters) shows the strict figure varies by only 1.6 percentage points, demonstrating robustness to this parameter choice.

An additional 21.8% of the codebase was classified as dual-purpose complexity (type annotations, interface definitions, JSDoc, Zod schemas -- features serving both human comprehension and current agent/tooling performance). To be precise about what "dual-purpose" means: we do not claim the semantic content of types is human tax -- specifications remain essential regardless of consumer, and our proposed stack (Section 4) demands richer types, not fewer. What we classify as dual-purpose is the textual representation of that semantic content in human-readable TypeScript syntax (e.g., the ASCII characters of interface User { name: string; age: number }) which could be replaced by a more compact agent-native encoding (e.g., a typed graph node or de Bruijn-indexed term) without loss of specification content. Attributing a fraction alpha of this dual-purpose complexity to human needs yields a parametric spectrum: at alpha = 0 the estimate equals the strict 26.4%; at alpha = 0.5 it rises to 37.3%; at alpha = 1.0 it reaches 48.2%. Critically, alpha is a parameter, not a measurement -- the blended figure (37.3%) does not independently support the 30--50% hypothesis because it is computed by construction as 26.4 + alpha * 21.8, and for any target in the range [26.4%, 48.2%] there exists an alpha that produces it. The correct reading is that the strict measurement establishes a floor of 26.4% and the dual-purpose share (21.8%) defines the range of uncertainty contingent on how much dual-purpose complexity agents can eventually bypass. Resolving this uncertainty requires the ablation study described in the research agenda (Section 10.2): progressively stripping dual-purpose elements from codebases and measuring agent performance to empirically calibrate alpha, rather than assuming it.

This exploratory pilot is a single-codebase, automated-only analysis (see Threats to Validity, Section 9.6b) and does not constitute validation of the 30--50% hypothesis. The strict figure (26.4%) establishes that human-accidental complexity is a non-trivial fraction of production code, but whether it reaches the hypothesized 30--50% range depends on two open empirical questions: (1) what fraction of dual-purpose complexity is genuinely eliminable for agent-only consumption (the alpha calibration problem), and (2) whether the result generalizes across languages and codebases (the cross-language validation problem). Both are priority items in the research agenda (Section 10.2). Full measurement methodology, sensitivity analysis, and per-file results are available in the supplementary material.

The compounding effect. The human tax does not merely accumulate; it compounds across the stack:

  1. Language level: An estimated 30--50% of code is human-accidental complexity (hypothesis; see above).
  2. Architecture level: Microservices add substantial operational overhead versus equivalent monolithic implementations 2728, primarily to enable human team independence.
  3. Infrastructure level: Kubernetes, CI/CD, monitoring, and IaC constitute an entire operational layer that exists to mediate between human operators and computing resources.
  4. Process level: Code review, semantic versioning, documentation, and release management impose real engineering costs as social coordination overhead.

Each layer's human-centric decisions constrain and complicate the layers above and below it. Conway's Law 26 ensures that organizational structure propagates through the entire system: team boundaries become service boundaries, which become network boundaries, which require serialization layers, which require monitoring, which requires dashboards, which require on-call rotations. The causal chain from "humans have limited communication bandwidth" to "we need PagerDuty" is unbroken.

Brooks's framework, extended. We can now state our refinement of Brooks's taxonomy precisely:

  1. Essential complexity: Inherent in the computational problem regardless of who solves it. Algorithmic difficulty, irreducible state management, fundamental concurrency challenges. This is invariant across human and agent programmers.

  2. Tool-accidental complexity (Brooks's original category): Introduced by imperfect tools and removable by better tools. Slow compilers, inadequate debuggers, poor IDE support. This category persists but shrinks as tools improve.

  3. Human-accidental complexity (our new category): Introduced by the need to make software comprehensible to humans. Naming conventions, type annotations as guardrails, architectural boundaries driven by team structure, human-readable formats, social coordination processes. This category is invisible when all programmers are human. It becomes visible -- and removable -- when the programmer is a machine.

The third category has been invisible for sixty years because its reference class contained exactly one element: the human programmer. The human constraints are baked so deeply into our tools, practices, and assumptions that they appear natural, even essential. They are not. Recognizing them as accidental complexity specific to a particular class of programmer is the first step toward an agent-native software stack -- a stack that preserves the essential complexity of computation while shedding the accumulated weight of human cognitive accommodation.


PART II: PRESCRIPTION -- What to Build Instead


Section 3: Beyond Text -- Code Representations for Agent-Scale Development

Text-based source code is a lossy compression of program semantics optimized for human visual parsing. When a programmer writes if x > 0: return x, the whitespace, the keyword if, the colon, and the newline all serve a single consumer: a human eye scanning left-to-right across a monospaced terminal. An AI agent does not scan; it tokenizes. It does not need whitespace to infer scope or keywords to recognize control flow. The entire surface syntax of every programming language is, from the agent's perspective, an elaborate encoding of structure into a format optimized for the wrong reader.

This section argues that agents should operate on a spectrum of structured representations -- ASTs, IRs, graphs -- choosing the optimal level for each task, and that the systems that succeed at agent scale will be those that make translation between representations cheap, reliable, and automatic.

3.1 The Representation Spectrum

Code exists at multiple levels of abstraction, each trading human legibility for machine precision:

  1. Natural language specifications -- prose describing desired behavior.
  2. Human-readable source code -- Python, TypeScript, Rust -- structured text with formatting conventions optimized for human comprehension.
  3. Token streams -- the output of a lexer: a flat sequence of typed tokens that strips whitespace and comments but preserves sequential structure.
  4. Concrete Syntax Trees (CSTs) -- a full parse tree preserving every syntactic detail, including whitespace and comments. Lossless and round-trippable.
  5. Abstract Syntax Trees (ASTs) -- a tree structure discarding syntactic sugar, preserving only semantic structure.
  6. Intermediate Representations (IRs) -- LLVM IR, WebAssembly, MLIR dialects -- platform-independent representations with well-defined semantics, optimized for compiler transformations.
  7. Machine code -- the final executable form, fully resolved to a specific instruction set.

Each step down this hierarchy removes information that matters to humans (formatting, naming conventions, comments) and adds information that matters to machines (explicit type information, resolved references, optimization metadata). Each step also narrows the space of valid programs, reducing ambiguity at the cost of flexibility.

The question for agent-scale development is not which single level to standardize on, but how to build systems that move fluidly between levels. An agent performing high-level architectural refactoring might operate at the specification or AST level; an agent optimizing a hot loop might work directly with IR. The most effective agent systems will be polyglot across the representation spectrum.

The Goldilocks Zone. For current LLM-based agents, empirical evidence suggests a "Goldilocks zone" in the AST-to-IR range. Operating purely on raw text invites semantic errors that syntactic validity masks. Operating purely on machine code sacrifices the semantic structure that makes reasoning about program behavior tractable. Representations in the AST-to-IR range are structured enough to eliminate entire classes of errors, abstract enough to capture semantic intent, and well-tooled enough to integrate with existing compiler and verification infrastructure.

3.2 ASTs as the Natural Agent Language

Of all representations on the spectrum, Abstract Syntax Trees occupy a uniquely productive position for AI agents. ASTs strip away syntactic sugar -- parentheses used for grouping, semicolons, significant whitespace -- while preserving semantic structure. A Python for loop and its equivalent while loop may look different as text but share significant structural overlap at the AST level.

The case for AST-based generation rests on three pillars. First, structural validity by construction: when an agent generates code as a sequence of AST node construction operations, the output is guaranteed to be syntactically valid, eliminating an entire category of failure modes that currently waste agent compute on retry loops. Second, explicit semantic structure: operator precedence, variable scoping, and control flow are explicit in the tree rather than implicit in formatting conventions. Third, normalization of equivalent programs: many superficially different text representations correspond to the same AST, reducing the space of representations an agent must reason about.

The most compelling recent evidence comes from GrammarCoder 15. Liang et al. developed billion-parameter-scale language models that generate code using grammar rule sequences derived from AST traversals rather than raw token sequences. On HumanEval, GrammarCoder-1.3B-Base achieved a Pass@1 score of 63.4%, compared to 34.8% for the underlying DeepSeek-Coder-1.3B-Base -- an 82% improvement over the baseline using the same training data, model architecture, and compute budget. The only difference was the representation.

Crucially, this improvement was not due to elimination of syntax errors. Both models produced syntactically valid code nearly 100% of the time at the billion-parameter scale. Instead, grammar-based representations amplified subtle semantic differences invisible at the token level. When comparing error code to corrected code, the average edit distance at the token level was 14.33, but at the grammar level it increased to 27.43 -- a 91% increase in edit-distance sensitivity. Minor changes like adding parentheses (changing operator precedence) result in dramatic restructuring of the grammar rule sequence, making the model more sensitive to semantically significant changes.

The Tree-sitter ecosystem provides practical infrastructure for AST-based manipulation, offering incremental parsing for over 100 languages with a uniform API. The cAST framework (2025) demonstrated that AST-based code chunking produces chunks better aligned with semantic boundaries than line-based splitting, directly improving retrieval quality for code generation agents.

However, scaling evidence tempers the headline figure. At 7B parameters (Qwen2.5-7B), GrammarCoder's Pass@1 improves from 57.9% to 76.8%, but the relative gain shrinks from 82% to 32%, suggesting that grammar-based representations yield diminishing returns as the underlying model approaches frontier scale. Whether the advantage persists, vanishes, or re-emerges at yet larger scales remains an open empirical question. Notably, GrammarCoder's evaluation is limited to function-level synthesis benchmarks (HumanEval, MBPP); its applicability to repository-level tasks such as SWE-bench, which require cross-file reasoning and contextual understanding beyond grammar-rule sequences, remains undemonstrated.

ASTs are not without limitations. They are verbose -- several times larger than equivalent text -- increasing attention computation costs quadratically in standard transformers. They lose pragmatic information (comments, naming conventions) that carries intent. And they struggle with incomplete code fragments common in real-world agent interactions. These limitations motivate representations further down the spectrum.

3.3 Intermediate Representations as the Sweet Spot

If ASTs represent the structure of a single programming language, Intermediate Representations capture the structure of computation itself. IRs like LLVM IR, WebAssembly, and MLIR sit between source code and machine code, designed to be platform-independent, well-defined, optimizable, and verifiable.

LLVM IR is a typed, SSA-form (Static Single Assignment) representation with the most mature optimization infrastructure in the industry. Its SSA form makes dataflow analysis trivial -- every value has exactly one definition. For agents, LLVM IR has the attractive property that generation and optimization are separable: an agent could generate unoptimized LLVM IR and leverage decades of optimization passes without being constrained by any particular source language's abstractions.

WebAssembly (Wasm) occupies a particularly interesting position. Originally designed as a browser compilation target, it has evolved into a general-purpose sandboxed execution environment with strong security guarantees. For agent systems, Wasm offers three key properties: sandboxing (memory-safe execution by default, critical for agent-generated code), portability (identical execution across platforms), and composability (the Component Model enables fine-grained module composition with type-safe interfaces). Microsoft's Wassette project (2025) bridges WebAssembly Components with the Model Context Protocol, allowing agents to autonomously download and execute secure Wasm tools -- a concrete architecture for agent-scale tool use.

MLIR (Multi-Level Intermediate Representation) 34 may be the most forward-looking IR for agent-scale development. Its dialect system provides a framework for defining domain-specific IRs at different abstraction levels with principled lowering between them. An agent reasoning about a machine learning pipeline might operate at the "linalg" dialect level; an agent optimizing memory access patterns at a lower dialect; an agent targeting a specific accelerator at the hardware dialect. The COMPASS project (2025) demonstrated using an LLM-based agent for MLIR compilation pass pipeline generation, a concrete example of agents operating directly on IR-level representations. MLIR's extensibility means that new agent-specific dialects could be defined -- for specification-level intent, test assertions, or deployment constraints -- with the lowering infrastructure handling verified translation between levels.

The HDL/RTL analogy. The chip design industry provides a compelling precedent. Hardware engineers write specifications in HDLs, which are compiled through behavioral descriptions, Register Transfer Level (RTL), gate-level netlists, and physical layouts. RTL serves as the critical intermediate representation: abstract enough to reason about, concrete enough to verify, precise enough to synthesize. Software development is converging toward this model as agents take over implementation: the specification becomes the source of truth, the IR captures essential computational structure, and the executable is a derived artifact. Hardware equivalence checking is decidable for finite-state designs but worst-case coNP-complete; practical tractability derives from SAT/BDD engineering. Software equivalence is undecidable in general (Rice's Theorem), making the verification gap structurally wider for code than for circuits.

3.4 Beyond Trees: Graphs, Specifications, and Multi-Modal Artifacts

Many aspects of program behavior are graph-shaped, not tree-shaped. PROGRAML 35 demonstrates that graph representations enable ML models to approximate dataflow analysis more effectively than AST-based approaches. CodexGraph 36 integrates LLM agents with property graphs capturing cross-cutting relationships -- enabling multi-hop reasoning ("find all callers of deprecated authenticate() not covered by integration tests") that is natural as graph traversal but difficult as text search.

The "compiled specification" model treats specifications as the source of truth and code as a derived artifact. Spec2Code (2025) and SpecGen 37 demonstrate bidirectional bridges between specifications and code. This implies that version control should track specifications, not code -- eliminating merge conflicts at the code level by resolving them at the specification level. The chip design industry has operated this way for decades.

For agent-scale trust, proof-carrying code 38 is essential: when Agent A generates code for Agent B, mechanically checkable proofs provide assurance in polynomial time. VerifyThisBench 39 evaluates combined generation of code, specifications, and proofs -- treating "program" as a composite object.

3.5 Toward Representation-Fluid Agent Systems

The evidence points toward agent-scale systems that are representation-fluid: operating on multiple representations simultaneously, choosing the most appropriate for each task.

A plausible architecture: (1) Specifications serve as the human-facing representation, version-controlled and reviewable. (2) ASTs/grammar representations serve as the generation target, guaranteeing structural validity. (3) Enriched graphs (combining AST structure with dataflow, dependency, and knowledge-graph edges) serve as the reasoning representation. (4) IRs (LLVM IR, WebAssembly, MLIR dialects) serve as the optimization and execution target. (5) Multi-modal artifacts tie everything together, bundling implementation, specification, tests, proofs, and provenance.

Table 2: Representation Tradeoffs

RepresentationGeneration QualityVerification StrengthCompositionHuman Inspectability
Raw textBaselineWeak (requires parsing)Poor (textual merge)High
AST+32--82% (GrammarCoder, scale-dependent)Moderate (structural)Moderate (tree merge)Low (requires rendering)
IR (LLVM/MLIR)High (type-directed)Strong (SSA, type system)High (module system)Very low
Graph (CFG/DFG)High (dependency-aware)Strong (path analysis)High (subgraph composition)Very low
Multi-modal artifactHighest (spec-guided)Strongest (proof-carrying)Highest (contract-based)Medium (via projections)

The key insight is that no single representation is optimal for all tasks. A "text is all you need" counter-thesis holds for local function synthesis, where frontier LLMs achieve high Pass@1 on raw text; but it breaks down for repository-level tasks requiring cross-file dependency tracking, semantic merge, and compositional verification -- precisely the regime where agent-scale systems operate. The systems that succeed at agent scale will embrace this multiplicity -- building infrastructure that makes representation translation cheap, reliable, and automatic. The compiler community has spent decades building exactly this kind of infrastructure; the challenge for the agent-scale era is making it accessible to AI agents as first-class users.


Section 4: Agent-Native Languages

If the primary author of code shifts from humans to agents, the entire design space of programming languages reopens. Every feature of every mainstream language -- from Python's significant whitespace to Rust's lifetime annotations to Java's class hierarchies -- was designed for a single consumer: the human programmer. An agent has no need for readable variable names, familiar keywords, or indentation. It has acute need for unambiguous semantics, machine-verifiable correctness, and composability through behavioral contracts rather than social conventions.

This section derives from first principles what an agent-native programming language requires, what it can discard, surveys candidate paradigms, investigates whether agents would converge on their own notations, and presents concrete design proposals.

4.1 First Principles: What Agents Need

Unambiguous semantics. Human languages are littered with undefined behavior -- C has over 200 categories. Even "safe" languages like Python have surprising edge cases around mutable default arguments, late binding in closures, and generator-exception interactions. An agent has no intuition about dark corners; its model of language semantics is only as good as the specification. An agent-native language must have a total semantics: every syntactically valid program has exactly one meaning. Standard ML achieved this decades ago through careful formalization 40; the reason this approach did not dominate human programming is that humans found the resulting language less ergonomic. Agents face no such friction.

Machine-verifiable correctness. Agents generate vast quantities of code. The quality-assurance bottleneck shifts from "can I write it?" to "can I verify it?" An agent-native language should embed proof obligations directly into its type system: when an agent writes a sorting function, the language requires a machine-checkable proof that the output is a permutation of the input and is ordered. Writing code and proving it correct become the same act.

Composability through behavioral specifications. When a thousand agents work simultaneously, their outputs must compose. Agent A writes module X; Agent B writes module Y; the system must guarantee that X and Y work correctly together. In human engineering, this is handled by social processes -- code review, integration testing, Slack conversations. Agents need a mechanical guarantee: strong module interfaces with behavioral contracts, not just type signatures. Any implementation satisfying a module's specification is interchangeable -- parametricity taken to its logical extreme.

Determinism at the denotational level. Two agents given the same specification should produce programs with the same observable behavior, even if the code differs textually. Program equivalence must be defined at the denotational level, not the syntactic level -- two programs are "the same" if they denote the same mathematical object.

Minimal redundancy. Human languages include enormous redundancy: boilerplate, repeated type annotations, import statements, structural scaffolding. For agents, redundancy is pure waste that creates opportunities for inconsistency. The DRY principle, an aspiration in human engineering, becomes a hard language constraint at the representation level: global type inference, implicit imports, structural typing, no boilerplate. We distinguish this from topology-level deduplication---whether to share implementations across module boundaries in a codebase---which the companion paper 16 argues should be relaxed at agent scale to maximize parallelism ("Spec-DRY, Code-WET"). The two levels are complementary: an agent-native language eliminates syntactic waste within each artifact (DRY-repr) while the codebase permits local copies across modules to avoid coupling that serializes parallel work (WET-topo).

Self-describing code. The code itself carries all information needed to understand it: specifications embedded in the code, change history as part of the data model, design rationale encoded in the proof structure, performance characteristics as part of the type system. A "program" is not a text file but a rich structured object including its specification, proof of correctness, performance model, and provenance.

4.2 What Agents Do Not Need

Table 3: What Agents Need vs. Don't Need

Agents NeedAgents Do Not Need (in principle)
Unambiguous, total semanticsReadable variable names*
Machine-verifiable correctness (proof obligations)Indentation and formatting
Composability via behavioral specificationsComments and documentation (subsumed by specs)
Determinism at the denotational levelFamiliar syntax or keywords
Minimal redundancy (DRY-repr as hard constraint)Backward compatibility with human mental models
Self-describing code (specs, proofs, provenance embedded)Interactive debugging (REPLs, breakpoints, print statements)
Cost types (time, space, energy budgets)Social coordination artifacts (versioning, changelogs)

*With important caveats; see Section 4.2.1 below.

An agent designed for agent-native representations could identify variables by de Bruijn indices or graph positions, operate on ASTs directly (with surface syntax as a human-facing rendering concern), and rely on rich types (e.g., Age : {n : Nat | n < 200}) to subsume comments.

4.2.1 The Training-Artifact Problem: When "Don't Need" Meets "Currently Depend On"

The claims in Table 3 describe what agents in principle do not need -- that is, what an agent designed from scratch for agent-native representations could dispense with. However, the current generation of AI agents, trained overwhelmingly on human-written code, empirically depends on human-readable features in ways that complicate the "don't need" claim. We address the most important cases honestly.

Names. Several studies demonstrate that current LLMs degrade substantially when human-readable identifiers are removed. "When Names Disappear" 41 introduced CLASSEVAL-OBF, a benchmark stripping semantic naming cues, and found LLM performance degradation of up to 62.5%. Wang et al. 42 showed that identifier-aware training improved code generation quality (+4.7 CodeBLEU), with ablations confirming drops when identifier objectives were removed. Miceli Barone et al. 43 found that identifier-swap perturbations caused catastrophic failures in code LLMs, with some model families exhibiting inverse scaling. This paper itself cites 11 as evidence that "naming conventions serve human cognition" -- but the evidence equally shows that they serve current LLM cognition.

Types. Mundler et al. 44 found that approximately 94% of compile errors were type-check errors, and type-constrained decoding cut compilation errors by over 50% while improving functional correctness by +3.5--5.5%. Poesia et al. 45 showed that semantic constraints including typing improved validity substantially. This contradicts a naive reading of Section 2.2's argument that type systems are "redundant overhead"; for current agents, type constraints improve generation quality.

Structured reasoning. Chen et al. 46 demonstrated that iterative debugging with natural-language explanations improved code generation accuracy by 2--12% across benchmarks. Li et al. 47 showed that forcing LLMs to reason using human-readable control structures improved Pass@1 by up to 13.79%. These findings suggest that what we classify as "human cognitive scaffolding" currently functions as computational scaffolding for LLMs as well.

Our position. We do not dispute these findings. The dependency of current agents on human-readable features is real and empirically well-established. Our argument is that this dependency we hypothesize is substantially a training artifact, not a fundamental requirement. Current LLMs perform well with human-readable names because they were trained on billions of tokens of human-written code where names carry semantic signal. An agent trained primarily on agent-native representations -- where semantic information is carried by types, proof structures, and graph positions rather than strings -- would, we hypothesize, not share this dependency. Partial evidence is consistent with this hypothesis, though non-decisive and from narrow domains: DeepMind's AlphaDev 48 discovered novel sorting algorithms operating directly on assembly instructions without human-readable abstractions, and research on bytecode-level LLMs demonstrates that models can learn to reason over non-human-readable representations in constrained settings. However, these results operate on single algorithms with limited instruction sets and do not establish that the pattern generalizes to full software engineering tasks.

Falsification condition. The training-artifact hypothesis would be falsified if models trained exclusively on agent-native representations (ASTs, intermediate representations, proof terms, de Bruijn-indexed code) still require human-readable names, formatting, or comments for comparable performance on code generation benchmarks -- even after sufficient training on the structured representations to reach convergence. Specifically, if a model fine-tuned on progressively abstracted code representations (removing names, then formatting, then comments, then syntactic sugar) shows monotonic performance degradation that does not recover with continued training, the hypothesis would be refuted: those features would be architecturally necessary rather than contingent on training data. We note that "sufficient training" and "convergence" introduce practical ambiguity into this criterion; Section 10.2 proposes concrete experimental designs to operationalize these terms.

The transition from human-trained to agent-native agents will require a bootstrapping period (Section 9.2) during which human-readable features remain practically necessary even as they may be architecturally unnecessary. The research question is whether agents can be weaned from human-readable scaffolding through progressive training on more structured representations -- a question we believe the available evidence is consistent with but that cannot be answered definitively without the experiments outlined in Section 10.2.

4.3 Candidate Paradigms and Convergence

Several existing paradigms offer foundations for agent-native languages. Dependently-typed languages (Lean 4 49, Idris 2, Agda, F*) allow types to express arbitrarily rich specifications via the Curry-Howard correspondence; AlphaProof 48 and AlphaVerus 50 (CMU, ICML 2025) demonstrate that AI can discover complex proofs in these frameworks, with AlphaVerus achieving 38% on HumanEval-Verified and automated Rust verification through reinforcement learning over the Verus proof assistant. Lean Copilot 51 further demonstrates the feasibility of AI-assisted proof generation, using LLMs to suggest proof steps within the Lean 4 proof assistant. Concatenative languages (Forth, Factor) make composition syntactically lightweight. Dataflow languages (Lustre, TensorFlow graph mode) make parallelism structural. Constraint programming (Z3, MiniZinc) specifies what solutions must satisfy, not how to compute them -- natural for specification-driven agents. Category-theoretic foundations 5253 provide the mathematics of composition.

If agents trained their own successors, they would likely converge on something close to formal mathematics -- structured, provable, compositional -- as AlphaProof's results suggest: machine-verified proofs unlike human proofs but correct. They would not converge on assembly, which lacks composability at scale. Research on emergent communication (IJCAI 2024) shows agent populations can develop their own protocols but these lack stability and verifiability, suggesting agent-native languages should be designed with Section 4.1 properties rather than emerged.

4.4 AGL: The Agent Graph Language

Drawing the threads together, we sketch a concrete language designed from first principles for agent programmers.

Core abstractions. AGL defines six node categories: computation nodes (functions, lambdas, pipelines), data nodes (typed values, constants), control nodes (conditionals, loops, concurrency), contract nodes (preconditions, postconditions, invariants), proof nodes (formal proofs or machine-checkable evidence), and interface nodes (external APIs, I/O boundaries). Each node has a stable identifier, type signature, and metadata. The graph explicitly encodes dependencies and dataflow.

Type system. AGL combines six type categories: (1) base types (integers, strings, booleans); (2) dependent types (values in types, enabling Vector Nat 5); (3) refinement types (predicates checked by SMT or proof obligations); (4) effect types (explicit I/O and state effects); (5) proof-carrying types (functions optionally carrying proof artifacts); (6) cost types (token, latency, dollar, energy budgets -- resource awareness as a first-class concern).

Execution model. AGL operates at four levels: L0 Intent Spec (goals, constraints, allowed capabilities), L1 Plan Graph (decomposed subgoals and dependencies), L2 Action IR (typed tool and environment operations), L3 Runtime Ops (API calls, shell actions, hardware instructions). Each level is connected to the next by a verified lowering, ensuring the final executable faithfully implements the original specification.

Verification pipeline. Type checking, contract checking, proof checking, and runtime checks form a graduated pipeline. If a node lacks proof, it is explicitly marked as unverified. Agents reason about verification status, prioritizing proof generation for critical nodes.

Homoiconicity. AGL is intentionally homoiconic: programs are data in the same language. Agents generating, verifying, and transforming programs use the same representation. But AGL goes beyond Lisp's code-as-data tradition: it provides code-as-typed-data-with-proof-obligations. An agent manipulating a program fragment operates on a typed graph node carrying its specification and proof, not a raw S-expression.

4.5 Five Practical Design Proposals

We distill the design space into five concrete proposals, each embodying different points on the agent-native spectrum.

Proposal 1: Proof-Carrying Code. A dependently typed functional language where every function signature includes a behavioral specification and every function body must discharge a proof obligation. The proof is not separate from the code -- via the Curry-Howard correspondence, the proof is the code. When Agent A writes sort and Agent B calls it, Agent B relies on the specification (permutation and sortedness) without inspecting the implementation. Feasibility: high. Lean 4 and F* are already close; AlphaProof-class systems are closing the proof-automation gap.

Proposal 2: Composition-Only. A language where the only primitive is function composition. No variables, no bindings, no control flow, no data structures -- only functions and their composition. Data structures are Church-encoded; control flow is continuation-passing. This is combinatory logic (Schoenfinkel, Curry) taken to its extreme. Feasibility: moderate, due to potential exponential blowup in practice.

Proposal 3: Spec-Is-Code. No distinction between specification and implementation. Every program is simultaneously a specification (what) and an implementation (how); "compilation" is refinement. This realizes the refinement calculus 5455: a high-level specification refines to an efficient implementation through verified steps. Feasibility: moderate. Refinement calculi are well-established theoretically; automated verification of refinement steps is the practical challenge.

Proposal 4: Proof-Search Compilation. The agent writes a specification (a logical formula); the "compiler" is a proof search engine that finds a program satisfying it. The specification is a logical formula, the program is a constructive proof, compilation is proof search. The agent's expertise (understanding requirements) applies to specifications; the compiler's expertise (proof search, optimization) applies to implementations. Feasibility: low-to-moderate for general-purpose programming; compositional synthesis could scale it.

Proposal 5: Structural Programs. Programs stored as structured objects (ASTs, DAGs, hypergraphs), not text. Diff, merge, and refactoring are structural operations. When a thousand agents modify a codebase simultaneously, textual merge is a disaster; structural merge resolves this. Research on AST-based collaborative editing confirms that structural merge significantly reduces false conflicts 56. Feasibility: high. The technology exists; the barrier is ecosystem inertia for human tools, which an agent-native toolchain would not face.

4.6 The Write-to-Hardware Question

The most radical possibility is that agents might bypass high-level languages entirely. The argument is strongest in performance-critical domains: high-frequency trading (nanosecond latency), embedded systems (severe memory constraints), GPU kernels (fine-grained thread scheduling), and FPGA synthesis (small gap between specification and hardware).

An agent writing LLVM IR directly can leverage decades of optimization infrastructure without being constrained by any source language. WebAssembly offers a more structured target with portability and sandboxing. Raw machine code, however, lacks the compositional properties needed for scale -- no modules, no behavioral specifications, no type system rich enough to enforce invariants.

The resolution is a stratified architecture: (1) dependently typed specifications at the top, where agents write behavioral contracts; (2) a typed, linear-logic-based IR in the middle, with enough structure for compositional reasoning and enough detail for optimization; (3) machine code, LLVM IR, or hardware descriptions at the bottom, where agents write directly for performance-critical components -- provided they supply proofs that their low-level code satisfies the high-level specification. This mirrors the refinement calculus: each layer refines the one above it, and the refinement is verified.

CompCert 57 demonstrated that entire compilers can be formally verified, guaranteeing that compiled code preserves source semantics. In an agent-native ecosystem, verified compilers would ensure faithful translation from high-level specifications to executable code, closing the gap between the stratified architecture's top and bottom layers.


Section 5: Infrastructure and Operations Without Human Operators

Every layer of the contemporary operations stack encodes human assumptions: Grafana renders telemetry into visual charts for pattern-recognizing eyes; PagerDuty sends SMS alerts calibrated to circadian cycles; runbooks are natural-language programs for stressed human interpreters. When the operator is no longer human, every layer becomes architecturally wrong.

5.1 The Human-Centric Operations Stack

The observability stack (Grafana, Datadog, New Relic) is fundamentally a rendering pipeline: millions of data points per hour compressed into visual representations that give humans a feeling for system health, not a precise understanding. Alerting systems (PagerDuty, OpsGenie) are human notification systems -- 60--70% of engineering time is wasted on alert noise because human attention is scarce and fatigable. Runbooks are natural-language programs for interpreters under stress. Incident response (war rooms, Slack channels, verbal consensus) optimizes for limited human communication bandwidth. On-call rotations ensure human availability across time zones. An agent fleet is always on, never fatigued, and shares state at memory speed.

5.2 The Agent-Native Operations Replacement

The replacement is not wrapping existing tools with AI. It requires rethinking the stack from first principles.

Table 4: Human-Centric vs. Agent-Native Operations

Human-Centric ComponentAgent-Native ReplacementKey Difference
Visual dashboards (Grafana)Streaming telemetry APIsContinuous, multi-dimensional, full-fidelity
Threshold-based alerts (PagerDuty)Event-driven agent activationNo human notification overhead
Natural-language runbooksExecutable remediation policiesComplete state space coverage
War room incident responseAutonomous diagnosis loopsMachine-speed, no communication overhead
Periodic post-mortemsAutomated causal analysisContinuous, feeds directly into policy
On-call rotationsAlways-on agent fleetsElastic capacity, no burnout
YAML/HCL configurationTyped, schema-validated specsPrecise, machine-verifiable, no ambiguity
Human-readable APIs (REST/JSON)Agent-optimized protocols (gRPC, binary)Streaming, semantically rich, composable
Password-based authenticationCryptographic agent identityContinuous verification, capability-based
Manual security auditsContinuous behavioral monitoringReal-time, adaptive, scalable
Human deployment decisionsAutonomous canary analysisFaster, more dimensions, statistically rigorous
Periodic chaos experimentsContinuous fault injectionAdaptive, targeted, machine-speed

In the agent-native model, telemetry is exposed as structured, queryable streams consumed directly -- anomaly detection runs in-stream, not on dashboards. Alerts dissolve into events triggering agent activation with full diagnostic context. Post-mortems dissolve into automated causal analysis that runs continuously.

5.3 The Agent-Computer Interface (ACI)

SWE-agent 5 demonstrated that interface design affects agent performance more than agent design itself. OSWorld 58 confirmed this: humans completed 72% of tasks vs. agents' 12.24%, but agents performed significantly better on CLI than GUI tasks. Agent-native APIs must be schema-first, semantically rich (responses include provenance and confidence), composable (operations chain without HTTP round-trips), streaming by default, and support probabilistic return values. The AIOS 59 proposes OS-level agent services: priority-aware scheduling, context and memory management, storage management, access control, structured I/O channels, and introspection metadata.

5.4 Self-Healing Infrastructure

Agent-native self-healing goes beyond Kubernetes' container restarts. Chaos engineering operates continuously at machine speed: fault injection agents generate targeted experiments based on recent changes, with results feeding remediation policy engines. Infrastructure-as-code's drift problem is solved by continuous verification: agents compare declared and observed state in real-time, enforcing specifications as continuously-checked type systems rather than deployment scripts.

5.5 Deployment at Agent Speed

Agent-driven deployment operates on a fundamentally different timescale: the generate-validate-deploy cycle completes in seconds. Agent canary analysis applies statistical tests in real-time (Bayesian changepoint detection, Mann-Whitney U, Kolmogorov-Smirnov) across hundreds of metrics simultaneously, adapting dynamically -- accelerating rollout when traffic shows no issues, increasing statistical power when results are borderline. Deployment windows and freezes dissolve: an agent fleet is equally capable at 3 AM Sunday as at 10 AM Tuesday.

5.6 Security in an Agent-Operated World

The Cloud Security Alliance's Agentic Trust Framework (ATF) 60, published February 2026, provides the most comprehensive analysis of the shifting threat landscape. Traditional security assumes "human users with predictable behavior, systems that follow deterministic rules, and binary access decisions where trust is established once." AI agents break all three assumptions: non-deterministic behavior adapting to context, probabilistic decisions varying by situation, and access needs changing dynamically based on current task.

New attack vectors include prompt injection (the agent-native equivalent of SQL injection, fundamentally different because the injection point is a natural language processing boundary, inherently fuzzy), model poisoning (subtly compromised behavior passing normal validation -- a supply chain attack at the model level), agent impersonation (one agent masquerading as another), and tool manipulation (modifying tool interfaces to cause unauthorized actions while appearing normal).

The ATF maturity model introduces progressive trust inspired by organizational hierarchies: agents start as "Interns" (observe-only) and earn autonomy through demonstrated trustworthiness, progressing through "Junior" (recommend with human approval), "Senior" (act with notification), and "Principal" (autonomous within domain). Each level has explicit promotion criteria -- minimum time, performance thresholds, security validation, governance sign-off -- and automatic demotion if behavior deviates from baseline. This is fundamentally different from traditional binary access control: trust is earned and continuously verified.

Cryptographic audit trails require every agent action to be logged tamper-proof: cryptographically signed entries forming hash chains (preventing modification without detection), structured reasoning traces (for LLM-based agents, capturing not just actions but the reasoning leading to them), and cross-agent correlation (reconstructing multi-agent interactions). HashiCorp's 2026 analysis adds that agent identities are ephemeral, contextual, and hierarchical -- architecturally unsuited to traditional identity management designed for long-lived human accounts.

5.7 Protocol Layer: MCP + A2A

Two complementary protocols are emerging as the standard infrastructure for agent-operated systems. The Model Context Protocol (MCP) standardizes how agents connect to tools and data sources, with explicit security principles around user consent, data privacy, and tool safety. MCP becomes the tool plane of the agent-native stack. The Agent2Agent protocol (A2A), introduced by Google (2025), addresses inter-agent coordination across vendors and frameworks, supporting secure authentication, long-running task management, and capability discovery through agent cards. A2A becomes the coordination plane.

Together, MCP and A2A provide the protocol layer that, combined with AIOS kernel enforcement, defines how agents interact with their environment and with each other. MCP enforces data governance and tool safety; A2A enforces authentication and task lifecycle constraints; the AIOS kernel enforces resource limits and access control across both.

The transition will not be uniform. The most likely trajectory follows a three-horizon model articulated by Harness (February 2026): human-in-the-loop (the human drives, the agent assists) for the near term; human-on-the-loop (the agent drives, the human reviews) for years 2--4; and human-out-of-the-loop (the agent drives, the human sets policy) for well-understood operational domains within five years. Throughout, the human role evolves from operator to policy designer to strategic overseer -- not removed from the system but repositioned as a circuit breaker, intervening only when the agent fleet encounters situations outside its policy set or when stakes warrant human judgment.


PART III: FRONTIER -- Where This Leads


Section 6: Self-Improving Systems -- When the Loop Closes

The preceding sections treated the agent as a fixed entity operating within an evolving environment. This section removes that boundary, examining what happens when agents modify their own capabilities, train successor models, and dissolve the distinction between builder and built. Every mechanism described here has at least a prototype implementation in the 2024--2026 literature.

6.1 The Self-Improvement Loop

Agent self-improvement follows a generate-evaluate-improve feedback loop. AlphaCode 61 demonstrated that large-scale sampling with execution-based filtering achieves competitive-level programming. AlphaProof 48 extended this into formal mathematics, coupling language models with the Lean proof assistant to achieve olympiad-level theorem proving through reinforcement learning -- the verification signal from Lean acts as an incorruptible oracle. STaR 62 showed that failures can bootstrap training signal through rationalization, iteratively expanding the set of problems the model can solve.

The bootstrapping problem is real: self-improvement requires a foundation of capability to generate useful signal. All current self-improving agents stand on human knowledge absorbed during pre-training. The theoretical notion of recursive self-improvement 63 predicts explosive capability growth, but empirical evidence in code generation strongly favors gradualism -- smooth scaling-law improvements rather than discontinuous leaps. The relevant question is whether the loop eventually becomes self-sustaining, making human involvement optional rather than essential.

6.2 Agents Training Their Own Models

Self-training mechanisms span a spectrum of ambition, summarized in Table 5.

Table 5: Self-Improvement Mechanisms Taxonomy

MechanismWhat EvolvesHowExample
Project fine-tuningModel weightsGradient-based on accepted editsCopilot workspace adaptation
Synthetic data generationTraining corpusAgent-generated (problem, solution) pairsSeed-Coder 64 pipeline
Curriculum learningTraining distributionUncertainty-guided selectionSTaR 62 iterative refinement
DistillationModel architectureLarge-to-small knowledge transferModel zoo specialization
Tool creationTool libraryWrite, test, commit to skill libraryVOYAGER 65 skill accumulation
Orchestration rewriteDispatch logicSelf-modification with rollbackDarwin Godel Machine 66

Project-specific fine-tuning -- where accepted edits and passing tests become training data -- is the most immediately practical form. Synthetic data generation carries the risk of distributional collapse; mitigations include maintaining human-code reservoirs, optimizing for output diversity, and execution-based validation 67. Distillation creates specialized "model zoos" where a large teacher model produces smaller, task-specific students 68. Each mechanism has been demonstrated individually; their integration into a coherent system remains open.

6.3 From Tool Use to Tool Creation

The transition from using predefined tools to creating new ones is a qualitative leap. VOYAGER 65 maintains a compositional, persistent, retrievable, and verifiable skill library that grows through gameplay. Toolformer 69 demonstrated self-supervised tool learning bootstrapped from utility rather than instruction. In multi-agent systems, MCP creates a "tool substrate" where meta-agents generate new MCP servers and plug them into their own workflows. The Darwin Godel Machine (DGM) 66 takes this further: agents modify their own orchestration logic, with only performance-improving variants retained -- direct analogues of compiler optimizations applied to the agent's own runtime.

6.4 The Compiler That Writes Itself

Classical compiler bootstrapping (GCC compiling itself, Lean 4 implemented in Lean) provides the template. Thompson's "Reflections on Trusting Trust" 70 identified the security implication: a trojanized compiler perpetuates backdoors through all subsequent versions, invisible in source code. This is a precise analogue of the alignment problem in self-improving AI -- a single compromised improvement cycle propagates through all downstream cycles.

Agent-designed languages extend this trajectory. Google's MLGO uses RL for compiler optimization decisions; COMPASS demonstrates LLM-guided MLIR pass generation; daVinci-Dev 71 demonstrates agent-native mid-training for software engineering, providing early evidence that the bootstrapping pathway from human-centric to agent-native infrastructure is practically achievable. The critical difference from classical bootstrapping is that agent self-modification is continuous rather than discrete, creating a "Ship of Theseus" situation where the system evolves incrementally while processing tasks.

6.5 What Software Becomes

When agents continuously generate, deploy, evaluate, and regenerate code, software becomes more organism than artifact. The specification-driven paradigm treats specifications as DNA and code as phenotype: a derived, regenerable expression that may be discarded and regenerated when conditions change. Technical debt dissolves because code is not expected to persist. Version control shifts from tracking code changes to tracking specification evolution. The distinction between "software" and "system" dissolves: a system is a specification plus the machinery for continuously realizing it.

Human judgment retains value where formal specification is insufficient: aesthetic choices, ethical judgments, strategic decisions about what to build. The choice of what behavior to specify remains a fundamentally human responsibility.

6.6 Safety and Alignment

Self-improving systems present a fundamental alignment challenge: ensuring that improved versions maintain behavioral constraints. Capability changes can interact with alignment in unpredictable ways -- the "instrumental convergence" concern 72. Current approaches include Constitutional AI (behavioral constraints as a trained constitution), formal verification of value preservation (currently limited to simple systems), and behavioral monitoring that halts improvement cycles when deviations are detected.

Formal verification is the strongest available safety tool, but it faces limits: dependent type checking can be undecidable, proof generation is expensive, specifications themselves may be flawed, and environmental models may be incomplete 73. AgentGuard 74 complements static verification with runtime enforcement. Corrigibility -- the property of not resisting modification or shutdown -- can be engineered into the goal structure rather than imposed as an external constraint 75, but each improvement cycle must preserve it.

The convergence of these mechanisms defines the human role in self-improving systems: not operator but governor -- setting constraints, auditing behavior, and retaining authority to intervene. The human moves from writing code to writing specifications of acceptable behavior, from debugging programs to debugging alignment properties.


PART IV: CONTRIBUTION -- What We Specifically Propose


Section 7: Five Novel Concepts for the Agent-Native Stack

The preceding analysis -- spanning human constraints baked into the software stack, structured representations for agent-scale code, agent-native languages, infrastructure without human operators, and self-improving systems -- converges on a clear architectural need: a coherent, integrated stack designed from first principles for agent-native software development. This section distills our analysis into five concrete architectural concepts that together define both a research agenda and a practical blueprint for agent-native software systems.

Each concept is individually grounded in existing research but novel in its specific formulation. Proof-Carrying Patch Graphs (PCPG) synthesize structural diff, formal verification, and cryptographic provenance into a single change-management primitive. Multi-Level Agent IR (MAIR) generalizes the MLIR insight into a four-level stack purpose-built for agent workflows. Capability-Leased Tool Fabric (CLTF) applies affine and quantitative type theory to the practical problem of securing agent fleet tool access. Latent Coordination Bus (LCB) unifies three communication modalities under a single provenance-tracked protocol. Counterexample Supply Chain (CSC) closes the loop between production failures and system improvement. Together, these five concepts compose into an integrated agent-native stack.

7.1 Proof-Carrying Patch Graphs (PCPG)

Definition. A Proof-Carrying Patch Graph is a directed acyclic graph in which every node represents a code change and every node carries five artifacts: (1) the change itself, represented as a structural diff over typed ASTs rather than textual lines; (2) the contract delta, specifying how the change modifies behavioral specifications (preconditions, postconditions, invariants); (3) test evidence, recording which tests were executed and their results; (4) formal proof obligations, stating what must be proven for the change to be considered correct; and (5) verification receipts, providing cryptographically signed attestations that specific verification procedures were performed by identified verification agents. Edges in the graph encode causal and compositional dependencies between changes: a patch P_2 that depends on the modifications introduced by P_1 is connected by a directed edge from P_1 to P_2.

Motivation. In environments where 1,000 or more agents commit changes concurrently, the traditional change-management primitives -- textual diffs, human code review, and linear version histories -- are insufficient. Textual diffs conflate formatting changes with semantic changes and produce false conflicts when independent agents modify nearby lines. Human code review, which operates at a rate of roughly 200--400 lines of code per hour, cannot scale to the throughput of agent fleets. Linear version histories (as in Git) force serialization of inherently parallel work. PCPG addresses all three deficiencies: structural diffs eliminate formatting noise, machine-verifiable proof obligations replace human review for correctness, and the graph structure supports parallel, non-linear change histories.

Connection to existing research. PCPG draws on several established lines of work. GumTree provides syntax-aware tree differencing that operates on ASTs rather than text lines, detecting moves and renames that textual diff misses. GrammarCoder 15 demonstrates that grammar-based structural representations yield up to 82% improvements in code generation quality (at 1.3B parameters; gains diminish at larger scales), validating the premise that structural representations carry more semantic information than text. VerifyThisBench 39 evaluates the combined generation of code, specifications, and machine-checkable proofs, establishing that unified code-plus-proof artifacts are a viable target for AI systems. CompCert 57 proves that formal verification of complex software transformations is achievable, establishing the precedent that compiler-like transformation chains can carry proofs of semantic preservation. The Necula proof-carrying code framework 38 establishes the theoretical foundation: code that carries a machine-checkable proof of compliance with a safety policy, where the proof is cheaper to verify than to generate.

Design sketch. The PCPG data model consists of:

  • Patch nodes. Each node contains a structural diff (expressed as a typed AST edit script following the GumTree formalism), metadata (author agent ID, timestamp, parent node references), and the five carried artifacts. The structural diff is parameterized by the representation level: it may operate on source-level ASTs, on MAIR L1 plan graphs, or on L2 action IRs, depending on the abstraction level of the change.

  • Verification receipt schema. A receipt is a tuple (verifier_id, verification_type, property_verified, result, timestamp, signature), where verification_type ranges over type-checking, contract-checking, proof-checking, test-execution, and static-analysis. The signature is a cryptographic attestation by the verifying agent, enabling third-party audit of the verification chain.

  • Composition rules. Two PCPG nodes P_a and P_b can be composed (merged) if and only if: (a) their structural diffs are non-conflicting at the AST level (they modify disjoint subtrees or compatible overlapping subtrees), (b) their contract deltas are consistent (no postcondition of P_a contradicts a precondition introduced by P_b, and vice versa), and (c) both carry verification receipts for a common verification policy. Composition produces a new node whose structural diff is the union of the constituent diffs, whose contract delta is the conjunction of the constituent deltas, and whose proof obligations are the union of the constituent obligations minus any obligations discharged by the constituent proofs.

The PCPG model transforms version control from a social coordination mechanism (Git, designed for human developers communicating through commit messages) into a formal verification pipeline where trust is established through machine-checkable evidence rather than human review. PCPG operationalizes the Evidence-Carrying Patch (ECP) paradigm proposed for agent-scale workflows 16 by instantiating the abstract evidence bundle as five typed artifact slots and adding graph topology with a composition algebra, providing a concrete data structure where ECP provides an organizational principle.

Table: ECP-to-PCPG Mapping

ECP Concept (Macro)PCPG Realization (Micro)
Evidence bundle5-slot artifact tuple (diff, contract delta, tests, proof obligations, verification receipts)
Patch identityTyped AST edit script (GumTree formalism) with cryptographic hash
Trust signalVerification receipt schema: (verifier_id, type, property, result, timestamp, signature)
Composition ruleGraph-algebraic merge: non-conflicting AST subtrees + consistent contract deltas + common verification policy
Provenance chainDAG edges encoding causal and compositional dependencies between PCPG nodes

Current vs. target maturity of proof-carrying semantics. PCPG's full design assumes that agents can reliably generate formal proofs alongside code. Current empirical evidence does not yet support this assumption at scale: the companion paper 16 reports a 3.6% pass@1 proof rate on the VERINA benchmark and identifies a falsification criterion if proof generation success rates remain below 10% over a defined time horizon. We frame PCPG accordingly: it is target-state infrastructure whose value scales with proof generation capability, not an architecture that requires full proof maturity to deliver returns. The five-slot design degrades gracefully -- when formal proofs are unavailable, the test evidence slot, contract delta slot, and verification receipts (which can attest to static analysis, type-checking, or test execution rather than formal proof) still carry substantial trust signal. As proof generation matures (e.g., DeepSeek-Prover-V2 achieving 49/50 on miniF2F-test), the proof obligations and verification receipts slots become increasingly load-bearing, and the PCPG graph converges toward the full proof-carrying ideal. The architecture is designed to reward incremental improvements in verification capability rather than requiring a capability threshold to be useful.

7.2 Multi-Level Agent IR (MAIR)

Definition. MAIR is a standard intermediate representation stack spanning four levels with semantics-preserving lowering and bidirectional traceability: L0 (Intent Specification: declarative goals, acceptance predicates, resource budgets), L1 (Plan Graph: DAG of decomposed subgoals with typed data/control/capability edges), L2 (Action IR: typed tool operations with pre/postconditions and cost estimates), and L3 (Runtime Operations: concrete API calls, shell commands, RPCs). Lowering from L_n to L_{n+1} produces a proof obligation asserting that the lower-level artifact satisfies upper-level constraints. Every L3 operation carries a provenance chain back through L2, L1, and L0.

Motivation. No single representation suffices for agent-scale software. Without a standard multi-level IR, agent systems perform ad hoc translation between representations, losing semantic information and making verification intractable. MAIR generalizes the MLIR insight 34 -- composable dialects at multiple abstraction levels -- from compiler infrastructure to the full agent workflow stack. LLVM IR provides the precedent for typed universal IRs; WebAssembly demonstrates sandboxed portable execution; COMPASS shows LLMs can generate MLIR pass pipelines; the AoT framework demonstrates that explicit IR layers between natural language and hardware improve functional correctness by 1.8--5.2x.

7.3 Capability-Leased Tool Fabric (CLTF)

Definition. In CLTF, tool access is modeled as affine, expiring capabilities with typed, scoped, revocable leases. An agent does not "have access to" a tool; it holds a lease that specifies: what operations are permitted (the capability type), for how long (the time-to-live), with what resource budget (token, compute, cost limits), with what blast radius (the maximum scope of side effects), and with what delegation rights (whether the lease can be sub-leased to child agents and under what restrictions). Leases follow quantitative affine semantics drawing on Girard's linear logic 76 and Atkey's quantitative type theory (LICS 2018): they cannot be duplicated, ensuring that resource usage is tracked precisely, and they may expire without being exercised (affine, not strictly linear). Budgeted leases (e.g., "at most 1,000 API calls") carry a usage count as part of the type -- a quantitative annotation tracking remaining invocations. The delegation operation (below) splits a lease's budget rather than transferring the lease wholesale: the delegator retains the lease with a reduced budget, and the delegate receives a new sub-lease with the transferred portion. Both parties hold valid leases simultaneously, but the sum of their budgets cannot exceed the original -- delegation preserves the global budget invariant through quantitative resource splitting, not through affine move semantics.

Motivation. Agent fleets accessing thousands of tools require security and resource management that goes far beyond static role-based access control (RBAC). Consider a fleet of 1,000 agents, each with access to file system operations, network APIs, database queries, and model invocations. Under static RBAC, each agent either has or does not have access to each tool. This binary model cannot express temporal constraints (access for 10 minutes only), budget constraints (at most 1,000 API calls), blast-radius constraints (may only modify files in directory X), or delegation constraints (may sub-lease read access but not write access). CLTF addresses all of these through a unified capability model inspired by affine and quantitative type systems, where resources are used at most a bounded number of times and tracked precisely through the type system.

Connection to existing research. CLTF integrates ideas from several domains. The MCP specification defines security principles for tool access -- explicit user consent, data privacy, tool safety -- but does not provide a formal capability model. The Cloud Security Alliance's Agent Trust Framework (ATF), published in February 2026, proposes a maturity model for agent trust (Intern -> Junior -> Senior -> Principal) that maps naturally to CLTF lease restrictions: an Intern-level agent receives narrowly scoped leases with short TTLs and no delegation rights; a Principal-level agent receives broad leases with long TTLs and full delegation. HashiCorp's zero-trust architecture for agents emphasizes that every agent action should be authenticated and authorized. The AIOS kernel provides scheduling and access control for agent workloads. Affine type systems, originating in Girard's linear logic 76 and realized in languages like Rust's ownership model (which is affine, not strictly linear -- values may be dropped without use), provide the theoretical foundation for resources that are used at most once. Quantitative type theory (as in Idris 2) generalizes this to bounded usage counts, which maps directly to CLTF's budgeted leases.

Design sketch. The CLTF type system defines:

  • Lease type: Lease<Tool, Ops, TTL, Budget, BlastRadius, DelegationRights>, where Tool identifies the tool or tool class, Ops is a set of permitted operations (a subset of the tool's full API), TTL is a duration, Budget is a multi-dimensional resource limit (tokens, API calls, compute seconds, cost), BlastRadius is a scope constraint (file paths, network endpoints, database tables), and DelegationRights specifies whether and how the lease can be sub-leased.

  • Capability algebra. Four operations on leases:

    • compose(L1, L2) -> L3: combines two leases for the same tool into a single lease with the union of their operations and the intersection of their constraints (most restrictive TTL, smallest budget, narrowest blast radius).
    • restrict(L, constraint) -> L': produces a new lease with tighter constraints than the original. Restriction is always permitted; it cannot expand capabilities.
    • delegate(L, agent_id, k') -> (L_rem, L'): splits the lease budget, producing two leases: L_rem (retained by the delegator with budget reduced by k') and L' (granted to the delegate with budget k'). Delegation is only permitted if the lease's DelegationRights allow it, and the sub-lease can only be more restrictive than the parent across all dimensions.
    • revoke(L): immediately invalidates a lease. Revocation propagates to all delegated sub-leases.
  • AIOS kernel integration. The AIOS kernel maintains a lease registry and enforces lease constraints at the system call level. Every tool invocation is intercepted by the kernel, which checks the invoking agent's lease for the requested operation, verifies that the TTL has not expired and the budget has not been exhausted, and logs the invocation for audit. Violations result in immediate invocation rejection and, optionally, lease revocation and agent suspension.

  • Lifecycle. A lease progresses through states: Granted -> Active -> [Delegated] -> Revoked | Expired. The lease lifecycle is logged to the cryptographic audit trail (shared with LCB, below), providing a complete record of who had what access, when, and what they did with it.

7.3.1 Formal Type System for CLTF

We formalize CLTF using a type-theoretic framework grounded in quantitative type theory 77 with affine resource management. The formalization makes the safety invariants precise and machine-checkable.

Lease type. A lease is an 8-tuple:

Lease = <Owner, LeaseId, Tool, Operations, Budget, Expiry, Scope, Delegable>

where Owner is the agent holding the lease, LeaseId is a unique identifier, Tool is a tool identifier, Operations is a set of permitted tool operations (a subset of the tool's full API surface), Budget : Nat is a non-negative use count, Expiry : Timestamp is an absolute expiration time, Scope constrains the blast radius (e.g., file path prefixes, network CIDRs, database table names), and Delegable : Bool determines whether the holder may sub-lease. Delegation rights are modeled as a separate typed component rather than encoded within the Operations set; this separation prevents conflation of tool semantics (what operations the lease permits on the tool) with policy management (whether the lease may be split and shared).

Typing rules. We define four judgments over a lease context Delta (a multiset of leases) and an agent context Gamma. All rules use the full 8-tuple consistently. We write d for the Delegable flag and require fresh(id) as a side condition wherever a new lease identifier is minted.

Creation:

Delta, Gamma |- grant(a, id, t, ops, k, exp, s, d) : Lease<a, id, t, ops, k, exp, s, d>

where fresh(id)

An authority (the AIOS kernel or a delegating agent) creates a lease granting agent a access to tool t for operations ops, with budget k, expiry exp, scope s, and delegation flag d. The fresh lease identifier id is generated by the kernel. The lease is added to a's lease context Delta.

Use (budget decrement):

Delta, l : Lease<a, id, t, ops, k, exp, s, d> |- use(l, op) : Result<t, op>

where op in ops, k >= 1, now < exp

producing Delta, l : Lease<a, id, t, ops, k-1, exp, s, d>

Each invocation consumes one unit of budget. When k reaches 0, the lease is exhausted and further use is rejected. This is quantitative typing in the sense of Atkey (2018) and Idris 2's multiplicity annotations: the type carries a usage count that the type system tracks. The use rule consumes the input lease and produces a new lease with decremented budget -- the lease variable is rebound, not mutated.

Delegation (quantitative budget split):

Delta, l : Lease<a, id, t, ops, k, exp, s, d> |- delegate(l, a', ops', k', exp', s', d') : Lease<a, id, t, ops, k-k', exp, s, d> x Lease<a', id', t, ops', k', exp', s', d'>

where d = true (delegation-authority premise) and fresh(id') (fresh identifier for sub-lease) and ops' is a subset of ops (operation restriction) and 0 < k' <= k (non-trivial split; full transfer permitted) and exp' <= exp (expiry restriction) and s' is a subset of s (scope restriction)

Delegation splits the lease's budget into two disjoint shares: the delegator retains a lease with budget k - k', and the delegate receives a new lease (with fresh identifier id') with budget k'. Both parties hold valid leases simultaneously. The constraint 0 < k' <= k ensures the split is non-trivial (at least one unit delegated) while permitting full transfer (k' = k), in which case the delegator suspends at budget 0 until it reclaims residual budget via the Join rule (below). The sub-lease can only be more restrictive than the parent lease across all capability dimensions (operations, budget, expiry, scope) -- delegation cannot amplify capabilities.

Join (Reclaim):

Delta, l_p : Lease<a, id_p, t, ops, k_p, exp, s, d>, l_c : Lease<a', id_c, t, ops_c, k_c, exp_c, s_c, d_c> |- join(l_p, l_c) : Lease<a, id_p, t, ops, k_p + k_c, exp, s, d>

where is_child(id_c, id_p) and State(l_c) in {Done, Error}

producing Delta with l_c removed

When a child agent's lease enters a terminal state (Done or Error), the parent may reclaim the child's residual budget via join. The parent's budget increases by the child's remaining budget k_c, and the child lease is consumed -- preventing double-reclaim. This closes the "zombie lease" gap: without Join, delegated budget that is never fully consumed would be permanently lost, creating a liquidity trap in long-running orchestration hierarchies.

Revocation:

Delta, l : Lease<a, id, t, ops, k, exp, s, d> |- revoke(l, id_child) : Lease<a, id, t, ops, k, exp, s, d> x RevocationReceipt(id_child)

where is_child(id_child, id)

producing Delta with id_child and all transitive sub-leases of id_child removed

Revocation targets a specific sub-lease by identifier. The revoking agent retains their own lease unchanged. Revocation propagates transitively: all sub-leases derived from the revoked lease are also invalidated. The RevocationReceipt is logged to the audit trail. In a full formalization, revocation would operate on a global lease store Sigma (mapping lease IDs to lease records and a revocation set), with use and delegate requiring id not in Revoked(Sigma) as a precondition. We elide store semantics here for brevity, noting that the AIOS kernel's lease registry serves this role at runtime.

A note on phase distinction. Readers familiar with type theory will observe that several side conditions in the typing rules above -- now < exp in the Use rule, State(l_c) in {Done, Error} in the Join rule, fresh(id) in Creation and Delegation -- are runtime predicates rather than purely static (compile-time) judgments. This is a deliberate design choice, not an oversight. CLTF's typing rules operate in a runtime-dependent regime analogous to Liquid Haskell's refinement types, where type-level predicates are discharged by an SMT solver at verification time but may depend on runtime values (e.g., the current timestamp, the state of a child agent's lease). The alternative -- a phase-separated typing discipline where all side conditions are resolved statically -- would be more standard from a PL perspective but would lose expressiveness for lease lifecycle management: lease expiry and agent completion are inherently dynamic properties that cannot be known at lease-creation time. The tradeoff is that CLTF's safety guarantees (Invariants 1 and 2 below) require runtime enforcement by the AIOS kernel's lease registry, not purely static type checking. This is consistent with the capability-system tradition (Capsicum, Macaroons, CloudFlare Workers) where capability constraints are enforced at the system-call boundary rather than at compile time. A fully phase-separated alternative formalization -- where expiry and state conditions are modeled as abstract predicates discharged by a runtime monitor -- is possible and would yield a cleaner metatheory at the cost of a more complex operational semantics; we leave this exploration to future work.

Quantitative affine property. A lease l cannot be duplicated: there is no rule of the form l : Lease<...> |- (l, l) : Lease<...> x Lease<...>. However, unlike pure affine types (which permit at most one use), a budgeted lease permits up to k uses -- each use rebinds the lease variable with a decremented budget. Delegation splits a single lease into two leases with disjoint budget shares. A lease may also expire without being exercised (dropped from context when now >= exp), making the system affine rather than strictly linear. The combination of affine non-duplication with quantitative budget tracking follows Atkey's framework, where resource annotations on types govern how many times a binding may be consumed.

Non-amplification proof sketch. We prove non-amplification across all capability dimensions, not just budget. Define a capability preorder <=_cap on leases: l' <=_cap l iff l'.ops is a subset of l.ops, l'.budget <= l.budget, l'.expiry <= l.expiry, and l'.scope is a subset of l.scope. Define a budget measure mu(Delta) = sum of l.budget for all leases l in Delta.

Monotonicity (all dimensions). The delegation rule's premises directly enforce l_child <=_cap l_parent for operations, expiry, and scope (subset/no-greater-than at each delegation step). By induction on any delegation chain l_0 -> l_1 -> ... -> l_n, transitivity of subset/leq gives l_n <=_cap l_0. No delegation chain can widen operations, extend expiry, or broaden scope beyond the root lease.

Conservation (budget). We show that no rule increases mu(Delta) beyond what the kernel mints:

  • Creation: mu increases by k (the granted budget). Creation is privileged -- only the AIOS kernel may mint new budget, establishing the conservation boundary.
  • Use: mu(Delta') = mu(Delta) - 1. Budget strictly decreases.
  • Delegation: The input lease has budget k. The output consists of two leases with budgets k - k' and k'. mu(Delta') = mu(Delta) - k + (k - k') + k' = mu(Delta). Budget is conserved exactly.
  • Join: The parent lease has budget k_p and the consumed child lease has residual budget k_c. The output is a single lease with budget k_p + k_c, and the child lease is removed from Delta. mu(Delta') = mu(Delta) - k_p - k_c + (k_p + k_c) = mu(Delta). Budget is conserved exactly.
  • Revocation: mu(Delta') <= mu(Delta). Budget may decrease (revoked sub-leases are removed) but cannot increase.

By induction on any sequence of rule applications from an initial context Delta_0, mu(Delta_n) <= mu(Delta_0) + (budget minted by kernel). Combining monotonicity and conservation: no delegation chain can produce aggregate capability exceeding what the kernel originally granted, across any dimension of the capability tuple. QED (sketch).

Safety invariants. We state two central safety properties:

Invariant 1 (Capability Gate). For all agents a, tools t, and operations op: agent a can invoke (t, op) only if there exists a valid lease l in a's context such that l.owner = a, l.tool = t, op is in l.ops, l.budget > 0, and l.expiry > now.

Invariant 2 (Non-Amplification). For any delegation chain l_0 -> l_1 -> ... -> l_n, the operations, budget, expiry, and scope of l_n are each a subset of (or no greater than) those of l_0. No delegation chain can produce a lease with capabilities exceeding its root.

Invariant 1 is enforced at runtime by the AIOS kernel's lease registry, which intercepts every tool invocation and checks it against the invoking agent's lease context. Invariant 2 follows from the non-amplification proof sketch above: the delegation rule's premises (ops' subset of ops, k' <= k, exp' <= exp, s' subset of s) ensure monotonic restriction at each step, and induction on chain length completes the argument.

Scope enforcement. Scope enforcement -- ensuring that the effects of a tool invocation fall within l.scope -- is a runtime concern rather than a type-level judgment. This is standard for capability systems: Capsicum (FreeBSD) enforces capability scopes at the kernel system-call boundary; Google Macaroons enforce contextual caveats at the service layer. In CLTF, the AIOS kernel's system-call interposition layer checks file paths, network endpoints, and database tables against the lease's scope constraint before permitting the operation. A full effect-type formalization (e.g., a Performs(s) effect annotation on the use rule) is possible but would substantially increase the complexity of the type system for a property that is naturally enforced at the system-call boundary. We leave effect-level scope formalization to future work. We note, however, that scope constraints can be formalized algebraically using lattice-based information flow models 78: defining Scope as a lattice with a computable meet operator allows the kernel to compute S_child = S_parent meet S_req, structurally guaranteeing S_child is a subset of S_parent. This maps directly to CIDR blocks, URI prefix trees, and IAM policy schemas -- all decidable in O(n) or O(log n) time.

Concurrency and budget splitting. Budget splitting introduces shared mutable state: if two concurrent threads both hold a reference to the same lease variable l, concurrent delegate calls could both claim k' from the same budget k. In a language with substructural types (e.g., Rust's ownership model), this race is prevented statically -- the affine variable l cannot be passed to two concurrent contexts without first being explicitly split. In languages without substructural enforcement, a serialized lease-operations model is required: the AIOS kernel's lease registry must execute each delegation as a single atomic transaction -- checking the delegation-authority guard, decrementing the parent's budget, minting a fresh child lease ID, and recording the parent-child lineage -- rather than merely performing an atomic compare-and-swap on the budget field alone. This serialization ensures that concurrent splits are linearized and the full budget invariant (including lineage tracking for revocation) is maintained, analogous to how database systems serialize concurrent transactions on shared rows. Critically, this serialization is per-lease (keyed on the parent lease ID), not global: operations on disjoint lease subtrees proceed in parallel without contention, preserving O(1) overhead for non-overlapping delegations.

Connection to quantitative type theory. Budgeted leases (at most k uses) correspond directly to quantitative types as formalized by Atkey 79 and implemented in Idris 2's multiplicity system. In quantitative type theory, a variable with multiplicity k may be used at most k times; in CLTF, a lease with budget k may be exercised at most k times. The delegation rule's budget split (k -> (k-k', k')) corresponds to the split rule for quantitative contexts, which divides a resource annotation across two sub-derivations. This connection means that CLTF's budget tracking can be embedded in any language supporting quantitative types, with the type checker providing static guarantees about budget compliance.

7.4 Latent Coordination Bus (LCB)

Note: LCB as used in this paper refers to Latent Coordination Bus, not to be confused with LiveCodeBench 80, an execution-based code generation benchmark that shares the same acronym.

Definition. LCB is an inter-agent communication channel supporting three modalities in a single protocol: (1) natural language for human-agent interaction and audit, (2) structured state deltas (typed, schema-validated diffs against shared state) for inter-agent coordination, and (3) compressed latent payloads (embeddings, KV-cache fragments, learned tokens) for high-bandwidth agent-agent communication. All messages carry provenance metadata (sender, timestamp, causal chain, MAIR level reference) and are logged to a cryptographic audit trail.

Motivation and research grounding. Existing protocols each address a single modality: MCP uses structured JSON-RPC, A2A uses task objects, human interaction uses NL. LCB unifies these with explicit modality negotiation -- agents negotiate the highest-efficiency modality they both support, with a guaranteed degradation chain: agents negotiate downward from latent to structured to natural language, ensuring that every coordination exchange has a human-inspectable fallback even when the primary channel operates at latent bandwidth. The design synthesizes emergent communication research in multi-agent RL, state-delta trajectory communication (EMNLP 2025), Q-KVComm for KV-cache exchange, and the KQML/FIPA-ACL tradition of performative-based agent messages. Every message carries causal predecessors for communication graph reconstruction and an integrity hash for tamper detection.

7.5 Counterexample Supply Chain (CSC)

Definition. The Counterexample Supply Chain is a pipeline that automatically transforms production incidents into four downstream artifacts:

  1. Reproducible execution traces: A deterministic replay of the sequence of actions, observations, and decisions that led to the incident, captured at MAIR L2/L3 granularity with full provenance.

  2. Formal counterexamples to system specifications: A formalization of the incident as a witness to a specification violation -- a concrete input-state-output triple that demonstrates that the system's behavior deviated from its L0 intent specification.

  3. Benchmark tasks for agent evaluation: A structured task derived from the incident, suitable for inclusion in evaluation suites (analogous to SWE-bench entries), with defined inputs, expected outputs, and evaluation criteria.

  4. Curriculum entries for model retraining: A training example derived from the incident, capturing the failure trajectory and the correct resolution (if known), formatted for inclusion in fine-tuning or reinforcement learning pipelines.

Every production failure thus enriches the system's verification, evaluation, and training infrastructure. The pipeline is automatic: incidents flow through the CSC without human intervention, though humans may review and annotate the outputs.

Motivation. Production failures are the most valuable data for self-improving systems, but they are currently wasted. In human-centric operations, failures produce post-mortem documents that may or may not be read, may or may not lead to process changes, and almost never feed directly back into the system's verification or training infrastructure. The gap between "a production incident occurred" and "the system is less likely to produce similar incidents in the future" is mediated entirely by human processes -- writing post-mortems, filing tickets, updating runbooks, adding tests -- that are slow, inconsistent, and often incomplete. CSC closes this gap by automating the transformation of failures into machine-consumable improvement signals.

Connection to existing research. CSC's feedback loop is structurally analogous to Counterexample-Guided Abstraction Refinement (CEGAR) 81, which iteratively refines abstract models using counterexamples from failed verification attempts. The key distinction is that CEGAR is a localized verification technique operating within a single model checker, whereas CSC is a distributed architectural primitive spanning production monitoring, benchmark generation, and training curricula across an entire agent fleet. CSC integrates several established research directions. AIOpsLab provides the infrastructure for fault injection, telemetry collection, and agent evaluation in autonomous cloud operations, establishing the precedent for automated incident-to-benchmark pipelines. Chaos engineering, as documented in the comprehensive multivocal literature review by Basiri et al., provides systematic approaches to controlled failure injection and resilience validation. STaR 62 demonstrates that failures can be bootstrapped into training signal through rationalization -- generating correct reasoning conditioned on the known correct answer. LeDex constructs datasets for code explanation and refinement using execution verification to filter refinement trajectories. ReVeal provides a reinforcement learning framework where models generate code, verify it via tools, and iteratively improve, demonstrating that verification-driven feedback loops improve agent performance. The DGM 66 uses empirical evaluation as the improvement signal, demonstrating that benchmark performance can drive self-modification. Seed-Coder 64 shows that model-curated data pipelines are feasible and effective, establishing that the data curation step in CSC can itself be automated.

Design sketch. The CSC pipeline consists of four stages:

  • Stage 1: Incident-to-trace. When a production incident is detected (through monitoring, chaos testing, or specification violation), the system captures the full execution trace at MAIR L2/L3 granularity. The trace includes: the sequence of agent actions, the observations received at each step, the decisions made (with the decision context, including model inputs and outputs), and the point at which the failure manifested. The trace is deterministic and replayable -- given the same initial state and the same sequence of observations, replaying the trace reproduces the failure.

  • Stage 2: Trace-to-counterexample. The execution trace is analyzed to identify the specification violation. The system extracts the relevant L0 intent specification, identifies the acceptance predicate that was violated, and constructs a formal counterexample: a tuple (initial_state, action_sequence, observed_outcome, expected_outcome, violated_predicate). This counterexample is machine-checkable: the violated predicate can be evaluated against the observed outcome to confirm the violation.

  • Stage 3: Counterexample-to-benchmark. The counterexample is packaged as a benchmark task. The task specifies: the initial state (environment snapshot), the goal (the L0 intent that was being pursued), the evaluation criteria (the violated predicate, now serving as the acceptance test), and metadata (difficulty estimate, category, related incidents). The benchmark task is added to the system's evaluation suite, ensuring that future agent versions are tested against this failure mode.

  • Stage 4: Benchmark-to-curriculum. The benchmark task is transformed into a training example. If the correct resolution is known (from human annotation or from a successful automated repair), the training example pairs the failure trajectory with the resolution trajectory. If the correct resolution is not known, the training example captures the failure trajectory alone, serving as a negative example for reinforcement learning (the agent should learn to avoid this trajectory). The curriculum entry includes metadata about the failure category, enabling curriculum-learning agents to prioritize training on underrepresented failure modes.

7.6 The Integrated Stack

The five concepts compose into a coherent agent-native stack where each concept occupies a distinct architectural role:

  • MAIR provides the representation. All artifacts in the stack -- specifications, plans, actions, executable operations -- are expressed at the appropriate MAIR level with verified lowering between levels and bidirectional traceability. MAIR is the substrate on which all other concepts operate.

  • PCPG provides the change management. Every modification to a MAIR artifact -- whether a change to an L0 specification, a refinement of an L1 plan, or a bug fix in L3 code -- is represented as a PCPG node carrying structural diffs, contract deltas, test evidence, proof obligations, and verification receipts. The PCPG graph provides the version history of the entire system at every MAIR level.

  • CLTF provides the security model. Every agent's access to tools, data, and other agents is mediated by CLTF leases. The leases are enforced by the AIOS kernel and logged to the audit trail. CLTF ensures that self-improving systems cannot acquire unbounded capabilities: every capability expansion requires an explicit lease grant, which can be audited, constrained, and revoked.

  • LCB provides the coordination. All inter-agent communication -- whether between agents coordinating on a shared MAIR L1 plan, between an agent and a tool via MCP, or between agents exchanging latent representations for cooperative code generation -- flows through the LCB. The bus provides modality negotiation, provenance tracking, and degradation to human-inspectable NL, ensuring that coordination is both efficient and auditable.

  • CSC provides the learning loop. When the system fails -- when a PCPG-verified change nonetheless produces a production incident, when an agent exceeds its CLTF lease constraints, when an LCB communication leads to a miscoordination -- the CSC pipeline captures the failure, formalizes it as a counterexample, packages it as a benchmark, and feeds it into the training curriculum. This ensures that failures are not merely repaired but absorbed into the system's capability, making each failure a step toward robustness rather than a cost to be minimized.

The integration creates three interlocking feedback loops:

  1. The verification loop: MAIR lowering generates proof obligations -> PCPG nodes carry proof evidence -> verification failures feed into CSC -> CSC counterexamples strengthen future proof obligations.

  2. The security loop: CLTF leases constrain agent capabilities -> lease violations are detected by the AIOS kernel -> violations feed into CSC -> CSC curriculum teaches agents to operate within constraints.

  3. The coordination loop: LCB messages coordinate agent work -> miscoordinations produce failures -> failures feed into CSC -> CSC benchmarks evaluate coordination quality -> evaluation signals improve LCB protocol negotiation.

Each loop is self-reinforcing: the system becomes more verified, more secure, and better coordinated with each iteration. Critically, each loop includes a formal or empirical check that prevents degradation: PCPG verification receipts prevent unverified changes from propagating, CLTF leases prevent capability creep, and CSC benchmarks prevent regression. The result is a stack that improves over time while maintaining -- and strengthening -- its safety invariants.

This integrated stack does not exist today. No system implements all five concepts, and implementing any one of them at scale presents substantial engineering challenges. But every component is grounded in demonstrated research, and the interfaces between components are defined precisely enough that they could be built independently and integrated incrementally. The research agenda is clear: implement MAIR on existing compiler infrastructure (LLVM, MLIR, WebAssembly), deploy PCPG in multi-agent development environments, evaluate CLTF in real agent fleet security scenarios, prototype LCB in cooperative agent systems, and build CSC pipelines on top of existing incident-management infrastructure. The integrated stack is the destination; the individual concepts are the milestones.


Section 8: Related Work

We position our work against six areas, explicitly stating what this paper does that the existing literature does not.

8.1 Program Synthesis and Verified Code Generation

Program synthesis spans deductive synthesis 82, inductive synthesis 83, and neural synthesis (AlphaCode 61, AlphaProof 48, SpecGen 37, AlphaVerus, SynVer 84, VerifyThisBench 39). Our contribution: synthesis research generates individual programs; we address the architecture of systems that continuously synthesize, verify, compose, and evolve programs at the scale of thousands of concurrent agents. PCPG extends proof-carrying code from individual artifacts to graph-structured change management; MAIR provides the multi-level representation infrastructure within which synthesis operates.

8.2 Agent Frameworks and Benchmarks

SWE-agent 5 introduced the ACI concept; SWE-bench 85, OSWorld, and AIOpsLab provide evaluation benchmarks; AutoGen 86 enables multi-agent conversation; Yang et al. 87 survey the emerging landscape of LLM-based multi-agent systems from both technical and business perspectives. CodeMonkeys 88 demonstrates that scaling test-time compute across many parallel agents yields substantial improvements on SWE-bench, providing empirical evidence for the thousand-agent concurrency regime this paper's stack targets. Our contribution: agent research focuses on individual agents or agent orchestration patterns; we focus on the stack beneath agents. We generalize SWE-agent's ACI insight across the full operational stack and formalize it through CLTF and LCB. CSC proposes generating evaluation tasks automatically from production failures.

8.3 Formal Methods and Proof Assistants

Lean 4 49, CompCert 57, CakeML, AlphaProof, and the refinement calculus 5455 provide foundations. Our contribution: formal methods treats verification as an external tool or specialist discipline. We integrate it as a built-in primitive: proof obligations in AGL's type system, verification receipts as first-class PCPG components, making verification pervasive rather than exceptional.

8.4 Compiler Infrastructure

LLVM IR 89, MLIR 34, WebAssembly, COMPASS, and GrammarCoder 15 provide representation and transformation technologies. Our contribution: compiler research treats IRs as compilation artifacts. We treat them as agent-native representations -- the primary medium in which agents reason about programs. MAIR generalizes MLIR's multi-level dialects from hardware abstraction to agent cognitive abstraction.

8.5 Multi-Agent Communication

FIPA-ACL 90, emergent communication in multi-agent RL, state-delta trajectory communication, Q-KVComm, MCP, and A2A 91 each address a single communication modality. Our contribution: LCB unifies three modalities with explicit modality negotiation, provenance tracking, and degradation rules ensuring human inspectability, connecting communication to the security (CLTF) and learning (CSC) layers.

8.6 GenAI-Native System Design

Vandeputte 92 proposes foundational design principles and architectural patterns for GenAI-native software systems at the Onward! 2025 symposium, organizing the design space around five pillars that span from cognitive processing components to hybrid system architectures blending traditional and GenAI-native elements. That work is complementary to ours: Vandeputte addresses the macro-architectural question of how to structure robust, adaptive systems that incorporate LLM capabilities -- separating cognitive from traditional processing, screening and sandboxing model outputs, and defining compositional GenAI-native cells -- while this paper asks what the software stack itself should look like when the primary producer and consumer of code is a machine. The two analyses operate at different levels: Vandeputte addresses system-of-systems architecture surrounding GenAI capabilities; we address the representation, language, and infrastructure layers beneath agents. The safety principles Vandeputte identifies (e.g., treating model outputs as untrusted assets requiring screening) are consistent with CLTF's capability model, and the multi-layered architectural decomposition aligns with MAIR's multi-level representation philosophy.

8.7 AI Safety and Alignment

Constitutional AI, corrigibility research 75, AgentGuard 74, the CSA Agentic Trust Framework 60, and Thompson's trust problem 70 address agent safety. Our contribution: we embed safety as a stack-level concern distributed across infrastructure components (CLTF for capability control, CSC for learning from failures, PCPG for change-level accountability) -- making safety architectural rather than behavioral.


Section 9: Discussion and Limitations

This paper's analysis is forward-looking, and several claims rest on extrapolations from early-stage evidence. We address the most important limitations honestly.

9.1 The Transition Problem

The existing ecosystem represents sixty years of investment. We propose a three-horizon transition: human-in-the-loop (current state: agents within human-centric systems), human-on-the-loop (increasing autonomy, humans review aggregated outcomes), and human-out-of-the-loop (purpose-built agent-native systems for well-understood domains). The transition zone may persist for decades, during which agent-native and human-centric systems must interoperate -- MAIR must translate to/from human-readable code, PCPG must integrate with Git, LCB must degrade to natural language. The interoperability overhead may substantially reduce theoretical benefits during transition.

9.2 The Bootstrap Paradox

Agent-native tools must be built using the human-centric tools they replace. This chicken-and-egg problem is resolvable through: (1) incremental bootstrapping (MAIR as an MLIR dialect, PCPG as a Git extension), (2) early self-hosting (agents improving their own tools once minimally functional), and (3) specification-first development (formal properties that any implementation -- human, agent, or hybrid -- must satisfy). The bootstrap paradox also applies to training data: models trained on human code produce human-idiomatic output even in agent-native representations.

9.3 Verification Scalability

Our proposals rely heavily on formal verification, which faces well-known challenges: dependent type checking can be undecidable; proof generation can require exponential time; specifications themselves may be flawed (a verified implementation of a wrong specification is correct but useless); and environmental models may be incomplete 73. Practical compromises -- probabilistic verification, incremental verification, proof reuse -- trade verification strength for scalability. The right balance is an open question.

9.4 LCB Verification Challenge

LCB's latent-modality messages (embeddings, KV-cache fragments) resist conventional type-checking and formal verification: their semantics are defined by model weights rather than symbolic specifications. Verifying that a latent payload faithfully encodes a structured state delta is likely comparable in difficulty to neural network verification, which Katz et al. 93 showed to be NP-complete even for simple properties of ReLU networks. Practical LCB deployments must therefore rely on structured-modality checksums, round-trip fidelity tests (encode-decode-compare), and statistical monitoring rather than formal proof -- a fundamental limitation that the degradation guarantee (latent -> structured -> NL) is designed to mitigate but cannot eliminate.

9.5 The Alignment Tax and Empirical Gaps

Safety invariants impose computational cost. Every CLTF lease check, PCPG proof obligation, and CSC pipeline stage adds latency. Whether this "alignment tax" is 5% or 500% of unconstrained iteration speed is an empirical question. The stack is designed so verification degree can be tuned, allowing explicit speed-safety tradeoffs.

The evidence base has significant gaps. Most agent capability evidence comes from benchmarks, not production systems. The 30--50% human tax estimate aggregates studies measuring different properties in different contexts; no study has directly measured what fraction of a production codebase could be eliminated for agent-only consumption. None of the five novel concepts has been implemented as an integrated system. No agent-native stack has been deployed at production scale. We frame our contributions as a research agenda rather than a proven architecture.

The METR boundary condition. The strongest empirical challenge to the premise underlying this paper comes from the METR randomized controlled trial 94, which found that experienced developers were 19% slower with current AI coding tools on real-world tasks. A reviewer may reasonably ask: why redesign the software stack for agents that currently reduce individual developer productivity? We acknowledge this challenge directly. The METR finding measures micro-level productivity of individual developers using AI assistants within the current human-centric stack -- precisely the environment this paper argues is a poor fit for agent capabilities. The human tax thesis predicts that agents operating within human-accidental infrastructure will underperform: the stack is designed to slow them down. The companion paper 16 provides a detailed engagement with METR, including a quantitative break-even analysis showing that parallelism across NN agents changes the calculus even under per-task penalties, and that the critical variable is the Agent-Parallel Fraction -- which is itself constrained by the human-centric coupling patterns this paper proposes to eliminate. METR does not invalidate the infrastructure investment thesis; it strengthens the case that incremental AI adoption within the existing stack has a low ceiling, and that the stack itself must change.

9.6a Counterarguments from the Literature

Intellectual honesty requires engaging with the strongest published evidence against the thesis of this paper. Several lines of research from 2023--2025 suggest that human-readable features are not merely "tax" but are functionally valuable for current AI agents, and that incremental adaptation of the existing stack may be more productive than wholesale redesign. We address the most important counterarguments here.

Human-readable names help current agents. As discussed in Section 4.2.1, the evidence is clear that current LLMs depend on human-readable identifiers. "When Names Disappear" 41 found up to 62.5% performance degradation when semantic naming cues were stripped. Wang et al. 95 showed identifier-aware training improved code generation by +4.7 CodeBLEU. Miceli Barone et al. 43 found that identifier-swap perturbations caused catastrophic failures with inverse scaling in some model families. These findings are genuine and important. Our response (Section 4.2.1) is that we hypothesize the dependency is substantially a training artifact rather than a fundamental requirement -- a hypothesis we state with an explicit falsification condition (Section 4.2.1) but that remains unproven.

Type constraints improve agent code generation. Mundler et al. (PLDI 2025) demonstrated that type-constrained decoding cut compilation errors by over 50% and improved functional correctness by 3.5--5.5%. Poesia et al. 96 showed similar improvements from semantic constraints including typing. This does not contradict our proposal for richer type systems (AGL's dependent types); it contradicts the narrower claim that type systems are "redundant overhead" (Section 2.2). We have revised Section 2.2's framing: the argument is that the enforcement mechanism of traditional static type checking (compile-time errors for human workflows) is what becomes redundant, not the information content of types, which we propose to strengthen rather than remove.

Structured reasoning and self-debugging improve agents. Chen et al. 97 showed 2--12% improvements from iterative debugging with natural-language explanations. Li et al. 98 demonstrated up to 13.79% improvement from forcing LLMs to reason using human-readable control structures. Zhong et al. 99 showed that mimicking human debugging processes outperformed single-pass generation. Zhu et al. 100 demonstrated that verbalized uncertainty acts as an error-correction mechanism. These findings are consistent with our observation (Section 6.1) that current self-improvement mechanisms build on human-derived knowledge; the question is whether future agents trained on richer, more structured representations will still require human-style reasoning scaffolds.

Current agents succeed within human infrastructure. Devin, SWE-agent, and Cursor achieve competitive results by wrapping standard human tools with thin adaptation layers (the ACI pattern), not by replacing the stack. SWE-agent explicitly showed that a lightweight Python wrapper over standard shell commands boosted performance by 10.7%. This is valid evidence that wholesale redesign is not required for current-generation agents. Our counterargument is not that current agents cannot function within human infrastructure, but that the ceiling of agent performance within that infrastructure is lower than the ceiling with purpose-built infrastructure -- a hypothesis that cannot be tested until agent-native alternatives exist.

Microservices serve technical purposes beyond Conway's Law. As Dragoni et al. (2017) document, microservices provide fault isolation, independent scaling, and blast-radius containment -- technical properties that persist regardless of whether the operator is human or machine. A swarm of 1,000 agents working on a monolith introduces single-point-of-failure risks that service boundaries mitigate. We acknowledge this criticism of Section 2.3: our claim that microservices are "primarily" driven by team structure overstates the case. The correct claim is that some microservice overhead (network serialization, distributed transactions, service mesh infrastructure) exists to enable human team independence, while other aspects (fault isolation, independent scaling) serve genuine technical purposes that an agent-native architecture must also address, potentially through different mechanisms (e.g., fine-grained internal modularity with verification-based boundaries rather than network boundaries).

The "ecosystem tax" of abandoning the current stack. Trillions of dollars of investment in CI/CD pipelines, security scanners, compliance monitors, and developer tooling depend on text-based code and standard file systems. The cost of abandoning this ecosystem may exceed the benefits of removing the human tax. We address this through the three-horizon transition model (Section 9.1), which explicitly acknowledges that the transition period may persist for decades and that hybrid systems may capture only a fraction of the theoretical benefits.

These counterarguments do not invalidate the thesis of the paper, but they do constrain its scope and urgency. The human tax is real, but it is not purely wasteful: some of it serves dual purposes (aiding both humans and current AI agents), some of it provides genuine technical value (fault isolation in microservices), and removing it carries real transition costs. The strongest version of our argument is not "the human tax should be eliminated immediately" but "recognizing human-accidental complexity as a distinct category is the prerequisite for informed architectural decisions about what to keep, what to transform, and what to discard."

9.6b Threats to Validity

Construct validity. Operationalizing "human-accidental complexity" through syntactic patterns (comments, indentation, naming overhead) is a proxy for the underlying concept. Some patterns classified as human-accidental may carry semantic information accessible to agents through alternative channels; some patterns classified as essential may serve primarily human comprehension in ways the classifier cannot detect. The four-category taxonomy (essential, human-accidental, dual-purpose, infrastructure) is one reasonable operationalization; others are possible, and the results are sensitive to category definitions -- particularly the boundary between human-accidental and dual-purpose.

Internal validity. The pilot measurement (Section 2.6) is a single-codebase, automated-only analysis on 15 files (260,407 characters). No human annotators validated the automated classifications. Additionally, the measurement operates at character granularity rather than token granularity; since LLMs consume tokens (BPE subwords), a token-level reanalysis might shift category boundaries -- particularly for identifiers, where a single long name may tokenize into multiple subwords each carrying varying degrees of semantic content. A token-level measurement is a priority for future work. The naming-overhead heuristic (characters beyond a configurable threshold in an identifier are human-accidental) is a simplification; however, sensitivity analysis across thresholds of 1, 3, and 5 characters shows the strict figure varies by only 1.6 percentage points (25.7% to 27.3%), indicating the overall result is robust to this parameter. The indentation classification assumes AST-based agent consumption; agents consuming text representations would benefit from indentation.

External validity. The measurement covers TypeScript only -- a dynamically-optional-statically-typed language with significant type annotation overhead. Results would differ for Python (less type annotation, more indentation), Rust (more type and lifetime annotation), or Java (more boilerplate, less comment density). The measured codebase is a well-documented, well-structured open-source project; codebases with less documentation or different coding standards would yield different results.

Conclusion validity. The dual-purpose classification is particularly sensitive to assumptions about agent capability trajectories. If current LLMs' dependence on human-readable features is permanent rather than a training artifact, the "broad" figure (48.2%) overstates the removable human tax. If agents can be trained to function without any human-readable features, the "strict" figure (26.4%) understates it. The gap between these figures (21.8 percentage points) represents genuine uncertainty about the nature of the human tax.

9.7 Open Direction: Agent-Native Observability

Current observability stacks (OpenTelemetry, Prometheus, Grafana) instrument request paths through deterministic services. Agent workflows are stochastic, branching, and self-modifying -- requiring a fundamentally different observability model that tracks reasoning traces, capability-lease exercise chains, and emergent coordination patterns across agent fleets, none of which map cleanly to existing span/metric/log abstractions.

9.8 What We Are Not Claiming

We are not claiming that humans should be removed from software development -- the human role shifts from writing code to writing specifications and governing agent fleets, but does not disappear. We are not claiming that agent-native systems are imminent at full scale -- a fully realized stack is years to decades away, and the integrated stack of Section 7.6 is a research target, not an engineering plan. We are not claiming that PCPG, MAIR, CLTF, LCB, and CSC are the only possible architecture -- alternative designs satisfying the same requirements should be explored.

We are claiming that recognizing human-centric accidental complexity is the prerequisite for principled agent-native design. Whether the specific numbers (30--50%), concepts (PCPG, MAIR), or language features (dependent types, proof-carrying semantics) prove optimal is less important than the foundational observation: the software stack was designed for humans, agents are not humans, and the gap is a tractable engineering problem the systems community should address deliberately.


Section 10: Conclusion

10.1 Summary of Contributions

This paper has: (1) identified and characterized the human tax across six strata, introducing human-accidental complexity as a refinement of Brooks's framework (Sections 1--2); (2) demonstrated 32--82% improvements (scale-dependent) with structured representations and proposed multi-representation fluidity (Section 3); (3) derived agent-native language requirements and presented the AGL design sketch (Section 4); (4) proposed a comprehensive agent-native operations stack (Section 5); (5) analyzed self-improving systems with formal verification as safety invariant (Section 6); and (6) distilled the analysis into five architectural concepts -- PCPG, MAIR, CLTF, LCB, CSC -- composing into an integrated stack with three interlocking feedback loops (Section 7).

10.2 The Research Agenda

The integrated stack does not exist today. Building it requires: (1) MAIR on MLIR -- define L0/L1 dialects with verified lowering to existing compiler infrastructure; (2) PCPG deployment -- integrate GumTree, proof-carrying code, and cryptographic signing into a unified change-management system, measuring false conflict rates and verification throughput; (3) CLTF evaluation -- implement kernel-level lease enforcement, testing against realistic threat models including delegation chain amplification and concurrent revocation; (4) LCB prototyping -- measure bandwidth improvement of latent over structured communication and fidelity loss of NL degradation; (5) CSC pipelines -- evaluate whether auto-generated benchmarks and curriculum improve agent performance; (6) Ablation study -- measure agent Pass@1 on codebases with dual-purpose elements (type annotations, naming conventions, interface boundaries) progressively stripped, isolating which human-accidental artifacts agents genuinely depend on versus those that are pure tax; (7) Cross-language validation -- replicate the pilot measurement across Python, Java/C#, Rust, and C codebases using established benchmarks (SWE-bench, HumanEval, MultiPL-E) to test whether the human-tax distribution is language-invariant or language-specific; and (8) Human tax measurement -- the most important task: build equivalent systems optimized for human vs. agent consumption and compare complexity, addressing category overlap and dual-purpose artifacts.

Falsification criteria. To ground this agenda in Popperian terms, we state the conditions under which our central hypotheses would be falsified. The 30--50% human tax hypothesis would be falsified if a controlled measurement across N >= 10 representative codebases (spanning languages, domains, and team sizes) shows human-accidental complexity below 15% under both strict and blended definitions -- i.e., if the human tax is an order of magnitude smaller than estimated, the motivation for a purpose-built agent-native stack weakens to incremental optimization rather than architectural redesign. The training-artifact hypothesis (Section 4.2.1) would be falsified if models fine-tuned on progressively abstracted code representations (removing names, then formatting, then comments, then syntactic sugar) show monotonic performance degradation that does not recover even after training to convergence on the abstracted representations -- operationally, if performance on a held-out benchmark (e.g., SWE-bench) remains below 80% of the baseline after 10x the compute budget used for the original model's code training, the hypothesis would be refuted, as those features would be architecturally necessary rather than contingent on training data. The CLTF safety invariant (Section 7.3.1) would be falsified if a formally verified implementation nonetheless permits capability amplification through delegation chains -- which would indicate a flaw in the formalization rather than the enforcement mechanism.

10.3 Closing

The software stack is a monument to human cognitive limitations -- but it is also a monument to human ingenuity in working within those limitations. Dijkstra's structured programming, Milner's type theory, Conway's organizational analysis, Google's site reliability engineering -- each represents a brilliant response to the question of how to make software development tractable for the human mind. The argument of this paper is not that these contributions were wrong, but that the question they answered is changing.

When the primary producer and consumer of code is no longer a biological organism with a seven-item working memory but an artificial one with a million-token context window, the design space of the entire stack reopens. Conventions that were necessary become optional. Overhead that was invisible becomes visible. Constraints that were fundamental become accidental.

The opportunity is not to discard six decades of human ingenuity but to redirect it: from accommodating human constraints to specifying human intent. The programmer's skill shifts from writing correct implementations to writing precise specifications. The architect's skill shifts from decomposing systems along team boundaries to decomposing them along verification boundaries. The operator's skill shifts from interpreting dashboards to authoring policy. In every case, the human moves up the abstraction stack -- from how to what to why.

The question is not whether this transition will occur. The evidence reviewed in this paper -- from GrammarCoder's 32--82% improvements (scale-dependent) with structured representations, to AlphaProof's olympiad-level theorem proving, to the emerging MCP and A2A protocol ecosystem -- demonstrates that the components are assembling. The question is whether we will build agent-native stacks deliberately, through the kind of systematic analysis this paper provides, or stumble into them through incremental patches on human-centric systems, accumulating a new generation of accidental complexity along the way.

The software stack began as a monument to human limitations. It can become a monument to human intent -- if we choose to build it that way.


References


References

Footnotes

  1. G. A. Miller, "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information," Psychological Review, vol. 63, no. 2, pp. 81--97, 1956. 2

  2. E. W. Dijkstra, "Go To Statement Considered Harmful," Communications of the ACM, vol. 11, no. 3, pp. 147--148, 1968. 2

  3. R. Milner, "A Theory of Type Polymorphism in Programming," Journal of Computer and System Sciences, vol. 17, no. 3, pp. 348--375, 1978. 2

  4. R. C. Martin, Agile Software Development: Principles, Patterns, and Practices. Prentice Hall, 2003. 2

  5. J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering," NeurIPS, 2024. arXiv:2405.15793. 2 3

  6. F. P. Brooks, "No Silver Bullet: Essence and Accidents of Software Engineering," Computer, vol. 20, no. 4, pp. 10--19, 1987.

  7. B. Moseley and P. Marks, "Out of the Tar Pit," Software Practice Advancement (SPA), 2006.

  8. Terser contributors, "Terser -- JavaScript mangler and compressor toolkit for ES6+," 2024. Available: https://terser.org. We measured reductions on popular libraries (jQuery 3.7.1, Axios 1.13.5, D3 7.9.0, Lodash 4.17.23, Moment 2.30.1) using Terser with default settings; individual results ranged from 30--73% for whitespace/comment removal and 51--86% with identifier mangling, with most libraries in the 30--51% and 51--68% ranges respectively (Lodash is an outlier at 73%/86% due to its heavily commented source). 2

  9. "Code understandability in reviews," arXiv:2410.21990, 2024. 2

  10. "Impermanent identifiers," Journal of Systems and Software, 2024. 2

  11. "When Names Disappear," arXiv:2510.03178, 2025. 2

  12. S. Tobin-Hochstadt, M. Felleisen, R. B. Findler, M. Flatt, B. Greenman, A. M. Kent, V. St-Amour, T. S. Strickland, and A. Takikawa, "Migratory Typing: Ten Years Later," LIPIcs, vol. 71, Article 17, SNAPL, 2017. DOI: 10.4230/LIPIcs.SNAPL.2017.17.

  13. R. P. L. Buse and W. Weimer, "Learning a Metric for Code Readability," IEEE Transactions on Software Engineering, vol. 36, no. 4, pp. 546--558, 2010. 2

  14. H. He, C. Miller, S. Agarwal, C. K"{a}stner, and B. Vasilescu, "Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects," Proc. 23rd International Conference on Mining Software Repositories (MSR '26), arXiv:2511.04427, 2025.

  15. Q. Liang, Z. Chen, Z. Yang, H. Lin, D. Chen, B. Guan, Y. Wang, and W. Zhang, "Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?" arXiv:2503.05507, March 2025. 2 3 4

  16. aleatoric research, "Thinking at Massive Scale: When AI Agents Replace Software Teams," arXiv:2602.XXXXX, 2026. 2 3 4 5

  17. J. Backus, "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs," Communications of the ACM, vol. 21, no. 8, pp. 613--641, 1978.

  18. R. C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall, 2008.

  19. S. McConnell, Code Complete: A Practical Handbook of Software Construction, 2nd ed. Microsoft Press, 2004.

  20. T. J. McCabe, "A Complexity Measure," IEEE Transactions on Software Engineering, vol. SE-2, no. 4, pp. 308--320, 1976.

  21. M. H. Halstead, Elements of Software Science. Elsevier, 1977.

  22. L. Cardelli, "Type Systems," ACM Computing Surveys, vol. 28, no. 1, pp. 263--264, 1996.

  23. P. Martin-Lof, Intuitionistic Type Theory. Bibliopolis, 1984.

  24. "Type-Constrained Code Generation with Language Models," PLDI, 2025.

  25. B. C. Pierce, Types and Programming Languages. MIT Press, 2002.

  26. M. E. Conway, "How Do Committees Invent?" Datamation, vol. 14, no. 4, pp. 28--31, 1968. 2

  27. T. Ueda, T. Nakaike, and M. Ohara, "Workload Characterization for Microservices," Proc. IEEE International Symposium on Workload Characterization (IISWC), 2016. DOI: 10.1109/IISWC.2016.7581269. 2

  28. G. Blinowski, A. Ojdowska, and A. Przybyszewski, "Monolithic vs. Microservice Architecture: A Performance and Scalability Evaluation," IEEE Access, vol. 10, pp. 20357--20374, 2022. DOI: 10.1109/ACCESS.2022.3152803. 2

  29. B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, Omega, and Kubernetes," ACM Queue, vol. 14, no. 1, pp. 70--93, 2016.

  30. B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016.

  31. M. E. Fagan, "Design and Code Inspections to Reduce Errors in Program Development," IBM Systems Journal, vol. 15, no. 3, pp. 182--211, 1976.

  32. A. Bacchelli and C. Bird, "Expectations, Outcomes, and Challenges of Modern Code Review," Proc. 35th International Conference on Software Engineering (ICSE), pp. 712--721, 2013.

  33. T. Preston-Werner, "Semantic Versioning 2.0.0," https://semver.org, 2013.

  34. C. Lattner et al., "MLIR: Scaling Compiler Infrastructure for Domain Specific Computation," CGO, pp. 2--14, 2021. 2 3

  35. C. Cummins, Z. V. Fisches, T. Ben-Nun, T. Hoefler, M. O'Boyle, and H. Leather, "PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations," ICML, 2021.

  36. X. Liu et al., "CodexGraph: Bridging Large Language Models and Code Repositories via Dependency Graph-based Code Navigation," NAACL, 2025.

  37. H. Ma et al., "SpecGen: Automated Generation of Formal Program Specifications via Large Language Models," ICSE, 2025. 2

  38. G. C. Necula, "Proof-Carrying Code," POPL, pp. 106--119, 1997. 2

  39. Z. Zhang et al., "VerifyThisBench: Generating Code, Specifications, and Proofs All at Once," arXiv:2505.19271, 2025. 2 3

  40. R. Milner, M. Tofte, R. Harper, and D. MacQueen, The Definition of Standard ML (Revised), MIT Press, 1997.

  41. arXiv:2510.03178, 2025 2

  42. "CodeT5," EMNLP 2021

  43. Findings ACL 2023 2

  44. "Type-Constrained Code Generation," PLDI 2025

  45. "Synchromesh," 2022

  46. "Self-Debug," ICLR 2024

  47. "Structured Chain-of-Thought," TOSEM 2023

  48. DeepMind, "Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learning," Nature, 2025. 2 3 4

  49. L. de Moura and S. Ullrich, "The Lean 4 Theorem Prover and Programming Language," CADE, pp. 625--635, 2021. 2

  50. P. Aggarwal, B. Parno, and S. Welleck, "AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement," Carnegie Mellon University, arXiv:2412.06176, December 2024.

  51. P. Song, K. Yang, and A. Anandkumar, "Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean," arXiv:2404.12534, 2024.

  52. E. Moggi, "Notions of Computation and Monads," Information and Computation, vol. 93, no. 1, pp. 55--92, 1991.

  53. P. Wadler, "Comprehending Monads," Mathematical Structures in Computer Science, vol. 2, no. 4, pp. 461--493, 1992.

  54. R. J. R. Back and J. von Wright, Refinement Calculus: A Systematic Introduction, Springer, 1998. 2

  55. C. Morgan, Programming from Specifications, Prentice Hall, 1998. 2

  56. GumTreeDiff, "GumTree: A Syntax-Aware Diff Tool," https://github.com/GumTreeDiff/gumtree, 2025.

  57. X. Leroy, "A Formally Verified Compiler Back-end," Journal of Automated Reasoning, vol. 43, no. 4, pp. 363--446, 2009. 2 3

  58. Xie et al., 2024

  59. K. Mei et al., "AIOS: LLM Agent Operating System," arXiv:2403.16971, 2024.

  60. Cloud Security Alliance, "The Agentic Trust Framework: Zero Trust Governance for AI Agents," February 2026. 2

  61. Y. Li et al., "Competition-Level Code Generation with AlphaCode," Science, 2022; AlphaCode 2 Technical Report, 2023. 2

  62. E. Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning," NeurIPS, 2022. 2 3

  63. I. J. Good, "Speculations Concerning the First Ultraintelligent Machine," Advances in Computers, vol. 6, pp. 31--88, 1965.

  64. Seed-Coder Team, "Seed-Coder: Let the Code Model Curate Data for Itself," arXiv:2506.03524, 2025. 2

  65. G. Wang et al., "VOYAGER: An Open-Ended Embodied Agent with Large Language Models," arXiv:2305.16291, 2023. 2

  66. Darwin Godel Machine, arXiv:2505.22954, 2025. 2 3

  67. "Towards Active Synthetic Data Generation for Finetuning Language Models," arXiv:2512.00884, 2025.

  68. "Knowledge Distillation from Foundation Models," arXiv:2503.12067, 2025.

  69. T. Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," NeurIPS, 2023.

  70. K. Thompson, "Reflections on Trusting Trust," Communications of the ACM, vol. 27, no. 8, pp. 761--763, 1984. 2

  71. J. Zeng, D. Fu, T. Mi, Y. Zhuang, Y. Huang, X. Li, L. Ye, M. Xie, Q. Hua, Z. Huang, M. Jiang, H. Wang, J. Lin, Y. Xiao, J. Sun, Y. Wu, and P. Liu, "daVinci-Dev: Agent-native Mid-training for Software Engineering," arXiv:2601.18418, January 2026.

  72. Bostrom, 2014

  73. "Limitations on Formal Verification for AI Safety," AI Alignment Forum, 2025. 2

  74. "AgentGuard: Runtime Verification of AI Agents," arXiv:2509.23864, 2025. 2

  75. R. Potham and M. Harms, "Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models," arXiv:2506.03056, 2025. 2

  76. J.-Y. Girard, "Linear Logic," Theoretical Computer Science, vol. 50, no. 1, pp. 1--102, 1987. 2

  77. Atkey, LICS 2018

  78. D. E. Denning, "A Lattice Model of Secure Information Flow," Communications of the ACM, vol. 19, no. 5, pp. 236--243, 1976.

  79. "Syntax and Semantics of Quantitative Type Theory," LICS 2018

  80. N. Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code," arXiv:2403.07974, 2024.

  81. E. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith, "Counterexample-Guided Abstraction Refinement," Proc. 12th International Conference on Computer Aided Verification (CAV), LNCS 1855, pp. 154--169, 2000.

  82. Z. Manna and R. Waldinger, "A Deductive Approach to Program Synthesis," ACM Transactions on Programming Languages and Systems, vol. 2, no. 1, pp. 90--121, 1980.

  83. S. Gulwani, "Automating String Processing in Spreadsheets Using Input-Output Examples," POPL, pp. 317--330, 2011.

  84. P. Mukherjee and B. Delaware, "Towards Automated Verification of LLM-Synthesized C Programs," CoqPL 2025 / arXiv, 2025.

  85. C. E. Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR, 2024.

  86. Q. Wu et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework," arXiv:2308.08155, 2023.

  87. Y. Yang, Q. Peng, J. Wang, Y. Wen, and W. Zhang, "LLM-based Multi-Agent Systems: Techniques and Business Perspectives," arXiv:2411.14033, 2024.

  88. R. Ehrlich et al., "CodeMonkeys: Scaling Test-Time Compute for Software Engineering," arXiv:2501.14723, 2025.

  89. C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation," CGO, pp. 75--86, 2004.

  90. FIPA, "FIPA Agent Communication Language Specifications," IEEE Foundation for Intelligent Physical Agents, 2002.

  91. Google, "Agent2Agent Protocol Specification," 2025.

  92. F. Vandeputte, "Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems," Proc. ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), pp. 44--62, 2025. DOI: 10.1145/3759429.3762620.

  93. G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks," Proc. 29th International Conference on Computer Aided Verification (CAV), LNCS 10426, pp. 97--117, 2017. arXiv:1702.01135.

  94. Becker et al., 2025

  95. CodeT5, EMNLP 2021

  96. Synchromesh, 2022

  97. Self-Debug, ICLR 2024

  98. Structured Chain-of-Thought, TOSEM 2023

  99. LDB, Findings ACL 2024

  100. Uncertainty-Guided CoT, 2025