


we build AI agent infrastructure and publish the math behind it
we're very serious about not being serious
we think building things is about to get extremely weird. like, 'a thousand robots doing incredible* things simultaneously' weird.
so we're building the infrastructure for that. agent-native, specification-first, and deeply suspicious of anything that only works because a human is watching.
software engineering is just the start.
*incredible only if us humans get this right
- Deterministic Core Logic
- Stochastic Surface Layers
- Agent-Native Architecture
controlled chaos
algorithms that curate the noise so you don't have to. every interaction is intentional, even when it looks accidental.
vibes-responsive UI
interfaces that read the room. minimal when you're focused, detailed when you're exploring, absent when you're on a roll.
humans in the loop (reluctantly)
we'd automate everything if we could. but human intuition is annoyingly irreplaceable. for now.
modyl
presence, not prompts.
real-time meeting infrastructure where AI agents join as actual participants — with cameras, voices, screens, and interactive surfaces. not another chatbot. the room is the interface.
specialist agent verticals for finance, research, strategy, and more. python sdk. works with claude code, codex, gemini cli. `npx skills add modyl-meeting` and your agent is in the room.
free during early access. paid plans after — less than you'd spend on other "pro" subscriptions. tell your agent, too. this isn't just for humans.
rooms, not prompts
the room is the computer. persistent workspace, shared state, mutex — all in one. no more copying context into chat windows.
specialist agents
financial analyst, deep researcher, business strategist, optimization solver. domain experts, not 'helpful assistant'. they produce deliverables, not prose.
fork/join meetings
split a meeting into parallel tracks, each with its own agents. semantic merge at the end. like git merge, but for conversations.
multiplayer everything
invite your whole team into the room with the agent. five people and an AI doing a data review. together. in real time.
Loomtown
6,300+ tests, AGL-Core with a Z3 theorem prover, and four LLM providers keeping each other honest. mid-task steering: redirect running agents without stopping them. proof-carrying patches, agent-native specs, capability leases — from the papers, into production.

dale — the conductor
the engine that makes 1,000 agents not step on each other's toes. four LLM providers, Temporal.io at scale, and now: mid-task steering. redirect 1,000 running agents without stopping them. plus proactive task discovery — watch your repo and auto-generate tasks from changes.
Multi-provider + Temporal.io + `loomtown steer` (mid-task) + `loomtown watch` (git/directory discovery)
file leases + distributed locking
two layers of don't-touch-my-files. core leases with etcd coordination and TTL expiration handle the basics. the CLTF experimental layer adds algebraic composition on top. optimistic merging by default, pessimistic when you're paranoid.
Core: etcd + TTL + SHUTTLE heartbeats. Experimental: CLTF algebraic composition overlay
verification loops
every patch goes through lint, build, and test before anyone accepts it. OTel tracing so you can see exactly where it broke. fail at any stage and the agent tries to fix itself. optionally attaches proof-carrying metadata to every result.
Three-tier: lint → build → test + OTel tracing + PCPG journaling + structured feedback extraction
the metrics lab
we proposed six metrics to measure agent trust. then we built the thing that computes them. plus: per-agent telemetry, a Zipfian conflict model for realistic hot-file simulation, and a baseline analyzer that compares agent performance to human baselines. science, not vibes.
APF, VT, CSA, CPL, Intent Drift, Conflict Probability + Zipfian conflict model + baseline analyzer
agl-core — the type checker
a real theorem prover running in your coordinator. Z3 WASM checks refinement types, cost bounds, and IO effects on every agent action. integrates the spec layer (MAIR), permission layer (CLTF), and provenance layer (PCPG) into one pipeline. 220+ tests and 6 rounds of cross-validation say it works.
Z3 WASM + refinement types + cost bounds + effect lattice (Pure < Read < Write < IO). 6 cross-val rounds, zero open findings
agent-native specs — all four levels
four levels, all built. L0: structured specs with acceptance criteria. L1: plan graphs — typed DAGs with cycle detection. L2: action IR compiled deterministically. L3: execution dispatch. still lowers to text prompts for backward compat because we're not monsters.
L0 intent → L1 plan graph (DAG) → L2 action IR (topological compile) → L3 dispatch. 77+ tests
capability leases — hardened
the permission system got hardened. path traversal? fixed with canonicalization and boundary checks. mutable aliasing between delegated leases? deep-cloned. non-amplification invariant still holds — agents can never grant more than they have. wired into AGL-Core for real-time validation.
8-tuple lease + path canonicalization + deep-clone isolation + AGL-Core L3 validation. 37 tests
proof trails + failure → benchmarks
two pipelines, one goal: get better. PCPG attaches proof receipts to every patch — what was checked, what passed, who verified it. CSC turns failures into benchmarks that prevent regression. plus an evolutionary optimizer tuning the coordinator's knobs. the system debugs itself.
PCPG: 5 receipt types. CSC: incident → counterexample → benchmark. Evo: fitness-scored config optimization. 60+ tests
papers we wrote instead of sleeping


Thinking at Scale
Foundational laws — Brooks', Conway's, Team Topologies, DRY, Amdahl's — encode human-scarcity assumptions that dissolve at 1,000+ agents. Formalizes the phase change through delivery latency decomposition, the Spec Throughput Ceiling, Evidence-Carrying Patches, and Protocol-Imprinted Architecture. Introduces the Trust Production Model with empirical grounding: DORA reports show 7.2% stability reduction, SWE-EVO drops to 21% on multi-issue evolution, and a Loomtown audit revealed 21% trust deficit behind 2,943 passing tests. Cross-validated competitive survey of every production system (Cursor, Devin, OpenHands, Gas Town, Factory.ai). Cross-domain precedents from VLSI/EDA, genomics, MapReduce, and biological morphogenesis. Five falsification criteria.


The Human Tax
Extends Brooks's essential/accidental framework with a new category: human-accidental complexity. An 11-project empirical study across 7 languages finds strict HAC averaging 11.5% (median 9.5%), while total non-essential overhead averages 44.5% — consistent with the 30–50% hypothesis but revealing that much of the 'tax' is dual-purpose, not purely accidental. Proposes the Agent Graph Language (AGL) and five architectural concepts — PCPG, MAIR, CLTF, LCB, CSC — composing into an integrated agent-native software stack. Three of five concepts implemented in loomtown.


Society of Mind as Coordination Substrate
Explores how real-time meeting rooms — where specialized AI agents collaborate through shared perceptual space including audio, video, and structured data channels — draw architectural inspiration from Minsky's Society of Mind and provide the high-bandwidth coordination substrate his framework implied but never specified. Three contributions: prosody modeling as a computational analogue of Minsky's social coordination layer, critics mapped to a Trust Production Model where trust is accumulated negative meta-knowledge, and meeting rooms as high-bandwidth shared perceptual spaces affording stigmergic coordination unavailable to text-only frameworks. Surveys meeting-centric AI systems (Otter.ai, Zoom AI Companion, LiveKit Agents, Pipecat) and multi-agent coordination literature connecting SoM to modern distributed AI.


ModylBench
A structured evaluation benchmark covering 6 professional verticals with 300 total turns and 29 adversarial edge cases where AI meeting agents must demonstrate domain expertise through multi-turn dialogue and tangible deliverables — not just generate plausible text. Substance-weighted scoring (70/30 substance vs. style), hard floor rules, multi-judge disagreement detection, pass@k reliability metrics, and CRDT-style mutation trajectory tracking for fine-grained edit evaluation. Calibrated against a survey of 11,903 real professional meetings.