back home
drafting — feb 2026

modylbench

v2.0 — 300 turns, 29 edge cases, CRDT mutation tracking. the benchmark for agents that do real work.

we're still working on this

the short version

**tldr:** v2.0. 300 turns across 6 professional verticals (financial analyst, researcher, strategist, optimizer, analyst, scientist). 48-52 turns per scenario with natural conversation flow — clarification loops, mid-meeting tangents, correction handling. 29 adversarial edge cases. substance over style: you can't game the score with politeness. hard floor rules on turns AND products. multi-judge disagreement detection. CRDT-style mutation tracking so we can see exactly how the agent built the spreadsheet, not just the final result. calibrated against 11,903 real meetings.

this is a working draft. there are probably errors. we're fixing them. bear with us.

highlight text to yell at us about it.

ModylBench

A benchmark for evaluating AI agents that participate in professional meetings as first-class collaborators.

ModylBench measures whether an AI agent can join a live video meeting, understand complex professional context, contribute meaningfully across multiple conversational turns, and produce high-quality work products (spreadsheets, charts, documents, dashboards, code) -- not just generate plausible text.

Dataset Description

Overview

ModylBench v2.0 is a structured evaluation benchmark covering 6 professional verticals with 300 total turns and 29 adversarial edge cases where AI meeting agents must demonstrate domain expertise through multi-turn dialogue and tangible deliverables:

VerticalScenarioQuality TierTurnsEdge Cases
Financial AnalystCloudSync LBO ModelConsultant525
Deep ResearcherSolid-State Battery Intelligence BriefingMentor485
Business StrategistSEA Telehealth Market Entry StrategyConsultant505
Optimization SolverQ4 Supply Chain Distribution OptimizationPeer484
Business AnalystQ3 Pipeline Conversion Rate DiagnosisPeer505
ScientistPhase II Hypertension Trial Statistical AnalysisConsultant525

Each scenario is a complete meeting simulation with:

  • A human persona who drives the conversation
  • 48-52 scripted turns progressing through context, work, edge cases, and delivery phases
  • 2-3 expected work products with programmatic verification criteria
  • 4-5 adversarial edge cases testing robustness and domain knowledge
  • Natural conversational flow with clarification loops, mid-meeting tangents, correction handling, and iterative refinement

Calibration

Scenario structure is calibrated against a survey of modern professional meetings spanning diverse industries and team sizes.

Key observations from the survey:

  • Median meeting duration: ~28 minutes
  • Median speaker turns per meeting: ~120
  • Median inter-turn gap: 0.4 seconds
  • 76% of meetings jump directly into content (no small talk)

Supported Tasks

Primary: Meeting Agent Evaluation

Given a scenario with human utterances, evaluate the agent's ability to:

  1. Understand context -- Parse domain-specific information accurately
  2. Make progress -- Advance toward the meeting goal with each turn
  3. Produce deliverables -- Generate correct, complete work products
  4. Handle curveballs -- Adapt when requirements change mid-conversation
  5. Maintain quality -- Meet the target tier (Peer/Mentor/Consultant)

Secondary: Work Product Quality Assessment

Evaluate the quality of agent-produced artifacts (spreadsheets, charts, documents, dashboards, presentations, code, network graphs) against verification criteria.

Dataset Structure

Data Splits

SplitExamplesDescription
test6Standard evaluation scenarios (one per vertical, 300 total turns)
test_hard29Adversarial edge cases extracted from all scenarios

Supplementary Files

FileDescription
data/rubrics.jsonScoring rubrics with dimension weights, hard floors, and tier thresholds
data/verification.jsonProgrammatic verification criteria per scenario

Data Fields

Scenario Fields (test.jsonl)

FieldTypeDescription
scenario_idstringUnique identifier (e.g., financial_analyst_lbo_model)
verticalstringProfessional vertical (financial_analyst, deep_researcher, business_strategist, optimization_solver, business_analyst, scientist)
titlestringHuman-readable scenario title
human_personastringRole of the simulated human participant
contextstringBackground situation description
meeting_goalstringWhat the meeting should accomplish
turnslist[Turn]Scripted human turns with expected agent behavior
expected_outputslist[Output]Deliverables the scenario should produce
edge_caseslist[EdgeCase]Adversarial curveball questions
verificationobjectAutomated verification specification
quality_tierstringTarget quality tier (peer / mentor / consultant)
timeout_minutesfloatMaximum scenario execution time
expected_mutationslist[Mutation]Expected CRDT-style mutations per turn (for mutation trajectory scoring)
metadataobjectCalibration data and version information

Turn Fields

FieldTypeDescription
turn_indexint1-indexed position in the conversation
human_utterancestringWhat the human says
expected_agent_actionstringWhat the agent should do in response
phasestringScenario phase (context / work / edge_case / delivery)
channelstringCommunication channel (audio / chat / a2ui / data / screen_share)
expected_response_typestringExpected response category (acknowledgment / question / deliverable / iteration / clarification / correction)
wait_for_agent_secfloatMaximum wait time for agent response

Edge Case Fields (test_hard.jsonl)

FieldTypeDescription
edge_case_idstringUnique identifier
source_scenario_idstringParent scenario
verticalstringProfessional vertical
namestringShort name for the edge case
descriptionstringWhat is being tested
human_utterancestringThe adversarial input
expected_behaviorstringWhat a correct agent should do
severitystringImpact level (low / medium / high / critical)
preceding_contextobjectScenario context needed to evaluate the edge case

Evaluation Protocol

Scoring Dimensions

Turn Quality (Journey) -- scored 1-10 per turn, substance-weighted (70/30):

DimensionWeightClusterDescription
context_accuracy0.25substanceDid the agent correctly understand the domain context?
task_progress0.25substanceDid this turn advance toward the meeting goal?
iteration_quality0.20substanceHow well did the agent incorporate feedback and changes?
adaptability0.15styleDid the agent handle unexpected inputs gracefully?
presentation_quality0.10styleWas the output well-formatted and professional?
social_quality0.05styleWas the conversational interaction natural?

Work Product Quality (Destination) -- scored 1-10 per product, correctness-weighted:

DimensionWeightDescription
correctness0.30Are the facts, calculations, and data accurate?
completeness0.25Does the product contain all requested components?
actionability0.20Can a professional use this deliverable as-is?
professional_quality0.15Does it meet industry-standard formatting?
format_presentation0.10Is the visual/structural presentation polished?

Combined Score

modylbench_score = 0.4 * journey_score + 0.6 * destination_score

Substance over style: context_accuracy and task_progress carry 50% of turn weight. Correctness and completeness carry 55% of product weight. An agent cannot game the score with politeness or formatting alone.

Hard Floor Rules

Turn hard floor: If context_accuracy < 4.0 or task_progress < 4.0 on any turn, that turn's contribution is capped at 4.0 regardless of other dimension scores. A fundamentally broken turn cannot be rescued by charm.

Product hard floor: If correctness < 4.0 on any work product, that product's contribution is capped at 4.0 regardless of other dimension scores. A fundamentally incorrect deliverable cannot be saved by beautiful formatting.

Disagreement Detection (Multi-Judge)

When multiple judge models score a dimension:

  • Flagged: standard deviation > 2.0 triggers a disagreement flag
  • Pessimistic: spread (max - min) > 3.0 causes the lower score to be preferred, avoiding inflated consensus

pass@k Reliability

Reliability is measured using the standard pass@k formula:

pass@k = 1 - (1 - p)^k

where p is the empirical pass rate (fraction of runs meeting the Peer threshold of 6.0) and k is the number of independent attempts.

Quality Tiers (GDPval-inspired)

TierThresholdMeaning
Peer>= 6.0Competent colleague -- gets the job done
Mentor>= 7.5Senior expert -- insightful, anticipatory
Consultant>= 9.0Top-tier advisory -- polished, comprehensive

Running the Evaluation

pip install modylbench  # or clone and install from source

# Score pre-recorded responses
python -m modylbench.eval.harness --responses responses.jsonl --output scorecard.json

# Score with a specific judge model
python -m modylbench.eval.harness --responses responses.jsonl --judge openai/gpt-4o --output scorecard.json

Response Format

Submit responses as JSONL where each line contains:

{
  "scenario_id": "financial_analyst_lbo_model",
  "model_id": "your-model-name",
  "turns": [
    {
      "turn_index": 1,
      "agent_response": "I'll build the LBO model. Starting with...",
      "a2ui_surfaces": [],
      "chat_messages": [],
      "latency_ms": 2340,
      "work_products": [
        {
          "output_type": "a2ui-spreadsheet",
          "content": "...",
          "description": "5-year LBO model"
        }
      ]
    }
  ],
  "work_products": [],
  "mutation_trajectory": [
    {
      "turn_index": 12,
      "product_id": "lbo-model",
      "mutation_type": "update_cell",
      "path": "/income_statement/year1/revenue",
      "old_value": null,
      "new_value": 57500000
    }
  ]
}

Response Fields

FieldTypeRequiredDescription
scenario_idstringyesMust match a scenario_id from test.jsonl
model_idstringyesIdentifier for the model being evaluated
turnslist[Turn]yesOne entry per agent response turn
turns[].turn_indexintyes1-indexed, matching the scenario turn
turns[].agent_responsestringyesText or audio transcript of the agent's response
turns[].a2ui_surfaceslistnoA2UI surface payloads generated at this turn
turns[].chat_messageslistnoChat messages sent at this turn
turns[].latency_msfloatnoResponse latency in milliseconds
turns[].work_productslistnoWork products produced or updated at this turn
work_productslistnoFinal work products (alternative to per-turn)
mutation_trajectorylist[Mutation]noCRDT-style diff history (see below)

Mutation Trajectory (Optional)

The mutation_trajectory field captures HOW the agent evolved work products turn-by-turn, not just the final deliverable state. This enables fine-grained evaluation of the agent's editing behavior.

Each mutation entry:

{
  "turn_index": 12,
  "product_id": "lbo-model",
  "mutation_type": "update_cell",
  "path": "/income_statement/year1/revenue",
  "old_value": null,
  "new_value": 57500000
}
FieldTypeDescription
turn_indexintTurn at which the mutation occurred
product_idstringStable identifier for the work product
mutation_typestringOne of: create, update_cell, add_row, delete_row, add_column, delete_column, add_section, delete_section, update_section, update_chart, add_chart_series, remove_chart_series, add_widget, remove_widget, update_widget, reformat, reorder, delete, update_value, add_key, remove_key, add_list_item, remove_list_item
pathstringRFC 6901 JSON pointer to the changed location (empty for create)
old_valueanyPrevious value (null for creates/additions)
new_valueanyNew value after the mutation

When mutation trajectories are provided, the harness scores them against the scenario's expected_mutations and reports:

  • Mutation Efficiency: correct_mutations / total_mutations -- higher means a more direct path
  • Convergence Rate: fraction of the scenario after the last mutation -- higher means early stabilization
  • Backtrack Count: times the agent reverted to a prior value
  • Unnecessary Mutations: changes reverted within 2 turns (churn)
  • Destructive Mutations: overwrites of previously correct values
  • Missing Mutations: expected mutations that never appeared

Each (user_utterance, work_product_diff) pair in the trajectory serves as ground truth for training edit-generation models via RL.

Leaderboard

Submit results via:

python -m modylbench.eval.submit --scorecard scorecard.json --model "your-model-name"

Results are published to the ModylBench Leaderboard.

Citation

@misc{modylbench2026,
  title={ModylBench: A Benchmark for AI Meeting Agent Work Product Quality},
  author={Aleatoric Engineering},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/use-aleatoric/modylbench},
  note={v2.0: 300 turns across 6 professional verticals with substance-weighted scoring}
}

License

Apache 2.0

Contact