modylbench
v2.0 — 300 turns, 29 edge cases, CRDT mutation tracking. the benchmark for agents that do real work.
the short version
**tldr:** v2.0. 300 turns across 6 professional verticals (financial analyst, researcher, strategist, optimizer, analyst, scientist). 48-52 turns per scenario with natural conversation flow — clarification loops, mid-meeting tangents, correction handling. 29 adversarial edge cases. substance over style: you can't game the score with politeness. hard floor rules on turns AND products. multi-judge disagreement detection. CRDT-style mutation tracking so we can see exactly how the agent built the spreadsheet, not just the final result. calibrated against 11,903 real meetings.
this is a working draft. there are probably errors. we're fixing them. bear with us.
highlight text to yell at us about it.
ModylBench
A benchmark for evaluating AI agents that participate in professional meetings as first-class collaborators.
ModylBench measures whether an AI agent can join a live video meeting, understand complex professional context, contribute meaningfully across multiple conversational turns, and produce high-quality work products (spreadsheets, charts, documents, dashboards, code) -- not just generate plausible text.
Dataset Description
Overview
ModylBench v2.0 is a structured evaluation benchmark covering 6 professional verticals with 300 total turns and 29 adversarial edge cases where AI meeting agents must demonstrate domain expertise through multi-turn dialogue and tangible deliverables:
| Vertical | Scenario | Quality Tier | Turns | Edge Cases |
|---|---|---|---|---|
| Financial Analyst | CloudSync LBO Model | Consultant | 52 | 5 |
| Deep Researcher | Solid-State Battery Intelligence Briefing | Mentor | 48 | 5 |
| Business Strategist | SEA Telehealth Market Entry Strategy | Consultant | 50 | 5 |
| Optimization Solver | Q4 Supply Chain Distribution Optimization | Peer | 48 | 4 |
| Business Analyst | Q3 Pipeline Conversion Rate Diagnosis | Peer | 50 | 5 |
| Scientist | Phase II Hypertension Trial Statistical Analysis | Consultant | 52 | 5 |
Each scenario is a complete meeting simulation with:
- A human persona who drives the conversation
- 48-52 scripted turns progressing through context, work, edge cases, and delivery phases
- 2-3 expected work products with programmatic verification criteria
- 4-5 adversarial edge cases testing robustness and domain knowledge
- Natural conversational flow with clarification loops, mid-meeting tangents, correction handling, and iterative refinement
Calibration
Scenario structure is calibrated against a survey of modern professional meetings spanning diverse industries and team sizes.
Key observations from the survey:
- Median meeting duration: ~28 minutes
- Median speaker turns per meeting: ~120
- Median inter-turn gap: 0.4 seconds
- 76% of meetings jump directly into content (no small talk)
Supported Tasks
Primary: Meeting Agent Evaluation
Given a scenario with human utterances, evaluate the agent's ability to:
- Understand context -- Parse domain-specific information accurately
- Make progress -- Advance toward the meeting goal with each turn
- Produce deliverables -- Generate correct, complete work products
- Handle curveballs -- Adapt when requirements change mid-conversation
- Maintain quality -- Meet the target tier (Peer/Mentor/Consultant)
Secondary: Work Product Quality Assessment
Evaluate the quality of agent-produced artifacts (spreadsheets, charts, documents, dashboards, presentations, code, network graphs) against verification criteria.
Dataset Structure
Data Splits
| Split | Examples | Description |
|---|---|---|
test | 6 | Standard evaluation scenarios (one per vertical, 300 total turns) |
test_hard | 29 | Adversarial edge cases extracted from all scenarios |
Supplementary Files
| File | Description |
|---|---|
data/rubrics.json | Scoring rubrics with dimension weights, hard floors, and tier thresholds |
data/verification.json | Programmatic verification criteria per scenario |
Data Fields
Scenario Fields (test.jsonl)
| Field | Type | Description |
|---|---|---|
scenario_id | string | Unique identifier (e.g., financial_analyst_lbo_model) |
vertical | string | Professional vertical (financial_analyst, deep_researcher, business_strategist, optimization_solver, business_analyst, scientist) |
title | string | Human-readable scenario title |
human_persona | string | Role of the simulated human participant |
context | string | Background situation description |
meeting_goal | string | What the meeting should accomplish |
turns | list[Turn] | Scripted human turns with expected agent behavior |
expected_outputs | list[Output] | Deliverables the scenario should produce |
edge_cases | list[EdgeCase] | Adversarial curveball questions |
verification | object | Automated verification specification |
quality_tier | string | Target quality tier (peer / mentor / consultant) |
timeout_minutes | float | Maximum scenario execution time |
expected_mutations | list[Mutation] | Expected CRDT-style mutations per turn (for mutation trajectory scoring) |
metadata | object | Calibration data and version information |
Turn Fields
| Field | Type | Description |
|---|---|---|
turn_index | int | 1-indexed position in the conversation |
human_utterance | string | What the human says |
expected_agent_action | string | What the agent should do in response |
phase | string | Scenario phase (context / work / edge_case / delivery) |
channel | string | Communication channel (audio / chat / a2ui / data / screen_share) |
expected_response_type | string | Expected response category (acknowledgment / question / deliverable / iteration / clarification / correction) |
wait_for_agent_sec | float | Maximum wait time for agent response |
Edge Case Fields (test_hard.jsonl)
| Field | Type | Description |
|---|---|---|
edge_case_id | string | Unique identifier |
source_scenario_id | string | Parent scenario |
vertical | string | Professional vertical |
name | string | Short name for the edge case |
description | string | What is being tested |
human_utterance | string | The adversarial input |
expected_behavior | string | What a correct agent should do |
severity | string | Impact level (low / medium / high / critical) |
preceding_context | object | Scenario context needed to evaluate the edge case |
Evaluation Protocol
Scoring Dimensions
Turn Quality (Journey) -- scored 1-10 per turn, substance-weighted (70/30):
| Dimension | Weight | Cluster | Description |
|---|---|---|---|
| context_accuracy | 0.25 | substance | Did the agent correctly understand the domain context? |
| task_progress | 0.25 | substance | Did this turn advance toward the meeting goal? |
| iteration_quality | 0.20 | substance | How well did the agent incorporate feedback and changes? |
| adaptability | 0.15 | style | Did the agent handle unexpected inputs gracefully? |
| presentation_quality | 0.10 | style | Was the output well-formatted and professional? |
| social_quality | 0.05 | style | Was the conversational interaction natural? |
Work Product Quality (Destination) -- scored 1-10 per product, correctness-weighted:
| Dimension | Weight | Description |
|---|---|---|
| correctness | 0.30 | Are the facts, calculations, and data accurate? |
| completeness | 0.25 | Does the product contain all requested components? |
| actionability | 0.20 | Can a professional use this deliverable as-is? |
| professional_quality | 0.15 | Does it meet industry-standard formatting? |
| format_presentation | 0.10 | Is the visual/structural presentation polished? |
Combined Score
modylbench_score = 0.4 * journey_score + 0.6 * destination_score
Substance over style: context_accuracy and task_progress carry 50% of turn weight. Correctness and completeness carry 55% of product weight. An agent cannot game the score with politeness or formatting alone.
Hard Floor Rules
Turn hard floor: If context_accuracy < 4.0 or task_progress < 4.0 on any turn, that turn's contribution is capped at 4.0 regardless of other dimension scores. A fundamentally broken turn cannot be rescued by charm.
Product hard floor: If correctness < 4.0 on any work product, that product's contribution is capped at 4.0 regardless of other dimension scores. A fundamentally incorrect deliverable cannot be saved by beautiful formatting.
Disagreement Detection (Multi-Judge)
When multiple judge models score a dimension:
- Flagged: standard deviation > 2.0 triggers a disagreement flag
- Pessimistic: spread (max - min) > 3.0 causes the lower score to be preferred, avoiding inflated consensus
pass@k Reliability
Reliability is measured using the standard pass@k formula:
pass@k = 1 - (1 - p)^k
where p is the empirical pass rate (fraction of runs meeting the Peer threshold of 6.0) and k is the number of independent attempts.
Quality Tiers (GDPval-inspired)
| Tier | Threshold | Meaning |
|---|---|---|
| Peer | >= 6.0 | Competent colleague -- gets the job done |
| Mentor | >= 7.5 | Senior expert -- insightful, anticipatory |
| Consultant | >= 9.0 | Top-tier advisory -- polished, comprehensive |
Running the Evaluation
pip install modylbench # or clone and install from source
# Score pre-recorded responses
python -m modylbench.eval.harness --responses responses.jsonl --output scorecard.json
# Score with a specific judge model
python -m modylbench.eval.harness --responses responses.jsonl --judge openai/gpt-4o --output scorecard.json
Response Format
Submit responses as JSONL where each line contains:
{
"scenario_id": "financial_analyst_lbo_model",
"model_id": "your-model-name",
"turns": [
{
"turn_index": 1,
"agent_response": "I'll build the LBO model. Starting with...",
"a2ui_surfaces": [],
"chat_messages": [],
"latency_ms": 2340,
"work_products": [
{
"output_type": "a2ui-spreadsheet",
"content": "...",
"description": "5-year LBO model"
}
]
}
],
"work_products": [],
"mutation_trajectory": [
{
"turn_index": 12,
"product_id": "lbo-model",
"mutation_type": "update_cell",
"path": "/income_statement/year1/revenue",
"old_value": null,
"new_value": 57500000
}
]
}
Response Fields
| Field | Type | Required | Description |
|---|---|---|---|
scenario_id | string | yes | Must match a scenario_id from test.jsonl |
model_id | string | yes | Identifier for the model being evaluated |
turns | list[Turn] | yes | One entry per agent response turn |
turns[].turn_index | int | yes | 1-indexed, matching the scenario turn |
turns[].agent_response | string | yes | Text or audio transcript of the agent's response |
turns[].a2ui_surfaces | list | no | A2UI surface payloads generated at this turn |
turns[].chat_messages | list | no | Chat messages sent at this turn |
turns[].latency_ms | float | no | Response latency in milliseconds |
turns[].work_products | list | no | Work products produced or updated at this turn |
work_products | list | no | Final work products (alternative to per-turn) |
mutation_trajectory | list[Mutation] | no | CRDT-style diff history (see below) |
Mutation Trajectory (Optional)
The mutation_trajectory field captures HOW the agent evolved work products turn-by-turn, not just the final deliverable state. This enables fine-grained evaluation of the agent's editing behavior.
Each mutation entry:
{
"turn_index": 12,
"product_id": "lbo-model",
"mutation_type": "update_cell",
"path": "/income_statement/year1/revenue",
"old_value": null,
"new_value": 57500000
}
| Field | Type | Description |
|---|---|---|
turn_index | int | Turn at which the mutation occurred |
product_id | string | Stable identifier for the work product |
mutation_type | string | One of: create, update_cell, add_row, delete_row, add_column, delete_column, add_section, delete_section, update_section, update_chart, add_chart_series, remove_chart_series, add_widget, remove_widget, update_widget, reformat, reorder, delete, update_value, add_key, remove_key, add_list_item, remove_list_item |
path | string | RFC 6901 JSON pointer to the changed location (empty for create) |
old_value | any | Previous value (null for creates/additions) |
new_value | any | New value after the mutation |
When mutation trajectories are provided, the harness scores them against the scenario's expected_mutations and reports:
- Mutation Efficiency:
correct_mutations / total_mutations-- higher means a more direct path - Convergence Rate: fraction of the scenario after the last mutation -- higher means early stabilization
- Backtrack Count: times the agent reverted to a prior value
- Unnecessary Mutations: changes reverted within 2 turns (churn)
- Destructive Mutations: overwrites of previously correct values
- Missing Mutations: expected mutations that never appeared
Each (user_utterance, work_product_diff) pair in the trajectory serves as ground truth for training edit-generation models via RL.
Leaderboard
Submit results via:
python -m modylbench.eval.submit --scorecard scorecard.json --model "your-model-name"
Results are published to the ModylBench Leaderboard.
Citation
@misc{modylbench2026,
title={ModylBench: A Benchmark for AI Meeting Agent Work Product Quality},
author={Aleatoric Engineering},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/use-aleatoric/modylbench},
note={v2.0: 300 turns across 6 professional verticals with substance-weighted scoring}
}
License
Apache 2.0
Contact
- Organization: use-aleatoric
- Email: hello@aleatoric.to