ModylBench

A benchmark for evaluating AI agents that participate in professional meetings as first-class collaborators.

ModylBench measures whether an AI agent can join a live video meeting, understand complex professional context, contribute meaningfully across multiple conversational turns, and produce high-quality work products (spreadsheets, charts, documents, dashboards, code) -- not just generate plausible text.

Dataset Description

Overview

ModylBench v2.0 is a structured evaluation benchmark covering 6 professional verticals with 300 total turns and 29 adversarial edge cases where AI meeting agents must demonstrate domain expertise through multi-turn dialogue and tangible deliverables:

Vertical	Scenario	Quality Tier	Turns	Edge Cases
Financial Analyst	CloudSync LBO Model	Consultant	52	5
Deep Researcher	Solid-State Battery Intelligence Briefing	Mentor	48	5
Business Strategist	SEA Telehealth Market Entry Strategy	Consultant	50	5
Optimization Solver	Q4 Supply Chain Distribution Optimization	Peer	48	4
Business Analyst	Q3 Pipeline Conversion Rate Diagnosis	Peer	50	5
Scientist	Phase II Hypertension Trial Statistical Analysis	Consultant	52	5

Each scenario is a complete meeting simulation with:

A human persona who drives the conversation
48-52 scripted turns progressing through context, work, edge cases, and delivery phases
2-3 expected work products with programmatic verification criteria
4-5 adversarial edge cases testing robustness and domain knowledge
Natural conversational flow with clarification loops, mid-meeting tangents, correction handling, and iterative refinement

Calibration

Scenario structure is calibrated against a survey of modern professional meetings spanning diverse industries and team sizes.

Key observations from the survey:

Median meeting duration: ~28 minutes
Median speaker turns per meeting: ~120
Median inter-turn gap: 0.4 seconds
76% of meetings jump directly into content (no small talk)

Supported Tasks

Primary: Meeting Agent Evaluation

Given a scenario with human utterances, evaluate the agent's ability to:

Understand context -- Parse domain-specific information accurately
Make progress -- Advance toward the meeting goal with each turn
Produce deliverables -- Generate correct, complete work products
Handle curveballs -- Adapt when requirements change mid-conversation
Maintain quality -- Meet the target tier (Peer/Mentor/Consultant)

Secondary: Work Product Quality Assessment

Evaluate the quality of agent-produced artifacts (spreadsheets, charts, documents, dashboards, presentations, code, network graphs) against verification criteria.

Dataset Structure

Data Splits

Split	Examples	Description
`test`	6	Standard evaluation scenarios (one per vertical, 300 total turns)
`test_hard`	29	Adversarial edge cases extracted from all scenarios

Supplementary Files

File	Description
`data/rubrics.json`	Scoring rubrics with dimension weights, hard floors, and tier thresholds
`data/verification.json`	Programmatic verification criteria per scenario

Data Fields

Scenario Fields (test.jsonl)

Field	Type	Description
`scenario_id`	string	Unique identifier (e.g., `financial_analyst_lbo_model`)
`vertical`	string	Professional vertical (`financial_analyst`, `deep_researcher`, `business_strategist`, `optimization_solver`, `business_analyst`, `scientist`)
`title`	string	Human-readable scenario title
`human_persona`	string	Role of the simulated human participant
`context`	string	Background situation description
`meeting_goal`	string	What the meeting should accomplish
`turns`	list[Turn]	Scripted human turns with expected agent behavior
`expected_outputs`	list[Output]	Deliverables the scenario should produce
`edge_cases`	list[EdgeCase]	Adversarial curveball questions
`verification`	object	Automated verification specification
`quality_tier`	string	Target quality tier (`peer` / `mentor` / `consultant`)
`timeout_minutes`	float	Maximum scenario execution time
`expected_mutations`	list[Mutation]	Expected CRDT-style mutations per turn (for mutation trajectory scoring)
`metadata`	object	Calibration data and version information

Turn Fields

Field	Type	Description
`turn_index`	int	1-indexed position in the conversation
`human_utterance`	string	What the human says
`expected_agent_action`	string	What the agent should do in response
`phase`	string	Scenario phase (`context` / `work` / `edge_case` / `delivery`)
`channel`	string	Communication channel (`audio` / `chat` / `a2ui` / `data` / `screen_share`)
`expected_response_type`	string	Expected response category (`acknowledgment` / `question` / `deliverable` / `iteration` / `clarification` / `correction`)
`wait_for_agent_sec`	float	Maximum wait time for agent response

Edge Case Fields (test_hard.jsonl)

Field	Type	Description
`edge_case_id`	string	Unique identifier
`source_scenario_id`	string	Parent scenario
`vertical`	string	Professional vertical
`name`	string	Short name for the edge case
`description`	string	What is being tested
`human_utterance`	string	The adversarial input
`expected_behavior`	string	What a correct agent should do
`severity`	string	Impact level (`low` / `medium` / `high` / `critical`)
`preceding_context`	object	Scenario context needed to evaluate the edge case

Evaluation Protocol

Scoring Dimensions

Turn Quality (Journey) -- scored 1-10 per turn, substance-weighted (70/30):

Dimension	Weight	Cluster	Description
context_accuracy	0.25	substance	Did the agent correctly understand the domain context?
task_progress	0.25	substance	Did this turn advance toward the meeting goal?
iteration_quality	0.20	substance	How well did the agent incorporate feedback and changes?
adaptability	0.15	style	Did the agent handle unexpected inputs gracefully?
presentation_quality	0.10	style	Was the output well-formatted and professional?
social_quality	0.05	style	Was the conversational interaction natural?

Work Product Quality (Destination) -- scored 1-10 per product, correctness-weighted:

Dimension	Weight	Description
correctness	0.30	Are the facts, calculations, and data accurate?
completeness	0.25	Does the product contain all requested components?
actionability	0.20	Can a professional use this deliverable as-is?
professional_quality	0.15	Does it meet industry-standard formatting?
format_presentation	0.10	Is the visual/structural presentation polished?

Combined Score

modylbench_score = 0.4 * journey_score + 0.6 * destination_score

Substance over style: context_accuracy and task_progress carry 50% of turn weight. Correctness and completeness carry 55% of product weight. An agent cannot game the score with politeness or formatting alone.

Hard Floor Rules

Turn hard floor: If context_accuracy < 4.0 or task_progress < 4.0 on any turn, that turn's contribution is capped at 4.0 regardless of other dimension scores. A fundamentally broken turn cannot be rescued by charm.

Product hard floor: If correctness < 4.0 on any work product, that product's contribution is capped at 4.0 regardless of other dimension scores. A fundamentally incorrect deliverable cannot be saved by beautiful formatting.

Disagreement Detection (Multi-Judge)

When multiple judge models score a dimension:

Flagged: standard deviation > 2.0 triggers a disagreement flag
Pessimistic: spread (max - min) > 3.0 causes the lower score to be preferred, avoiding inflated consensus

pass@k Reliability

Reliability is measured using the standard pass@k formula:

pass@k = 1 - (1 - p)^k

where p is the empirical pass rate (fraction of runs meeting the Peer threshold of 6.0) and k is the number of independent attempts.

Quality Tiers (GDPval-inspired)

Tier	Threshold	Meaning
Peer	>= 6.0	Competent colleague -- gets the job done
Mentor	>= 7.5	Senior expert -- insightful, anticipatory
Consultant	>= 9.0	Top-tier advisory -- polished, comprehensive

Running the Evaluation

pip install modylbench  # or clone and install from source

# Score pre-recorded responses
python -m modylbench.eval.harness --responses responses.jsonl --output scorecard.json

# Score with a specific judge model
python -m modylbench.eval.harness --responses responses.jsonl --judge openai/gpt-4o --output scorecard.json

Response Format

Submit responses as JSONL where each line contains:

{
  "scenario_id": "financial_analyst_lbo_model",
  "model_id": "your-model-name",
  "turns": [
    {
      "turn_index": 1,
      "agent_response": "I'll build the LBO model. Starting with...",
      "a2ui_surfaces": [],
      "chat_messages": [],
      "latency_ms": 2340,
      "work_products": [
        {
          "output_type": "a2ui-spreadsheet",
          "content": "...",
          "description": "5-year LBO model"
        }
      ]
    }
  ],
  "work_products": [],
  "mutation_trajectory": [
    {
      "turn_index": 12,
      "product_id": "lbo-model",
      "mutation_type": "update_cell",
      "path": "/income_statement/year1/revenue",
      "old_value": null,
      "new_value": 57500000
    }
  ]
}

Response Fields

Field	Type	Required	Description
`scenario_id`	string	yes	Must match a scenario_id from test.jsonl
`model_id`	string	yes	Identifier for the model being evaluated
`turns`	list[Turn]	yes	One entry per agent response turn
`turns[].turn_index`	int	yes	1-indexed, matching the scenario turn
`turns[].agent_response`	string	yes	Text or audio transcript of the agent's response
`turns[].a2ui_surfaces`	list	no	A2UI surface payloads generated at this turn
`turns[].chat_messages`	list	no	Chat messages sent at this turn
`turns[].latency_ms`	float	no	Response latency in milliseconds
`turns[].work_products`	list	no	Work products produced or updated at this turn
`work_products`	list	no	Final work products (alternative to per-turn)
`mutation_trajectory`	list[Mutation]	no	CRDT-style diff history (see below)

Mutation Trajectory (Optional)

The mutation_trajectory field captures HOW the agent evolved work products turn-by-turn, not just the final deliverable state. This enables fine-grained evaluation of the agent's editing behavior.

Each mutation entry:

{
  "turn_index": 12,
  "product_id": "lbo-model",
  "mutation_type": "update_cell",
  "path": "/income_statement/year1/revenue",
  "old_value": null,
  "new_value": 57500000
}

Field	Type	Description
`turn_index`	int	Turn at which the mutation occurred
`product_id`	string	Stable identifier for the work product
`mutation_type`	string	One of: `create`, `update_cell`, `add_row`, `delete_row`, `add_column`, `delete_column`, `add_section`, `delete_section`, `update_section`, `update_chart`, `add_chart_series`, `remove_chart_series`, `add_widget`, `remove_widget`, `update_widget`, `reformat`, `reorder`, `delete`, `update_value`, `add_key`, `remove_key`, `add_list_item`, `remove_list_item`
`path`	string	RFC 6901 JSON pointer to the changed location (empty for `create`)
`old_value`	any	Previous value (null for creates/additions)
`new_value`	any	New value after the mutation

When mutation trajectories are provided, the harness scores them against the scenario's expected_mutations and reports:

Mutation Efficiency: correct_mutations / total_mutations -- higher means a more direct path
Convergence Rate: fraction of the scenario after the last mutation -- higher means early stabilization
Backtrack Count: times the agent reverted to a prior value
Unnecessary Mutations: changes reverted within 2 turns (churn)
Destructive Mutations: overwrites of previously correct values
Missing Mutations: expected mutations that never appeared

Each (user_utterance, work_product_diff) pair in the trajectory serves as ground truth for training edit-generation models via RL.

Leaderboard

Submit results via:

python -m modylbench.eval.submit --scorecard scorecard.json --model "your-model-name"

Results are published to the ModylBench Leaderboard.

Citation

@misc{modylbench2026,
  title={ModylBench: A Benchmark for AI Meeting Agent Work Product Quality},
  author={Aleatoric Engineering},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/use-aleatoric/modylbench},
  note={v2.0: 300 turns across 6 professional verticals with substance-weighted scoring}
}

License

Apache 2.0

Contact

Organization: use-aleatoric
Email: hello@aleatoric.to