A Harbor-native benchmark of 105 tasks where terminal agents must understand audio, video, and image content and turn that evidence into verifier-checkable file artifacts.
Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows.
Each MMTB task hands the agent a working directory of image, audio, or video files and requires a checkable artifact — a selected file, a timestamped segment, a structured JSON/CSV record, an edit list, or a processed media clip. The expected artifact varies by task; the agent is scored only on the final artifact at the required path, not on its rationale or command trace.
Construction pipeline. MMTB begins with 163
candidate scenarios, each anchored to a specific public URL
documenting a paid practitioner workflow —
predominantly Upwork and Fiverr gig listings, with casting calls,
practitioner forums, and industry standards documents making up the
rest — so the suite captures the multimedia work people are
actually paid to do, not synthetic instruction templates. Each scenario
is scoped into a concrete task design, scaffolded as a Harbor task, and
populated with license-compatible external media or controlled synthetic
and derivative assets. The resulting candidate tasks are then filtered
through automated checks for task structure, Docker build, media
fetching, oracle solvability, and dummy/no-op failure, as well as
baseline checks for tasks easily solved by baselines, followed by manual
review for trivial shortcuts, unrealistic setups, and redundancy. After
filtering, we obtain 105 tasks, each with asset
provenance recorded in media.toml, including source
descriptions, license information, and content hashes.
Tasks are scoped so that a typical
ffprobe → ffmpeg → frame / ASR → output shell pipeline cannot
reach the right answer without genuine cross-modal perception. Per-task
duration ranges from a few seconds to 49 min (median 1 m 20 s).
Standalone terminal agents struggle on MMTB: Codex CLI with GPT-5.2 solves only
16.2% of tasks. Adding native multimodal perception tools to the
agent harness lifts performance substantially, but not every modality contributes
equally. Terminus-MM — the harness that exposes
view_image, listen_audio, and watch_video —
reaches the highest binary success rate (37.1%) and partial credit (46.9%) on
Gemini 3.1 Pro.
| Harness | Backbone | Native tools | Binary ↑ | Partial ↑ |
|---|---|---|---|---|
| Terminus-2 | Gemini-3.1-Pro | text only | 0.124 | 0.162 |
| Terminus-KIRA | Gemini-3.1-Pro | + image | 0.105 | 0.159 |
| Terminus-A | Gemini-3.1-Pro | + audio | 0.257 | 0.349 |
| Terminus-IA | Gemini-3.1-Pro | + image + audio | 0.333 | 0.406 |
| Terminus-IV | Gemini-3.1-Pro | + image + video | 0.333 | 0.432 |
| Terminus-MM | Gemini-3.1-Pro | + image + audio + video | 0.371 | 0.469 |
| Claude Code | Sonnet-4.6 | + image (own) | 0.162 | 0.186 |
| Codex CLI | GPT-5.2 | + image (own) | 0.162 | 0.202 |
Mean over 105 tasks. Binary = verifier-passing artifact match. Partial = task-specific partial-credit score. See the paper for cost, latency, modality-masking, and per-category breakdowns.
@misc{heo2026mmtbevaluatingterminalagents,
title={MMTB: Evaluating Terminal Agents on Multimedia-File Tasks},
author={Chiyeong Heo and Jaechang Kim and Junhyuk Kwon and Hoyoung Kim and Dongmin Park and Jonghyun Lee and Jungseul Ok},
year={2026},
eprint={2605.10966},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2605.10966},
}