A Harbor-native benchmark of 105 tasks where terminal agents must understand audio, video, and image content and turn that evidence into verifier-checkable file artifacts.
Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows.
Each MMTB task hands the agent a working directory of image, audio, or video
files and requires a checkable artifact — a selected file, a timestamped
segment, a structured JSON/CSV record, an edit list, or a processed media
clip. Tasks are scoped so that a typical
ffprobe → ffmpeg → frame / ASR → output shell pipeline cannot
reach the right answer without genuine cross-modal perception. Per-task
duration ranges from a few seconds to 49 min (median 1 m 20 s).
Every task is anchored to a public URL documenting a paid practitioner workflow — predominantly Upwork and Fiverr gig listings, with casting calls, practitioner forums, and industry standards documents making up the rest. The suite captures the multimedia work people are actually paid to do, not synthetic instruction templates.
Standalone terminal agents struggle on MMTB: Codex CLI with GPT-5.2 solves only
16.2% of tasks. Adding native multimodal perception tools to the
agent harness lifts performance substantially, but not every modality contributes
equally. Terminus-MM — the harness that exposes
view_image, listen_audio, and watch_video —
reaches the highest binary success rate (37.1%) and partial credit (46.9%) on
Gemini 3.1 Pro.
| Harness | Backbone | Native tools | Binary ↑ | Partial ↑ |
|---|---|---|---|---|
| Terminus-2 | Gemini-3.1-Pro | text only | 0.124 | 0.162 |
| Terminus-KIRA | Gemini-3.1-Pro | + image | 0.105 | 0.159 |
| Terminus-A | Gemini-3.1-Pro | + audio | 0.257 | 0.349 |
| Terminus-IA | Gemini-3.1-Pro | + image + audio | 0.333 | 0.406 |
| Terminus-IV | Gemini-3.1-Pro | + image + video | 0.333 | 0.432 |
| Terminus-MM | Gemini-3.1-Pro | + image + audio + video | 0.371 | 0.469 |
| Claude Code | Sonnet-4.6 | + image (own) | 0.162 | 0.186 |
| Codex CLI | GPT-5.2 | + image (own) | 0.162 | 0.202 |
Mean over 105 tasks. Binary = verifier-passing artifact match. Partial = task-specific partial-credit score. See the paper for cost, latency, modality-masking, and per-category breakdowns.
view_image (Terminus-KIRA, 0.105) does not improve over the text-only baseline (Terminus-2, 0.124) on Gemini-3.1-Pro — static frames rarely capture the cue MMTB tasks turn on.listen_audio alone more than doubles binary success (0.124 → 0.257). Over half of MMTB’s tasks turn on audio cues that no spectrogram-or-frame proxy reliably preserves.@misc{mmtb2026,
title = {MultiMedia-TerminalBench: A Benchmark for Terminal Agents with Multimedia Perception},
author = {Heo, Chiyeong and Kim, Jaechang and Kwon, Junhyuk and Kim, Hoyoung and Park, Dongmin and Lee, Jonghyun and Ok, Jungseul},
year = {2026},
eprint = {XXXX.XXXXX},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
note = {arXiv preprint, URL TBA}
}