MultiMedia-TerminalBench:
Evaluating Terminal Agents on Multimedia-File Tasks

A Harbor-native benchmark of 105 tasks where terminal agents must understand audio, video, and image content and turn that evidence into verifier-checkable file artifacts.

Chiyeong Heo1, Jaechang Kim1, Junhyuk Kwon1, Hoyoung Kim3, Dongmin Park4, Jonghyun Lee4, Jungseul Ok1,2,*
1 GSAI, POSTECH  ·  2 CSE, POSTECH  ·  3 National AI Research Lab  ·  4 Krafton AI
* Corresponding author
{chiyeongheo, jungseul.ok}@postech.ac.kr
An example MMTB task and two terminal-agent approaches.
An example MMTB task and two terminal-agent approaches. The task merges three videos and one audio file into one edited artifact. Agents with native multimodal access read the raw files directly; text-only agents must reach the same evidence through command-line tools (OCR, ASR, motion-energy), adding processing steps that introduce inefficiency and errors.

Abstract

Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows.

The benchmark at a glance

105
tasks
5
meta-categories
536
media files
6 h 54 m
total audio-visual content

Each MMTB task hands the agent a working directory of image, audio, or video files and requires a checkable artifact — a selected file, a timestamped segment, a structured JSON/CSV record, an edit list, or a processed media clip. The expected artifact varies by task; the agent is scored only on the final artifact at the required path, not on its rationale or command trace.

Construction pipeline. MMTB begins with 163 candidate scenarios, each anchored to a specific public URL documenting a paid practitioner workflow — predominantly Upwork and Fiverr gig listings, with casting calls, practitioner forums, and industry standards documents making up the rest — so the suite captures the multimedia work people are actually paid to do, not synthetic instruction templates. Each scenario is scoped into a concrete task design, scaffolded as a Harbor task, and populated with license-compatible external media or controlled synthetic and derivative assets. The resulting candidate tasks are then filtered through automated checks for task structure, Docker build, media fetching, oracle solvability, and dummy/no-op failure, as well as baseline checks for tasks easily solved by baselines, followed by manual review for trivial shortcuts, unrealistic setups, and redundancy. After filtering, we obtain 105 tasks, each with asset provenance recorded in media.toml, including source descriptions, license information, and content hashes.

Tasks are scoped so that a typical ffprobe → ffmpeg → frame / ASR → output shell pipeline cannot reach the right answer without genuine cross-modal perception. Per-task duration ranges from a few seconds to 49 min (median 1 m 20 s).

Media production  subtitling, broadcast / film, podcast assembly, game-capture review
Performance & coaching  acting takes, language-learner pronunciation, music performance feedback
Enterprise & compliance  meeting / screen-share evidence, business document structuring, audit
Personal & education  user’s own media, lecture chaptering, study notes
Operations & research  ML data annotation, smart-device automation, public-safety audits

Headline results

Standalone terminal agents struggle on MMTB: Codex CLI with GPT-5.2 solves only 16.2% of tasks. Adding native multimodal perception tools to the agent harness lifts performance substantially, but not every modality contributes equally. Terminus-MM — the harness that exposes view_image, listen_audio, and watch_video — reaches the highest binary success rate (37.1%) and partial credit (46.9%) on Gemini 3.1 Pro.

Harness Backbone Native tools Binary ↑ Partial ↑
Terminus-2 Gemini-3.1-Protext only 0.1240.162
Terminus-KIRA Gemini-3.1-Pro+ image 0.1050.159
Terminus-A Gemini-3.1-Pro+ audio 0.2570.349
Terminus-IA Gemini-3.1-Pro+ image + audio 0.3330.406
Terminus-IV Gemini-3.1-Pro+ image + video 0.3330.432
Terminus-MMGemini-3.1-Pro+ image + audio + video0.3710.469
Claude Code Sonnet-4.6 + image (own) 0.1620.186
Codex CLI GPT-5.2 + image (own) 0.1620.202

Mean over 105 tasks. Binary = verifier-passing artifact match. Partial = task-specific partial-credit score. See the paper for cost, latency, modality-masking, and per-category breakdowns.

What we find

BibTeX

@misc{heo2026mmtbevaluatingterminalagents,
      title={MMTB: Evaluating Terminal Agents on Multimedia-File Tasks},
      author={Chiyeong Heo and Jaechang Kim and Junhyuk Kwon and Hoyoung Kim and Dongmin Park and Jonghyun Lee and Jungseul Ok},
      year={2026},
      eprint={2605.10966},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2605.10966},
}