MultiMedia-TerminalBench: Evaluating Terminal Agents on Multimedia-File Tasks

Abstract

Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows.

The benchmark at a glance

105

tasks

5

meta-categories

536

media files

6 h 54 m

total audio-visual content

Each MMTB task hands the agent a working directory of image, audio, or video files and requires a checkable artifact — a selected file, a timestamped segment, a structured JSON/CSV record, an edit list, or a processed media clip. Tasks are scoped so that a typical ffprobe → ffmpeg → frame / ASR → output shell pipeline cannot reach the right answer without genuine cross-modal perception. Per-task duration ranges from a few seconds to 49 min (median 1 m 20 s).

Every task is anchored to a public URL documenting a paid practitioner workflow — predominantly Upwork and Fiverr gig listings, with casting calls, practitioner forums, and industry standards documents making up the rest. The suite captures the multimedia work people are actually paid to do, not synthetic instruction templates.

Media production subtitling, broadcast / film, podcast assembly, game-capture review

Performance & coaching acting takes, language-learner pronunciation, music performance feedback

Enterprise & compliance meeting / screen-share evidence, business document structuring, audit

Personal & education user’s own media, lecture chaptering, study notes

Operations & research ML data annotation, smart-device automation, public-safety audits

Headline results

Standalone terminal agents struggle on MMTB: Codex CLI with GPT-5.2 solves only 16.2% of tasks. Adding native multimodal perception tools to the agent harness lifts performance substantially, but not every modality contributes equally. Terminus-MM — the harness that exposes view_image, listen_audio, and watch_video — reaches the highest binary success rate (37.1%) and partial credit (46.9%) on Gemini 3.1 Pro.

Harness	Backbone	Native tools	Binary ↑	Partial ↑
Terminus-2	Gemini-3.1-Pro	text only	0.124	0.162
Terminus-KIRA	Gemini-3.1-Pro	+ image	0.105	0.159
Terminus-A	Gemini-3.1-Pro	+ audio	0.257	0.349
Terminus-IA	Gemini-3.1-Pro	+ image + audio	0.333	0.406
Terminus-IV	Gemini-3.1-Pro	+ image + video	0.333	0.432
Terminus-MM	Gemini-3.1-Pro	+ image + audio + video	0.371	0.469
Claude Code	Sonnet-4.6	+ image (own)	0.162	0.186
Codex CLI	GPT-5.2	+ image (own)	0.162	0.202

Mean over 105 tasks. Binary = verifier-passing artifact match. Partial = task-specific partial-credit score. See the paper for cost, latency, modality-masking, and per-category breakdowns.

What we find

Image alone is not enough. Adding only view_image (Terminus-KIRA, 0.105) does not improve over the text-only baseline (Terminus-2, 0.124) on Gemini-3.1-Pro — static frames rarely capture the cue MMTB tasks turn on.
Audio is the largest single perceptual contribution. Adding listen_audio alone more than doubles binary success (0.124 → 0.257). Over half of MMTB’s tasks turn on audio cues that no spectrogram-or-frame proxy reliably preserves.
Native perception is also cheaper. Conversion-heavy successful runs cost 1.63×–7.72× more API spend than native perception on the same tasks; worst case 30.11× when native video is missing and 42.49× when native audio is missing.

BibTeX

@misc{mmtb2026,
  title         = {MultiMedia-TerminalBench: A Benchmark for Terminal Agents with Multimedia Perception},
  author        = {Heo, Chiyeong and Kim, Jaechang and Kwon, Junhyuk and Kim, Hoyoung and Park, Dongmin and Lee, Jonghyun and Ok, Jungseul},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  note          = {arXiv preprint, URL TBA}
}