MultiMedia-TerminalBench:
Evaluating Terminal Agents on Multimedia-File Tasks

A Harbor-native benchmark of 105 tasks where terminal agents must understand audio, video, and image content and turn that evidence into verifier-checkable file artifacts.

Chiyeong Heo1, Jaechang Kim1, Junhyuk Kwon1, Hoyoung Kim3, Dongmin Park4, Jonghyun Lee4, Jungseul Ok1,2,*
1 GSAI, POSTECH  ·  2 CSE, POSTECH  ·  3 National AI Research Lab  ·  4 Krafton AI
* Corresponding author
{chiyeongheo, jungseul.ok}@postech.ac.kr
An example MMTB task and two terminal-agent approaches.
An example MMTB task and two terminal-agent approaches. The task merges three videos and one audio file into one edited artifact. Agents with native multimodal access read the raw files directly; text-only agents must reach the same evidence through command-line tools (OCR, ASR, motion-energy), adding processing steps that introduce inefficiency and errors.

Abstract

Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows.

The benchmark at a glance

105
tasks
5
meta-categories
536
media files
6 h 54 m
total audio-visual content

Each MMTB task hands the agent a working directory of image, audio, or video files and requires a checkable artifact — a selected file, a timestamped segment, a structured JSON/CSV record, an edit list, or a processed media clip. Tasks are scoped so that a typical ffprobe → ffmpeg → frame / ASR → output shell pipeline cannot reach the right answer without genuine cross-modal perception. Per-task duration ranges from a few seconds to 49 min (median 1 m 20 s).

Every task is anchored to a public URL documenting a paid practitioner workflow — predominantly Upwork and Fiverr gig listings, with casting calls, practitioner forums, and industry standards documents making up the rest. The suite captures the multimedia work people are actually paid to do, not synthetic instruction templates.

Media production  subtitling, broadcast / film, podcast assembly, game-capture review
Performance & coaching  acting takes, language-learner pronunciation, music performance feedback
Enterprise & compliance  meeting / screen-share evidence, business document structuring, audit
Personal & education  user’s own media, lecture chaptering, study notes
Operations & research  ML data annotation, smart-device automation, public-safety audits

Headline results

Standalone terminal agents struggle on MMTB: Codex CLI with GPT-5.2 solves only 16.2% of tasks. Adding native multimodal perception tools to the agent harness lifts performance substantially, but not every modality contributes equally. Terminus-MM — the harness that exposes view_image, listen_audio, and watch_video — reaches the highest binary success rate (37.1%) and partial credit (46.9%) on Gemini 3.1 Pro.

Harness Backbone Native tools Binary ↑ Partial ↑
Terminus-2 Gemini-3.1-Protext only 0.1240.162
Terminus-KIRA Gemini-3.1-Pro+ image 0.1050.159
Terminus-A Gemini-3.1-Pro+ audio 0.2570.349
Terminus-IA Gemini-3.1-Pro+ image + audio 0.3330.406
Terminus-IV Gemini-3.1-Pro+ image + video 0.3330.432
Terminus-MMGemini-3.1-Pro+ image + audio + video0.3710.469
Claude Code Sonnet-4.6 + image (own) 0.1620.186
Codex CLI GPT-5.2 + image (own) 0.1620.202

Mean over 105 tasks. Binary = verifier-passing artifact match. Partial = task-specific partial-credit score. See the paper for cost, latency, modality-masking, and per-category breakdowns.

What we find

BibTeX

@misc{mmtb2026,
  title         = {MultiMedia-TerminalBench: A Benchmark for Terminal Agents with Multimedia Perception},
  author        = {Heo, Chiyeong and Kim, Jaechang and Kwon, Junhyuk and Kim, Hoyoung and Park, Dongmin and Lee, Jonghyun and Ok, Jungseul},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  note          = {arXiv preprint, URL TBA}
}