105 / 105 tasks
2-speaker-diarized-transcript-from-podcast-audio
Produce a diarized transcript labeling each utterance with its speaker for a 2-person podcast clip
accessibility-sync-audit
Accessibility tester audits a 47 s benefits-portal screen-reader walkthrough and produces a 6-row desync log of screen-reader vs visible-focus mismatches.
adr-edit-detection
Detect ADR replacement time intervals in a narration scene via acoustic continuity
animation-narration-audit
Designer narrates 8 transient UI animations (300-700ms each) on a Helio Studio prototype. 6 narrations correctly describe the visible animation; 2 misdescribe (wrong color, wrong direction). Agent emits 8-row CSV: (claim_idx, described_animation, observed_animation, match).
articulation-deviation-detection
Read a piano score with explicit articulation markings (staccato dots, slurs, accents, tenutos), listen to a recording where some notes are played with the wrong articulation, and produce a feedback.json listing each mismatched note with its expected and played articulation category.
audience-ringtone-detection
Find the recital recording containing an audience cellphone ringtone and sort recordings
audio-visual-dub-detection
Find audio-dub slips in a lecture recording where short audio spans have been replaced by audio from elsewhere in the same talk; requires joint audio-visual reasoning to detect rhythm mismatches between lip motion and heard syllables.
av-desync-detection
Detect which video clips have noticeable audio-video desynchronization
av-desync-offset-repair
Repair a desynced clip so audio and video are aligned
av-identity-leak-detect
Detect cross-channel identity leaks (badge + spoken name/title) in a pre-release marketing clip
av-privacy-exposure
Detect cross-modal PII exposures in an Acme CRM screen recording where reveal-on-click toggles transient visibility, and produce both a pii_flags.csv and an edited.mp4 with audio muted + visual mask over the customer-detail panel during exposure intervals.
b-roll-pool-assignment
Assign each of 10 narration segments to its matching clip from a 30-clip B-roll pool
batch-media-qc-audit
Audit a batch of 10 video delivery bundles against a manifest spec; report which seeded compliance defects each bundle carries.
birthday-money-shot
Cut the singing and candle-blow segments from a birthday-party video
blind-audition-match
Pick the audition candidate whose line-by-line readings most match the director's script directions
blood-test-pdfs-to-csv
Flatten five scanned multi-locale pathology PDFs into a normalised analyte CSV (52 rows, mixed SI/conventional units)
boss-cooldown-cheat-audit
Audit boss-fight ability casts against posted cooldown rules. For each cast, determine if it fired before the ability's cooldown bar refilled (illegal) or after (legal). Joint-AV required: each cast plays a distinctive spell SFX, a visible animation, AND triggers a UI cooldown bar drain — agent must hold a unified A+V state across the clip to track per-ability cooldowns and flag premature casts.
broadcast-package-edit
Broadcast certification-style package edit — agent assembles bars+tone / black / main+music+mosaic+logo into a 25 s 360x240 MP4 with edit log
bug-repro-claim-audit
Support engineer triages a 60s synthetic Acme HR portal bug-repro screen recording. User narrates 6 claims; some match the visible screen sequence, some don't. Agent emits a 6-row CSV: (claim_idx, claimed_event, actual_event, confirmed).
call-center-disclosure-audit
Audit a recorded support call for compliance: spoken disclosures + CRM UI actions
caption-nonspeech-enrichment
Enrich a speech-only SRT with cues for the recording's non-speech audio events
caption-speech-mismatch
Find captions in a 4:23 lecture recording that disagree with the spoken audio. Joint-AV required: must compare visual caption text against audio speech to identify semantic mismatches at known intervals.
chapter-repair
Refine a 3-entry coarse chapter file for a shell-tools lecture into a 7-9 entry fine chapter file aligned to topic + visual transitions
code-review-comment-attribution
Code-review compliance reviewer audits a 73 s Github-PR-style screen-share and produces a 4-row attribution log: intended vs committed line, mismatch flag, sentiment.
comping-chord-substitution
Read a piano lead sheet (bass-clef melody with chord symbols above each bar), listen to a comping recording with four wrong chords substituted into the harmonic accompaniment, and produce a feedback.json listing each wrong-chord bar with its expected and played chord names.
constant-hum-attenuation
Attenuate 60/120/180 Hz mains-style hum from a voice recording without damaging speech intelligibility
constant-offset-srt
Correct a constant-offset timing shift on an SRT file by re-anchoring it to the spoken audio
cooking-instruction-alignment
Label the first frame where each narrated cooking event (grasp, shake-end, release) becomes visually true
coop-voice-callout-audit
Audit teammate voice callouts on a coop FPS team-comms channel against the visible game state. For each match, flag callouts that don't match what's visible: false_call, wrong_direction, wrong_state, wrong_attribution. Joint-AV required: each callout names a speaker, an event, a direction; the agent must hold a unified A+V picture across the clip — which teammate is talking (4 distinct voices), the live HUD state (per-teammate HP/ammo/status), the kill-feed history, the minimap layout — to decide whether the call is correct.
creator-voiceover-lipsync-mismatch
Sponsored creator vs voiceover lip-sync mismatch flagging — joint-AV detection of voiceover-during-lip-motion intervals
crm-compliance-audit
Sales-ops compliance reviewer audits a 60 s screen-share of a sales rep on a discovery call inside a Salesforce-style CRM. Produce a 5-row promised-vs-logged audit CSV.
cross-channel-privacy-leak
Flag corporate-comms clips where a moving callout on a diagram element co-occurs with the voiceover naming that same element, AND deliver a redacted MP4 for each leaked clip. Joint-AV required: callout-on-element and audio-naming-element each appear separately throughout but only their precise temporal intersection constitutes a leak.
cursor-deictic-thumbnails
Photo curator reviews 24 thumbnails on a Pixelmine asset library, narrating with deictic-only references ('this one', 'that one', 'the one underneath'). The cursor hovers over the referenced thumbnail at each utterance moment. Agent emits an 8-row CSV: (utterance_idx, thumbnail_id_referenced).
dead-air-removal
Identify mid-sentence dead-air regions to cut from a narration recording while preserving sentence-boundary pauses
debate-attribution
Attribute each of 8 utterances in a 4-speaker panel-debate video to one of the 4 on-screen positions (A/B/C/D). Voices are paired across positions so voice alone is insufficient; lip-sync on the active tile is required for full disambiguation.
deictic-ui-reference
Recovering which on-screen UI element a reviewer's deictic remark referred to
delivery-clip-defect-triage
Multi-defect triage on a delivery clip set — classify 8 short clips into a closed defect set spanning audio, visual, and joint-AV failure modes
design-review-approval-audit
PM audits a 83 s Figma-style design-review screen-share with three voices and produces a 4-row committed-vs-claimed approval audit per frame.
design-review-version-approval
Identify the trustee-recruitment plan agreement moment in a meeting recording (which plan slide + when + verbatim phrase)
dialogue-exchange-match
Pick the 4-turn dialogue take whose per-turn speaker AND emotion arc match the director's brief, with stable emotion within each turn
dub-speaker-mismatch
Detect intervals where heard voice does not match on-screen speaker in a dubbed multi-character scene
emotional-arc-match
Pick the monologue take whose per-sentence emotional arc matches the director's brief
external-mic-sync-repair
Sync an external-mic WAV recording to camera footage with non-trivial drift; produce synced MP4 + drift report
fugal-subject-entry-labeling
Read a four-voice string-quartet fugue score, listen to the recording, and produce a feedback.json listing each bar where the fugue's subject is stated — labeling which voice (violin_1, violin_2, viola, or cello) carries the subject at that entry. Requires recognizing the subject's melodic shape as it migrates through all four voices during the exposition.
game-alert-mismatch
Find clips where a voiced game alert ('combat engaged', 'fuel low') does not match the visible game event; output a bugs.csv with flagged clip_ids and timestamps
game-outcome-qa
Find the 2 of 6 gameplay outcome clips whose banner outcome does not match the played outcome jingle
interview-music-ducking-audit
Family-video music-vs-kid-voice mix audit — flag windows where background music is too hot relative to a child's voice
interview-srt-refine
Refine an interview's auto-generated SRT to broadcast-grade quality
invoice-estimate-pdfs-to-xlsx
Extract fields from 5 invoice + 5 estimate PDFs into a target Excel template, including PO-based estimate↔invoice association
lecture-demo-clip-extract
Locate the on-screen timestamp window for each of 7 labeled slides in a 25-min CC-BY-NC Python conference talk AND quote a verbatim phrase the presenter says while each slide is visible.
lecturer-visual-term-ref
Resolve a lecturer's deictic references to specific terms in on-screen equations
lexical-stress-classification
Per-word lexical-stress correctness classification across 3 L2 English read-aloud recordings; suprasegmental rather than segmental
line-failure-annotation
Flag which line indices of a single monologue take diverged from the director's per-line emotion brief
lipsync-drift-correction
Lip-sync drift correction on single-talker clip — measure audio→video offset via joint mouth motion + audio onset analysis, then re-mux with the inverse offset
long-form-clip-miner
Mine the strongest short-form clip candidates from a ~49-minute developer lecture, requiring hooks that combine the spoken concept with the distinctive on-screen terminal tokens.
mock-call-automation
Write a doorbell-cam automation script that detects when audible footsteps coincide with a visible person entering the camera frame within 2 seconds. Tests the agent's ability to plan a joint audio-visual detection pipeline and write a working, generalising script — not solvable by perceiving sample clips and hardcoding answers, because scoring runs on previously-unseen test clips.
multi-mic-bleed-attribution
Identify cross-mic bleed events on a 4-lavalier panel recording, naming source speaker via diagram
multi-utterance-pronunciation-errors
Detect and characterise per-phone pronunciation errors across 3 L2 English read-aloud recordings; ARPABET phone substitutions / deletions
multicam-active-speaker-cut
Multicam active-speaker cut — given 3 ISO angles + a boom mix, identify per second which camera frames the active speaker and emit a cut list + cut video
musical-mood-shot-pick
Pick which of 5 candidate silent video shots best matches a reference music cue's pacing and mood.
narration-drift-qc
Find the interval where documentary narration stops matching on-screen footage
narration-mars-rover
Find on-screen captions in a 3:09 NASA Mars-rover panorama narrated video that disagree with the spoken narration. Joint-AV required: must compare visual caption text against audio narration to identify semantic mismatches.
narration-music-ducking
Mix narration over background music with proper ducking (audio production task)
narration-visual-align
Find on-screen captions in a 4:14 NASA Juno narrated video that disagree with the spoken narration. Joint-AV required: must compare visual caption text against audio narration to identify semantic mismatches.
near-duplicate-frame-dedup
Cluster 20 lecture-slide screenshots into one canonical frame per distinct slide state
ornament-classification-detection
Read a baroque-style keyboard score with explicit ornament symbols (trill tr, mordent squiggle, turn ~), listen to a recording, and produce a feedback.json listing each note where the ornament played differs from the ornament notated on the score.
page-photo-to-text
Transcribe a marked-up scan of a printed memo, applying the handwritten edits
partial-srt-resync
Resynchronise a drifted SRT after a mid-video cut; produce a corrected WebVTT
phone-level-pronunciation-errors
Detect and characterise per-phone pronunciation errors in an L2 English learner's read-aloud recording (ARPABET phone substitutions / deletions, native-rater gold)
phoneme-confusion-patterns
Identify recurring phoneme-confusion patterns across 4 utterances from one L2 English speaker; aggregate-level pronunciation coaching workflow
piano-practice-feedback
Read a printed piano sheet-music image, listen to a practice recording, and flag wrong-pitch, missed-note, and timing-error mistakes as a structured feedback JSON.
podcast-episode-assembly
Retrofit a mid-roll sponsor spot at an editorially specified topic break in a published 47-minute podcast episode, apply crossfades, and master to broadcast-spec LUFS and true peak.
polyphonic-piano-feedback
Read a polyphonic piano grand-staff sheet music image (treble + bass), listen to a practice recording with seeded mistakes in both hands, and flag per-hand wrong-pitch / missed / timing-error mistakes as structured feedback JSON.
polyrhythm-accuracy-detection
Read a piano score notated with explicit 3:2 polyrhythm (triplet brackets in the right hand over steady quarters in the left), listen to a recording, and produce a feedback.json listing each bar where the right-hand polyrhythm was played sloppily (flattened to a duple rhythm instead of the notated 3-against-2 triplet figure).
pronunciation-error-flagging
Flag nonstandard mispronunciations in a learner's read-aloud recording against a script + closed-set error labels
proof-step-note
Write a focused study note for step 3 of a 4-step Pythagorean proof. Step→lemma binding is audio-only; lemma slides are visually labelled.
prosody-multi-dim-selection
Pick the voice-over take matching the director's 3-D brief (actor gender × emotion × intensity) across 18 same-text takes
prosody-take-selection
Pick the voice-over take whose prosody matches the director's brief among 6 same-text takes
question-statement-intonation
Classify each of 6 short audio clips as a question or statement based on intonation alone (no transcript provided)
quote-clip-retrieval
Find a moment in a 150-second academic lecture excerpt where BOTH a verbal cue is spoken AND a specific visual condition holds, then export that segment as a 3-30s mp4. Joint-AV required — verifier checks both audio (whisper substring) and visual (SSIM vs reference frame) coincidence.
receipt-photo-to-json
Extract vendor, date, total, and currency from a folder of real printed receipts
robotics-demo-command-audit
Audit 12 tabletop robot demos for command-vs-action mismatches across object identity, destination, and ignored corrections
safe-single-cue-keep
Decide which 20-second segments of a 120-second 6-segment Acme CRM tutorial must be removed (cross-modal PII exposure) vs kept (single-cue / mismatch / benign). Same CRM reveal-on-click + 2.2s auto-redact mechanic as T015 but at segment-level EDL granularity.
screenshare-deictic-grounding
Ground multiple spoken deictic decisions to specific cards on a Kanban screen-share recording
semantic-chaptering
Chapter a 23-minute academic lecture into its major sections by identifying the start timestamp of each chapter. Tests precise boundary detection in long-form content using both visual slide cues and verbal section transitions.
semantic-image-retrieval
Rank the top 3 matches for each natural-language query against a controlled 50-image gallery
signal-based-qc-report
Signal-based playback-defect QC on a delivered video
slack-action-extraction
Compliance auditor classifies 6 messages in a Slack channel by both auditor-voice and visible message-state. 6-row CSV output.
speaker-action-attribution
3-person Zoom call with screen-share. Sarah hosts; Mike, Priya, and Sarah herself issue 6 verbal instructions/observations about a CRM dashboard. Sarah executes screen actions in response; some match the speaker's instruction, some don't. Agent emits 6-row CSV: (action_idx, speaker_name, instructed_action, executed_action, match).
speaker-roster-identification
Identify which rostered speakers (from 20 voice exemplars) are present in a mixed audio call
speedrun-input-tamper-detect
Game-engine audio-trigger QA: review 5 paired playtest captures (visual frame log + audio event log on the same engine t=0 clock) of a Godot platformer build, find frames where the in-game SFX (jump / attack / coin pickup / hit) is desynced from the visible player action — either audio_only (orphan SFX, audio system fired without trigger) or visual_only (orphan animation, audio system failed to fire). Joint-AV required at sub-second precision: visual + audio are split into two engine-trace files specifically because the QA workflow exposes the engine's separate audio-event and frame-event logs.
spoken-decision-cell-ref
Log spoken decisions in a quarterly ops review against the spreadsheet cell each decision refers to
spoken-vs-displayed-claim
Find on-screen captions in a 5:00 TEDx lecture that disagree with the spoken audio. Joint-AV required: must compare visual caption text against audio speech to identify semantic mismatches.
sports-broadcast-events
Log the official events (fouls, baskets) from short basketball broadcast clips. Joint-AV required: each official event needs concurring audio (whistle / ball-net) AND visual (ref-signal graphic / scoreboard change) cues; single-channel cues alone (crowd whistles, replay-graphic score blips) are decoys.
stereo-channel-flip-repair
Identify which clips in a 4-clip stereo video batch have their L/R audio channels wired backwards relative to visible source motion, repair only those, and write a batch QC report
stream-alert-ack-audit
Twitch-style live-stream session: streamer plays a game with continuous voice commentary while 8 transient alert overlays slide in (follows, donations, subs, raid, gift). Some are verbally acknowledged (named or unnamed); some are silently ignored. Agent emits 8-row CSV: (alert_idx, alert_type, sender_name, acknowledged).
string-quartet-mistake-attribution
Read a four-staff string quartet score, listen to a rehearsal recording with mistakes seeded in different voices, and produce per-part wrong-pitch / missed / timing-error feedback with the mistake attributed to the correct player (violin_1, violin_2, viola, cello).
take-tone-reaction-pick
Pick the audition take whose voice tone AND facial affect both convey 'angry but restrained' (joint-AV affect congruence)
tempo-drift-detection
Read a piano score with rehearsal letters A-H marking 8 sections, listen to a recording mixed with a steady metronome click track, and produce a feedback.json listing every section where the pianist's tempo drifts away from the click (rushing or dragging by more than ±3 BPM). Ignore intentional tempo changes written in the score.
traffic-cam-incident-audit
Audit dispatch radio calls over traffic-cam captures of an intersection. For each clip, flag dispatches that don't match the visible event (false calls, wrong vehicle attribution, wrong action, or late). Joint-AV required: each dispatch line names a vehicle, action, and direction; the agent must hold a unified A+V picture across the clip — light state, vehicle identities, and the dispatch claim — to decide whether the call is correct.
travel-clip-retrieval
Find and select the specific travel video clip that matches a described scene from a folder of mixed clips
tutorial-edit-recreation
Reproduce a Kdenlive screencast tutorial's edits on the same source media — trim, lower-third title, ducking, 9:16 export.
vfr-drift-repair
Measure the progressive audio-video drift in a 60s recording where post-production introduced a smooth A/V offset that worsens over time. Requires joint audio-visual reasoning to compare lip-motion timing against heard syllable timing at multiple points along the timeline.
violin-intonation-detection
Read a solo violin score, listen to a recording where some notes are played with intonation errors (played sharp or flat by more than ±10 cents), and produce a feedback.json listing each out-of-tune note with a signed cents-error magnitude.
warehouse-sku-pack-audit
Cooking-instructor multi-voice attribution audit — joint-AV: which of two simultaneously-dubbed voices (chef vs director) the visible cook followed (REDESIGNED v2 2026-04-27)