Under Review

Sound Sparks Motion

Audio and Text Tuning for Video Editing

1University of Cyprus 2Simon Fraser University 3Tel Aviv University 4CYENS Center of Excellence
Given a source video and an edit prompt, our method tunes the audio latent as a conditioning parameter to the LTX video model, such that the desired motion edit is realized in the edited video.
Given a source video and an edit prompt, our method tunes the audio latent as a conditioning parameter to the LTX video model, such that the desired motion edit is realized in the edited video.

Abstract

Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time.

Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control.

Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision–language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example.

Method

Our method turns a frozen retake-style video editor into a motion-aware editor through test-time latent tuning — no fine-tuning required.

01

Freeze the Model

The LTX-2 video generation model weights remain entirely frozen. We never backpropagate into the model parameters.

02

Optimize Conditioning

We tune two lightweight variables — an audio latent and a text residual — via gradient descent. Qwen2.5-VL provides a differentiable motion-alignment signal as supervision.

03

Apply via Retake

The optimized conditioning embeddings are injected into the Retake pipeline to regenerate only the target time window, preserving the rest of the video.

Architecture diagram showing the Sound Sparks Motion pipeline
Overview of the Sound Sparks Motion pipeline. The audio latent α and text residual Δv are optimized jointly while all model weights remain frozen.

Results

Drag the slider to compare source and result. Scroll horizontally for all examples — Transfer results show learned controls applied to unseen videos.

BibTeX

Citation coming soon

The full citation will be available here once the paper is accepted and published.