Highlighted in CVPR 2025: Google DeepMind’s “Sport Tips” Paper Unlocking Granule Video Control

0 0 4 minutes read

Highlighted in CVPR 2025: Google DeepMind’s “Sport Tips” Paper Unlocking Granule Video Control

Key points:

Researchers at Google DeepMind, University of Michigan and Brown University have developed “Motion Tips,” a new way to control video generation using specific motion trajectories.
The technique uses “motion cues”, a flexible representation of motion that may be sparse or dense to guide pre-trained video diffusion models.
The key innovation is the “Motion Prompt Extension”, which translates advanced user requests (such as mouse drag) into detailed motion instructions for the model.
This single unified model can perform a variety of tasks, including precise object and camera control, motion transmission from one video to another, and interactive image editing without retraining for each specific function.

With the continuous development of generative AI, precise control of video creation is a key barrier to its widespread adoption in markets such as advertising, filmmaking, and interactive entertainment. While text cues are the primary method of control, they are often lacking in specifying subtle dynamic movements that make the video compelling. A new paper, launched by Google DeepMind, the University of Michigan and Brown University and published and highlighted at CVPR 2025, introduces a groundbreaking solution called “Motion Tips,” which provides unprecedented levels of control by allowing users to direct motion trajectories in videos.

This new approach goes beyond the limitations of the text, which is difficult to accurately describe complex movements. For example, tips like “The bear quickly turns his head” are open to countless explanations. How fast is “fast”? What is the exact path to head movement? Movement Tips solves this problem by allowing creators to define the movement itself, thus opening the door to more expressive and intentional video content.

Please note that the results are not real-time (10 minutes of processing time)

Introduce sports tips

At the heart of this study is the concept of “motional cues.” The researchers determined that sparsy or dense trajectories of motion in time (which can actually track the movement of points over time) are ideal ways to represent any type of motion. This flexible format captures anything from subtle hair to complex camera movements.

To achieve this, the team trained a control network adapter on a powerful, pre-trained video diffusion model called Lumiere. The control network is trained on a large number of internal datasets of 2.2 million videos, each with detailed motion tracks extracted by an algorithm called BootStap. This diverse training allows the model to understand and generate various actions without specializing in engineering for each task.

From simple clicks to complex scenes: motion prompt extension

While it is still unrealistic to specify each motion point of a complex scene for users, the researchers developed a process called “motion timely expansion.” This clever system translates simple advanced user input into detailed semi-density motion, prompting model requirements.

This allows for a variety of intuitive applications:

“Interact” with the image: The user can click in a still image and drag his mouse over the object to move it. For example, a user can drag the parrot’s head to make it turn, or “play” with a person’s hair, and the model will generate a real video. Interestingly, this process reveals the behavior that appears, which will produce physically reasonable movement, just like the actual scattering of sand when the cursor is “pushed”.

Object and camera control: By interpreting mouse movement as an explanation of manipulating geometric originality (such as invisible spheres), users can implement fine-grained controls, such as precisely rotating the cat’s head. Likewise, the system can generate complex camera motion, such as around the scene by estimating the depth of the scene from the first frame and projecting the desired camera path onto it. The model can even combine these tips to control both the object and the camera.

Movement transfer: This technology allows the motion of the source video to be applied to completely different subjects in static images. For example, researchers show that transferring a person’s head movement to a macaque is actually a “fake” animal.

Test it

The team conducted extensive quantitative evaluation and human studies to validate its approach and compare it with latest models such as image conductors and drags. Among almost all metrics, including image quality (PSNR, SSIM) and motion accuracy (EPE), their models outperform the benchmark.

Human studies further confirm these results. When choosing between motion cues and videos generated by other methods was required, participants always favored the results of the new model, citing better adherence to motion commands, more realistic motion, and higher overall visual quality.

Limitations and future directions

Researchers are transparent about the current limitations of the system. Sometimes, the model may produce unnatural results, such as unnaturally stretching the object if some parts are wrongly “locked” to the background. However, they believe that these failures can be used as valuable tools for detecting the underlying video model and identifying weaknesses in their physical world.

This study represents an important step in creating truly interactive and controllable generative video models. By focusing on the essential elements of the movement, the team unlocks a versatile and powerful tool that can one day become the standard for professionals and creativity who want to leverage AI’s full potential in video production.

Check Paper and Project page. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Jean-Marc is a successful AI business executive. He led and accelerated the growth of AI Power’s solutions and founded a computer vision company in 2006. He is a recognized spokesperson for the AI conference and holds an MBA from Stanford University.