Vismap: Unsupervised summary of long-term videos using meta-start and short form datasets

liralbes April 28, 2025

0 3 minutes read

Vismap: Unsupervised summary of long-term videos using meta-start and short form datasets

Video subtitle models are usually trained on a dataset composed of short videos, usually within three minutes, and paired with corresponding subtitles. While this allows them to describe basic actions such as walking or talking, these models struggle with the complexity of long-term videos (such as video blogs, sports events and can last for more than an hour). When applied to such videos, they often produce fragmented descriptions that focus on isolated action rather than capturing a wider storyline. Efforts like Ma-Lmm and Lavila extend video subtitles to 10-minute clips using LLMS, but long videos remain a challenge due to the lack of a proper dataset. Although EGO4D has introduced a large one-hour video dataset, its first-person perspective limits its broader applicability. Video review resolves this gap by training multiple trans annotations in multiple videos over a long period of time, but this approach is expensive and prone to inconsistent annotations. Instead, annotated short video datasets are widely used and are more user-friendly.

Advances in visual language models significantly enhance the integration of visual and linguistic tasks, as well as early works such as editing and alignment foundations. Subsequent models, such as LLAVA and MINIGPT-4, extend these capabilities to images, while others adapt them to video comprehension by focusing on time series modeling and building more powerful datasets. Despite these developments, the scarcity of large, annotated long-term video datasets remains a significant obstacle to progress. Traditional short-form video tasks (such as video Q&A, subtitles, and grounding) require primarily spatial or time understanding, while summarizing an hour of video requires keyframes in a large amount of redundancy. While some models (such as Longva and Llava-Video) can perform VQA in long videos, they struggle with summary tasks due to data limitations.

Researchers at Queen Mary University and Spotify introduced Vismap, an unsupervised method for summarizing long-term videos without expensive annotations. Traditional models perform well in short pre-segment videos, but with significant events spreading. Vismap covers this gap by using LLM and a metadata strategy, thereby iterating the generation and refinement of pseudo-Sunday Savings from clip descriptions created by short-form video models. The process involves three llms to generate, evaluate and timely optimization. Vismap achieves performance comparable to fully supervised models for multiple datasets while maintaining domain adaptability and eliminating the need for a large number of manual labels.

The study addressed cross-domain video summary by training on video datasets marked with short forms and adapting to tagless long-term videos from other domain names. Initially, the model was trained to summarize 3 minutes of video using the TimeFormer function, visual language comparison module and text decoder, and optimized through cross-permeability and contrast loss. To handle longer videos, divide them into 3-minute clips and generate pseudo-clogs. Iterative meta-start method with multiple LLMs (generator, evaluator, optimizer) improves the summary. Finally, these pseudo-Sunday models were fine-tuned using symmetrical transcondensation losses to manage noisy labels and improve adaptability.

The study evaluated VISMAP in three cases: aggregation of long videos using cross-domain generalizations of EGO4D-HCAP, MSRVTT, MSVD and YouCook2 datasets, and adaptation of short videos using Egoschema. Vismap is trained in up to an hour of video, compared with supervised and zero-fire methods such as video review and Lavila+GPT3.5, demonstrating competitive or superior performance without supervision. Assess the accuracy of using cider, rouge-l, meteor scores and quality assurance. Ablation studies highlight the benefits of element and component modules, such as contrast learning and SCE loss. Implementation details include using TimesFormer, Distilbert, and GPT-2 and were trained on the NVIDIA A100 GPU.

In summary, Vismap is an unsupervised method of aggregating long videos by using an annotated short video dataset and a startup meta strategy. It first creates high-quality summary through metadata and then trains the summary model, thus reducing the need for a large number of annotations. Experimental results show that VISMAP is performed in the same aspect in a fully supervised manner and is adapted effectively in various video datasets. However, its dependence on pseudo-labels from the source domain model may affect performance under important domain movements. In addition, Vismap currently only relies on visual information. Future work may integrate multimodal data, introduce hierarchical summary, and develop more generalized metadata facilitation techniques.

Check Paper. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.