Milestone Systems Launches Traffic-Focused Vision Language Model
Milestone Systems has released an advanced vision language model (VLM) specializing in traffic understanding, powered by NVIDIA Cosmos Reason, a framework designed to enable advanced reasoning across real-world visual data.
Milestone announced the model underpins two new offerings: a Video Summarization tool for XProtect Video Management Software and a Vision Language Model as a Service for third party integrations.
The Video Summarization tool is a generative AI-powered plug-in for the XProtect Smart Client designed to automate reporting and streamline operator workflows. According to the company, reviewing large volumes of recorded video remains time consuming and largely manual, and the new tool is intended to reduce that burden by converting video segments into structured text summaries. Early reports cited by Milestone indicate the technology could reduce operator false alarm fatigue by up to 30%.
Using the tool, operators submit a short video clip along with a prompt describing what they want to know. The model analyzes the footage and produces a text summary within seconds. Summaries can be searched based on video content rather than timestamps or manual tags, bookmarked and filtered to speed review and integrated with existing XProtect event and rule logic to trigger automated summaries tied to specific alarms or alerts. The tool can also filter out irrelevant motion or noise to focus attention on valid events.
Video Summarization is free to download and installs directly within the XProtect Smart Client in a few minutes. Users pay only when prompting the vision language model. The service supports customized, region-specific sovereign models starting with the United States and European Union, with additional regions planned.
VLMaaS expands developer access
In parallel, Milestone introduced Hafnia Vision Language Model as a Service, or VLMaaS, which provides developers, integrators and partners with API access to production-ready video intelligence built on NVIDIA technology. The service is intended to let developers add generative AI-based video intelligence to applications without setting up, fine-tuning or managing their own AI systems, regardless of the analytics already in place.
Milestone states that VLMaaS can significantly accelerate AI and analytics development, requiring up to 70 times less effort than fine-tuning a vision language model independently. The service is delivered through an API-first approach using HTTPS and supports traffic-optimized models fine-tuned for U.S. and EU markets, with additional regions to follow. It is designed for both standalone applications and integrations with Milestone products, and uses responsibly sourced training data with auditable lineage that is GDPR- and EU AI Act-compliant. Pricing is pay per use based on API calls, with no large upfront investment or custom training costs.
Milestone leadership said the two offerings are intended to address video overload and manual review bottlenecks by delivering immediate insights within XProtect for operators and production-ready video intelligence for developers without bespoke training or heavy infrastructure.
The company noted that XProtect customers including the city of Genoa in Italy and Dubuque in the United States are preparing to use the new capabilities as part of traffic management efforts.
Both offerings are powered by Milestone’s Hafnia VLM, which has been fine-tuned on 75,000 hours of responsibly sourced real-world video data from Europe or the United States. Data preparation uses NVIDIA Cosmos Curator, with deployment supported on cloud infrastructure or regional data centers.
Andrew Burnett, acting CTO at Milestone Systems, said the new capabilities are designed to help organizations extract more value from traffic video while reducing the operational burden of manual review.
“Because this model is specialized for real-world traffic video and fine-tuned on responsibly sourced data, customers can trust the results, deploy with confidence, and enhance all existing solutions in place,” he stated “It’s the fastest, most advanced and impactful path to turning video into actionable outcomes.”

