Back before every major streamer began tagging scene tone and emotion for ad targeting, there was a much simpler question: How do you rate a title? That was the question Yash Chaturvedi was trying to answer when he joined Amazon’s Trust and Safety Compliance department. Consider timestamps instead of trailers—how can you detect smoke, violence, or profanity in a movie that is just about to be released worldwide?
Solving content safety with machine learning set the stage for something much bigger. As well as flagging violent scenes, the same technology can also tell you its mood, pacing, even what type of truck is flying through it. Eventually, this evolved into the Video Understanding Platform (VUP), which combines computer vision, audio, and text to make every frame machine-readable and every viewer experience more intelligent.
We spoke with Yash just after Disney launched its own scene-based engine, Magic Words. During this interview, we talked about how this has evolved, what’s next for AI and long-form video, and what storytelling might look like without technical limitations.
Disney recently unveiled Magic Words—a tool that identifies scene context, mood, and tone to improve ad targeting. As the creator of the Video Understanding Platform, built on the same principles, do you think such technologies have already become the new standard in the industry?
Absolutely. A good ad, a recommendation, or a safety check can be placed exactly where it belongs once you can read the mood, objects, and pacing of every scene. Viewers immediately sense the difference. No wonder that over the past two years the major streamers, broadcasters, and a wave of startups have all moved in that direction. Today, scene intelligence is no longer a curiosity; it’s become the new baseline and is expected.
Can you walk us through the origin story? What initial business challenge sparked the idea behind the Video Understanding Platform?
It was the pressure of content-safety that ignited the spark. Previously, we shipped thousands of hours of long-form video to dozens of regions, each with its own standards and policies. The workload was immense. Human-rating every instance of violence or nicotine use wasn’t feasible, so we trained computer vision models to flag them automatically. Once we had that scene graph, we realised the same data could power discovery and advertising. What started as a narrow compliance tool suddenly became a strategic engine.
What kinds of business problems does automated video understanding solve? Targeting, recommendations, content safety? Which of these presented the greatest challenge in the context of Prime Video?
Three challenges rise to the top: scene-matched advertising that respects narrative flow, personalised discovery that tracks a viewer’s moment-to-moment mood, and frame-accurate compliance that works at global scale. Each publisher faces unique hurdles; Prime Video, for example, is shaped by its scale and wide range of entitlement types. Trust and Safety also stands out: you have to be 100% certain, otherwise you risk breaking viewers’ trust. Content also benefits from the platform’s support for multiple languages and cultural considerations.
Disney says Magic Words allows advertisers to match ads to the emotional tone of a scene. What kind of edge does this give advertisers, and how is it different from traditional contextual targeting?
Traditional systems might tag an entire episode based on genre or keywords, while platforms like ours or Magic Words zero in on the exact emotional beat of a scene. Drop a pickup-truck ad right after a triumphant chase—recall climbs. You’re not just placing an ad; you’re syncing with how the viewer feels in that moment. That’s a huge shift. Relevance goes up, intrusion goes down, and media value follows. This kind of precision isn’t experimental anymore—it’s the new baseline.
Like Magic Words, your Video Understanding Platform sits at the intersection of computer vision and machine learning. What core technologies did your platform rely on, and what constraints did you face working with long-form video content?
Without revealing any proprietary details, I can say this: like Magic Words, our platform fuses visual, audio, and text signals into a multi-modal scene graph. Thanks to recent advances in GenAI, we can now generate inferences at scale without losing meaningful detail. At its core, we’re tackling a creative challenge with scientific tools—and that’s always tricky. But long-form video adds even more friction: the sheer number of frames, the need for extensive training data, and a slower narrative pace that often masks context shifts. To manage that, we batch heavy computation offline and cache scene vectors so we can stay real-time where it counts.
In your opinion, what other content types or user experiences are ripe for reinvention using platforms like VUP or Magic Words?
Live sports, for instance, will use real-time scene graphs to trigger sponsor graphics or micro-bets the moment a play unfolds. In education, platforms could auto-generate quizzes from key moments in a lesson. Accessibility services, too, stand to benefit—imagine dynamic audio descriptions that capture not just the action but the emotional tone as well. Really, any format that gains from understanding exactly what’s on screen in real time is up for reinvention.
If there were no technical, budgetary, or organizational constraints—and you could launch any AI product for video—what would it be?
Fun question for a product manager. I’d go with a first-person story engine. Imagine slipping on lightweight AR glasses and experiencing a film entirely through the protagonist’s eyes—dialogue reacts to your prompts, the camera angle becomes your own, and the score and lighting sync to your heartbeat. The building blocks—volumetric capture, neural rendering, real-time scene graphs—already exist. It’s really just a matter of stitching them together. The result could be a new kind of immersive storytelling, one that’s as exciting for creators as it is for viewers.