Phani Harish Wajjala is a Principal Machine Learning Engineer who leads content understanding for a large-scale avatar marketplace, where his team’s models classify, protect, and surface millions of 3D assets within a $2 billion annual economy.
Before that, he spent seven years at Amazon working on computer vision systems for product data, including OCR-based nutrition extraction and scene construction.
Alltech Magazine sat down with Wajjala to discuss how he approaches ML decision-making at production scale, from choosing when a rule-based solution beats a model to measuring ROI on systems that sit several layers removed from the final transaction.
You lead machine learning initiatives for a content understanding platform that processes millions of 3D assets. What does “content understanding” mean in practice, and why should business leaders care about it?
At its core, content understanding means generating useful signals from complex data sources. Generally, a “content understanding” pod becomes necessary when you have vast amounts of unstructured data that you don’t know how to parse efficiently.
Let’s take Netflix, for example. They have a massive collection of movies. If you treat a movie simply as a video and audio file, a business leader only has access to high-level stats—like the total number of movies—and maybe basic metadata, such as the country of origin or cast lists. But that data alone wouldn’t make it possible for Netflix to show personalized, widgetized recommendations like “Romantic Movies” or “Because you watched X.” A content understanding platform is crucial to extract deeper properties from those raw files—things like specific sub-genres, plot twists, musical composition, and humor styles.
ML teams often struggle to move from proof-of-concept to production systems that work reliably at scale. What organizational or technical decisions have made that transition successful in your experience?
There are three frequent roadblocks we see when bringing a prototype to production. First, offline metrics look positive, but the A/B test is negative. Second, the prototype is hard to integrate into the production computing environment due to library gaps or other technical debt. And third, the prototype simply doesn’t scale to production traffic.
We handle these challenges by carefully following MLOps principles.
We ensure the offline distribution matches the production distribution as closely as possible by constantly sampling production traffic for our offline metrics. We use containerized services to minimize disruptions caused by the ever-changing libraries required for ML. Finally, we monitor the model not just for ML accuracy, but for strict performance and availability criteria—this ensures we don’t have to roll back a production version due to an availability alert spike after deployment.
Building ML systems requires collecting and labeling massive amounts of data, often from human evaluators. How do you design data collection pipelines that produce high-quality training data without becoming a bottleneck?
Due to the cost and limited scale of human annotation, active learning is a crucial step. We use the model’s own probabilities to identify where it is least confident, and we oversample that specific data for human labeling.
Regarding the annotation queues, we also sample data for multiple annotators to query to verify inter-annotator agreement. A low agreement usually signals issues with the labeling guidelines, input clarity, or subjectivity in the task itself.
For example, a task like “what is the class for this object” usually has high agreement. But we once needed to collect output for “rate this avatar’s thematic coherence” on a scale of 1 to 5. The agreement was very low. When we gave the annotators feedback, they just started marking everything as “3” to play it safe, which resulted in unhelpful data. We rephrased the task to simply compare thematic coherence between two avatar pairs and select the better one. That shift resulted in a much better data distribution with high inter-annotator agreement.
When you’re building AI-powered automation for a marketplace, how do you decide which problems are worth solving with ML versus simpler rule-based approaches?
This is the most critical discussion during scoping. My rule of thumb is: Don’t use a probabilistic solution for a deterministic problem. ML brings maintenance overhead and non-determinism that we should avoid if possible.
For instance, we had a request to use NLP to detect malicious scripts in user-uploaded assets—code designed to ‘explode’ an asset’s size to crash the game. An ML model would have been complex and prone to false positives. Instead, we turned it into a sandbox problem. We loaded the asset into an empty scene, checked the geometric bounds, and if the size spiked, we blocked it. This rule-based approach gave us 100% accuracy with zero false positives, something an ML model could never guarantee.
You’ve worked across both e-commerce and virtual marketplace environments. How do the ML challenges differ when you’re classifying physical products versus digital 3D assets?
One of the biggest differences is the richness and accuracy of the metadata. On an e-commerce platform, there is a strong feedback loop from customers back to the website—returns and reviews ensure the descriptions are very accurate. This creates rich, detailed product listings.
With virtual 3D assets, that loop is different. Since the product you see on the store page is exactly what you get in-game, there aren’t “returns” in the same way. Consequently, the product page often lacks that strong loop of user-curated metadata. This means a lot more burden is placed on the platform to “understand” the product and tag it correctly, rather than relying on the seller or buyer to describe it for us.
Many companies invest heavily in ML infrastructure but struggle to measure ROI. How do you think about quantifying the business value of content understanding and classification systems?
Measuring ROI for backend ML is tricky because we are layers removed from the final transaction. I bridge this gap by establishing ‘Proxy Metrics’ that correlate to revenue.
For example, we might not be able to claim a model caused a purchase, but we can prove that it improved ‘Search Relevance’ or ‘Click-Through Rate’ (CTR) in an A/B test. If we launch a new classifier and CTR goes up by 5%, we can calculate the incremental value of that lift. My job is to define that ‘North Star’ proxy metric before we write code, so we aren’t guessing at the value after launch.
Your work includes building agents that automate parts of the 3D asset creation workflow. Where do you see the boundary between what AI should automate and what should remain in human hands?
I view AI as a tool for acceleration, not intention. AI is incredible at generation—it can create 50 sword variations in seconds. But it lacks the ‘soul’ of narrative design. For example, creating a ‘hilarious old man’ outfit requires a human touch to curate items that clash in a specific, funny way. AI doesn’t understand the joke yet.
There is also a strict technical boundary. In 3D, assets are functional. A layered jacket cannot ‘clip’ or bloat unnaturally when worn over a thick sweater. This requires complex ‘caging’ adjustments. Right now, AI struggles with this spatial layering. Until AI has enough training data on how clothes physically deform over other clothes, this remains a skilled human task.
Cross-functional alignment is a common pain point for ML teams. Engineering, product, and business stakeholders often have different priorities. How do you navigate that?
The key is to move the debate from ‘opinions’ to ‘tenets.’ For example, if we agree upfront that ‘User Safety > Model Accuracy,’ then delaying a launch for a safety check isn’t an argument—it’s a policy we already agreed on.
When we do hit a stalemate, I use the ‘One-Way Door’ framework. If a decision is easily reversible (a Two-Way Door), I usually side with Product to move fast. If it’s irreversible (a One-Way Door, like breaking API compatibility), I side with Engineering to be cautious. Once a decision is made, we practice ‘disagree and commit’—championing the plan even if it wasn’t our first choice.
What’s one lesson you’ve learned about scaling ML operations that you wish someone had told you earlier in your career?
The most important advice is “fail fast.” This is particularly true for ML. I once spent eight months implementing a solution, only to realize the problem could have been solved better just by the quality of the data; the complex solution didn’t actually help.
ML is a rapidly evolving field. Groundbreaking solutions from even five years ago are often outdated today. Because the number of ideas for improvement is vast, and there is rarely a clear “perfect” solution, we have to iterate quickly. We try to abide by the principle of failing fast by building systems that automate our workflows and provide enough tests to tell us immediately if a solution is viable. This investment, while initially cumbersome, pays large dividends over time because it prevents us from falling into the trap of spending months on a proof of concept that won’t work.

