I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]
Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled?
The setup: compile a human demo into an object-centric graph (what changed in the world: relations, contacts, event order), run a solver, then independently extract a graph from the rollout only and check if they match. The whole point is a hard information boundary so the "answer key" can never leak into the side that grades the rollout. A no-op baseline fails with named failure classes; a dumb scripted arm passes. That contrast is the thing I care about.
Most manipulation success metrics are hand-coded predicates written by the same person training the policy. The policy author controls both the behavior and the definition of "success." That's a conflict of interest we'd never accept in ML benchmarking, yet it's standard in manipulation eval.
But I keep going back and forth on whether this matters, and I'd like other people's read:
The case that it's real: VLA/foundation-model training is starved for reliable dense reward at scale. Human raters don't scale, brittle predicates lie. An automatic, embodiment-agnostic grader that can say "this rollout reproduced the demonstrated transformation, here's why it failed" seems like an obviously-missing piece of the training loop.
The case that it's a non-problem: maybe everyone's already fine with task-specific success checks because in practice you only care about the tasks you're shipping, and a general verifier is solving for a generality nobody needs. And the representation that makes verification tractable (discrete relational state — INSIDE/TOUCHING/event-order) is also what caps it: it handles pick/place/insert/open-drawer but has no obvious purchase on force-profile or deformable tasks, which is exactly where the frontier is.
There's also the uncomfortable bit: the hard 80% is perception (video → graph under occlusion and contact noise), and that's where the leakage discipline gets harder, not easier, because your extractor is now a learned, error-prone thing.
Two questions I don't have a settled answer on:
- Is reward/eval honesty a first-order bottleneck for the current generation of manipulation learning, or second-order polish?
- Is object-centric relational state a dead representation for where manipulation is actually going, or a reasonable floor you build up from?
[link] [comments]
Want to read more?
Check out the full article on the original site