I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled?

The setup: compile a human demo into an object-centric graph (what changed in the world: relations, contacts, event order), run a solver, then independently extract a graph from the rollout only and check if they match. The whole point is a hard information boundary so the "answer key" can never leak into the side that grades the rollout. A no-op baseline fails with named failure classes; a dumb scripted arm passes. That contrast is the thing I care about.

Most manipulation success metrics are hand-coded predicates written by the same person training the policy. The policy author controls both the behavior and the definition of "success." That's a conflict of interest we'd never accept in ML benchmarking, yet it's standard in manipulation eval.

But I keep going back and forth on whether this matters, and I'd like other people's read:

The case that it's real: VLA/foundation-model training is starved for reliable dense reward at scale. Human raters don't scale, brittle predicates lie. An automatic, embodiment-agnostic grader that can say "this rollout reproduced the demonstrated transformation, here's why it failed" seems like an obviously-missing piece of the training loop.

The case that it's a non-problem: maybe everyone's already fine with task-specific success checks because in practice you only care about the tasks you're shipping, and a general verifier is solving for a generality nobody needs. And the representation that makes verification tractable (discrete relational state — INSIDE/TOUCHING/event-order) is also what caps it: it handles pick/place/insert/open-drawer but has no obvious purchase on force-profile or deformable tasks, which is exactly where the frontier is.

There's also the uncomfortable bit: the hard 80% is perception (video → graph under occlusion and contact noise), and that's where the leakage discipline gets harder, not easier, because your extractor is now a learned, error-prone thing.

Two questions I don't have a settled answer on:

Is reward/eval honesty a first-order bottleneck for the current generation of manipulation learning, or second-order polish?
Is object-centric relational state a dead representation for where manipulation is actually going, or a reasonable floor you build up from?

submitted by /u/Alexpplay
[link] [comments]

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

Want to read more?

Tagged with