Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN–Transformer backbone, we formulate Euler–Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods.
We cast Euler-Lagrange residuals as virtual observables and integrate their likelihood into the diffusion objective, enabling physics to guide refinement without being enforced as a hard constraint.
We estimate variances of dynamic residuals through a last-layer Laplace approximation, providing a principled indicator of recovery uncertainty and highlighting where physical consistency may weaken.
On two widely adopted hand datasets (DexYCB and HO3D), we improve reconstruction accuracy and physical plausibility over strong baselines and prior video-based methods.
| Method | PA-MPJPE↓ | MPJPE↓ | ACCEL↓ |
|---|---|---|---|
| Image-based | |||
| MaskHand | 5.00 | 11.70 | — |
| WiLoR | 4.88 | 12.75 | 6.70 |
| Video-based | |||
| S²HAND(V) | 7.27 | 19.67 | — |
| VIBE | 6.43 | 16.95 | — |
| TCMR | 6.28 | 16.03 | — |
| Deformer | 5.22 | 13.64 | 6.77 |
| BioPR | — | 12.81 | — |
| WiLoR + Ours | 4.63 | 10.56 | 3.34 |
| Method | PA-MPJPE↓ | ACCEL↓ |
|---|---|---|
| Image-based | ||
| AMVUR | 10.30 | — |
| WiLoR | 7.50 | 4.98 |
| Video-based | ||
| VIBE | 9.90 | — |
| TCMR | 11.40 | — |
| TempCLR | 10.60 | — |
| Deformer | 9.40 | 6.37 |
| WiLoR + Ours | 7.43 | 2.71 |
Our last-layer Laplace approximation produces per-joint, per-time variance estimates. These variances reliably flag frames where the underlying image-based motion estimates are physically implausible which are typically arising from poor image-based estimates.
Coming Soon!