CVPR 2026

PAD-Hand: Physics-Aware Diffusion
for Hand Motion Recovery

1Michigan State University   2Independent Researcher

Abstract

Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN–Transformer backbone, we formulate Euler–Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods.

Contributions

Probabilistic Physics Integration

We cast Euler-Lagrange residuals as virtual observables and integrate their likelihood into the diffusion objective, enabling physics to guide refinement without being enforced as a hard constraint.

Interpretable Physics Consistency

We estimate variances of dynamic residuals through a last-layer Laplace approximation, providing a principled indicator of recovery uncertainty and highlighting where physical consistency may weaken.

Comprehensive Evaluation

On two widely adopted hand datasets (DexYCB and HO3D), we improve reconstruction accuracy and physical plausibility over strong baselines and prior video-based methods.

Overall Pipeline

PAD-Hand pipeline overview
Overview of PAD-Hand. A sequence of images I1:T is passed through an image-based pose estimator to obtain per-frame pose θ1:T and the average shape βavg estimates. The pose estimates are then refined via a diffusion process to obtain temporally coherent motion. Simultaneously, we propagate the variance at each diffusion step starting from a delta Dirac distribution at diffusion step N to obtain per-frame dynamic variance estimates. At each diffusion step, the backbone predicts clean motion x^1:T which is supervised with data-driven loss Ldata and physics-driven loss LEL during training.

Results

Qualitative Results

Qualitative results on DexYCB
Refined motion estimates by PAD-Hand with dynamic variance. Top: Image-based estimates (left) are refined by our model (PAD-Hand) (right) to enforce temporal and physics consistency. Bottom: Joint-level (left) and mesh-level (right) variance maps concentrate on frames/regions where the image-based motion estimate is unreliable (highlighted in red), aligning high variance with poor motion estimates. The color bar shows normalized variance (low to high).

Quantitative Comparison

DexYCB
Method PA-MPJPE↓ MPJPE↓ ACCEL↓
Image-based
MaskHand 5.00 11.70
WiLoR 4.88 12.75 6.70
Video-based
S²HAND(V) 7.27 19.67
VIBE 6.43 16.95
TCMR 6.28 16.03
Deformer 5.22 13.64 6.77
BioPR 12.81
WiLoR + Ours 4.63 10.56 3.34
HO3D
Method PA-MPJPE↓ ACCEL↓
Image-based
AMVUR 10.30
WiLoR 7.50 4.98
Video-based
VIBE 9.90
TCMR 11.40
TempCLR 10.60
Deformer 9.40 6.37
WiLoR + Ours 7.43 2.71

Variance Analysis

Distribution of dynamic variances
Distribution of dynamic variances for PAD-Hand. Bar color encodes the mean Euler–Lagrange residual within each variance bin (blue is low, red is high). Higher variance bins coincide with larger residuals, indicating that the model's uncertainty aligns with physics violations.

Physics uncertainty as a quality signal

Our last-layer Laplace approximation produces per-joint, per-time variance estimates. These variances reliably flag frames where the underlying image-based motion estimates are physically implausible which are typically arising from poor image-based estimates.

BibTeX

Coming Soon!