PAD-Hand | CVPR 2026

Abstract

Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN–Transformer backbone, we formulate Euler–Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods.

Contributions

Probabilistic Physics Integration

We cast Euler-Lagrange residuals as virtual observables and integrate their likelihood into the diffusion objective, enabling physics to guide refinement without being enforced as a hard constraint.

Interpretable Physics Consistency

We estimate variances of dynamic residuals through a last-layer Laplace approximation, providing a principled indicator of recovery uncertainty and highlighting where physical consistency may weaken.

Comprehensive Evaluation

On two widely adopted hand datasets (DexYCB and HO3D), we improve reconstruction accuracy and physical plausibility over strong baselines and prior video-based methods.

Overall Pipeline

PAD-Hand pipeline overview — **Overview of PAD-Hand.** A sequence of images $I_{1 : T}$ is passed through an image-based pose estimator to obtain per-frame pose $θ_{1 : T}$ and the average shape $β_{avg}$ estimates. The pose estimates are then refined via a diffusion process to obtain temporally coherent motion. Simultaneously, we propagate the variance at each diffusion step starting from a delta Dirac distribution at diffusion step $N$ to obtain per-frame dynamic variance estimates. At each diffusion step, the backbone predicts clean motion ${\hat{x}}_{1 : T}$ which is supervised with data-driven loss $L_{data}$ and physics-driven loss $L_{EL}$ during training.

Results

Qualitative Results

Quantitative Comparison

DexYCB

Method	PA-MPJPE↓	MPJPE↓	ACCEL↓
Image-based
MaskHand	5.00	11.70	—
WiLoR	4.88	12.75	6.70
Video-based
S²HAND(V)	7.27	19.67	—
VIBE	6.43	16.95	—
TCMR	6.28	16.03	—
Deformer	5.22	13.64	6.77
BioPR	—	12.81	—
WiLoR + Ours	4.63	10.56	3.34

HO3D

Method	PA-MPJPE↓	ACCEL↓
Image-based
AMVUR	10.30	—
WiLoR	7.50	4.98
Video-based
VIBE	9.90	—
TCMR	11.40	—
TempCLR	10.60	—
Deformer	9.40	6.37
WiLoR + Ours	7.43	2.71

Variance Analysis

**Distribution of dynamic variances for PAD-Hand.** Bar color encodes the mean Euler–Lagrange residual within each variance bin (blue is low, red is high). Higher variance bins coincide with larger residuals, indicating that the model's uncertainty aligns with physics violations.

Physics uncertainty as a quality signal

Our last-layer Laplace approximation produces per-joint, per-time variance estimates. These variances reliably flag frames where the underlying image-based motion estimates are physically implausible which are typically arising from poor image-based estimates.

BibTeX

Coming Soon!

PAD-Hand: Physics-Aware Diffusionfor Hand Motion Recovery