PA3FF: Part-Aware 3D Feature Field

Abstract

Articulated object manipulation is essential for real-world robotic tasks, yet generalizing across diverse objects remains challenging. The key lies in understanding functional parts (e.g., handles, knobs) that indicate where and how to manipulate across diverse categories and shapes.

Previous approaches using 2D foundation features face critical limitations when lifted to 3D: long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information.

We propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D representation with part awareness for generalizable manipulation. PA3FF is trained via contrastive learning on 3D part proposals from large-scale datasets. Given point clouds as input, it predicts continuous 3D feature fields in a feedforward manner, where feature proximity reflects functional part relationships.

Building on PA3FF, we introduce Part-Aware Diffusion Policy (PADP) for enhanced sample efficiency and generalization. PADP significantly outperforms existing 2D and 3D representations (CLIP, DINOv2, Grounded-SAM), achieving state-of-the-art performance on both simulated and real-world tasks.

Overview

Key Contributions:

• We introduce PA3FF, a 3D-native representation that encodes dense, semantic, and functional part-aware features directly from point clouds
• We develop PADP, a diffusion policy that leverages PA3FF for generalizable manipulation with strong sample efficiency
• PA3FF can further enable diverse downstream methods, including correspondence learning and segmentation, making it a versatile foundation for robotic manipulation
• We validate our approach on 16 PartInstruct and 8 real-world tasks, where it significantly outperforms prior 2D and 3D representations (CLIP, DINOv2, and Grounded-SAM), offering a 15% and 16.5% increase

Methodology

Method Pipeline — **Three-Stage Training Framework:** (1) **Geometric Pre-training:** Leverage 3D geometric priors from large-scale datasets. (2) **Part-Aware Contrastive Learning:** Learn dense 3D feature fields to enhance part-level consistency. (3) **Policy Learning:** Integrate refined features into a diffusion policy for generalizable manipulation.

3D-Native

Directly processes point clouds, avoiding the inconsistencies of 2D multi-view lifting.

Dense Features

Predicts continuous per-point features that capture fine-grained geometric details.

Efficient

Single feedforward pass for fast inference, suitable for real-time robotic control.

Why PA3FF?

Feature Comparison — **Visual Comparison:** PA3FF generates smooth, semantically consistent 3D feature fields. In contrast, 2D-lifting methods (like DINOv2) suffer from noise and inconsistency across views, while other 3D baselines (Sonata) lack discriminative detail for functional parts.

Limitations of 2D Lifting

Inconsistency: Features fluctuate across different viewpoints.
Resolution: Small/thin parts (handles) are often lost in 2D renderings.
Latency: Multi-view fusion is computationally expensive.

PA3FF Advantages

Consistency: 3D-native prediction ensures viewpoint invariance.
Precision: Dense per-point features capture fine geometric details.
Speed: Feedforward network enables real-time interaction.

Table 2: Real-world manipulation success rates across 8 diverse tasks.

State-of-the-Art Performance

PADP achieves a 58.8% success rate on unseen objects, outperforming the best baseline (GenDP) by +23.8%, effectively bridging the sim-to-real gap.

Performance Metrics

58.8%

Success (Unseen)

35.0%

Baseline (GenDP)

+23.8%

Absolute Improvement

Figure 2: Success rates on PartInstruct benchmark across 5 generalization levels.

Key Insight

PADP achieves 28.8% average success rate, outperforming GenDP (19.4%) by a significant margin (+9.4%). It demonstrates superior robustness on Novel Object Categories (OC), validating the generalization capability of our part-aware features.

Generalization Protocol

OS	Object States (Pose/Rotation)
OI	Object Instances (Same Category)
TP	Task Parts (New Parts)
TC	Task Categories (New Tasks)
OC	Object Categories (Unseen Class)

Component Analysis

62%

Full PADP Method

46%

w/o Feature Refinement

39%

Sonata + DP3

Key Insight

Feature refinement via contrastive learning provides the largest performance gain (+16%), confirming that our part-aware representation learning is the critical component for success.