SuperDec

Abstract

We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.

Method Overview

Overview of SuperDec. Given a point cloud of an object with N points, a Transformer-based neural network predicts parameters for P superquadrics, as well as a soft segmentation matrix that assigns points to superquadrics. The predicted parameters include the 11 superquadric parameters and an objectness score. These predictions provide an effective initialization for the subsequent Levenberg–Marquardt (LM) optimization, which refines the superquadrics.

Results

Object-Level Results

We evaluate the performance of SuperDec to decompose individual objects on ShapeNet, a traditional object dataset.

To evaluate both accuracy and generalization, we conduct two experiments: in-category and out-of-category. In the in-category experiment, all learning-based methods are trained on the full ShapeNet training set (13 classes) and evaluated on the corresponding test set. In the out-of-category experiment, models are trained on half of the categories (airplane, bench, chair, lamp, rifle, table) and evaluated on the remaining ones (car, sofa, loudspeaker, cabinet, display, telephone, watercraft).

We compare with three baselines: EMS [2], SQ [3], CSA [4].

shapenet_quantitative — **Quantitative Results on ShapeNet**. We evaluate the accuracy of the reconstruction in terms of L2 Chamfer distance (scaled by 100) and the compactness of the representation in terms of number of primitives.

shapenet_qualitative — **Qualitative Results on ShapeNet**. We show results on test samples for in-category *(four first columns)* classes and out-of-category classes *(two last columns)*. The latter were not seen during training and illustrate how well models generalize to novel classes.

❮ ❯

Scene-Level Results

Our model trained only on 13 ShapeNet categories, can be extended to 3D scenes without any additional fine-tuning. Specifically, given a 3D scene point cloud, we obtain object instance mask using Mask3D, center and rescale them, and directly input them into our model. We visualize the results for some scenes from the Replica dataset.

**Qualitative results on Replica scenes.** Top row shows renderings of the original point clouds, bottom rows shows the superquadric representation obtained with SuperDec.

Applications

We envision SuperDec enabling a wide range of applications, especially in robotics and controllable content generation.

Robotics

We have explored how our representation can be used in robotics by evaluating in real-world for the tasks of path planning and object grasping. Given a scan of a real-world 3D scene captured with an iPad, we use SuperDec to compute its superquadric representation and we compute the grasping poses for some of the objects present in the scene.

Grasp for a milk bottle and some flowers

Then, given the robot’s starting position, we use the superquadric representation to compute the path planning towards the milk bottle, allowing it to move towards the desired object and to grasp it using the previously computed pose.

Controllable generation and editing

We have also explored how our representation can be directly leveraged to introduce joint spatial and semantic control in the generations of text-to-image diffusion models. To do that we generated some images by conditioning a ControlNet on the depths of the superquadrics extracted from some Replica scenes. We have seen that the superquadric representation can be used to achieve both spatial and semantic control of the generations.

*Spatial* control. Top row shows superquadrics generated by SuperDec, bottom row shows generated images using the prompt *"A corner of a room with a plant"*.

*Semantic* control. All images share the same scene geometry; the first two are structural prompts, the last two are generated with distinct textual prompts. Our representation allows to change the style of the room while keeping its semantic and geometric structure fixed.

Additional experiments

Unsupervised part segmentation

Our method not only learns to predict the parameters of the superquadrics representation, but also the segmentation matrix which decomposes the initial point cloud into parts which can be fitted by the predicted superquadrics. Below, we visualize the predicted segmentations for the same examples from ShapeNet. We observe that segmentation masks, appear very sharp and this suggests that our method, especially if trained at a larger scale, can be leveraged for different applications as geometry-based part segmentation or as pretraining for supervised part segmentation.

**Part segmentation results on ShapeNet**. Our method learns to segment objects into parts which can be fitted by superquadrics.

What does our network learn?

Since our network is completely unsupervised in segmenting objects into parts, in this experiment we analyze the features learned by our Transformer decoder across different object classes. To do that, similarly to BERT's [CLS] token, we append a learnable embedding to the sequence of embedded superquadrics. While this additional embedding is never explicitly decoded, it intuitively learns to extract meaningful features during self- and cross-attention layers. After training the model with this additional embedding, we decode it at test time and save the resulting vectors across different ShapeNet object categories. Below, we show a t-SNE visualization of those. We observe that categories with consistent object shapes, such as chairs, airplanes, and cars, form distinct clusters, whereas classes with greater shape diversity, such as watercraft, spread across a larger region of the plot. This result suggests that our model is able to cluster objects based on their geometrical structure, without the need for any annotation.

t-SNE visualization of primitive embeddings — **t-SNE Visualization of Primitive Embeddings** across different ShapeNet classes.

References

Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet: An Information-Rich 3D Model Repository. In arXiv preprint, 2015.
Weixiao Liu, Yuwei Wu, Sipu Ruan, and Gregory S. Chirikjian. Robust and accurate superquadric recovery: a probabilistic approach. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Despoina Paschalidou, Ali Osman Ulusoy, and Andreas Geiger. Superquadrics revisited: Learning 3d shape parsing beyond cuboids. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Kaizhi Yang and Xuejin Chen. Unsupervised learning for cuboid shape abstraction via joint segmentation from point clouds. In ACM Transactions On Graphics (TOG), 2021.

BibTeX

@inproceedings{fedele2025superdec,
      title   = {{SuperDec: 3D Scene Decomposition with Superquadric Primitives}},
      author  = {Fedele, Elisabetta and Sun, Boyang and Guibas, Leonidas and Pollefeys, Marc 
        and Engelmann, Francis},
      booktitle = {{Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}},
      year    = {2025}
    }