Abstract
Addressing Lidar Panoptic Segmentation (LPS) is crucial for safe deployment of autnomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.






Similar content being viewed by others
References
Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE TPAMI, 34(11), 2189–2202.
Alonso, I., Riazuelo, L., Montesano, L., & Murillo, A. C. (2020). 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation. IEEE Robotics and Automation Letters, 5(4), 5432–5439.
Aygün, M., Osep, A., Weber, M., Maximov, M., Stachniss, C., Behley, J., & Leal-Taixé, L. (2021). 4d panoptic lidar segmentation. In CVPR.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019). SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV.
Behley, J., Milioto, A., & Stachniss, C. (2021). A benchmark for LiDAR-based panoptic segmentation based on KITTI. In International Conference on Robotics and Automation.
Behley, J., Steinhage, V., & Cremers, A.B. (2013). Laser-based segment classification using a mixture of bag-of-words. In International Conference on Intelligent Robots and Systems.
Bendale, A., & Boult, T.E. (2016) Towards open set deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572.
Boykov, Y., & Funka-Lea, G. (2006). Graph cuts and efficient nd image segmentation. In IJCV70(2).
Cen, J., Yun, P., Zhang, S., Cai, J., Luan, D., Tang, M., Liu, M., & Yu Wang, M. (2022). Open-world semantic segmentation for lidar point clouds. In ECCV.
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., & Urtasun, R. (2015). 3D object proposals for accurate object class detection. In Advances in Neural Information Processing Systems.
Choy, C., Gwak, J., & Savarese, S. (2019). 4D spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
Dewan, A., Caselitz, T., Tipaldi, G.D., & Burgard, W. (2015). Motion-based detection and tracking in 3d lidar scans. In International Conference on Robotics and Automation.
Dhamija, A.R., Günther, M., & Boult, T.E. (2018). Reducing network agnostophobia. In NeurIPS.
Dhamija, A., Gunther, M., Ventura, J., & Boult, T. (2020). The overlooked elephant of object detection: Open set. In Wint. Conf. App. Comput. Vis.
Douillard, B., Underwood, J., Kuntz, N., Vlaskine, V., Quadros, A., Morton, P., & Frenkel, A. (2011). On the segmentation of 3d lidar point clouds. In 2011 IEEE International Conference on Robotics and Automation, pp. 2798–2805. IEEE.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Robotics: Science and Systems.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. IJCV, 88(2), 303–338.
Fomenko, V., Elezi, I., Ramanan, D., Leal-Taix’e, L., & Osep, A. (2022). Learning to discover and detect objects. In Advances in Neural Information Processing Systems.
Fong, W.K., Mohan, R., Hurtado, J.V., Zhou, L., Caesar, H., Beijbom, O., & Valada, A. (2021). Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. arXiv preprint arXiv:2109.03805
Gasperini, S., Mahani, M.-A.N., Marcos-Ramiro, A., Navab, N., & Tombari, F. (2021). Panoster: End-to-end panoptic segmentation of lidar point clouds. Letters: IEEE Rob. Automat.
Held, D., Guillory, D., Rebsamen, B., Thrun, S., & Savarese, S. (2016). A probabilistic framework for real-time 3d segmentation using spatial, temporal, and semantic cues. In Robotics: Science and Systems.
Hendrycks, D., Mazeika, M., & Dietterich, T. (2019). Deep anomaly detection with outlier exposure. In ICLR.
Hong, F., Zhou, H., Zhu, X., Li, H., & Liu, Z. (2021). Lidar-based panoptic segmentation via dynamic shifting network. In CVPR.
Hong, F., Kong, L., Zhou, H., Zhu, X., Li, H., & Liu, Z. (2024). Unified 3d and 4d panoptic segmentation via dynamic shifting networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2023.3349304
Hu, P., Held, D., & Ramanan, D. (2020). Learning to optimally segment point clouds. IEEE Robotics and Automation Letters, 5(2), 875–882.
Hwang, J., Oh, S.W., Lee, J.-Y., & Han, B. (2021). Exemplar-based open-set panoptic segmentation network. In CVPR.
Jiang, P., & Saripalli, S. (2021). Lidarnet: A boundary-aware domain adaptation model for point cloud semantic segmentation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2457–2464. IEEE.
Joseph, K.J., Khan, S., Khan, F.S., & Balasubramanian, V.N. (2021). Towards open world object detection. In CVPR.
Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2019). Panoptic segmentation. In CVPR.
Klasing, K., Wollherr, D., & Buss, M. (2008). A clustering method for efficient segmentation of 3d laser data. In 2008 IEEE International Conference on Robotics and Automation, pp. 4043–4048. IEEE.
Kong, S., & Fowlkes, C.C. (2018). Recurrent pixel embedding for instance grouping. In CVPR.
Kong, S., & Ramanan, D. (2021). Opengan: Open-set recognition via open data generation. In ICCV.
Kong, L., Quader, N., & Liong, V.E. (2023). Conda: Unsupervised domain adaptation for lidar segmentation via regularized domain concatenation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9338–9345. IEEE.
Kreuzberg, L., Zulfikar, I.E., Mahadevan, S., Engelmann, F., & Leibe, B. (2022). 4d-stop: Panoptic segmentation of 4d lidar using spatio-temporal object proposal generation and aggregation. In ECCV AVVision Workshop.
Langer, F., Milioto, A., Haag, A., Behley, J., & Stachniss, C. (2020). Domain transfer for semantic segmentation of lidar data using deep neural networks. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8263–8270. IEEE
Li, J., He, X., Wen, Y., Gao, Y., Cheng, X., & Zhang, D. (2022). Panoptic-phnet: Towards real-time and high-precision lidar panoptic segmentation via clustering pseudo heatmap. In CVPR.
Li, E., Razani, R., Xu, Y., & Liu, B. (2023). Cpseg: Cluster-free panoptic segmentation of 3d lidar point clouds. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 8239–8245. IEEE.
Li, X., Zhang, G., Pan, H., & Wang, Z. (2022). Cpgnet: Cascade point-grid fusion network for real-time lidar semantic segmentation. In 2022 International Conference on Robotics and Automation (ICRA), pp. 11117–11123. IEEE.
Li, X., Zhang, G., Wang, B., Hu, Y., & Yin, B. (2023). Center focusing network for real-time lidar panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13425–13434.
Liao, Y., Xie, J., & Geiger, A. (2021). KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. arXiv preprint arXiv:2109.13410
Lin, Z., Pathak, D., Wang, Y.-X., Ramanan, D., & Kong, S. (2022). Continual learning with evolving class ontologies. Advances in Neural Information Processing Systems, 35, 7671–7684.
Liu, Y., Zulfikar, I.E., Luiten, J., Dave, A., Ramanan, D., Leibe, B., Osep, A., & Leal-Taixé, L. (2022). Opening up open world tracking. In CVPR.
Loiseau, R., Aubry, M., & Landrieu, L. (2022). Online segmentation of lidar sequences: Dataset and algorithm. In European Conference on Computer Vision, pp. 301–317. Springer.
Marcuzzi, R., Nunes, L., Wiesmann, L., Behley, J., & Stachniss, C. (2023). Mask-based panoptic lidar segmentation for autonomous driving. IEEE Robotics and Automation Letters, 8(2), 1141–1148.
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV.
McInnes, L., Healy, J., & Astels, S. (2017). HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205.
Moosmann, F., & Stiller, C. (2013). Joint self-localization and tracking of generic objects in 3d range data. In International Conference on Robotics and Automation.
Moosmann, F., Pink, O., & Stiller, C. (2009). Segmentation of 3d lidar data in non-flat urban environments using a local convexity criterion. In IEEE Intelligent Vehicles Symposium.
Mosig, C. (2022). ROS package to publish the KITTI-360 dataset. https://github.com/dcmlr/kitti360_ros_player
Najibi, M., Ji, J., Zhou, Y., Qi, C.R., Yan, X., Ettinger, S., & Anguelov, D. (2022). Motion inspired unsupervised perception and prediction in autonomous driving. In ECCV.
Najibi, M., Ji, J., Zhou, Y., Qi, C.R., Yan, X., Ettinger, S., & Anguelov, D. (2023). Unsupervised 3d perception with 2d vision-language distillation for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8602–8612
Neuhold, G., Ollmann, T., Rota Bulo, S., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4990–4999.
Nunes, L., Chen, X., Marcuzzi, R., Osep, A., Leal-Taixé, L., Stachniss, C., & Behley, J. (2022). Unsupervised class-agnostic instance segmentation of 3d lidar data for autonomous vehicles. IEEE Robotics and Automation Letters, 7(4), 8713–8720.
Osep, A., Mehner, W., Voigtlaender, P., & Leibe, B. (2018). Track, then decide: Category-agnostic vision-based multi-object tracking. In International Conference on Robotics and Automation.
Osep, A., Voigtlaender, P., Luiten, J., Breuers, S., & Leibe, B. (2018). Towards large-scale video video object mining. In ECCV Workshop on Interactive and Adaptive Learning in an Open World.
Osep, A., Voigtlaender, P., Luiten, J., Breuers, S., & Leibe, B. (2019). Large-scale object mining for object discovery from unlabeled video. In International Conference on Robotics and Automation.
Osep, A., Voigtlaender, P., Weber, M., Luiten, J., & Leibe, B. (2020). 4d generic video object proposals. In International Conference on Robotics and Automation.
Oza, P., & Patel, V.M. (2019). C2AE: Class conditioned auto-encoder for open-set recognition. In CVPR.
Qi, C.R., Su, H., Mo, K., & Guibas, L.J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR.
Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., & Guibas, L.J. (2016). Volumetric and multi-view cnns for object classification on 3d data. In CVPR.
Qi, C.R., Yi, L., Su, H., & Guibas, L.J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision.
Razani, R., Cheng, R., Li, E., Taghavi, E., Ren, Y., & Bingbing, L. (2021). Gp-s3net: Graph-based panoptic sparse semantic segmentation network. In CVPR.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS.
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., & Wang, X. (2021). A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9), 1–40.
Rist, C.B., Enzweiler, M., & Gavrila, D.M. (2019). Cross-sensor deep domain adaptation for lidar detection and segmentation. In 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 1535–1542. IEEE
Scheirer, W. J., Rezende Rocha, A., Sapkota, A., & Boult, T. E. (2012). Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757–1772.
Shaban, A., Lee, J., Jung, S., Meng, X., & Boots, B. (2023). Lidar-uda: Self-ensembling through time for unsupervised lidar domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19784–19794.
Shi, S., Wang, X., & Li, H. (2019). PointRCNN: 3D object proposal generation and detection from point cloud. In CVPR.
Sirohi, K., Mohan, R., Büscher, D., Burgard, W., & Valada, A. (2021). Efficientlps: Efficient lidar panoptic segmentation. IEEE Transactions on Robotic, 38, 1894.
Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., & Han, S. (2020). Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV
Teichman, A., Levinson, J., & Thrun, S. (2011). Towards 3D object recognition via classification of arbitrary object tracks. In International Conference on Robotics and Automation.
Teichman, A., & Thrun, S. (2012). Tracking-based semi-supervised learning. The International Journal of Robotics Research, 31(7), 804–818.
Thomas, H., Qi, C.R., Deschaud, J.-E., Marcotegui, B., Goulette, F., & Guibas, L.J. (2019). Kpconv: Flexible and deformable convolution for point clouds. In CVPR.
Thorpe, C., Herbert, M., Kanade, T., & Shafer, S. (1991). Toward autonomous driving: The cmu navlab: i: Perception. IEEE Expert, 6(4), 31–42.
Wang, D.Z., Posner, I., & Newman, P. (2012). What could move? Finding cars, pedestrians and bicyclists in 3D laser data. In International Conference on Robotics and Automation.
Weng, Z., Ogut, M.G., Limonchik, S., & Yeung, S. (2021). Unsupervised discovery of the long-tail in instance segmentation using hierarchical self-supervision. In CVPR.
Wong, K., Wang, S., Ren, M., Liang, M., & Urtasun, R. (2020). Identifying unknown instances for autonomous driving. In Conference on Robot Learning, pp. 384–393. PMLR.
Wu, B., Zhou, X., Zhao, S., Yue, X., & Keutzer, K. (2019). Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. IEEE.
Xian, G., Ji, C., Zhou, L., Chen, G., Zhang, J., Li, B., Xue, X., & Pu, J. (2022). Location-guided lidar-based panoptic segmentation for autonomous driving. IEEE Transactions on Intelligent Vehicles, 8(2), 1473–1483.
Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., & Pu, S. (2021). Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In CVPR.
Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.
Ye, M., Xu, S., Cao, T., & Chen, Q. (2021). Drinet: A dual-representation iterative learning network for point cloud segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 7447–7456.
Yi, L., Gong, B., & Funkhouser, T. (2021). Complete & label: A domain adaptation approach to semantic segmentation of lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15363–15373
Yoshihashi, R., Shao, W., Kawakami, R., You, S., Iida, M., & Naemura, T. (2019). Classification-reconstruction learning for open-set recognition. In CVPR.
Yoshihashi, R., Shao, W., Kawakami, R., You, S., Iida, M., & Naemura, T. (2019). Classification-reconstruction learning for open-set recognition. In CVPR.
Zhan, X., Wang, Q., Huang, K.-h., Xiong, H., Dou, D., & Chan, A.B. (2022). A comparative survey of deep active learning. arXiv preprint arXiv:2203.13450
Zhang, L., Yang, A.J., Xiong, Y., Casas, S., Yang, B., Ren, M., & Urtasun, R. (2023). Towards unsupervised object detection from lidar point clouds. In CVPR.
Zhang, Z., Zhang, Z., Yu, Q., Yi, R., Xie, Y., & Ma, L. (2023). Lidar-camera panoptic segmentation via geometry-consistent and semantic-aware alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3662–3671.
Zhao, Y., Zhang, X., & Huang, X. (2021). A technical survey and evaluation of traditional point cloud clustering methods for lidar panoptic segmentation. In ICCV Workshops.
Zhao, Y., Zhang, X., & Huang, X. (2022). A divide-and-merge point cloud clustering algorithm for lidar panoptic segmentation. In International Conference on Robotics and Automation.
Zhou, Z., Zhang, Y., & Foroosh, H. (2021). Panoptic-polarnet: Proposal-free lidar point cloud panoptic segmentation. In CVPR.
Zhu, X., Zhou, H., Wang, T., Hong, F., Ma, Y., Li, W., Li, H., & Lin, D. (2021). Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR.
Zitnick, C.L., & Dollár, P. (2014). Edge Boxes: Locating object proposals from edges. In ECCV.
Funding
This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.
Author information
Authors and Affiliations
Contributions
Experiments were performed by Anirudh Chakravarthy and Meghana Ganesina, who were advised by Aljosa Osep, Deva Ramanan, Shu Kong, and Laura Leal-Taixé. Peiyun Hu provided support with instance segmentation implementation. All authors approved the manuscript submission.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no Conflict of interest to declare that are relevant to the content of this article.
Ethical Approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Code Availability
Code has been released along with the submission.
Additional information
Communicated by Zhun Zhong.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 630204 KB)
Appendices
LiDAR Panoptic Segmentation in Open-World
1.1 Vocabulary Splits
We detail our vocabulary split for SemanticKITTI (Behley et al., 2019, 2021) and KITTI360 (Liao et al., 2021) in Table 5. The categorization of stuff and thing follows from the KITTI360 ontology. Vocabulary 1 is constructed by sorting SemanticKITTI superclasses by the number of instances in a superclass and holding-out tail classes as other. Vocabulary 2 is constructed by holding out only the rarest object instances that do not correspond to any semantic class, labeled in SemanticKITTI (e.g., other-vehicle, other-object, etc.)
1.2 Vocabulary Consistency
Inconsistent labeling policies across datasets cause label shifts. For this reason, we base our evaluation set-up on SemanticKITTI and KITTI360. The two datasets adopt largely consistent labeling policies and the same sensor; therefore, the shift in data distribution can be thought of as a result of new classes emerging across datasets. To ensure we have a consistent class vocabulary, we perform the following measures:
-
We merge rider and bicyclist (SemanticKITTI) with human and rider (KITTI360) into a single human class to ensure consistency.
-
Classes pole and traffic sign are commonly treated as thing classes. However, in SemanticKITTI, they are treated as stuff classes because we do not have instance-level annotations for them. As instance labels for these classes are available in KITTI360, we treat them as other thing classes in KITTI360. Therefore, individual instances of these classes must be segmented. This is consistent with the overall goal of LiPSOW: methods must segment all instances, including those not labeled in SemanticKITTI.
-
We treat building as a stuff class in SemanticKITTI and KITTI360 (i.e., we do not treat the building class as an object, a thing class).
1.3 Model Training
We train all LiPSOW methods using instance labels for known-things and semantic labels for known- things and known-stuff. The instance labels for the other classes are held out, i.e., not available during training. The LiPSOW methods should classify these points as other instead of performing a fine-grained semantic classification of these points. Additionally, LiPSOW methods need to segment novel instances from other, e.g., segment bicycles in SemanticKITTI (Vocabulary 1) and vending machines in KITTI360 (Vocabulary 1 & Vocabulary 2).
Semantic and instance labels for KITTI360 evaluation, retrieved from the dense accumulated labeled point clouds. Grey color denotes points for which we could not retrieve labeled points within a 10 cm radius to transfer labels (Color figure online)
1.4 KITTI360 Ground-Truth
To evaluate methods on the KITTI360 dataset, we require per-scan semantic and instance labels for Velodyne Lidar scans. However, KITTI360 only provides multiple accumulated point clouds (accumulated over approximately 200 m), recorded by the SICK Lidar sensor. We use these dense accumulated point clouds to retrieve per-scan labels for individual Velodyne point clouds. Concretely, we use publicly available scripts (Mosig, 2022) to align individual Velodyne scans with the accumulated point cloud based on known vehicle odometry. Once aligned, we perform a nearest neighbor search for each Velodyne point in the corresponding accumulated point cloud. In case a match is not found within a 10 cm radius, we mark this point as unlabeled (ignored during the evaluation). We visualize the retrieved labels for Velodyne point clouds in Fig. 7.
Implementation Details
1.1 4DPLS\(^\dagger \)
Our method and several baselines are based on 4D-PLS (Aygün et al., 2021) that employs an encoder-decoder point-based KPConv (Thomas et al., 2019) backbone for point classification and instance segmentation. The instance segmentation branch consists of three network heads. The objectness head predicts for all points how likely they are to represent a (modal) instance center. The embedding and variance heads are used to associate points with their respective instance centers. During the inference, we select the point \(p_i\) with the highest objectness, evaluate all points under a Gaussian (parameterized by the predicted mean and variances for \(p_i\)), and assign points to this cluster if the point-to-center association probability is higher than a threshold, i.e., \(> 0.5\). This process is repeated until the maximum objectness is below a certain threshold (0.1 in 4D-PLS). To ensure high-quality segments, 4D-PLS also enforces that the highest objectness should be \(> 0.7\).
Since the objectness head is trained independently of the semantic head in a class-agnostic fashion, we hypothesize it should be able to learn a general notion of objectness from geometric cues. Therefore, we evaluate whether it can segment instances of novel classes with lower confidence. Therefore, we adapt the inference procedure in 4D-PLS to allow additional instances to be segmented. We achieve this by reducing the minimum objectness threshold from 0.7 to 0.3 for other, while maintaining the same threshold for known things. Our experimental evaluation confirms that this baseline can segment a larger number of other instances compared to 4D-PLS.
1.2 OSeg (Cen et al., 2022)
OSeg (Cen et al., 2022) introduces a novel strategy for open-world semantic segmentation of LiDAR Point Clouds. The proposed framework consists of two stages: (i) Open-Set semantic segmentation (OSeg) and (ii) Incremental Learning. For a fair comparison, we benchmark OSeg against our baselines.
OSeg introduces redundancy classifiers on top of a closed-set model to output scores for the unknown class. In addition, OSeg uses unknown object synthesis to generate pseudo-unknown objects based on real novel objects. The OSeg formulation considers other vehicle as a novel category for SemanticKITTI. To benchmark under our proposed LiPSOW formulation, we modify OSeg to allow for more classes in other (based on our vocabulary splits) and train from scratch. We use the default set of hyper-parameters, and use 3 redundancy classifiers.
OWL: Instance Segmentation
1.1 Segmentation Tree Generation
Given an input point cloud, we first make a network pass and classify points into thing, stuff, and other classes. Then, we construct a hierarchical tree segmentation tree T by applying HDBSCAN (McInnes et al., 2017) on points. Concretely, at each level of the hierarchy, we reduce the distance threshold \(\epsilon \) (HDBSCAN connectivity hyperparameter) to obtain finer point segments. Therefore, the nodes in T contain strictly smaller and finer-grained instances in child nodes. We follow Hu et al. (2020) and use distance thresholds \(\epsilon \in [1.2488, 0.8136, 0.6952, 0.594, 0.4353, 0.3221]\).
1.2 Learning Objectness-Scoring Function
Given a hierarchical tree T over the input point cloud, we need to find a node partitioning such that each point is assigned to a unique instance. Naturally, some nodes in the tree would contain high-quality segments, while others would consist of a soup of segments or overly segmented instances. To associate a metric quality for each node, we follow Hu et al. (2020) and learn a function to score each segment in the node.
1.2.1 Network Architecture
Each node in the tree consists of a segment (i.e., a group of points), where the number of points may vary. We concatenate all the points in a segment to get a \(N \times 3\) dimensional tensor, where N is the number of points in the segment. There are several ways how to learn such a function \(f(p) \rightarrow [0,1]\) that estimates how likely a subset of points represents an object. One approach is to estimate a per-point objectness score. Following Aygün et al. (2021), this can be learned by regressing a truncated distance \(O \in \mathbb {R}^{N \times 1}\) to the nearest point center of a labeled instance (Aygün et al., 2021) on-top of decoder features \(F \in \mathbb {R}^{N \times D}\). The objectness value can then be averaged over the segment \(p \subset P\).
Alternatively, we can train a holistic classifier as a second-stage network. The network comprises three major components: a) input projection layer, b) segment embedding layer, and c) objectness head. In the input projection layer, we project the input point cloud to a higher dimension of \(N \times 256\) by passing through two fully-connected layers. Then, we compute a per-segment embedding of dimension 512 using the embedding layer. This consists of set abstraction layers inspired by PointNet++ (Qi et al., 2017), followed by a reduction over the points.
1.2.2 Network Training
The objectness head predicts per-segment objectness, using three fully-connected layers with a hidden layer of size 256. To obtain training supervision, we pre-built hierarchical segmentation trees \(T_i\) for each point cloud i in the training set and minimize the training loss based on the signal we obtain from matched segments between the segmentation trees and set of labeled instances, \(GT_i\). As described in Sec 4 of the main paper, for each node in the segmentation tree, the regressor predicts the objectness value which is supervised by the intersection-over-union of the segment with the maximal matching ground-truth instance.
Alternatively, we can also formulate the network as a classifier, where the post-softmax outputs from the network can be viewed as the quality of each segment (i.e., how good or bad a segment is), and this is trained using a binary cross-entropy loss. We observe that the regression formulation empirically results in a better tree cut compared to the classifier formulation. We attribute this to the over-confident and peaky distributions resulting from the classifier. The regressor formulation benefits from a smoother distribution over objectness scores, resulting in a better tree cut. We evaluate the aforementioned variants in Table 1 in the main paper.
1.3 Inference
With a score assigned to each segment, we now need to find a global segmentation, i.e., the optimal instance segmentation from an exponentially large space of possible segments. The global segmentation score is defined as the worst objectness among the individual segments in a tree cut. The optimal partition is the one that maximizes the global segmentation score. We outline this inference algorithm in Algorithm 1. Each node in the tree T constitutes a segment proposal for an object instance. For each proposal S, we score its objectness using the learned objectness function f. The segment S is deemed as an optimal node to perform a tree-cut if its objectness is greater than any of its child nodes. By design, this tree-cut algorithm ensures that each segment is assigned to a unique instance. For details about the algorithm and optimality guarantees, we refer the reader to Hu et al. (2020).
Node Partitioning given a Hierarchical Segmentation Tree
Implementation Details
1.1 Training Encoder–Decoder Network
.To train the point-classification network on each vocabulary, we follow the training procedure from Aygün et al. (2021). We train the network for 1000 epochs with a batch size of 8. We use the SGD optimizer with a learning rate of \(1e-3\) and a linear decay schedule.
1.2 Second-Stage Training
.In contrast to the encoder-decoder network, which takes an entire point cloud as input, the second-stage network requires positive and negative training instances (i.e., examples of objects and non-objects) from the segmentation tree to evaluate the loss functions (regression or cross-entropy losses). To generate these instances, we first generate semantic predictions. Next, we use the points classified as thing or other to generate the segmentation tree using thresholds as described in Sect C.1. Each node in the segmentation tree is a training sample for the second-stage network. We use predictions from the encoder-decoder instead of ground-truth semantic labels, since during inference the second-stage network must be robust to misclassification errors within each node in the segmentation tree.
1.3 Training Objectness Regression Function
.To train the regressor, we need to generate the corresponding ground-truth for each segment in the generated training set. For a given segment, the target score is computed as the maximum intersection-over-union of the segment with all the ground-truth instances in the dataset. Finally, we train the network using a mean-squared error loss function with a learning rate of \(2e-3\) and batch size of 512 for 200 epochs.
1.4 Training Objectness Classification Function
Alternatively, learning an objectness function can be posed as a classification problem, rather than a regression problem. In this case, we supervise the network via cross-entropy loss. The target labels for training this classifier are obtained by binarizing the regression targets using pre-defined intersection-over-union (IoU) thresholds. A regression target with an IoU greater than 0.7 is defined as a positive segment, and a regression target with an IoU less than 0.3 is treated as a negative segment. Since the ground-truth classification targets are generated from the hierarchical tree, which consists of predicted known-things or other, the generated training data is strongly biased towards positive samples. To elaborate, since the point classification network performs well, the segments in the tree most likely consist of known-things or other, barring misclassification error. Therefore, while training this network, we observe a disproportionate imbalance towards the positive class. To mitigate this, perform a weighted resampling; We resample instances (segments) of either positive or negative classes with a probability proportional to the inverse frequency of that class.
Complexity Analysis
OWL requires a two-stage training process. The first stage method, 4DPLS (Aygün et al., 2021) is not a real-time. In addition, the second stage requires the construction of a segmentation tree. Given N points, the algorithm’s (Hu et al., 2020) time complexity and space complexity are linear in N. In practice, we observe that N is quite large, often of the order of 100,000+ points per scan. Therefore, a limitation of OWL is that it cannot be run in real-time.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chakravarthy, A.S., Ganesina, M.R., Hu, P. et al. Lidar Panoptic Segmentation in an Open World. Int J Comput Vis 133, 1153–1174 (2025). https://doi.org/10.1007/s11263-024-02166-9
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02166-9

