JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering 2026 Vol.27 No.4 P.1-10

http://doi.org/10.1631/ENG.ITEE.2026.0044

Three-dimensional affordance segmentation for object point cloud driven by language instructions

Author(s): Jiaxuan DU, Hao WU, Qing MA, Guohui TIAN, Zhixian ZHAO, Shuwen LENG
Affiliation(s): 1. School of Control Science and Engineering, Shandong University, Jinan 250061, China more
Corresponding email(s): wh911@sdu.edu.cn
Key Words: Visual affordance, Point cloud segmentation, Open vocabulary, Multimodal fusion, Service robot

Share this article to： More <<< Previous Article \|Next Article >>>

Jiaxuan DU, Hao WU, Qing MA, Guohui TIAN, Zhixian ZHAO, Shuwen LENG. Three-dimensional affordance segmentation for object point cloud driven by language instructions[J]. Journal of Zhejiang University Science C, 2026, 27(4): 1-10.

@article{title="Three-dimensional affordance segmentation for object point cloud driven by language instructions",
author="Jiaxuan DU, Hao WU, Qing MA, Guohui TIAN, Zhixian ZHAO, Shuwen LENG",
journal="Journal of Zhejiang University Science C",
volume="27",
number="4",
pages="1-10",
year="2026",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2026.0044"
}

%0 Journal Article
%T Three-dimensional affordance segmentation for object point cloud driven by language instructions
%A Jiaxuan DU
%A Hao WU
%A Qing MA
%A Guohui TIAN
%A Zhixian ZHAO
%A Shuwen LENG
%J Frontiers of Information Technology & Electronic Engineering
%V 27
%N 4
%P 1-10
%@ 1869-1951
%D 2026
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2026.0044

TY - JOUR
T1 - Three-dimensional affordance segmentation for object point cloud driven by language instructions
A1 - Jiaxuan DU
A1 - Hao WU
A1 - Qing MA
A1 - Guohui TIAN
A1 - Zhixian ZHAO
A1 - Shuwen LENG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 27
IS - 4
SP - 1
EP - 10
%@ 1869-1951
Y1 - 2026
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2026.0044

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: The location where a robot grasps an object is closely related to the task type. For the same object, different user requirements may necessitate different grasping strategies. visual affordance serves as a reliable source of prior knowledge for manipulation. Existing methods learn affordance from images or videos, but planar affordance lacks the spatial information required for 6-degree-of-freedom (6-DoF) manipulation. Furthermore, current approaches are limited to affordances associated with predefined categories and cannot directly infer affordances from user instructions. To address such limitations, we propose a novel task: instruction-driven three-dimensional (3D) object affordance segmentation. To support this research, we introduce an instruction–affordance dataset (IAD), a challenging dataset consisting of 7190 object instances across 20 common object categories, paired with 624 manipulation instructions that specify the corresponding affordances. To evaluate generalization to novel commands, our dataset includes both seen and unseen settings. Building on this, we design an instruction-driven 3D affordance segmentation (IDAS) network, which extracts point cloud features and integrates instruction features layer by layer. Given a user instruction, our method segments suggested manipulation regions on the object’s point cloud, thereby guiding the selection of optimal grasp poses. Experimental results show that our method outperforms other related approaches under both seen and unseen settings, demonstrating generalization ability to diverse user commands and unknown affordances.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Achlioptas P, Abdelreheem A, Xia F, et al., 2020. Referit3D: neural listeners for fine-grained 3D object identification in real-world scenes. 16^th European Conf on Computer Vision, p.422-440.

[2]Ardón P, Pairet È, Petrick RPA, et al., 2019. Learning grasp affordance reasoning through semantic relations. IEEE Robot Autom Lett, 4(4):4571-4578.

[3]Chen DZ, Chang AX, Nießner M, 2020. ScanRefer: 3D object localization in RGB-D scans using natural language. 16^th European Conf on Computer Vision, p.202-221.

[4]Deng SH, Xu X, Wu CZ, et al., 2021. 3D AffordanceNet: a benchmark for visual object affordance understanding. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1778-1787.

[5]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[6]Do TT, Nguyen A, Reid I, 2018. AffordanceNet: an end-to-end deep learning approach for object affordance detection. IEEE Int Conf on Robotics and Automation, p.5882-5889.

[7]Fang HS, Wang CX, Fang HJ, et al., 2023. AnyGrasp: robust and efficient grasp perception in spatial and temporal domains. IEEE Trans Robot, 39(5):3929-3945.

[8]Fang K, Wu TL, Yang D, et al., 2018. Demo2Vec: reasoning object affordances from online videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2139-2147.

[9]Gibson JJ, 1978. The ecological approach to the visual perception of pictures. Leonardo, 11(3):227-235.

[10]Goyal M, Modi S, Goyal R, et al., 2022. Human hands as probes for interactive object understanding. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3283-3293.

[11]He PC, Liu XD, Gao JF, et al., 2020. DeBERTa: decoding-enhanced BERT with disentangled attention.

[12]Huang PH, Lee HH, Chen HT, et al., 2021. Text-guided graph neural networks for referring 3D instance segmentation. Proc 35^th AAAI Conf on Artificial Intelligence, p.1610-1618.

[13]Islam R, Moushi OM, 2025. GPT-4o: the cutting-edge advancement in multimodal LLM. In: Arai K (Ed.), Intelligent Computing. Lecture Notes in Networks and Systems, Springer, Cham, p.47-60.

[14]Li MC, Sigal L, 2021. Referring Transformer: a one-step approach to multi-task visual grounding. Proc 35^th Int Conf on Neural Information Processing Systems, Article 1503.

[15]Li YC, Zhao N, Xiao JB, et al., 2024. LASO: language-guided affordance segmentation on 3D object. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14251-14260.

[16]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007.

[17]Liu C, Ding HH, Jiang XD, 2023. GRES: generalized referring expression segmentation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.23592-23601.

[18]Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673-1683.

[19]Liu YH, Ott M, Goyal N, et al., 2019. RoBERTa: a robustly optimized BERT pretraining approach. https://arxiv.org/abs/1907.11692

[20]Luo JY, Fu JH, Kong XH, et al., 2022. 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16433-16442.

[21]Milletari F, Navab N, Ahmadi SA, 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation. 4^th Int Conf on 3D Vision, p.565-571.

[22]Mo KC, Qin YZ, Xiang FB, et al., 2022. O2O-Afford: annotation-free large-scale object-object affordance learning. 5^th Conf on Robot Learning, p.1666-1677.

[23]Montani I, Honnibal M, Boyd A, et al., 2023. Explosion/spaCy: v3.7.2: Fixes for APIs and Requirements. Zenodo.

[24]Mousavian A, Eppner C, Fox D, 2019. 6-DOF GraspNet: variational grasp generation for object manipulation. Proc IEEE/CVF Int Conf on Computer Vision, p.2901-2910.

[25]Nagarajan T, Feichtenhofer C, Grauman K, 2019. Grounded human-object interaction hotspots from video. Proc IEEE/CVF Int Conf on Computer Vision, p.8687-8696.

[26]Nguyen T, Vu MN, Vuong A, et al., 2023. Open-vocabulary affordance detection in 3D point clouds. IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.5692-5698.

[27]Perez E, Strub F, De Vries H, et al., 2017. FiLM: visual reasoning with a general conditioning layer. 32^nd AAAI Conf on Artificial Intelligence, p.3942-3951.

[28]Qi CR, Yi L, Su H, et al., 2017. PointNet++: deep hierarchical feature learning on point sets in a metric space. Proc 31^st Int Conf on Neural Information Processing Systems, p.5105-5114.

[29]Qian SY, Chen WF, Bai M, et al., 2024. AffordanceLLM: grounding affordance from vision language models. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.7587-7597.

[30]Qin XF, Hu WK, Xiao C, et al., 2023. Attention-based efficient robot grasp detection network. Front Inform Technol Electron Eng, 24(10):1430-1444.

[31]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. 38^th Int Conf on Machine Learning, p.8748-8763.

[32]Roh J, Desingh K, Farhadi A, et al., 2022. LanguageRefer: spatial-language model for 3D visual grounding. 5^th Conf on Robot Learning, p.1046-1056.

[33]Roy A, Todorovic S, 2016. A multi-scale CNN for affordance segmentation in RGB images. 14^th European Conf on Computer Vision, p.186-201.

[34]Song HO, Fritz M, Goehring D, et al., 2015. Learning to detect visual grasp affordance. IEEE Trans Autom Sci Eng, 13(2):798-809.

[35]Sundermeyer M, Mousavian A, Triebel R, et al., 2021. Contact-GraspNet: efficient 6-DoF grasp generation in cluttered scenes. IEEE Int Conf on Robotics and Automation, p.13438-13444.

[36]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[37]Wang Q, Fan Z, Sheng WH, et al., 2022. Cloud-assisted cognition adaptation for service robots in changing home environments. Front Inform Technol Electron Eng, 23(2):246-257.

[38]Yang YH, Zhai W, Luo HC, et al., 2023. Grounding 3D object affordance from 2D interactions in images. IEEE/CVF Int Conf on Computer Vision, p.10871-10881.

[39]Yang ZY, Zhang SY, Wang LW, et al., 2021. SAT: 2D semantics assisted training for 3D visual grounding. Proc IEEE/CVF Int Conf on Computer Vision, p.1836-1846.

[40]Zhao LC, Cai DG, Sheng L, et al., 2021. 3DVG-Transformer: relation modeling for visual grounding on point clouds. Proc IEEE/CVF Int Conf on Computer Vision, p.2908-2917.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference