ENGINEERING Information Technology & Electronic Engineering  2026 Vol.27 No.4 P.1-10

http://doi.org/10.1631/ENG.ITEE.2026.0044


Three-dimensional affordance segmentation for object point cloud driven by language instructions


Author(s):  Jiaxuan DU, Hao WU, Qing MA, Guohui TIAN, Zhixian ZHAO, Shuwen LENG

Affiliation(s):  1. School of Control Science and Engineering, Shandong University, Jinan 250061, China more

Corresponding email(s):   wh911@sdu.edu.cn

Key Words:  Visual affordance, Point cloud segmentation, Open vocabulary, Multimodal fusion, Service robot


Jiaxuan DU, Hao WU, Qing MA, Guohui TIAN, Zhixian ZHAO, Shuwen LENG. Three-dimensional affordance segmentation for object point cloud driven by language instructions[J]. Journal of Zhejiang University Science C, 2026, 27(4): 1-10.

@article{title="Three-dimensional affordance segmentation for object point cloud driven by language instructions",
author="Jiaxuan DU, Hao WU, Qing MA, Guohui TIAN, Zhixian ZHAO, Shuwen LENG",
journal="Journal of Zhejiang University Science C",
volume="27",
number="4",
pages="1-10",
year="2026",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2026.0044"
}

%0 Journal Article
%T Three-dimensional affordance segmentation for object point cloud driven by language instructions
%A Jiaxuan DU
%A Hao WU
%A Qing MA
%A Guohui TIAN
%A Zhixian ZHAO
%A Shuwen LENG
%J Frontiers of Information Technology & Electronic Engineering
%V 27
%N 4
%P 1-10
%@ 1869-1951
%D 2026
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2026.0044

TY - JOUR
T1 - Three-dimensional affordance segmentation for object point cloud driven by language instructions
A1 - Jiaxuan DU
A1 - Hao WU
A1 - Qing MA
A1 - Guohui TIAN
A1 - Zhixian ZHAO
A1 - Shuwen LENG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 27
IS - 4
SP - 1
EP - 10
%@ 1869-1951
Y1 - 2026
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2026.0044


Abstract: 
The location where a robot grasps an object is closely related to the task type. For the same object, different user requirements may necessitate different grasping strategies. visual affordance serves as a reliable source of prior knowledge for manipulation. Existing methods learn affordance from images or videos, but planar affordance lacks the spatial information required for 6-degree-of-freedom (6-DoF) manipulation. Furthermore, current approaches are limited to affordances associated with predefined categories and cannot directly infer affordances from user instructions. To address such limitations, we propose a novel task: instruction-driven three-dimensional (3D) object affordance segmentation. To support this research, we introduce an instruction–affordance dataset (IAD), a challenging dataset consisting of 7190 object instances across 20 common object categories, paired with 624 manipulation instructions that specify the corresponding affordances. To evaluate generalization to novel commands, our dataset includes both seen and unseen settings. Building on this, we design an instruction-driven 3D affordance segmentation (IDAS) network, which extracts point cloud features and integrates instruction features layer by layer. Given a user instruction, our method segments suggested manipulation regions on the object’s point cloud, thereby guiding the selection of optimal grasp poses. Experimental results show that our method outperforms other related approaches under both seen and unseen settings, demonstrating generalization ability to diverse user commands and unknown affordances.

语言指令驱动的物体点云三维可供性分割

杜佳璇1,吴皓1,马庆1,田国会1,赵志贤1,冷述文2
1山东大学控制科学与工程学院,中国济南市,250061
2中国华能集团有限公司山东分公司,中国济南市,250014
摘要:物体的抓取位置与任务类型密切相关。对于同一种物品,不同用户需求可能对应不同的抓取方式。视觉可供性为操作行为提供了可靠的先验知识。现有方法通常从图像或视频中学习可供性,但基于平面的可供性缺乏实现六自由度操作所需的空间信息。此外,当前方法局限于预定义类别相关的可供性,无法直接从用户指令中推断可供性。为解决上述问题,提出了一项新任务:语言指令驱动的三维物体可供性分割。为支持该研究,构建了一个指令-可供性数据集。该数据集具有挑战性,包含20类常见物体类别中的7190个物体实例,并配有624条操作指令,这些指令明确了相应的可供性。为评估模型对新指令的泛化能力,数据集包括"已见"和"未见"两种设置。在此基础上,设计了指令驱动的三维可供性分割网络,该网络从点云中提取特征,并逐层融合指令特征。依据给定用户指令,模型能够在物体点云上直接分割出建议的操作区域,从而指导最优抓取位姿的选择。实验结果表明,该方法在"已见"和"未见"设置下均优于其他相关方法,并展现出对多样化用户指令和未知可供性的泛化能力。

关键词:视觉可供性;点云分割;开放语义;多模态融合;服务机器人

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Achlioptas P, Abdelreheem A, Xia F, et al., 2020. Referit3D: neural listeners for fine-grained 3D object identification in real-world scenes. 16th European Conf on Computer Vision, p.422-440.

[2]Ardón P, Pairet È, Petrick RPA, et al., 2019. Learning grasp affordance reasoning through semantic relations. IEEE Robot Autom Lett, 4(4):4571-4578.

[3]Chen DZ, Chang AX, Nießner M, 2020. ScanRefer: 3D object localization in RGB-D scans using natural language. 16th European Conf on Computer Vision, p.202-221.

[4]Deng SH, Xu X, Wu CZ, et al., 2021. 3D AffordanceNet: a benchmark for visual object affordance understanding. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1778-1787.

[5]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[6]Do TT, Nguyen A, Reid I, 2018. AffordanceNet: an end-to-end deep learning approach for object affordance detection. IEEE Int Conf on Robotics and Automation, p.5882-5889.

[7]Fang HS, Wang CX, Fang HJ, et al., 2023. AnyGrasp: robust and efficient grasp perception in spatial and temporal domains. IEEE Trans Robot, 39(5):3929-3945.

[8]Fang K, Wu TL, Yang D, et al., 2018. Demo2Vec: reasoning object affordances from online videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2139-2147.

[9]Gibson JJ, 1978. The ecological approach to the visual perception of pictures. Leonardo, 11(3):227-235.

[10]Goyal M, Modi S, Goyal R, et al., 2022. Human hands as probes for interactive object understanding. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3283-3293.

[11]He PC, Liu XD, Gao JF, et al., 2020. DeBERTa: decoding-enhanced BERT with disentangled attention.

[12]Huang PH, Lee HH, Chen HT, et al., 2021. Text-guided graph neural networks for referring 3D instance segmentation. Proc 35th AAAI Conf on Artificial Intelligence, p.1610-1618.

[13]Islam R, Moushi OM, 2025. GPT-4o: the cutting-edge advancement in multimodal LLM. In: Arai K (Ed.), Intelligent Computing. Lecture Notes in Networks and Systems, Springer, Cham, p.47-60.

[14]Li MC, Sigal L, 2021. Referring Transformer: a one-step approach to multi-task visual grounding. Proc 35th Int Conf on Neural Information Processing Systems, Article 1503.

[15]Li YC, Zhao N, Xiao JB, et al., 2024. LASO: language-guided affordance segmentation on 3D object. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14251-14260.

[16]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007.

[17]Liu C, Ding HH, Jiang XD, 2023. GRES: generalized referring expression segmentation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.23592-23601.

[18]Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673-1683.

[19]Liu YH, Ott M, Goyal N, et al., 2019. RoBERTa: a robustly optimized BERT pretraining approach. https://arxiv.org/abs/1907.11692

[20]Luo JY, Fu JH, Kong XH, et al., 2022. 3D-SPS: single-stage 3D visual grounding via referred point progressive selection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16433-16442.

[21]Milletari F, Navab N, Ahmadi SA, 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation. 4th Int Conf on 3D Vision, p.565-571.

[22]Mo KC, Qin YZ, Xiang FB, et al., 2022. O2O-Afford: annotation-free large-scale object-object affordance learning. 5th Conf on Robot Learning, p.1666-1677.

[23]Montani I, Honnibal M, Boyd A, et al., 2023. Explosion/spaCy: v3.7.2: Fixes for APIs and Requirements. Zenodo.

[24]Mousavian A, Eppner C, Fox D, 2019. 6-DOF GraspNet: variational grasp generation for object manipulation. Proc IEEE/CVF Int Conf on Computer Vision, p.2901-2910.

[25]Nagarajan T, Feichtenhofer C, Grauman K, 2019. Grounded human-object interaction hotspots from video. Proc IEEE/CVF Int Conf on Computer Vision, p.8687-8696.

[26]Nguyen T, Vu MN, Vuong A, et al., 2023. Open-vocabulary affordance detection in 3D point clouds. IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.5692-5698.

[27]Perez E, Strub F, De Vries H, et al., 2017. FiLM: visual reasoning with a general conditioning layer. 32nd AAAI Conf on Artificial Intelligence, p.3942-3951.

[28]Qi CR, Yi L, Su H, et al., 2017. PointNet++: deep hierarchical feature learning on point sets in a metric space. Proc 31st Int Conf on Neural Information Processing Systems, p.5105-5114.

[29]Qian SY, Chen WF, Bai M, et al., 2024. AffordanceLLM: grounding affordance from vision language models. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.7587-7597.

[30]Qin XF, Hu WK, Xiao C, et al., 2023. Attention-based efficient robot grasp detection network. Front Inform Technol Electron Eng, 24(10):1430-1444.

[31]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. 38th Int Conf on Machine Learning, p.8748-8763.

[32]Roh J, Desingh K, Farhadi A, et al., 2022. LanguageRefer: spatial-language model for 3D visual grounding. 5th Conf on Robot Learning, p.1046-1056.

[33]Roy A, Todorovic S, 2016. A multi-scale CNN for affordance segmentation in RGB images. 14th European Conf on Computer Vision, p.186-201.

[34]Song HO, Fritz M, Goehring D, et al., 2015. Learning to detect visual grasp affordance. IEEE Trans Autom Sci Eng, 13(2):798-809.

[35]Sundermeyer M, Mousavian A, Triebel R, et al., 2021. Contact-GraspNet: efficient 6-DoF grasp generation in cluttered scenes. IEEE Int Conf on Robotics and Automation, p.13438-13444.

[36]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.

[37]Wang Q, Fan Z, Sheng WH, et al., 2022. Cloud-assisted cognition adaptation for service robots in changing home environments. Front Inform Technol Electron Eng, 23(2):246-257.

[38]Yang YH, Zhai W, Luo HC, et al., 2023. Grounding 3D object affordance from 2D interactions in images. IEEE/CVF Int Conf on Computer Vision, p.10871-10881.

[39]Yang ZY, Zhang SY, Wang LW, et al., 2021. SAT: 2D semantics assisted training for 3D visual grounding. Proc IEEE/CVF Int Conf on Computer Vision, p.1836-1846.

[40]Zhao LC, Cai DG, Sheng L, et al., 2021. 3DVG-Transformer: relation modeling for visual grounding on point clouds. Proc IEEE/CVF Int Conf on Computer Vision, p.2908-2917.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Full Text:   <26>

CLC number: TP242.62

On-line Access: 2026-04-24

Received: 2026-02-06

Revision Accepted: 2026-04-24

Crosschecked: 2026-03-22

Cited: 0

Clicked: 115

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Jiaxuan DU

0009-0001-0930-9958

Hao WU

0000-0001-6993-8863

Qing MA

0000-0002-3902-3635

Guohui TIAN

0000-0001-8332-3064

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE