Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding (English)

Shi, Fengyuan / Gao, Ruopeng / Huang, Weilin / Wang, Limin

In: IEEE Transactions on Pattern Analysis and Machine Intelligence ; 46 , 2 ; 1181-1198 ; 2024

ISSN:

0162-8828, 2160-9292, 1939-3539

Article (Journal) / Electronic Resource

How to get this title?

Check access

Download

Commercial Copyright fee: €31.02 Basic fee: €4.00 Total price: €35.02

Academic Copyright fee: €31.02 Basic fee: €2.00 Total price: €33.02

Export, share and cite

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal detection transformer (DETR) (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce $\sim$∼44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. With the same number of encoder layers as TransVG, our Dynamic MDETR (ResNet-50) outperforms TransVG (ResNet-101) but only brings marginal extra computational cost relative to TransVG. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

Title:

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Contributors:

Shi, Fengyuan ( author ) / Gao, Ruopeng ( author ) / Huang, Weilin ( author ) / Wang, Limin ( author )
Published in:

IEEE Transactions on Pattern Analysis and Machine Intelligence ; 46, 2 ; 1181-1198
Publisher:

IEEE

Publication date:

2024-02-01
Size:

11806254 byte
ISSN:

0162-8828, 2160-9292, 1939-3539
DOI:

https://doi.org/10.1109/TPAMI.2023.3328185
Type of media:

Article (Journal)
Type of material:

Electronic Resource
Language:

English
Source:

IEEE

Table of contents – Volume 46, Issue 2

Show all volumes and issues

The tables of contents are generated automatically and are based on the data records of the individual contributions available in the index of the TIB portal. The display of the Tables of Contents may therefore be incomplete.

667: Diffusion Mechanism in Residual Neural Network: Theory and Applications
Wang, Tangjun / Dou, Zehao / Bao, Chenglong / Shi, Zuoqiang et al. | 2024
digital version
681: OPAL: Occlusion Pattern Aware Loss for Unsupervised Light Field Disparity Estimation
Li, Peng / Zhao, Jiayin / Wu, Jingyao / Deng, Chao / Han, Yuqi / Wang, Haoqian / Yu, Tao et al. | 2024
digital version
695: An Asynchronous Linear Filter Architecture for Hybrid Event-Frame Cameras
Wang, Ziwei / Ng, Yonhon / Scheerlinck, Cedric / Mahony, Robert et al. | 2024
digital version
712: Generalizable Heterogeneous Federated Cross-Correlation and Instance Similarity Learning
Huang, Wenke / Ye, Mang / Shi, Zekun / Du, Bo et al. | 2024
digital version
729: Blockchain Data Mining With Graph Learning: A Survey
Qi, Yuxin / Wu, Jun / Xu, Hansong / Guizani, Mohsen et al. | 2024
digital version
749: Understanding and Accelerating Neural Architecture Search With Training-Free and Theory-Grounded Metrics
Chen, Wuyang / Gong, Xinyu / Wu, Junru / Wei, Yunchao / Shi, Humphrey / Yan, Zhicheng / Yang, Yi / Wang, Zhangyang et al. | 2024
digital version
764: Image Captioning With Controllable and Adaptive Length Levels
Ding, Ning / Deng, Chaorui / Tan, Mingkui / Du, Qing / Ge, Zhiwei / Wu, Qi et al. | 2024
digital version
780: Tessellating the Latent Space for Non-Adversarial Generative Auto-Encoders
Gai, Kuo / Zhang, Shihua et al. | 2024
digital version
793: Progressive Learning of 3D Reconstruction Network From 2D GAN Data
Dundar, Aysegul / Gao, Jun / Tao, Andrew / Catanzaro, Bryan et al. | 2024
digital version
805: COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition
Tellamekala, Mani Kumar / Amiriparian, Shahin / Schuller, Bjorn W. / Andre, Elisabeth / Giesbrecht, Timo / Valstar, Michel et al. | 2024
digital version
823: Video Frame Interpolation With Many-to-Many Splatting and Spatial Selective Refinement
Hu, Ping / Niklaus, Simon / Zhang, Lu / Sclaroff, Stan / Saenko, Kate et al. | 2024
digital version
837: Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation
Sanchez-Cartagena, Victor M. / Espla-Gomis, Miquel / Perez-Ortiz, Juan Antonio / Sanchez-Martinez, Felipe et al. | 2024
digital version
851: Explanatory Object Part Aggregation for Zero-Shot Learning
Chen, Xin / Deng, Xiaoling / Lan, Yubin / Long, Yongbing / Weng, Jian / Liu, Zhiquan / Tian, Qi et al. | 2024
digital version
869: Cost Function Unrolling in Unsupervised Optical Flow
Lifshitz, Gal / Raviv, Dan et al. | 2024
digital version
881: Deep Image Matting With Sparse User Interactions
Wei, Tianyi / Chen, Dongdong / Zhou, Wenbo / Liao, Jing / Zhao, Hanqing / Zhang, Weiming / Hua, Gang / Yu, Nenghai et al. | 2024
digital version
896: MetaFormer Baselines for Vision
Yu, Weihao / Si, Chenyang / Zhou, Pan / Luo, Mi / Zhou, Yichen / Feng, Jiashi / Yan, Shuicheng / Wang, Xinchao et al. | 2024
digital version
913: CGOF++: Controllable 3D Face Synthesis With Conditional Generative Occupancy Fields
Sun, Keqiang / Wu, Shangzhe / Zhang, Ning / Huang, Zhaoyang / Wang, Quan / Li, Hongsheng et al. | 2024
digital version
927: Reliable Event Generation With Invertible Conditional Normalizing Flow
Gu, Daxin / Li, Jia / Zhu, Lin / Zhang, Yu / Ren, Jimmy S. et al. | 2024
digital version
944: WOOD: Wasserstein-Based Out-of-Distribution Detection
Wang, Yinan / Sun, Wenbo / Jin, Jionghua / Kong, Zhenyu / Yue, Xiaowei et al. | 2024
digital version
957: Mitigating Confounding Bias in Practical Recommender Systems With Partially Inaccessible Exposure Status
Cao, Tianwei / Xu, Qianqian / Yang, Zhiyong / Huang, Qingming et al. | 2024
digital version
975: 3-D Point Cloud Attribute Compression With -Laplacian Embedding Graph Dictionary Learning
Li, Xin / Dai, Wenrui / Li, Shaohui / Li, Chenglin / Zou, Junni / Xiong, Hongkai et al. | 2024
digital version
994: Room-Object Entity Prompting and Reasoning for Embodied Referring Expression
Gao, Chen / Liu, Si / Chen, Jinyu / Wang, Luting / Wu, Qi / Li, Bo / Tian, Qi et al. | 2024
digital version
1011: Temporal Action Segmentation: An Analysis of Modern Techniques
Ding, Guodong / Sener, Fadime / Yao, Angela et al. | 2024
digital version
1031: Variance Reduced Domain Randomization for Reinforcement Learning With Policy Gradient
Jiang, Yuankun / Li, Chenglin / Dai, Wenrui / Zou, Junni / Xiong, Hongkai et al. | 2024
digital version
1049: Learning Hierarchical Modular Networks for Video Captioning
Li, Guorong / Ye, Hanhua / Qi, Yuankai / Wang, Shuhui / Qing, Laiyun / Huang, Qingming / Yang, Ming-Hsuan et al. | 2024
digital version
1065: A Theoretical Analysis of DeepWalk and Node2vec for Exact Recovery of Community Structures in Stochastic Blockmodels
Zhang, Yichi / Tang, Minh et al. | 2024
digital version
1079: SPLiT: Single Portrait Lighting Estimation via a Tetrad of Face Intrinsics
Fei, Fan / Cheng, Yean / Zhu, Yongjie / Zheng, Qian / Li, Si / Pan, Gang / Shi, Boxin et al. | 2024
digital version
1093: Image Restoration via Frequency Selection
Cui, Yuning / Ren, Wenqi / Cao, Xiaochun / Knoll, Alois et al. | 2024
digital version
1109: A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm
Tobin, Joshua / Zhang, Mimi et al. | 2024
digital version
1121: Learning Interpretable Rules for Scalable Data Representation and Classification
Wang, Zhuo / Zhang, Wei / Liu, Ning / Wang, Jianyong et al. | 2024
digital version
1134: Optimal Composite Likelihood Estimation and Prediction for Distributed Gaussian Process Modeling
Li, Yongxiang / Zhou, Qiang / Jiang, Wei / Tsui, Kwok-Leung et al. | 2024
digital version
1148: Differentiable Image Data Augmentation and Its Applications: A Survey
Shi, Jian / Ghazzai, Hakim / Massoud, Yehia et al. | 2024
digital version
1165: Back to Reality: Learning Data-Efficient 3D Object Detector With Shape Guidance
Xu, Xiuwei / Wang, Ziwei / Zhou, Jie / Lu, Jiwen et al. | 2024
digital version
1181: Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Shi, Fengyuan / Gao, Ruopeng / Huang, Weilin / Wang, Limin et al. | 2024
digital version
1199: False Correlation Reduction for Offline Reinforcement Learning
Deng, Zhihong / Fu, Zuyue / Wang, Lingxiao / Yang, Zhuoran / Bai, Chenjia / Zhou, Tianyi / Wang, Zhaoran / Jiang, Jing et al. | 2024
digital version
1212: ViTPose++: Vision Transformer for Generic Body Pose Estimation
Xu, Yufei / Zhang, Jing / Zhang, Qiming / Tao, Dacheng et al. | 2024
digital version
1231: Importance Weighted Structure Learning for Scene Graph Generation
Liu, Daqi / Bober, Miroslaw / Kittler, Josef et al. | 2024
digital version
1243: Multi-Stage Asynchronous Federated Learning With Adaptive Differential Privacy
Li, Yanan / Yang, Shusen / Ren, Xuebin / Shi, Liang / Zhao, Cong et al. | 2024
digital version
1257: LayerNet: High-Resolution Semantic 3D Reconstruction of Clothed People
Corona, Enric / Alenya, Guillem / Pons-Moll, Gerard / Moreno-Noguer, Francesc et al. | 2024
digital version
1273: PFENet++: Boosting Few-Shot Semantic Segmentation With the Noise-Filtered Context-Aware Prior Mask
Luo, Xiaoliu / Tian, Zhuotao / Zhang, Taiping / Yu, Bei / Tang, Yuan Yan / Jia, Jiaya et al. | 2024
digital version
1290: Tobias: A Random CNN Sees Objects
Cao, Yun-Hao / Wu, Jianxin et al. | 2024
digital version
1305: Inequality-Constrained 3D Morphable Face Model Fitting
Sariyanidi, Evangelos / Zampella, Casey J. / Schultz, Robert T. / Tunc, Birkan et al. | 2024
digital version

How to get this title?

Check access

Download

Commercial Copyright fee: €31.02 Basic fee: €4.00 Total price: €35.02

Academic Copyright fee: €31.02 Basic fee: €2.00 Total price: €33.02

Quicklinks

Borrowing & Ordering

Quicklinks

Search & discover

Quicklinks

Learning & working

Quicklinks

Publishing & Archiving

Quicklinks

About the TIB

Quicklinks

Research & Development

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding (English)

How to get this title?

Export, share and cite

More details on this result

Table of contents

Table of contents – Volume 46, Issue 2

Similar titles

How to get this title?

Export, share and cite