VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance (English)

Crowson, Katherine / Biderman, Stella / Kornis, Daniel / Stander, Dashiell / Hallahan, Eric / Castricato, Louis / Raff, Edward

In: Computer Vision – ECCV 2022 : 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII ; Chapter: 6 ; 88-105 ; 2022

ISBN:

978-3-031-19836-6, 978-3-031-19835-9

ISSN:

1611-3349, 0302-9743

Article/Chapter (Book) / Electronic Resource

How to get this title?

Check access

Download

Commercial Copyright fee: €29.95 Basic fee: €4.00 Total price: €33.95

Academic Copyright fee: €15.00 Basic fee: €2.00 Total price: €17.00

Export, share and cite

Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models. We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semantic complexity without any training by using a multimodal encoder to guide image generations. We demonstrate on a variety of tasks how using CLIP [37] to guide VQGAN [11] produces higher visual quality outputs than prior, less flexible approaches like minDALL-E [19], GLIDE [33] and Open-Edit [24], despite not being trained for the tasks presented. Our code is available in a public repository.

Title:

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Additional title:

Lect.Notes Computer
Contributors:

Avidan, Shai ( editor ) / Brostow, Gabriel ( editor ) / Cissé, Moustapha ( editor ) / Farinella, Giovanni Maria ( editor ) / Hassner, Tal ( editor ) / Crowson, Katherine ( author ) / Biderman, Stella ( author ) / Kornis, Daniel ( author ) / Stander, Dashiell ( author ) / Hallahan, Eric ( author )
Conference:

European Conference on Computer Vision ; 2022 ; Tel Aviv, Israel
Published in:

Computer Vision – ECCV 2022 : 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII ; Chapter: 6 ; 88-105

Lecture Notes in Computer Science ; 13697 ; 88-105
Publisher:

Springer Nature Switzerland

Place of publication:

Cham
Publication date:

2022-10-22
Size:

18 pages
ISBN:

978-3-031-19836-6, 978-3-031-19835-9
ISSN:

1611-3349, 0302-9743
DOI:

https://doi.org/10.1007/978-3-031-19836-6_6
Type of media:

Article/Chapter (Book)
Type of material:

Electronic Resource
Language:

English
Keywords:

Generative adversarial networks , Grounded language , Image manipulation

Computer Science , Image Processing and Computer Vision
Source:

Springer Verlag

Table of contents eBook

The tables of contents are generated automatically and are based on the data records of the individual contributions available in the index of the TIB portal. The display of the Tables of Contents may therefore be incomplete.

1: Most and Least Retrievable Images in Visual-Language Query Systems
Zhu, Liuwan / Ning, Rui / Li, Jiang / Xin, Chunsheng / Wu, Hongyi et al. | 2022
digital version
2: Sports Video Analysis on Large-Scale Data
Wu, Dekun / Zhao, He / Bao, Xingce / Wildes, Richard P. et al. | 2022
digital version
3: Grounding Visual Representations with Texts for Domain Generalization
Min, Seonwoo / Park, Nokyung / Kim, Siwon / Park, Seunghyun / Kim, Jinkyu et al. | 2022
digital version
4: Bridging the Visual Semantic Gap in VLN via Semantically Richer Instructions
Ossandón, Joaquín / Earle, Benjamín / Soto, Álvaro et al. | 2022
digital version
5: StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
Maharana, Adyasha / Hannan, Darryl / Bansal, Mohit et al. | 2022
digital version
6: VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Crowson, Katherine / Biderman, Stella / Kornis, Daniel / Stander, Dashiell / Hallahan, Eric / Castricato, Louis / Raff, Edward et al. | 2022
digital version
7: Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
Liu, Xian / Xu, Yinghao / Wu, Qianyi / Zhou, Hang / Wu, Wayne / Zhou, Bolei et al. | 2022
digital version
8: End-to-End Active Speaker Detection
Alcázar, Juan León / Cordes, Moritz / Zhao, Chen / Ghanem, Bernard et al. | 2022
digital version
9: Emotion Recognition for Multiple Context Awareness
Yang, Dingkang / Huang, Shuai / Wang, Shunli / Liu, Yang / Zhai, Peng / Su, Liuzhen / Li, Mingcheng / Zhang, Lihua et al. | 2022
digital version
10: Adaptive Fine-Grained Sketch-Based Image Retrieval
Bhunia, Ayan Kumar / Sain, Aneeshan / Shah, Parth Hiren / Gupta, Animesh / Chowdhury, Pinaki Nath / Xiang, Tao / Song, Yi-Zhe et al. | 2022
digital version
11: Quantized GAN for Complex Music Generation from Dance Videos
Zhu, Ye / Olszewski, Kyle / Wu, Yu / Achlioptas, Panos / Chai, Menglei / Yan, Yan / Tulyakov, Sergey et al. | 2022
digital version
12: Uncertainty-Aware Multi-modal Learning via Cross-Modal Random Network Prediction
Wang, Hu / Zhang, Jianpeng / Chen, Yuanhong / Ma, Congbo / Avery, Jodie / Hull, Louise / Carneiro, Gustavo et al. | 2022
digital version
13: Localizing Visual Sounds the Easy Way
Mo, Shentong / Morgado, Pedro et al. | 2022
digital version
14: Learning Visual Styles from Audio-Visual Associations
Li, Tingle / Liu, Yichen / Owens, Andrew / Zhao, Hang et al. | 2022
digital version
15: Remote Respiration Monitoring of Moving Person Using Radio Signals
Choi, Jae-Ho / Kang, Ki-Bong / Kim, Kyung-Tae et al. | 2022
digital version
16: Camera Pose Estimation and Localization with Active Audio Sensing
Yang, Karren / Firman, Michael / Brachmann, Eric / Godard, Clément et al. | 2022
digital version
17: PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
Yu, Samuel / Wu, Peter / Liang, Paul Pu / Salakhutdinov, Ruslan / Morency, Louis-Philippe et al. | 2022
digital version
18: VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer
Montesinos, Juan F. / Kadandale, Venkatesh S. / Haro, Gloria et al. | 2022
digital version
19: Telepresence Video Quality Assessment
Ying, Zhenqiang / Ghadiyaram, Deepti / Bovik, Alan et al. | 2022
digital version
20: MultiMAE: Multi-modal Multi-task Masked Autoencoders
Bachmann, Roman / Mizrahi, David / Atanov, Andrei / Zamir, Amir et al. | 2022
digital version
21: AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Tzinis, Efthymios / Wisdom, Scott / Remez, Tal / Hershey, John R. et al. | 2022
digital version
22: Audio–Visual Segmentation
Zhou, Jinxing / Wang, Jianyuan / Zhang, Jiayi / Sun, Weixuan / Zhang, Jing / Birchfield, Stan / Guo, Dan / Kong, Lingpeng / Wang, Meng / Zhong, Yiran et al. | 2022
digital version
23: Unsupervised Night Image Enhancement: When Layer Decomposition Meets Light-Effects Suppression
Jin, Yeying / Yang, Wenhan / Tan, Robby T. et al. | 2022
digital version
24: Relationformer: A Unified Framework for Image-to-Graph Generation
Shit, Suprosanna / Koner, Rajat / Wittmann, Bastian / Paetzold, Johannes / Ezhov, Ivan / Li, Hongwei / Pan, Jiazhen / Sharifzadeh, Sahand / Kaissis, Georgios / Tresp, Volker et al. | 2022
digital version
25: GAMa: Cross-View Video Geo-Localization
Vyas, Shruti / Chen, Chen / Shah, Mubarak et al. | 2022
digital version
26: Revisiting a kNN-Based Image Classification System with High-Capacity Storage
Nakata, Kengo / Ng, Youyang / Miyashita, Daisuke / Maki, Asuka / Lin, Yu-Chieh / Deguchi, Jun et al. | 2022
digital version
27: Geometric Representation Learning for Document Image Rectification
Feng, Hao / Zhou, Wengang / Deng, Jiajun / Wang, Yuechen / Li, Houqiang et al. | 2022
digital version
28: S $^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{2}$$\end{document}-VER: Semi-supervised Visual Emotion Recognition
Jia, Guoli / Yang, Jufeng et al. | 2022
digital version
29: Image Coding for Machines with Omnipotent Feature Learning
Feng, Ruoyu / Jin, Xin / Guo, Zongyu / Feng, Runsen / Gao, Yixin / He, Tianyu / Zhang, Zhizheng / Sun, Simeng / Chen, Zhibo et al. | 2022
digital version
30: Feature Representation Learning for Unsupervised Cross-Domain Image Retrieval
Hu, Conghui / Lee, Gim Hee et al. | 2022
digital version
31: Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition
Xu, Shilin / Li, Xiangtai / Wang, Jingbo / Cheng, Guangliang / Tong, Yunhai / Tao, Dacheng et al. | 2022
digital version
32: Semantic-Guided Multi-mask Image Harmonization
Ren, Xuqian / Liu, Yifan et al. | 2022
digital version
33: Learning an Isometric Surface Parameterization for Texture Unwrapping
Das, Sagnik / Ma, Ke / Shu, Zhixin / Samaras, Dimitris et al. | 2022
digital version
34: Towards Regression-Free Neural Networks for Diverse Compute Platforms
Duggal, Rahul / Zhou, Hao / Yang, Shuo / Fang, Jun / Xiong, Yuanjun / Xia, Wei et al. | 2022
digital version
35: Relationship Spatialization for Depth Estimation
Xu, Xiaoyu / Qiu, Jiayan / Wang, Xinchao / Wang, Zhou et al. | 2022
digital version
36: Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models
Xu, Chenfeng / Yang, Shijia / Galanti, Tomer / Wu, Bichen / Yue, Xiangyu / Zhai, Bohan / Zhan, Wei / Vajda, Peter / Keutzer, Kurt / Tomizuka, Masayoshi et al. | 2022
digital version
37: FAR: Fourier Aerial Video Recognition
Kothandaraman, Divya / Guan, Tianrui / Wang, Xijun / Hu, Shuowen / Lin, Ming / Manocha, Dinesh et al. | 2022
digital version
38: Translating a Visual LEGO Manual to a Machine-Executable Plan
Wang, Ruocheng / Zhang, Yunzhi / Mao, Jiayuan / Cheng, Chin-Yi / Wu, Jiajun et al. | 2022
digital version
39: Fabric Material Recovery from Video Using Multi-scale Geometric Auto-Encoder
Liang, Junbang / Lin, Ming et al. | 2022
digital version
40: MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment
Ren, Jie / Liang, Wenteng / Yan, Ran / Mai, Luo / Liu, Shiwen / Liu, Xiao et al. | 2022
digital version
41: The One Where They Reconstructed 3D Humans and Environments in TV Shows
Pavlakos, Georgios / Weber, Ethan / Tancik, Matthew / Kanazawa, Angjoo et al. | 2022
digital version

How to get this title?

Check access

Download

Commercial Copyright fee: €29.95 Basic fee: €4.00 Total price: €33.95

Academic Copyright fee: €15.00 Basic fee: €2.00 Total price: €17.00

Quicklinks

Borrowing & Ordering

Quicklinks

Search & discover

Quicklinks

Learning & working

Quicklinks

Publishing & Archiving

Quicklinks

About the TIB

Quicklinks

Research & Development

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance (English)

How to get this title?

Export, share and cite

More details on this result

Table of contents

Table of contents eBook

Similar titles

How to get this title?

Export, share and cite