research-article
Free access
Authors: Ziyi Kou, Shichao Pei, Xiangliang Zhang
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Pages 1418 - 1427
Published: 24 August 2024 Publication History
Metrics
Total Citations0Total Downloads4Last 12 Months4
Last 6 weeks4
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
PDFeReader
- View Options
- References
- Media
- Tables
- Share
Abstract
Zero-Shot Story Visualization (ZSV) seeks to depict textual narratives through a sequence of images without relying on pre-existing text-image pairs for training. In this paper, we address the challenge of automated multi-character ZSV, aiming to create distinctive yet compatible character portraits for high-quality story visualization without the need of manual human interventions. Our study is motivated by the limitation of current ZSV approaches that necessitate inefficient manual collection of external images as initial character portraits and suffer from low-quality story visualization, especially with multi-character interactions, when the portraits are not well initiated. To overcome these issues, we develop LeMon, an LLM enhanced Multi-Character Zero-Shot Visualization framework that automates character portrait initialization and supports iterative portrait refinement by exploring the semantic content of the story. In particular, we design an LLM-based portrait generation strategy that matches the story characters with external movie characters, and leverage the matched resources as in-context learning (ICL) samples for LLMs to accurately initialize the character portraits. We then propose a graph-based Text2Image diffusion model that constructs a character interaction graph from the story to iteratively refine the character portraits by maximizing the distinctness of different characters while minimizing their incompatibility in the multi-character story visualization. Our evaluation results show that LeMon outperforms existing ZSV approaches in generating high-quality visualizations for stories across various types with multiple interacted characters. Our code is available at https://github.com/arxrean/LLM-LeMon.
Supplemental Material
MP4 File - 2min Promo Video
In this video, we briefly discuss the motivation and implementation of our work LeMon. The video contains no audio but built-in descriptions for major components of LeMon.
- Download
- 27.46 MB
References
[1]
Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. 2023. A Neural Space-Time Representation for Text-to-Image Personalization. arXiv preprint arXiv:2305.15391 (2023).
[2]
Victor Nikhil Antony and Chien-Ming Huang. 2023. ID. 8: Co-Creating Visual Stories with Generative AI. arXiv preprint arXiv:2309.14228 (2023).
[3]
Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. 2022. ?This is my unicorn, Fluffy": Personalizing frozen vision-language representations. In European Conference on Computer Vision. Springer, 558--577.
Digital Library
[4]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
[5]
Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. 2023. TaleCrafter: Interactive Story Visualization with Multiple Characters. arXiv preprint arXiv:2305.18247 (2023).
[6]
Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2022. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 (2022).
[7]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, Vol. 30 (2017).
[8]
Martina Hedenius, Michael T Ullman, Per Alm, Margareta Jennische, and Jonas Persson. 2013. Enhanced recognition memory after incidental encoding in children with developmental dyslexia. PloS one, Vol. 8, 5 (2013), e63998.
[9]
Hyeonho Jeong, Gihyun Kwon, and Jong Chul Ye. 2023. Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900 (2023).
[10]
Ziyi Kou, Shichao Pei, Yijun Tian, and Xiangliang Zhang. [n.,d.]. Character as pixels: A controllable prompt adversarial attacking framework for black-box text guided image generation models.
[11]
Bowen Li. 2022. Word-level fine-grained story visualization. In European Conference on Computer Vision. Springer, 347--362.
Digital Library
[12]
Bowen Li, Philip HS Torr, and Thomas Lukasiewicz. 2022. Clustering generative adversarial networks for story visualization. In Proceedings of the 30th ACM International Conference on Multimedia. 769--778.
Digital Library
[13]
Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. 2019. Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6329--6338.
[14]
Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, and Weidi Xie. 2023. Intelligent Grimm--Open-ended Visual Storytelling via Latent Diffusion Models. arXiv preprint arXiv:2306.00973 (2023).
[15]
Adyasha Maharana, Darryl Hannan, and Mohit Bansal. 2022. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In European Conference on Computer Vision. Springer, 70--87.
Digital Library
[16]
Rodrigo Mello, Filipe Calegario, and Geber Ramalho. 2023. ELODIN: Naming Concepts in Embedding Spaces. arXiv preprint arXiv:2303.04001 (2023).
[17]
Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. 2022. Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950 (2022).
[18]
Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. 2024. Synthesizing coherent story with auto-regressive latent diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2920--2930.
[19]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[20]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv: 2112.10752 [cs.CV]
[21]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500--22510.
[22]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, Vol. 35 (2022), 36479--36494.
[23]
Yun-Zhu Song, Zhi Rui Tam, Hung-Jen Chen, Huiao-Han Lu, and Hong-Han Shuai. 2020. Character-preserving coherent story visualization. In European Conference on Computer Vision. Springer, 18--33.
Digital Library
[24]
Sitong Su, Litao Guo, Lianli Gao, Heng Tao Shen, and Jingkuan Song. 2023. Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control. arXiv preprint arXiv:2312.07549 (2023).
[25]
Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V Chawla. 2024. TinyLLM: Learning a Small Student from Multiple Large Language Models. arXiv preprint arXiv:2402.04616 (2024).
[26]
Yijun Tian, Chuxu Zhang, Zhichun Guo, Xiangliang Zhang, and Nitesh V Chawla. 2023. Learning MLPs on Graphs: A Unified View of Effectiveness, Robustness, and Efficiency. In ICLR.
[27]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[28]
Marilyn A Walker, Grace I Lin, Jennifer Sawyer, et al. 2012. An Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style. In LREC. 1373--1378.
[29]
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974--983.
Digital Library
[30]
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2023. Expel: Llm agents are experiential learners. arXiv preprint arXiv:2308.10144 (2023).
[31]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).
[32]
Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. 2022. Understanding adamw through proximal methods and scale-freeness. arXiv preprint arXiv:2202.00089 (2022).
Index Terms
LeMon: Automating Portrait Generation for Zero-Shot Story Visualization with Multi-Character Interactions
Computing methodologies
Artificial intelligence
Computer vision
Computer vision representations
Image representations
Machine learning
Learning settings
Machine learning algorithms
Recommendations
- Character-Preserving Coherent Story Visualization
Computer Vision – ECCV 2020
Abstract
Story visualization aims at generating a sequence of images to narrate each sentence in a multi-sentence story. Different from video generation that focuses on maintaining the continuity of generated images (frames), story visualization emphasizes ...
Read More
- Hippocampus-heuristic character recognition network for zero-shot learning in Chinese character recognition
Highlights
- A novel hippocampus-heuristic character recognition network (HCRN) is proposed for zero/few-shot learning.
Abstract
The recognition of Chinese characters has always been a challenging task due to their huge variety and complex structures. The current radical-based methods fail to recognize Chinese characters without learning all of their radicals in ...
Read More
- Zero-shot Generation ofTraining Data withDenoising Diffusion Probabilistic Model forHandwritten Chinese Character Recognition
Document Analysis and Recognition - ICDAR 2023
Abstract
There are more than 80,000 character categories in Chinese while most of them are rarely used. To build a high performance handwritten Chinese character recognition (HCCR) system supporting the full character set with a traditional approach, many ...
Read More
Comments
Information & Contributors
Information
Published In
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2024
6901 pages
ISBN:9798400704901
DOI:10.1145/3637528
- General Chairs:
- Ricardo Baeza-Yates
Northeastern University, USA
, - Francesco Bonchi
CENTAI / Eurecat, Italy
Copyright © 2024 ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].
Sponsors
- SIGMOD: ACM Special Interest Group on Management of Data
- SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 24 August 2024
Permissions
Request permissions for this article.
Check for updates
Author Tags
- LLMs
- story visualization
- text-to-image generation
Qualifiers
- Research-article
Conference
KDD '24
Sponsor:
- SIGMOD
- SIGKDD
KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 25 - 29, 2024
Barcelona, Spain
Acceptance Rates
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%
Contributors
Other Metrics
View Article Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
Total Citations
4
Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)4
Reflects downloads up to 28 Aug 2024
Other Metrics
View Author Metrics
Citations
View Options
View options
View or Download as a PDF file.
PDFeReader
View online with eReader.
eReaderGet Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Publication
Media
Figures
Other
Tables