LeMon: Automating Portrait Generation for Zero-Shot Story Visualization with Multi-Character Interactions | Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2024)

research-article

Free access

Authors: Ziyi Kou, Shichao Pei, Xiangliang Zhang

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 1418 - 1427

Published: 24 August 2024 Publication History

Metrics

Total Citations0Total Downloads4

Last 12 Months4

Last 6 weeks4

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

Manage my Alerts

New Citation Alert!

Please log in to your account

PDFeReader

    • View Options
    • References
    • Media
    • Tables
    • Share

Abstract

Zero-Shot Story Visualization (ZSV) seeks to depict textual narratives through a sequence of images without relying on pre-existing text-image pairs for training. In this paper, we address the challenge of automated multi-character ZSV, aiming to create distinctive yet compatible character portraits for high-quality story visualization without the need of manual human interventions. Our study is motivated by the limitation of current ZSV approaches that necessitate inefficient manual collection of external images as initial character portraits and suffer from low-quality story visualization, especially with multi-character interactions, when the portraits are not well initiated. To overcome these issues, we develop LeMon, an LLM enhanced Multi-Character Zero-Shot Visualization framework that automates character portrait initialization and supports iterative portrait refinement by exploring the semantic content of the story. In particular, we design an LLM-based portrait generation strategy that matches the story characters with external movie characters, and leverage the matched resources as in-context learning (ICL) samples for LLMs to accurately initialize the character portraits. We then propose a graph-based Text2Image diffusion model that constructs a character interaction graph from the story to iteratively refine the character portraits by maximizing the distinctness of different characters while minimizing their incompatibility in the multi-character story visualization. Our evaluation results show that LeMon outperforms existing ZSV approaches in generating high-quality visualizations for stories across various types with multiple interacted characters. Our code is available at https://github.com/arxrean/LLM-LeMon.

Supplemental Material

MP4 File - 2min Promo Video

In this video, we briefly discuss the motivation and implementation of our work LeMon. The video contains no audio but built-in descriptions for major components of LeMon.

  • Download
  • 27.46 MB

References

[1]

Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. 2023. A Neural Space-Time Representation for Text-to-Image Personalization. arXiv preprint arXiv:2305.15391 (2023).

[2]

Victor Nikhil Antony and Chien-Ming Huang. 2023. ID. 8: Co-Creating Visual Stories with Generative AI. arXiv preprint arXiv:2309.14228 (2023).

[3]

Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. 2022. ?This is my unicorn, Fluffy": Personalizing frozen vision-language representations. In European Conference on Computer Vision. Springer, 558--577.

Digital Library

[4]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).

[5]

Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, and Yujiu Yang. 2023. TaleCrafter: Interactive Story Visualization with Multiple Characters. arXiv preprint arXiv:2305.18247 (2023).

[6]

Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2022. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 (2022).

[7]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, Vol. 30 (2017).

[8]

Martina Hedenius, Michael T Ullman, Per Alm, Margareta Jennische, and Jonas Persson. 2013. Enhanced recognition memory after incidental encoding in children with developmental dyslexia. PloS one, Vol. 8, 5 (2013), e63998.

[9]

Hyeonho Jeong, Gihyun Kwon, and Jong Chul Ye. 2023. Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900 (2023).

[10]

Ziyi Kou, Shichao Pei, Yijun Tian, and Xiangliang Zhang. [n.,d.]. Character as pixels: A controllable prompt adversarial attacking framework for black-box text guided image generation models.

[11]

Bowen Li. 2022. Word-level fine-grained story visualization. In European Conference on Computer Vision. Springer, 347--362.

Digital Library

[12]

Bowen Li, Philip HS Torr, and Thomas Lukasiewicz. 2022. Clustering generative adversarial networks for story visualization. In Proceedings of the 30th ACM International Conference on Multimedia. 769--778.

Digital Library

[13]

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. 2019. Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6329--6338.

[14]

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, and Weidi Xie. 2023. Intelligent Grimm--Open-ended Visual Storytelling via Latent Diffusion Models. arXiv preprint arXiv:2306.00973 (2023).

[15]

Adyasha Maharana, Darryl Hannan, and Mohit Bansal. 2022. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In European Conference on Computer Vision. Springer, 70--87.

Digital Library

[16]

Rodrigo Mello, Filipe Calegario, and Geber Ramalho. 2023. ELODIN: Naming Concepts in Embedding Spaces. arXiv preprint arXiv:2303.04001 (2023).

[17]

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. 2022. Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950 (2022).

[18]

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. 2024. Synthesizing coherent story with auto-regressive latent diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2920--2930.

[19]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[20]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv: 2112.10752 [cs.CV]

[21]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500--22510.

[22]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, Vol. 35 (2022), 36479--36494.

[23]

Yun-Zhu Song, Zhi Rui Tam, Hung-Jen Chen, Huiao-Han Lu, and Hong-Han Shuai. 2020. Character-preserving coherent story visualization. In European Conference on Computer Vision. Springer, 18--33.

Digital Library

[24]

Sitong Su, Litao Guo, Lianli Gao, Heng Tao Shen, and Jingkuan Song. 2023. Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control. arXiv preprint arXiv:2312.07549 (2023).

[25]

Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V Chawla. 2024. TinyLLM: Learning a Small Student from Multiple Large Language Models. arXiv preprint arXiv:2402.04616 (2024).

[26]

Yijun Tian, Chuxu Zhang, Zhichun Guo, Xiangliang Zhang, and Nitesh V Chawla. 2023. Learning MLPs on Graphs: A Unified View of Effectiveness, Robustness, and Efficiency. In ICLR.

[27]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).

[28]

Marilyn A Walker, Grace I Lin, Jennifer Sawyer, et al. 2012. An Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style. In LREC. 1373--1378.

[29]

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974--983.

Digital Library

[30]

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2023. Expel: Llm agents are experiential learners. arXiv preprint arXiv:2308.10144 (2023).

[31]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).

[32]

Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. 2022. Understanding adamw through proximal methods and scale-freeness. arXiv preprint arXiv:2202.00089 (2022).

Index Terms

  1. LeMon: Automating Portrait Generation for Zero-Shot Story Visualization with Multi-Character Interactions

    1. Computing methodologies

      1. Artificial intelligence

        1. Computer vision

          1. Computer vision representations

            1. Image representations

        2. Machine learning

          1. Learning settings

            1. Machine learning algorithms

        Recommendations

        • Character-Preserving Coherent Story Visualization

          Computer Vision – ECCV 2020

          Abstract

          Story visualization aims at generating a sequence of images to narrate each sentence in a multi-sentence story. Different from video generation that focuses on maintaining the continuity of generated images (frames), story visualization emphasizes ...

          Read More

        • Hippocampus-heuristic character recognition network for zero-shot learning in Chinese character recognition

          Highlights

          • A novel hippocampus-heuristic character recognition network (HCRN) is proposed for zero/few-shot learning.

          Abstract

          The recognition of Chinese characters has always been a challenging task due to their huge variety and complex structures. The current radical-based methods fail to recognize Chinese characters without learning all of their radicals in ...

          Read More

        • Zero-shot Generation ofTraining Data withDenoising Diffusion Probabilistic Model forHandwritten Chinese Character Recognition

          Document Analysis and Recognition - ICDAR 2023

          Abstract

          There are more than 80,000 character categories in Chinese while most of them are rarely used. To build a high performance handwritten Chinese character recognition (HCCR) system supporting the full character set with a traditional approach, many ...

          Read More

        Comments

        Information & Contributors

        Information

        Published In

        LeMon: Automating Portrait Generation for Zero-Shot Story Visualization with Multi-Character Interactions | Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (4)

        KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

        August 2024

        6901 pages

        ISBN:9798400704901

        DOI:10.1145/3637528

        • General Chairs:
        • Ricardo Baeza-Yates

          Northeastern University, USA

          ,
        • Francesco Bonchi

          CENTAI / Eurecat, Italy

        Copyright © 2024 ACM.

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].

        Sponsors

        • SIGMOD: ACM Special Interest Group on Management of Data
        • SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 24 August 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. LLMs
        2. story visualization
        3. text-to-image generation

        Qualifiers

        • Research-article

        Conference

        KDD '24

        Sponsor:

        • SIGMOD
        • SIGKDD

        Acceptance Rates

        Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

        Contributors

        LeMon: Automating Portrait Generation for Zero-Shot Story Visualization with Multi-Character Interactions | Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (7)

        Other Metrics

        View Article Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Total Citations

        • 4

          Total Downloads

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)4

        Reflects downloads up to 28 Aug 2024

        Other Metrics

        View Author Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        Get this Publication

        Media

        Figures

        Other

        Tables

        LeMon: Automating Portrait Generation for Zero-Shot Story Visualization with Multi-Character Interactions | Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2024)
        Top Articles
        Latest Posts
        Article information

        Author: Van Hayes

        Last Updated:

        Views: 6354

        Rating: 4.6 / 5 (66 voted)

        Reviews: 89% of readers found this page helpful

        Author information

        Name: Van Hayes

        Birthday: 1994-06-07

        Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

        Phone: +512425013758

        Job: National Farming Director

        Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

        Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.