0% found this document useful (0 votes)

37 views36 pages

Phase1 Report - Removed

The document is a project report titled 'Comic Book Generator Using Stable Diffusion' submitted for the Bachelor of Technology degree in Computer Science and Engineering. It outlines the development of a web application that transforms textual narratives into illustrated comic books using natural language processing and stable diffusion techniques, aiming to democratize comic creation and enhance storytelling. The report includes sections on the project's background, problem statement, objectives, and literature review, emphasizing the integration of technology in creative processes.

Uploaded by

jamiemathew1303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views36 pages

Phase1 Report - Removed

Uploaded by

jamiemathew1303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

COMIC BOOK GENERATOR USING STABLE DIFFUSION

Project Phase I Report

Submitted to APJ Abdul Kalam Technological University

in partial fulfilment of requirements for the award of degree
Bachelor of Technology
in
Computer Science and Engineering
by
GAYATHRI H (LBT21CS037)
JAMIE MATHEW (LBT21CS050)
JYOTHIKA JOHNSON (LBT21CS055)
KRISHNA VASUDEVAN (LBT21CS062)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

LBS INSTITUTE OF TECHNOLOGY FOR WOMEN
TRIVANDRUM, KERALA
DECEMBER 2024
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
LBS INSTITUTE OF TECHNOLOGY FOR WOMEN
TRIVANDRUM, KERALA

CERTIFICATE

This is to certify that the report entitled ‘COMIC BOOK GENERATOR USING STABLE
DIFFUSION’ submitted by GAYATHRI H (LBT21CS037), JAMIE MATHEW
(LBT21CS050), JYOTHIKA JOHNSON (LBT21CS055), KRISHNA VASUDEVAN
(LBT21CS062) to the APJ Abdul Kalam Technological University in partial fulfilment of the
requirements for the award of the Degree of Bachelor of Technology in Computer Science
and Engineering is a bonafide record of the project work carried out by them under our
guidance and supervision. This report in any form has not been submitted to any other
University or Institute for any purpose.

Prof. Gisha G S Prof. Sreejith Dr. Anitha Kumari S

Assistant Professor Assistant Professor Head of Department
Department of CSE Department of CSE Department of CSE
LBSITW LBSITW LBSITW
(Seminar Guide) (Seminar Coordinator)
DECLARATION

We, the undersigned, hereby declare that the project titled ‘COMIC BOOK GENERATOR
USING STABLE DIFFUSION’ submitted in partial fulfilment of the requirements for the
Bachelor of Technology degree at APJ Abdul Kalam Technological University, Kerala,
represents our original work conducted under the supervision of Prof. Gisha G. S Assistant
Professor, Department of Computer Science and Engineering, LBS Institute of Technology
for Women, Poojappura. We affirm that this submission reflects our own ideas and that any
contributions from external sources have been accurately cited and referenced. We attest to
our adherence to the principles of academic honesty and integrity, ensuring that all data,
ideas, facts, and sources have been presented truthfully and ethically. We acknowledge that
any breach of academic integrity or misrepresentation of data may result in disciplinary
action by the institute and/or the University Furthermore, we confirm that this report has not
been previously used to obtain any degree, diploma,or similar title from other academic
institutions.

Place: Thiruvananthapuram GAYATHRI H

Date: December 2024 JAMIE MATHEW
JYOTHIKA JOHNSON
KRISHNA VASUDEVAN
ACKNOWLEDGEMENT

We would like to express our sincere gratitude to all the people who have guided and assisted
us through the course of this work. First and foremost, we wish to express our deep and
sincere gratitude to our Principal, Dr. Smithamol M B for providing us with all the facilities
and infrastructure for the completion of this work. We record our sincere gratitude and thanks
to Dr. Anitha Kumari S, Head of the Department, Department of Computer Science and
Engineering, for providing us with all the facilities for the completion of this work. We would
like to express our sincere gratitude to our project coordinator Dr. Sreejith S, Assistant
Professor, Department of Computer Science and Engineering, for their valuable assistance
provided during the course of the project. We would also like to express our sincere gratitude
to our project guide Prof Gisha G S, Assistant Professor, Department of Computer Science
and Engineering, for providing us constant guidance, support and immense encouragement
for the successful completion of this project. We would also like to thank all the faculty
members in the department for the valuable support and encouragement they rendered. Last
but not the least, we thank all our friends and well-wishers for their kind cooperation,
immense support and useful suggestions and also for providing their truthful and illuminating
views on a number of issues related to the project.

GAYATHRI H
JAMIE MATHEW
JYOTHIKA JOHNSON
KRISHNA VASUDEVAN
TABLE OF CONTENTS

Title Page No.

ABSTRACT i

LIST OF FIGURES ii

LIST OF ABBREVIATIONS iii

CHAPTER 1. INTRODUCTION 1

1.1 Background 1

1.2 Problem Statement 2

1.3 Objectives 2

1.4 Purpose and Need 2

CHAPTER 2. LITERATURE REVIEW 4

CHAPTER 3. GAP ANALYSIS 13

CHAPTER 4. PROPOSED SOLUTION 14

4.1 Methodology 14

4.1.1 Input and User Interface development 15

4.1.2 Text Summarization 15

4.1.3 Coreference Resolution 16

4.1.4 Visual Content Generation 17

4.1.5 Audio Synthesis and Integration 17

4.1.6 PDF Generation 18

4.2 System Architecture 18

4.2.1 LED Model 19

4.2.1 Stable Diffusion 21

CHAPTER 5. CONCLUSION 23

5.1 Future Scope 23

REFERENCES 25
ABSTRACT

In today's digital era, where technological innovation transforms storytelling, our project aims
to revolutionise storytelling by transforming textual narratives into fully illustrated comic
books using natural language processing and stable diffusion techniques. The system
identifies key elements such as characters, ensuring that the generated visuals align
seamlessly with the narrative context. This ensures that each comic panel reflects the original
text, enhancing the storytelling experience. Our system harnesses NLP models to thoroughly
analyse and interpret textual stories, capturing their essence and structure for further creative
transformation. Utilising Stable Diffusion, a powerful generative method, these narratives are
brought to life through visually stunning illustrations, crafting immersive comic book
experiences. This project aims to redefine storytelling in the digital realm, offering a dynamic
platform that merges language and art, fostering creativity and engagement among users from
diverse backgrounds. The integration of NLP and Stable Diffusion promises to revolutionise
the way we experience and interact with narrative content, empowering users to transform
their stories into vivid comic books effortlessly. To achieve this, we are implementing a
pipeline that begins with text processing using NLP models, such as Hugging Face
transformers, to analyse and summarise the input text effectively. Key details like dialogue
and scene descriptions are passed to the Stable Diffusion model, which generates images
based on these inputs. This approach democratises comic creation, enabling anyone to
produce visually compelling content without prior artistic training. This approach enables
anyone to produce visually compelling content without prior artistic training. Applications for
this project span various fields, including education, entertainment, and digital marketing,
where personalised, engaging visual narratives can captivate audiences. By combining these
advanced techniques with the timeless appeal of comics, we aim to bridge the gap between
imagination and reality, creating a transformative tool for storytelling in the digital age.

i
LIST OF FIGURES

Figure No. Title Page No.

Fig 4.1 Methodology for the Comic Book Generator 14

Fig 4.2 Overall System Architecture 19

Fig 4.3 LED Model 21

Fig 4.4 Stable Diffusion Architecture 22

ii
LIST OF ABBREVIATIONS

AI Artificial Intelligence

NLP Natural Language Processing

GAN Generative Adversarial Network

PDF Portable Document Format

VAE Variational AutoEncoders

GPT Generative Pre-trained Transformer

BERT Bidirectional Encoder Representations from Transformers

CNN Convolutional Neural Network

ROUGE Recall-Oriented Understudy for Gisting Evaluation

DCGAN Deep Convolutional Generative Adversarial Network

FID Frechet Inception Distance

CLIP Contrastive Language-Image Pre-training

MSCOCO Microsoft Common Objects in Context

WGAN Wasserstein Generative Adversarial Network

CGAN Conditional Generative Adversarial Network

LED Longformer Encoder Decoder

FPDF Free Portable Document Format

UI User Interface

HTML HyperText Markup Language

CSS Cascading Style Sheets

TTS Text-to-Speech

MP3 MPEF Audio Layer 3

API Application Programming Interface

BPE Byte-Pair Encoding

iii
CHAPTER 1
INTRODUCTION

1.1 BACKGROUND

The idea of developing a comic book generator using Stable Diffusion is driven by the
aspiration to revolutionise storytelling and make visual creation more inclusive. Traditional
comic book creation requires significant artistic talent, time, and resources, which can be
barriers for many aspiring creators. This project aims to break down these barriers by
leveraging Stable Diffusion, a cutting-edge text-to-image generative model that transforms
textual descriptions into visually striking comic panels. By interpreting prompts describing
characters, scenes, and dialogues, the generator can produce high-quality illustrations,
enabling creators to focus on storytelling while the AI handles the artistic execution. This
innovation empowers individuals without advanced artistic skills to bring their visions to life,
fostering inclusivity in the creative process.

The choice of Stable Diffusion reflects its ability to generate high-resolution, stylistically
coherent visuals with intricate details. Fine-tuning the model on datasets of comic art styles
ensures adaptability to various genres, such as manga or Western comics, while maintaining
narrative consistency. This project aligns with the growing demand for diverse narratives and
tools that encourage experimentation in visual storytelling. It also has the potential to
revolutionise the comic industry by offering rapid prototyping and personalised storytelling
for small creators, educators, or individuals with disabilities. By blending human creativity
with AI efficiency, this project redefines how stories are told, inspiring new possibilities and
innovation in the world of comics. Furthermore, it opens doors for collaborative projects
between artists and writers, allowing them to bring their creative visions to life more
efficiently. The technology could also pave the way for educational applications, where
illustrated narratives can be generated to enhance learning experiences for students. In
essence, this project serves as a bridge between art and technology, unlocking endless
creative possibilities for people across the globe.

Beyond its immediate applications, this project has the potential to influence broader domains
by showcasing how AI can be a transformative tool in the creative arts. It could inspire
similar innovations in related fields such as animation, video game design, and digital

1
marketing, where visual storytelling plays a crucial role. Additionally, the project highlights
the role of technology in preserving cultural narratives, enabling communities to visualise
and share traditional stories in ways that resonate with modern audiences. By making the
process of visual content creation more accessible, this initiative encourages diverse voices to
contribute to the global storytelling landscape, fostering creativity, inclusivity, and cultural
exchange. As AI continues to evolve, projects like this exemplify how technology can
complement human imagination, enabling creators to push the boundaries of what is possible
in visual media.

1.2 PROBLEM STATEMENT

This project addresses the challenge of overcoming significant barriers to creating visually
appealing and cohesive comic books or illustrated stories. Traditional comic creation
demands a combination of artistic skills, substantial time, and financial resources, which can
exclude many talented storytellers who lack access to these assets. Additionally, manually
creating illustrations for every panel is time-consuming and limits the ability to experiment
with multiple styles or designs. For small creators, educators, or individuals with limited
resources, this makes the process of visual storytelling inaccessible. The absence of an
automated system that can seamlessly transform written narratives into high-quality visuals
further exacerbates this challenge. This limitation hinders creativity, as creators often have to
compromise on their artistic vision due to resource constraints. It also creates barriers to entry
for aspiring artists and writers who lack the means to produce professional-grade content. The
process of commissioning illustrations can be costly and time-consuming, further
discouraging new talent from pursuing visual storytelling. The gap between concept and
execution often leaves many innovative ideas unrealized, curbing diversity and representation
in the creative arts. Addressing these challenges is essential to democratizing access to tools
that enable storytelling, fostering inclusivity, and unlocking the full potential of creators
worldwide.

1.3 OBJECTIVES

● Create a Sequence of Visuals: Develop a web application to produce a series of

images that illustrate a story by using stable diffusion and NLP.
● Audio Generation: Audio narration is generated for each image to enhance user
experience.

2
● Output Generation: Users should have an option to download the picture book as a
PDF.

1.4 PURPOSE AND NEED

● Enhanced Storytelling Experience: The primary objective of the project is to

enhance storytelling through the deft transformation of written narratives into visually
appealing picture books. Giving customers a more engaging and immersive
experience while reading stories is the aim of this modification.
● Interactive and Dynamic Content Creation: The project allows users to have an
active part in the creative process on a platform. The interface allows users to enter
their stories, and it will automatically produce the appropriate graphic content. Users’
creativity and sense of ownership are encouraged by this involvement.
● Education Tool: The initiative offers a novel approach to presenting and
understanding narratives, making it a useful teaching tool. Learning can be made
more fun and approachable by using visual aids to help with literacy development,
especially for younger audiences.
● Bridge Between Text and Imagery: Using solely text, conventional storytelling
frequently limits the visual experience. It enhances the storytelling experience overall
by offering a seamless transition between textual narratives and vibrant images.
● Innovative Technological Integration: Creative projects that make use of these
skills are needed as sophisticated natural language processing and image production
technology become more prevalent. This need is satisfied by the project, which
combines NLP and stable diffusion for a novel and imaginative application.
● Addressing Digital Literacy: With more people becoming digitally literate, the
project fills the demand for original and insightful digital content. The project, which
targets a tech-savvy audience, is in line with the changing ways that people engage
with and absorb storytelling.

3
CHAPTER 2
LITERATURE REVIEW

In preparation for this project dedicated to transforming textual narratives into visually
compelling comic books using Stable Diffusion, we conducted a thorough review of pertinent
literature. Our selection included a diverse range of sources, such as conference papers,
research articles, and online publications, which were carefully curated and extensively
analysed. From this comprehensive review, five specific research papers have been
highlighted as key references for our discussion. These studies, along with their associated
methodologies and innovations, hold significant relevance for guiding the development of our
comic book generator system.

A. Warrier et.al [1] presents a novel approach for criminal identification, overcoming the
limitations of traditional hand-drawn sketches based on witness descriptions. Traditional
methods are prone to human error and artistic interpretation, leading to inconsistencies and
inaccuracies. The proposed solution leverages stable diffusion models to generate realistic
facial images from textual descriptions and further refine them using text-based editing
techniques. The process involves three key components: face generation, manipulation, and
matching. The stable diffusion model, which avoids the issues of mode collapse seen in
Generative Adversarial Networks (GANs), provides more precise control over the image
creation process. It consists of a variational autoencoder (VAE), a U-Net model, and a text
encoder to convert text into image embeddings. The generated images are high-quality,
photorealistic, and customizable based on user inputs.

For face manipulation, text prompts are used to modify specific features of a generated
image, such as hair colour or facial expressions, while maintaining ethical standards. The
images are then matched against a criminal suspect database using facial recognition
algorithms, which compare the generated face with known suspects, providing a ranked
match score. The paper highlights the potential of diffusion-based models in criminal
investigations, offering higher-quality images and more fine-grained control than previous
methods like GANs. However, challenges remain, such as ensuring that textual descriptions
are accurate and fine-tuning the model to meet specific legal and ethical standards. Future

4
work suggests improving model performance and integrating more advanced language
models like GPT-3 to refine textual inputs. Additionally, expanding the model’s capability to
process multiple descriptive inputs could allow for a more comprehensive profile, helping
investigators create more accurate suspect images. Further research could also focus on
real-time application, enabling immediate adjustments based on evolving witness statements
to assist ongoing investigations .

A. M, A. Danda et.al [2] presents an approach to automating text summarization using

Hugging Face’s transformer-based models, including BERT and PEGASUS. The study
emphasises the use of pre-trained models and the Hugging Face NLP pipeline to simplify and
optimise the summarization process. Hugging Face's models, particularly BERT and its
variants like DistilBERT, are known for their ability to handle complex natural language
tasks. These models are fine-tuned for summarization by training them on large, diverse
datasets to enhance their capacity to generate meaningful and coherent summaries.

The core idea of the study is to build a summarization system that is easy to implement while
maintaining high performance. Hugging Face’s transformer models, especially the
encoder-decoder architectures such as PEGASUS, are ideally suited for this task due to their
powerful attention mechanisms and their ability to generate human-like summaries. The
study uses publicly available datasets, such as CNN/Daily Mail, for training and evaluation.
To assess the performance of the summarization models, the study employs ROUGE metrics,
which are commonly used to evaluate the quality of summaries by comparing them to
reference summaries. The metrics include ROUGE-1, ROUGE-2, and ROUGE-L, which
measure recall, precision, and F1-score for unigrams, bigrams, and longest common
subsequences, respectively. The results show that Hugging Face’s transformer models
significantly outperform traditional summarization techniques, such as extractive
summarization, which simply selects key sentences from the original text. One of the
standout findings of the paper is the superior performance of PEGASUS, a model specifically
pre-trained for summarization tasks. PEGASUS achieves high scores in ROUGE evaluations,
showcasing its ability to generate abstractive summaries that are not only accurate but also
fluent and contextually relevant.

The paper concludes by emphasising the efficiency and effectiveness of Hugging Face’s NLP
models in automating text summarization. These models reduce the need for manual

5
summarization and make it easier for developers to integrate summarization capabilities into
applications with minimal effort. The authors also suggest potential areas for future research,
such as enhancing the models by using domain-specific datasets or exploring real-time
summarization applications.

Y. Zhang et.al [3] proposes a two-stage framework, TexControl, designed to improve

sketch-based fashion image generation. Fashion design often relies on deep learning models
to convert hand-drawn sketches into fashion images, but existing methods struggle to
maintain fine-grained texture details. To overcome these limitations, TexControl introduces a
two-stage pipeline that focuses on controlling both the outline and texture of generated
fashion images. The first stage, called base generation, employs ControlNet to create outline
previews from freehand sketches. This allows for the retention of essential sketch details
while ensuring the generated images remain faithful to the initial design. The second stage,
texture control, uses an image-to-image ControlNet model to refine textures, ensuring they
align with the input sketches and text descriptions. By breaking the task into these two
distinct stages, TexControl achieves high-quality results that capture both the overall shape
and intricate textures of clothing designs. The authors conducted qualitative evaluations
comparing TexControl with other sketch-to-image models. The results showed that
TexControl generates more detailed and accurate textures while maintaining consistent
outlines. However, the paper also acknowledges some limitations, such as occasional errors
in the generation of human body parts instead of clothing. This approach represents a
significant step forward in the fashion design domain, offering better control over the
appearance of both the shape and texture of clothing in generated images. Additionally,
TexControl opens new possibilities for customising designs, allowing designers to
experiment with different textures and patterns directly on the generated sketches. Future
work aims to improve the model’s understanding of human anatomy to further reduce errors
and expand its applicability across diverse design tasks [3].

Y. Liu et.al [4] proposes an intelligent and interactive online t-Shirt customization based on
User's Preference by Yexin Liu and Lin Wang presents an innovative approach to online
T-shirt customization, utilising advanced AI techniques to enhance user experience.
Traditional T-shirt customization often faces communication challenges and limited visual
feedback, requiring users to repeatedly adjust designs in a cumbersome process. MYCloth

6
addresses these issues by leveraging AI models like ChatGPT for refining text inputs and
Stable Diffusion for generating design patterns. MYCloth introduces an intelligent system
that allows users to input text descriptions for design themes. ChatGPT refines these
descriptions, and the Stable Diffusion model generates high-quality prints based on the
refined input. Users can then visually adjust and preview their T-shirt designs using a novel
virtual try-on model. This try-on model uses deep learning techniques to realistically display
the designed T-shirt on virtual avatars, offering real-time feedback and an immersive
customization experience.

The system consists of four main components: pattern selection, paint generation, cloth
adjustment, and virtual try-on. Users can select patterns, create designs from text, adjust the
placement and colour of prints, and see how the designs would look when worn. MYCloth
improves upon existing systems by allowing more interactive control over design elements
and offering an immersive virtual try-on experience, enhancing user satisfaction and reducing
customization time. The paper also evaluates the system through user studies and quantitative
performance metrics, showcasing MYCloth's superior performance in both virtual try-on
accuracy and user experience compared to existing methods. These findings highlight its
potential to transform the online T-shirt customization process, making it more accessible and
engaging for users. Furthermore, the adaptability of MYCloth’s framework could inspire
similar advancements in other areas of online apparel customization, supporting a broader
trend toward personalised digital fashion solutions [4].

Diana Earshia V et.al [5] explores the use of Generative Adversarial Networks (GANs) to
automatically generate amusing and visually distinct animated characters. The focus is on
utilising the powerful capabilities of GANs for creative content generation, specifically in
character design, where humour and originality are key elements. The study aims to
streamline the character creation process in animation, which traditionally demands
substantial artistic effort and manual design.

The approach revolves around employing a GAN architecture to generate cartoon-like,

humorous character designs that balance creativity with randomness. The framework utilises
a Generator-Discriminator pair, where the generator network is trained to produce animated
character images, while the discriminator differentiates between real character designs
(collected from popular animated series) and those generated by the model. This adversarial

7
process pushes the generator to improve iteratively, producing characters that can pass as
authentic, hand-crafted artwork.

The paper also addresses common challenges in training GANs, such as mode collapse,
where the generator repeatedly produces similar outputs, thereby limiting diversity.
Techniques like feature matching and batch normalisation were implemented to stabilise the
training process and improve the variety of generated characters. Additionally, Conditional
GANs (cGANs) were explored to provide more control over character attributes, enabling the
generation of characters with specific features like unique facial expressions or body shapes
[5].

Y. Gyungho et.al [6] addresses the challenges of generating webtoons from multilingual text
inputs using deep learning models. Text-to-image technology, which enables computers to
generate images based on input text, has advanced significantly with the use of GANs.
However, most of these advancements have focused on generating images from
English-language text. The authors propose a solution that leverages multilingual
text-to-image models to automate parts of the webtoon creation process, reducing the need
for extensive human intervention.

Previous research in the field of text-to-image generation has explored GAN-based models
like DCGAN, StackGAN, and AttnGAN, which extract textual features and generate images
based on those features. These models, while successful in generating photorealistic images,
often struggle with capturing contextual nuances from complex texts. In response, more
recent approaches like multimodal learning and diffusion models have been introduced,
providing improved image quality and stability during the generation process. However, most
of these models still focus predominantly on English inputs. The authors of this study
highlight the need for multilingual text-to-image technology, as translating native language
text into English introduces inefficiencies and can reduce the quality of the generated images.

The study presents a novel approach by utilising bert to extract feature vectors from texts
written in multiple languages, including English and Korean. The researchers trained a
GAN-based model, specifically a DCGAN, on a dataset of webtoons created using both
English and Korean text descriptions, transforming the images into cartoon-like styles using
CartoonGAN. The model demonstrated the ability to generate images that were contextually
similar to the input text, with evaluation metrics like the Inception score and FID score

8
validating the quality of the generated webtoon images. In conclusion, the study successfully
introduces a multilingual text-to-image model capable of generating webtoon images based
on multilingual inputs. This model contributes to the automation of webtoon creation,
reducing the reliance on manual drawing and potentially lowering production costs. Although
the generated images show promising results, the authors note that there is room for
improvement, particularly in enhancing image detail and addressing issues related to training
stability in GAN-based models. Future research may explore the use of more advanced
diffusion models to further improve the quality and diversity of the generated webtoon
images. Additionally, expanding the model's language capabilities to include other languages
could broaden its application, enabling more creators to access automated webtoon
generation. Further work could also involve optimising the model for faster processing
speeds, making it more feasible for real-time or interactive applications in digital storytelling
[6].

K. Yu et.al [7] present a novel approach to assist webtoon creators by integrating deep
learning models. The research leverages contrastive learning via CLIP and diffusion models
to generate webtoons from text descriptions.The authors first constructed a multimodal
webtoon dataset by converting publicly available datasets (e.g., MSCOCO) into cartoon-style
images using CartoonGAN. They used CLIP, which combines a multilingual BERT for text
and Vision Transformer for images, to associate text with corresponding webtoon images.
CLIP was trained to extract features and align multimodal data, while the diffusion model
was used to generate new webtoon images based on the most text-similar input.

The study also highlights that while previous attempts using GAN-based methods produced
low-quality images, the diffusion model in this work allowed for more realistic and diverse
image generation. Despite these advancements, limitations remain in generating webtoons
from multi-sentence inputs while ensuring artistic consistency. Further research is suggested
to address these challenges. Through experiments using both single- and continuous-text
inputs, the researchers achieved promising results, with an inception score of 7.14 when using
continuous-text inputs. This development shows potential to streamline webtoon creation,
empowering artists with AI-driven tools for more efficient production [7].

9
Anoushka Popuri et.al [8] explores the transformative role of GANs in image-related tasks
through adversarial training between a generator and a discriminator network. The generator's
goal is to produce realistic images, while the discriminator aims to differentiate between real
and fake images, thus pushing the generator to improve its outputs. This dynamic has enabled
breakthroughs in generating high-quality, photorealistic images, making GANs highly
influential in creative applications like digital content creation, gaming, and even automated
storytelling.
The techniques pioneered by GANs have paved the way for utilising deep learning models to
generate diverse and contextually relevant visual content. The ability of GANs to perform
style transfer and image-to-image translation is particularly relevant for transforming
sketches or conceptual art into fully coloured and detailed comic panels. Additionally,
techniques like cGANs enable controlled image generation based on specific inputs, making
it possible to design characters or scenes that align with a particular storyline or
style.However, despite their success, GANs face challenges like mode collapse (where the
generator produces limited variations) and instability in training, which can affect consistency
in generated outputs. To address these limitations, enhancements like Wasserstein GANs
(WGANs) and Progressive GANs have been developed to stabilise training and improve
image quality. These insights are valuable for integrating diffusion models, like Stable
Diffusion, in generating complex and diverse visuals required for comic books. By
combining the strengths of both GANs and diffusion models, projects can achieve more
detailed and coherent art generation, allowing for automated yet artistically rich content
creation [8].

J. Zakraoui [9] introduces a novel framework that converts natural language stories into
visual representations. The goal is to bridge the gap between text and imagery, providing a
way to automatically generate visual narratives from written stories. It focuses on the
challenge of translating textual information into meaningful, coherent visual content, which
can be used for a variety of applications such as digital storytelling, interactive media, and
educational tools.
The proposed pipeline consists of several key stages. The first stage involves extracting the
structure and content of the narrative from the input text. This is done by parsing the story
and identifying the core elements, such as characters, events, settings, and actions. The
system uses NLP techniques to detect entities (e.g., people, objects) and relationships

10
between them, as well as the temporal and spatial relationships that define the flow of the
story. Once the entities and actions are identified, the next step is to create a visual
representation. This involves generating scenes based on the information extracted from the
text. The generated images are then arranged to reflect the narrative flow, with careful
attention to how events unfold and how characters interact with their environment.To ensure
that the visualisations are coherent and accurate, the system incorporates a feedback loop
where the generated images are assessed for consistency with the original story. The paper
emphasises the importance of maintaining narrative integrity while generating diverse and
creative visuals that capture the essence of the story.
The paper also highlights the role of user input in guiding the visualisation process. While
the system is designed to automate most of the process, it allows users to adjust and fine-tune
the visuals to better match their interpretation of the story. The proposed pipeline provides a
powerful tool for transforming text into engaging, informative visuals, with potential
applications in storytelling, media production, and educational tools [9].

T. Kwon [10] introduces DiffusionCLIP, a method that combines diffusion models and CLIP
for text-guided image manipulation. DiffusionCLIP enhances the ability of diffusion models
to generate and modify images based on textual descriptions. The model integrates the
strengths of CLIP, which is trained to connect visual and textual representations, with the
generative power of diffusion models, which iteratively refine noisy images into coherent
outputs. The core idea of DiffusionCLIP is to guide the image generation process with textual
prompts, enabling the model to modify or create images in a way that aligns with the user's
input. The authors propose a robust framework that allows for better control over image
attributes, such as style, content, and composition, using simple textual descriptions.

The paper presents several experiments where DiffusionCLIP demonstrates impressive

capabilities in tasks like image editing, style transfer, and synthesis of new images based on
textual input. It is shown to outperform previous text-to-image generation models in terms of
image quality, coherence, and fidelity to the given prompts. By combining the flexibility of
CLIP with the powerful generative process of diffusion models, DiffusionCLIP provides a
highly adaptable and robust framework for text-guided image manipulation, pushing the
boundaries of creative applications in image generation and editing.

11
DiffusionCLIP significantly advances the field of text-guided image editing by combining the
strengths of diffusion models and CLIP embeddings to achieve unprecedented levels of
precision, coherence, and flexibility. It effectively addresses the limitations of previous
models by ensuring a tight alignment between textual descriptions and visual outputs,
enabling users to make highly specific and contextually relevant edits. Through iterative
refinement, the method produces images that are not only aesthetically pleasing but also
semantically faithful to the input prompts, handling a diverse range of tasks such as image
synthesis, style transfer, and content manipulation with remarkable accuracy. This innovative
framework demonstrates its adaptability across various scenarios, from subtle attribute
modifications like adjusting lighting or color tones to dramatic transformations that
completely reimagine the original image, all while maintaining visual integrity. By enhancing
user control over intricate details and offering seamless flexibility, DiffusionCLIP opens new
doors for creative exploration in fields such as digital content creation, graphic design, visual
storytelling, and fine arts. Its user-friendly design empowers both professionals and amateurs
to bring their imaginative ideas to life effortlessly, democratizing access to sophisticated
AI-driven tools. Furthermore, its capability to generate high-quality, contextually accurate
images has the potential to revolutionize industries like marketing, entertainment, education,
and product design, making it a pivotal innovation in the realm of AI-powered visual
generation. As a versatile and intuitive solution, DiffusionCLIP sets a new benchmark for
text-to-image interaction, paving the way for future advancements in creative technologies
[10].

12
CHAPTER 3
GAP ANALYSIS

In the realm of automated image generation, particularly for applications such as comic book
creation, existing systems face several notable limitations. These challenges hinder the
effectiveness and versatility of current technologies, making them less suited for the complex
demands of comic storytelling. A major challenge in existing image generation systems is
mode collapse, where the model produces a limited variety of similar or repetitive images.
This is particularly problematic for comics, which require diverse character designs,
environments, and settings to maintain visual interest and narrative flow. Comics rely on
distinct panels to depict varied scenes, emotions, and actions, and mode collapse results in
monotonous visuals that fail to convey the uniqueness of each scene.
Another significant gap in current image generation systems is the lack of fine control over
generated content. In comic book production, creators need the ability to guide the generation
process based on specific requirements, such as character features, scene layout, and
emotions. This often results in visuals that fail to capture the intended essence of the story,
with characters appearing inconsistent, emotions misrepresented, or layouts diverging from
the envisioned sequence. However, most systems do not allow for precise manipulation,
resulting in images that may not align with the intended storyline. This lack of control limits
their practical use for comic creators, highlighting the need for tools that enable detailed
customization in image generation.
These limitations present significant barriers for application in comic book creation. Mode
collapse results in repetitive visuals, which fail to capture the diversity and dynamic nature
required for comic storytelling. Similarly, the inability to precisely control character design,
scene layout, and emotions limits the customization needed for aligning images with the
intended storyline. However, our project utilises the stability and flexibility of diffusion
models and this approach can generate diverse, high-quality images with greater control over
specific aspects of the comic creation process. This enables the production of unique,
contextually accurate visuals that align with detailed story inputs, offering a promising
solution to the gaps identified in existing systems. By bridging these limitations, our
approach opens new possibilities for creating visually compelling and narratively coherent
comics.

13
CHAPTER 4
PROPOSED SOLUTION

4.1 METHODOLOGY

The methodology for generating a comic book integrates several advanced AI-driven
components into a streamlined pipeline. It begins with Input Handling, where the user's story
is accepted and pre-processed. The input is passed to a Text Processing Module using the
LED (Longformer Encoder-Decoder) model, which generates a concise and coherent
summary tailored for comic book storytelling. This is followed by Coreference Resolution
using Fastcoref, ensuring that all pronouns and references are unambiguously linked to their
antecedents, thereby enhancing narrative clarity.

Next, the refined text is fed into the Image Generation Module, which utilises the Stable
Diffusion model to produce high-quality illustrations based on the story's content. In parallel,
an Audio Generation Module leverages the Google Text-to-Speech (gTTS) library to produce
narrations for each section of the comic. Finally, all elements—text, images, and audio—are
integrated into a PDF Output, created using the FPDF Python library, to generate a cohesive
and visually engaging comic book. This end-to-end methodology ensures both automation
and high-quality results, making it ideal for transforming textual stories into immersive
multimedia picture books.

Fig 4.1 Methodology for the Comic Book Generator

14
4.1.1 INPUT AND USER INTERFACE DEVELOPMENT

The input and UI module is designed to provide an intuitive and user-friendly platform for
interacting with the comic book generator. Developed using HTML, CSS, and JavaScript, the
UI allows users to input their own story. Once the input is processed, the resulting comic
book is displayed on the user’s screen in an interactive, page-by-page format. The interface
dynamically presents the text alongside the generated visual elements, such as images and
illustrations that correspond to the narrative. Users can navigate through the pages with
intuitive buttons, allowing seamless transitions between comic pages. The UI also includes an
embedded audio player that provides narration for each page. This adds an immersive,
multi-sensory layer to the comic book, enhancing engagement. The narration is synchronised
with the text and images on each page, ensuring a cohesive and enhanced user experience.

Furthermore, the interface abstracts the complexity of the underlying processes involved in
comic book generation, offering a straightforward and accessible platform. This allows users
to create visually appealing and narratively consistent comic books with minimal effort. As a
result, it serves as a versatile tool for a broad audience, from casual users interested in
creating their own stories to more experienced creators seeking an efficient way to visualise
and share their narratives.

4.1.2 TEXT SUMMARIZATION

Text summarization is a crucial step in adapting lengthy narratives into concise text suitable
for comic book panels. For this project, the LED model is implemented, chosen for its
capability to process long-form text efficiently. Unlike traditional models limited by shorter
input lengths, LED can handle extended content, ensuring that the core elements of a story
are preserved. This ensures that even complex narratives can be distilled into meaningful,
engaging summaries while maintaining their original context and emotional depth. By
generating succinct text, the model facilitates the creation of dialogue and captions that fit
naturally within the spatial constraints of comic panels.

The process involves tokenizing the input text and feeding it into the summarization model,
which reduces the content while aligning it with the storytelling format of comics. The result
is text that retains essential plot points, character interactions, and key themes, enabling
seamless integration with the visual elements. This step not only optimises storytelling for the

15
comic medium but also enhances reader engagement by delivering concise and impactful
narratives. By leveraging the LED model, the summarization process achieves a balance
between brevity and clarity, making it ideal for transforming long-form text into comic-ready
dialogue and captions.

4.1.3 COREFERENCE RESOLUTION

Coreference resolution is crucial for maintaining coherence and consistency in the text by
identifying and resolving references such as pronouns, determiners, and other referring
expressions back to their original entities (e.g., character names or objects). This ensures that
the narrative flow is uninterrupted and that each entity is consistently recognized throughout
the text, which is particularly important for generating accurate and contextually appropriate
images.

FastCoref is an efficient, machine learning-based tool designed to identify coreferences

within a text. It works by parsing the input text and detecting where different expressions
refer to the same entity. For example, in a sentence like "John went to the market. He bought
fruits," FastCoref would resolve the pronoun "He" to the proper noun "John." This allows the
system to track references and ensure that each entity is identified correctly, even if it is
mentioned multiple times or with different expressions.

Once coreference resolution is applied, the text becomes more structured and easier to
process. This clarity is vital for the image generation module, as it helps the system
understand which characters or objects the text refers to when creating images. For instance,
if the text mentions "Alice" and later uses the pronoun "she," the system must be able to
recognize that "she" refers to "Alice" to generate consistent images of that character. The
coreference resolution step prevents confusion and ensures that images are generated based
on the correct entities.

After resolving coreferences, the text is split into individual sentences, which are then passed
as input prompts to the image generation module. This step is essential for maintaining
narrative consistency, as it allows the generated images to accurately reflect the progression
of the story. Additionally, coreference resolution ensures that characters and objects retain
their identity across different scenes and pages, preserving visual continuity in the comic
book. This is especially important when dealing with multiple characters, objects, or complex

16
storylines, where accurate representation of references is necessary for a coherent and
engaging comic book experience.

4.1.4 VISUAL CONTENT GENERATION

To bring the summarised text to life, each condensed narrative is paired with an AI-generated
image, creating a visual representation of the scene. The Stable Diffusion API is integrated to
perform text-to-image synthesis, transforming the resolved text from the summarization
process into high-quality images. By using the final text as a prompt, the model generates
detailed illustrations that capture the essence of the scene, from character appearance to
background settings. The flexibility of the Stable Diffusion model allows for the
customization of various parameters, including style, composition, and resolution, which can
be predefined or adjusted via the user interface. This ensures that the generated images align
with the intended artistic direction of the comic, whether it's a manga style, a Western comic
look, or something entirely unique.

The generated images are then stored in sequence, ensuring that they correspond to the
summarised text in the correct order to form a coherent narrative. This process enables the
seamless creation of a comic book, where both the text and visuals work together to tell a
compelling story. By pairing the summarised text with tailored AI-generated imagery, the
project offers an efficient and accessible way to produce high-quality comic books,
transforming text-based narratives into immersive visual experiences. This integration of AI
in both the textual and visual creation stages allows for rapid prototyping and customization,
making comic book creation more accessible and dynamic.

4.1.5 AUDIO SYNTHESIS AND INTEGRATION

To further enhance the storytelling experience, audio narration is added to each comic panel,
providing an immersive auditory element to complement the visuals and text. The Google
Text-to-Speech (TTS) API is used to generate high-quality audio from the summarised text.
Each piece of text is converted into a narration using a chosen voice and language, allowing
for customization based on the desired tone, gender, and accent. This ensures that the
narration aligns with the mood and context of the story, whether it's a dramatic, humorous, or
emotional moment.

17
The generated audio files are saved in a standard format, such as MP3, making them easily
accessible for integration with the comic panels. Once the audio files are created, they are
linked to the corresponding panels, ensuring that the narration is synchronised with the text
and visuals. This audio feature not only enhances accessibility but also deepens the reader's
engagement by adding a dynamic layer to the storytelling process. With audio narration, the
comic book becomes a multi-sensory experience, allowing readers to enjoy the story through
both sight and sound.

4.1.6 PDF GENERATION

The final output of the project is a comprehensive PDF that combines text, images, and audio
narration into a cohesive comic book. The FPDF library is used to create and format the PDF,
providing a flexible and efficient way to structure the content. Text and images are arranged
sequentially within the document, ensuring that the layout mirrors the typical flow of a comic
book, with each panel clearly separated and aligned to maintain readability. The images,
generated via Stable Diffusion, are placed alongside the corresponding summarised text,
creating a visual narrative that guides the reader through the story.

To enhance the experience further, audio files are embedded as clickable links within the
PDF. These links provide direct access to the audio narration for each panel, allowing readers
to listen to the narrated content as they view the comic. The integration of multimedia
elements, including text, images, and audio, transforms the comic into a dynamic, interactive
format. Once the PDF is compiled, it is made available for download, allowing users to easily
access and share the complete picture book with integrated storytelling features. This process
not only makes the comic more immersive but also ensures that the content is easily
distributable in a widely-used format.

4.2 SYSTEM ARCHITECTURE

The architecture for creating a comic book using Stable Diffusion is structured into three
main components: User Interface, Text Processing Module, and Integration Module. The
process begins with a story input provided by the user through the User Interface. This story
is then passed into the Text Processing Module, which performs multiple steps to refine the
input.

18
Within the Text Processing Module, the story is summarised using a Summary Generator
(LED), ensuring the output is concise and suitable for a comic book format. Coreference
Resolution (Fastcoref) ensures that pronouns and references in the story are clearly resolved,
enhancing coherence. After this, the story is divided into sentences using a Sentence Splitter,
preparing the text for both image and audio generation.

The Integration Module is responsible for turning the processed sentences into multimedia
elements. Sentences are further refined using a Sentence Generation component and then fed
into the Image Generation module, powered by the Stable Diffusion API, to produce visually
compelling illustrations. Simultaneously, an Audio Generation component (Google TTS)
generates narrations for each sentence. Finally, these images and audio clips are combined to
produce the comic book, which is presented as the final output. This modular approach
ensures the system is flexible, scalable, and capable of producing high-quality picture books.

Fig 4.2 Overall System Architecture

4.2.1 LED MODEL

The LED (Longformer Encoder-Decoder) model is designed to efficiently process long text
sequences, particularly for tasks like text summarization, where documents can often exceed
the typical input length of traditional Transformer models. The architecture of LED builds on
the Longformer model, which adapts the Transformer architecture to handle long-range
dependencies in a computationally efficient way.

19
The encoder in LED employs a unique attention mechanism known as sliding window
attention. In traditional Transformer models, attention is computed between every pair of
tokens in the input sequence, leading to quadratic time complexity. This becomes
computationally expensive for long texts. Longformer’s sliding window attention mitigates
this issue by restricting attention to a local window of surrounding tokens, reducing
computational cost. Additionally, Longformer uses global attention for tokens that are
deemed important, such as those representing the start or end of sentences. This global
attention ensures that essential context is preserved while processing long documents. The
decoder of the LED model mirrors the encoder’s sliding window attention approach but is
designed to generate the output sequence, typically a summary of the input text. The decoder
attends to the encoded representation of the input text, and the combination of local and
global attention mechanisms ensures that the model generates coherent and concise
summaries.

In LED, the input text undergoes tokenization, where the text is split into smaller units,
typically using the Byte Pair Encoding (BPE) method. These tokens are then converted into
dense vector representations through pre-trained embeddings, which capture the semantic
meaning of the text. These embeddings are passed through the encoder for processing. The
core innovation of LED lies in its attention mechanism. Rather than attending to all tokens in
the sequence, which is computationally expensive, each token attends only to a local window
of surrounding tokens, improving efficiency. Additionally, certain tokens are assigned global
attention, ensuring that key information from across the document is not overlooked.

The LED model also incorporates positional encoding to account for the order of tokens in
the sequence. Since the attention mechanism does not inherently consider token order,
positional encodings are added to the token embeddings to maintain the relative positions of
tokens within the input text. At the end of the process, the LED model generates the output
sequence, which is typically a concise summary of the input text. The model extracts the
most important content and discards irrelevant information, ensuring that the summary
remains coherent and aligned with the original meaning of the document.

20
Fig 4.3 LED Model

4.2.2 STABLE DIFFUSION

Stable Diffusion's architecture is based on a combination of key components, including the

Variational Autoencoder (VAE), CLIP (Contrastive Language-Image Pretraining), and a
denoising diffusion process. At the heart of the model, the VAE is responsible for mapping
images into a lower-dimensional latent space, which significantly reduces computational
complexity. This latent space representation enables the model to work with more
manageable data while retaining critical visual details, allowing for the efficient generation of
high-resolution images. The VAE is essential for generating high-quality images by learning
to encode and decode between the image and its latent representation.

To establish a meaningful connection between textual inputs and the visual outputs, CLIP
plays a pivotal role. CLIP is a pre-trained neural network capable of understanding both text
and images in a shared semantic space. This capability allows Stable Diffusion to interpret
textual descriptions and align them accurately with the generated images. Acting as a guide
during the image creation process, CLIP evaluates the similarity between the text and image
representations, offering a score that reflects their alignment. This feedback loop ensures that
the generated image corresponds closely to the semantic meaning of the input prompt, even
for intricate or abstract descriptions. By leveraging CLIP, Stable Diffusion achieves a high

21
degree of consistency between what is described in text and what is visually produced,
making it a powerful tool for creative and practical applications.

The denoising diffusion process drives the actual generation of images in Stable Diffusion.
This process begins with an image that is initialized as pure random noise. Through a series
of iterative steps, the model learns to refine this noise, progressively denoising it to bring the
image closer to the desired output. During training, the model is exposed to a reverse process
where noise is systematically added to an image, enabling it to understand how to reverse this
degradation. At inference, the model applies this learned reverse process, gradually
transforming random noise into a clear and coherent image. This iterative refinement ensures
that the generated images are highly detailed, stylistically consistent, and capable of
accurately capturing the nuances of complex textual prompts. By combining these powerful
components — VAE for efficient encoding, CLIP for text-to-image alignment, and denoising
diffusion for image refinement — Stable Diffusion can produce stunning visuals that are both
creative and contextually accurate. This seamless integration of components allows the model
to excel in generating images that are not only creative but also deeply aligned with the intent
of the input descriptions.

Fig 4.4 Stable Diffusion Architecture

22
CHAPTER 5
CONCLUSION

The comic book generator utilising Stable Diffusion presents a significant leap in leveraging
AI for creative content generation. This project successfully integrates advanced technologies
such as natural language processing, image synthesis, and multimedia creation to streamline
the traditionally time-consuming process of comic book production. By automating tasks like
text analysis, image generation, and audio integration, the system delivers a comprehensive
solution for producing high-quality, visually engaging comic books.

The system offers substantial advantages to creators by reducing the time and costs
associated with manual comic production. By automating repetitive processes, it allows
artists and writers to focus on enhancing creativity and storytelling rather than technical
execution. For individuals with compelling storylines but limited artistic skills, this tool
provides an accessible way to bring their ideas to life, democratising the comic creation
process. Additionally, this project holds significant potential for educational and
entertainment purposes. In education, it can be used to create engaging visual content to
simplify complex subjects, making learning more interactive and enjoyable. For comic
enthusiasts and hobbyists, the system offers an affordable and efficient way to explore their
passion for storytelling and art.

By accommodating diverse genres and artistic styles, the approach demonstrates the
flexibility of Stable Diffusion in maintaining consistency and adapting to creative needs. This
project underscores the potential of AI to not only enhance productivity but also empower
individuals with creative ideas, making it a valuable contribution to both the creative and
educational domains.

5.1 FUTURE SCOPE

The future of the comic book generator holds vast potential, with opportunities to enhance its
capabilities and broaden its impact across different domains. With advancements in
technology and user expectations, the system can evolve to provide more personalised,
engaging, and collaborative experiences. Below are some key areas for future development:

23
● Artistic Style Diversification
Enable users to select or customise artistic styles such as manga, vintage comics, or
modern art. Incorporate style transfer techniques to adapt or create unique visuals
tailored to user preferences.
● Enhanced User Experience
Develop an intuitive interface using frameworks like Flutter, offering real-time
previews, drag-and-drop tools, and step-by-step guides to simplify the creation
process for users.
● Collaborative Features
Introduce multi-user collaboration with features like version control, real-time edits,
and shared workspaces. Integration with project management tools can enhance
teamwork and client interactions.
● Broader Applications
Expand use cases to education (visual learning materials), corporate training
(interactive manuals), and therapy (storytelling-based support). Utilize comics for
impactful communication in awareness campaigns.

As these advancements come to fruition, the comic book generator could evolve into a
comprehensive solution for creative professionals, educators, and businesses alike. This could
result in a tool capable of delivering tailored, immersive, and impactful visual storytelling
that transcends its initial scope, offering enhanced user engagement, collaborative features,
and real-time customization options. By incorporating sophisticated features such as
AI-driven customization, adaptive storytelling templates, and intuitive user interfaces, the
tool could cater to a wide range of creative needs. The continued development of these
features will ensure that the comic book generator remains at the forefront of digital content
creation, providing innovative solutions to an expanding user base, and establishing itself as a
versatile platform for fostering creativity and communication across diverse industries. This
could establish itself as an essential tool for enhancing communication, and enabling
innovative storytelling across diverse industries, from entertainment and education to
marketing and beyond.

24
REFERENCES

[1] A. Warrier, A. Mathew, A. Patra, K. S. Hiremath and J. Jijo, “Generation and Editing of
Faces using Stable Diffusion with Criminal Suspect Matching”, 2024 IEEE International
Conference on Advanced Systems and Emergent Technologies (IC_ASET), Hammamet,
Tunisia, 2024, pp. 1-6, doi: 10.1109/IC_ASET61847.2024
[2] Asmitha M; Aashritha Danda; Hemanth Bysani; Rimjhim Padam Singh; Sneha Kanchan
“Automation of Text Summarization Using Hugging Face NLP“ 2024 5th International
Conference for Emerging Technology (INCET) Belgaum,India, pp. 1-7,doi:
10.1109/INCET61516.2024.10593316.
[3] Y. Zhang, T. Zhang and H. Xie, “TexControl: Sketch-Based Two-Stage Fashion Image
Generation Using Diffusion Model” , 2024 Nicograph International (NicoInt), Hachioji,
Japan, 2024, pp. 64-68, doi: 10.1109/NICOInt62634.2024.00021.
[4] Y. Liu and L. Wang, “MYCloth: Towards Intelligent and Interactive Online T-Shirt
Customization based on User’s Preference”, 2024 IEEE Conference on Artificial Intelligence
(CAI), Singapore, 2024, pp. 955-962, doi: 10.1109/CAI59869.2024.00175.
[5] Diana Earshia V;Veeri Venkata Hemanth Kumar;Raghunath Dinesh Kumar;Vangavaragu
Moni Sahithi “Generation of Hilarious Animated Characters using GAN“ 2023 7th
International Conference on Trends in Electronics and Informatics (ICOEI) Tirunelveli,
India,pp. 63-66, doi: 10.1109/ICOEI56765.2023.10125904.
[6] Y. Gyungho, H. Kim, J. Kim, and C. Chun, “A Study on Generating Webtoons Using
Multilingual Text‐to‐Image Models” , Appl. Sci., vol. 13, no. 12, p. 7278, Jun. 2023, doi:
10.3390/app13127278.
[7] K. Yu, H. Kim, J. Kim, C. Chun, and P. Kim, “A Study on Webtoon Generation Using
CLIP and Diffusion Models”,Electronics, vol. 12, no. 18, p. 3983, Sep. 2023, doi:
10.3390/electronics12183983.
[8] Anoushka Popuri; John Miller “Generative Adversarial Networks in Image Generation
and Recognition“ 2023 International Conference on Computational Science and

25
Computational Intelligence (CSCI) Las Vegas, NV, USA, 2023, pp. 1294-1297, doi:
10.1109/CSCI62032.2023.00212
[9] J. Zakraoui, M. Saleh, S. Al-Maadeed, and J. M. Alja’am, “A Pipeline for Story
Visualization from Natural Language”, Appl. Sci. 2023, vol. 13, no. 8, p. 5107, Apr. 2023,
doi: 10.3390/app13085107.
[10] T. Kwon, G. Kim, and J. C. Ye, “DiffusionCLIP: Text-Guided Diffusion Models for
Robust Image Manipulation”, in 2022 IEEE/CVF Conf. Computer Vision and Pattern
Recognition (CVPR), Jun. 2022, doi: 10.1109/CVPR52688.2022.00246.
[11] B. Jadhav, M. Jain, A. Jajoo, D. Kadam, H. Kadam and T. Kakkad, “Imagination Made
Real: Stable Diffusion for High-Fidelity Text-to-Image Tasks”, 2024 2nd International
Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India,
2024, pp. 773-779, doi: 10.1109/ICSCSS60660.2024.10625113
[12] K. Mallikharjuna Rao and T. Patel, “ Enhancing Control in Stable Diffusion Through
Example-based Fine-Tuning and Prompt Engineering”, 2024 5th International Conference on
Image Processing and Capsule Networks (ICIPCN), Dhulikhel, Nepal, 2024, pp. 887-894,
doi: 10.1109/ICIPCN63822.2024.00153.
[13] N. Zade, G. Mate, K. Kishor, N. Rane and M. Jete, "NLP Based Automated Text
Summarization and Translation: A Comprehensive Analysis," 2024 2nd International
Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India,
2024, pp. 528-531, doi: 10.1109/ICSCSS60660.2024.10624907.
[14] N. N, A. Narayan, A. M. Sridharan and A. Pradhan, "Automated Text Summarizer Using
Google Pegasus," 2023 International Conference on Smart Systems for applications in
Electrical Sciences (ICSSES),Tumakuru, India, 2023, pp. 1-4, doi:
10.1109/ICSSES58299.2023.10199721.
[15] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, "Hierarchical
Text-Conditional Image Generation with CLIP Latents," arXiv, April 2022, doi:
10.48550/arXiv.2204.06125.

26
Tab 2

Conference Template Letter
No ratings yet
Conference Template Letter
6 pages
Project Doc-File
No ratings yet
Project Doc-File
64 pages
Project Final PDF
No ratings yet
Project Final PDF
51 pages
Voice Based System Assistant Using NLP and Deep Learning-1
No ratings yet
Voice Based System Assistant Using NLP and Deep Learning-1
82 pages
Ask Your PDF (Thesis)
No ratings yet
Ask Your PDF (Thesis)
42 pages
Story 2 Image
No ratings yet
Story 2 Image
17 pages
Final Defense
No ratings yet
Final Defense
51 pages
Lucid Tales
No ratings yet
Lucid Tales
26 pages
Final Report On Chatbot
No ratings yet
Final Report On Chatbot
70 pages
Documentation (2) (2) NVMSP
No ratings yet
Documentation (2) (2) NVMSP
51 pages
Minor Proj
No ratings yet
Minor Proj
41 pages
Fake Object Detection
No ratings yet
Fake Object Detection
69 pages
Final Report
No ratings yet
Final Report
37 pages
Mini Project Report
No ratings yet
Mini Project Report
32 pages
Yug's Blackbook
No ratings yet
Yug's Blackbook
73 pages
SRS Documentation
No ratings yet
SRS Documentation
61 pages
Udaya
No ratings yet
Udaya
63 pages
Virtual Marker Report 5
No ratings yet
Virtual Marker Report 5
22 pages
Project
No ratings yet
Project
76 pages
Final Report
No ratings yet
Final Report
54 pages
Technical Seminar Report
No ratings yet
Technical Seminar Report
24 pages
TYBCA Project
No ratings yet
TYBCA Project
40 pages
Overview Merged
No ratings yet
Overview Merged
51 pages
BTP Endsem 83
No ratings yet
BTP Endsem 83
38 pages
Final Drive Douctment Major
No ratings yet
Final Drive Douctment Major
44 pages
EVENT MANAGEMENT SYSTEM Sai Deepak
No ratings yet
EVENT MANAGEMENT SYSTEM Sai Deepak
51 pages
Admission Presection
No ratings yet
Admission Presection
49 pages
Template To Prepare Documentation
No ratings yet
Template To Prepare Documentation
6 pages
WORST
No ratings yet
WORST
41 pages
JAISON
No ratings yet
JAISON
41 pages
Ad ZT8
No ratings yet
Ad ZT8
5 pages
Dubber
No ratings yet
Dubber
36 pages
Mini Project Sample Document-1
No ratings yet
Mini Project Sample Document-1
12 pages
B.E Cse Batchno 176
No ratings yet
B.E Cse Batchno 176
83 pages
Major Project
No ratings yet
Major Project
39 pages
Black Book-2024
No ratings yet
Black Book-2024
24 pages
Pdfquery
No ratings yet
Pdfquery
68 pages
CSDS-2 Batch-12 Project Report
No ratings yet
CSDS-2 Batch-12 Project Report
68 pages
NGO Connect Report
No ratings yet
NGO Connect Report
29 pages
Child Safety Assistant System (Final Report)
No ratings yet
Child Safety Assistant System (Final Report)
65 pages
Nerd AI
No ratings yet
Nerd AI
31 pages
Eye Blink Detection: Integrated - Master of Computer Applications
100% (1)
Eye Blink Detection: Integrated - Master of Computer Applications
34 pages
AI-front Pages 3
No ratings yet
AI-front Pages 3
20 pages
Project Document
No ratings yet
Project Document
70 pages
Voice Based System Assistant Using NLP and Deep Learning
No ratings yet
Voice Based System Assistant Using NLP and Deep Learning
63 pages
Final Report - 12
No ratings yet
Final Report - 12
60 pages
WAVES Merged
No ratings yet
WAVES Merged
71 pages
RTRP Project Documentation Format-2024 (AutoRecovered)
No ratings yet
RTRP Project Documentation Format-2024 (AutoRecovered)
62 pages
Sampdf Merged Removed
No ratings yet
Sampdf Merged Removed
41 pages
Sample Project Final Document
No ratings yet
Sample Project Final Document
68 pages
Blood Banking System
No ratings yet
Blood Banking System
5 pages
Splitted Final Major Project 22B Final
No ratings yet
Splitted Final Major Project 22B Final
55 pages
VERBALIZE
No ratings yet
VERBALIZE
21 pages
Dhruvil Blackbook Final
No ratings yet
Dhruvil Blackbook Final
45 pages
Technical Seminar Final
No ratings yet
Technical Seminar Final
18 pages
Automated Media Caption Generator
No ratings yet
Automated Media Caption Generator
5 pages
Main 1
No ratings yet
Main 1
9 pages
Mini Project Report
No ratings yet
Mini Project Report
31 pages
Semantic Computing
From Everand
Semantic Computing
Phillip C.-Y. Sheu
No ratings yet
Building Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition)
From Everand
Building Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition)
Saurabh Chandrakar
No ratings yet
Ajay Gurav: Junior Data Scientist
No ratings yet
Ajay Gurav: Junior Data Scientist
1 page
Design and Implementation of A Smoke/Fire Detection Using Computer Vision and Edge Computing
No ratings yet
Design and Implementation of A Smoke/Fire Detection Using Computer Vision and Edge Computing
9 pages
Chest X Ray Abnormality Detection
No ratings yet
Chest X Ray Abnormality Detection
12 pages
Dense Extreme Inception Network
No ratings yet
Dense Extreme Inception Network
10 pages
A Novel End-To-End 1D-ResCNN Model To Remove Artifact From EEG Signals
No ratings yet
A Novel End-To-End 1D-ResCNN Model To Remove Artifact From EEG Signals
14 pages
Sign Language Detection
No ratings yet
Sign Language Detection
27 pages
AI-Driven Error Automation For Frappe: Integrating ImageCaptioning and WhatsApp For Enhanced Support
No ratings yet
AI-Driven Error Automation For Frappe: Integrating ImageCaptioning and WhatsApp For Enhanced Support
5 pages
Research Papers
No ratings yet
Research Papers
16 pages
10 1108 - Ijilt 02 2020 0022
No ratings yet
10 1108 - Ijilt 02 2020 0022
13 pages
Jia 2023 J. Phys. Conf. Ser. 2504 012034
No ratings yet
Jia 2023 J. Phys. Conf. Ser. 2504 012034
8 pages
Shruti
No ratings yet
Shruti
54 pages
(Dr. Muahmmad Sohaib) - CV (1) 16473624325971677044794920 - 1715425529651
No ratings yet
(Dr. Muahmmad Sohaib) - CV (1) 16473624325971677044794920 - 1715425529651
4 pages
Sensors: Fault Diagnosis From Raw Sensor Data Using Deep Neural Networks Considering Temporal Coherence
No ratings yet
Sensors: Fault Diagnosis From Raw Sensor Data Using Deep Neural Networks Considering Temporal Coherence
17 pages
The Calculation of Meaning On The Misunderstanding of New Artificial Intelligence As Culture
No ratings yet
The Calculation of Meaning On The Misunderstanding of New Artificial Intelligence As Culture
16 pages
Preprints202306 1318 v1
No ratings yet
Preprints202306 1318 v1
8 pages
Generative AI Research Papers
No ratings yet
Generative AI Research Papers
3 pages
Approximate Inference Turns Deep Networks Into Gaussian Processes
No ratings yet
Approximate Inference Turns Deep Networks Into Gaussian Processes
18 pages
Generative Ai Manan Report PDF 30 Monday - 1 - GG Jryj
No ratings yet
Generative Ai Manan Report PDF 30 Monday - 1 - GG Jryj
21 pages
Environmental Sound Recognition Using DNN - Thesis Book
No ratings yet
Environmental Sound Recognition Using DNN - Thesis Book
55 pages
Question Bank
100% (1)
Question Bank
12 pages
Top 50 Data Analyst Portfolio Project
50% (2)
Top 50 Data Analyst Portfolio Project
53 pages
A Comprehensive Review and Evaluation On Text Predictive and Entertainment Systems
No ratings yet
A Comprehensive Review and Evaluation On Text Predictive and Entertainment Systems
42 pages
SaugataPaul DS AIML
No ratings yet
SaugataPaul DS AIML
2 pages
Artificial Intelligence Construction Technologys Next Frontier
50% (2)
Artificial Intelligence Construction Technologys Next Frontier
8 pages
A Review On Prediction and Analysis of Forest Fires Using AI and ML Algorithms
No ratings yet
A Review On Prediction and Analysis of Forest Fires Using AI and ML Algorithms
6 pages
ArtificialIntelligence and Machine Learning Rohit Tanwar
0% (1)
ArtificialIntelligence and Machine Learning Rohit Tanwar
155 pages
Advanced Data Science and AI Brochure
No ratings yet
Advanced Data Science and AI Brochure
51 pages
Adilet Uvaliyev: Education
No ratings yet
Adilet Uvaliyev: Education
1 page
Aiml Report
No ratings yet
Aiml Report
29 pages
Adversarial Attack and Defense Mechanisms in Medical Imaging A Comprehensive Review
No ratings yet
Adversarial Attack and Defense Mechanisms in Medical Imaging A Comprehensive Review
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Phase1 Report - Removed

Uploaded by

Phase1 Report - Removed

Uploaded by

COMIC BOOK GENERATOR USING STABLE DIFFUSION

Project Phase I Report

Submitted to APJ Abdul Kalam Technological University

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Prof. Gisha G S Prof. Sreejith Dr. Anitha Kumari S

Place: Thiruvananthapuram GAYATHRI H

Title Page No.

LIST OF ABBREVIATIONS iii

1.2 Problem Statement 2

1.4 Purpose and Need 2

CHAPTER 2. LITERATURE REVIEW 4

CHAPTER 3. GAP ANALYSIS 13

CHAPTER 4. PROPOSED SOLUTION 14

4.1.1 Input and User Interface development 15

4.1.2 Text Summarization 15

4.1.3 Coreference Resolution 16

4.1.4 Visual Content Generation 17

4.1.5 Audio Synthesis and Integration 17

4.1.6 PDF Generation 18

4.2 System Architecture 18

4.2.1 LED Model 19

4.2.1 Stable Diffusion 21

5.1 Future Scope 23

Figure No. Title Page No.

Fig 4.1 Methodology for the Comic Book Generator 14

Fig 4.2 Overall System Architecture 19

Fig 4.3 LED Model 21

Fig 4.4 Stable Diffusion Architecture 22

NLP Natural Language Processing

GAN Generative Adversarial Network

PDF Portable Document Format

VAE Variational AutoEncoders

GPT Generative Pre-trained Transformer

BERT Bidirectional Encoder Representations from Transformers

CNN Convolutional Neural Network

ROUGE Recall-Oriented Understudy for Gisting Evaluation

DCGAN Deep Convolutional Generative Adversarial Network

FID Frechet Inception Distance

CLIP Contrastive Language-Image Pre-training

MSCOCO Microsoft Common Objects in Context

WGAN Wasserstein Generative Adversarial Network

CGAN Conditional Generative Adversarial Network

LED Longformer Encoder Decoder

FPDF Free Portable Document Format

HTML HyperText Markup Language

CSS Cascading Style Sheets

MP3 MPEF Audio Layer 3

API Application Programming Interface

BPE Byte-Pair Encoding

1.2 PROBLEM STATEMENT

●​ Create a Sequence of Visuals: Develop a web application to produce a series of

1.4 PURPOSE AND NEED

●​ Enhanced Storytelling Experience: The primary objective of the project is to

A. M, A. Danda et.al [2] presents an approach to automating text summarization using

Y. Zhang et.al [3] proposes a two-stage framework, TexControl, designed to improve

The approach revolves around employing a GAN architecture to generate cartoon-like,

The paper presents several experiments where DiffusionCLIP demonstrates impressive

Fig 4.1 Methodology for the Comic Book Generator

4.1.2 TEXT SUMMARIZATION

4.1.3 COREFERENCE RESOLUTION

FastCoref is an efficient, machine learning-based tool designed to identify coreferences

4.1.4 VISUAL CONTENT GENERATION

4.1.5 AUDIO SYNTHESIS AND INTEGRATION

4.1.6 PDF GENERATION

4.2 SYSTEM ARCHITECTURE

Fig 4.2 Overall System Architecture

4.2.1 LED MODEL

4.2.2 STABLE DIFFUSION

Stable Diffusion's architecture is based on a combination of key components, including the

Fig 4.4 Stable Diffusion Architecture

5.1 FUTURE SCOPE

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

● Create a Sequence of Visuals: Develop a web application to produce a series of

● Enhanced Storytelling Experience: The primary objective of the project is to