Phase1 Report - Removed
Phase1 Report - Removed
CERTIFICATE
This is to certify that the report entitled ‘COMIC BOOK GENERATOR USING STABLE
DIFFUSION’ submitted by GAYATHRI H (LBT21CS037), JAMIE MATHEW
(LBT21CS050), JYOTHIKA JOHNSON (LBT21CS055), KRISHNA VASUDEVAN
(LBT21CS062) to the APJ Abdul Kalam Technological University in partial fulfilment of the
requirements for the award of the Degree of Bachelor of Technology in Computer Science
and Engineering is a bonafide record of the project work carried out by them under our
guidance and supervision. This report in any form has not been submitted to any other
University or Institute for any purpose.
We, the undersigned, hereby declare that the project titled ‘COMIC BOOK GENERATOR
USING STABLE DIFFUSION’ submitted in partial fulfilment of the requirements for the
Bachelor of Technology degree at APJ Abdul Kalam Technological University, Kerala,
represents our original work conducted under the supervision of Prof. Gisha G. S Assistant
Professor, Department of Computer Science and Engineering, LBS Institute of Technology
for Women, Poojappura. We affirm that this submission reflects our own ideas and that any
contributions from external sources have been accurately cited and referenced. We attest to
our adherence to the principles of academic honesty and integrity, ensuring that all data,
ideas, facts, and sources have been presented truthfully and ethically. We acknowledge that
any breach of academic integrity or misrepresentation of data may result in disciplinary
action by the institute and/or the University Furthermore, we confirm that this report has not
been previously used to obtain any degree, diploma,or similar title from other academic
institutions.
We would like to express our sincere gratitude to all the people who have guided and assisted
us through the course of this work. First and foremost, we wish to express our deep and
sincere gratitude to our Principal, Dr. Smithamol M B for providing us with all the facilities
and infrastructure for the completion of this work. We record our sincere gratitude and thanks
to Dr. Anitha Kumari S, Head of the Department, Department of Computer Science and
Engineering, for providing us with all the facilities for the completion of this work. We would
like to express our sincere gratitude to our project coordinator Dr. Sreejith S, Assistant
Professor, Department of Computer Science and Engineering, for their valuable assistance
provided during the course of the project. We would also like to express our sincere gratitude
to our project guide Prof Gisha G S, Assistant Professor, Department of Computer Science
and Engineering, for providing us constant guidance, support and immense encouragement
for the successful completion of this project. We would also like to thank all the faculty
members in the department for the valuable support and encouragement they rendered. Last
but not the least, we thank all our friends and well-wishers for their kind cooperation,
immense support and useful suggestions and also for providing their truthful and illuminating
views on a number of issues related to the project.
GAYATHRI H
JAMIE MATHEW
JYOTHIKA JOHNSON
KRISHNA VASUDEVAN
TABLE OF CONTENTS
LIST OF FIGURES ii
CHAPTER 1. INTRODUCTION 1
1.1 Background 1
1.3 Objectives 2
4.1 Methodology 14
CHAPTER 5. CONCLUSION 23
REFERENCES 25
ABSTRACT
In today's digital era, where technological innovation transforms storytelling, our project aims
to revolutionise storytelling by transforming textual narratives into fully illustrated comic
books using natural language processing and stable diffusion techniques. The system
identifies key elements such as characters, ensuring that the generated visuals align
seamlessly with the narrative context. This ensures that each comic panel reflects the original
text, enhancing the storytelling experience. Our system harnesses NLP models to thoroughly
analyse and interpret textual stories, capturing their essence and structure for further creative
transformation. Utilising Stable Diffusion, a powerful generative method, these narratives are
brought to life through visually stunning illustrations, crafting immersive comic book
experiences. This project aims to redefine storytelling in the digital realm, offering a dynamic
platform that merges language and art, fostering creativity and engagement among users from
diverse backgrounds. The integration of NLP and Stable Diffusion promises to revolutionise
the way we experience and interact with narrative content, empowering users to transform
their stories into vivid comic books effortlessly. To achieve this, we are implementing a
pipeline that begins with text processing using NLP models, such as Hugging Face
transformers, to analyse and summarise the input text effectively. Key details like dialogue
and scene descriptions are passed to the Stable Diffusion model, which generates images
based on these inputs. This approach democratises comic creation, enabling anyone to
produce visually compelling content without prior artistic training. This approach enables
anyone to produce visually compelling content without prior artistic training. Applications for
this project span various fields, including education, entertainment, and digital marketing,
where personalised, engaging visual narratives can captivate audiences. By combining these
advanced techniques with the timeless appeal of comics, we aim to bridge the gap between
imagination and reality, creating a transformative tool for storytelling in the digital age.
i
LIST OF FIGURES
ii
LIST OF ABBREVIATIONS
AI Artificial Intelligence
UI User Interface
TTS Text-to-Speech
iii
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
The idea of developing a comic book generator using Stable Diffusion is driven by the
aspiration to revolutionise storytelling and make visual creation more inclusive. Traditional
comic book creation requires significant artistic talent, time, and resources, which can be
barriers for many aspiring creators. This project aims to break down these barriers by
leveraging Stable Diffusion, a cutting-edge text-to-image generative model that transforms
textual descriptions into visually striking comic panels. By interpreting prompts describing
characters, scenes, and dialogues, the generator can produce high-quality illustrations,
enabling creators to focus on storytelling while the AI handles the artistic execution. This
innovation empowers individuals without advanced artistic skills to bring their visions to life,
fostering inclusivity in the creative process.
The choice of Stable Diffusion reflects its ability to generate high-resolution, stylistically
coherent visuals with intricate details. Fine-tuning the model on datasets of comic art styles
ensures adaptability to various genres, such as manga or Western comics, while maintaining
narrative consistency. This project aligns with the growing demand for diverse narratives and
tools that encourage experimentation in visual storytelling. It also has the potential to
revolutionise the comic industry by offering rapid prototyping and personalised storytelling
for small creators, educators, or individuals with disabilities. By blending human creativity
with AI efficiency, this project redefines how stories are told, inspiring new possibilities and
innovation in the world of comics. Furthermore, it opens doors for collaborative projects
between artists and writers, allowing them to bring their creative visions to life more
efficiently. The technology could also pave the way for educational applications, where
illustrated narratives can be generated to enhance learning experiences for students. In
essence, this project serves as a bridge between art and technology, unlocking endless
creative possibilities for people across the globe.
Beyond its immediate applications, this project has the potential to influence broader domains
by showcasing how AI can be a transformative tool in the creative arts. It could inspire
similar innovations in related fields such as animation, video game design, and digital
1
marketing, where visual storytelling plays a crucial role. Additionally, the project highlights
the role of technology in preserving cultural narratives, enabling communities to visualise
and share traditional stories in ways that resonate with modern audiences. By making the
process of visual content creation more accessible, this initiative encourages diverse voices to
contribute to the global storytelling landscape, fostering creativity, inclusivity, and cultural
exchange. As AI continues to evolve, projects like this exemplify how technology can
complement human imagination, enabling creators to push the boundaries of what is possible
in visual media.
This project addresses the challenge of overcoming significant barriers to creating visually
appealing and cohesive comic books or illustrated stories. Traditional comic creation
demands a combination of artistic skills, substantial time, and financial resources, which can
exclude many talented storytellers who lack access to these assets. Additionally, manually
creating illustrations for every panel is time-consuming and limits the ability to experiment
with multiple styles or designs. For small creators, educators, or individuals with limited
resources, this makes the process of visual storytelling inaccessible. The absence of an
automated system that can seamlessly transform written narratives into high-quality visuals
further exacerbates this challenge. This limitation hinders creativity, as creators often have to
compromise on their artistic vision due to resource constraints. It also creates barriers to entry
for aspiring artists and writers who lack the means to produce professional-grade content. The
process of commissioning illustrations can be costly and time-consuming, further
discouraging new talent from pursuing visual storytelling. The gap between concept and
execution often leaves many innovative ideas unrealized, curbing diversity and representation
in the creative arts. Addressing these challenges is essential to democratizing access to tools
that enable storytelling, fostering inclusivity, and unlocking the full potential of creators
worldwide.
1.3 OBJECTIVES
2
● Output Generation: Users should have an option to download the picture book as a
PDF.
3
CHAPTER 2
LITERATURE REVIEW
In preparation for this project dedicated to transforming textual narratives into visually
compelling comic books using Stable Diffusion, we conducted a thorough review of pertinent
literature. Our selection included a diverse range of sources, such as conference papers,
research articles, and online publications, which were carefully curated and extensively
analysed. From this comprehensive review, five specific research papers have been
highlighted as key references for our discussion. These studies, along with their associated
methodologies and innovations, hold significant relevance for guiding the development of our
comic book generator system.
A. Warrier et.al [1] presents a novel approach for criminal identification, overcoming the
limitations of traditional hand-drawn sketches based on witness descriptions. Traditional
methods are prone to human error and artistic interpretation, leading to inconsistencies and
inaccuracies. The proposed solution leverages stable diffusion models to generate realistic
facial images from textual descriptions and further refine them using text-based editing
techniques. The process involves three key components: face generation, manipulation, and
matching. The stable diffusion model, which avoids the issues of mode collapse seen in
Generative Adversarial Networks (GANs), provides more precise control over the image
creation process. It consists of a variational autoencoder (VAE), a U-Net model, and a text
encoder to convert text into image embeddings. The generated images are high-quality,
photorealistic, and customizable based on user inputs.
For face manipulation, text prompts are used to modify specific features of a generated
image, such as hair colour or facial expressions, while maintaining ethical standards. The
images are then matched against a criminal suspect database using facial recognition
algorithms, which compare the generated face with known suspects, providing a ranked
match score. The paper highlights the potential of diffusion-based models in criminal
investigations, offering higher-quality images and more fine-grained control than previous
methods like GANs. However, challenges remain, such as ensuring that textual descriptions
are accurate and fine-tuning the model to meet specific legal and ethical standards. Future
4
work suggests improving model performance and integrating more advanced language
models like GPT-3 to refine textual inputs. Additionally, expanding the model’s capability to
process multiple descriptive inputs could allow for a more comprehensive profile, helping
investigators create more accurate suspect images. Further research could also focus on
real-time application, enabling immediate adjustments based on evolving witness statements
to assist ongoing investigations .
The core idea of the study is to build a summarization system that is easy to implement while
maintaining high performance. Hugging Face’s transformer models, especially the
encoder-decoder architectures such as PEGASUS, are ideally suited for this task due to their
powerful attention mechanisms and their ability to generate human-like summaries. The
study uses publicly available datasets, such as CNN/Daily Mail, for training and evaluation.
To assess the performance of the summarization models, the study employs ROUGE metrics,
which are commonly used to evaluate the quality of summaries by comparing them to
reference summaries. The metrics include ROUGE-1, ROUGE-2, and ROUGE-L, which
measure recall, precision, and F1-score for unigrams, bigrams, and longest common
subsequences, respectively. The results show that Hugging Face’s transformer models
significantly outperform traditional summarization techniques, such as extractive
summarization, which simply selects key sentences from the original text. One of the
standout findings of the paper is the superior performance of PEGASUS, a model specifically
pre-trained for summarization tasks. PEGASUS achieves high scores in ROUGE evaluations,
showcasing its ability to generate abstractive summaries that are not only accurate but also
fluent and contextually relevant.
The paper concludes by emphasising the efficiency and effectiveness of Hugging Face’s NLP
models in automating text summarization. These models reduce the need for manual
5
summarization and make it easier for developers to integrate summarization capabilities into
applications with minimal effort. The authors also suggest potential areas for future research,
such as enhancing the models by using domain-specific datasets or exploring real-time
summarization applications.
Y. Liu et.al [4] proposes an intelligent and interactive online t-Shirt customization based on
User's Preference by Yexin Liu and Lin Wang presents an innovative approach to online
T-shirt customization, utilising advanced AI techniques to enhance user experience.
Traditional T-shirt customization often faces communication challenges and limited visual
feedback, requiring users to repeatedly adjust designs in a cumbersome process. MYCloth
6
addresses these issues by leveraging AI models like ChatGPT for refining text inputs and
Stable Diffusion for generating design patterns. MYCloth introduces an intelligent system
that allows users to input text descriptions for design themes. ChatGPT refines these
descriptions, and the Stable Diffusion model generates high-quality prints based on the
refined input. Users can then visually adjust and preview their T-shirt designs using a novel
virtual try-on model. This try-on model uses deep learning techniques to realistically display
the designed T-shirt on virtual avatars, offering real-time feedback and an immersive
customization experience.
The system consists of four main components: pattern selection, paint generation, cloth
adjustment, and virtual try-on. Users can select patterns, create designs from text, adjust the
placement and colour of prints, and see how the designs would look when worn. MYCloth
improves upon existing systems by allowing more interactive control over design elements
and offering an immersive virtual try-on experience, enhancing user satisfaction and reducing
customization time. The paper also evaluates the system through user studies and quantitative
performance metrics, showcasing MYCloth's superior performance in both virtual try-on
accuracy and user experience compared to existing methods. These findings highlight its
potential to transform the online T-shirt customization process, making it more accessible and
engaging for users. Furthermore, the adaptability of MYCloth’s framework could inspire
similar advancements in other areas of online apparel customization, supporting a broader
trend toward personalised digital fashion solutions [4].
Diana Earshia V et.al [5] explores the use of Generative Adversarial Networks (GANs) to
automatically generate amusing and visually distinct animated characters. The focus is on
utilising the powerful capabilities of GANs for creative content generation, specifically in
character design, where humour and originality are key elements. The study aims to
streamline the character creation process in animation, which traditionally demands
substantial artistic effort and manual design.
7
process pushes the generator to improve iteratively, producing characters that can pass as
authentic, hand-crafted artwork.
The paper also addresses common challenges in training GANs, such as mode collapse,
where the generator repeatedly produces similar outputs, thereby limiting diversity.
Techniques like feature matching and batch normalisation were implemented to stabilise the
training process and improve the variety of generated characters. Additionally, Conditional
GANs (cGANs) were explored to provide more control over character attributes, enabling the
generation of characters with specific features like unique facial expressions or body shapes
[5].
Y. Gyungho et.al [6] addresses the challenges of generating webtoons from multilingual text
inputs using deep learning models. Text-to-image technology, which enables computers to
generate images based on input text, has advanced significantly with the use of GANs.
However, most of these advancements have focused on generating images from
English-language text. The authors propose a solution that leverages multilingual
text-to-image models to automate parts of the webtoon creation process, reducing the need
for extensive human intervention.
Previous research in the field of text-to-image generation has explored GAN-based models
like DCGAN, StackGAN, and AttnGAN, which extract textual features and generate images
based on those features. These models, while successful in generating photorealistic images,
often struggle with capturing contextual nuances from complex texts. In response, more
recent approaches like multimodal learning and diffusion models have been introduced,
providing improved image quality and stability during the generation process. However, most
of these models still focus predominantly on English inputs. The authors of this study
highlight the need for multilingual text-to-image technology, as translating native language
text into English introduces inefficiencies and can reduce the quality of the generated images.
The study presents a novel approach by utilising bert to extract feature vectors from texts
written in multiple languages, including English and Korean. The researchers trained a
GAN-based model, specifically a DCGAN, on a dataset of webtoons created using both
English and Korean text descriptions, transforming the images into cartoon-like styles using
CartoonGAN. The model demonstrated the ability to generate images that were contextually
similar to the input text, with evaluation metrics like the Inception score and FID score
8
validating the quality of the generated webtoon images. In conclusion, the study successfully
introduces a multilingual text-to-image model capable of generating webtoon images based
on multilingual inputs. This model contributes to the automation of webtoon creation,
reducing the reliance on manual drawing and potentially lowering production costs. Although
the generated images show promising results, the authors note that there is room for
improvement, particularly in enhancing image detail and addressing issues related to training
stability in GAN-based models. Future research may explore the use of more advanced
diffusion models to further improve the quality and diversity of the generated webtoon
images. Additionally, expanding the model's language capabilities to include other languages
could broaden its application, enabling more creators to access automated webtoon
generation. Further work could also involve optimising the model for faster processing
speeds, making it more feasible for real-time or interactive applications in digital storytelling
[6].
K. Yu et.al [7] present a novel approach to assist webtoon creators by integrating deep
learning models. The research leverages contrastive learning via CLIP and diffusion models
to generate webtoons from text descriptions.The authors first constructed a multimodal
webtoon dataset by converting publicly available datasets (e.g., MSCOCO) into cartoon-style
images using CartoonGAN. They used CLIP, which combines a multilingual BERT for text
and Vision Transformer for images, to associate text with corresponding webtoon images.
CLIP was trained to extract features and align multimodal data, while the diffusion model
was used to generate new webtoon images based on the most text-similar input.
The study also highlights that while previous attempts using GAN-based methods produced
low-quality images, the diffusion model in this work allowed for more realistic and diverse
image generation. Despite these advancements, limitations remain in generating webtoons
from multi-sentence inputs while ensuring artistic consistency. Further research is suggested
to address these challenges. Through experiments using both single- and continuous-text
inputs, the researchers achieved promising results, with an inception score of 7.14 when using
continuous-text inputs. This development shows potential to streamline webtoon creation,
empowering artists with AI-driven tools for more efficient production [7].
9
Anoushka Popuri et.al [8] explores the transformative role of GANs in image-related tasks
through adversarial training between a generator and a discriminator network. The generator's
goal is to produce realistic images, while the discriminator aims to differentiate between real
and fake images, thus pushing the generator to improve its outputs. This dynamic has enabled
breakthroughs in generating high-quality, photorealistic images, making GANs highly
influential in creative applications like digital content creation, gaming, and even automated
storytelling.
The techniques pioneered by GANs have paved the way for utilising deep learning models to
generate diverse and contextually relevant visual content. The ability of GANs to perform
style transfer and image-to-image translation is particularly relevant for transforming
sketches or conceptual art into fully coloured and detailed comic panels. Additionally,
techniques like cGANs enable controlled image generation based on specific inputs, making
it possible to design characters or scenes that align with a particular storyline or
style.However, despite their success, GANs face challenges like mode collapse (where the
generator produces limited variations) and instability in training, which can affect consistency
in generated outputs. To address these limitations, enhancements like Wasserstein GANs
(WGANs) and Progressive GANs have been developed to stabilise training and improve
image quality. These insights are valuable for integrating diffusion models, like Stable
Diffusion, in generating complex and diverse visuals required for comic books. By
combining the strengths of both GANs and diffusion models, projects can achieve more
detailed and coherent art generation, allowing for automated yet artistically rich content
creation [8].
J. Zakraoui [9] introduces a novel framework that converts natural language stories into
visual representations. The goal is to bridge the gap between text and imagery, providing a
way to automatically generate visual narratives from written stories. It focuses on the
challenge of translating textual information into meaningful, coherent visual content, which
can be used for a variety of applications such as digital storytelling, interactive media, and
educational tools.
The proposed pipeline consists of several key stages. The first stage involves extracting the
structure and content of the narrative from the input text. This is done by parsing the story
and identifying the core elements, such as characters, events, settings, and actions. The
system uses NLP techniques to detect entities (e.g., people, objects) and relationships
10
between them, as well as the temporal and spatial relationships that define the flow of the
story. Once the entities and actions are identified, the next step is to create a visual
representation. This involves generating scenes based on the information extracted from the
text. The generated images are then arranged to reflect the narrative flow, with careful
attention to how events unfold and how characters interact with their environment.To ensure
that the visualisations are coherent and accurate, the system incorporates a feedback loop
where the generated images are assessed for consistency with the original story. The paper
emphasises the importance of maintaining narrative integrity while generating diverse and
creative visuals that capture the essence of the story.
The paper also highlights the role of user input in guiding the visualisation process. While
the system is designed to automate most of the process, it allows users to adjust and fine-tune
the visuals to better match their interpretation of the story. The proposed pipeline provides a
powerful tool for transforming text into engaging, informative visuals, with potential
applications in storytelling, media production, and educational tools [9].
T. Kwon [10] introduces DiffusionCLIP, a method that combines diffusion models and CLIP
for text-guided image manipulation. DiffusionCLIP enhances the ability of diffusion models
to generate and modify images based on textual descriptions. The model integrates the
strengths of CLIP, which is trained to connect visual and textual representations, with the
generative power of diffusion models, which iteratively refine noisy images into coherent
outputs. The core idea of DiffusionCLIP is to guide the image generation process with textual
prompts, enabling the model to modify or create images in a way that aligns with the user's
input. The authors propose a robust framework that allows for better control over image
attributes, such as style, content, and composition, using simple textual descriptions.
11
DiffusionCLIP significantly advances the field of text-guided image editing by combining the
strengths of diffusion models and CLIP embeddings to achieve unprecedented levels of
precision, coherence, and flexibility. It effectively addresses the limitations of previous
models by ensuring a tight alignment between textual descriptions and visual outputs,
enabling users to make highly specific and contextually relevant edits. Through iterative
refinement, the method produces images that are not only aesthetically pleasing but also
semantically faithful to the input prompts, handling a diverse range of tasks such as image
synthesis, style transfer, and content manipulation with remarkable accuracy. This innovative
framework demonstrates its adaptability across various scenarios, from subtle attribute
modifications like adjusting lighting or color tones to dramatic transformations that
completely reimagine the original image, all while maintaining visual integrity. By enhancing
user control over intricate details and offering seamless flexibility, DiffusionCLIP opens new
doors for creative exploration in fields such as digital content creation, graphic design, visual
storytelling, and fine arts. Its user-friendly design empowers both professionals and amateurs
to bring their imaginative ideas to life effortlessly, democratizing access to sophisticated
AI-driven tools. Furthermore, its capability to generate high-quality, contextually accurate
images has the potential to revolutionize industries like marketing, entertainment, education,
and product design, making it a pivotal innovation in the realm of AI-powered visual
generation. As a versatile and intuitive solution, DiffusionCLIP sets a new benchmark for
text-to-image interaction, paving the way for future advancements in creative technologies
[10].
12
CHAPTER 3
GAP ANALYSIS
In the realm of automated image generation, particularly for applications such as comic book
creation, existing systems face several notable limitations. These challenges hinder the
effectiveness and versatility of current technologies, making them less suited for the complex
demands of comic storytelling. A major challenge in existing image generation systems is
mode collapse, where the model produces a limited variety of similar or repetitive images.
This is particularly problematic for comics, which require diverse character designs,
environments, and settings to maintain visual interest and narrative flow. Comics rely on
distinct panels to depict varied scenes, emotions, and actions, and mode collapse results in
monotonous visuals that fail to convey the uniqueness of each scene.
Another significant gap in current image generation systems is the lack of fine control over
generated content. In comic book production, creators need the ability to guide the generation
process based on specific requirements, such as character features, scene layout, and
emotions. This often results in visuals that fail to capture the intended essence of the story,
with characters appearing inconsistent, emotions misrepresented, or layouts diverging from
the envisioned sequence. However, most systems do not allow for precise manipulation,
resulting in images that may not align with the intended storyline. This lack of control limits
their practical use for comic creators, highlighting the need for tools that enable detailed
customization in image generation.
These limitations present significant barriers for application in comic book creation. Mode
collapse results in repetitive visuals, which fail to capture the diversity and dynamic nature
required for comic storytelling. Similarly, the inability to precisely control character design,
scene layout, and emotions limits the customization needed for aligning images with the
intended storyline. However, our project utilises the stability and flexibility of diffusion
models and this approach can generate diverse, high-quality images with greater control over
specific aspects of the comic creation process. This enables the production of unique,
contextually accurate visuals that align with detailed story inputs, offering a promising
solution to the gaps identified in existing systems. By bridging these limitations, our
approach opens new possibilities for creating visually compelling and narratively coherent
comics.
13
CHAPTER 4
PROPOSED SOLUTION
4.1 METHODOLOGY
The methodology for generating a comic book integrates several advanced AI-driven
components into a streamlined pipeline. It begins with Input Handling, where the user's story
is accepted and pre-processed. The input is passed to a Text Processing Module using the
LED (Longformer Encoder-Decoder) model, which generates a concise and coherent
summary tailored for comic book storytelling. This is followed by Coreference Resolution
using Fastcoref, ensuring that all pronouns and references are unambiguously linked to their
antecedents, thereby enhancing narrative clarity.
Next, the refined text is fed into the Image Generation Module, which utilises the Stable
Diffusion model to produce high-quality illustrations based on the story's content. In parallel,
an Audio Generation Module leverages the Google Text-to-Speech (gTTS) library to produce
narrations for each section of the comic. Finally, all elements—text, images, and audio—are
integrated into a PDF Output, created using the FPDF Python library, to generate a cohesive
and visually engaging comic book. This end-to-end methodology ensures both automation
and high-quality results, making it ideal for transforming textual stories into immersive
multimedia picture books.
14
4.1.1 INPUT AND USER INTERFACE DEVELOPMENT
The input and UI module is designed to provide an intuitive and user-friendly platform for
interacting with the comic book generator. Developed using HTML, CSS, and JavaScript, the
UI allows users to input their own story. Once the input is processed, the resulting comic
book is displayed on the user’s screen in an interactive, page-by-page format. The interface
dynamically presents the text alongside the generated visual elements, such as images and
illustrations that correspond to the narrative. Users can navigate through the pages with
intuitive buttons, allowing seamless transitions between comic pages. The UI also includes an
embedded audio player that provides narration for each page. This adds an immersive,
multi-sensory layer to the comic book, enhancing engagement. The narration is synchronised
with the text and images on each page, ensuring a cohesive and enhanced user experience.
Furthermore, the interface abstracts the complexity of the underlying processes involved in
comic book generation, offering a straightforward and accessible platform. This allows users
to create visually appealing and narratively consistent comic books with minimal effort. As a
result, it serves as a versatile tool for a broad audience, from casual users interested in
creating their own stories to more experienced creators seeking an efficient way to visualise
and share their narratives.
Text summarization is a crucial step in adapting lengthy narratives into concise text suitable
for comic book panels. For this project, the LED model is implemented, chosen for its
capability to process long-form text efficiently. Unlike traditional models limited by shorter
input lengths, LED can handle extended content, ensuring that the core elements of a story
are preserved. This ensures that even complex narratives can be distilled into meaningful,
engaging summaries while maintaining their original context and emotional depth. By
generating succinct text, the model facilitates the creation of dialogue and captions that fit
naturally within the spatial constraints of comic panels.
The process involves tokenizing the input text and feeding it into the summarization model,
which reduces the content while aligning it with the storytelling format of comics. The result
is text that retains essential plot points, character interactions, and key themes, enabling
seamless integration with the visual elements. This step not only optimises storytelling for the
15
comic medium but also enhances reader engagement by delivering concise and impactful
narratives. By leveraging the LED model, the summarization process achieves a balance
between brevity and clarity, making it ideal for transforming long-form text into comic-ready
dialogue and captions.
Coreference resolution is crucial for maintaining coherence and consistency in the text by
identifying and resolving references such as pronouns, determiners, and other referring
expressions back to their original entities (e.g., character names or objects). This ensures that
the narrative flow is uninterrupted and that each entity is consistently recognized throughout
the text, which is particularly important for generating accurate and contextually appropriate
images.
Once coreference resolution is applied, the text becomes more structured and easier to
process. This clarity is vital for the image generation module, as it helps the system
understand which characters or objects the text refers to when creating images. For instance,
if the text mentions "Alice" and later uses the pronoun "she," the system must be able to
recognize that "she" refers to "Alice" to generate consistent images of that character. The
coreference resolution step prevents confusion and ensures that images are generated based
on the correct entities.
After resolving coreferences, the text is split into individual sentences, which are then passed
as input prompts to the image generation module. This step is essential for maintaining
narrative consistency, as it allows the generated images to accurately reflect the progression
of the story. Additionally, coreference resolution ensures that characters and objects retain
their identity across different scenes and pages, preserving visual continuity in the comic
book. This is especially important when dealing with multiple characters, objects, or complex
16
storylines, where accurate representation of references is necessary for a coherent and
engaging comic book experience.
To bring the summarised text to life, each condensed narrative is paired with an AI-generated
image, creating a visual representation of the scene. The Stable Diffusion API is integrated to
perform text-to-image synthesis, transforming the resolved text from the summarization
process into high-quality images. By using the final text as a prompt, the model generates
detailed illustrations that capture the essence of the scene, from character appearance to
background settings. The flexibility of the Stable Diffusion model allows for the
customization of various parameters, including style, composition, and resolution, which can
be predefined or adjusted via the user interface. This ensures that the generated images align
with the intended artistic direction of the comic, whether it's a manga style, a Western comic
look, or something entirely unique.
The generated images are then stored in sequence, ensuring that they correspond to the
summarised text in the correct order to form a coherent narrative. This process enables the
seamless creation of a comic book, where both the text and visuals work together to tell a
compelling story. By pairing the summarised text with tailored AI-generated imagery, the
project offers an efficient and accessible way to produce high-quality comic books,
transforming text-based narratives into immersive visual experiences. This integration of AI
in both the textual and visual creation stages allows for rapid prototyping and customization,
making comic book creation more accessible and dynamic.
To further enhance the storytelling experience, audio narration is added to each comic panel,
providing an immersive auditory element to complement the visuals and text. The Google
Text-to-Speech (TTS) API is used to generate high-quality audio from the summarised text.
Each piece of text is converted into a narration using a chosen voice and language, allowing
for customization based on the desired tone, gender, and accent. This ensures that the
narration aligns with the mood and context of the story, whether it's a dramatic, humorous, or
emotional moment.
17
The generated audio files are saved in a standard format, such as MP3, making them easily
accessible for integration with the comic panels. Once the audio files are created, they are
linked to the corresponding panels, ensuring that the narration is synchronised with the text
and visuals. This audio feature not only enhances accessibility but also deepens the reader's
engagement by adding a dynamic layer to the storytelling process. With audio narration, the
comic book becomes a multi-sensory experience, allowing readers to enjoy the story through
both sight and sound.
The final output of the project is a comprehensive PDF that combines text, images, and audio
narration into a cohesive comic book. The FPDF library is used to create and format the PDF,
providing a flexible and efficient way to structure the content. Text and images are arranged
sequentially within the document, ensuring that the layout mirrors the typical flow of a comic
book, with each panel clearly separated and aligned to maintain readability. The images,
generated via Stable Diffusion, are placed alongside the corresponding summarised text,
creating a visual narrative that guides the reader through the story.
To enhance the experience further, audio files are embedded as clickable links within the
PDF. These links provide direct access to the audio narration for each panel, allowing readers
to listen to the narrated content as they view the comic. The integration of multimedia
elements, including text, images, and audio, transforms the comic into a dynamic, interactive
format. Once the PDF is compiled, it is made available for download, allowing users to easily
access and share the complete picture book with integrated storytelling features. This process
not only makes the comic more immersive but also ensures that the content is easily
distributable in a widely-used format.
The architecture for creating a comic book using Stable Diffusion is structured into three
main components: User Interface, Text Processing Module, and Integration Module. The
process begins with a story input provided by the user through the User Interface. This story
is then passed into the Text Processing Module, which performs multiple steps to refine the
input.
18
Within the Text Processing Module, the story is summarised using a Summary Generator
(LED), ensuring the output is concise and suitable for a comic book format. Coreference
Resolution (Fastcoref) ensures that pronouns and references in the story are clearly resolved,
enhancing coherence. After this, the story is divided into sentences using a Sentence Splitter,
preparing the text for both image and audio generation.
The Integration Module is responsible for turning the processed sentences into multimedia
elements. Sentences are further refined using a Sentence Generation component and then fed
into the Image Generation module, powered by the Stable Diffusion API, to produce visually
compelling illustrations. Simultaneously, an Audio Generation component (Google TTS)
generates narrations for each sentence. Finally, these images and audio clips are combined to
produce the comic book, which is presented as the final output. This modular approach
ensures the system is flexible, scalable, and capable of producing high-quality picture books.
The LED (Longformer Encoder-Decoder) model is designed to efficiently process long text
sequences, particularly for tasks like text summarization, where documents can often exceed
the typical input length of traditional Transformer models. The architecture of LED builds on
the Longformer model, which adapts the Transformer architecture to handle long-range
dependencies in a computationally efficient way.
19
The encoder in LED employs a unique attention mechanism known as sliding window
attention. In traditional Transformer models, attention is computed between every pair of
tokens in the input sequence, leading to quadratic time complexity. This becomes
computationally expensive for long texts. Longformer’s sliding window attention mitigates
this issue by restricting attention to a local window of surrounding tokens, reducing
computational cost. Additionally, Longformer uses global attention for tokens that are
deemed important, such as those representing the start or end of sentences. This global
attention ensures that essential context is preserved while processing long documents. The
decoder of the LED model mirrors the encoder’s sliding window attention approach but is
designed to generate the output sequence, typically a summary of the input text. The decoder
attends to the encoded representation of the input text, and the combination of local and
global attention mechanisms ensures that the model generates coherent and concise
summaries.
In LED, the input text undergoes tokenization, where the text is split into smaller units,
typically using the Byte Pair Encoding (BPE) method. These tokens are then converted into
dense vector representations through pre-trained embeddings, which capture the semantic
meaning of the text. These embeddings are passed through the encoder for processing. The
core innovation of LED lies in its attention mechanism. Rather than attending to all tokens in
the sequence, which is computationally expensive, each token attends only to a local window
of surrounding tokens, improving efficiency. Additionally, certain tokens are assigned global
attention, ensuring that key information from across the document is not overlooked.
The LED model also incorporates positional encoding to account for the order of tokens in
the sequence. Since the attention mechanism does not inherently consider token order,
positional encodings are added to the token embeddings to maintain the relative positions of
tokens within the input text. At the end of the process, the LED model generates the output
sequence, which is typically a concise summary of the input text. The model extracts the
most important content and discards irrelevant information, ensuring that the summary
remains coherent and aligned with the original meaning of the document.
20
Fig 4.3 LED Model
To establish a meaningful connection between textual inputs and the visual outputs, CLIP
plays a pivotal role. CLIP is a pre-trained neural network capable of understanding both text
and images in a shared semantic space. This capability allows Stable Diffusion to interpret
textual descriptions and align them accurately with the generated images. Acting as a guide
during the image creation process, CLIP evaluates the similarity between the text and image
representations, offering a score that reflects their alignment. This feedback loop ensures that
the generated image corresponds closely to the semantic meaning of the input prompt, even
for intricate or abstract descriptions. By leveraging CLIP, Stable Diffusion achieves a high
21
degree of consistency between what is described in text and what is visually produced,
making it a powerful tool for creative and practical applications.
The denoising diffusion process drives the actual generation of images in Stable Diffusion.
This process begins with an image that is initialized as pure random noise. Through a series
of iterative steps, the model learns to refine this noise, progressively denoising it to bring the
image closer to the desired output. During training, the model is exposed to a reverse process
where noise is systematically added to an image, enabling it to understand how to reverse this
degradation. At inference, the model applies this learned reverse process, gradually
transforming random noise into a clear and coherent image. This iterative refinement ensures
that the generated images are highly detailed, stylistically consistent, and capable of
accurately capturing the nuances of complex textual prompts. By combining these powerful
components — VAE for efficient encoding, CLIP for text-to-image alignment, and denoising
diffusion for image refinement — Stable Diffusion can produce stunning visuals that are both
creative and contextually accurate. This seamless integration of components allows the model
to excel in generating images that are not only creative but also deeply aligned with the intent
of the input descriptions.
22
CHAPTER 5
CONCLUSION
The comic book generator utilising Stable Diffusion presents a significant leap in leveraging
AI for creative content generation. This project successfully integrates advanced technologies
such as natural language processing, image synthesis, and multimedia creation to streamline
the traditionally time-consuming process of comic book production. By automating tasks like
text analysis, image generation, and audio integration, the system delivers a comprehensive
solution for producing high-quality, visually engaging comic books.
The system offers substantial advantages to creators by reducing the time and costs
associated with manual comic production. By automating repetitive processes, it allows
artists and writers to focus on enhancing creativity and storytelling rather than technical
execution. For individuals with compelling storylines but limited artistic skills, this tool
provides an accessible way to bring their ideas to life, democratising the comic creation
process. Additionally, this project holds significant potential for educational and
entertainment purposes. In education, it can be used to create engaging visual content to
simplify complex subjects, making learning more interactive and enjoyable. For comic
enthusiasts and hobbyists, the system offers an affordable and efficient way to explore their
passion for storytelling and art.
By accommodating diverse genres and artistic styles, the approach demonstrates the
flexibility of Stable Diffusion in maintaining consistency and adapting to creative needs. This
project underscores the potential of AI to not only enhance productivity but also empower
individuals with creative ideas, making it a valuable contribution to both the creative and
educational domains.
23
● Artistic Style Diversification
Enable users to select or customise artistic styles such as manga, vintage comics, or
modern art. Incorporate style transfer techniques to adapt or create unique visuals
tailored to user preferences.
● Enhanced User Experience
Develop an intuitive interface using frameworks like Flutter, offering real-time
previews, drag-and-drop tools, and step-by-step guides to simplify the creation
process for users.
● Collaborative Features
Introduce multi-user collaboration with features like version control, real-time edits,
and shared workspaces. Integration with project management tools can enhance
teamwork and client interactions.
● Broader Applications
Expand use cases to education (visual learning materials), corporate training
(interactive manuals), and therapy (storytelling-based support). Utilize comics for
impactful communication in awareness campaigns.
As these advancements come to fruition, the comic book generator could evolve into a
comprehensive solution for creative professionals, educators, and businesses alike. This could
result in a tool capable of delivering tailored, immersive, and impactful visual storytelling
that transcends its initial scope, offering enhanced user engagement, collaborative features,
and real-time customization options. By incorporating sophisticated features such as
AI-driven customization, adaptive storytelling templates, and intuitive user interfaces, the
tool could cater to a wide range of creative needs. The continued development of these
features will ensure that the comic book generator remains at the forefront of digital content
creation, providing innovative solutions to an expanding user base, and establishing itself as a
versatile platform for fostering creativity and communication across diverse industries. This
could establish itself as an essential tool for enhancing communication, and enabling
innovative storytelling across diverse industries, from entertainment and education to
marketing and beyond.
24
REFERENCES
[1] A. Warrier, A. Mathew, A. Patra, K. S. Hiremath and J. Jijo, “Generation and Editing of
Faces using Stable Diffusion with Criminal Suspect Matching”, 2024 IEEE International
Conference on Advanced Systems and Emergent Technologies (IC_ASET), Hammamet,
Tunisia, 2024, pp. 1-6, doi: 10.1109/IC_ASET61847.2024
[2] Asmitha M; Aashritha Danda; Hemanth Bysani; Rimjhim Padam Singh; Sneha Kanchan
“Automation of Text Summarization Using Hugging Face NLP“ 2024 5th International
Conference for Emerging Technology (INCET) Belgaum,India, pp. 1-7,doi:
10.1109/INCET61516.2024.10593316.
[3] Y. Zhang, T. Zhang and H. Xie, “TexControl: Sketch-Based Two-Stage Fashion Image
Generation Using Diffusion Model” , 2024 Nicograph International (NicoInt), Hachioji,
Japan, 2024, pp. 64-68, doi: 10.1109/NICOInt62634.2024.00021.
[4] Y. Liu and L. Wang, “MYCloth: Towards Intelligent and Interactive Online T-Shirt
Customization based on User’s Preference”, 2024 IEEE Conference on Artificial Intelligence
(CAI), Singapore, 2024, pp. 955-962, doi: 10.1109/CAI59869.2024.00175.
[5] Diana Earshia V;Veeri Venkata Hemanth Kumar;Raghunath Dinesh Kumar;Vangavaragu
Moni Sahithi “Generation of Hilarious Animated Characters using GAN“ 2023 7th
International Conference on Trends in Electronics and Informatics (ICOEI) Tirunelveli,
India,pp. 63-66, doi: 10.1109/ICOEI56765.2023.10125904.
[6] Y. Gyungho, H. Kim, J. Kim, and C. Chun, “A Study on Generating Webtoons Using
Multilingual Text‐to‐Image Models” , Appl. Sci., vol. 13, no. 12, p. 7278, Jun. 2023, doi:
10.3390/app13127278.
[7] K. Yu, H. Kim, J. Kim, C. Chun, and P. Kim, “A Study on Webtoon Generation Using
CLIP and Diffusion Models”,Electronics, vol. 12, no. 18, p. 3983, Sep. 2023, doi:
10.3390/electronics12183983.
[8] Anoushka Popuri; John Miller “Generative Adversarial Networks in Image Generation
and Recognition“ 2023 International Conference on Computational Science and
25
Computational Intelligence (CSCI) Las Vegas, NV, USA, 2023, pp. 1294-1297, doi:
10.1109/CSCI62032.2023.00212
[9] J. Zakraoui, M. Saleh, S. Al-Maadeed, and J. M. Alja’am, “A Pipeline for Story
Visualization from Natural Language”, Appl. Sci. 2023, vol. 13, no. 8, p. 5107, Apr. 2023,
doi: 10.3390/app13085107.
[10] T. Kwon, G. Kim, and J. C. Ye, “DiffusionCLIP: Text-Guided Diffusion Models for
Robust Image Manipulation”, in 2022 IEEE/CVF Conf. Computer Vision and Pattern
Recognition (CVPR), Jun. 2022, doi: 10.1109/CVPR52688.2022.00246.
[11] B. Jadhav, M. Jain, A. Jajoo, D. Kadam, H. Kadam and T. Kakkad, “Imagination Made
Real: Stable Diffusion for High-Fidelity Text-to-Image Tasks”, 2024 2nd International
Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India,
2024, pp. 773-779, doi: 10.1109/ICSCSS60660.2024.10625113
[12] K. Mallikharjuna Rao and T. Patel, “ Enhancing Control in Stable Diffusion Through
Example-based Fine-Tuning and Prompt Engineering”, 2024 5th International Conference on
Image Processing and Capsule Networks (ICIPCN), Dhulikhel, Nepal, 2024, pp. 887-894,
doi: 10.1109/ICIPCN63822.2024.00153.
[13] N. Zade, G. Mate, K. Kishor, N. Rane and M. Jete, "NLP Based Automated Text
Summarization and Translation: A Comprehensive Analysis," 2024 2nd International
Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India,
2024, pp. 528-531, doi: 10.1109/ICSCSS60660.2024.10624907.
[14] N. N, A. Narayan, A. M. Sridharan and A. Pradhan, "Automated Text Summarizer Using
Google Pegasus," 2023 International Conference on Smart Systems for applications in
Electrical Sciences (ICSSES),Tumakuru, India, 2023, pp. 1-4, doi:
10.1109/ICSSES58299.2023.10199721.
[15] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, "Hierarchical
Text-Conditional Image Generation with CLIP Latents," arXiv, April 2022, doi:
10.48550/arXiv.2204.06125.
26
Tab 2