Manish Capstone
Manish Capstone
Enhance Creativity
Bachelor of Technology
in
Computer Science and Engineering
by
May, 2024
1
DECLARATION
I hereby declare that the thesis entitled “Hyperparameter Tuning for Poetry
Generation to Enhance Creativity” submitted by me, for the award of the degree of Bachelor
of Technology in Programme to VIT is a record of bonafide work carried out by me under the
supervision of Dr.Dinesh R.
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.
Place : Vellore
Date : 09/05/2024
2
CERTIFICATE
This is to certify that the thesis entitled “Hyperparameter Tuning for Poetry
Generation to Enhance Creativity” submitted by Manish Goyal (20BCE2849), School of
Computer Science and Engineering, VIT, for the award of the degree of Bachelor of
Technology in Programme, is a record of bonafide work carried out by him under my
supervision during the period, 01. 12. 2023 to 30.04.2024, as per the VIT code of academic and
research ethics.
The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any other institute
or university. The thesis fulfills the requirements and regulations of the University and in my
opinion meets the necessary standards for submission.
Place : Vellore
Date : 09/05/2024
3
ACKNOWLEDGEMENTS
I extend my heartfelt gratitude and appreciation to all the previous research whose
contributions made my Capstone Project course a truly enriching and rewarding journey.
Foremost, I express my gratitude to VIT Vellore and SCOPE for giving me the opportunity to
partake in this course and for their steadfast support throughout my project. Their guidance,
wealth of knowledge, and expertise have been pivotal in deepening my understanding of
essential concepts. Special appreciation goes to my mentor, Dr. Dinesh R, whose unwavering
guidance, generous sharing of time and resources, and expert insights have empowered me to
acquire practical skills and insights in the field. His mentorship has played a pivotal role in
shaping my comprehension of industry dynamics and preparing me for future career endeavors.
Furthermore, I am indebted to the support and encouragement of my reviewers. Their keen
interest and fervor for the project have served as a constant source of inspiration, driving me
forward and instilling in me a deeper passion for the subject matter. I am truly grateful for the
invaluable learning opportunities they provided. In closing, I extend my sincerest thanks to all
those who have supported and believed in me throughout this journey. Your contributions have
been instrumental in my growth and development, and I am profoundly grateful for your
unwavering support.
Manish Goyal
i
Executive Summary
Our project focuses on developing advanced models for automated poetry generation,
leveraging cutting-edge natural language processing techniques. By harnessing pretrained
models such as T5, Mistral 7B, and GPT2, we have created a versatile framework capable of
crafting poetry in various forms including ballads, epics, and odes. Through iterative training
and fine-tuning processes, we continuously enhance the models' capabilities, striving for
improved poetry generation quality. A key aspect of our work involves comparative analysis
among different models, enabling us to identify their strengths and weaknesses. This informs
our decision-making process regarding model selection and further refinement strategies.
Additionally, we prioritize documentation of system modifications and enhancements,
ensuring that our framework evolves in alignment with advancements in NLP technologies. In
summary, our project represents a significant contribution to AI-driven creative expression,
advancing the field of automated poetry generation while exploring broader applications of
natural language processing in creative domains.
ii
Table of Contents
ACKNOWLEDGEMENTS i
Executive Summary ii
List of Figures v
List of Tables vi
Abbreviations vii
1. Introduction 1
1.1 Objectives 1
1.2 Motivation 1
1.3 Background 2
3. Technical Specification 9
3.1 Requirements 9
3.1.1 Functional Requirements 9
3.1.2 Non-functional Requirements 10
iii
4. Design Approach and Details 14
4.2 Design 17
4.2.1 Data Flow Diagram 17
4.2.2 Use Case Diagram 20
4.2.3 Class Diagram 20
4.2.4 Sequence Diagram 21
5.3 Testing 25
7. Summary 30
8. References 31
iv
List of Figures
Figure i. Basic System Architecture
Figure ii. Transformer Architecture
Figure iii. Adam vs AdamW
Figure iv. Level 0 DFD
Figure v. Level 1 DFD 18
Figure vi. Data Processing Process Level 2 DFD 18
Figure vii. Prompt Engineering Level 2 DFD 19
Figure viii. Model Selection and Fine-Tuning Process Level 2 DFD 19
Figure ix. Use Case Diagram 20
Figure x. Class Diagram 20
Figure xi. Sequence Diagram 21
Figure xii. Gantt Chart 22
Figure xiv. Poem Analysis Example
Figure xiii. Evaluating Train and Loss
Figure xv. An Example of Sentiment Analysis 26
Figure xvi. Mistral 7B GPU Utilisation
Figure xvii Mistral Training Stats 26
Figure xviii Mistral GPU Usage
Figure xix. Mistral Generated Poem 27
Figure xx. GPT2 Generated Poem 28
Figure xxi FastAI's Suggested LR 28
Figure xxii. The T5 Models's Start 29
Figure xxiii. Continuous Model Generation GPT2 Modular 29
Figure xxiv. Different Generated forms 30
v
List of Tables
NA
vi
Abbreviations
vii
1. Introduction
1.1 Objectives
The domain of our work lies at the intersection of Natural Language Processing (NLP)
and creative writing, specifically focusing on the refinement of Large Language Models
(LLMs) to generate poetry with stylistic coherence and creativity. Leveraging advanced
techniques in hyperparameter tuning and fine-tuning pre-trained models like FLAN T5, Mistral
7B and GPT2, the project aims to augment the capability of LLMs in crafting poetry across
specific styles and themes. Ultimately, we seek to demonstrate the potential of LLMs as tools
for fostering artistic expression and creativity within the poetry generation. By overcoming
limitations associated with generic text production, this project underscores the potential of
LLMs to make a meaningful contribution to the field of creative writing.
1.2 Motivation
Poetry is art that thrives when we express ourselves creatively and emotionally. While
there have been a lot of NLP advancements recently, generating creative text formats like
poetry remains a challenge. This project is driven by the potential of LLMs to bridge this gap
and empower artistic exploration. The problem with many current systems designed
specifically for generating poems is that they often produce generic lines instead of original
ones due partially or wholly to limitations within their programming paradigms – meaning
there’s always room at the top for improvement when it comes to artistic AI design. Powerful
models exist, but they're often closed source and computationally expensive. This project seeks
to address this by refining open-source LLMs using readily available resources like free tier
GPUs offered by Google Colab/Kaggle notebooks. Our motivation lies not only in pushing
LLM capabilities but also in fostering a future where these models become valuable tools for
artists and writers not as a replacement by as partners for enhancing creativity. We believe this
project can demonstrate the power of LLMs as collaborators, unlocking new avenues for
creative exploration within poetry generation. By making LLM exploration more accessible
and cost-effective, this research can empower a wider range of researchers and students to
contribute to this exciting field.
1
1.3 Background
Among the diverse applications of LLMs, poetry generation stands out as a particularly
intriguing and challenging endeavor. Traditional approaches to poetry generation have often
relied on rule-based systems or statistical methods, which may struggle to capture the
intricacies of poetic language and artistic expression. In contrast, LLMs based on transformer
architectures have demonstrated remarkable capabilities in producing coherent and
contextually relevant text across different domains. LLMs offer a promising avenue for
automated poetry generation due to their ability to learn complex patterns and structures from
vast amounts of textual data. Hyperparameter tuning, a process of optimizing the configuration
parameters of a machine learning model, has emerged as a key technique for enhancing the
performance and capabilities of LLMs. Through this endeavor, the project aims to demonstrate
the potential of LLMs as powerful tools for fostering artistic expression and creativity within
the realm of poetry.
Recent research in Natural Language Processing has witnessed a surge in the application
of Large Language Models for various tasks, including question answering and machine
translation. However, a gap exists in leveraging LLMs for creative writing tasks, particularly
those requiring stylistic coherence and artistic expression, such as poetry generation. Some of
the findings are listed below along with their respective research papers:
2
model selection tasks demonstrate the effectiveness of Hyperband in optimizing
hyperparameters for complex machine learning models.
2. Bits of Grass, Does GPT already know how to write like Whitman: The ability of
GPT-3.5, GPT-3.5-turbo (ChatGPT), and GPT-4 models to generate poems in the style
of specific authors using zero-shot and many-shot prompts was examined in this study.
The performance of models not fine-tuned for generating poetry in the style of specific
authors was assessed via automated evaluation. Findings indicated that without fine-
tuning, even when provided with maximum examples, these models did not generate
poetry in the desired style. Recommendations were made for future work to analyse
GPT's ability to write poetry in the style of other poets, particularly those using
structured and rhymed writing, and to investigate how few-shot prompt engineering
can enhance the models' ability to generate poetry in requested styles.
3. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: This
study explores how the ability of large language models to perform complex reasoning
can be significantly enhanced through the generation of a chain of thought—a series
of intermediate reasoning steps. The emergence of such reasoning abilities in
sufficiently large language models is demonstrated through a method called chain of
thought prompting, where a few demonstrations of chain of thought are provided as
exemplars in the prompt. Experiments on three large language models reveal that chain
of thought prompting improves performance on various arithmetic, commonsense, and
symbolic reasoning tasks, with striking empirical gains observed. For instance,
prompting a 540B-parameter language model with just eight chain of thought
exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word
problems, surpassing even fine-tuned GPT-3 with a verifier.
4. Training language models to follow instructions with human feedback: This paper
explores the alignment of LLMs with user intent through fine-tuning with human
feedback. Through a method termed InstructGPT, the models are trained using
supervised learning on labeller demonstrations of desired model behaviour, followed
by reinforcement learning from human feedback. Human evaluations demonstrate that
outputs from the 1.3B parameter InstructGPT model are preferred to those from the
175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models
exhibit improvements in truthfulness and reductions in toxic output generation, with
minimal performance regressions on public NLP datasets. While InstructGPT still
3
makes simple mistakes, findings suggest that fine-tuning with human feedback holds
promise for aligning language models with human intent.
5. Scaling Instruction-Finetuned Language Models: This paper explores instruction
finetuning of language models with a focus on scaling the number of tasks, model size,
and finetuning on chain-of-thought data. Results demonstrate dramatic improvements
in performance across various model classes, prompting setups, and evaluation
benchmarks. For instance, Flan-PaLM 540B instruction finetuned on 1.8K tasks
achieves state-of-the-art performance on several benchmarks, including 75.2% on five-
shot MMLU. Flan-T5 models are also released, surpassing baseline T5 models by a
large margin. The study underscores the generality of instruction finetuning in
enhancing the performance and usability of pretrained language models.
6. The Flan Collection: Designing Data and Methods for Effective Instruction
Tuning: This study examines the design choices of instruction tuning methods,
focusing on the development of Flan 2022. Through ablation studies, it highlights the
importance of task balancing and enrichment techniques in improving model
performance. Training with mixed prompt settings, including zero-shot, few-shot, and
chain-of-thought, yields stronger performance across all settings. Flan-T5 requires less
finetuning and converges faster than T5 on single downstream tasks, making it a more
computationally-efficient starting checkpoint. The Flan 2022 collection, including
datasets, templates, and methods, is publicly available for further research.
7. Direct Preference Optimization: Your Language Model is Secretly a Reward
Model: The paper introduces a novel method called DPO for fine-tuning large-scale
unsupervised LMs to align with human preferences. Unlike existing reinforcement
learning methods, DPO optimizes the language model's behaviour using a simple
classification loss, eliminating the need for complex RL procedures. The study
demonstrates that DPO achieves comparable or better performance than existing
RLHF algorithms, particularly in sentiment modulation, summarization, and dialogue
tasks. The authors also discuss potential limitations and future research directions, such
as generalization of DPO policies and scaling to larger models. Overall, DPO offers a
computationally lightweight and effective approach to steering language model
behaviour according to human preferences.
8. Unsupervised Cross-Task Generalization via Retrieval Augmentation: The paper
introduces a retrieval-augmentation method called ReCross, aimed at enhancing the
cross-task generalization ability of language models. By leveraging unlabelled
4
examples as queries, ReCross retrieves a small subset of upstream data, which is then
used to update the multi-task model for improved generalization. The method
combines efficient dense retrieval and pair-wise reranking techniques, resulting in
significant performance improvements over non-retrieval methods and other baseline
methods. The research demonstrates the effectiveness of ReCross in improving cross-
task generalization in unsupervised settings, particularly in sentiment modulation,
summarization, and dialogue tasks. They also discuss potential future directions for
enhancing the re-learning stage, extending distant supervision mining, and analysing
the correlation between upstream data and target tasks.
9. Crosslingual Generalization through Multitask Finetuning: The paper explores the
effectiveness of MTF in enhancing the zero-shot task generalization ability of large
multilingual language models. MTF was applied to pretrained multilingual models
such as BLOOM and mT5, resulting in finetuned variants called BLOOMZ and mT0.
It was found that finetuning these large multilingual models on English tasks with
English prompts enables task generalization to non-English languages present in the
pretraining corpus. Moreover, finetuning on multilingual tasks with English prompts
further improves performance on both English and non-English tasks, leading to
various state-of-the-art zero-shot results. Additionally, the study investigates the
effectiveness of finetuning on multilingual tasks with prompts that have been machine-
translated from English to match the language of each dataset, demonstrating better
performance on human-written prompts in respective languages. Surprisingly, the
models exhibit zero-shot generalization capabilities to tasks in languages they have
never intentionally seen, suggesting the acquisition of higher-level task- and language-
agnostic capabilities. The paper introduces xP3, a composite of supervised datasets in
46 languages with English and machine-translated prompts.
10. Data-Efficient Finetuning Using Cross-Task Nearest Neighbours: The paper
introduces DEFT, a method for improving the efficiency of finetuning large language
models by leveraging cross-task nearest neighbours retrieved from a multitask data
pool. The study demonstrates that DEFT significantly outperforms traditional
finetuning methods in terms of data efficiency, yielding superior performance on
various held-out tasks. DEFT also provides better initialization for few shot finetuning
on target-task data, showcasing its effectiveness in scenarios with limited labelled data.
11. ChatPLUG: Open-Domain Generative Dialogue System with Internet-
Augmented Instruction Tuning for Digital Human: The paper introduces
5
ChatPLUG, a Chinese open-domain dialogue system for digital human applications,
trained on a diverse range of dialogue tasks using internet-augmented instruction
tuning. Unlike other models focusing solely on large-scale pre-training, ChatPLUG
aims to achieve practicality and versatility by incorporating various skills through
instruction tuning. It combines large-scale pre-training with instruction tuning to align
the dialogue agent with user intent and task-specific skills, outperforming existing
Chinese dialogue systems. ChatPLUG demonstrates strong multi-task generalization
and possesses fundamental skills such as open-world knowledge, distinct personality,
and multi-turn memory. The system's deployment in real-world applications like Smart
Speaker and Instant Message applications showcases its practical utility.
12. Unveiling the Pitfalls of Knowledge Editing for Large Language Models: The
paper explores the risks associated with editing knowledge in LLMs, emphasizing
concerns of knowledge conflict and distortion that could lead to unintended
consequences. It introduces benchmark datasets and innovative evaluation metrics to
assess these risks. Knowledge conflict arises when groups of edited facts clash
logically, magnifying inconsistencies in LLMs, while knowledge distortion occurs
when editing parameters warp the innate knowledge structure of LLMs. Experimental
results demonstrate that knowledge editing may inadvertently introduce unintended
consequences, warranting attention in future research. The CONFLICTEDIT dataset
and metrics like Conflict Magnitude are introduced to quantify knowledge conflicts,
while MLE serves as a simple solution to evaluate the impact of knowledge distortion
on post-edited models.
13. InstructZero: Efficient Instruction Optimization for Black-Box Large Language:
INSTRUCTZERO is a method that optimizes a soft prompt to generate instructions for
black-box language models, outperforming other auto-instruction methods across
various tasks. It leverages an open-source LLM to convert the soft prompt into a
human-readable instruction, which is then evaluated on the target task. Bayesian
optimization guides the optimization process by proposing new prompts to improve
zero-shot performance. Despite its limitations in using only one open-source LLM and
simpler tasks, INSTRUCTZERO provides a promising approach for optimizing
instructions without direct access to the black-box model. It reduces the complexity of
instruction optimization by operating in a low-dimensional space and demonstrates
competitive performance in generating task-relevant instructions.
6
14. LongForm: Optimizing Instruction Tuning for Long Text Generation with
Corpus Extraction: The LongForm dataset enables instruction tuning for language
models, enhancing their understanding of user intent and improving generalization
across various tasks. It leverages human-written documents augmented with
instructions, providing a cost-effective and clean dataset suitable for long text
generation. Finetuning models on LongForm outperforms larger models without
instruction tuning and prior instruction-tuned models, achieving significant
improvements in tasks like story/recipe generation and long-form question answering.
Additionally, LongForm models demonstrate effectiveness in following and answering
multilingual instructions. While the approach excels in long text generation, it has
limitations in structured prediction tasks and may encounter hallucination issues
similar to other large language models. Nonetheless, the benefits of LongForm models
for research outweigh the associated risks, especially in facilitating exploration and
improvement of instruction-following language models.
15. Visual Instruction Tuning: The paper introduces LLaVA, a large multimodal model
combining a vision encoder and a LLM to enhance visual and language understanding.
By leveraging language-only GPT-4 to generate multimodal language-image
instruction-following data, LLaVA achieves impressive chat abilities and outperforms
GPT-4 on a synthetic multimodal instruction-following dataset by 85.1%. When fine-
tuned on Science QA, LLaVA and GPT-4 achieve a new state-of-the-art accuracy. This
innovative approach addresses the less-explored realm of instruction tuning in the
multimodal field, demonstrating the potential of multimodal models in understanding
and following instructions across different modalities. The availability of GPT-4
generated visual instruction tuning data, along with the model and code, fosters further
research in visual instruction following and multimodal understanding.
Based on the research carried out, some existing gaps were identified. Some of them are
explored below:
7
2. Prompt Engineering: Research suggests that LLMs without fine-tuning struggle with
specific author styles [2].
3. Evaluation Metrics: Current evaluation methods often combine human evaluation
with automated metrics [2, 8]. These approaches though valuable don’t incorporating
additional metrics specific to poetry, such as rhyme scheme adherence, meter detection,
or sentiment analysis tailored for poetic language.
4. Knowledge Integration: Studies on knowledge editing for LLMs highlight potential
risks [12]. However, incorporating domain-specific knowledge (e.g., poetry thesaurus,
rhyme dictionaries) could potentially enhance the LLM's ability to generate creative
and stylistically coherent poems.
5. Multimodal Poetry Generation: Research on multimodal instruction tuning focuses
primarily on language and vision [15]. There exists a potential of incorporating visual
elements (e.g., paintings, photographs) as prompts or inspiration for the LLM to
generate poems. This could be a novel approach to explore the interplay between visual
and poetic creativity.
6. Generalizability and Explainability: While instruction tuning offers promising
results [6, 14], limited research explores the generalizability of these techniques across
different poetry styles or forms. We can investigate how well models trained on specific
poetic styles perform on unseen styles.
7. Human-in-the-Loop Poetry Generation: While some studies employ human
feedback for fine-tuning [4], we can interactive poem generation systems. This could
involve allowing users to iteratively refine prompts or provide feedback on generated
poems to guide the LLM towards a desired style or theme.
These are just some potential research gaps that we can explore in the project. By
addressing these gaps, we can contribute to the advancement of creative LLM applications and
enhance the quality and variety of AI-generated poetry.
The challenge lies in optimizing LLMs to produce poems that not only adhere to specific
styles, themes, and authorial voices but also resonate with aesthetic sensibilities. There is a
pressing need to explore novel strategies, such as hyperparameter tuning and architectural
exploration, to enhance the creative output and stylistic coherence of LLM-generated poetry.
By addressing these limitations, this project aims to demonstrate the potential of LLMs as
8
powerful tools for fostering artistic expression and creativity within the domain of poetry
generation. This endeavor underscores the potential of fine-tuning LLMs for specialized tasks
like poem generation, emphasizing the significance of deliberate data selection and evaluation
methods in achieving desired stylistic outcomes.
3. Technical Specification
3.1 Requirements
The functional requirements describe the specific features and capabilities that the
software system must possess to fulfil its intended purpose. These requirements serve as the
basis for defining the system's behavior and functionality. Some of them are listed below:
1. Training Pipeline: The system should have a training pipeline capable of fine-tuning
pre-trained LLM models on a curated poetry corpus. It should support the integration of
different pre-trained LLM models such as GPT-2, T5, or BART. The pipeline should
facilitate hyperparameter tuning and optimization for poetry generation, including
temperature, sampling strategies, and prompt engineering.
2. Data Management: The system should manage training data effectively, including the
ingestion of poetry corpora from various sources such as Gutenberg, Kaggle, and web-
scraped collections. It should provide preprocessing capabilities to clean and format the
training data for model training.
3. Model Integration: System should integrate with a pre-trained LLM suitable for text
generation (e.g., GPT-2, T5). System should be capable to fine-tune the LLM for the
specific task of poetry generation.
4. Prompt Engineering: System should provide an interface for creating and managing
different types of prompts for poem generation. There should be functionality to specify
style, theme, poem length, and potentially rhyme scheme within the prompt.
5. Poem Generation: System should generate poems based on user-provided prompts.
Users should be able to control the length and potentially the rhyme scheme of the
generated poems.
6. Model Evaluation: The system should include mechanisms for evaluating the quality of
generated poems using both automated metrics and human assessment. It should support
9
the calculation of readability scores, grammatical correctness, and stylistic coherence of
generated poems.
7. Model Deployment: The system should enable the deployment of trained LLM models
for real-time poetry generation. It should support integration with external applications
or platforms for accessing generated poems.
The non-functional requirements specify the qualities or characteristics that the system
must exhibit. These requirements address the system's performance, usability, and other
attributes that contribute to its overall quality and effectiveness. Some of them are listed below:
10
3.2 Feasibility Study
• Technical Skills: The project requires expertise in natural language processing, LLM
fine-tuning techniques, and potentially software development for the user interface.
• Available Resources: Several pre-trained LLMs are publicly available, and open-
source libraries can be used for LLM fine-tuning and text generation. Cloud platforms
offer powerful computing resources for training and running the model.
• Technical Challenges:
o Fine-tuning LLMs specifically for poetry generation requires expertise and
experimentation to achieve optimal results.
o Prompt engineering plays a crucial role in guiding the LLM towards desired styles.
Considerable research and development efforts might be needed to refine prompt
creation techniques.
o Evaluating the creativity and stylistic coherence of generated poems remains
challenging. Developing robust evaluation metrics is an ongoing research area.
• Overall: The project is technically feasible with the available resources and
advancements in NLP. Addressing challenges related to fine-tuning, prompt
engineering, and evaluation will require research and development efforts.
• Costs:
o Computational Resources: Training and fine-tuning LLMs can be
computationally expensive, requiring access to powerful GPUs or cloud-based
platforms. This might incur significant costs depending on the chosen LLM model
and training duration.
o Data Collection: Curating a high-quality poetry corpus might require acquiring
data from paid sources or dedicating resources to scraping and cleaning publicly
available data.
o Development: Developing the system, including the user interface and integration
with the LLM, might involve developing new tools and interfaces.
11
• Benefits:
o Educational Tools: The project could lead to educational applications, potentially
creating interactive platforms for learning about poetry styles and fostering
creativity.
o Content Creation: The system could be used for generating creative text formats
like poems for marketing materials, advertisements, or personalized greetings.
o Accessibility: The project could offer new avenues for artistic expression,
potentially making poetry creation more accessible to non-writers.
• Overall: Economic feasibility depends on the project's scale and monetization strategy.
Open-source tools and exploring free data sources could reduce financial burden.
• Positive Impacts:
o Promote Creativity: The project can encourage creative expression and provide a
tool for exploring different poetic styles.
o Accessibility: The system can make poetry creation more accessible to individuals
who might not consider themselves writers.
o Educational Value: The project has the potential to be used for educational
purposes, enhancing learning about poetry and creative writing.
• Negative Impacts:
o Plagiarism Concerns: Generated poems might be misused for plagiarism if proper
attribution is not established.
o Job Displacement: There's a minimal risk of the project displacing professional
poets or writers, but it could potentially affect freelance content creation jobs that
rely on writing generic poems.
o Ethical Considerations: AI-generated content raises ethical concerns about
potential biases or offensive outputs. The project should prioritize responsible AI
development practices to mitigate these risks.
• Overall: The social impact is mostly positive, promoting creativity and accessibility.
Addressing ethical concerns through responsible development practices is crucial.
12
3.3 System Specification
• Computing Power:
o High-performance GPUs are essential for efficient training and fine-tuning of large
language models.
o Cloud-based platforms like Google Colab, Kaggle notebooks, Amazon SageMaker,
or Microsoft Azure Machine Learning that offer access to powerful GPUs on a pay-
as-you-go basis.
• Storage:
o Sufficient storage capacity is required to house the pre-trained LLM model, the
curated poetry corpus, and any intermediate training files.
• System Memory:
o Large amounts of RAM are crucial for processing large text datasets and running
the LLM during generation. The specific amount required will depend on the
chosen LLM model.
• Networking:
o Reliable internet connectivity is essential for accessing online resources,
downloading datasets, and collaborating with team members.
o High-bandwidth internet connections will facilitate efficient data transfer and
communication.
• Operating System:
o A Linux-based operating system (preferred if the model runs on system).
• Frameworks:
o Huggingface Transformers will be used for accessing the pre-trained models.
o PyTorch and TensorFlow will be used to implement LLM fine-tuning and poem
generation functionalities.
• Programming Language and Software:
o Python will be used along with Jupyter to execute and visualize the execution and
monitor the whole process.
o CUDA and CUDNN for accessing GPU abilities through python.
13
3.3.3 Standard and Policies
1. Data Privacy and Security: The project must comply with relevant data privacy
regulations, such as GDPR or CCPA, when collecting, storing, and processing user data
or personal information. Measures should be implemented to secure sensitive data and
protect against unauthorized access or data breaches.
2. Ethical Guidelines: Adherence to ethical guidelines for AI research and development
is essential, including principles of fairness, transparency, and accountability. The
project should strive to mitigate biases in training data and model outputs, and avoid
harmful content generation.
3. Intellectual Property Rights: Respect for intellectual property rights, including
copyright and licensing agreements, is paramount when using third-party datasets, pre-
trained models, or poetry corpora. Proper attribution and permission should be obtained
for any copyrighted materials used in the project, and licensing terms should be
followed accordingly.
4. Model Deployment and Accessibility: Considerations should be given to the
accessibility and usability of the developed model, ensuring that it can be easily
deployed and integrated into applications or platforms. Documentation and tutorials
should be provided to facilitate usage by developers and end-users, and efforts should
be made to address potential biases or limitations in the model's performance.
14
1. Data Collection and Preprocessing: Data sources such as the poetry foundation,
Kaggle poetry dataset, and web-scraped collections are collected. Data preprocessing
techniques are applied to clean, tokenize, and format the raw text data for training.
5. Deployment and Accessibility: The trained and fine-tuned LLM models, along with
the prompt engineering strategies, are deployed and made accessible for further
experimentation and integration. Documentation and tutorials are provided to facilitate
the usage of the developed models by developers and end-users. Efforts are made to
address potential biases or limitations in the model's performance, ensuring fairness and
inclusivity in its usage.
15
processing and creative AI. Updates and enhancements are made to the system
architecture, data collection, preprocessing techniques, model training, and evaluation
methodologies to further enhance the creative capabilities of the LLMs for poetry
generation.
AdamW provides improved convergence. AdamW's adaptive learning rates and weight
decay can help the transformer converge to a good solution faster compared to simpler
optimizers like SGD. It creates better generalization. By preventing overfitting through weight
decay, AdamW can lead to models that perform well on unseen data. Figure iii shows the
training loss curves for both Adam and AdamW optimizers. The loss decreases over epochs for
both optimizers, but AdamW may converge slightly faster and with less fluctuation due to the
inclusion of weight decay regularization.
16
• Corrected estimated moment (𝒎
̂ 𝒕 ):
o This adjusts for bias in the initial estimates of the mean.
• ̂ 𝑡 = 𝑚𝑡 / (1 − 𝛽1𝑡 )
𝑚
• Corrected variance (𝒗
̂𝒕 ):
o Similar to 𝑚
̂ 𝑡 , this corrects for bias in the initial variance estimate and adds a
small epsilon value for numerical stability.
• 𝑣̂𝑡 = 𝑣𝑡 / (1 − 𝛽2𝑡 ) + 𝜀
• Parameter update:
o The learning rate (𝜂) is scaled by the inverse square root of the variance and
then used to update the parameter based on the corrected mean.
̂𝑡
𝑚
• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 .
√𝑣̂𝑡 + ℇ
o This avoids the potential issues of weight decay affecting bias updates in the
standard Adam formulation.
4.2 Design
Level 0 DFD
17
Level 1 DFD
Level 2 DFD
18
Figure vii. Prompt Engineering Level 2 DFD
19
4.2.2 Use Case Diagram
20
4.2.4 Sequence Diagram
In undertaking this project, it's essential to consider various constraints, alternatives, and
tradeoffs. Firstly, constraints include the availability of computational resources, which may
limit the scale and complexity of hyperparameter tuning and architectural exploration.
Additionally, the quality and availability of poetry datasets pose challenges, impacting the
model's ability to generate varied and stylistically coherent poetry. Time constraints further
limit the depth of experimentation achievable within the project's timeline, while the
interpretability of complex LLM architectures adds another layer of constraint, hindering
understanding and interpretation of hyperparameter effects.
21
Lastly, integrating human feedback loops during model training could offer valuable guidance,
albeit with added complexity and resource requirements.
22
5.2 Module Description
The module showcased here exemplifies the process of fine-tuning the Mistral 7B large
language model using LoRA (Low-Rank Adapters) for the purpose of poetry generation. It
embarks on this journey with a meticulous setup of the environment, initiating library
installations via pip to ensure access to essential tools for quantization, model architecture, and
training acceleration. Moreover, the integration of Hugging Face and Weights & Biases adds a
layer of functionality for model management and performance monitoring throughout the
training process. Configuration and data loading follow suit, with careful consideration given
to paths for model and dataset access, thereby laying the groundwork for subsequent model
manipulation and training. Loading and quantizing the base Mistral 7B model mark a pivotal
moment in the process, with the integration of BitsAndBytesConfig enabling efficient
quantization for model optimization. The subsequent addition of Low-Rank Adapters (LoRA)
further enhances model flexibility and efficiency, setting the stage for hyperparameter setup
and training. This phase sees the meticulous tuning of various training parameters, from
23
optimizer selection and learning rate scheduling to gradient accumulation and sequence length
grouping. The utilization of SFTTrainer for Selective Fine-tuning encapsulates the training
process, orchestrating the fine-tuning of the Mistral 7B model with LoRA adapters on the poem
dataset. Upon completion of training, the fine-tuned model is saved, and evaluation ensues,
exemplified by text generation using a provided prompt. This comprehensive approach not
only demonstrates the technical intricacies of model fine-tuning but also showcases the model's
creative potential in the domain of poetry generation. Through each step of the process, from
environment setup to evaluation, the module underscores the fusion of innovation and artistry
inherent in modern language model development.
This module exemplifies the process of fine-tuning the GPT-2 language model for poetry
generation using TensorFlow and the Hugging Face Transformers library. It commences with
data preprocessing, where a dataset of poetry is read from a CSV file and cleaned by removing
any NaN values. The poems are then concatenated into a single string, ensuring a continuous
flow of text for training. Subsequently, the dataset is saved into a text file in the specified data
location. The GPT-2 tokenizer is initialized, followed by configuration setup for the GPT-2
model, defining parameters such as vocabulary size and special tokens. The model is
instantiated with the configured settings, setting the stage for training. The text data is
tokenized, and examples are generated by segmenting the tokenized text into blocks of a fixed
size. These examples are then split into inputs and labels, forming the basis of the training
dataset. TensorFlow's dataset API is utilized to create an efficient input pipeline, shuffling and
batching the data for training. The model is compiled with an Adam optimizer and sparse
categorical cross-entropy loss function. Training commences on the dataset, iterating through
epochs to refine the model's parameters. Following training, the fine-tuned model is ready for
text generation. This is demonstrated by providing prompts to the model and generating text
sequences using beam search, showcasing the model's ability to produce coherent and
contextually relevant poetry. The module encapsulates the entire pipeline of fine-tuning and
utilizing the GPT-2 model for poetry generation, underscoring the synergy between deep
learning frameworks and natural language processing libraries in creative AI applications.
This model illustrates the process of fine-tuning the GPT-2 language model for poetry
generation using the Fastai and Hugging Face Transformers libraries. Initially, the dataset,
24
organized into folders based on forms and topics, is read using Fastai's get_text_files function,
facilitating easy access to the poems. The number of poems in the dataset is printed for
reference. Next, the poems are grouped based on their forms, and forms with fewer than 25
sample poems are filtered out. This preprocessing step ensures that the model is trained on
forms with a sufficient number of examples to learn meaningful patterns. A function is defined
to create a dictionary containing forms with at least 25 poems as keys and their corresponding
poems as values. Training is then performed iteratively for selected forms. For each form, the
poems are loaded, split into training and validation sets, and tokenized using the GPT-2
tokenizer. The model is trained using the Fastai Learner class, which handles the training loop,
optimization, and evaluation. Training progress is monitored, and the model is saved after
training completion. The module also includes a function for generating poetry based on user
prompts. The trained model is utilized to generate poetry given a starting prompt. Beam search
is employed to improve the quality of generated sequences, ensuring coherence and relevance.
5.3 Testing
The functionality of the modules undergoes rigorous testing by generating poems and
evaluating their quality. This evaluation process involves analyzing the poems using natural
language processing (NLP) techniques and comparing the results using various graphical
representations. Fig xiv. Analyze the poem
generated by evaluating the theme,
language, literacy device, sound and rhythm
of it using a LLM. During the training
phase, the train/loss is continuously tracked
to measure the performance of the fine
tuning. Mistral performed better than flan
T5 and GPT 2 in this aspect with below 3%
loss consistently.
Figure xiii. Poem Analysis Example Figure xiv. Evaluating Train and Loss
25
The testing phase also utilizes the spacy library to analyse sentiment in the generated
poem text. This helps to read the mood of the poem that is being generated by the model. This
provide us the feedback of if the sentiment is what we really needed from the model or not.
The utilization of various models empowers us to effectively craft poems across a rich
spectrum of forms, encompassing diverse styles like ballads, epics, odes, and more. By
harnessing a range of models, including those pretrained on T5, Mistral 7B, and GPT2
architectures, we not only gain the ability to deploy existing models but also embark on the
journey of refining them through iterative training sessions. This iterative process enables us
to fine-tune the models to better capture the nuances of poetry generation, thereby enhancing
their creative output.
26
We finetuned Mistral 7B
instruct model using bitsandbytes,
peft, lora, etc. which made the
training efficient. The GPU
utilization in fig. xvi, xvii and xviii
show the training statistics. T5 had
less parameters than mistral so it
could be trained without the use of
above techniques, but the
performance was hampered. We
trained for 10 epoch. Mistral had Figure xviii Mistral GPU Usage
better performance than flan T5 it
has more hyperparameters and was trained more efficiently. We also experimented with the
GPT 2 model. By loading the trained weights of the model, performing the prompt engineering
the generated results were good. We utilized FastAI’s GPT2 trained weights and trained them
on our own dataset to generate poems of different genres. Overall, the quality of poems
generated my LLM improved utilizing open source LLM’s, dataset and free tire GPUs provided
by Kaggle and Google colab. And we were able to achieve our objectives without over reliance
on expensive GPUs and hardware resources.
27
Figure xx. GPT2 Generated Poem
28
Figure xxii. The T5 Models's Start
29
Figure xxiv. Different Generated forms
7. Summary
Our project is mainly based around the development and refinement of models for
automated poetry generation. By adopting various natural language processing (NLP)
techniques, we have built a framework capable of crafting poems in diverse forms such as
ballads, epics, odes, and more.
The basis to our approach is the refining and further training of various pretrained models
based on T5, Mistral 7B, and GPT2 architectures. These models serve as the foundation upon
which we conduct iterative training sessions, fine-tuning their parameters to enhance their
ability to generate high-quality poetry.
A significant aspect of our project involves comparative analysis among different models.
Through meticulous evaluation, we discern the strengths and weaknesses of each model,
informing our decision-making process regarding model selection and further refinement
strategies. By systematically tracking changes and identifying areas for improvement, we
ensure that our poetry generation framework evolves in tandem with advancements in NLP
technologies.
30
8. References
1. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016, August).
Hyperband: A novel bandit-based approach to hyperparameter optimization. In
Proceedings of the 33rd International Conference on Machine Learning (ICML) (pp. 3560-
3568).
2. Sawicki, P., Grześ, M., Goes, F., et al. (2023, June). Bits of Grass: Does GPT already know
how to write like Whitman? In Proceedings of the International Conference on
Computational Creativity (ICCC) (pp. 1-7).
3. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... Le, Q. V. (2022,
January 25). Chain-of-thought prompting elicits reasoning in large language models.
4. Ouyang, L., Wu, J., Jiang, X., et al. (2022, March 7). Training language models to follow
instructions with human feedback.
5. Chung, H. W., Hou, L., Longpre, S., et al. (2022, October 26). Scaling instruction-
finetuned language models.
6. Longpre, S., Hou, L., Vu, T., et al. (2023, January 18). The Flan collection: Designing data
and methods for effective instruction tuning.
7. Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct preference optimization: Your
language model is secretly a reward model.
8. Lin, B. Y., Tan, K., Miller, C., et al. (2022, April 20). Unsupervised cross-task
generalization via retrieval augmentation.
9. Muennighoff, N., Wang, T., Sutawika, L., et al. (2022, November 1). Crosslingual
generalization through multitask finetuning.
10. Ivison, H., Smith, N. A., Hajishirzi, H., & Dasigi, P. (2022, December 1). Data-efficient
finetuning using cross-task nearest neighbors.
11. Tian, J., Chen, H., Xu, G., et al. (2023, April 19). ChatPLUG: Open-domain generative
dialogue system with internet-augmented instruction tuning for digital human.
12. Li, Z., Zhang, N., Yao, Y., et al. (2023). Unveiling the pitfalls of knowledge editing for
large language models.
13. Chen, L., Chen, J., Goldstein, T., et al. (2023). InstructZero: Efficient instruction
optimization for black-box large language models.
14. Koksal, A., Schick, T., Korhonen, A., & Schutze, H. (2023). LongForm: Optimizing
instruction tuning for long text generation with corpus extraction.
15. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning.
31
Appendix A – Sample Code
import pandas as pd
import tensorflow as tf
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer
from transformers import WEIGHTS_NAME, CONFIG_NAME
import os
data = pd.read_csv('../input/poetry-foundation-poems/PoetryFoundationData.csv')
data = data.dropna()
data = data['Poem'].str.lower()
string = ''
for x in data:
string += x + "</s>"
data_location = "data"
if not os.path.exists(data_location):
os.makedirs(data_location)
32
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 1000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []
for ex in examples:
inputs.append(ex[:-1])
labels.append(ex[1:])
dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print("Done creating dataset")
optimizer = tf.keras.optimizers.Adam(
learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss)
model.fit(dataset, epochs=30)
text = "I used to love a girl"
input_ids = tokenizer.encode(text, return_tensors='tf')
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
temperature=0.7,
no_repeat_ngram_size=2,
num_return_sequences=5
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
33
input_ids = tokenizer.encode(text, return_tensors='tf')
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
temperature=0.7,
no_repeat_ngram_size=2,
num_return_sequences=5
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
training_arguments = TrainingArguments(
output_dir="./results",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
optim="paged_adamw_32bit",
save_steps=50,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=250,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant",
report_to="wandb"
)
34
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length= None,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing= False,
)
35