Generative IA For Sofwr Deployment
Generative IA For Sofwr Deployment
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 2nd chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
Artificial intelligence can significantly amplify productivity and creativity in code generation and
autocompletion. This chapter explores how AI-driven tools are redefining the coding experience,
transforming a time-intensive manual process into an interactive, efficient, and error-reducing
endeavor.
The advent of AI in code generation is not merely about accelerating developers’ typing speed; it’s
about understanding the context of their work, suggesting relevant code snippets, and even
generating complex code blocks with minimal inputs. These tools, powered by sophisticated
machine-learning algorithms, have the ability to learn from vast repositories of code available in
public and private databases to continuously improve their suggestions and accuracy.
I will examine how a software engineer can go from doing 100% of the work in a given software-
development task to becoming a reviewer of the contributions provided by AI tools. This entails
ensuring proper input about you require from these tools and thoroughly revising the outputs to
make sure the deliverable fulfills the requirements.
These AI tools are powerful and impressive, and it’s easy to fall into the trap of using their output
without proper precautions–for instance, opening a pull request or pushing code to production
without validating how and why the code works. That careless approach carries two important
risks:
Outdated code
Most AI tools are trained on dated training data, which means they may suggest outdated
frameworks or functionalities.
Wrong answers
LLMs, the technology underlying all these tools, sometimes generate what are commonly
described as “hallucinations.” That means their output may include false statements, bugs, or code
functions or API endpoints that don’t exist.
Software engineers and developers must use AI tools to help them work better and faster,
but not to replace their own judgment, much as we do with the autocomplete functionality that
has become popular in most integrated development environments (IDEs). It helps a lot to simply
hit the tab key instead of typing every character, of course–but autocomplete suggestions range
from perfectly relevant to useless. It’s up to your judgment whether to use or discard them.
The AI tools I cover in this chapter require the same constant assessment. Many times, the code
these tools generate will work and fit the task requirements flawlessly. In other cases, it will be
only partially complete or will contain bugs, performance issues, or some other flaw that must be
revised. It’s your job to use, discard, or revise it.
The AI tools reviewed for this chapter fall into two main categories, whose usage in software
development differs slightly:
Browser-based tools
With these tools, such as ChatGPT, you can log in and interact with the model right there in your
browser. There’s no activity happening on your local computer, just an interaction with a website
over the internet. These tools are easy to use and adapt well to more use cases, but their biggest
con is the limited context window. You must manually type or copy/paste context into the prompt
for each interaction, which is limiting when you’re dealing with large codebases or pieces of
documentation.
IDE-based tools
These tools, such as GitHub Copilot, work as plugins installed in the IDE you use to write code on
your local computer. Once installed, they become embedded in your software development
experience, in the actual environment where you write code. Their biggest pro is the large context
window: these tools can ingest a whole codebase as context for each interaction.
Use Cases
Millions of software engineers are adopting AI tools to support their daily tasks. Perhaps the five
most prominent use cases where these tools influence development are:
Instead of typing in every single word and function in a codebase, you provide the AI tool with
specific requirements that the code should fulfill. It outputs ready-to-use code in any of the most
popular programming languages (such as Java, Python, PHP, or Javascript). This can speed up
prototyping as well as the development process. The tools described in this chapter can generate
code for a wide range of applications, including web development, data analysis, automation
scripts or mobile applications. In general, this use case is one where AI helps bridge the gap
between conceptualization and implementation and makes technology development more
accessible and efficient.
Debugging
This use case is especially valuable because debugging can often be a time-consuming and
frustrating part of software development. These AI tools analyze error messages and problematic
code snippets and suggest specific changes or improvements. This not only saves time but also
serves as an educational tool, enhancing your debugging skills over time. Furthermore, some tools
(like ChatGPT) can explain why certain errors occur and sometimes even the architectural tradeoffs
implied in avoiding them. This deeper understanding of common pitfalls in software development
is a key reason why so many developers use this tool as their coding assistant.
Accelerating learning
AI tools can serve as instructors if you’re trying to get up to speed in a technology stack you aren’t
proficient in, learn a new programming language or framework, or understand specific
implementation details, like adding indexes to a table in a MySQL database or pulling last month’s
transactions from the Stripe API. They can provide tutorials, examples, and concise summaries of
documentation for a wide range of technologies. This educational interaction with AI tools can
speed your learning progress regardless of the specific technology or the scope of what you’re
learning.
Optimizing code
Many software engineers use AI tools to review code and make it more efficient, readable, and
maintainable. This includes recommendations for refactoring code, using more efficient
algorithms, or applying best practices for performance or security. Code optimization is an ongoing
challenge and can be easy to forget about. Eventually, though, all that suboptimal code piles up
into huge technical debt that eventually will need to be refactored across the codebase on a large
and thus very costly scope. Using AI tools to review code on a task level can make a significant
impact on the quality of the overall codebase.
Automating documentation
Documentation is essential for maintaining and understanding software projects, yet developers
often overlook or underprioritize it. Some AI tools can generate documentation, including in-line
comments and details about functions, classes, and modules. This saves time and also ensures that
documentation is consistently updated alongside the codebase. By providing clear, comprehensive
documentation, AI tools helps improve code readability and makes it easier for teams to
collaborate. This use case is particularly beneficial when used in large teams or on open-source
projects, where clear documentation is crucial for enabling other developers to contribute
effectively. Automating documentation also enhances projects’ maintainability and facilitates
better knowledge transfer within development teams.
Evaluation Process
I evaluated more than 50 AI tools in order to shortlist the ones I highlight in this chapter. Every tool
covered here meets the following criteria:
My process in this chapter was as follows: I submitted a brief code challenge to each of the
selected code tools, ran the same challenge several times on each tool, and compared their
output. I then gave each tool a rating on a scale from 1 to 10, with 1 being the worst– a solution
that errors out and doesn’t run at all– and 10 being a flawless solution. A 5 would be a solution
that runs but solves only part of the problem. I look closely at the top product in each category,
detailing its pros and cons, then provide some more concise information on the runner-up.
It’s also important to note that all tests described in this chapter were run in March 2024. Given
the fast pace of evolution of each of these tools and underlying models, it’s likely that you could
get a different result at a later time for the same prompt.
Browser-based Tools
This chapter will look first at browser-based AI tools, then at IDE-based tools.
ChatGPT
ChatGPT is an artificial intelligence developed by OpenAI and powered by its GPT-3.5 architecture.
Imagine it as a multitool for software engineers, offering a broad range of functions from
conversational engagement to intricate problem-solving, way beyond the specific scope I discuss in
this particular chapter (generating software code).
As described in OpenAI’s website, ChatGPT is like a highly intelligent virtual assistant that
understands the nuances of human language and can generate text that feels as if it were written
by a human. It’s an example of modern natural language processing (NLP) technology. It has been
meticulously trained on a wide array of internet text, giving it a broad knowledge base that
developers and non-developers alike can tap into.
ChatGPT has gained massive adoption, having reached 100 million users in just 2 months after its
launch November 30, 2022, making it the fastest-growing product ever. Of course, this large user
base includes many software engineers. But before we dive in, it’s important to mention data
security, which caused 14 prominent tech companies and even 15 countries’ governments to ban
ChatGPT in its early days. Their concerns were that it wasn’t compliant with the EU’s General Data
Protection Regulation (GDPR). Most of these bans have since been revoked, and at the time of
writing (early 2024) several public authorities are officially starting to use ChatGPT, including the
government of Pennsylvania in the US and the UK Judicial Office.
Pros
Let’s look at the pros and cons of ChatGPT, starting with the positive. Note that these also largely
apply to most similar tools.
Versatility
ChatGPT can generate code in any popular programming language and understand any technical
topic included in its training dataset, such as databases, cloud infrastructure, API documentation,
and so on.
Browsing
Whenever ChatGPT receives prompts that include recent events, products, or framework updates
that happened after its 2022 training-data cutoff date, it automatically browses for the answer
online and includes close-to-real-time knowledge as part of its reply. At the time of writing in early
2024, this feature is only available for premium users (see the Cons section below).
Structured thinking
One of the best aspects of ChatGPT is that it structures the code it generates in a very logical and
holistic manner, often including the packages to be installed and the environment variables to set
up. It uses numbered bullets, as in a how-to tutorial, which makes it easy to transfer those answers
to a codebase.
Cons
Now let’s look at some of the drawbacks of ChatGPT and tools like it:
Security risks
For high-security applications, code provided by ChatGPT may not always adhere to best security
practices. It’s also unclear whether its suggestions include copyrighted materials from other
companies or publicly available sources. Always apply a high level of critical thinking when
reviewing its outputs and considering whether to add them to your codebase. Many companies
are publishing their own rules for employees’ ChatGPT use, which range from outright bans to
training materials to no rules at all. If you’re using ChatGPT for professional purposes, abide by
your company’s policy.
ChatGPT works on a model that is pretrained on a certain knowledge base that has an end date (at
the time of writing, this date is January 2022 for GPT 3.5, the model available in the free plan). This
is a moving window, and the date gets pushed forward with each new update, but there will
always be limitations when you ask ChatGPT to generate code that depends on recent events, such
as product launches, framework updates, or security patches. For premium users, ChatGPT
includes browsing, which significantly reduces this drawback.
Reduced scope
ChatGPT has a limited context window, which means that the amount of text you can include in
the prompt is often equivalent to one file’s worth of software code. Many bugs and issues span
across a whole repository, and sometimes the root cause lies even beyond the codebase itself.
ChatGPT is not very helpful for tackling challenges of such a broad scope, so it’s not well suited for
repository-wide solutions. Its replies are text-based, not file-based, and its context window is
limited both for input and for output.
Hallucinations
Despite its accuracy, ChatGPT is not infallible. It sometimes generates “hallucinations” such as
incorrect code, inaccurate explanations of programming concepts, or false claims about software
functionality. Again, approach ChatGPT’s output with a critical mindset. Verify the accuracy of the
code and any other information it provides against trusted sources, like official documentation,
reputable forums, or peer review. Think of ChatGPT as a brainstorming tool rather than an absolute
authority.
Practical example
I asked each of the AI tools covered in this chapter to generate code to solve the exact same
challenge, a classic 2-D array I’ve used in many technical interviews over my career as a CTO. Here
is the prompt as I provided it to ChatGPT GPT-3.5 model (in March 2024):
Context:
- We have to find the starting point and ending point of all rectangles filled with 0.
- It is given that rectangles are separated and do not touch each other however they can touch the
boundary of the array.
Desired output:
- Each of those array elements contains an array with 4 elements that compose the rectangle (top
left X, top left Y, bottom right X, bottom right Y).
Example arrays:
input1 = [ [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0, 1], [1, 1, 1, 0, 0, 0, 1], [1, 1, 1, 1, 1, 1,
1], [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1] ]
input2 = [ [0, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0, 1], [1, 1, 1, 0, 0, 0, 1], [1, 1, 1, 1, 1, 1,
1], [1, 0, 0, 1, 1, 1, 1], [1, 0, 0, 1, 1, 0, 0], [1, 0, 0, 1, 1, 0, 0] ]
In less than 10 seconds, ChatGPT generated the code shown in Example 1-2.
function findRectangles(input) {
return row >= 0 && row < numRows && col >= 0 && col < numCols && input[row][col] === 0;
const rectangle = [j, i, j, i]; // Initialize rectangle [top left X, top left Y, bottom right X,
bottom right Y]
dfs(i, j, rectangle); // Perform depth-first search to find all cells in the rectangle
rectangles.push(rectangle);
return rectangles;
// Example arrays
const input1 = [
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1]
];
const input2 = [
[0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 1, 1, 1, 1],
[1, 0, 0, 1, 1, 0, 0],
[1, 0, 0, 1, 1, 0, 0]
];
console.log(findRectangles(input1));
console.log(findRectangles(input2));
I copied this code snippet into a local file in my computer, named it chatgpt.js, and ran it in the
terminal. The result was as shown in Example 1-3.
[ [ 3, 2, 5, 3 ] ]
[ 0, 0, 0, 0 ],
[ 6, 0, 6, 0 ],
[ 3, 2, 5, 3 ],
[ 1, 5, 2, 7 ],
[ 5, 6, 6, 7 ]
This output is absolutely correct! Furthermore, the code generated is well structured, ingranular,
well defined functions with relevant comments.
To fully explore the boundaries of each tool, I tried this exact same prompt about 10 times per
tool. ChatGPT generated a new and original reply every time–even for the exact same prompt. Not
only did the actual code generated change, it didn’t always work correctly. Here’s how its answers
break down:
• In most instnaces, it generated the correct result (as shown above), though with some
variations in variable names and code organization.
• Twice, it generated partial results (for example, finding only the top left but not the bottom
right of each rectangle).
• Once, it “hallucinated” by outputting the “X” and “Y” strings in each position of the
array literally instead of making the necessary calculations.
For all these reasons, I rate ChatGPT a 9 out of 10 for this code challenge. While it generated the
correct result most of the times, it can’t be a 10 of10, because it did generate very discrepant
solutions which, in some cases, didn’t correctly solve the problem.
Google Gemini
Gemini is Google’s direct competitor to ChatGPT, its latest and most advanced AI model,
succeeding previous models like LaMDA and PaLM 2. Its pros and cons for code generation are very
much in line with those of ChatGPT. So let’s compare their performance in the same practical
example and see how Google Gemini solved the 2D array challenge.
First, it took significantly longer to reply. Geminidoesn’t have ChatGPT’s throttling user experience,
where you can see the reply building up as if someone was typing it). It felt like it was just
processing for almost a full minute, until it finally produced the result, which you can see in full in
the google_gemini.js file in the book’s Github repository. This result, shown in Example 1-4, is
partially correct.
Example 1-4. Console output after running Google Gemini’s solution for the code challenge.
[ [ 3, 2, 5, 3 ] ]
[ [ 3, 2, 5, 3 ], [ 1, 5, 2, 7 ], [ 5, 6, 6, 7 ] ]
Gemini returned the correct solution for the first input array (with only one rectangle, a simpler
problem scope), but it only found 3 out of 5 rectangles in the second input array. I repeated the
experiment a few times, just like with ChatGPT, but Gemini produced fewer variations than
ChatGPT in the code it generated. Every solution it gave returned this exact same output.
The reason for Google Gemini’s partial failure appears to be that it misunderstood the
requirements, which read in part, “A rectangle might contain only one element.” Gemini’s solution
included a validation to exclude single elements in the output array, as shown in Example 1-5. The
two missing rectangles in Gemini’s output were the two with only one element.
Example 1-5. Part of the code that caused Google Gemini’s partially failed solution.
I rate Google Gemini’s solution an 8 out of 10. While it didn’t exactly generate code that correctly
solves the challenge, it did produce the correct algorithm– it just added a silly validation that
violated one of the guidelines in the brief. This was more of a scope misunderstanding than a
genuinely wrong code solution. It generated consistently similar code snippets, and it never
generated any hallucinations.
The other tools I tried were unable to solve the challenge or sometimes even to generate code that
would run.
IDE-based Tools
Next, let’s review the top IDE-based tools, beginning with the top contender: GitHub Copilot.
GitHub Copilot
In fact, GitHub Copilot launched before any of the other tools reviewed in this chapter: in June
2022, roughly 5 months before ChatGPT’s launch. Soon after launch, GitHub claimed that 20,000
organizations were using Copilot. That number has grown to 37,000 at the time of writing (early
2024). In October 2023, Microsoft CEO Satya Nadella claimed that more than a million people were
paying to use Copilot. This perhaps paints the most accurate picture of the tool’s usage: users pay
at least US$10 per month for access. (Copilot’s free tier, as of early 2024, is reserved for “verified
students, teachers, and maintainers of popular open source projects.” Others can sign up for a 30-
day free trial.)
While GitHub Copilot uses OpenAI’s GPT models, it does use GPT-4 (the one available in ChatGPT
premium), also, the training data is focused on public GitHub code repositories, documentation,
and code comments. This has generated some controversy so far, with critics alleging that its
output often copies code snippets verbatim from the training data. Since some of those
repositories are copyrighted, it’s no surprise that copyright-infringement lawsuits have already
been filed against GitHub for this exact reason.
When you install GitHub Copilot, you’ll be asked some questions about what type of code you
want it to include in the code it generates. You can allow any code from the training data or place
restrictions around copyright and publicly available code. I’d expect many developments on this
legal front for Copilot (and most other tools, too), especially as regulators delineate what’s
acceptable for AI generation and what constitutes an unacceptable copyright violation.
Pros
Now that you have some context, let’s look at the advantages of GitHub Copilot:
Maximum convenience
Like other IDE-based tools, Copilot lives inside your IDE and generates code directly in the file in
which you’re already coding. This allows for a higher level of integration into the software
development flow.
As opposed to browser-based tools, which require you to write or copy context into the browser
window, in Copilot and other IDE-based tools, the context is already there. It uses the whole
codebase as context whenever a user asks it to generate any code. This makes it especially suitable
to generate code with dependencies on functions or variables that are declared in different files in
the same repository.
Cons
As mentioned above, GitHub Copilot has been seen generating code that was copied verbatim
from repositories in it training data, which in some cases could be copyrighted. Using such code
could cause trouble for you.
Lack of depth
Most browser-based tools have many general-purpose applications that go way beyond just
generating code. With those, you can have a broader discussion about research, brainstorm
implementation options, and assess tradeoffs. In GitHub Copilot and other IDE-based tools,
however, this is not as feasible. Copilot has recently rolled out a chat function that aims to provide
a comparable user experience to the browser-based tools.
Practical example
I used GitHub Copilot to solve the exact same code challenge I gave to all the other tools. But its
user experience was quite different from the browser-based tools covered earlier. Let me walk you
through that experience.
I installed the GitHub Copilot extension in my IDE (Visual Studio Code, VSC), so where the action is
happening. In any empty file, Copilot prompts me to press a command that opens its widget, as
shown in Figure 1-1.
When I press ⌘ I as instructed, the widget opens (Figure 1-2) and I paste in the exact same
prompt I used with ChatGPT.
Figure 1-2. GitHub Copilot’s prompt widget shows up when I press the shortcut.
As I hit the Enter key, GitHub Copilot starts generating code right there in the code file inside the
IDE. The user experience is very much in line with ChatGPT’s, in that it starts writing the code
immediately when I submit the prompt and renders the code as if someone’s typing it very fast,
line by line. Both tools take about 10 seconds to generate the full solution.
Now, there’s one big difference: GitHub Copilot’s solution is incorrect. Example 1-6 shows the
console output when I run the code it generated.
Example 1-6. Console output for the solution generated by GitHub Copilot.
[ 3, 2, 5, 3 ],
[ 4, 2, 5, 3 ],
[ 5, 2, 5, 3 ],
[ 3, 3, 5, 3 ],
[ 4, 3, 5, 3 ],
[ 5, 3, 5, 3 ]
]
[
[ 0, 0, 0, 0 ], [ 6, 0, 6, 0 ],
[ 3, 2, 5, 3 ], [ 4, 2, 5, 3 ],
[ 5, 2, 5, 3 ], [ 3, 3, 5, 3 ],
[ 4, 3, 5, 3 ], [ 5, 3, 5, 3 ],
[ 1, 5, 2, 7 ], [ 2, 5, 2, 7 ],
[ 1, 6, 2, 7 ], [ 2, 6, 2, 7 ],
[ 5, 6, 6, 7 ], [ 6, 6, 6, 7 ],
[ 1, 7, 2, 7 ], [ 2, 7, 2, 7 ],
[ 5, 7, 6, 7 ], [ 6, 7, 6, 7 ]
While Copilot’s solution correctly found the top lefts and bottom rights of all rectangles, it also
added each of those to the “rectangles” array, as if each individual element would be a rectangle in
itself. There’s also missing logic in the section of code that flags elements as top left and then
checks if the element is already inside an existing rectangle before flagging it as the start of a new
rectangle.
So GitHub Copilot’s algorithm is mostly correct, but misses a key part of the logic. Example 1-7
shows the code it generated, which you can also find in the file github_copilot.js in the book’s
GitHub repository.
function findRectangles(matrix) {
if (matrix[i][j] === 0) {
let topLeftX = j;
let topLeftY = i;
let bottomRightX = j;
let bottomRightY = i;
// Find the bottom right corner of the rectangle
bottomRightX++;
bottomRightY++;
return rectangles;
const input1 = [
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1]
];
const input2 = [
[0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 1, 1, 1, 1],
[1, 0, 0, 1, 1, 0, 0],
[1, 0, 0, 1, 1, 0, 0]
];
console.log(rectangles1);
console.log(rectangles2);
For these reasons, I rate GitHub Copilot a 6 out of 10 on this task. While its solution partially solved
the problem, the solutions it generated were quite consistent over multiple attempts and never
included any hallucinations.
In this race for the AI code-generation market share, CodeWhisperer is Amazon’s contender. It was
released after most of the other tools covered in this chapter, but has gained some momentum,
especially among the very large user base of Amazon Web Services (AWS). Let’s see how it handles
our code challenge.
Example 1-8. Console output for the code generated by Amazon Code Whisperer.
[ [ 3, 2, 5, 3 ] ]
[ 0, 0, 0, 0 ],
[ 6, 0, 6, 0 ],
[ 3, 2, 5, 3 ],
[ 1, 5, 2, 7 ],
[ 5, 6, 6, 7 ]
As shown in Example 1-8, Amazon’s solution returned the correct result–on my sixth try. However, I
must tell you that Code Whisperer has the clunkiest user experience of all the tools in this chapter.
First, it took me some time to figure out the exact comment syntax I needed to use to generate the
code. Second, Code Whisperer generated more hallucinations than any other tool. Before it
generated this correct result, I tried five times, during which it generated simple code comments
without any actual code; code that would throw console errors based on undeclared variables; and
solutions that partially solved the challenge. Amazon’s tool had the widest range of discrepancies
among attempts to solve the same problem.
For these reasons, I rate Amazon Code Whisperer a 7 of 10. This correct output feels like a lucky
strike, given the range of useless hallucinations it generated for the exact same prompt right before
generating the correct solution.
Tool Comparison
If I were to select a single tool to solve algorithmic functions like this 2D array challenge, ChatGPT
would be my go-to choice: it returned the correct result in most instances. Table 1-1 provides an
overview of the tools reviewed here.
ChatGPT Browser9/10
However, for more general-purpose software development, which usually involves a much broader
scope and more nuanced requests, I’d probably use GitHub Copilot, for the convenience of having
it in my IDE at a code comment’s distance.
Conclusion
I’ve used the 2D array code challenge from this example dozens of times in interviews over the
years. Usually, I start an hour long live coding interview by giving the candidate the challenge brief
pretty much exactly as I’ve given it here. The candidates then code the solution, thinking out loud
as they work, occasionally searching Google for help.
In that hour-long interview, only a very few candidates have ever managed to solve the full scope
of the challenge (multiple rectangles). Most write partial solutions that find only one rectangle, or
only the top left corners, or some other variation.
It’s incredible that a free tool like ChatGPT (GPT 3.5, in this case) can produce the same outcomes
as those top performers in only 10 seconds. However, it’s also important to stress that it didn’t
always produce the correct answer. Even with this objective, straightforward prompt, it produced
partial solutions and even hallucinated once.
While every tool reviewed in this chapter was considered best-in-class for code generation at the
time of writing (early 2024), none performed better than ChatGPT. The only other tool that
generated a correct solution was AWS Code Whisperer, and that was a one-off among the wild
hallucinations it generated in all my other attempts. The remaining tools generated either partial
solutions or solutions that didn’t run.
None of the tools reviewed here produced a correct result for this challenge on all attempts, and
most failed to produce a correct solution at all. Even for those that did produce a correct result, I
have no way to know if the code they generated is copyrighted. Again, you must exercise caution.
Most of the prompts a software engineer would use on a daily basis are way more complex or
subjective than this challenge, which would increase the likelihood of these tools generating wrong
or misleading results. Again, critical thinking is key when using these tools.
Most software engineers view the most important part of reviewing ChatGPT’s reply to their
prompt as confirming if it actually solves their problem. While that is of course important, I
recommend a few rules of thumb.
First, always review AI-generated code before pushing it to production or opening a pull request.
Make the code yours, regardless of how much of it was generated by your tool. Second, test your
code. Runit against a test suite that covers a wide range of cases, from the happy path to edge
cases and error states. Getting all tests green is a solid confirmation that the code fulfills your
requirements. And finally, while I’ve said it before, be sure to revisit your company’s guidelines for
any AI tools you use for professional purposes.
Chapter 2. Bug Detection and Code Review
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 3rd chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
Imagine paying the highest salaries in a company to software engineers to develop a product that
will be responsible for the company’s revenue, only to lose that revenue due to costly bugs in
production. This is any business owner’s worst nightmare, and sadly it happens every day. Software
has automated whole industries, replacing lengthy manual processes and creating new ways to do
previously impossible things. However, automation can’t be effective when bugs detract from the
underlying products’ key functionalities.
To mitigate this fundamental concern, several job titles have been created over the years to
guarantee proper quality assurance (QA), such as QA Engineer, QA Analyst, and Test Engineer, etc.
Processes, too, have been developed to detect bugs before they get deployed to production. Those
processes boil down to two main categories:
Code reviews
This process is done during development, and consists of team members reviewing each other’s
code before it is deemed ready to go live. Some teams mandate a minimum number of team
members who must review and approve a pull request (PR) before it can be merged.
This process is done after development as the last “gatekeeper” before code gets pushed to
production. It consists of manual or automated tests done in an environment that closely matches
production. These tests aim to mimic users’ behavior to catch any bug that could have escaped a
code review.
When either process finds any bugs, performance issues, security vulnerabilities, or other t
malfunctions, the code can be regressed: that is, it goes back to the software engineer who
developed it, along with a comment containing the specific deficiencies that must be corrected.
These processes are critical to any software development team, yet they are often very lengthy and
nondeterministic, introducing bottlenecks while not fully delivering on the vision of avoiding bugs
ever showing up in production. As such, as AI tools have come into existence, the industry has seen
big focus on automating code reviews and making the process of detecting bugs much faster and
more deterministic. Thousands of software engineering teams are already using AI-based
automated code-review tools.
Types of AI code-review tools
The AI tools reviewed for this chapter fall into three main categories, whose usage in software
development differs slightly. Some of the tools reviewed offer more than one type of usage.
IDE-based tools
Integrated Development Environment (IDE)-based tools integrate directly into the software
development environment that engineers use to write code, such as Visual Studio Code, IntelliJ
IDEA, or Eclipse. These tools provide real-time feedback as they write code: highlighting errors,
suggesting improvements, and providing documentation links directly in the IDE. Of the three
types of tools described here, this is the only one that provides feedback when the code is saved
locally. This immediate feedback loop helps developers identify and fix issues on the spot,
improving code quality and reducing the need for extensive reviews later.
Git-based tools
Git-based tools integrate with version-control systems, such as GitHub, GitLab, or Bitbucket, and
operate within the Git workflow. Unlike IDE-based tools, Git-based tools can’t be triggered by local
saves of a file, only by actions in the Git workflow. You can set them up to review code
automatically whenever you push changes to a repository or created or merge a PR. These tools
check the code against predefined rules and guidelines and can enforce coding standards across all
branches of the codebase. They typically provide feedback in the form of comments in PRs or
reports in a continuous integration pipeline, helping ensure code quality before merging changes
into the main branch.
Browser-based tools
These tools are accessible through web browsers and typically integrate with online version-
control platforms like GitHub, GitLab, and Bitbucket. Like Git-based tools they can only be triggered
by changes in the Git workflow, not local changes. You can use these browser-based tools to get
automatic reviews of your pull requests or code merges online. When you submit a PR, the tool
reviews the code for errors, style violations, and security vulnerabilities, then provides feedback on
that PR via the web interface in the browser. I find this the least convenient of the three presented
here, since it requires you to use another platform besides the IDE and version-control tools you
are already familiar with.For the scope of this book and chapter, I’ll be covering only this last type
of tool, the AI-powered code analysis.
Use cases
The millions of software engineers who are already using AI tools for automated code reviews and
bug detection find that it brings obvious benefits across a range of daily use cases. These include:
Automated code-review tools provide software engineers, especially junior ones, with a 24/7 pair
programmer that points out bugs, provides suggestions, and above all gives context and reasoning
for its suggestions. This is a great tool to use to hone your skills. Feedback loops are much more
frequent with an automated tool than with normal code reviews by team members, which
increases exposure to learning opportunities about the specific language, framework, or algorithm
in question. This can be especially beneficial for junior developers and for engineers switching to
new tech stacks or working with a framework for the first time, since juniority makes mistakes
more common. In code reviews, they can be regressed with a message that helps the developer
understand the mistake and avoid it the next time.
Automating code review reduces the number of PR regressions. It also tremendously reduces the
amount of time between the code being written and the review identifying issues to be fixed.
Automatic code reviews at every change can point out vulnerabilities and improvements so that
developers can fix them immediately. This eliminates the cycle of pushing faulty code only for
other team members to be find and regress it– a cycle of multiple regression loops that cost
individual developers time and delay shipping features to production.
Many times, security vulnerabilities and performance issues are missed in code reviews because
they don’t often impact functionality, which is objectively the biggest focus of any code review.
Even when they are detected, they aren’t often treated a cause for regression. Instead, they often
go into a “nice to have” note, effectively adding the vulnerability or issue to the pile of tech debt.
That pile usually accumulates for a long time, until it becomes unsustainable and requires
extensive refactoring of the code base.
Most of the code-review tools mentioned in this chapter focus on security vulnerabilities, and
often point out occurrences of OWASP top-10 vulnerabilities in code, along with suggestions for
resolving them. Team code reviews rarely reach this level of depth; such vulnerabilities are often
only detected much later (if ever), during professional security audits or penetration-testing
reports. Using these tools allows teams to detect security vulnerabilities much earlier.
A common criticism of automated code-review tools is that they discourage (human) team
members from performing code reviews in a timely manner. To be fair, code reviews were a
dreaded activity in many teams long before AI tools came into existence. Software engineers
frequently forget to review their peers’ pull requests or leave a positive review message of “lgtm”
(short for “looks good to me”) just to unblock some feature deployment.
AI tools add tremendous immediacy to the code-review process. This reassures software engineers
that their code has a high quality standard, but it also leaves them feeling less urgency to review
their peers’ code, believing the AI tool has already done that job for them.
This is a very fair criticism, in my opinion. AI code reviews don’t replace human code reviews,
especially those performed by senior engineers who know both a feature’s technology and the
business and use cases for it. This is the angle that is manifestly missing in AI code reviews. The AI
tool misses the context behind the code being reviewed and the intent behind certain code
segments. This can lead it to make irrelevant suggestions or fail to identify context-specific issues
that might be obvious to a human reviewer. This is a key reason why you should never skip human
code reviews, even if you’re also using automated code reviews.
It’s also worth noting that the language used to market these automated code review tools is quite
different from that used for the code-generation tools reviewed in the previous chapter. Few of the
tools in this chapter mention AI in their marketing copy much (or at all), despite the fact that the
products do use AI algorithms (eg: Codacy).
There are two reasons for this. Several of these tools existed in the market for years before the
recent popularity of AI. However, many position themselves as a backstop to issues found in AI-
generated code. Sonar, for example, promises to minimize risk, ensure code quality, and derive
more value from code created by both AI and humans. As the website copy states: “To maximize
the advantages of generative AI in coding, developer teams need robust DevOps processes,
reporting, and metrics that focus on code quality, security, and reliability.”
Tool evaluation
I evaluated more than 20 automated code-review tools in order to shortlist the ones I highlight in
this chapter. Every tool covered here meets the following criteria:
In order to select and compare AI tools for this chapter, I created a simple Javascript program and
introduced four issues into the code. You can review the full code in the book’s Github repository,
inside the folder named “Chapter 3”. Example 3-1 provides the most relevant snippet, with each of
the four issues commented for clarity. I ran the exact same code through each of the tools
reviewed in this chapter, which comments on the results each tool provided.
if (err) {
} else {
console.log('Query result:', rows);
});
const responseHtml = `
<html>
<body>
<h1>User Profile</h1>
</body>
</html>
`;
// 4. Inefficient loop
let sum = 0;
sum += i;
});
Before we dive in, here is a brief explanation of each of the errors I introduced and why it would be
important to catch them in a code review:
SQL injection vulnerability
This vulnerability arises from incorporating user input directly into an SQL query without any form
of validation or sanitization. In the provided code, the variable requestData.username is directly
concatenated into the SQL query string. This approach allows attackers to craft user inputs that
manipulate the SQL query to perform unauthorized actions, such as accessing, modifying, or
deleting data. For instance, an attacker could provide a username input like ' OR '1'='1, which could
potentially alter the query logic to return all users in the system, thereby breaching data privacy.
Cross-site scripting
Cross-site scripting (XSS) occurs when an application includes untrusted data, typically from user
inputs, within the content of its web pages without proper validation or escaping. In the
script, requestData.userInput is directly included in an HTML response structure sent back to the
client. If this user input includes malicious JavaScript code, the browser could execute that
unauthorized script, leading to session hijacking, personal data theft, or malicious redirection.
Memory leak
Memory leaks in web applications can occur when memory that is no longer needed is not
released back to the system. In the example, a large number of event listeners are created within a
loop but are never removed. Each listener retains a closure scope that may consume more
memory. Over time, especially in long-running applications like servers, these listeners accumulate,
occupying an increasing amount of memory. This can potentially exhaust available resources and
lead to performance degradation or crashes.
Inefficient loop
The loop in the example code inefficiently performs a large number of iterations to compute the
sum of all integers from 0 to 99,999. Each iteration involves performing arithmetic operations and
updating a local variable. Although these actions are relatively simple, they are unnecessarily
repeated many times. This not only consumes CPU cycles, it could also block the event loop in a
Node.js environment, leading to delays in processing other incoming requests or operations.
Codacy
Codacy is a startup based in Portugal that launched an automated code review tool in 2012. The
product has evolved significantly over the years and is now a market-leading solution that
leverages AI to “help developers identify and fix issues within their code, improving code quality
and reducing technical debt, with support for more than 40 programming languages and seamless
integrations with GitHub, Bitbucket, and GitLab.”, as per the copy on their website.
Codacy’s AI analyzes code for potential errors, style violations, security vulnerabilities, and
performance issues, and it provides software engineers with suggestions for improvement. The
tool is designed to learn from past reviews, adapting to the specific standards and practices of each
development team.
By automating the code review process, Codacy helps developers focus more on building features
rather than fixing issues, ultimately speeding up the development cycle and enhancing code
maintainability.
Practical example
I created an account with Codacy using my Github account and ran it on the code showed earlier in
this chapter (which you can review in full in the book’s Github repository).
Codacy correctly identified issue number 1, the SQL Injection vulnerability, and labeled its severity
as “Critical,” the highest level in its ranking, as seen in Figure 2-1.
Codacy provides an expandable section with an explanation of what the error is, why it’s
dangerous, and how to solve it (Figure 2-2).
Codacy also correctly identified issue 2, the XSS vulnerability, and labeled it as “Medium” severity
(Figures 3-3 and 3-4).
Codacy’s analysis didn’t identify issues number 3 and 4, which are more related to performance
than to security.
All the above feedback was provided on Codacy’s website immediately after I connected my Github
account and selected the repository I’d like to have analyzed. However, after I opened a PR on that
same repository, Codacy performed a second level of analysis directly in the repository.
Most of the errors it identified reiterate those it found in the previous analysis, which I expected,
since the code is the same. However, on Github, Codacy also offers a “commit suggestion” to fix
each issue along with a brief explanation. This makes it very convenient for software engineers to
simply accept the suggestion and merge the PR with one click (Figure 2-5).
Figure 2-5. Codacy provides a suggested fix for the issue it found.
For all these reasons, I rate Codacy’s tool an 8/10. It found two of two security issues, but didn’t
find either of the two performance issues. For the issues it did find, it offered very comprehensive
explanations and proposed fixes that could be accepted with one click on the actual repository.
DeepCode began as an independent startup based in Zurich, Switzerland, as a spinoff from ETH
Zurich University.1 It was acquired by the Cybersecurity behemoth Snyk in October 2020. Since
then, the product was marketed first as “DeepCode by Snyk” and more recently as “DeepCode AI,”
and has been integrated into the Snyk’s broader suite of products and services.
DeepCode can be used on an IDE or directly in a Git repository. It points out security vulnerabilities
on the spot, as alerts in the IDE tooltip or as comments to the pull request in the repository. As
stated on the company’s website, “it combines symbolic and generative AI, multiple machine
learning methods, and the expertise of top security researchers to offer accurate vulnerability
detection and tech debt management. DeepCode AI is purpose-built for security, supporting 11
languages and over 25 million data flow cases to find and fix vulnerabilities efficiently. This AI
technology enhances developer productivity by offering one-click security fixes and comprehensive
app coverage while ensuring the trustworthiness of the AI through training data from millions of
open-source projects. DeepCode AI stands out for its hybrid approach using multiple models and
security-specific training sets to secure applications effectively.”
Practical example
Just like I did for Codacy, I created an account with Snyk/DeepCode using my Github account and
ran it on the code in Example 3-1 within the book’s repository.
DeepCode correctly identified issue number 1, the SQL injection vulnerability, and labeled it with
“H” (High), the highest level in its ranking system. It even provides a score (Figure 2-6), though I
could not find specific information about what this score means. This issue’s score of 830 is the
highest score my code got.
Snyk/DeepCode provides two expandable sections for each error. One provides a deeper
explanation of the issue, resembling a stack trace rendered in the browser UI (Figure 2-7).
Figure 2-7. DeepCode explains the SQL injection vulnerability.
The second expandable section suggests a fix for the issue (Figure 2-8), and points to avoid using
concatenated SQL statements as strings stored directly from user-entered parameters. A best
practice in defensive programming.
Figure 2-8. DeepCode suggests a fix for the SQL injection vulnerability.
These suggestions are provided “as is” from an open source repository in the training dataset. This
is very nice in terms of transparency, as a software engineer should always want to know where
the code comes from. However, it adds some extra cognitive load in terms of actually fixing the
issue, since this is just a proposed solution to help the software developer fix the issue, not an
actual proposed solution to be adopted by the click of a button.
Despite this deep level of detail for issue number 1, DeepCode didn’t find issues 2, 3, or 4. It did
find some lower-severity issues in some libraries I used (inside node_modules), which were
irrelevant to this book’s exercise.
I rate DeepCode a 6/10. It found one of two security issues and didn’t find either of the
performance issues. For those issues it found, it provided very comprehensive explanations;
however, the help it offers for each issue is lacking in comparison to that offered by Codacy and
CodeRabbit. Snyk provides information about the issue, but doesn’t offer proposed solutions that
are easy to adopt with one click.
CodeRabbit
CodeRabbit leverages AI capabilities to enhance the quality, performance, and efficiency of code
reviews. It delivers its code recommendations through comments in the repository.
Practical example
Like I did for the other tools, I created an account with CodeRabbit, allowed it access to my Github
account, and selected the repository I wanted to give it access to. Unlike Codacy and DeepCode,
CodeRabbit won’t statically analyze code that’s already in a repository. Instead, I needed to open a
pull request; CodeRabbit then posted comments to that PR with its code-review items and
suggestions. CodeRabbit promotes this as a much more interactive tool that aims to mimic a team
member commenting on a PR seconds after it’s opened on Github. However, my experience on
CodeRabbit’s website was way inferior to my experiences with the competitors analyzed here.
CodeRabbit correctly identified issue number 1, the SQL Injection vulnerability (Figure 2-9). It
doesn’t provide any sort of indication of severity level: all issues it reports look alike in that regard.
It did a good job pointing out the faulty code snippet, and offered a brief explanation about why it
contains a vulnerability. I believe most software engineers will enjoy this simple UI, since it’s
exactly the type of interaction they get from human colleagues who review their PR.
Figure 2-9. CodeRabbit identifies the SQL injection vulnerability.
Along with the explanation, CodeRabbit offers an expandable section called “Committable
suggestion” that contains a suggested fix for the issue (Figure 2-10). While CodeRabbit displays a
noticeable warning to review the offered solution thoroughly, committing it is only one convenient
click away.
Figure 2-10. CodeRabbit suggests a fix to the SQL injection vulnerability.
CodeRabbit also found issue number 2, the XSS vulnerability. It offered a very simple explanation
as to why this is an issue, but didn’t suggest a solution, as you can see in Figure 2-11.
Figure 2-11. CodeRabbit identifies the SQL Injection vulnerability.
Just like the other tools analyzed, CodeRabbit didn’t find performance issues number 3 and 4.
Thus, I rate CodeRabbit a 7/10. It found both security issues but neither of the performance issues.
It also proposed a solution for one of the issues it found, but not the other one. However, its
explanation for the issues was very superficial compared to the other two tools. Finally, it lacks a
website interface that would let users check issue in more depth and provide some historical
perspective of changes and improvements made on the code base, which the other tools have.
Tool comparison
All three of these AI code-review tools take different approaches to blocking my pull request from
being merged, as shown in Figure 2-12:
• Codacy blocks the PR merge until I fix the issues they identified (which, to be fair, I can do
using their suggested fixes).
• CodeRabbit only posts comments; it doesn’t run actual checks, and thus would never block
a PR merge regardless of any issues it finds.
Figure 2-12. Codacy and Snyk show up in the checks section for the PR merge.
If I were to select a single tool, Codacy would be my go-to tool. As Table 3-1 indicates, it had the
highest score.
Conclusion
Code reviews have been one of the biggest frustrations in my software development teams over
the years. People are naturally more inclined to pick up new tasks assigned to them than to stop
their own thread of work to review a colleague’s PR. This default behavior has delayed features
being moved to QA and ultimately going live. It has also created situations where we fast-track
some urgent features even with a less-than-ideal level of code review, resulting in bugs showing up
in production.
In general, the biggest casualty of these common code-review frustrations is team morale, with
team members feeling like they’re constantly switching context and losing focus.
I began using several forms of automated code review in my teams, like linters, static code analysis,
and test coverage dashboards, long before the recent generative AI hype. Any team with robust
engineering standards has probably done likewise.
However, after 15 years in the industry, I can tell that the recent wave of evolution adds more
depth to these tools–especially the seamless way they integrate with your software development
workflow, and the option to accept suggested fixes with one click. Having a very capable code
reviewer who’s available 24/7 to provide thoughtful feedback on issues in your code is a massive
help to anyone. It’s something I could only have dreamed of when I started out as a software
engineer myself.
However, I believe that software engineers should leverage these tools as learning opportunities
before anything else. They can and do make mistakes, as the tools note themselves in very visible
warnings, and I can only underline that. Always have a human being review and test the suggested
fixes. As with code-generation tools, I recommend a high level of diligence when reviewing any
code or fixes suggested by these tools. Make it yours before you open a PR or merge to master.
1 This book’s author was part of the DeepCode team prior to the company’s acquisition by Snyk,
but has no contractual relationship, equity, or any other vested interest in DeepCode at the time of
writing.
Chapter 3. Automated Testing and Quality Assurance
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 4th chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
Testing and quality assurance (QA) are usually the last gates that new software code must pass
through before it gets deployed in production. Their ultimate goal is to find costly bugs or other
standout issues that may have made it through code review (as covered in the previous chapter) to
avoid putting them into production.
The QA process happens after code has been developed, reviewed, and accepted to merge into the
codebase. There is occasional confusion between Testing and QA as concepts, perhaps because the
stakeholders traditionally involved are called either testing engineers or QA engineers at different
companies. Whatever the title, though, they are usually in charge of the process covered in this
chapter.
Usually, the QA process consists of conducting manual and/or automated tests in an environment
that closely matches production and mimics user behavior, to catch any bugs that escaped the
code review process.
When such bugs are found during testing/QA, the feature is regressed back to development status.
The original software engineer in charge of implementation must fix the issues before pushing the
feature to review and QA again. These regression loops aim to guarantee that the code that
ultimately gets deployed to production is indeed bug-free.
These processes are critical to any software development team. We can break into two big
categories: automated and manual.
Automated tests
Automated testing employs specialized software tools to execute pre-scripted tests on the
application. This method is highly efficient for repetitive and regression tests, because it reduces
the time needed to validate new code changes. Automation ensures consistency and precision,
minimizes the risk of human error, and enables extensive test coverage. Automated tests can run
around the clock, providing rapid feedback and allowing for continuous integration and continuous
delivery (CI/CD) pipelines. Although initially setting up automated tests requires effort, as does
maintaining them, the long-term benefits include faster release cycles, improved accuracy, and the
ability to quickly detect and address defects.
Manual tests
In manual testing, human testers meticulously execute test cases without the assistance of
automated tools. They simulate end-user behavior to identify defects, ensuring that the software
behaves as expected in real-world scenarios. This approach allows for nuanced understanding and
adaptability, often catching issues that automated scripts might miss, such as user-interface
glitches and usability concerns. While manual testing can be time-consuming and labor-intensive,
it remains essential for exploratory testing, where creativity and intuition are crucial in uncovering
unexpected bugs and ensuring a seamless user experience.
QA is a meticulous, careful process by its nature, which often makes it a bottleneck that delays
features going live. As such, there’s a market for AI tools that propose to accelerate different parts
of this process. This chapter will focus on two of those tools in particular.
AI is changing every aspect of automated testing. For example, until very recently, automating
testing involved writing complex scripts. Now, however, many automated testing tools provide
ways to create tests without writing a single line of code. With simple, plain English, you can create
automated tests that check every component and functionality in your software application. Visual
testing has also been simplified with AI-powered tools that automatically detect visual bugs,
ensuring that your user interface looks and works as intended. These improvements make the
testing process more effective and efficient, allowing testers to focus on improving the overall
quality of the software.
In addition to the automated/manual divide, we can also classify AI tools for software testing and
quality assurance as functional and non-functional, based on the specific areas they target within
the testing lifecycle.
As the name implies, functional testing tools verify that a software application performs all of its
intended functions accurately. These tools focus on what the system does. Their goal is to test
whether the application’s internal components deliver the expected output. Functional testing
tools handle unit tests, integration tests, visual tests, regression tests, and smoke tests, for
example.
Non-functional AI testing tools assess aspects of software that go beyond its functional behavior,
such as its performance, compatibility, usability, security, and reliability. These tools focus on
evaluating the software’s performance rather than its behavior. They measure speed, response
time, and resource utilization, to name a few.
Tools in both categories aim to identify potential performance issues and security vulnerabilities.
They use deep learning models trained on customer usage data, internal company documents, or
even industry regulatory norms or standards. These algorithms can learn to identify patterns that
may indicate performance bottlenecks or security risks. That underlying “intelligence” makes these
AI tools important peers of humans in the QA stage of the software development lifecycle. The
biggest gain to be reaped from using these tools is that they can apply their intrinsic testing
acumen on large codebases in near-real time.
The common frustration is that QA takes a long time, since complex products and extensive
codebases usually have hundreds of different user journeys to test, which takes a long time to do
that manually. Automated tools do NOT reduce the value of having a human in the loop, but they
can certainly automate a good part of the repetitive work, freeing human QA professionals to focus
on the critical flows, ones that were changed in the last pull request, or whatever makes up the
20% of work that has 80% of the impact (as per the Pareto principle, so often used in software
development).
Many of the prominent tools I evaluate in this chapter combine functional and non-functional
testing abilities, as they aim to integrate into various development environments. These tools can
be used in different ways, depending on each team’s context and preferences.
For instance, testing is one of the most significant aspects of the CI/CD process. Thanks to CI/CD
integrated testing tools, we now conduct tests continuously during development rather than
waiting until after development. This continuous integration approach provides real-time feedback
about your software’s performance and internal functioning.
CI/CD-integrated AI testing tools automatically test changes made to your code after every build.
Continuous testing ensures that issues are identified and addressed early in the development
cycle, reducing the risk of defects in production. This approach promotes a culture of quality and
allows for faster, more reliable software releases.
In contrast, browser and cloud-based tools run tests in web browsers or the cloud, providing
flexibility and accessibility. They allow testing on different devices and environments, without
complex setups like IDEs and CI/CD integrated tools.
Use cases
Software developers and engineering teams across various industries are integrating AI testing and
QA tools into their processes. Here are some of the prominent use cases that we’ve seen:
Building test automations used to be very slow and time-consuming. It takes a lot of time and
mental bandwidth to design and write test scripts, run regression tests, and do everything in
between. This is what many AI-driven testing tools aim to help with, by generating comprehensive
test scripts from plain English prompts within seconds. This natural language processing (NLP)
method of scripting makes it easy to automate complex workflows. This, in turn, makes testing
accessible to both technical and non-technical stakeholders. AI-generated test scripts are usually
based on user behavior and existing patterns in previous test data, which makes the tests more
relevant and closer to what a human QA tester would create.
Improving accuracy means fewer code bugs slip through the QA stage to production. AI algorithms’
superpower is that, unlike manual testers, they can capture patterns and anomalies at scale.
Beingtrained on extensive code bases and past testing data helps them better spot the nuances
that can indicate an issue that could require the feature to be regressed.
AI testing tools with self-healing capabilities automatically detect and fix issues in test scripts when
changes in the application’s UI or code cause tests to fail. This ensures that all tests remain
functional and up-to-date without manual intervention. Historically, updates are one of the biggest
challenges for QA teams, since a change in the UI causes the need to change many tests written in
the past. These AI tools can significantly reduce the maintenance burden on QA teams and make
the QA process faster and more reliable.
By automating repetitive tasks using AI testing tools, we can speed the release cycle of software
applications tenfold. Developers can concentrate more of their time on innovating new features
and enhancing the product instead of spending the entire day trying to catch bugs or write test
scripts. Companies can also respond faster to market demands and user feedback.
It is important to remember that while these AI tools can do a great job catching issues and bugs
that would eventually break production, the human instinct is still crucial during testing. This is not
just about the limitations of the tools reviewed here, nor their underlying AI algorithms. It’s
beyond that. Software development teams don’t write 100% of their requirements and edge cases
in an absolutely perfect way.
I can speak from my own experience leading software teams for more than a decade: there
are always changes and caveats based on last-minute user feedback, an ad hoc request from sales,
or even a phone call from the CEO with a specific exception. While teams try hard to properly
document all requirements and capture edge cases and test plans in the software development
task descriptions, it’s never a perfect outcome. There are gaps. And these are the written materials
on which AI tools are trained, and they take project requirements as the ultimate instructions to
test against, they’ll eventually miss some nuances of those requirements or ad hoc exceptions.
Even beyond that, frankly, there’s often specific context awareness that only humans can have.
Weneed humans to adapt to industry-wide events or sensitive user concerns. Software
development is a complex matter, and the more extensive a product and code base are, the more
likely it is that a pure AI-driven QA process will show its limitations and gaps.
AI algorithms are only as good as the data used in training them. They can absolutely help a lot, as
this chapter shows–especially with the repetitive grunt work, like testing an extensive list of user
journeys and application flows. But human monitoring, review, and intervention are still needed
for the critical parts of the process.
Evaluation Process
Most companies in the QA automation space cater to enterprise clients. This makes sense, given
that enterprise companies tend to have larger teams, more extensive products, and much higher
quality-control standards. While this is totally fine and expected, it affected my selection process
for tools to showcase here, since I gave preference to tools that can be accessed via a simple self-
service sign-up process and that offer a free trial. This is a deliberate choice to make it easier for
readers to act on what they read here, though it certainly leaves out some tools that required me
to speak with their sales teams to negotiate a price package. I decided those tools were out of
scope for this book.
Even with that limitation, as I researched this chapter, I reviewed more than 20 automated testing
tools (many of which fell into that enterprise sales category). I shortlisted the two tools highlighted
below.
To evaluate and compare AI-powered testing tools for this chapter, I applied each tool to write and
run test cases for a simple, straightforward test site: a simple web application for booking
appointments with a medical doctor. Since developing a comprehensive, end-to-end automated
testing framework is a substantial undertaking, I focused on evaluating the specific AI features
these testing tools offer, to demonstrate their potential for integrating AI into software testing.
The examples in this book are not intended to represent a complete testing framework, but to
demonstrate how to use AI-integrated features in automated testing tools. The primary objective
of this chapter is to showcase AI’s possibilities and simplicity in the software testing domain, not to
provide a production-ready solution.
I evaluated how the AI features in these tools enhance various aspects of the testing process, such
as generating test cases, creating test data, executing tests, and analyzing results.
Test site:
https://katalon-demo-cura.herokuapp.com/
App description
Web app with a login page for booking appointments with a medical doctor.
Test description
Automate a series of actions on a healthcare service website. This test ensures that a patient can
navigate and successfully book an appointment to meet with the doctor from the application. We
want to see if everything works as it should on the app.
Steps
2. Users can successfully book appointments if all the required fields are updated
Test case 1
• Navigate to https://katalon-demo-cura.herokuapp.com/.
• Click on the Make Appointment button.
• Check if the user can successfully login when the correct details are entered
Test case 2
• Book an appointment.
• Check that the user can successfully book an appointment 10 seconds after submitting the
booking form with all the right details.
Test case 3
Now, let’s examine the top-performing AI testing tools I found and see how they followed these
instructions and evaluated the website using their artificial intelligence features.
Katalon Studio
Kalaton Studio, launched by Katalon Inc. in 2015, is an automated software QA tool that supports
testing for mobile applications, web apps, desktop apps, and APIs. The company’s website
highlights that it has “embedded AI across our entire platform to test faster, see clearer, and
streamline test automation with fewer bottlenecks.”
Katalon Studio was the first tool in Katalon’s ecosystem. Since then, two additional tools have been
added. Katalon Recorder is a browser automation extension for creating and running Firefox, Edge,
and Chrome tests. Katalon TestOps is a test-orchestration platform that centralizes test planning
and management activities, streamlining DevOps processes and enhancing cross-team
collaboration.
• A Virtual Data Analyst feature that analyzes all your TestOps data and generates reports
• Self-healing capabilities
Katalon’s self-healing AI, as noted above, automatically helps you fix tests that break during runs.
You don’t have to manually maintain existing test scripts when you ship a new feature or change a
component. Regression test plans are also handled automatically: the AI engine instantly reruns
your existing functional and non-functional tests to ensure that your software’s previously
developed and tested components still perform correctly even after you’ve added new changes.
Traditionally, to create test cases in Katalon, you either record tests and playback or write test
scripts with Groovy.
Practical example
In this example, I used StudioAssist AI, Katalon’s generative AI, which helps programmers write test
cases from plain-text prompts. I used it to write test cases for the website above. For the sake of
this test, I acted as a stakeholder who doesn’t know the Groovy syntax. I used the StudioAssist AI
feature in Katalon to generate Groovy scripts, which set up my tests. I wrote the test I wanted in
the prompt, and it created a test script for me in Groovy, which I then ran to evaluate the software.
StudioAssist also helps explain what each line of code it generates does.
I created a new test project, set up a test folder, and navigated to the script tab to begin writing my
tests. Here is the prompt I gave StudioAssist AI:
Prompt:
3. Fill username in the 'Page_Login/txt_Username' object with the value in the 'Username' variable
4. Fill the password in the Page_Login/txt_Password' object with the value in the 'Password'
variable
Katalon StudioAssist generated test cases written in correct Groovy syntax (see in full in Example 4-
1) that executed the test script when it was run (see Figures 4-1 and 4-2).
Figure 3-1. Generating tests with Katalon it intuitive by following the StudioAssist option in the UI
/* I want to write a Katalon Studio test case to perform the following steps.
* */
WebUI.openBrowser('https://katalon-demo-cura.herokuapp.com/')
WebUI.closeBrowser()
As you can see, the test generated fulfills the instructions I provided, and the code is written in the
correct syntax.
Pros
• StudioAssist is easy for non-technical users to use and debug , since it transforms natural
language prompts into correct Groovy testing syntax.
• Built-in keywords and templates speed up the test-creation process and reduce the need
for extensive coding.
• Its self-healing capabilities automatically update test scripts when there are changes to the
application’s UI.
• StudioAssist integrates with popular CI/CD tools and testing frameworks like Jenkins, Git,
and JIRA.
Cons
• Katalon requires you to download and install StudioAssist (from the screenshots above).
This adds some additional setup work.
• Katalon can sometimes be slow, particularly when dealing with large test suites or complex
test scenarios.
• There is a bit of a learning curve with the Katalon StudioAssist UI. Some options are buried
inside the cascade options from the top bar, and you’ll need to learn keyboard shortcuts.
I rate Katalon a 9 out of 10. Ithelps a lot with writing tests from plain English text prompts and
executing them against the application I want to test, within the same UI. The only reason I won’t
rate it 10/10 is the learning curve pointed out in the cons above. It could certainly be more
intuitive, although this is quite a typical UX for complex enterprise products, which Katalon already
is.
TestRigor
TestRigor is an AI-driven automated tool designed to streamline software testing. Unlike traditional
testing tools, TestRigor allows developers to create and execute tests without writing code. Its NLP
capabilities allow you to describe your application functionality in plain English. The AI then
generates, executes, and reports on test cases, significantly reducing the time and technical
expertise required for comprehensive software testing.
Practical example
In my evaluation of TestRigor, one feature that really stood out was its completely codeless test-
creation process. I did not have to write a single line of test code. Instead, I provided my test site
URL and a brief description of my application and how it should behave. I also provided my test
goals and specified the number of test cases to generate. The AI handled everything, from
generating tests to executing them to generating a detailed test report.
The goal of the testing, as you may recall, was to check whether a user can log in in less than 10
seconds and successfully book an appointment.
Pros:
• TestRigor uses Behavior-Driven Test Case Creation, which allows for the creation of tests
based on how users interact with the application. This bypasses the technicalities of testing
syntax, which can remove attrition for nontechnical users or smaller teams.
• TestRigor’s testing product is very accessible, which makes it stand out from the crowd. It’s
fully cloud-based, which eliminates the need to install additional software (unlike Katalon).
This makes it easy to accessible and use from anywhere.
• It integrates with popular CI/CD pipelines like Jenkins and CircleCI and supports bug-
tracking tools like JIRA, which make it seamless to integrate with the tools that teams are
already using.
• The self-healing functionality, just like Katalon’s, reduces the maintenance burden on the
testing team whenever existing application workflows are changed.
Cons:
• Bypassing actual test writing is great for smaller teams and occasional users, but I doubt it
would be practical for larger teams that already have a large testing infrastructure in place.
For those software teams (which are a majority, the value of automated testing is to
generate the tests in correct syntax.
• A cascade con of this bypass is that TestRigor doesn’t offer the same flexibility and control
as traditional testing languages and frameworks. It would not work well for complex test
scenarios or extensive application workflows.
For these limitations, I rate TestRigor a 7 out of 10. Beyond that, it’s a great UX that “just works,”
and it’s a perfect fit for smaller teams that don’t have a complex testing infrastructure in place
already, or teams whose testing needs are occasional and who just want to check that the product
is working as per the requirements.
Tool comparison
Katalon and TestRigorhave strengths that cater to different testing needs, though both leverage AI
and machine learning to enhance their functionalities. Table 3-1 provides a comparison.
Katalon
Katalon offers a robust suite of features designed to handle complex test scenarios. It is particularly
useful for large-scale testing projects where comprehensive test coverage is critical, and in most
software development teams there’s already a testing infrastructure, team, and processes in place.
While the learning curve is steeper than with TestRigor, Katalon’s depth of features and flexibility in
handling diverse testing requirements make it a powerful tool for a broader number of software
development teams, especially larger ones or those working on complex products.
TestRigor
TestRigor impressed me with its simplicity and ease of use. The learning curve is notably short, and
I found it remarkable how fast I went from signup to actual test results. This tool excels in
environments where product features change frequently, requiring rapid and continuous testing.
I’d say TestRigor is best suited for startup teams or occasional one-off users who don’t have an
existing testing infrastructure in place and whose product requirements may change too often to
let them even set up such a robust testing environment. On the other hand, TestRigor poses
limitations for those teams where Katalon excels: that is, larger teams and more complex product
workflows.
Conclusion
Of the tools analyzed in this chapter, Katalon emerged as a good pick for larger teams and
enterprise products, while TestRigor proved to be a winner for startups and side products. That
covers the software-development market nicely, and showcases how teams with different types of
products and levels of maturity can benefit from using AI testing tools.
If you’ve ever worked in software testing or QA, or if you’ve simply written unit tests for any code
you wrote, you’ll know how laborious it is to write tests and keep them updated as an application
evolves and gets extended.
I’ve often been a part of conversations about budget planning and roadmap discussions where
robust testing was postponed, or outdated tests were simply framed as technical debt that should
be phased out. It’s very common for both technical and non-technical stakeholders to have biases
against proper testing practices, and one of the key reasons for that is how significant of an
investment it has been, historically, to have them.
That brings us to the bulk of the value that AI testing tools can bring to the table. In software
development, we’re constantly looking for occurrences of the Pareto principle: “What’s the 20% of
effort that will return 80% of this roadmap item’s value?” As a CTO, I’ve been in the center of these
discussions many times. In QA, the 20% of effort that creates 80% of value is about defining the
application workflow properly; talking with users and clients about the issues and edge cases;
going the extra mile to map out nuances for the software developers who will implement the
requirements; and, ultimately, doing user acceptance testing as a final gatekeeper before going
live.
The other 80% of effort, which creates 20% of value, is the actual grunt work of writing and
executing each test to verify if the code fulfills the requirements. AI tools excel at this art. Being
able to provide instructions in natural language and get back tests written in proper syntax, ready
to execute, is a huge time-saving use case. Having those self-healing capabilities to update tests
whenever application code is changed is a great backstop for when tests become deprecated and
are simply commented out, as pressing priorities emerge to get some release to production.
These are the day-to-day decisions that so often relegate proper software testing to a second-order
priority.AI tools can help alleviate those concerns and contribute to ensuring that software running
in production is properly tested and bug-free. This can’t be done by AI tools alone, since these
tools won’t replace humans. Quite the opposite: the human tasks in QA are critical, as they define
the scope of testing and serve as key guidelines for the AI tools to do the grunt work at a high
quality standard.
Once again, “AI + human” is a combination that leads an often frustrating process that has existed
for decades to produce a higher-quality output..
Chapter 4. User Interface and User Experience Design
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 5th chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
The world of software applications depends on beautiful user interfaces (UI) and intuitive user
experiences (UX). UI design is about how a product looks and how users interact with it visually, it
focuses on layout, colors, buttons, typography, and everything that shapes the visual experience of
a software product. UX design, on the other hand, is about how a product feels to the user and if it
is intuitive and efficient to use, UX is about usability, navigation flow, and making sure the overall
experience is smooth and user-friendly. Both work hand in hand to create great software products.
Producing those interfaces and experiences has been the job of UI/UX Designers for the last few
decades. Those usually pair with Software Engineers to implement the functionality and the
backend workflows that bring those designs to life as an integrated experience for users around
the world. The transformation of such UI/UX designs into functional front-end code is a labor-
intensive process, though. Traditionally, designers and engineers work closely to ensure that the
visual aesthetics of a design are accurately translated into a responsive and interactive user
interface. This process often involves multiple iterations, meticulous attention to detail, and a deep
understanding of both design principles and coding standards.
Since the emergence of text-to-image generative models in 2022 and 2023, UI/UX design has
witnessed an enormous transformation. As I write this in 2024, recent AI tools now generate UI
designs from ideas described in natural language, and generate functional frontend code from UI
designs. These tools are revolutionizing the design-to-code workflow by automating significant
portions of the process. This has immense potential to reduce the time and effort required to bring
designs to life and make design professionals much more efficient. AI tools can significantly reduce
the gap between concept and implementation.
For example, the first iteration of wireframing and sketching designs for a new application can take
more than a week to complete–sometimes more than a month, depending on the complexity of
the project. The new AI design generators can cut wireframing time to less than a day and make
each feedback loop faster as well. These tools’ ability to transform designs into functional frontend
code means one person or team can go from describing an idea in natural language to generating
functional frontend code within minutes (or hours, counting iterations), as opposed to weeks or
months.
This chapter will examine some of the leading tools in both the domains of UI design and UX
design, aiming to cover the wide range of activities in between.
Throughout this chapter I’ll focus on two types of AI tools: those that transform ideas into designs
and those that transform designs into code.
This sounds like a UI/UX designer’s job, but think about it: if creating designs becomes easier,
faster, and more accessible to people without professional design skills, the average frontend
developer will probably be able to create designs and code, as opposed to the current scope of
work, which entails only writing code. That translates to massive empowerment for software
engineers, especially those working on the frontend, who’ll be able to create simple designs
independently.
This segment complements the tools covered in Chapter 2, which generate code from a natural-
language prompt. The tools here generate code from image-based mockups or Figma design files.
(If you aren’t familiar with it, Figma is for UI/UX designers what VS Code is for software
developers.)
Most available UI/UX design generator tools are fully browser/cloud-based. They are simple to use
without complex software installation and are easily accessible with a browser and a URL. Some
tools, like qoqo.ai, come as plugins for design platforms like Figma.
Most of the products this chapter covers have been launched very recently (as of mid-2024) and
are in the early stages of product development. They’ll undergo continuous development, rolling
out new features and improved functionality in the months and years ahead. As I write this, such
design tools are a year or two behind software development tools like Github Copilot in terms of
maturity.
I expect that in the UI/UX space we’ll also see more browser-based tools, more add-ons or plugins
to popular design platforms, and more integration into collaboration tools used between designers
and frontend development teams (the analogous to Git might still be Figma for many teams, but
for others might be plugins to popular collaboration tools like JIRA or Notion, we’ll see).
Like any tools, the tools in this chapter present advantages and disadvantages. Some of the
advantages include:
Consistency
By automating the conversion process, AI tools help maintain consistent implementation of design
elements across a project.
Accessibility
Individuals without technical or design expertise can pick up these tools and create functional
interfaces, a shift that helps to democratize UI/UX design.
AI-generated code allows for faster prototyping. Designers and engineers can quickly test and
iterate on their ideas to bring new products to market much faster than ever before.
Code quality
Code produced by AI tools does not always meet the standards of human-generated code. It can
be less efficient and harder to read and maintain, so–as I’ve emphasized throughout this book–
developers must look through and revise generated code thoroughly before releasing it into
production.1
Device compatibility
AI-generated code may not be fully optimized for all browsers, operating systems and screen sizes.
This can lead to issues with responsiveness and cross-device compatibility that require additional
effort to troubleshoot and fix.
While AI tools can handle standard design patterns well, they may struggle with more complex,
custom design elements. This greatly limits creativity and the ability to tailor the design to specific
needs. For this reason, human creativity is still needed.
AI tools tend to produce designs that resemble existing ones. These tools are trained on current
designs and prevailing trends, which can lead to a lack of originality and distinctiveness, with some
designs appearing generic or repetitive, failing to stand out. Although AI tools can offer a useful
starting point or inspiration, they’re far from matching the creative capabilities of a talented
human designer.
Striking the right balance between leveraging AI for automation and relying on human talent for
quality and creativity is key to successfully integrating these tools into the design workflow.
• Accessibility checkers
• Personalization engines
• Design-to-code converters
In this chapter, I’m focusing only on the last two items, which lie at the intersection of UI and
frontend code and are thus of most interest to software engineers. Other tools that target specific
parts of the UX and product research workflows are usually more specific to designers and other
functions that collaborate more loosely with software developers, like product managers, product
analysts, and UX researchers.
Thus, I evaluate the tools in this chapter using the following use cases:
Rapid prototyping
AI tools can quickly generate wireframes and prototypes based on initial inputs and design
descriptions. This is a major chunk of the design process, where designers spend hours in back and
forth iterations. With these tools, creating prototypes is much faster and designers can iterate
multiple design options quickly. This flexibility can also make sophisticated flows such as A/B
testing much more accessible.
AI tools can be trained on design principles like components, colors, and styles, to ensure
consistency across different parts of a product by generating designs that adhere to established
design systems, branding, and guidelines. This results in a cohesive user experience with uniform
design elements and styles throughout the application. While this “factory” approach doesn’t
foster creativity, it does bring significant value for use cases where all you need is to create new
screens that look like the existing ones but fulfill new functionality. (In all honesty, for better or
worse, most software development consists of this “factory” approach, as proved by the growing
popularity of low-code and no-code tools.
Accessibility assessments
AI tools can automatically check and improve the accessibility of UI designs to people with
disabilities, ensuring compliance with standards like WCAG and inclusive design practices–much as
the tools covered in Chapter 3 do for software code. Such features can identify accessibility issues
and suggest fixes, reducing the manual effort required.
AI UI/UX design tools analyze user data about behavior, needs, and preferences to tailor their
output to specific user groups or individuals, improving engagement, retention, and satisfaction. By
leveraging data-driven insights, these tools ensure user-centered design. This can massively extend
existing use cases. For example, if the goal is creating new landing pages for specific customer
demographics, with these tools you could create full product variations according to specific
parameters. Without them, each product variation would require heavy software development
cycles.
Content generation
AI tools assist in generating content for UI elements, such as placeholder text, images, and icons.
This quickly populates designs with realistic content, making prototypes more lifelike for testing
and ensuring consistency with the design’s style and tone. This frees designers to focus on the
more creative aspects of the process.
Evaluation process
My process in this chapter was as follows: I submitted a brief design prompt to each of the
selected UI design tools, ran the same challenge on each tool, and compared their output. I also
evaluated their ability to convert the designs they generate into frontend HTML CSS, Next.js, and
React code. I reviewed 20+ tools available in the market, which as expected follow this long tail
distribution where a few of them work as they advertise, while others have different issues that
prevent me from running the test, or simply deliver poor output. As it happened in other chapters,
many tools in the AI space are very early stage startups, whose product is in a very early stage of
development.
As such, I selected the best 4 tools that provided actual value for the use cases tested. I then rated
each tool on a scale from 1 to 10, with 1 being the worst–a solution that errors out and doesn’t run
at all – 10 being flawless, and 5 being a design that didn’t exactly fit the requirements. I look
closely at the top product in each category, detailing its pros and cons, then provide some more
concise information on the runner-up.
It’s also important to note that all tests described in this chapter were run in June 2024. Given the
fast pace of evolution of each of these tools and underlying models, it’s likely that you could get a
different result at a later time for the same prompt.
UI Tools
I asked each of the AI tools covered in this chapter to generate screen designs for a mobile food-
delivery application. Here is the detailed prompt I provided:
Create a user-friendly food delivery app that allows customers to browse local restaurants, view
menus, place orders, and track delivery status in real-time. The app should include screens- Login,
Restaurant and Menu Browsing, Order Placement and Tracking.
Style: App should be modern and minimalist, focusing on ease of use and visual appeal. The design
should incorporate a clean layout. Add in mouth-watering food images.
Uizard
Uizard has a marketing slogan that I believe sets the standard for most other players in this space:
“Forget no-code, here comes no-design.” The analogy between code automation and design
automation makes a lot of sense. Just as no-code has promised the same functionality in a fraction
of the time and budget, no-design could have an equal impact on the design process.
The Uizard product started as a machine-learning research project called pix2code in 2017 in
Copenhagen, Denmark. As per Uizard’s marketing collateral: “It allows developers to visualize
product ideas quickly and easily with AI. This helps to fast-track the workflow of designers and
software developers. Its mission is to democratize design and empower non-designers everywhere
to build digital, interactive products. ”
With the free plan, you can generate and design up to 5 screens per project, and that’s what I used
for this test. Specifically, I used Uizard’s most advanced AI design engine, Autodesigner
2.0, released in June 2024. The company describes Autodesigner 2.0 as a “text-to-UI generator
product” that “generates mockups and wireframes from text.” It “takes the intuitive conversational
flow of ChatGPT and merges it with Uizard’s generative UI capabilities and a drag-and-drop editor
to provide you with an intuitive UI design generator.”
Uizard provides tips to guide users through the entire design process. A nonprofessional could pick
up this tool and create reasonable designs within minutes; a professional designer will have even
more leverage.
Practical example
At the prompt screen (Figure 4-1), I asked Autodesigner to generate screens for a food-delivery app
using the prompt from the practical section above.
Figure 4-1. The Autodesigner prompt screen.
You can see all of the screens it generated, along with the interactions,
here: https://app.uizard.io/p/97cf063f/preview
Modify selection
Allows you to select a particular section of your design for modification. It is Ideal for tweaking
existing designs to perfection. Use this when you want to refine or alter specific elements on your
canvas, such as changing colors, adjusting layouts, or modifying content. As a developer, the best
use case here is those little adjustments I’d otherwise need to ask a designer, and now I’m
independent to do them myself.
Generate an image
Use this to add custom images to your design. UIzard uses an image generation model that can
generate photos, illustrations, and other visuals from a prompt. This is great to generate mockups
that don’t feel constrained by some stock photo library.
Generate a screen
Perfect for creating full interfaces from scratch. Choose this when starting a new project or when
you need to expand your application with additional screens. Ideal for both high-fidelity prototypes
and initial concept explorations.
Generate a theme
Select this to create a cohesive style across your project. Define your colors, typography, and UI
elements to ensure consistency. It’s a simple way to set a consistent design system across your
project right from the start or revamp it with a new aesthetic.
All of the above are very typical work flows for any designer, so it’s not surprising that they were
included in this product. The implementation is very intuitive and easy to use, as the chat interface
in Figure 4-3 shows.
Figure 4-3. Asking Autodesigner to generate an image and add it to a section of the welcome
screen.
I rate Uizard an 8 out of 10. The text-to-UI design feature works pretty well. Its designs come with
an interactive demo, dynamic transitions, and an external link to share. I didn’t rate it higher due to
design inconsistencies, some of them pretty hard to ignore. The most obvious one is the
inconsistency in the app name: it is “FoodieDash” in most of the design screens, but “EasyEats” in
the tracking screen. Similar inconsistencies can be found in input fields and color palette.
Autodesigner 2.0 is very promising; I expect a lot of iteration in the months to come. Uizard also
recently launched a design-to-code feature, which wasn’t available in the plan I used at the time of
writing. A Figma plugin is in the beta stage of launch as I write this. Both of these features could be
game-changers when fully available to the public, both for end to and product completeness, and
could facilitate these tools’ adoption into existing team workflows.
Galileo AI
Galileo AI is a prompt-to-UI generative AI tool for designers and builders. Its creators write that the
product emerged from their core belief that “AI will augment the creativity and productivity of
designers, rather than replace them.” They explain that its goal is to build capabilities beyond basic
automation and delve into the creative process to actually make the user a better designer. They
aim to decentralize design and make it more accessible to a broader range of people.
Galileo AI stands out as the most versatile of the design tools I evaluated. Not only does it
transform natural-language ideas into UI designs within minutes, it also allows designers to upload
images for inspiration or redesign. With the user providing concise prompts, it transforms
uploaded images into designs that closely and accurately match the designer’s vision. It offers a
comprehensive suite of features that designers need, including prompt-to-UI, image-to-UI, and UI-
to-code capabilities. These features are uniquely combined in one platform, unlike other tools.
Think of Galileo as ChatGPT for interface design. The prompt can be “a fun, friendly app design for
dog walkers” or “a dashboard UI for an e-commerce site” and you’ll be presented with multiple
options for a visual layout in under a minute, with appropriate UI elements, images, and product
copy. You can pull these full UI screens into Figma to edit them. The design-to-code feature was
available for free users when I evaluated this tool, which also makes it stand out.
I provided Galileo AI with the exact same food-delivery-app prompt I gave Uizard AI, as seen
in Figure 4-4, and the results were very interesting. The generated designs (see Figure 4-
5 and Figure 4-6) came out clean and comparable to what a human designer might produce.
The screens generated by Galileo come out with the corresponding HTML code, as shown
in Figure 4-6, which can be downloaded.
Figure 4-6. HTML Code snippet for one of the screens generated.
The code is about 205 lines and this is only a snippet. The complete code can be found in the
GitHub Repository for Chapter 4.
I rate Galileo AI a 10/10 because it truly stands out. You can easily transfer designs to Figma for
further customization and refinement. Even if the HTML code generated is clunky at times, the AI
produces better designs and code than the other tools I evaluated. Remarkably, all these features
are available in the free version.
As a possible future improvement, when Galileo generates code for the designs, it should use the
actual libraries and frameworks that most frontend developers use, such as React components,
Tailwind CSS, and Bootstrap. Adding compatibility to those would make the generated code easier
to use and less prone to needing a big refactor when merged into a productive repository.
UX Tools
In this short section, I cover tools for the UX part of the UI/UX design workflow. This means the
user research, product workflows and usability.
Create a user persona for an 18-25 year old university student trying to order food online from a
restaurant using our application.
Description: 18-34, Female, Male, Teen university undergraduate Adult from Los Angeles,
California, single, and no children. Looking for a user-friendly platform to order food very fast, track
orders and make payments
QoQo.ai
QoQo.ai (pronounced “cocoa”) is an AI for UX design Figma plug-in that helps designers in the early
stage of design. With qoqo.ai designers can generate UX personas, journey mapping, sitemap, and
UX copywriting. UI/UX Designers, product managers, and product researchers, who usually do
these tasks, will save time with this tool; it can also make these tasks more accessible to software
engineers, helping them better engage with users.
You provide a prompt that includes demographics (Figure 4-7), and within seconds the AI
generates a persona (Figure 4-8).
Figure 4-7. Prompting fields in QoQo.ai, with demographics, scenarios and attributes.
Figure 4-8. A persona generated by QoQo.ai
I rate Qoqo an 8 out of 10. It gets the job done and there is really no learning curve involved. It
does feel like a GPT wrapper, in the sense that it’s generating text and placing it into predefined
brackets, which can provide tremendous value to start designing the UI/UX of a software product.
Research Studio
Research Studio is an AI-powered application designed to automate the customer and user
research analysis process. This tool is particularly useful for researchers and designers who need to
quickly transform user interviews into actionable insights, analyze satisfaction effortlessly, and
generate branded reports swiftly. Traditionally, UX researchers spend hours combing through vast
amounts of data to extract insights. Research Studio streamlines this process, making it faster and
more efficient.
Research Studio’s platform allows you to choose between three open-source LLMs: ClaudeAI,
Mistral or OpenAI ChatGPT 4o. You upload your user interviews and it extracts insights into a
report. It can also take this information it has garnered from the interviews and data to write a
report. Several report types are available as an output, such as user flows, feature map, question
analysis, etc.
To evaluate this tool, I used Claude AI separately to generate some raw input. I asked it to generate
50 survey responses of users to our dummy food-delivery app from above, using the following
prompt:
Generate 50 survey responses from users of a food delivery app, reflecting their thoughts after
using the app for some time. The responses should highlight both positive and negative views,
including any problems they faced and suggestions for features that could improve the app’s
efficiency and user experience.
These responses include both positive and negative feedback, as well as suggestions for
improvements. Here are a few examples from its output:
1. "The app is fast and easy to use. Love the variety of restaurants!"
2. "Delivery times are often longer than estimated. Please improve accuracy."
I passed the responses into Research studio to extract insights from the survey responses and
asked it to write a comprehensive report based on its inbuilt report types.
Figure 4-9: Research Studio transformed raw user reviews into a map of “insights”.
Figure 4-9. An insight map generated by Research Studio from survey responses.
Tool Comparison
Galileo AI emerged from my evaluation as the best text-to-UI tool amongst the 20+ tools I
evaluated here, including the four featured in this test. Its ability to take an image and remake it
into a beautiful design also makes it stand out. As I noted, this space is still in its very early days, so
I can only imagine how far these tools will evolve in the coming years.
The UX tools I evaluated are less complex in terms of underlying technology, but that’s not a
critique: they worked well enough for their use cases.
Conclusion
I began writing this chapter with the wrong expectations. I expected the tools in this space to be
less useful in their current state, given their newness. I was wrong. I’m now very excited about this
space and the evolution in software development processes we’ll see in the coming months and
years.
Generating frontend code from designs, the exact use case covered by Galileo and soon by other
players in the field, is certainly the angle that best speaks to software engineers, especially those
working on the frontend. It is a game-changer that will save software engineers around the world
millions of hours.
The other use cases seem to cater less to software engineers: they are a designer’s work. But think
about it: our tools are becoming more powerful, and if it takes less time to do the same work,
software engineers can manage a larger portion of design workflows. Indeed, our employers and
team leads will likely ask us to do so more and more. It’s not just about being full stack rather than
frontend or backend–it’s also about touching tasks that would otherwise be handled by a
dedicated designer or design team.
Never before have software engineers been so empowered to collect feedback from the users of
their products, structure that feedback into actionable insights, generate improved designs,
transform them into code, and push a new iteration to production. Product iteration cycles will
shorten significantly thanks to these tools. Professionals who master them will be closer to fitting
the proverbial “10x developer” persona than those who dismiss them.
1 This is yet another occurrence of the tradeoff between saving time writing code and spending
more time debugging, fixing, and improving code. It’s not much different, conceptually, from the
tradeoffs of the HTML code generated by Dreamweaver in the early 2000s, or the more recent
advent of low-code tools that generate templated code, which is often hard to customize and
adapt to one’s specific needs.
Chapter 5. Data Analysis and Business Intelligence
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 6th chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
In today’s software-driven world, businesses generate vast amounts of data from their
applications, users, and operations. This data holds valuable insights that can be used to make
decisions, predict trends, and improve system performance. The ability to analyze and act on this
data has become a critical skill for software engineers.
In this chapter, I tap into data analytics and business intelligence, and I’ll test how state-of-the-art
AI tools can help businesses understand their data and improve their results. Whether it’s
predicting future user behavior or optimizing resource usage, data analytics opens up new
possibilities. Here are three key angles this chapter will cover:
Analyzing data
The first promise these tools make is that users can query large datasets by asking questions in
natural language, and it seems attractive, given how costly it is for companies to build visualization
dashboards on top of their databases.
Often companies want to correlate data points and find patterns in order to understand user
behavior or spot some malfunction in their processes, in ways that go well beyond simply querying
the data to count and sum fields. Again, the costs of such projects prevent many companies from
even stepping into this territory, so the promise of AI tools removing this barrier is a big one.
The ultimate promise of data science and business intelligence is to pick up data about what
happened in the past and use it to draw conclusions about what will happen in the near future.
This forecasting can be a game-changer for businesses that do it well, and the companies that
develop and use it as a key part of their decision-making processes treat these projects and
algorithms as sensitive intellectual property.
These angles show how software engineers and data analysts can turn raw data into actionable
insights to help teams make smarter decisions. They also show how expensive and time-consuming
these projects have been, historically. High costs and complexity have limited sophisticated
projects in these fields to companies that have the funds and the technically capable teams to
drive them. Most small to medium businesses, startups, and non-tech businesses face high barriers
to developing such tools, which hinders their ability to compete in this global market. The promise
of AI tools in this sector is to democratize access to such tools among all market participants,
regardless of their size, their sector, or the technical abilities of their teams.
Before we look at the tools themselves, let’s quickly review a few basics about data analysis.
At the heart of data analytics is the data itself. In software engineering, this data comes from many
sources, such as:
User Activity
Information about how users interact with software, including page views, click paths, and session
lengths.
System Logs
Detailed records of system and application performance, which help engineers monitor health and
performance.
Tracking tools
Automatically collected real-time data on application performance, such as response times and
error rates.
Customer Feedback
Insights from user reviews, support tickets, and surveys, providing a qualitative perspective on user
satisfaction.
Market Research
Competitive analysis, the news, market reports and all relevant information that’s published every
day.
There’s a long tail of other possible sources of valuable data that businesses can use to shape their
decisions about the future. These data sources form the foundation of data analysis.They also
shape the quality of the data and determine how much data cleaning is required. For example, if a
dataset contains lots of empty fields or inconsistent field types, etc, it requires significant
specialized work to clean the dataset for analysis, as well as advanced techniques like data
normalization and clustering. I’ll be exploring these techniques in the tools analyzed in this chapter.
With valuable data in hand, data analysts can tackle a variety of challenges. We’ll tap into some of
these key use cases for data analysis and business intelligence in the tool evaluation part of this
chapter.
Performance Insights
By analyzing system metrics and logs, engineers can identify performance bottlenecks and
inefficiencies early on. This helps in optimizing resource usage, improving response times, and
ensuring the application remains scalable as demand grows. For example, tracking CPU and
memory usage over time can reveal patterns that signal when a system needs scaling or
optimization.
Data analytics can uncover patterns in user behavior, helping teams anticipate future user needs
and preferences. By analyzing user activity data, such as click paths and session lengths, engineers
can predict which features users will likely use more and tailor their product development efforts
accordingly. This allows teams to focus on enhancements that will have the most impact on user
satisfaction and engagement.
Capacity Planning
Analyzing historical usage data can help teams predict future resource needs and scale
infrastructure appropriately to meet demand. By understanding traffic patterns, engineers can
forecast peak usage periods and prepare systems to handle higher loads without compromising
performance.
Anomaly Detection
Automated systems can analyze operational data to detect unusual patterns that could signal
potential security breaches, system failures, or fraudulent activity. This proactive approach allows
engineers to address issues before they escalate, minimizing downtime and protecting user data.
Business Intelligence
Beyond performance and system optimization, data analytics can offer broader insights into
business performance. This includes tracking product adoption, analyzing market trends, and
evaluating key business metrics. These insights help guide strategic decisions, such as which new
features to prioritize or how to allocate resources more effectively.
Each of these use cases highlights how data analytics allows teams to make informed decisions,
optimize processes, and improve both software performance and the overall business strategy. The
sections that follow look at how to approach these use cases with the right tools, models, and
techniques. We’ll also explore how AI and machine learning can further enhance data analytics,
helping engineers automate processes and uncover insights faster.
AI tools have been emerging in data analysis, like in many other industries and verticals. Just using
many enterprise tools requires complex sales and onboarding processes; I’ve left those out of the
scope of this book, with the goal of steering you toward the most accessible options.
I’ve also found some tools that offer infrastructure-level support for data analysis. While many of
them are valuable, this chapter’s use case is about a business owner who wants to extract
business-worthy insights from a dataset, and such tools are overkill for such cases.
I ended up with tools that offer self-service onboarding and that have a free tier that allows
readers to test the software. Almost all of these tools contain a chatbot UX that lets you upload a
data file and ask analytical questions about the data. This seems to be the winning UX pattern for
data-analysis use cases.
Evaluation process
I evaluated more than 20 AI tools in the data analysis and business intelligence space in order to
shortlist the ones I highlighted in this chapter. Every tool covered here meets the following criteria:
For this test I’m using a public dataset of one year of online retail transactions from the Machine
Learning Repository at the University of California at Irvine. It contains over 500,000 transactions,
with eight data columns for each transaction:
• InvoiceNo
• StockCode
• Description
• Quantity
• InvoiceDate
• UnitPrice
• CustomerID
• Country
Tools comparison
For this comparison test, I’ll act as if I’m the owner of an online retailer and attempt to use the AI
tools to draw business-worthy insights from this data. I’ll ask questions like:
• What are my top-selling products?
In this test, I’m importing the raw dataset into each tool, working through the UX to get insights,
and taking note of the results I get, how easy it is to get them, and how the tools compare against
each other.
Julius.ai
Julius AI automates complex data analysis processes and provides interpretations, visualizations,
and predictive analytics. It uses a combination of OpenAI’s GPT-4 and Anthropic’s Claude as its
underlying data-processing models.
Julius’s instant-messaging UI (Figure 5-2) resembles those of ChatGPT and other popular AI tools. I
used it to upload my dataset, then asked my first question, in natural language:
What are the top selling products, both by number of units and by revenue generated?
Julius took just over a minute to run several Python scripts, then returned the tables shown in
Figures 6-3 and 6-4, which list the 10 top-selling products by number of units sold and by revenue
generated.
Figure 5-3. Part of Julius’s reply to the first prompt, showing top-selling products by quantity.
Figure 5-4. Part of Julius’s reply to the first prompt, showing top-selling products by revenue.
Along with the hard data, Julius offered some contextual explanation about high-value products
(Figure 5-5), highlighting some interesting aspects of the data.
Figure 5-5. Some of Julius’s observations about the data in response to the first prompt.
Next, I asked Julius to identify which customer segments were generating the highest value for this
online retail store:
From the same dataset, do customer segmentation and calculate lifetime value of each client
segment. I want to identify the highest value customer segments.
Julius decided to follow the Recency, Frequency, Monetary (RFM) method and ran some Python
scripts again for almost 2 minutes, then presented the customer clusters and their lifetime value
(Figure 5-6).
Figure 5-6. Julius’s reply to the second prompt, with its analysis of high-value customers.
Julius figured that clients that make an average 5,914 purchases and generate £64k per year are
likely to be resellers, and gave me pointers on how to use that information for my marketing
efforts. This is a unique and very good insight; however, key information is still missing, such as
how large this cluster is and how representative these figures are. Is it an outlier case of an
extremely high-value client, or does it include hundreds or even thousands of clients that I can
market towards to grow the business?
Also, Julius clearly hallucinated on its lifetime value (LTV) calculation: by no means would any client
be worth £13 billion. It’s not clear what went wrong, but my hunch is that Julius used the wrong
field in this calculation.
Assume I do a marketing investment of £500k to grow this customer segment that you’ve
identified as high-LTV. Forecast my total sales volume for the next 12 months, both in units sold
and in revenue. I want to provision inventory based on your forecast. Generate a table for the 20
top selling products in the forecast, with product name and quantity to provision for next year.
In this forecast (Figure 5-7), Julius first made the fair assumption that these marketing efforts
would generate a 20% sales increase. However, note that its total revenue forecast is off: the
current year’s revenue is around £9.7 million, so this £565k forecast would be a massive decrease.
Figure 5-7. Part of Julius’s reply to the third prompt, with sales forecast.
Second, the table it created with the number of items to provision for next year, shown in Figure 5-
8, is also off (probably a cascade from the wrong revenue estimation above). The quantities listed
are way below the current year’s sales volume for those items. So, while Julius’s forecast indicates
a 20% sales increase, its stock provisioning figures suggest a decrease of 60% or more.
Figure 5-8. Part of Julius reply to the third prompt, with its stock-provisioning forecast.
Julius did well in the objective data analysis in the first prompt and showed promise in customer
segmentation, where it offered an insightful analysis despite the error in its calculation. However,
its forecasting was far off the mark. While this could perhaps be mitigated with some prompt
engineering, I believe my question was specific enough that I could reasonably expect a better
answer.
Akkio
Akkio provides AI-driven data analysis and predictive modeling aimed at digital-marketing and ad-
targeting clients. Akkio uses its own model, called AD LLM, which it claims to have trained on
advertising-specific data to understand data structures, business requirements, and other context
specific to ad targeting.
Akkio’s polished UI starts with a file upload, prompting users to select from a number of file
formats. Once I uploaded my file, Akkio took some 2 minutes to fully ingest it and make the
product functionality available to me. While the file renders as a spreadsheet table in the UI’s
“Prepare” tab, the product offers several different features (Figure 5-9): Prepare, Explore, Predict,
Deployments, and Reports.
The Explore tab displays an instant-messaging UI similar to the one in Julius. As such, I began with
the same prompt, asking about the top-selling products:
What are the top selling products, both by number of units and by revenue generated?
Akkio took just a few seconds to reply to my prompts. However, the output was quite raw and
lacked context, often being just a data table, with no accompanying text to provide context. For
instance, in response to my first question, it simply returned charts and tables (Figures 6-10 and 6-
11) with the top-selling products by quantity and by revenue generated, respectively.
Figure 5-10. Part of Akkio’s reply to my first prompt, with top-selling products by quantity.
I used the “AI interpretation” widget above each chart, but it didn’t help much, since the
explanation was very technical, it simply describes in natural language the technicalities of the
query performed against the data, no functional context about what’s the analysis being made or
what are we seeing the visualization rendered.
Figure 5-11. Part of Akkio’s reply to my first prompt, with top-selling products by revenue.
Akki’s response to the first question was correct, so I proceeded to the second question:
From the same dataset, do customer segmentation and calculate lifetime value of each client
segment. I want to identify the highest value customer segments.
Akkio’s reply is good (see Figure 5-12). It tells me that the highest-value customer segment is
composed of 2,539 clients, who generated an average of about £3,000 each. This is good, but the
cluster is too large. I’d expect such a cluster to be perhaps 10 to 20% of the total number of
customers, but this one encompasses more than half of the 4,372 unique customers in the whole
dataset.
Figure 5-12. Part of Akkio’s reply to my second prompt, with customer segmentation and value
calculations.
Working with such a broad cluster of customers produces generic recommendations that aren’t
very actionable, as seen in Figure 6-13. These recommendations would apply to any business; none
of them are specific to high-value customers in a way that I can use for marketing purposes.
Figure 5-13. Part of Akkio’s reply to my second prompt, with an explanation about the clients with
the highest lifetime value.
Assume I do a marketing investment of £500k to grow this customer segment that you’ve
identified as high-LTV. Forecast my total sales volume for the next 12 months, both in units sold
and in revenue. I want to provision inventory based on your forecast. Generate a table for the 20
top selling products in the forecast, with product name and quantity to provision for next year.
Akkio’s reply is so deprived of context that it’s hard to understand the rationale behind the
numbers. The revenue figure of £9.8 million (Figure 5-14) is a slight increase over the current year’s
£9.7 million revenue, which seems too low given the marketing investment I mentioned in my
prompt–but, again, no context is provided for that forecast.
Figure 5-14. Part of Akkio’s reply to my third prompt, with sales forecast.
Also, there seems to be some hallucination in the stock provision forecast (Figure 5-15). These
numbers are way below the current year’s sales for those products, by an order of magnitude. For
example, Akkio forecasts 1,918 sales for the first item in the list, even though the current year’s
sales for that item total 53,847. This, too, suggests some confusion in the calculations, but without
any visibility into the scripts or functional explanations of context, it’s hard to understand the
process that led to those predictions.
Figure 5-15. Part of Akkio’s reply to my third prompt, with product provisioning forecast.
Here, too, using the “AI interpretation” tab (Figure 5-16) doesn’t help much. It provides a technical
description of the query used to feed the chart, but doesn’t explain the broader thought process
that led to that query.
Figure 5-16. Part of Akkio’s reply to my third prompt, with the AI interpretation of the results.
Akkio did well in the objective data analysis with the first prompt, and did some decent customer
segmentation work with the second prompt. However, its forecasting was off. I think the most
underwhelming aspect of using Akkio is how little context it provides for its replies. It comes across
as a tool best used for simple dataset queries and charts that don’t require much analysis or
contextualisation.
ChatGPT
Since the top products that emerged in this category offer a chatbot-type experience, I decided to
add ChatGPT for comparison, since it’s most people’s go-to AI tool. I’ll quickly review its replies for
each of the same three questions.
As seen in Figure 5-17, in reply to my first question, ChatGPT correctly counted the product sales
and summed the revenue.
Figure 5-17. ChatGPT’s reply to my first prompt.
The second question is more nuanced. ChatGPT found a customer segment of 498 generating an
LTV of £403,000. That value is higher than the revenue generated by any client in the current year.
However, ChatGPT could be assuming that clients will be purchasing for many years into the future,
which would inflate the LTV calculations.
When I asked ChatGPT for more details about that highest LTV client, its reply (Figure 5-18) was
underwhelming and generic enough to apply to any business, rather than being specific to this
dataset and the underlying business.
Figure 5-18. Part of ChatGPT’s reply to my second prompt, with its explanation for the customer
segmentation.
ChatGPT’s reply to my third question, asking it to forecast revenue and stock provisioning based on
a significant marketing investment, started quite well. ChatGPT estimated that the marketing
investment would produce a sales increase of 20 to 40% in this customer segment.
However, its product stock provisioning forecasts are off (Figure 5-19): again, the sales numbers it
predicts are lower than the current year’s. To me, this suggests that ChatGPT calculated the
number of items to be sold only to this high-value customer segment and “forgot” to include the
items sold to all other clients.
Figure 5-19. Part of ChatGPT’s reply to my third prompt, with product stock provisioning forecast.
ChatGPT did well in the objective data analysis with the first prompt, and offered a good amount of
context and reasoning in response to the other questions, despite some obvious issues with the
calculations. I believe some of these might be mitigated with prompt engineering. It’s also worth
noting that ChatGPT isn’t really a native data-analysis tool: unlike the other tools analyzed here, it
renders clunky tables and has no ability to render charts.
Tool comparison
My first challenge with this comparison is that all three of the tools I analyzed were subject to a
black-box effect. I input a large volume of data, and within seconds these tools output good-
looking tables, charts, and writeups with conclusions and insights that appear to make sense. It
would be easy to assume that these tools are right, given their impressive speed and output.
However, I double-checked the results by running a script on my local machine against the
reference dataset (Figure 5-20; this script is available in the book’s Github repository).
Figure 5-20. Console log of my local tests to double-check the tools’ calculations and reference
figures for items sold and revenue generated.
Comparing the tools’ results against my local tests, I first observed that all tools missed the product
with most units sold (“Small Popcorn Holder”). I dug a bit into this quirk, but couldn’t figure out
why. I can speculate that, since this item has a very low unit price, perhaps a rounding-to-zero type
error could have caused it to be missed.
Besides that, all tools performed quite similarly, both in the value they provided and in their
pitfalls. From a UX perspective, Akkio stands out from the other tools. It sets a higher expectation
by offering what seems like a very robust process with multiple steps and tools. However, it ends
up standing out negatively, because the level of contextualisation it provides for each interaction is
way below what the other tools offer.
Julius and ChatGPT are simpler chatbot experiences that take longer to reply, but offer insights into
what’s happening and how my data is being processed. Both of those tools include text in their
replies alongside the tables and charts, to provide context and reasoning for its operations and to
show users how to read the data and interpret the results.
If I were to choose one of these tools, I’d select Julius. While its UX is very similar to ChatGPT’s, and
even the underlying model is in part the same (GPT-4, as I write this in mid-2024), its data-analysis
capabilities, such as rendering charts in the chat conversation, are not available in ChatGPT.
I rated all three tools between 5 and 7 (Table 5-1), given these shortcomings. I expect these tools
to evolve a lot in the coming years, but my opinion is that they are not yet reliable enough that you
can simply drop a large volume of data, ask questions, and trust the results. If you use them, I
recommend running scripts locally to double check the numbers. (It’s OK if your scripts are
generated by AI tools, since you can review and modify the code and have full visibility and control
over the data analysis, as you saw in Chapters 2 and 3.)
Conclusion
After more than 15 years working with software-development and data-science teams, I can
confidently say that AI tools have the potential to become game-changers in how we handle data
analysis and business intelligence. Their ability to clean and analyze massive datasets in seconds,
rather than days, will transform what’s possible for businesses of all sizes.
Furthermore, from my experience working with a wide range of business stakeholders, from early-
stage startup founders to business teams at Fortune 500 companies, I can easily imagine these AI
tools empowering non-technical stakeholders to extract insights from their data. The effects of that
empowerment could be immense. In some cases it might mean skipping costly data-engineering
projects; in other cases it just makes those projects faster and less expensive.
With that, here’s my word of caution: The tools are not there yet. While the results can be very
impressive on the surface, they come with significant flaws, calculation errors, and generic
explanations. A distracted user might be easily fooled by the instant reward of good-looking charts
and insights, but overlooking such shortcomings can result in serious negative consequences.
Business stakeholders could make decisions that reduce the value of their business; data analysts
who delegate their work to these tools might end up performing poorly in their jobs.
These tools are already powerful and useful. But they have limitations, and the “black box” effect
can make it really hard to identify those limitations. Always be specific in your prompts, and always
double-check the results by doing manual analysis or running local scripts. I always tell my teams
to treat AI-generated insights like advice from a colleague: while it’s valuable input, always validate
it and do your own critical thinking before making any big decisions.
Chapter 6. Documentation and Technical Writing
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 7th chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
Documentation is vital for clarity, consistency, and knowledge transfer in software development. It
ensures that team members understand the code when onboarding and reduces the learning
curve during day-to-day work, leaving less room for lost context and consequent errors and
refactorings.
Outside the company, documenting how to use a software product can help sales and marketing
efforts, prevent difficulties during customer onboarding, and foster user engagement with the
product. Writing features and workflows down for external stakeholders is also a great starting
point for collecting their feedback on how to improve the product.
Despite its importance, documentation often doesn’t get written at all. Software engineers don’t
usually enjoy writing for humans, so they often skip it if they can. But they are almost always under
deadline pressure, and when they have to make compromises, documentation is often one of the
things left behind. Even when it does get written, heavy workloads and time pressure often lead to
rushed or incomplete content. Writing high-quality documentation takes time.
Additional challenges include finding the right level of detail and keeping documentation up to
date as systems evolve.
AI tools were helping generate written content for many years before the recent LLM-driven AI
wave. Writing tools such as Grammarly, which helps find the correct words and fix mistakes, are
especially helpful for those writing in a foreign language. In software development, tools such as
Swagger and Javadoc also use AI to automatically generate API documentation in tandem with
code updates.
The tools I review in this chapter were launched more recently, mostly since the generative AI
wave started in 2022, and all aim to extend the simplicity of generating documentation from code
beyond simple modules (like APIs) and helpers (like Grammarly). Some aim to be competent
enough to replace the need for human action in writing documentation.
Types of Documentation
There are four key types of documentation commonly found in software development:
When business stakeholders define a new product or feature to be developed in order to fulfill
some business objective, they write feature specifications to let software engineers know what
functionalities to implement. The engineers’ role is to extend those specifications with technical
system designs, architectural decisions, and workflows that document not just what was
implemented, but also how it was implemented. This type of documentation is vital for
maintaining and evolving software projects over time, especially when the original engineers are
no longer around.
These documents help nontechnical users understand how to use the software. They include
everything from installation instructions to troubleshooting tips. They’re useful during the sales
process as a support material for sales and marketing colleagues, and as customers use the
product. The challenge here lies in writing documentation for users who don’t have a technical
background.
API/SDK documentation
A critical resource for developers, documentation of APIs and software development kits (SDKs),
this type of documentation provides clear, structured details about the functions, methods, and
services available within a software system. These documentation interfaces serve as a bridge
between different software components, ensuring that developers can integrate and use the
system efficiently.
When business stakeholders define a new product or feature to be developed in order to fulfill
some business objective, they write feature specifications to let software engineers know what
functionalities to implement. The engineers’ role is to extend those specifications with technical
system designs, architectural decisions, and workflows that document not just what was
implemented, but also how it was implemented. This type of documentation is vital for
maintaining and evolving software projects over time, especially when the original engineers are
no longer around.
These documents help nontechnical users understand how to use the software. They include
everything from installation instructions to troubleshooting tips. They’re useful during the sales
process as a support material for sales and marketing colleagues, and as customers use the
product. The challenge here lies in writing documentation for users who don’t have a technical
background.
Evaluation Process
In this chapter, I evaluated more than 20 AI tools in the documentation and technical writing space
in order to shortlist the four highlighted in this chapter. Every tool covered here meets the
following criteria:
For this test, I created a very simple authentication flow, with both frontend and backend. The full
code, which is available in this book’s Github repository, contains flows for signup, login, and
logout. I’ve used the AI tools in this chapter to document my code. My main point of comparison is
whether the documentation produced can be useful for any of the four documentation use cases
explained above.
Again, for this test I preferred tools that can be used with a simple signup and free trial, so I stayed
away from enterprise tools.
Tools Comparison
The full documentation generated for each test can be found in the book’s Github repository.
Swimm
I think this flow is comparable to the automated code reviews in Chapter 3. I can see how
embedding these tools into a repo can provide a seamless integration into existing software
development processes.
While Swimm can be blended into the repo and create or update documentation upon each PR, for
the sake of this comparison test, I haven’t used that exact flow. I’ve simply used Swimm’s browser-
based UI, which allows me to connect the repo, select specific files to be documented, and prompt
for what to include in the documentation, as shown in Figure 6-1.
Figure 6-1. Swimm’s widget to create a piece of documentation.
In this flow, I’ve asked Swimm to document the backend part of my authentication flow with a
simple prompt:
The desired output is a document that can be used for internal visibility on ongoing initiatives and
for onboarding future team members. You can see a sample of the result in Figure 6-2.
Figure 6-2. Sample of Swimm’s output for the backend documentation.
This output is quite good. I like the structure of the document as well as its content. However, my
authentication flow is probably too simple to showcase Swimm’s full potential. So I tested a second
case for a more complex document:
Describe the frontend code and create a test plan for each flow.
The result was again very good. It generated a full document (whose table of contents can be
found in Figure 6-3), including a high-level introduction and then a deep dive into specific code
components that impact the flow and thus should be documented.
Figure 6-3. Table of contents of the document generated by Swimm for the frontend code
The last section of the document, as I asked, identifies the main flows of my code and provides test
plans for each. The actual test plans are quite simplistic, but that’s probably a byproduct of the
simplicity of the underlying flow, as copied below:
2. Click the logout button and ensure the login form is shown again.
Swimm did well in this test. It was easy to get started with this tool, and it generated relevant
documentation for my requests in correct Markdown format, which is the standard in technical
documentation. However, I found it quite limiting that Swimm can only document one file of code
at a time. This produces very fragmented pieces of documentation that are closer to a read.me file
than a higher-level code base and flow documentation.
I see working with a larger scope of source material as a natural evolution for Swimm, which could
leverage its superior integration flow to create documentation for the whole code base, or at least
groups of files. It could work horizontally, documenting the structure of frontend code by using all
frontend files as the object of a document, or vertically, documenting a feature flow by using all
files related to that feature.
As such, I’m rating Swimm a 6/10. While the UX is good, the output is still far from the quality of
documentation that I would accept in my teams.
ChatGPT
ChatGPT is most software engineers’ go-to LLM tool for creating documentation, so I’m including it
in this chapter, specifically the GPT-4o model, the most advanced available at the time of writing
(late 2024).
I started by prompting ChatGPT to generate the documentation for my code. I included in the
prompt all six code files, a screenshot of the repository structure (so it understands the
relationships between the code files), and instructions for what the documentation should include,
as seen in Figure 6-4.
You can ask ChatGPT to output the documentation directly into a Markdown file. I committed its
final documentation generated by ChatGPT (as well as the other tools in this chapter) into the
book’s Github repository.
As expected, ChatGPT performs very well in this limited-scope test. It will work with up to 20 files
at a time, and the file size limit varies by file type. While that’s totally okay for small projects like
my authentication application, it is insufficient for most production-level applications. On top of
those limits, ChatGPT also offers an inconvenient UI, compared to tools that connect to the
repository. The need to upload files manually and give ChatGPT contextual information about their
structure and relationships makes it more challenging to use, especially in large projects.
As such, I’m rating ChatGPT a 7/10 for this use case. The quality of the documentation is very good,
with the caveat of the limits and inconvenient UI. It would take a software engineer some creativity
to either document clusters of an application (by functionality or part of the stack, or module)
within that limit of 20 files per piece of documentation.
Cursor
Cursor is a relatively new player in the AI coding tool space. It was launched in 2023 and has
captured massive market share in the specific use case of IDEs with AI code-assistance capabilities,
which has been led by Github Copilot. As of August 2024, Cursor had 40,000 customers.
Cursor’s product is an AI-native IDE that started as a fork from the popular Visual Studio Code. It
allows software engineers to select which LLM model should power the tool; I’ve used Anthropic’s
Claude Sonnet 3.5. As an actual IDE, Cursor has visibility into all code files in my repository,
regardless of their number or size. You enter prompts through a chat feature, as seen in Figure 6-7.
The document Cursor generated was good, with sections for the expected main components, as
seen in the table of contents in Figure 6-8.
Figure 6-8. Table of contents of the documentation generated by ChatGPT
Despite the very comprehensive outline and the relevancy of its content, Cursor has a significant
pitfall when it comes to generating Markdown documents. For some reason (perhaps a bug), the
generated content is only partially formatted as a Markdown file. It outputs some sections as raw
text, such as the snippet in Figure 6-9. This makes it much harder to read.
Despite these formatting issues, the documentation generated is extensive, covers the write topics
and the content has a correct level of technical depth. Definitely in line with what I would consider
acceptable documentation in my teams. As such, I rate Cursor 8/10.
Scribe
Scribe is quite a different tool from the others reviewed in this chapter. This tool is best suited for
creating user guides, standard operating procedures (SOPs), or bug reports in a visual way. While
my use of Swimm, ChatGPT, and Cursor focused very much on creating written documentation
about the technical implementation of a certain product or functionality, I used Scribe to produce a
guide about the product’s functionality.
While Scribe was created in 2019 as a basic screen capture tool, the functionality I used for this
test, called Scribe AI, was only launched in 2023. It leverages that original functionality that allows
a user to record a browser session, but instead of simply storing the video of the recording, it also
creates a whole workflow with annotations, based on the screen transitions in the recording.
That’s why it caters to UI-related use cases, like bug reports and product guides.
To start the test, I installed Scribe’s Chrome extension, and used it to record a simple session of
myself registering a new account and logging into that account. My goal was that Scribe would
generate a user guide that I could share with external nontechnical stakeholders, like users of the
product.
The experience of recording my first session was quite seamless; I got the recording I needed easily
on my first try. It’s called a Scribe, the name for the video recording and the annotated workflow
that’s generated, and it’s available in this public link. I’d say this output is good, since it identifies
the screen transitions in my workflow and captures the screenshots of each screen, highlighting
the action that the user did on the screen to cause the transition. The result in in line with user
shadowing tools like Hotjar or FullStory, which are commonly used for user research and bug
tracking.
Scribe offers a feature that converts that raw HTML document in the public link above into an AI-
generated document. I used this flow here, it allows the user to write a prompt with the desired
documentation piece to be generated from the screen recording captured. My instructions were
simple, as visible in Figure 6-7.
Figure 6-10. Instructions to Scribe to generate document from raw tracking of website actions
The resulting document is publicly available here. I found this output underwhelming. It’s generic,
and feels like it could have been written about any application, not specifically about mine. It
generated a document and embedded Scribes (specific flows) into it, as opposed to generating a
document based on the flow I recorded, which was my intention, this makes infer that the tool
might be a better fit to larger pieces of documentation that involve several different Scribes
merged together in a larger document (eg: product guide). The content of the document
generated is not very relevant to the use case. As such, I’m rating Scribe a 5/10.
Tool comparison
Conclusion
As a CTO for over a decade, I’ve found that documentation is one of those things that’s always
lacking, but never to the point where it’s worth pausing ongoing work to fix it. In fact, bad
documentation is a form of technical debt, but one that doesn’t break systems or degrade
performance. It does degrade the team’s performance, however, which is a less visible and perhaps
more damaging form of debt in a software development team.
I’ve always found it hard to push software engineers in my teams to write documentation in the
first place, and even harder to keep that documentation organized, accessible, and updated. I think
that AI tools like the ones I reviewed in this chapter can play a fundamental role in making that
process easier. With a simple prompt, they can generate documentation within seconds. It would
take a human at least an hour or two to generate a similar document. And that time commitment
compounds with complexity: the larger a system is, the more challenging and time-consuming it is
to document it properly and keep that documentation up to date.. In a team of a few dozen
people, that work could easily come to thousands of collective hours of work a year.
While AI tools can create documentation instantly, they can also create bad documentation (just
like humans can). I recommend that teams take the same approach to documentation as to setting
coding guidelines: create a template for prompts or even for documents, with pre-defined sections
and subsections. This serves as a backstop to avoid unnecessarily long documents, and facilitates
readability by making content easier to find.
With all that said, documentation created by AI tools must always be thoroughly reviewed and
edited by team members. While it takes seconds to produce 90% of the deliverable, the final
revisions and quality control must be performed by human beings, since the output does not
always fulfill the objective. See the case with Scribe, where the document generated is generic, a
human reviewer would have caught that flaw and improved the documentation manually.
Chapter 7. Chatbots and Virtual Assistants
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 8th chapter of the final book. Please note that the GitHub repo will be made active
later on.
If you have comments about how we might improve the content and/or examples in this book, or if
you notice missing material within this chapter, please reach out to the editor
at sgrey@oreilly.com.
Chatbots have been a staple of digital customer service and automation for over a decade. Initially
built as simple rule-based programs, traditional chatbots followed predefined logic trees to handle
repetitive tasks, such as answering frequently asked questions, providing automated responses in
customer support, or collecting user information in structured workflows. We may say these
chatbots weren’t very powerful, compared to today’s LLM-based bots, but they became a very
popular way to automate high-volume, low-complexity interactions. However, their technical
implementations struggled with unstructured conversations, ambiguity, or user inputs that
deviated from expected patterns. The experience usually felt rigid and unnatural–people clearly
understood they were talking to a machine, and those user interactions would frequently end with
frustrating statements like, “I don’t understand. Please try again.”
Now that people are used to chatting with LLM-based tools like ChatGPT, the rule-based approach
to chatbots seems like a distant memory. Today’s “AI-powered” chatbots no longer rely solely on
predefined scripts and rules. Instead, they leverage sophisticated natural language processing
(NLP), contextual understanding, and generative AI to deliver dynamic, human-like interactions.
These advancements have drastically expanded the role of chatbots to include tasks like:
This transformation means that software engineers now have a broad spectrum of tools and
frameworks available to help us build intelligent chatbots, ranging from no-code workflow builders
to fully customizable AI-powered assistants that leverage agentic reasoning and training data.
Whether you want to build a simple automated helpdesk bot or a sophisticated AI agent capable of
complex decision-making, the technology is now more accessible than ever.
No-code AI chatbots
These tools promise that users without any programming knowledge will be able to deploy AI-
powered assistants. They usually have easy-to-use interfaces where users can upload datasets,
define workflows, and select the underlying AI models. While these tools are promoted for a
nontechnical audience, I’m including them in the book because, in my experience, certain technical
knowledge is often required to integrate these tools with other parts of a company’s workflow,
which end up being part of software engineers’ scope of work.
These no-code chatbots are a natural evolution of the rule-based chatbots mentioned above and
cater to the same use cases: customer support, e-commerce, and FAQ automation.
These tools aim to bridge the gap between simplicity and flexibility to adapt to the needs of
different clients. They are usually easy to set up and allow users to build conversation flows visually
by connecting boxes with predefined functionalities, while integrating AI-powered features for
dynamic interactions. Developers can automate workflows and use more complex logic in the
backend for the chatbot.
These chatbot builders cater to use cases such as customer service, lead generation, and internal
workflow automation.
Code-based AI frameworks
These are usually SDKs, APIs, or open source repositories that allow software engineers to fully
control the chatbot’s behavior, which they’ll then use as boilerplate to implement customised
chatbot experiences. They tend to have higher standards for training data. Use cases include
agentic reasoning, integration with other tools and workflows, and even data security (since these
chatbots can be self-hosted on a company’s cloud infrastructure).
Evaluation Process
To compare chatbot-building tools, I created a chatbot that interacts with this Online Retail
Dataset, created by Lucas Soares, a fellow O’Reilly author in the field of Machine Learning. It
consists of transactional data from an e-commerce shop, which I’ve used to test chatbots’
capabilities to retrieve factual data and answer complex user queries.
Each chatbot was tested using a set of predefined queries, evaluating its:
I tested multiple tools, but have included one from each of the categories outlined in the previous
section.
Tools Comparison
Chatbase
Chatbase was launched in 2022 as a no-code AI chatbot builder that allows users to create and
train chatbots by uploading documents or datasets. It gained popularity due to its simple interface
and integration with OpenAI’s GPT models, which made it an accessible solution for businesses
looking to automate their support and customer interactions without requiring technical expertise.
While the product is still relatively young, Chatbase has seen adoption among small businesses,
solopreneurs, and startups, particularly for handling customer support and FAQ automation. The
tool offers a free-tier option and a paid plan for scaling your chatbot’s capabilities.
For this test, I created a free account on the Chatbase website and looked for a way to create my
chatbot. I quickly found the section to add the training data: my spreadsheet with the ecommerce
product catalog. In a matter of two or three minutes, I was interacting with my chatbot.
Figure 7-1. Chatbase’s Sources UI, where users can add training data.
Once the chatbot was ready, I started asking questions, as if I was a real client:
I see there are some Android phones in your catalog. Compare the features and prices, and tell me
the best budget Android phones?
Figure 7-2. First question to my Chatbase chatbot (truncated to the visible area)
It did well in the first question, fetching correct results and displaying them in a well-formatted
response.
In the second question, I purposefully asked about a product that doesn’t exist:
I’m looking for a laptop below $10. What’s the best option?
This is a common source of hallucinations, and once again, my Chatbase bot did well, stating the
fact that no product matches my query.
Figure 7-4. Third question to my Chatbase chatbot
For the third question, I asked something more complex, which not only inquired about specific
products in the catalog, but also a comparison between their attributes and ultimately a
recommendation:
You have mobile phones and digital cameras. I’m interested in photography, should I buy the Sony
Xperia XA2 Ultra or the Pentax K-1 camera? Why?
Again, my Chatbase bot did well, stating the attributes of each option and making
recommendations about when to buy one or the other option.
Chatbase also makes it very easy to publish a Chatbot using its Connect section. With a couple
clicks, I get some code to embed my chatbot in a website via a chat bubble or iframe. I was
impressed by how easy it was to upload training data, test the bot, and publish it.
Chatbase did very well in the test, and it’s probably one of the easiest ways to integrate a chatbot
in a website, if you have specific training data and make some changes to the prompt to grant the
desired functionality, tone, and so on. As such, I rate it a 9/10. The reason I did not rate it 10/10 is
because the end product’s behavior significantly resembles the default behavior of the underlying
LLM (gpt-4o-mini, in this test). In other words,if a user were to upload the same training data to
ChatGPT and ask these questions, the answers would have been very similar. Still, the ease of
doing all this and publishing the bot with a few clicks is impressive.
Botpress
To test Botpress, I followed the same playbook as for Chatbase. I created a free account on the
Botpress website and went through its widget to build an agent (that’s what they call a chatbot). It
took me some time to figure out how to add the training data, since I first needed to create a table
(their name for training data in csv format) on a separate page, then add that as a knowledge base
to my bot.
Figure 7-5. Botpress’s Agent Builder UI, where users can add training data and configure their
workflow.
Botpress offers more options beyond a simple chatbot feature. Users can add different agents and
tools, and compose a workflow with them that serves as the backend reasoning under the bot’s
hood. It’s more powerful than Chatbase, but getting a simple bot up and running took me 30 to 40
minutes of trial and error.
Figure 7-6. Botpress’s workflow builder. I selected a very simple flow for my test.
With the chatbot finally ready, I started interacting with it and asking the exact same questions I
asked the Chatbase bot.
Figure 7-7. First question to my Botpress chatbot (truncated to the visible area).
For the first question, the bot correctly fetched a couple of Android phones, but not budget
phones, as I’d asked: the prices started at $500. It also provided only two options, which was quite
limiting. I’d be a bit frustrated with this reply if I were a real client.
Figure 7-8. Second question to my Botpress chatbot.
Again, in the second question I purposefully asked about a product that doesn’t exist, and my
Botpress bot did well, correctly stating that no product matched my query.
Figure 7-9. Third question to my Botpress chatbot
Third, I once again asked the complex question about comparing the phone with the camera. My
Botpress bot did well, stating the attributes of each option and when it recommended buying one
or the other. The format wasn’t great, however, since each product’s attributes were packed into a
single bullet point.
Botpress also features a one-click Publish feature that creates a standalone URL for your chatbot,
and simple guidelines for the most popular use cases (implementing chat bubbles, embedding into
a website, and so on).
Botpress did well in the test. The lack of polish in some answers might be due to it using an older
model by default on the date of this test (gpt-4o-2024-08-06). It has much more comprehensive
workflow features than Chatbase does, However, the UI is quite challenging for a new user to
navigate. I rate it an 8/10.
Langchain
Code-based framework LangChain emerged in 2022 in response to the increasing demand for
developer-friendly frameworks to build AI-driven applications. It’s designed to integrate with
popular LLMs like GPT-4, Claude, and open-source alternatives, and to allow software engineers to
create custom AI-powered assistants with advanced reasoning capabilities, memory, and dynamic
API integrations. The framework has gained rapid adoption among AI engineers and startups
looking to build agent-based systems that go beyond basic chatbot interactions, and has become
the go-to framework for LLM-powered applications and AI agents.
For this test, I coded my own chatbot using the Langchain framework. It’s a very well-documented
framework, so it was rather easy to get started. It took me about 30 to 40 minutes to get from an
empty repository to the chatbot in the images below (with help from Cursor for boilerplate, the
verbose chunks of the code, and some formatting adjustments at the end). The resulting code is
available in the book’s Github repository. I ran it on my local machine, and used the model gpt-4o-
mini for this test.
Figure 7-10. The beginning and end of my first question and the response from the chatbot I coded
with Langchain (the verbose middle section is not displayed)
For the first question, the bot correctly fetched three Android phones,suggested one that’s actually
a good budget phone, and explained its reasoning, which was good as well.
Figure 7-11. Second question to my Langchain chatbot.
For the second question, about the product that doesn’t exist, the bot indeed realised that no such
product existed. The reply “I don’t know” results from a literal interpretation of the section of the
prompt that says, “If you don’t know the answer, just say you don’t know.” This is quirky, but
actually correct.
Figure 7-12. …
Figure 7-13. Beginning and end of my third question and my Langchain chatbot’s response
(verbose middle section not displayed).
On the complex question comparing the phone with the camera, my Langchain chatbot did well,
stating the attributes of each option and the tradeoffs of choosing one or the other. Its answer is
very much in line with those of the bots I created with the other tools, which illustrates how much
of the output can be attributed to the underlying models.
Langchain doesn’t offer a one-click publish option. I ran this code on my local machine, but to
make the bot publicly accessible, I would need to deploy it on a cloud server. Also, Langchain offers
no off-the-shelf guidelines for integrating my chatbot as a bubble chat or embedding it in other
people’s websites; that would need to be developed as desired.
I’ve implemented several real-world chatbot solutions with Langchain, and I like it a lot. It allows
you to build simple chatbots or more complex ones, as well as multiagent workflows using
LangGraph. It’s a powerful solution, but it certainly requires software development skills. It’s also
very well documented, with a great developer community. With assistance from other AI tools, like
Cursor and Github Copilot, these implementations take a fraction of the time they’d have taken
just a few years ago. As such, I rate Langchain a 10/10.
Tool Comparison
All of the tools in this test performed very well, which is why they’re included here. However,
Langchain will continue being my go-to tool for building any kind of chatbot.
Conclusion
As a CTO, I’ve built many chatbots over the years, mostly before the recent generative AI wave,
with functionality limited to predefined logic trees and very niche use cases.
The tests for this chapter highlight that it’s now possible to build a highly capable chatbot with
minimal effort. These chatbots are capable in terms of being trained on specific data, but also in
terms of their tone, range of inference, and ability to follow the specifics of the desired use case . It
should be no surprise that thousands of businesses are integrating these capabilities on their
products. It’s also curious to observe the massive shift in UI patterns, away from the traditional list-
>search->detail approach and towards chatbot UIs that allow the user to “talk with the data” in a
less structured way.
It’s worth noting that this shift towards chatbot UI creates room for hallucinations, partial
responses, and other well-known quirks of LLM technology. Indeed, my main takeaway from
writing this chapter is that the outputs from all the tools in this test were highly dependent on the
underlying LLMs they used–for better and for worse. It’s great that, with minimal action, a user can
set up an impressive chatbot that fetches information, structures it correctly, and draws complex
conclusions. But the chatbot’s output is still very much subject to the limitations of the underlying
LLM, such as hallucinations.
About the Author
Sergio Pereira has been a Software Engineer and CTO for 13 years. Sergio has built products for
several fast growing startups, such as Bulk MRO, StudentFinance or FutureFuel. For the last 7 years
Sergio has built innovative software products for the Fintech industry, complying with strict
compliance requirements.
As part of his work, Sergio is responsible for crafting the software development processes that
allow his teams to deliver high quality software in a timely manner for his clients. As such, Sergio
was an early adopter of ChatGPT, and has been a thought leader on the topic creating internal
documentation for his teams, and sharing most of it in public for the benefit of the community.
Sergio is a public speaker on the topics of technology, startups and remote work.