Hey Robot! Build Your Own AI Companion - Make
Hey Robot! Build Your Own AI Companion - Make
Advertisement
Chat and command your own embedded-AI companion bot using local LLMs
Imagine a fully autonomous robotic companion, like Baymax from Disney’s Big Hero 6 —
a friendly, huggable mechanical being that can walk, hold lifelike interactive
conversations, and, when necessary, fight crime. Thanks to the advent of large language
models (LLMs), we’re closer to this science
fiction dream becoming a reality — at least for
lifelike conversations.
LLMs are based on the neural network architecture known as a transformer. Like all
neural networks, transformers have a series of tunable weights for each node to help
Advertisement
perform the mathematical calculations required to achieve their desired task. A weight in
this case is just a number — think of it like a dial in a robot’s brain that can be turned to
increase or decrease the importance of some piece of information. In addition to
weights, transformers have other types of tunable dials, known as parameters, that help
convert words and phrases into numbers as well as determine how much focus should
be given to a particular piece of information.
Instead of humans manually tuning these dials, imagine if the robot could tune them
itself. That is the magic of machine learning: training algorithms adjust the values of the
parameters (dials) automatically based on some goal set by humans. These training
algorithms are just steps that a computer can easily follow to calculate the parameters.
Humans set a goal and provide training data with correct answers to the training
algorithms. The AI looks at the training data and guesses an answer. The training
algorithm determines how far off the AI’s result is from the correct answer and updates
the parameters in the AI to make it better next time. Rinse and repeat until the AI
performs at some acceptable level.
To give you an idea of complexity, a machine learning model that can read only the
handwritten digits 0 through 9 with about 99% accuracy requires around 500,000
parameters. Comprehending and generating text are vastly more complicated. LLMs are
trained on large quantities of human-supplied text, such as books, articles, and websites.
The main goal of LLMs is to predict the next word in a sequence given a long string of
previous words. As a result, the AI must understand the context and meaning of the text.
To achieve this, LLMs are made up of massive amounts of parameters. ChatGPT-4,
released in June 2023, is built from eight separate models, each containing around 220
billion parameters — about 1.7 trillion total.
However, running an LLM locally on a personal computer might be enticing for a few
reasons:
Advertisement
Maybe you require access to your AI in areas with limited internet access, such as
remote islands, tropical rainforests, underwater, underground caves, and most
technology conferences!
By running the LLM locally, you can also reduce network latency — the time it takes
for packets to travel to and from servers. That being said, the extra computing power
from the servers often makes up for the latency time for complex tasks like LLMs.
Additionally, you can assume greater privacy and security for your data, which
includes the prompts, responses, and model itself, as it does not need to leave the
privacy of your computer or local network. If you’re an AI researcher developing the
next great LLM, you can better protect your intellectual property by not exposing it to
the outside.
Personal computers and home network servers are often smaller than their corporate
counterparts used to run commercial LLMs. While this might limit the size and
complexity of your LLM, it often means reduced costs for such operations.
Finally, most commercial LLMs contain a variety of guardrails and limits to prevent
misuse. If you need an LLM to operate outside of commercial limits — say, to inject
your own biases to help with a particular task, such as creative writing — then a local
LLM might be your only option.
Thanks to these benefits, local LLMs can be found in a variety of instances, including
healthcare and financial systems to protect user data, industrial systems in remote
locations, and some autonomous vehicles to interact with the driver without the need for
an internet connection. While these commercial applications are compelling, we should
focus on the real reason for running a local LLM: building an adorable companion bot
that we can talk to.
Introducing Digit
Jorvon Moss’s robotic designs have improved and evolved since his debut with Dexter
(Make: Volume 73), but his vision remains constant: create a fully functioning companion
robot that can walk and talk. In fact, he often cites Baymax as his goal for functionality. In
recent years, Moss has drawn upon insects and arachnids for design inspiration. “I
personally love bugs,” he says. “I think they have the coolest design in nature.”
Digit’s body consists of a white segmented exoskeleton, similar to a pill bug’s, that
protects the sensitive electronics. The head holds an LED array that can express
emotions through a single, animated “eye” along with a set of light-up antennae and
controllable mandibles. It sits on top of a long neck that can be swept to either side
thanks to a servomotor. Digit’s legs cannot move on their own but can be positioned
manually. Advertisement
Like other companion bots, Digit can be perched on Moss’s shoulder to catch a ride. A
series of magnets on Digit’s body and feet help keep it in place.
Courtesy of Jorvon Moss
But Digit is unique from Moss’s other companion bots thanks to its advanced brain — an
LLM running locally on an Nvidia Jetson Orin Nano embedded computer. Digit is capable
of understanding human speech (English for now), generating a text response, and
speaking that response aloud — without the need for an internet connection. To help
maintain Digit’s relatively small size and weight, the embedded Jetson Orin Nano was
mounted on a wooden slab along with an LCD for startup and debugging. Moss totes
both the Orin Nano and the appropriate battery in a backpack. You could design your own
companion bot differently to house the Orin Nano inside.
Advertisement
Courtesy of DigiKey
Advertisement
Courtesy of DigiKey
The client, called hopper-chat, controls everything. It continuously listens for human
speech from a microphone and converts everything it hears using the Alpha Cephei
Vosk speech-to-text (STT) library. Any phrases it hears are compared to a list of wake
words/phrases, similar to how you might say “Alexa” or “Hey, Siri” to get your smart
speaker to start listening. For Digit, the wake phrase is, unsurprisingly, “Hey, Digit.” Upon
hearing that phrase, any new utterances are converted to text using the same Vosk
system.
The newly generated text is then sent to the LLM service. This service is a Docker
container running Ollama, an open-source tool for running LLMs. In this case, the LLM is
Meta’s Llama3:8b model with 8 billion parameters. While not as complex as OpenAI’s
ChatGPT-4, it still has impressive conversational skills. The service sends the response
back to the hopper-chat client, which immediately forwards it to the text-to-speech (TTS)
service.
TTS for hopper-chat is a service running Rhasspy Piper that encapsulates the en_US-
lessac-low model, a neural network trained to produce sounds when given text. In this
case, the model is specifically trained to produce English words and phrases in an
American dialect. The “low” suffix indicates that the model is low quality — smaller size,
more robotic sounds, but faster execution. The hopper-chat program plays any sounds it
receives from the TTS service through a connected speaker.
On Digit, the microphone is connected to a USB port on the Orin Nano and simply draped
over a backpack strap. The speaker is connected via Bluetooth. Moss uses an Arduino to
monitor activity in the Bluetooth speaker and move the mandibles during activity to give
Digit the appearance of speaking.
Moss added several fun features to give Digit a distinct personality. First, Digit tells a
random joke, often a bad pun, every minute if the wake phrase is not heard. Second,
Moss experimented with various default prompts to entice the LLM to respond in
particular ways. This includes making random robot noises when generating a response
and adopting different personalities, from helpful to sarcastic and pessimistic.
Advertisement
Courtesy of Jorvon Moss
At the moment, Digit can identify specific phrases using its STT library, but the recorded
phrase must exactly match the expected phrase. For example, you couldn’t say “What’s
the weather like?” when the expected phrase is “Tell me the local weather forecast.” A
well-trained LLM, however, could infer that intention. Moss and I plan to experiment with
Ollama and Llama3:8b to add such intention and command recognition.
The code for hopper-chat is open source and can be found on GitHub. Follow along with
us as we make Digit even more capable.
Dr. Tenma and Toby/Astro (Astro Boy manga, anime, and films, 1952–2014)
J.F. Sebastian and his toys Kaiser and Bear (Blade Runner, 1982)
Wallace and his Techno-Trousers (Wallace & Gromit: The Wrong Trousers, 1993)
Anakin Skywalker and C-3PO (Star Wars: The Phantom Menace, 1999)
Sheldon J. Plankton and Karen (Spongebob Squarepants, 1999–2024)
Dr. Heinz Doofenshmirtz and Norm (Phineas and Ferb, 2008–2024)
The Scientist and 9 (9, 2009)
Charlie Kenton and Atom (Real Steel, 2011)
Tadashi Hamada and Baymax (Big Hero 6, 2014)
Simone Giertz and her Shitty Robots (YouTube, 2016–2018)
Kuill and IG-11 (The Mandalorian, 2019)
Finch and Jeff (Finch, 2021)
Brian and Charles (Brian and Charles, 2022)
What will the next generation of Make: look like? We’re inviting you to shape the future
by investing in Make:. By becoming an investor, you help decide what’s next. The
future of Make: is in your hands. Learn More.
$ ollama serve
You might see a message that says “Error: listen tcp 127.0.0.1:11434: bind: address
already in use.” Ignore this, as it just indicates that Ollama is already running as a service
in the background.
You should be presented with a prompt. Try asking the AI a question or have it tell you a
joke.
Courtesy of Shawn Hymel
$ nano tinyllama-client.py
Close the file by pressing Ctrl+X, press Y when asked to save the document, and press
Enter.
$ python tinyllama-client.py
Advertisement
TinyLlama can take some time to generate a response, especially on a small computer
like the Raspberry Pi — 30 seconds or more — but here you are, chatting locally with an
AI!
But in just the past few months, some LLMs have been granted a powerful new ability —
to call arbitrary functions — which opens a huge world of possible AI actions. ChatGPT
and Ollama both call this ability tools. To enable such tools, you must define the
functions in a Python dictionary and fully describe their use and available parameters.
The LLM tries to figure out what you’re asking and maps that request to one of the
available tools/functions. We then parse the response before calling the actual function.
Let’s demonstrate this concept with a simple function that turns an LED on and off.
Connect an LED with a limiting resistor to pin GPIO 17 on your Raspberry Pi 5.
Advertisement
Advertisement
Make sure you’re in the venv-ollama virtual environment we configured earlier and install
some dependencies:
$ source venv-ollama/bin/activate
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install -y libportaudio2
$ python -m pip install ollama==0.3.3 vosk==0.3.45 sounddevice==0.5.0
You’ll need to download a new LLM model and the Vosk speech-to-text (STT) model:
As this example uses speech-to-text to convey information to the LLM, you will need a
USB microphone, such as Adafruit 3367. With the microphone connected, run the
following command to discover the USB microphone device number:
Note the device number of the USB microphone. In this case, my microphone is device
number 0, as given by USB PnP Sound Device. Copy this code to a file named ollama-
light-assistant.py on your Raspberry Pi.
You can also download this file directly with the command:
$ wget https://gist.githubusercontent.com/ShawnHymel/16f1228c92ad0eb9d
Advertisement
Open the code and change the AUDIO_INPUT_INDEX value to your USB microphone
device number. For example, mine would be:
AUDIO_INPUT_INDEX = 0
$ python ollama-light-assistant.py
You should see the Vosk STT system boot up and then the script will say “Listening…” At
that point, try asking the LLM to “turn the light on.” Because the Pi is not optimized for
LLMs, the response could take 30–60 seconds. With some luck, you should see that the
led_write function was called, and the LED has turned on!
The xLAM model is an open-source LLM developed by the SalesForce AI Research team.
It is trained and optimized to understand requests rather than necessarily providing text-
based answers to questions. The allenporter version has been modified to work with
Ollama tools. The 1-billion-parameter model can run on the Raspberry Pi, but as you
probably noticed, it is quite slow and misinterprets requests easily.
For an LLM that better understands requests, I recommend the Llama3.1:8b model. In
Advertisement
the command console, download the model with:
Note that the Llama 3.1:8b model is almost 5 GB. If you’re running out of space on your
flash storage, you can remove previous models. For example:
$ ollama rm tinyllama
In the code, change:
MODEL = "allenporter/xlam:1b"
to:
MODEL = "llama3.1:8b"
Run the script again. You’ll notice that the model is less picky about the exact phrasing of
the request, but it takes much longer to respond — up to 3 minutes on a Raspberry Pi 5
(8GB RAM).
When you are done, you can exit the virtual environment with the following command:
$ deactivate
The trick is to get the LLM to understand that calling this function is a possibility! Since
the LLM does not have direct access to your code, the ollama library acts as an
intermediary. By defining a set of tools, the LLM can return one of those tools as a
response instead of (or in addition to) its usual text-based answer. This response comes
in the form of a JSON or Python dictionary that our code can parse and call the related
function.
You must define the tools in a list of dictionary objects. As these small LLMs struggle
with the concept of an “LED,” we’ll call this a “light.” In our code, we provide the following
description of the led_write() function to Ollama:
TOOLS = [
{
'type': 'function',
'function': {
'name': "led_write",
'description': "Turn the light off or on",
'parameters': {
'type': 'object',
'properties': {
'value': {
'type': 'number',
'description': "The value to write to the ligh
"to turn it off and on. 0 for off, 1 for o
},
},
'required': ['value'],
},
}
}
]
In the send() function, we send our query to the Ollama server running the LLM. This
query is captured by the Vosk STT functions and converted to text before being added to
the message history buffer msg_history.
response = client.chat(
model=model,
Advertisement
messages=msg_history.get(),
tools=TOOLS,
stream=False
)
When we receive the response from the LLM, we check to see if it contains an entry with
the key tool_calls. If so, it means the LLM decided to use one of the defined tools! We
then need to figure out which tool the LLM intended to use by cycling through all of the
returned tool names. If the name led_write is given for one of the tools, which we defined
in the original TOOLS dictionary, we call the led_write() function. We provide the function
call with the pre-defined led object and argument value that the LLM decided to give.
if response['message'].get('tool_calls') is None:
print("Tools not used.")
return
else:
print("Tools used. Calling:")
for tool in response['message']['tool_calls']:
print(tool)
if tool['function']['name'] == "led_write":
led_write(led, tool['function']['arguments']['value'])
The properties defined in the TOOLS dictionary give the LLM context about the function,
such as its use case and the necessary arguments it needs to provide. Think of it like
giving an AI agent a form to fill out. The AI will first determine which form to use based
on the request (e.g. “control a light”) and then figure out how to fill in the various fields.
For example, the value parameter says that the field must be a number and it should be a
0 for “off” and 1 for “on.” The LLM uses these context clues to figure out how to craft an
appropriate response.
CONCLUSION
DIY Arcade Joystick Make: Arduino Maker's Notebook - Transistor Cat Kit
Kit Electronics Starter Hardcover 3rd
Pack Edition
MATERIALS
TOOLS
Advertisement
Tagged AI artificial intelligence companion cosplay expressive robots
Shawn Hymel
is an embedded engineer, maker, technical content creator, and instructor. He loves
finding fun uses of technology at the intersection of code and electronics, as well as
swing dancing in his free time (pandemic permitting).
Advertisement