DevOps с Laravel 2. Docker Swarm
DevOps с Laravel 2. Docker Swarm
Docker Swarm
State
Basic concepts
Workers, managers, and leaders
Creating a cluster
Application-level changes
Deploying a stack
Service placements
Scaling services
API and nginx
Worker
Visualizing the cluster
Protecting the databases
Protecting user-facing service
Ingress routing mesh
Health checks
Restarting services
Updating services
Rolling back services
Deployment
Deploying from a pipeline
Update service
Provisioning nodes
Monitoring and error tracking
Uptime
Uptime robot
DigitalOcean
Health check monitors
Health checks in a cluster
Server resource alerts
Error tracking
Log management and dashboards
JSON logs
Grafana & fluentbit
Conclusions
1 / 82
Martin Joo - DevOps with Laravel
Docker Swarm
The project files are located in the 5-swarm folder.
I know Docker Swarm is not the sexiest thing in the world, but here's my offer:
If you are already experienced with docker-compose, you can go from a single machine setup to a
highly available 100-server cluster in ~24 hours by learning just ~10 new commands.
Also:
Switching from Swarm to Kubernetes is not that hard. The principles are almost the same.
Now that you are excited, let's talk about the downsides of scaled applications.
2 / 82
Martin Joo - DevOps with Laravel
State
State is the number one enemy of distributed applications. By distributed, I mean, running on multiple
servers. Imagine that you have an API running on two different servers. Users can upload and download
files. You are using Laravel's default local storage.
User B wants to download the image but his request gets served by Server 2. But there's no
/storage/app/public/1.png on Server 2 because User A uploaded it onto Server 1.
Filesystem
Or memory
Databases such as MySQL. MySQL does not just use state, it is the state itself. So you cannot just run a
MySQL container on a random node or in a replicated way. Being replicated means that, for example, 4
containers are running at the same time on multiple hosts. This is what we want to do with stateless
services but not with a database.
Redis also means state. The only difference is that it uses memory (but it also persists data on the SSD).
When I deployed my first distributed application it took me 4-6 hours of debugging to realize these facts.
All of these problems can be solved relatively easily, so don't worry, we're going to look into different
solutions.
3 / 82
Martin Joo - DevOps with Laravel
Basic concepts
Docker Swarm is a container orchestrator tool that works on multiple servers. These are the basic terms:
Cluster is a set of nodes. Usually, one project runs on one cluster with multiple nodes.
Stack is a docker-compose.yml file. A project can have multiple stacks such as the application itself
and a monitoring-related stack. In the Docker chapter, I referred to the docker-compose.yml file as
"stack" because of this.
Service is the same thing as before. It's a service inside a stack. But in Swarm, a service can run in a
replicated way. For example, we can scale the API to run in 4 or 6 replicas.
Task is a tricky one. Each running container will have a task. And a task is the smallest unit of work that
can be scheduled and deployed in a Swarm cluster. A task represents a single instance of a service
running on a node in the cluster. Each task represents a specific container with its associated
configuration and resources. It's basically a scheduling entity. Don't worry, I didn't understand it either.
Just start playing with Swarm and you'll have a better idea after a few days.
4 / 82
Martin Joo - DevOps with Laravel
Worker is a node that doesn't care about the world but gets instructions and runs containers. It has no
other responsibilities.
Manager is a node that instructs the workers to do something. When you configure a service that
needs to run in 6 replicas you need to issue the command on a manager node and it will distribute
your 6 replicas across the available workers. The manager also checks the cluster and restarts services
if needed.
So it's a master-worker architecture. The only difference is that a manager is also a worker. So it not only
gives the worker commands but also runs containers itself.
You can see that this architecture has a big disadvantage: the manager is a single point of failure. If it goes
down the cluster becomes unreliable. There's no node that runs checks on the other nodes or monitors the
number of replicates and so on.
To solve that problem, we can have many manager nodes. But one of them is a leader. At any given time,
there's only one leader and that node manages the cluster. If the leader goes down, the other managers
elect a new leader.
5 / 82
Martin Joo - DevOps with Laravel
And now, you are the only single point of failure. But that's okay!
Here's the fault tolerance and the recommended number of managers by Docker:
1 0
2 0
3 1
4 1
5 2
6 2
7 3
If you have 4 manager nodes and one of them goes down the cluster stops making "progress." The general
formula is N / 2 - 1 where N is the number of managers. Because of that formula, it doesn't make sense
to have an even number of managers. As you can see it doesn't make a difference if you have 3 or 4
managers. The fault tolerance is 1 in both cases.
6 / 82
Martin Joo - DevOps with Laravel
I said that the cluster stops making "progress." But what is progress in this context? The manager nodes
need to agree on a specific state or command and perform the required actions based on the agreement.
These decisions and operations involve maintaining a consistent and replicated state across all nodes in the
cluster. For example, having 6 replicas of the api service but only two can run on the same node at the
same time. We're talking about these kinds of decisions, operations, and progress. Other than that
managers also have some special tasks such as:
Leader election: The managers in the cluster elect a leader responsible for coordinating operations and
making decisions on behalf of the cluster. It's an important process.
Log replication: The leader receives commands from clients and appends them to its log. It then
replicates the log entries to other nodes in the cluster to maintain consistency. On a manager node,
you can see every log entry from all the nodes produced by a specific service. For example, here are
the logs of the worker service running in 4 replicas on 4 nodes:
7 / 82
Martin Joo - DevOps with Laravel
Creating a cluster
I recommend you rent a few $6 droplets to create a playground cluster with me. Choose the Docker image
and name them node1, node2, etc.
To run a cluster you need to open the following ports on all nodes:
2377
7949
4789
They are used by Docker for communication. On DigitalOcean droplets you can run this command:
ufw allow 2377 !" ufw allow 7946 !" ufw allow 4789
You also need to login into your Docker registry on every node. For me, it's Docker Hub so the command is
simply:
You're basically done! Now you have a 1-node cluster. This node is the leader. If you run this command it
gives a token. This token can be used to join the cluster as a worker node.
1.2.3.4 is the leader's IP address and 2377 is one of the ports you opened before.
If you run
docker node ls
8 / 82
Martin Joo - DevOps with Laravel
I ran the command on node1 which is the leader. It lists all of the nodes that are part of the cluster. All of
the other nodes are workers since the Manager Status column is empty. It can be Leader , Reachable , or
empty. Reachable means manager.
If I run the same command on a worker node such as node2 it returns an error:
If you want to add another manager node you can request a manager token on the current manager node:
If you run docker swarm join with this token, the new node is going to be a manager.
Or if you have an existing worker node but want to promote it to the manager you can run this command
on a manager node:
9 / 82
Martin Joo - DevOps with Laravel
10 / 82
Martin Joo - DevOps with Laravel
Application-level changes
To take an app from one machine to a cluster you most likely need to make some changes. As I said in the
introduction, the state is our biggest enemy right now. You need to make sure that you don't use these:
Local storage
File-based session
File-based caches
Later, we're gonna talk about MySQL and Redis too. The sample application contains an endpoint with file
upload. To make it work in a cluster our best shot is to use S3, DigitalOcean Spaces, CloudFlare R2, or some
other object storage.
I'm going to use S3 in this book but if it was my own project I would not use it because the whole AWS
seems brutally overpriced. And quite ugly, to be honest.
FILESYSTEM_DISK=s3
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=us-east-1
AWS_BUCKET=devops-with-laravel-storage
AWS_USE_PATH_STYLE_ENDPOINT=false
AWS_URL=https:!$devops-with-laravel-storage.s3.us-east-1.amazonaws.com/
You can get all this information from the AWS console.
if ($file) {
$filename = Str!'slug($post!&title, '_')
. '-' . $post!&id . '.' . $file!&extension();
$file!&storeAs('post_cover_photos', $filename);
11 / 82
Martin Joo - DevOps with Laravel
$post!&cover_photo_path =
'post_cover_photos' . DIRECTORY_SEPARATOR . $filename;
$post!&save();
}
}
Since S3 is the default driver there's no need to explicitly specify it. But of course, you can do this as well:
Then in the PostResource class, I create a temporary URL to show the file:
If you use file-based session or cache you need to change these variables as well:
CACHE_DRIVER=redis
SESSION_DRIVER=database
12 / 82
Martin Joo - DevOps with Laravel
Deploying a stack
The next step is to make the existing compose files "Swarm-compatible." Everything is going to be the same,
there's only one, but big difference: Docker Swarm doesn't read the .env file in the current working
directory. docker-compose does. This is why we deployed it this way:
Docker Swarm doesn't do this for us, so we need to make a few changes.
api:
image: martinjoo/posts-api:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" php-fpm"
environment:
- APP_NAME=posts
- APP_ENV=production
- APP_KEY=${APP_KEY}
- APP_DEBUG=false
- APP_URL=http:!$localhost
- LOG_CHANNEL=stack
- LOG_LEVEL=error
- DB_CONNECTION=mysql
- DB_HOST=mysql
- DB_PORT=3306
- DB_DATABASE=posts
- DB_USERNAME=${DB_USERNAME}
- DB_PASSWORD=${DB_PASSWORD}
- QUEUE_CONNECTION=redis
- REDIS_HOST=redis
- REDIS_PORT=6379
- MAIL_MAILER=log
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_BUCKET=devops-with-laravel-storage
13 / 82
Martin Joo - DevOps with Laravel
- AWS_DEFAULT_REGION=us-east-1
- FILESYSTEM_DISK=s3
depends_on:
- update
- mysql
- redis
Every Laravel service needs these variables so I put these under the scheduler , worker , and update as
well.
As you can see, I'm using the template format ${DB_USERNAME} . This will be substituted with the current
environment variables in the running process. I'm talking about something like export
DB_USERNAME=username && some-command . If you run this command the DB_USERNAME will be available for
the some-command process.
Before deploying a stack (the compose file), we need to make sure that we export the right environment
variables. So we still can have an .env file but it needs to be exported:
This command will export every variable from the .env file. And after that the stack can be deployed and
Docker Swarm will substitute the placeholders with the environment variables.
So technically we will still have a .env file but it's only going to be used on the manager node before
running the deploy command. After that, it can be deleted because the variables are in the current session
and Swarm handles the rest.
So if you check out the project files you can prepare a quick, manual deployment by running these
commands:
Where 1.2.3.4 is the IP of a manager node. Then change the values inside the .env file and run this on
the manager:
14 / 82
Martin Joo - DevOps with Laravel
docker stack deploy is the most important command. This is how you can deploy a stack. It takes one
argument which is posts in my case. It's the stack's name. Basically, it's the name of your project. With the
-c you can define which docker-comose file you want to deploy. You always need to specify this, even if
your file is called docker-compose.yml .
docker stack ls
You should see the stacks that have been deployed. In this case, posts is the only stack.
docker service ls
15 / 82
Martin Joo - DevOps with Laravel
Ignore REPLICAS and replicated job for now. If everything went well you should see 1/1 everywhere.
If you now open the IP address of your manager node, you can see the frontend.
We Created 3 servers.
Added some worker nodes with this one: docker swarm join --token <your-token> 1.2.3.4:2377
And then deployed the stack using the docker stack deploy command
And now you have a 3-node cluster! If you still think Swarm is lame and k8s is the king...
The database is being placed on a random node every time you deploy the stack. So basically each
node will have about 33% of your records.
Services are not replicated. I mean, it's not really a "major problem", but that's the main goal right now.
16 / 82
Martin Joo - DevOps with Laravel
Service placements
One of the possible ways to solve the database problem is to label a node as "the database node." And then
Swarm will always place the database services (MySQL and Redis) to that node. Of course, the other solution
is to use a managed database cluster in a cloud provider. We're going to try it out later.
The first thing to do is adding a label to one of the nodes. First, list the nodes on the manager:
docker node ls
label-add adds a label to a node. These labels are visible to Swarm and they can be used in the
docker-compose.yml config to place services to a particular node.
db=true is the label itself. db is the label itself and true is the value. true has no special meaning. It
can be db=yes if you like that. And these are not binary values, so you can use a label such as
type=database
17 / 82
Martin Joo - DevOps with Laravel
So now, node3 is the database server. It will run the MySQL and Redis services so they can persist in their
state.
mysql:
image: martinjoo/posts-mysql:${IMAGE_TAG}
deploy:
placement:
constraints:
- "node.labels.db!)true"
ports:
- "3306:3306"
volumes:
- type: volume
source: mysqldata
target: /var/lib/mysql
environment:
18 / 82
Martin Joo - DevOps with Laravel
- MYSQL_ROOT_PASSWORD=${DB_PASSWORD}
deploy is a special part of the compose config. We're going to use it a lot because it contains configuration
that can be used only by Swarm (it's not entirely true, because docker-compose can also use a few of
them).
In the placement key we can control how Swarm should place the containers of this service. Swarm offers a
ton of options, just to name a few:
Spread replicas evenly across specific labels. For example, you can use labels to label data center
regions and then you can distribute your replicas across these data centers.
Under the constraint key we can use our specific constraint using the == or the != operands. Here's a
list of all available constraints.
node.labels.db refers to the specific label we want, and the value must be true . With this small change
Swarm guarantees that MySQL will always run on node3 .
redis:
image: redis:7.0.11-alpine
deploy:
placement:
constraints:
- "node.labels.db!)true"
ports:
- "6379:6379"
volumes:
- type: volume
source: redisdata
target: /data
Notice that I use the same volumes as earlier. It doesn't need to change at all.
19 / 82
Martin Joo - DevOps with Laravel
The important thing is the task created by the posts_mysql service runs on node3 now.
Each service can spin up multiple containers because we can scale them (we will later).
So a task is a container with some additional meta information (such as on which node it's running).
We labeled node3 as the database node by running docker node update --label-add db=true
node3
Added the deploy , placement , and constraints keys to the docker-compose config to tell Swarm
that MySQL and Redis should run on nodes with the label db=true
I suggest you try this on your own and after you deploy the stack run docker service ls . This way you
can see how it's updating services. Meanwhile, just open the IP address of your manager node and you can
see there's no downtime in the process.
20 / 82
Martin Joo - DevOps with Laravel
Scaling services
API and nginx
And here comes the fun part. Let's scale out the api to 6 replicas:
api:
image: martinjoo/posts-api:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" php-fpm"
deploy:
replicas: 6
!!*
If you now list the service by running docker services ls you should see 6/6 replicas:
You can list each of these tasks by running docker service ps posts_api :
21 / 82
Martin Joo - DevOps with Laravel
You can see that each node has some replicas of the api service. in this picture, you can see two kinds of
desired states:
Running
Shutdown
It's because I was already running a replicated stack, and then I redeployed it. When you run docker stack
deploy Swarm will shut down the currently running tasks and start new ones.
There's a frequently asked question when it comes to scaling an API: should I scale the nginx service that
acts like a reverse proxy or the API itself or both?
To answer that, let's think about what nginx does in our application:
location ~\.php {
try_files $uri =404;
include /etc/nginx/fastcgi_params;
fastcgi_pass api:9000;
fastcgi_index index.php;
fastcgi_param PATH_INFO $fastcgi_path_info;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}
It acts like a reverse proxy. It literally just receives requests and then forwards them right to the API. It's not
a performance-heavy task.
If you just run in 1 replica on a server with 2 CPUs it means that it will spin up 2 worker processes with 1024
connections each. So it can handle a maximum of 2048 concurrent requests. Of course, it's not a
guarantee, it's a maximum number. But it's quite a bug number. Now, I guarantee you that the bottleneck
will be PHP and MySQL, not nginx.
In my opinion, you can scale your nginx but it won't bring you that many benefits. Meanwhile scaling the API
is much more useful.
Usually, the easiest way to decide on the number of replicas is to (kind of) match the number of CPUs in
your server/cluster. My cluster has 4 nodes with 2 CPUs each. That's 8 CPUs. So in theory I can scale out the
API to 8 replicas and Swarm will probably place 2 on each node. Which is great. But what about other
services? Let's say node1 runs two API containers. But it also runs a worker. And nginx. And also the
scheduler.
It's easy to see the problem. We have 8 CPUs so we just scale this way:
22 / 82
Martin Joo - DevOps with Laravel
API 8
nginx 8
worker 8
It's important to note that in Swarm I don't use supervisor. So a worker container just runs a queue:work
process (details later).
Now there are 24 tasks running on 4 nodes and 8 CPUs. Meaning that each node runs 6 tasks with 2 CPUs.
And I'm not even counting, MySQL, Redis, or the frontend.
API 4
nginx 4
worker 4
Then there are 12 tasks running on 4 nodes with 8 CPUs. It means 3 tasks on each node with 2 CPUs. That
sounds much better to me. You don't have to match the number of CPUs exactly because sometimes worker
processes are idle for a long time. Other times, worker processes are busy but the API has a low traffic (for
example, a few users running performance-heavy jobs).
But as I said, nginx can handle quite a lot of traffic so we can change the number of replicas such as this:
API 6
nginx 2
worker 4
nginx does its job on the servers in 2 replicas. It can probably handle ~4000 concurrent connections.
There's an API container running on all the nodes plus we have 2 extra ones, just in case. This is the
container that will die the most and have the most trouble probably.
There's a worker on every node. Of course, it depends highly on the nature of your application. For
example, if you mainly use the queue to send notifications via an SMTP server you probably don't need
4 replicas.
23 / 82
Martin Joo - DevOps with Laravel
Of course, it's not an exact science and it depends on your application and the nature of your traffic. So,
unfortunately, I cannot give you exact numbers and formulas.
Worker
As I said earlier, I don't use supervisor in a cluster. The reason is that now worker processes can run on
multiple servers so we don't really need 4 processes on one machine. It can be 4 containers across 4 nodes.
Docker Swarm can play the part of a supervisor because it can:
Limit the number of replicas a node can run from the same container
As you might remember, the worker service defines a command that executes the worker.sh script:
worker:
image: martinjoo/posts-worker:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" /usr/src/worker.sh"
worker:
image: martinjoo/posts-worker:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" /usr/src/worker.sh"
deploy:
replicas: 4
24 / 82
Martin Joo - DevOps with Laravel
It's a good idea to run these kinds of monitoring tools in a different stack. If you check out the project files,
you can see a docker-compose.monitoring.yml :
version: "3.8"
services:
visualizer:
image: dockersamples/visualizer:stable
deploy:
placement:
constraints:
- "node.role!)manager"
ports:
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
It needs to run on a manager node since it runs Swarm-related Docker commands to get information
about the nodes and so on.
It needs to have access to the docker.sock file on your machine. The way we can solve this is a
volume. We mount the Docker socket from the host machine into the container. So when visualizer
runs a command in its container such as docker node ls it actually receives information from the
host machine.
I copied this file to the manager node and ran the following command:
25 / 82
Martin Joo - DevOps with Laravel
One for the project itself and one for the new monitoring stack I just deployed.
And if you now open one of your managers' IP addresses on port 8080 you'll see something like that:
You can see every node and the services running on them.
Visualizer works in real-time. So if you run a docker stack deploy you can see how your containers are
being deployed in real-time.
26 / 82
Martin Joo - DevOps with Laravel
MySQL
Redis
nginx
API
worker
It might be a little bit too much for a server with only 2 CPUs and 2G of RAM. If your application gets a big
spike in traffic this node will suffer because:
If you run lots of queries (and/or heavy ones) MySQL will consume a lot of your CPUs.
The API container can also consume CPU based on your application and of course some RAM. In the
optimization chapter, I showed you that an average PHP8 process in my case consumed 25Mb to 43MB
of RAM. But I also showed you a legacy PHP7 project that sometimes consumes 130MB of RAM. And it's
just one process, Now imagine if only 50 users are being served by this particular node.
What if the worker process starts to generate big PDFs or resizes images uploaded by users?
27 / 82
Martin Joo - DevOps with Laravel
If you add all of these things together it's almost a guaranteed failure for your database node. Which makes
the application and all of your other nodes useless. No database means no application. It's clearly a single-
point failure right now. On top of that, we overload it with containers. And it doesn't need to be down
necessarily. If this node is slowing down, probably your whole application is slowing down because the vast
majority of your requests use a database.
Fortunately, it's easy to solve this problem. All we need to do is to place non-database containers on the
other nodes:
api:
image: martinjoo/posts-api:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" php-fpm"
deploy:
replicas: 6
placement:
constraints:
- "node.labels.db!+true"
28 / 82
Martin Joo - DevOps with Laravel
Since now we effectively have a 3-node cluster, I decreased the number of replicas:
API 4
nginx 2
worker 3
I still want an extra API container. It means one of the nodes runs two APIs which is perfectly fine (node2
turned out to be the lucky one).
29 / 82
Martin Joo - DevOps with Laravel
Transcoding videos
Processing audio
Creating large ZIP files or just working with many files in general
then it can be a problem since workers will overload the nodes and the API is going to be slow.
But there are other kinds of jobs that can be resource intensive as well. For example, importing or exporting
large datasets. For example, I worked on a project where simply CSV imports caused real trouble.
These companies already used some kind of ERP system where they had all the user information about
their workforce.
So they obviously didn't want to create and manage thousands (or tens of thousands) of users in
another application as well.
Because of that we had to sync users with CSV imports via API integrations.
Some of the tenants were crazy (I don't want to name them, but you probably like their burgers) and
they uploaded their CSV every hour.
So every hour a job started on the server that was scanning tens of thousands of rows and it:
Unfortunately, processing a user meant something like 7 or 8 queries. So just a 10 000 row CSV import can
mean 80 000 database queries. Every hour (in this case). We had the worker processes on the same server
as nginx or the API and you can imagine things were slow.
To avoid situations like that we can add dedicated worker nodes with a new label of worker=true . And we
can place the worker container on these nodes:
30 / 82
Martin Joo - DevOps with Laravel
worker:
image: martinjoo/posts-worker:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" /usr/src/worker.sh"
deploy:
replicas: 2
placement:
constraints:
- "node.labels.worker!)true"
api:
image: martinjoo/posts-api:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" php-fpm"
deploy:
replicas: 4
placement:
constraints:
- "node.labels.db!+true"
- "node.labels.worker!+true"
With this constraint, other containers won't be placed on the worker server.
31 / 82
Martin Joo - DevOps with Laravel
And of course, you can add even more worker servers, you can play around with the server size, the number
of replicas, etc.
32 / 82
Martin Joo - DevOps with Laravel
33 / 82
Martin Joo - DevOps with Laravel
Yes, right now API runs on every node, but the FE does not. It runs in two replicas on node1 and node4:
But I was still able to access every node. Something is clearly going on in the background.
This is called the ingress routing mesh. This is how Docker defines it:
Docker Engine swarm mode makes it easy to publish ports for services to make them available to
resources outside the swarm. All nodes participate in an ingress routing mesh. The routing mesh
enables each node in the swarm to accept connections on published ports for any service running in
the swarm, even if there’s no task running on the node. The routing mesh routes all incoming requests
to published ports on available nodes to an active container.
This is why we need to open ports 7946 and 4789 on the nodes. Routing mesh uses these ports for
container discovery and communication.
34 / 82
Martin Joo - DevOps with Laravel
When you hit a port that is publicly accessible the request is going through an internal load balancer
provided by Swarm and it routes the request to a container that can handle it. This is because you can
access port 80 on every node. Swarm routes the request to a container on a node that listens to that port.
So basically Docker Swarm comes with a fully functional load balancer. Without any configuration. But you
can still configure your own load balancer. If you want to use the routing mesh with your own load balancer
that's all you need to do:
If you want to learn more check out the official Docker page about routing mesh where you can see the
exact configuration of the load balancer.
35 / 82
Martin Joo - DevOps with Laravel
Health checks
One of the great things about Docker Swarm is its health checks and self-healing features. We can define
some pretty great health checks for the containers and Swarm will start and restart them according to their
states.
mysql:
image: martinjoo/posts-mysql:${IMAGE_TAG}
healthcheck:
test: [ "CMD", "mysqladmin", "ping" ]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
It runs the command every 30s defined in internal . When the container is starting Swarm will wait
30s to run the first check. You can imagine like this: sleep 30 && mysqladmin ping instead of
mysqladmin ping && sleep 30
If the ping command takes more than 5s defined in timeout the check fails.
We give 30s defined in start_period to the container to start. Failed health checks in the first 30s
won't increase the retries count.
redis:
image: redis:7.0.11-alpine
healthcheck:
test: [ "CMD", "redis-cli", "ping" ]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
The API runs a php-fpm service on port 9000 so the best health check is to connect to that port. For this, I'm
using nc :
36 / 82
Martin Joo - DevOps with Laravel
api:
image: martinjoo/posts-api:${IMAGE_TAG}
healthcheck:
test: [ "CMD", "nc", "-zv", "localhost", "9000" ]
interval: 30s
timeout: 5s
retries: 3
start_period: 40s
Since the API depends on MySQL, Redis, and updates I gave it a bit more start period. We also need to install
nc in the Dockerfile:
After that, in the nginx container, we can actually send an HTTP request to the API as a health check:
nginx:
image: martinjoo/posts-nginx:${IMAGE_TAG}
healthcheck:
test: [ "CMD", "curl", "-f", "http:!$localhost/api/health-check" ]
interval: 30s
timeout: 5s
retries: 3
start_period: 1m
nginx depends on even more containers so I raised the start_period to 1 minute. Of course, it won't take
that much. The health check sends an HTTP request to localhost using curl. I added this pretty simple
health-check endpoint:
Route!'get('/health-check', function () {
return response('', 200);
});
37 / 82
Martin Joo - DevOps with Laravel
worker:
image: martinjoo/posts-worker:${IMAGE_TAG}
healthcheck:
test: [ "CMD", "php", "/usr/src/artisan", "queue:monitor", "default" ]
interval: 30s
timeout: 5s
retries: 3
start_period: 40s
The frontend container can also just curl itself on port 80:
frontend:
image: martinjoo/posts-frontend:${IMAGE_TAG}
healthcheck:
test: [ "CMD", "curl", "-f", "http:!$localhost" ]
interval: 30s
timeout: 5s
retries: 3
start_period: 1m30s
It's the last container in the dependency chain so it has the largest start_period .
Of course, you can and should customize these numbers to your own need. You can also play with the
timeout numbers (I'm using a default 5s for everything). But it can vary based on some criteria, for
example:
If you have a typical application usually it's okay if your worker is a bit slower for a period of time. So 5s
can be perfect.
But probably it's a problem if the API is slow. So maybe you can change it to 2 or 3 seconds.
Now that every service has health checks and they also use wait-for-it you can be sure that the startup
state of your application is stable and flawless. But of course, health checks also play a big role in restarting
containers when they run into some problem.
38 / 82
Martin Joo - DevOps with Laravel
Restarting services
Swarm also comes with great restart abilities. We can define how we want to restart containers that are
stopped or in an unhealthy state:
mysql:
image: martinjoo/posts-mysql:${IMAGE_TAG}
deploy:
restart_policy:
condition: any
delay: 5s
max_attempts: 3
window: 60s
condition can be none , on-failure , or any . In most cases, I'd like to use any since I can't imagine
a scenario where a container stops working and there's no need to restart it. The update container it's
an exception. It's a one-off job.
delay means that Swarm will wait for 5s before restarting the container.
max_attempts specifies how many times Swarm attempts to restart the container.
windows specifies how long to wait before deciding if a restart has succeeded or not. This value should
match the start_period in the healthcheck section and/or the time given to wait-for-it
Every container in the sample stack has a similar restart policy to this one except update . It doesn't need to
be restarted at all.
scheduler:
image: martinjoo/posts-scheduler:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" /usr/src/scheduler.sh"
deploy:
restart_policy:
condition: any
delay: 60s
window: 30s
It runs the scheduler.sh script, stops, waits for 60s seconds then restarts it. If you remember, with docker-
compose we had this script:
39 / 82
Martin Joo - DevOps with Laravel
But now delay 60s replaces sleep 60 so scheduler.sh looks like this:
A tiny difference but I think it's better that the orchestrator does the scheduling as well.
40 / 82
Martin Joo - DevOps with Laravel
Updating services
Swarm can also intelligently update the services when you re-deploy your stack. Meaning, it doesn't cause
downtime.
api:
image: martinjoo/posts-api:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" php-fpm"
deploy:
replicas: 4
update_config:
parallelism: 2
delay: 10s
failure_action: pause
monitor: 20s
max_failure_ratio: 0.25
order: stop-first
Swarm will update two containers at a time defined by parallelism . It means that it stops 2
containers from the currently running ones and then starts two new ones with the new image.
After stopping and starting two containers it waits for 10s because of the delay
If the new containers fail it will pause the update. This means that all the containers that were running
the previous version of the service will continue to run and serve traffic, while no new containers will
be created or updated until the issue that caused the failure is resolved manually. So you had 4
working replicas, and you rolled out a deployment that fails. There are still 2 old containers serving
traffic while you're working on the hotfix. pause is the default value so you can omit it. The other two
options are continue and rollback . In most cases, pause is your best option. continue is a bit too
risky, however, rollback can be useful. We're going to talk about it in a minute.
Swarm will monitor the status of the new containers for the 20s defined by monitor . If they fail in this
interval it's considered a failure and failure_action will be triggered. Otherwise, the update is
considered successful. Updates happen in order. So when api is updated mysql and redis are
already up and running according to their health checks.
If 25% of the new containers fail the whole update process is considered a failure and is halted. This is
defined by max_failure_ratio which is 0.25 which translates to 25% which means exactly 1
container since api runs in 4 replicas. So if one of the API containers fails, we stop the update of the
api containers. However, in a real project, I don't want to tolerate failures at all, so I set
max_failure_ratio to 0. Which is the default value so you can omit it.
41 / 82
Martin Joo - DevOps with Laravel
Swarm will first stop 2 runnings containers and then starts two new ones. This is controlled by order:
stop-first . The other value is start-first stop-first is almost always the safest choice since
start-first can cause problems. For example, you can write a constraint that prevents the api
service to run more than 2 replicas on the same node. start-first might violate that constraint. It
can also overload your node with too many tasks.
mysql:
image: martinjoo/posts-mysql:${IMAGE_TAG}
deploy:
update_config:
parallelism: 1
failure_action: rollback
monitor: 30s
max_failure_ratio: 0
order: stop-first
This is the MySQL service. It runs only one replica. It's a single point of failure. If the update fails you're in
deep shit. The whole application is down. This is why I'm using this config.
failure_action: rollback
If something goes wrong with the update Swarm will roll back to the previous version. This is crucial. With
failure_action: pause your whole application would be down until you figure out the problem. And in
this case, a rollback is pretty much safe in my opinion. The MySQL image is basically just the official image
with a config file in it. So an update will go wrong if you updated the config, didn't test it, and deployed it.
Which doesn't happen that often. And if it does happen Swarm will roll back to the previous image. Which
means either an older MySQL version or a different config. Hopefully, there's no such version of config that
would cause serious problems in your application. So your application most likely will work with the older
image.
42 / 82
Martin Joo - DevOps with Laravel
It's a great feature that can save the day. Rollbacks can be configured as well, and the good news is that it
has exactly the same properties as update_config :
api:
image: martinjoo/posts-api:${IMAGE_TAG}
deploy:
replicas: 4
update_config:
parallelism: 2
delay: 10s
failure_action: pause
monitor: 15s
max_failure_ratio: 0
order: stop-first
rollback_config:
parallelism: 2
delay: 10s
failure_action: pause
monitor: 15s
max_failure_ratio: 0
order: stop-first
As you can see the rollback_config is exactly the same as the update_config . For me, it makes sense to
use the same values when rolling back.
43 / 82
Martin Joo - DevOps with Laravel
Deployment
Deploying from a pipeline
Deploying from the pipeline it's almost entirely the same as it was with docker-compose. In the GitHub
workflow, we need to do the exact same steps:
deploy-prod:
needs: [ build-frontend, build-nginx ]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Copy SSH key
run: |
echo "${{ secrets.SSH_KEY }}" !, ./id_rsa
chmod 600 id_rsa
- name: Deploy app
run: |
scp -i ./id_rsa ./deployment/bin/deploy.sh ${{
secrets.SSH_CONNECTION_PROD }}:/home/martin/deploy.sh
scp -i ./id_rsa ./docker-compose.prod.yml ${{
secrets.SSH_CONNECTION_PROD }}:/usr/src/docker-compose.prod.yml
scp -i ./id_rsa ./.env.prod.template ${{ secrets.SSH_CONNECTION_PROD
}}:/usr/src/.env
ssh -tt -i ./id_rsa ${{ secrets.SSH_CONNECTION_PROD }} "chmod +x
/home/martin/deploy.sh"
ssh -tt -i ./id_rsa ${{ secrets.SSH_CONNECTION_PROD }} "
sed -i "/IMAGE_TAG/c\IMAGE_TAG=${{ github.sha }}" /usr/src/.env
44 / 82
Martin Joo - DevOps with Laravel
You need set one of the manager nodes' IP address in a GitHub secret to SSH into it. Just like we did earlier.
After these steps the .env file is ready and the deploy script is on the server. It's super simple:
!-/bin/bash
set -e
cd /usr/src
Update service
The update service (that runs migrations and caches configs) doesn't need to be changed at all. The only
difference is that it's not a long-running service but a one-off short-lived task.
update:
image: martinjoo/posts-api:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" /usr/src/update.sh"
deploy:
mode: replicated-job
This will run the container in one replica and when the update.sh script exists Swarm will not restart the
service. A replicated-job cannot have an update or rollback-related configs since it doesn't make sense.
45 / 82
Martin Joo - DevOps with Laravel
Provisioning nodes
Provisioning a new node is exactly the same as before. In the Docker chapter, I explained the whole script in
detail.
ufw allow 2377 !" ufw allow 7946 !" ufw allow 4789
We log in to Docker Hub, then open the necessary ports. As the final step new server needs to join the
cluster. For this, the script accepts a token as its parameter. If it's a manager token the new node is going to
be a manager, if it's a worker token then a worker, of course. $SWARM_MANAGER_IP_PORT is also an
argument. It's the manager node's IP address and port number such as 1.2.3.4:2377
46 / 82
Martin Joo - DevOps with Laravel
Monitoring means a lot of things. In this chapter, I'm going to show some techniques you can use. We start
with simpler solutions and then move forward to more complicated ones.
Uptime
This is the most basic and crucial monitoring tool you should have. Monitoring uptime means that there's a
service that calls your API every minute and notifies you if it's not responding. We did the same thing on the
container level with Swarm health checks. But in this case, I'm talking about an external tool on the "network
level."
Uptime robot
Better uptime
We're going to take a quick look at Uptime robot and DigitalOcean's uptime service.
Uptime robot
You need to configure a few basic things:
Your domain
47 / 82
Martin Joo - DevOps with Laravel
The free account only allows you to use 5-minute intervals. The ideal would 1 minute. This config will run
every 5 minutes and sends an alert if the site isn't responding after 10 seconds or it returns a not-healthy
status code (4xx, 5xx).
48 / 82
Martin Joo - DevOps with Laravel
DigitalOcean
Every server provider provides you with monitoring tools. On DigitalOcean they are completely free.
However, there are other providers (such as Azure) where it costs money.
They also have uptime monitoring for free. In the left sidebar choose Manage -> Monitoring -> Uptime :
They even have region settings. This means they try to hit your site from 4 different regions.
After you created your monitor, you can create different types of alerts:
49 / 82
Martin Joo - DevOps with Laravel
An SSL Cert Expire notifies if your SSL certificate is going to expire in N days.
Depending on your server provider use these monitors and alerts, or choose something like Uptime robot
or Oh dear.
50 / 82
Martin Joo - DevOps with Laravel
spatie/laravel-health
pragmarx/health
Both of them do the same thing. They can monitor the health of the application. Meaning:
etc
You can define the desired threshold and the package will notify you if necessary. I'm going to use the
Spatie package but the other one is also pretty good. Spatie is more code-driven meanwhile the Pragmarx
package is more configuration-driven.
<?php
namespace App\Providers;
51 / 82
Martin Joo - DevOps with Laravel
RedisCheck!'new(),
RedisMemoryUsageCheck!'new(),
QuerySpeedCheck!'new(),
]);
}
}
UsedDiskSpaceCheck warns you if more than 70% of the disk is used and it sends an error message if
more than 90% is used.
CpuLoadCheck measures the CPU load (the numbers you can see when you open htop ). It sends you
a failure message if the 5-minute average load is more than 2 or if the 15-minute average is more than
1.75. I'm using 2-core machines in this project a load of 2 means both the cores run at 100%. If you
have a 4-core CPU 4 means 100% load.
DatabaseConnectionCountCheck sends you a warning if there are more than 50 connections and a
failure message if there are more than 100 MySQL connections.
RedisCheck tries to connect to Redis and notifies you if the connection cannot be established.
RedisMemoryUsageCheck sends you a message if Redis is using more than 500MB of memory.
These are the basic checks you can use in almost every project.
To be able to use the CpuLoadCheck and the DatabaseConnectionCountCheck you have to install these
packages as well:
spatie/cpu-load-health-check
doctrine/dbal
The package can send you e-mail and Slack notifications as well. Just set up them in the health.php config
file:
52 / 82
Martin Joo - DevOps with Laravel
'notifications' !( [
Spatie\Health\Notifications\CheckFailedNotification!'class !( ['mail'],
],
'mail' !( [
'to' !( env('HEALTH_CHECK_EMAIL', 'healthcheck@posts.today'),
'from' !( [
'address' !( env('MAIL_FROM_ADDRESS', 'healthcheck@posts.today'),
'name' !( env('MAIL_FROM_NAME', 'Health Check'),
],
],
'slack' !( [
'webhook_url' !( env('HEALTH_SLACK_WEBHOOK_URL', ''),
'channel' !( null,
'username' !( null,
'icon' !( null,
],
Finally, you need to run the command provided by the package every minute:
$schedule!&command('health:check')!&everyMinute();
You can also write your own checks. For example, I created a QuerySpeedCheck that simply measures the
speed of an important query:
53 / 82
Martin Joo - DevOps with Laravel
namespace App\Checks;
$executionTimeMs = Benchmark!'measure(function () {
Post!'with('author')!&orderBy('publish_at')!&get();
});
return $result!&ok();
}
}
In the sample application I don't have too many database queries, but selecting the posts with authors is
pretty important and happens a lot. So if this query cannot be executed in a certain amount of time it
means the application might be slow. The numbers used in this example are completely arbitrary. Please
measure your own queries carefully before setting a threshold.
54 / 82
Martin Joo - DevOps with Laravel
55 / 82
Martin Joo - DevOps with Laravel
However, in a cluster, we have to do some extra work. Just imagine the following:
The scheduler container runs in 1 replica on a random node (except the database node of course).
This means that at any given time we're only monitoring one of the nodes. And the check will never run
on the database server.
Every minute
The first step is to add a new stage to the API image. It's pretty similar to the worker or scheduler stages:
!-/bin/bash
56 / 82
Martin Joo - DevOps with Laravel
health-check:
image: martinjoo/posts-health-check:${IMAGE_TAG}
command: sh -c "/usr/src/wait-for-it.sh mysql:3306 -t 60 !" /usr/src/wait-
for-it.sh redis:6379 -t 60 !" /usr/src/health-check.sh"
deploy:
mode: global
restart_policy:
condition: any
delay: 60s
window: 30s
depends_on:
- mysql
- redis
I'm using the martinjoo/posts-health-check image which is built in the pipeline (just like worker or
scheduler):
deploy:
mode: global
This guarantees that the container runs on every node in exactly 1 replica. This is called a global service.
There are four different kinds of modes:
57 / 82
Martin Joo - DevOps with Laravel
replicated is the default. This means we want to run X number of replicas of service on random
nodes (unless you use a placement constraint, of course).
global means that each node should run exactly one replica of this service.
replicated-job . We already used this with the update service. So a job means that it's a one-off
task. The container starts, runs a script and it exits. It won't get restarted. It's a one-time job.
replicated-job means it can run in X number of replicas. In our case, the update service always
runs in 1 replica.
So the health-check container runs on every node. Which is great because it can measure the current
server's health.
And finally, we use the same restart-policy as we used with the scheduler service:
restart_policy:
condition: any
delay: 60s
window: 30s
The script runs, it exists and then Swarm restarts the container after a 60s delay so it runs every minute.
These health checks are very important in my opinion. And as you can see, it's pretty easy to set up them.
58 / 82
Martin Joo - DevOps with Laravel
CPU utilization
Disk space
Memory
This will alert me if the memory usage is above 70% on any of the Droplets tagged as posts .
To be able to run these alerts you need to install the DigitalOcean agent on your servers. Unfortunately, it
needs to be running on the actual host, not inside the container so I added the installation script to the
provision_node script:
59 / 82
Martin Joo - DevOps with Laravel
Error tracking
The next topic I'd like to cover is error tracking. By default, you have only one tool to debug and understand
the errors in your application: laravel.log . Laravel exceptions are quite nice, but let's be honest, we don't
want to spend hours in 1000 lines long log files.
In a cluster environment, the situation gets even worse. In the previous chapter, I created a 4-node Swam
cluster. The logs are spread across 3 or 4 nodes. The only command Swarm provides is the
To solve these issues you can introduce an error-tracking application such as:
Rollbar
Sentry
Flare
All of them are pretty similar. Usually, they require the same installation and configuration steps. The
dashboards they provide are also very similar. So check them out and choose your favorite one. I've been
using Rollbar for years so I'm going to use it in the following examples as well.
Install Rollbar:
Then create a project and a token on the UI and set it in your .env file:
ROLLBAR_TOKEN=your-token
ROLLBAR_TOKEN=${{ROLLBAR_TOKEN}}
60 / 82
Martin Joo - DevOps with Laravel
api:
image: martinjoo/posts-api:${IMAGE_TAG}
!!*
environment:
- ROLLBAR_TOKEN=${ROLLBAR_TOKEN}
Add a new GitHub secret called ROLLBAR_TOKEN , and finally change the pipeline so it copies the secret to
the env file:
Of course, these steps depend on your deployment strategy. I'm assuming the same pipeline and Docker
Swarm stack explained in the previous chapter.
'rollbar' !( [
'driver' !( 'monolog',
'handler' !( \Rollbar\Laravel\MonologHandler!'class,
'access_token' !( env('ROLLBAR_TOKEN'),
'level' !( 'debug',
'person_fn' !( 'Auth!'user',
'capture_email' !( true,
'capture_username' !( true,
],
person_fn defines a function ( Auth::user() in this case) that will be used by Rollbar to get the
currently logged-in user.
capture_email means that the email fields from the User model are going to be appended in the
log entry.
After we have a new channel, let's add it to the stack channel (which is the one I'm always using):
61 / 82
Martin Joo - DevOps with Laravel
'stack' !( [
'driver' !( 'stack',
'channels' !( ['single', 'daily', 'stdout', 'stderr', 'rollbar'],
'ignore_exceptions' !( false,
],
Log!'debug('Test debug')
Or just simply throwing an exception. You should see every item on the dashboard:
Total is an important column. It shows you how many times an error has occurred so far. In the sample
application, I use a log mailer, and the first row on the screenshot is a log "e-mail" I'm getting from the
health check package.
62 / 82
Martin Joo - DevOps with Laravel
63 / 82
Martin Joo - DevOps with Laravel
In this chapter, we're going to add log collecting to the Swarm cluster and also add some dashboards that
help us debug the application.
For example, this dashboard shows the distribution of the different request methods, the top URIs, and the
status code distribution:
This one lists only the failed requests and health check status codes:
fluentbit
Loki
Grafana
Storing them in a database optimized for storing, querying, and indexing unstructured log entries
64 / 82
Martin Joo - DevOps with Laravel
There are lots of other stacks but the main architecture/concept is usually the same. For example, the ELK
stack contains these components:
Another famous one is the EFK stack. It's almost the same as ELK but the letter "F" stands for fluentd. So it
uses another log collector. And yes, there's a log collector called fluentd and another one called fluentbit.
They are pretty similar but fluentbit is a bit simpler. For our purpose, it's perfect.
JSON logs
To be able to create these dashboards, first, we need a structured log format. The problem with the default
format is that it produces multiline exceptions such as this:
A log collector (such as fluentbit) will collect this exception line-by-line and will become X different log
entries in Grafana.
The other problematic thing is nginx access logs. If you want to make a dashboard such as the status code
distribution above you need to somehow do calculations on the status codes. It's pretty hard if you have log
entries such as this:
With this format, your only hope is some magic with grep. Which is not good.
What I want is JSON log entries. I'd like to see logs such as these:
65 / 82
Martin Joo - DevOps with Laravel
The first one is a debug message from artisan from the API container
The second one is an error message behind an API endpoint (directly in the api.php route file as you
can see)
The third one is the 500 access log entry caused by the API error
Now imagine how much easier it is to filter for nginx log entries where the status property is either 4xx or
5xx. Or how easy it is to filter for level_name = ERROR in the API logs.
http {
log_format json '{"time_local":"$time_local",'
'"remote_addr":"$remote_addr",'
'"request_method":"$request_method",'
'"request_uri":"$request_uri",'
'"status":"$status",'
'"body_bytes_sent":"$body_bytes_sent",'
'"http_referer":"$http_referer",'
'"http_user_agent":"$http_user_agent"}';
We literally just need to create a JSON template using variables provided by nginx. The log format's name is
json . Then we specify that access logs should use the new json log format. That's it! Now nginx will log in
JSON.
After that go to your logging.php config and add a new json channel:
66 / 82
Martin Joo - DevOps with Laravel
'json' !( [
'driver' !( 'monolog',
'level' !( env('LOG_LEVEL', 'debug'),
'handler' !( StreamHandler!'class,
'formatter' !( Monolog\Formatter\JsonFormatter!'class,
'with' !( [
'stream' !( 'php:!$stdout',
],
'processors' !( [PsrLogMessageProcessor!'class],
],
It logs to stdout using the JsonFormatter formatter. That's it! Now we have "cloud native" JSON logs.
I check the current environment at the beginning of logging.php so on localhost I can still enjoy dirty "not
cloud native" format:
'stack' !( [
'driver' !( 'stack',
'channels' !( $channels,
'ignore_exceptions' !( false,
],
67 / 82
Martin Joo - DevOps with Laravel
fluentbit:
image: martinjoo/posts-fluentbit:${IMAGE_TAG}
deploy:
mode: global
ports:
- "24224:24224"
environment:
- LOKI_URL=http:!$loki:3100/loki/api/v1/push
Since fluentbit collects logs on every node it needs to run on every node so the deployment mode is set to
global . It also needs a LOKI_URL environment variable so it can push the logs to Loki.
As you can see, I also built a custom image for fluentbit. It's pretty simple:
FROM fluent/fluent-bit:2.1.7
Just as the nginx image it only contains a config file. You don't have to follow this practice, it's highly
subjective. You can just use the official image and then mount the configuration file. However, I find the
custom image solution more "self-contained," and more "stateless."
[SERVICE]
Flush 5
Log_Level error
Daemon off
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
68 / 82
Martin Joo - DevOps with Laravel
[OUTPUT]
name loki
match *
host loki
labels job=fluentbit, $sub['stream']
Flush 5 specifies the interval (in seconds) for flushing buffered records to the output plugin.
Log_Level error means fluentbit itself only logs errors. It has nothing to do with your application
logs. It means fluentbit should only report errors about itself.
Daemon off means fluentbit runs in the foreground. In a Docker image, you must have at least one
long-running process to keep alive and running. We did the same with nginx.
The [INPUT] section configures what fluenbit should do with its inputs. In a minute, we'll see that
containers push logs to fluentbit. Containers will forward logs to fluentbit using a TCP connection.
We use the forward protocol with 0.0.0.0:24224 . This protocol will forward anything that comes in on
port 24224. Containers send their logs to fluentbit:24224 and then fluentbit simply forwards them to the
OUTPUT .
name refers to the built-in Loki plugin. It will push logs to loki. This is why the image needs the
LOKI_URL environment variable.
host loki defines the target host where Loki is listening. It's going to be another container.
labels helps us identify log entries later. $sub['stream'] is a special variable provided by fluentbit.
It contains the source of the log message which is stdout in most cases.
If you have trouble understanding these input and output protocols, here's an easier example:
69 / 82
Martin Joo - DevOps with Laravel
[INPUT]
Name head
Tag head.cpu
File /proc/cpuinfo
Lines 8
Split_line true
[OUTPUT]
Name stdout
Match *
The input protocol is head which is the head terminal command that reads the beginning of a file. So this
config:
fluentbit is very flexible and it has more of these configuration options, such as:
Parser
Filter
Buffer
70 / 82
Martin Joo - DevOps with Laravel
loki:
image: martinjoo/posts-loki:${IMAGE_TAG}
deploy:
placement:
constraints:
- "node.labels.db!)true"
volumes:
- loki-data:/loki
volumes:
loki-data:
Loki is a database so it needs a volume to store data. Because it has a state, it needs to be placed on the
same node every time. The image follows the same principle, it's a Loki image with a config file. In most
cases, you can just use a default config file from the examples.
api:
image: martinjoo/posts-api:${IMAGE_TAG}
logging:
driver: fluentd
That's it! fluentd and fluentbit are supported by Docker out of the box. They both work the same way from
Docker's perspective so driver: fluentd is not a typo.
One more thing before moving on. It's a good idea to somehow label the log messages with a container
name. Otherwise, you end up with tens of thousands of log messages and if you want to see only the API's
log entries you won't be able to filter them. Fortunately, we can solve this issue with a simple label:
api:
image: martinjoo/posts-api:${IMAGE_TAG}
labels:
service_name: "api"
logging:
driver: fluentd
options:
labels: "service_name"
71 / 82
Martin Joo - DevOps with Laravel
Swarm adds the label api to the running container and when it sends the logs to fluentbit it sends the
label together with the message. Later, in Grafana we can use these labels in filters.
grafana:
image: grafana/grafana:9.5.6
deploy:
placement:
constraints:
- "node.labels.db!+true"
ports:
- "3030:3000"
After the stack is deployed Grafana can be accessed on port 3030. On the homepage, there's a link to
choose a data source. Grafana needs to be configured to use Loki:
The URL should be http://loki:3100 since they are in the same stack and Loki is listening on port 3100.
After that, choose the menu 'Explore'. This is the view where you can build queries and you can save them
as dashboards:
72 / 82
Martin Joo - DevOps with Laravel
Builder
Code
The builder view is what you can see in the image. In builder mode, you can visually build queries. in the
code view, you can see and write the actual queries. The builder view is a great way to get used to this new
query language. Which is LogQL . It has lots of features and introducing everything is outside the scope of
this book. Check out their documentation and learn more about it if you're interested. Also, I'm not a
monitoring expert at all, but I can show you some basic queries and dashboards.
First, let's build a simple table that shows us how HTTP methods are distributed:
There's an input called Label filter . Always start your queries by choosing job = fluentbit
If you click on the + Operation button you can see there are different types of operations, for example:
73 / 82
Martin Joo - DevOps with Laravel
Formats sounds like an easy start. A formatter reads a log entry and parses it in the given format so it can
be used as an object later in the query. For example, now we have log entries such as this:
But LogQL cannot use this in queries because it's just a string. First, we need to parse it:
{
"level": 200,
"level_name": "DEBUG"
}
And now we can access the properties such as .level or .level_name and they can be used in LogQL
expressions.
{job="fluentbit"} | json
These formatter and filters can be piped using the | symbol. The json formatter doesn't have any
argument.
And if you now click on Run query it should already give you some results:
74 / 82
Martin Joo - DevOps with Laravel
Of course, it returns every log entry now. To get information about the HTTP methods we only need nginx
logs. If you remember we added a label in the docker-compose config. In Grafana, we can filter based on
these filters. Just choose Label filter expression :
The query:
Label filter is like a where expression but we need to use the | operator again. The query now returns only
logs from nginx containers. But if you take a look at a log entry it's not quite right:
75 / 82
Martin Joo - DevOps with Laravel
log is the most important property. It contains the actual nginx log with status and request_method and
so on. But it's still just a string. To accomplish this we need to extract it to the top level and then format it as
JSON. In PHP we would do something like that:
[
'status' !( 200,
!!*
]
This line:
$logEntry['log']
line_format `{{.log}}`
76 / 82
Martin Joo - DevOps with Laravel
Now we can query against the actual log entry's properties so let's filter out OPTIONS requests:
| request_method != `OPTIONS`
The last step is to get a count by request_method . A count function looks like this in LogQL:
count_over_time(query, [period])
The first one is the query we have so far. So in LogQL count_over_time wraps the whole query as if it
was something like this in MySQL:
count(
select id from users
)
Instead of:
select count(id)
from users
77 / 82
Martin Joo - DevOps with Laravel
The second one is the time period. We'll use this query in a dashboard where we're going to choose a
time range. When you check out a dashboard you rarely ask something like "So let's see what
happened in the last 3 years?" but instead "What happened in the last 6 hours?" So it's going to be a
variable from a dropdown. There are some special variables we can use in queries. One of the is
$__range which stores the value of the time period dropdown in the top right corner.
And finally we want the sum by the request_method property which can be expressed such as this:
Now create a new dashboard and paste the query. Under the actual query there's an Options section. Set
Type to Instant :
And finally in there's a Transform tab, select Organize field and set the following options:
78 / 82
Martin Joo - DevOps with Laravel
With Organize fields you can rename, re-order, or hide columns. As you can see, now we have a nice
table showing the request distribution. You can save it to your dashboard as a new panel, in the top right
corner.
We have the first panel and dashboard. Just copy that and change the visualization type to Pie chart . And
now you have a pie chart showing the same data.
Top URIs
To get the most popular URIs in the application we almost need to write the same query as before. The
query needs the same formats and filters. It also needs a count_over_time and a sum by but this time is
based on the request_uri property instead of the request_method .
To get only the top X results there's an expression called topk and it works like this:
79 / 82
Martin Joo - DevOps with Laravel
topk(n, query)
I have good news: you can use the exact same query but with different properties and visualization type:
Now set the visualization type to bar chart and you're done.
Failed requests
One more "where" expression to filter out 2xx and 3xx status codes. The visualization type is logs.
80 / 82
Martin Joo - DevOps with Laravel
It's also an easy one. The only difference compared to the other queries is that this one has an additional
label filter expression because we want only the /api/health-check requests:
As you can see, the number of possibilities is endless so try out Grafana, run these queries, play around with
them a little bit, and try to create your own useful charts and dashboards.
81 / 82
Martin Joo - DevOps with Laravel
Conclusions
As I said in the introduction, learning Docker Swarm is super easy. It's very similar to docker-compose with
only a few new commands. The question is: when should you use it?
I think if you want to scale your application, and want to move into a cluster, Swarm is an excellent choice as
the first step. Especially if you don't have Kubernetes knowledge in your team. I think if you have a
dockerized application, you can move into a cluster in a day or so. And then, you'll have experience in the
cluster world. Which is crucial! You'll probably face problems you never knew existed before. If your project
grows even further and Swarm starts to be limited for you, just move to Kubernetes. I think it's a valid path.
Certainly better than jumping into k8s without any experience. Of course, I'm talking about a production
application with reasonable traffic and user base.
Thank you very much for reading this book! If you liked it, don't forget to Tweet about it. If you have
questions you can reach out to me here.
82 / 82