Are you an agency specialized in UX, digital marketing, or growth? Join our Partner Program

Learn / Blog / Article

Back to blog

Modernizing Hotjar’s architecture for a faster flow of value

In this article, we examine how we modernized Hotjar's architecture (note that Hotjar was acquired by Contentsquare in 2021) to deliver value faster to our customers. We start by explaining why we needed to modernize it and which principles guided us throughout the journey. Our aim is to provide insights into how these changes helped us improve and adapt in a rapidly evolving market.

Last updated

29 Jul 2024

Reading time

12 min

Share

Summary

What makes a good software architecture? 

Software architecture needs to be fit for purpose. It must accomplish its goals while optimizing for what’s relevant in the given business context. This might sound a bit abstract, so let's clarify with an example from house construction, originally described by Gregor Hohpe in The Software Architect Elevator book.

Most houses in Czechia look like the one on the left. They have steep roofs and brick walls. This design is optimal because the country experiences snow throughout the winter, and the steep roof helps shed the snow, while the brick walls insulate the heat.

The house on the right is very different. It has glass walls and a flat roof extending beyond those walls. This architecture is optimized for a very different climate. It’s ideal for areas with no snow and higher temperatures. The roof provides shade during the summer when the sun is high and helps keep the house warm in the winter when the sun is low.

This is exactly what a good software architecture should be: optimized for your company's relevant business needs.

The initial purpose of the Hotjar architecture: small team, simplicity, and short time to market

Hotjar started as a SaaS tool with three main components: a JavaScript snippet that customers installed on their websites, a frontend web application that customers used to understand visitor behavior, and a backend component that powered the other parts.

As with almost every other start-up, the initial goal was to quickly validate the product-market fit. This required delivering value to customers with a small team and a very short time to market. Initially, the team consisted of only full-stack engineers. A monolithic architecture with monolithic deployments of all the main components from a single code repository made sense to start with.

As we wanted our engineers to handle deployments (to get direct feedback from production) and work in small cohesive batches (to reduce the risk of each deployment and enable quick rollbacks if needed), our development process of choice was feature-branch delivery. Engineers made their changes in short-lived branches and queued when they wanted to deploy their changes to production (deploy to production from the branch, validate, and then merge to the main branch).

This meant we had a single deployment queue on our monolithic repository.

When architecture becomes a problem: identifying bottlenecks during 5x organizational growth

With a growing number of team members, the monolithic architecture with a single deployment queue started affecting our ability to rapidly deliver value to customers and increased the time to market of our changes. With around 25 engineers in the organization, deployments had already become the bottleneck of our delivery process, as per the theory of constraints.

Visualization example of the Theory of Constraints, Image credit: Flow System

We observed this issue in our metrics, measuring both the number of deployments per day (hitting a hard limit of four on good days, but only being able to deploy one or two on bad days) and the time engineers spent waiting in the deployment queue for their turn (a couple of hours). We also confirmed this with qualitative feedback from engineers in development surveys. The lead time for changes to individual tickets was typically around five days or more. 

This situation led to another negative impact: engineers started increasing the batch size to avoid multiple deployments, which increased the risk of each deployment. Larger batch sizes meant more changes were bundled together, making it harder to isolate and fix issues when they arose. It also led to longer and more complex PR reviews, increasing the likelihood of missing critical issues. This ultimately resulted in more challenging and riskier deployments.

Note: we were using deployment frequency as a proxy metric to measure how often we could deliver value. We know it’s not precise and can be challenged, but we’ve used it in combination with other metrics that helped us paint the full picture. The key underlying assumption for this proxy metric was that the more often we were able to ship code as an organization, the more often we could deliver value. This ties into the broader concept of the ‘flow of value,’ which is crucial for understanding how to improve delivery processes.

In software delivery, the ability to visualize, measure, and manage the flow of value is essential to achieving faster and more consistent delivery. It requires understanding capacity, identifying bottlenecks, and proactively prioritizing work that helps improve the flow.

Further reading: What is flow and why does it matter?

We were also using other metrics—mainly DORA and SPACE metrics—to understand the current challenges on organizational scale.

Knowing that we were already constrained and the company planned to grow (with the engineering team expected to be five times its current size in two years), there was a strong need for change. 

Removing the bottleneck: improving deployment performance

According to the theory of constraints, the only improvement that can affect flow is an improvement in the bottleneck

Our bottleneck was the deployment step. With increased engineering capacity, we had only added more work to the bottleneck, resulting in decreased throughput of the overall system.

Our first focus was to improve the bottleneck's actual performance. The deployment pipeline had taken a few hours and contained manual steps that were practically unbounded (it could have taken an infinite amount of time until engineers confirmed these steps). 

The slowest part of the pipeline was the automated tests. The full test run took  1.5 hours with unpredictable results due to flakiness. We optimized the tests, parallelized their execution, introduced strict rules addressing flakiness, and shortened this step to 15 minutes. We also addressed some of the manual steps of the deployment pipeline—pre-deployment testing in the staging environment (making it optional, and later replacing it with PR-scoped review environments), and post-deployment monitoring in production (automated). Each of these steps took at least 20 minutes, but the full time was unbounded. These changes increased our deployment capacity from four per day to 10. After that, we hit the point of diminishing returns and had to look into a different strategy.

Increasing flow: expanding queues and unblocking frontend engineers

Initially, the team consisted of full-stack engineers, making it practical for all three subsystems to be modified in one batch. However, as the team grew, the specialization of engineers became more pronounced. It became increasingly rare for the frontend to be modified in the same PR as the backend. Additionally, it became more difficult to hire full-stack engineers, and with further specialization of roles, the current architecture became outdated. 

Revisiting the architecture to meet the new needs of the growing organization was inevitable. We knew we needed more independent queues of work to increase the system's throughput.

By analyzing the number of changes in individual components, it was clear that we could achieve some quick wins with a pragmatic approach. We started by splitting the three main components into their own repositories, creating three independent queues for deployment: Backend, Insights Frontend, and Client Script. The approach we chose is known as tactical forking: copy-pasting all code into a new repository and removing the parts that are no longer needed.

The new state of architecture after that step is shown in the following diagram.

We immediately observed a positive impact on our metrics. The number of deployments per day increased to 20, and the time engineers spent waiting in the queue decreased for most repositories. Our front-end engineers were able to move much faster than before. However, we knew this was only a quick win, and we couldn't stop here with the planned growth of the organization.

We began working on a strategy for both frontend and backend components to increase the flow of value by unlocking more independent queues of work, thereby boosting the overall productivity of our growing engineering organization. Additionally, we aimed to optimize ownership, ensuring that teams had clear responsibility over their respective areas.

What’s next? Micro-* right?

There were different challenges between the front end and back end, so we took a different approach for both. For the front end, we decided to adopt a mono-repo pattern, refactor the code using domain-driven decomposition, and pave the way for independent deployment queues with micro-frontends in the future. It wasn’t a critical short-term goal, as the metrics showed enough breathing room for front-end deployments. For the rest of the story, we focus on the backend architecture side of things, as that’s where we had the most bottlenecks.

On the back end, we aimed for loosely coupled services that could be independently deployed from their own repositories. We also wanted to address other challenges, such as the lack of ownership, complex onboarding for new engineers in the monolithic repository, inability to get fast feedback on the local development environment (the monolith being too big and tests too slow), and increasing the resiliency and fault tolerance of our backend architecture overall.

We explicitly avoided the term ‘microservices’ due to the industry hype and baggage it carries—and because we didn’t need them yet. The goal of a loosely coupled architecture is to unlock multiple independent queues of work. Multiple modular monoliths or ‘chunkier’ services can achieve this in the same way as microservices, perhaps with less investment and initial risks. Modular monoliths reduce complexity by maintaining fewer services, which simplifies deployment, monitoring, and debugging. Additionally, they allow teams to iterate quickly without the overhead of managing numerous independent services. One of the most challenging parts of distributed architecture is finding the right service boundaries and avoiding creating distributed monoliths. 

If we went too granular straight away, things would become too complex very quickly due to increased operational overhead (our central SRE team would quickly become the next bottleneck), the risk of premature optimization, fragmented development efforts, more difficult onboarding for new team members, and the need to solve the same problems across multiple services simultaneously. So we aimed to start small and learn from our mistakes along the way. We began by delivering two new features outside of the backend monolith in separate repositories.

As part of this journey, we also had to solve the usual challenges that come with distributed architectures, such as routing, authentication and authorization, inter-service communication, and more.

What were the metrics telling us along the way?

We were growing the engineering organization quite fast: we’d more than doubled in less than a year. We observed that whilst with more front-end engineers, we’re getting more deployments over time (FE insights on the graph below), the situation was very different on the backend side of things. When looking at the number of deployments made within the backend monolith (Insights BE on the graph), it was practically flat despite doubling the number of BE engineers. Interestingly, one of the new backend services for the new billing system was showing promising trends.

Don’t avoid the difficult problems

The metrics showed that we were onto something. Also, qualitative feedback from working on separate services showed that the overall feedback loop was much shorter. The main challenge was that there weren’t enough paved roads, so only the bravest were able to go on that journey.

However, we needed more queues of work on the backend. Our developer experience team focused on enabling teams to create these services more easily and invested into education, eg. running workshops to reduce the entrance barrier of this type of work.

Within a year and a half of starting, we unlocked seven additional queues of work on the backend, unlocking 40 backend deployments per day. Around 35% of our engineers worked in these separate queues.

After some time, we noticed that almost all new features were being easily built outside of the backend monolith (we were actually tracking how quickly a team was able to set up a new service from scratch). However, we haven’t really tackled the problem for any of the existing features. Meaning that the teams that were working on these were stuck with a single queue of work. That was still around 65% of our backend engineers. Digging into reasons why, it was clear that for some of the existing features, we would have to address more critical and difficult issues, such as large dependencies on the data in the monolithic database. For most of the new features, engineers found creative ways to avoid addressing these. We knew that if we really wanted to move ahead, we needed to address this problem. 

We analyzed our backend monolith, applying a couple of heuristics to find the right features to extract next (based on criteria such as complexity, change frequency, planned initiatives, coupling to the rest of the code, the number of engineers working in these, and clarity of ownership). We planned to extract two existing features outside of the monolith—the remainder of the processing pipeline and our voice of customer tools like surveys. We created a tiger team and focused our effort on extracting these two sequentially, whilst addressing data dependencies in such a way that it can be unlocked for the rest of the areas as well.

Throughout these two initiatives, we saw a slightly increased amount of instability (our change failure rate went slightly up), but making mistakes and learning from them is an important part of the journey. Both initiatives went quite well and enabled other teams to follow as well.

Your desired architecture is a moving target

Within two and a half years of starting, we created 28 independent queues of work on the backend, unlocking up to 100 deployments per day. Deployment is no longer the bottleneck for our organization. 

While we've made significant progress, our journey doesn't end here. We remain committed to staying alert for new bottlenecks that may arise and refining our service boundaries for a fast flow of value. Continuous improvement is at the heart of what we do, and we understand that our desired architecture will always be a moving target.

As we move forward, we need to embrace constant learning and adaptation, knowing that the path ahead will present new challenges and opportunities. Our goal is to keep delivering value efficiently and effectively, ensuring that we meet the evolving needs of our customers and the market.

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy