11. November 2021
Soft factors of microservices
Since I started building microservice architectures the first time back in 2018 I felt some Saudade1 when touching one of my monolithic Django Applications; The simplicity, speed of development and straight-forward tooling was captivating. While you can transform a monolithic application into a microservice mesh, it amounts to gradually rebuilding the monolith. How can these models be so different?
Let's start to define what I think defines a monolith. Many products start as a bunch of code run in a single runtime system, executing the business logic. Thus your application is deployable to a virtual server or a single container. Maybe the code's organization is based on a framework giving you a fast and guided start by providing basic functionality like authentication, a data access layer like an ORM and probably some request routing and handling utilities. From there, developers add feature by feature and tweak the often manifold configuration of the underlying framework.
The monolith might be complemented with indeed some units in their own runtime system like a cache, a message queue temporarily storing background tasks that are eventually picked up by a process within the runtime running the business logic. If your system supports multiple customers without spinning up a new instance for each one (multi-tenancy), you might decide to scale your monolith by adding more servers and distribute the requests with a load balancer in front.
This monolithic model might carry you far, in fact, even major products out there are still based on such an architecture. It's only that you notice after years into developing and maintaining your product, that you make a subset of these worrying observations:
- It becomes complex to implement simple features when they have touch-points with data from different domains within your system.
- You encounter difficulties to make the database perform well at all times, especially when joining large tables or using transactions.
- You develop fears to revisit your data schema design because migrations get harder to write without breaking business logic.
- Your maintenance cost steadily increase because tracking down bugs requires engineers knowing all the inter-winded parts of the system.
- You don't feel comfortable deploying a new version of the monolith because starting it up, warming caches tends to fail more often.
While still mostly up and functional, that promise of rapid innovation that you gave to your clients becomes shallow.
By the rigorous decision to introduce some heavy restrictions on how to compose your product when starting off adding building blocks to your system, we can achieve a product of probably longer life-time, but for sure more consistent cost for changes. Moving towards microservice architectures, the thought process in our coder-cortex also has to shift, and it does not feel extremely sophisticated at first. It becomes more similar to the approach I used back at school to build the small programs helping me to get my repetitive homework done.
Facing requirements for a new feature to add some functionality to your system, the central question shifts from
Where do I add this feature in my code?
How do I divide the feature into independent parts?
This main question you will continuously reconsider during refining features at the requirements level and during development. While we strive for independent parts of logic and data units, I admit that for every useful product, independent is idealistic. It's rather as independent as possible while not compromising the user's experience too much.
However, if you can't outline the functionality of a new service or after your change to an existing service from the top of your head, it's probably not the right division of the problem. And more so, the boundary (a.k.a. interface) of each component in that division should be able to be explained to a workmate standing at the coffee machine, otherwise the interaction is probably too sophisticated. The simplicity of those service interfaces can be estimated by their size; smaller is simpler. That's why the interface per service, including returned errors, should be kept as small as possible.
Code is never bug-free (the exception are programs proofed to be correct by model checking, maybe running on satellites that can't be updated once shot into space). So resilience can't be based on the absence of erroneous code and more drastically, even a noble 100% test coverage will not guarantee that (just lookup the difference of branch coverage VS line coverage if you are doubtful).
I like to see a software system's evolution as a step function:
Usually, the function increases the software's value for users by every deployment. At the same time, every deployment comes with a more or less random risk of decreasing the value for certain or all users, depending on the feature set they use and the impact a feature degradation has. Within this model, "is a system resilient to newly added code with bugs" is a question of how well the system operates despite the bugs and how quickly the system can be set back to a state where the system was good enough for its users. Yes, I'm talking about rollbacks.
We often want services realising the business logic to be stateless. The stateless requirement does not only mean that there is no need for a persistence layer within that service, it's a much harder requirement: the inter-service logic is not compromised when your service is going down at any time! With that, we are approaching the main benefit I denote to the architectural model of microservices: The increased resilience by the separation of
the units whose behaviour is affected by a deployed code change,
the units or connections where a potential bottleneck leads to slow-downs or failures.
While stateless services are easy to scale and to reason about, most useful applications deal with some kind of state, somewhere. "The state lives in some type of database that is scaled by my favourite cloud provider", you might think. Sadly, even when your one big database system scales to a certain point performance-wise, it becomes your "complexity-bottleneck" because it is just too easy to couple every data point with each other. More so, it is seen as proper design to organize the database schema in Third Normal Form. If you are databases indeed becomes a performance bottleneck, you can protect the clients to some degree by caching responses and queuing non-urgent updates. But after all, this just postpones the problem to the next round of implementing new features.
Databases are also critical when seeking the holy grail of zero downtime deployments. Hopefully you are using a database that supports transactional schema migrations (in case it has a schema). But even with that, we might introduce an unforeseen problem in the business logic with a schema change. Then you are either fast in fixing the logic or you will rollback the schema change, potentially loosing new data points created in the meanwhile.
With the certainty that bugs (and changes shifting the bottlenecks) while be introduced, we should be concerned about mitigating the negative effects. Some of these changes are too severe to keep in the system until they are fixed, with which we reach the conclusion that every change to the system must be revertible. This includes changes to data schemas and infrastructural changes. This is why, in an idealistic setup, every change to a system should be put into code and under version control.
While teams are now aware of backwards schema migrations for their schema-full databases, infrastructure is usually still setup and modified with manual intervention, even with infrastructure-as-code tools like terraform. Because it's hard to make every change reversible when it comes to stateful products like databases. It's still worth considering being rigorous about reversibility to achieve fully hands-free continuous delivery.
Many blogs claim microservices are popular because they achieve horizontal scalability; simply adding more copies2 of a service should make it handle more load, as opposed to vertical scaling of monoliths where you will hit certain limits that are impossible or very costly to surpass.
I believe that does not capture the main advantage of microservices. By using an external database you are already separating the logical layer from your data on an infrastructural level. So the logic in a monolithic architecture can be scaled horizontally. But your database is still accessed from increasing number of code points, transactions might interfere more often when adding features. The crux to achieve straight-forward horizontal scalability for your data layer as well is to split your data along with the corresponding business logic.
Don't be fooled: In a microservice architecture bottlenecks will still occur (and it might just end up to be that external API you are not controlling). Simply setting the services to auto-scaling will just move the bottleneck somewhere else in the service mesh. So whatever architecture, you will need careful monitoring to understand which parts are potential bottlenecks and what's their impact on the user's experience. Once identified, you can react and apply a granular change thanks to the microservice architecture:
- don't change anything because the bottleneck does not hurt the user's experience
- eliminate a dependency's effect on your performance:
- remove dependency ("do it yourself")
- for reading from dependency: use caching and smartly detect/hide stale data from user
- for writing to dependencies: execute update asynchronously or regularly synchronise data from dependency and replicate logic in microservices (often cumbersome!)
- make the client not wait on a response of a slow API call by making the entire call asynchronous, returning an ID to later request the status
I locate the underestimated "soft-benefit" of microservices in the improved scalability of development. Smaller teams with less coordination can take ownership of a set of services and negotiate their concise interfaces with its clients (both other developers and users). On the flip side, it can get more complicated to ensure the inter-service logic implements exactly the behaviour that best for the user experience. You can learn more about how we can keep control about getting this logic right and how we can write expressive but fast test in development in the article about Introducing a kit for transitional services.
Image Source: https://unsplash.com/photos/qjX0QBtDXto