Over the past few years we’ve seen a rise in popularity of microservices architecture. There are a lot of resources on how to implement it correctly but, quite often, people talk about microservices like they’re a silver bullet. There are many arguments against microservices, but the most relevant one is that this kind of architecture carries a lot of accidental complexity that relies on how you manage the relationships between your services and teams. You can find a lot of literature about why (maybe) microservices isn’t a good choice depending on your circumstances.
At letgo we’ve migrated from monolith to microservices to meet our scalability requirements and once we verified its beneficial impact across all teams. When correctly applied, microservices gives us several advantages, namely:
- Scalability of the application: In our experience the main pain point in the scalability of an application is in its infrastructure. Microservices empowers modularization of code and infrastructure (databases, etc.). In a well-implemented microservices architecture, each service owns its infrastructure. The Users database can only be accessed (read and write) by the Users service.
- Organizational scalability: As we’ve observed, microservices helps to fix organizational issues and gives us a framework on how we manage a large codebase that several teams are changing. Splitting the codebase prevents conflicts when making changes. In our experience, working in big teams doesn’t scale efficiently, so once we decided to split our engineers into small teams it made sense to split our codebase into small components too. Organizing a company in small teams also fosters ownership.
Not all microservices architectures are event-oriented. There are several people that advocate for synchronous communication between services in this kind of architecture using HTTP (gRPC, REST, etc.). At letgo we try not to follow this pattern and we communicate our services asynchronously with domain events. Our reasons for doing so are:
- Improve scalability and resilience: Dividing a large system into smaller ones helps to control affectation of failures. For example, a DDoS or a traffic spike in one of our services shouldn’t affect other services. If you communicate your services synchronously the possibilities of a service DDoSing another one increases. In this case we can say that our services are too coupled. For us the key concept for increasing microservices’ scalability and resilience is how rigid the boundaries of a service are and how you enforce communication between them.
A bulkhead is an upright wall within the hull of a ship that creates watertight compartments that can contain water in the case of a hull breach or other leak.
- Decoupling: A change in a service shouldn’t affect another one. We think practices like synchronizing deploys of multiple services are bad smells because they add a lot of complexity to our operations. There are some ways to mitigate this, like API versioning, but in our experience using domain events as a service’s public contract helps to model its domain in a way that doesn’t impact other services. A user entity in our Users service shouldn’t be the same as a user entity in the Chat service.
Based on this, we try to encourage async communication between services at letgo and we only allow it in very special cases like feature MVPs. We do this because we want every service to build its own entities based on domain events published by other services in our Message Bus.
In our opinion, the success or failure of a microservices architecture depends on how you handle the inherent complexity that it carries and relies on how all your services communicate with each other. Splitting code without splitting infrastructure of switching communication to asynchronous tends to become a distributed monolith.
letgo’s Event-Oriented Architecture
I want to share an example of how we use domain events and async communication at letgo: Our User entity exists in a lot of services but its creation and editing is handled originally by the Users service. We store a lot of data in the Users service database like name, email, avatar, country, etc. In our Chat service, we also have the user concept but we don’t need the same data that Users has in Users service. In our conversation list view, we’re only showing the username, avatar and ID (to link to their profile). We say that in chat we have a projection of the user entity that contains only partial data. In fact, in chat we don’t speak about users, we consider them “talkers”. This projection belongs to the Chat service and is built with events that Chat consumes from the Users service.
We do the same with listings. In the Products service we store n pictures of every listing but in the conversation list view we only show the main one, so our Products projection in Chat only needs one picture instead of n.
Conversation list view in our chat. Which backend service produces information shown.
If you take a look at conversation list view again, you’ll see that almost all the data we show isn’t created by the Chat service, but all the data is owned by the Chat service because User and Product projections are property of Chat. There’s a tradeoff between availability and consistency in projections which we can’t cover in this post, but we can say that it’s obviously easier to scale a lot of small databases than one huge one.
Simplified view of letgo’s backend architecture
Some intuitive solutions often became mistakes. Here’s a list of the most important antipatterns we’ve seen in our architecture related to domain events.
1. Fat events
We should try to keep our domain events as small as possible without losing domain meaning. We should be careful especially when refactoring legacy codebases with big entities to an event-driven architecture. These kind of entities can lead us to fat events, but since our domain events became our public contract we need to keep them as simple as possible. In this case, we think it’s better to consider these refactors from the outside in. First, we design our events using techniques like event storm and then we refactor the service’s code to adapt it to our events.
We also need to be careful with the “user and product problem”: a lot of systems tend to have products and users and these entities tend to attract all of the logic which means that all domain events are coupled to them.
2. Events as intentions
A domain event, by definition, is something that has already happened. If you’re publishing something in the message bus to request that something else happens in another service, it’s probably an async command instead of a domain event. As a rule of thumb, we name our domain events using the past tense: user_registered, product_published, etc. The less a service knows about the others, the better. Using events as commands couples services and increases the chances that a change in a service will affect others.
3. No agnostic serialization or compression
The serialization and compression systems of our domain events should be agnostic of programming languages. You shouldn’t even know what programming language other services consumers are coded in. That’s why we can’t use the PHP or Java serializer for example. Take your time, as a team, to discuss and choose your serializer because changing this in the future is complex and hard. At letgo we’re using JSON but there are a lot of good serialization formats with good performance.
4. No structure standardization
When we started to migrate the letgo backend to an event-oriented architecture we agreed upon a common structure for our domain events. It looks like this:
Having a common and shared structure for our domain events enables us to integrate our services quicker and implement some libraries with abstractions.
5. No schema validation
We’ve experienced some problems with serialization at letgo related to programming languages that don’t have a strong typed system.
A strong testing culture that tests how our events are serialized and knowing how the serialization library works helps a lot to mitigate this. At letgo we’re migrating to Avro and Confluent Schema Registry that will provide us a single point of definition of our domain event structure and will allow us to avoid this kind of errors as well as outdated documentation.
6. Anemic domain events
As we said before, and as its name suggests, domain events should have a meaning at the domain level. Just like we try to avoid inconsistent states in our entities, we need to avoid the same in domain events. Let’s illustrate this with an example: A product in our system is geolocated with latitude and longitude that are stored in two different fields in a products table of the products service. All products can be “moved” so we will have domain events to represent this update. We used to have two events to represent this: product_latitude_updated and product_longitude_updated which doesn’t make much sense if you’re not a rook in a game of chess. In this example, it makes more sense to have an event like product_location_updated or product_moved.
The rook is a piece in the game of chess. Formerly the piece was called the tower.
The rook only moves horizontally or vertically, through any number of unoccupied squares.
7. No tooling for debug
At letgo we produce thousands of domain events per second. All these events become an extremely useful resource to know what’s happening in our system, to log user activity or even to reconstruct the state of the system in a specific point of time doing event sourcing. We need to take advantage of this resource and to do that we need tooling to inspect and debug our events. Queries like “give me all events produced by user John Doe in the last 3 hours” also become very useful to detect fraudulent behaviour. We’ve developed some tools to accomplish that on top of ElasticSearch, Kibana and S3.
8. No event monitoring
We can use our domain events to check the health of a system. When we deploy something (which happens several times per day depending on the service) we need tools to quickly check if everything’s working the way it should. For example, if we deploy a new version of our Products service in production and we see a 20% decrease in product_published events, we can be almost certain that we broke something. We’re currently using InfluxDB, Grafana and Prometheus to accomplish that using derivative functions. If you remember your math classes, the derivative of a function f(x) is the slope of the tangent line of every point x. If you have a function of publishing rate of a specific domain event and you apply the derivative you will see peaks of this function and you will be able to set alerts based on that. With this type of alerts you avoid ones like “alert me if we publish less than 200 events per second during 5 minutes” and focus on significant variations in publishing rate.
Something weird happened here… but maybe it’s only a marketing campaign :D
9. Assuming everything is gonna be alright
We try to build resilient systems and try to reduce their recovery cost. In addition to infrastructure problems or human failure, one of the most common things that can happen in an event-driven architecture is a loss of events. We need to have a plan to recover the right state of the system reprocessing all events that were lost. Our strategy for accomplishing it is based on two points:
- Save all events: We need to be capable of doing things like “reprocess all events that happened yesterday” so we need to have some kind of event store where we keep all events. At letgo this is a responsibility of the Data team more than the Backend one.
- Consumer idempotency: Our consumers should be capable of handling the same event more than once without corrupting the internal state or throwing a lot of errors. This can happen because we’re recovering from an error and reprocessing old events or because our message bus delivers an event more than once. Idempotence is, in our opinion, the cheapest solution to solve this problem. Imagine we’re listening in our service to the event user_registered from the Users service because we want to construct our projection of users and we have a MySQL table using user_id as our primary key. If we don’t check existence before inserting when handling user_registered domain events we can end up with a lot of duplicate key errors. In this case, even if we check the existence of the user before inserting it we can still have errors due to the delay between master and slaves in MySQL (about 30ms average). As these projections can be represented as key value records, we’re trying to switch them to DynamoDB. Even if you try to be idempotent there are use cases like incrementing or decrementing counters where building an idempotent consumer is very hard. Depending on the criticality of the use case at domain level, you should decide how tolerant you should be to failures and inconsistencies, and decide if the cost of a deduplication event system pays off.
10. Lack of domain event documentation
Our domain events become our public interface to the rest of the systems in our backend. Just like we document our REST APIs, we need to document our domain events. Any member of the organization should be able to see an updated documentation of every domain event published by each service. If we’re using schemas for domain event validation, they can be used as documentation too.
11. Resistance of consuming own events
You’re authorized and encouraged to consume your own domain events to construct projections inside your system that are, for example, optimized for reading. We have seen some resistance in some teams to do it because they interiorize the concept of consuming other’s events.