Chapter 6 Event Sourcing

The Reactive Manifesto states that a system needs to be both Resilient, and Elastic in order to achieve Responsiveness. It does this by leveraging a Message Driven framework.A common solution for building these systems is to leverage events. Events and activities have a direct correspondence in Domain Driven Design (DDD). These events can be used to construct views of the model using a pattern called Command Query Responsibility Segregation (CQRS). This pattern allows us to decouple our Command model from our Query model. By decoupling the models we create isolation in our system. This in turn allows us to build our system in a way that is more elastic, and more resilient.

Broadcasting the events in order to build the corresponding read models requires that we have a technique for persisting the events. If we were to lose the events for any reason, then it would jeopardize the consistency of the queries. One way that we can ensure that the models remain consistent is using a technique called Event Sourcing (ES). With Event Sourcing, we eliminate the persistence of state, and instead focus only on persisting the events. This provides us with a powerful persistence model that is capable of adapting over time.

CQRS and ES allow us to build flexible, decoupled models. These models can be scaled and replicated, allowing us to achieve the goals of the Reactive Manifesto. The techniques aren’t without costs, but when properly implemented, they result in systems that are able to evolve with the changing environment.

CQRS/ES stands for Command Query Responsibility Segregation and Event Sourcing. These are two different tools that can be used independently, but are often combined together. When combined together they give us a set of techniques that allow us to build applications that are more resilient and more elastic. In essence they’re giving us a way to build applications that are going to be more Reactive. Now we do have to be careful though, when we decide to use these tools, because they do come at a cost, and so we have to understand what that cost is. In all cases we should be looking at what our business needs are, and deciding whether that cost that we have to pay is worth it in order to achieve the business needs.

Some of the cases where you would want to consider adopting CQRS are, for example, if you needed applications that were auditable. So you needed some way not just to know what your current state is in your application, but also how you got there. Banking is a good example of this. Banking is a case where auditability is a legal requirement generally speaking. So you need to know not just how much money is in the bank, but how you got to that amount. Accounting is another example. You need to be able to know when you’re doing your accounting ledgers, how you got to the final balance. It’s not sufficient just to say what that balance is. So if auditability is important in your business, then it’s worth looking at the CQRS. and ES.

Now we should also consider whether this is a portion of the business that provides competitive advantage. So if this is a piece of your business that you actually believe will give you an edge over your competitors, then CQRS/ES may be valuable for you. It may give you some of the tools and some of the techniques that allow you to do things that your competitors can’t. And so if competitive advantage is important in this part of the business then you should consider CQRS/ES.

If for example you need high scalability, or you need your application to be extremely resilient, again that might be a case where you want to look at CQRS/ES but if your application is serving only a couple of hundred requests a minute and maybe it doesn’t really matter that much if it goes down for a period of time, again maybe this is overkill maybe you don’t want CQRS/ES in most cases.

6.1 State-Based Persistence

Most applications are based on what we would call a state based persistence technique. With this type of technique every time an update is applied to the database you overwrite the previous state and update it with a new state. So whatever the state was before that is gone, and now you just have the current version of the data. This has consequences, especially when adapting to an evolving domain.

The requirements in your domain are not always fixed and therefore changes to the requirements can show up for many reasons. Examples are new business opportunities, better domain understanding or external legal requirements, which change the way how the application stores data. The essential question which need to be addressed in evolving domains is how to capture changes that require us to look into past history?

An example is a restaurant software, which captures reservations and their locations. The according reservations table / model might look like the following:

ReservationId CustomerId LocationId DateTime SpecialInstructions
12345 Martin Lagomville_east 2020-06-10T17:00 Near Windows
54321 Grace Lagomville_west 2020-07-22T18:00 Booth

Management wants to know how often a reservation is changed from one location to another (to help them to plan new locations, staffing) and the plan is to adapt the system to start recording this information in the future. How can we ge this information from the past?

With the state-based approach of the above table this is only possible with big pains: Granted if we could have duplicate reservation IDs then it is possible to capture that information by allowing multiple entries for the same reservation id. However, we wouldn’t know which was the current version of the reservation, so we would still have to adapt it in some way in order to keep track of that information, for example a version index.

Also, what if management expects to have the information available not only from now into the future but also for the past? In such a state-based system this would simply be impossible.

Another problem with a state-based approach is that correcting past mistakes is difficult. Despite best efforts, errors creep into code and the result in incorrect state in the database. Fixing the errors fixes new state but not the state that has already been persisted.

Ultimately, the problem with a state-based approach is:

  • I captures where you are, not how you got there.
  • You cannot retroactively apply new domain insight.
  • You cannot fix bad state caused by issues in the code.

6.2 Event Sourcing

State-based persistence is a good general purpose technique, but it has limitations. Well what happens if we want to work around those limitations, or avoid them? In that case we can use a technique like event sourcing.

In order to solve the problems with state based persistence, a thing that we will commonly do, is that additionally to the state, we will persist an audit log of some kind, see Figure 6.1.

An Audit Log.

Figure 6.1: An Audit Log.

  • In addition to persisting state, you persist an audit log.
  • The log captures the history and can be used to fix issues with the state.
  • It can be used to retroactively develop new domain insights.

However, what if the audit log gets out of sync with the state (due to a bug / transaction issues e.g. state goes to DB, log to filesystem)? Which is the source of truth? The log or the state? The obvious answer is to say the log because the log has the full history of everything that happened and because of this it can always be used to rebuild the state. As a consequence, this raises the question that why we need the state in the first place if it doesn’t have enough information to be the source of truth? This brings us to event sourcing

Event sourcing eliminates the persistence of state, and instead captures only the intent. See Figure 6.2 for an example of an order service of a restaurant. It exposes endpoints like CreateOrder ,AddItem(Steak) and an AddItem(Salad). In case the order service (aggregat in DDD speak) receives such a command it updates the state of the order in memory, but at the same time it persists events that correspond to those commands: OrderCreated, ItemAdded(Steak), ItemAdded(Salad).

Event Sourcing

Figure 6.2: Event Sourcing

  • State-based persistence fail to capture intent.
  • Event Sourcing eliminates the persistence of state. It captures only the intent.
  • Intent is captured in the form of events which are stored in the log.
  • This log is the single source of truth in the system.
  • Event sourcing captures the journey, rather than the destination.

The question is now, how to load instantiate an object, such as the order in the above example, in case it got removed from memory (which will happen). There is no final state available anymore! The solution is now to replay all of the events to arrive at the current final state. In the above example the first event is to create the order, then add a steak and then add a salad, which will bring us to the final order state. See Figure 6.3.

Recovery in Event Sourcing

Figure 6.3: Recovery in Event Sourcing

  • When an event sourced object is loaded, rather than loading the final state, the events are replayed.
  • Each event is replayed causing the same update to the state that occurred from the original command.
  • It is imporant when replaying events to avoid replaying side effects (for example sending a notification email). Such side effects have already occured and do not need to be executed again.

An issue with recovery and replaying events in event sourcing is that it can become quite expensive in case there are a large number of events (in hundreds or thousands). A solution to this issues are Snapshots.

  • Eventually the time to replay all events may be too long.
  • In addition to persisting events, we can periodically persist a snapshot.
  • A snapshot captures the current state of the object.
  • When replaying, we can start from the most recent snapshot and replay only the events after that snapshot.
  • Snapshots are an optimisation. Be wary of premature optimisation!

Note that a snapshot is different from the state that is persisted in a state-based approach. Although it captures the current state of the object, alongside of the events, the difference is that it is treated as transient. The snapshot could be deleted at any time and the application would continue to operate exactly as it did before (albeit probably slower). So if a situation occurs where it is realised that the snapshot is corrupt or incorrect, it can simply be deleted and replayed. However often we do not have to go all the way back but replay from the most recent snapshot, giving us the state at that point, and then we replay only events that occurred after that snapshot. Therefore, it is possible to leverage snapshots in order to speed up load time.

It’s important to note though that snapshots are an optimization. Snapshots should be introduced only when the time to load the events becomes a problem. Selecting 1, tens or hundred records with a simple select statement out of the database is often not significant slower. Also, a lot of times the state is just in memory when you need it, and so you don’t actually have to reconstruct these objects that often. So if they take a little longer to load it’s not really a big deal.

What are the advantages of event sourcing?

  • Creates a built-in audit log showing where the current state comes from. This is particular useful if you need an audit due to legal requirements.
  • Allows you to rewind or undo the changes back to a particular time. This is particularly useful in cases of bugs or corruptions.
  • Enables you to correct errors in your state by fixing computation and replaying the events. Again this is useful in cases of bugs.
  • Append-only is usually (much) more efficient in databases.

What are the disadvanages of event sourcing? TODO

6.3 Evolving a Domain Model

Whether we use state-based persistence or event sourcing, there’s going to come a time when we’re building our system, where we will have to evolve our model. With state based persistence techniques, this not too difficult: we may have to add a new column for example, or a new table. Evolving a model however, gets more complicated though when we do event sourcing.

Essentially, event sourced solutions are only as good as the log. Therefore, if we cause a problem in the log, it can’t be used anymore to rebuild the state. So it’s very very important to maintain the integrity of the log:

  • The events in the log are facts. They represent the history of the system.
  • History should never be rewritten (except in 1984), therefore events must be immutable (value objects!).
  • Your Event Log is append only. You never update or delete.

All these requirements to consistencay are a problem, as over time events may need to change, see Figure 6.4. We might need to add new fields to the event, we may need to remove fields, we might need to rename things. This causes issues as well, for example to maintain support for all of the old versions. In this case it is necessary to continue to be able to parse old versions into into the software as old versions can stay around for a long period of time, theoretically forever, especially if we cannot force a client update.

Versioning Events

Figure 6.4: Versioning Events

  • Because events are immutable, we can’t change them.
  • This requires us to create a versioning scheme for events.
  • Best Practice: use a format for the log that is flexible (e.g. XML, JSON, Protobuf,… rather than Java Serialisation).

6.4 Command Sourcing

In addition to event sourcing there’s another technique that we call command sourcing. Now command sourcing is similar to event sourcing, the difference is that instead of persisting the events, we persist the commands, see Figure 6.5.

Command Sourcing

Figure 6.5: Command Sourcing

  • Command sourcing is simliar to Event Sourcing.
  • Commands are presisted upon receipt, before being executed.
  • Domain objects can be constructed or updated by executing the commands.
  • Allows commands to be processed asynchronously, which can be very useful in the case where there are commands that take a long time to process.

There are a number of challenges with command sourcing

  • The commands may be executed multiple times (dailures, retries,…). Therefore, they must be indempotent.
  • If they are not validated first, then commands could get stuck in the queue as they continually fail.
  • In the event of a bad command we are decoupled from the sender, so we cannot inform them of the problem.
  • If is often safer to perform validation first, then emit an event to update the state, e.g. we might have a command AddItem which gets validated and persisted as an AddItemValidated event, which is then turned into an ItemAdded event.

6.5 Read Models vs Write Models

When start doing event sourcing, usually if doing something like domain driven design, we would event source aggregates or entities. Problems arise when queries need to be performed that can’t be answered by a single aggregate root, see Figure 6.6.

Complexity of Queries

Figure 6.6: Complexity of Queries

In this example the challenge is to get all reservations for a customer (given its name or id). If reservation is the aggregate root, then reservation alone cannot answer that query. This can only be answered by going through and rebuild potentially every reservation in the system from the log and only return the reservations with the given customer id.

  • Queries become problematic because an entity or aggregate root must be rebuilt from events.
  • Querying across multiple aggregates requires you to rebuild all of them and query them individually - you can’t simply do a database query.

This issue is due to a deeper, underlying problem, which is that there are conflicting models in the system. There seem to be a dichotomy between commands and queries, see Figure 6.7.

Conflicting Models

Figure 6.7: Conflicting Models

The issue is that we need to deal with the fact that in one case for CreateReservation the best aggregate root is the reservation, whereas in another case like GetReservationsForCustomer a better aggregate root might be customer. The fact of the matter is that the model that is used to persist is often not compatible with the model that is used for queries. The question is now how to resolve this conflicts, which brings us to CQRS.

6.6 Command Query Responsibility Segregation (CQRS)

CQRS aims to reconcile the dichotomy between commands and queries sides of a model. It does this by acknowledging that the requirements for read and write are very different and supporting both in a single model may not be a good idea.

So at its core, CQRS tries to separate your reads and your writes, see Figure 6.8.

Simple CQRS

Figure 6.8: Simple CQRS

  • The write model is used for processing commands (writes). The write model is normally represented by the aggregate root or entity.
  • One or more read models are produced to handle queries (reads). It makes sense to have multiple read models, for example GetReservationsByLocation would be one read model, GetReservationsByCustomer another.
  • Both read and write models are optimised for their precise purpose. The ideal read model simply goes to the database extracts the data and then dumps it out exactly, or as close to exactly, as it came out of the database.

Note that CQRS does not implicitly assume event sourcing. It is perfectly valid to use CQRS in a state-based approach with separate tables for the read model and separate tables for the write model, each performing different queries. However, However CQRS and event sourcing are often combined, see Figure 6.9.

CQRS + Event Sourcing

Figure 6.9: CQRS + Event Sourcing

  • Commands still go through the write model, but that write model just persists events.
  • A separate process then consumes those events.
  • A denormalised model, called a Projection, is created that is used by the read model.

The essential idea here is for write purposes storing the events is probably our ideal circumstance that’s the ideal state. But for a lot of read purposes, reading those events is not ideal, so that separate projection is created and that projection is optimized for the read model. That way the read model can just go directly to that denormalized store and read the data directly out exactly as it needs it. This makes it very fast, because there’s not a lot of processing and there is no need for complex joins and complex queries - data is read exactly as it was stored in the database.

CQRS/ES brings a lot of flexibility:

  • CQRS allows you to build a very flexible model.
  • The write side can be highly optimised for write purposes.
  • The read sinde can be highly optimised for read purposes.
  • The read and write side can use different databases of necessary (Polyglot Persistence).

It allows for substantially easier evolving of the model:

  • Creating new Projections is easy in a CQRS/ES based system.
  • Because the full history is captured in the events, new projections are retroactive.
  • The read model and write model are deoupled and can be evolved independently from each other.

6.7 Fine Grained Microservices

One of the powerful things about CQRS and event sourcing is they allow us to build very very fine-grained micro-services. Ultimately, it allows us to essentially use each of the individual models as their own microservices, see Figure 6.10.

Models as Microservices

Figure 6.10: Models as Microservices

  • CQRS allows you to further break apart your bounded context into microservices.
  • The read and write models can each live in their own separate microservices if necessary.
  • You can take it further and have each projection live in its own microservice.
  • Caution: You can have very fine grained microservices with this approach, but it may not be worth it. Make sure you understand your needs before turning everything into a microservice.

6.8 Consistency, Availability and Scalability with CQRS

CQRS gives a lot of flexibility when trying to do things like scaling an application, breaking up an application, and as a result it has a direct impact on the ability to provide things like consistency, availability, and scalability.

6.8.1 Consistency

Simple CQRS (without ES) has the same consistency guarantees as any non-CQRS based system (which depends on the underlying database). CQRS+ES systems however, can be implemented with different consistency concerns for the read and write models. In this case, write model consistency is quite straightforward, whereas read model consistency is more complicated. Lets have a look at each of them.

6.8.1.1 Write Model Consistency

  • Strong consistency is often important in the write model.
  • Decisions and computations are often made based on the current state of the system.
  • Making decisions with stale data could result in an incorrect state.
  • The write model often leverages transactions, locks or mor reactive approaches like sharding.

6.8.1.2 Read Model Consistency

It is important to understand that strong consistency only matters when you are writing data where you may need to ensure that the write is based on the current state. However, pure reads are never strongly consistent:

  • The data you read might be changed immediately after you read it.
  • Pure reads are always working with stale data.
  • You can’t lock the data for the duration of the read to ensure consistency. The read may be unbounded, such as when displaying it on a screen.

Therefore: read models do NOT need strong consistency. Eventual consistency is fine for read models because the reality is that read models are eventually consistent by nature.

6.8.2 Scalability

  • Because read and write models are separated, they can be scaled independently.
  • Most applications are read heavy.
  • The eventually consistent nature of the read side allows high scalability.
    • Eventual consistency is built into the model, so caching is easy. When things are eventually consistent, we know that they don’t need to be immediately up-to-date, which means we can leverage caches that may be slightly out of date.
    • If necessary, different microservices can host different projections, allowing further scalability.
    • Multiple copies of the same projection can be created.
  • The write side of CQRS+ES often requires stronger consistency. It can be scaled using techniques like sharding.

6.8.3 Availability

  • The write model is often strongly consistent and therefore sacrifices have to be made in availability (see CAP).
  • The read model is eventually consistent so high availability is possible.
  • In failure scenarios, the system may have to disable writing of data, while still allowing data to be read.

6.9 Cost of CQRS

CQRS is a powerful tool, but like with any powerful tool, it’s not free to use. There’s a cost associated with it, and so it is important to take those costs into account when building systems using CQRS.

CQRS+ES is often critisised for being more complex than traditional architectures, wowever, it can be simpler. Considering the simplified design in Figure 6.11, which is to be found in the red side and contains all operations provided for reservations. On the other hand in some ways it’s actually more complex, because that one single object is more complex. The green CQRS side on the other hand has more objects which introduces complexity, but each of those objects is in turn simpler. So it’s a bit of a trade-off you get some complexity but you gain some simplicity at the same time.

Simplified Design.

Figure 6.11: Simplified Design.

  • Without CQRS, models become bloated, complex, and rigid. The simple red side design makes it difficult to account for model changes as pointed out above. All queries need to be changed when some change happens. Now on the CQRS side, on the other hand, it’s more complex in the sense that now you have completely independent databases, and you’ve got to maintain all of them and manage all of them, but at the same time they are completely independent, so if I go and I make a change to the write model database storage, as long as I’m not changing the events, the read models don’t have to change at all.
  • CQRS allows smaller model that are easier to modify and understand.
  • Eventual consistency in CQRS can be isolated to where it is necessary. All systems have eventual consistency in them, however CQRS is just more explicit about it.

Concrete costs of CQRS+ES:

  • Usually results in an increased numbre of objects/classes (commands, events, read models etc.).
  • May result in introducing multiple data stores, potentially of different type.
  • UI must be designed to accept the Eventually Consistent architecture.
  • Maintaining support for old event versions can become challenging.
  • May require additional storage due to long event history and read model data duplication.
  • Data duplication can result in synchronisation issues. This can be solved by rebuilding projections.

Sources

This chapter is based on material from the following online resources:

https://cognitiveclass.ai/courses/reactive-architecture-cqrs#utm_source=Lightbend