Event sourcing and git

I attended DDD Day in Bologna, last Saturday. Greg Young was really amazing and effective in introducing Command and Query Responsibility Segregation and Event Sourcing to the audience.

As far as I understood, the key concept in Event Sourcing (I’m using Greg Young’s words) is

that we have an Event Store holding the events to rebuild an object behind the domain as opposed to something storing the current state

Every time you delete or update something, Greg said, you loose information. It makes sense to me.

Since “conceptually the Event Store is an infinitely appending file“, rebuilding of objects based on events can be problematic, and Greg explained an approach in using snapshots.

I’m all but an expert on this topic, but that reminded me of revision control systems. I might be completely wrong, but I don’t see too many differences between persisting objects based on events (and sometimes snapshots) with Event Sourcing and committing changes to a revision control system.

Further on, that reminded me of a very nice article by Αριστοτέλης Παγκαλτζής I read about the key difference between git and all the other revision control systems (bolds are mine):

Among the systems I did look into, there are really just two contenders: Git and Mercurial. All the other systems track metadata; Git and hg just track content and infer the metadata.

By tracking metadata I mean that these systems keep a record of what steps were taken. “This file had its name changed.” “Those modifications came from that file in that branch.” “This file was copied from that file.” Tracking content alone means doing none of that. When you commit, the VCS just records what the tree looks like. It doesn’t care about how the tree got that way. When you ask it about two revisions, it looks at the tree beforehand and the tree afterwards, and figures out what happened inbetween. A file is not a unit that defines any sort of boundary in this view. The VCS always looks at entire trees; files have no individual identity separate from their trees at all.

As a consequence, whether you used VCS tools to manipulate your working copy or regular command line utilities or applied a patch or whatever is irrelevant. The resulting history is always the same.

Another consequence, at least with Git, is that it can track the movement of things smaller than a file, e.g. a single function being moved from one file to another.
And that sub-file level tracking in Git is an example of how, if the VCS is improved and its tracking becomes more intelligent, your entire repository instantly benefits from this. A metadata tracking system can’t do that because the old part of your repository didn’t have the necessary metadata recorded. A file-based VCS can’t do that because it doesn’t have an innate understanding that there are interrelationships between files.

So that’s why the only contenders are Git and Mercurial.

Now I wonder: can be storing metadata (with a RCS) compared to persisting events with Event Sourcing? If so, since git’s magic capabilities to “figuring out what happened” just because it stores snapshots and not diffs, could be always storing snapshots a better way to do Event Sourcing?

Just wondering.

@gianmarcog suggested me a very interesting post by Linus Torvalds about git tracking “_nothing_ but information” rather than events happened to the source code, which I find amazingly interesting, especially if I try to read it thinking about event sourcing. Thanks, @gianmarcog


5 thoughts on “Event sourcing and git

  1. “I don’t see too many differences between persisting objects based on events (and sometimes snapshots) with Event Sourcing and committing changes to a revision control system.”

    I agree! In fact, that “penny just dropped for me”. I googled it, and found your post. Seeing how powerful events are in Git / Hg makes me more confident expending the effort to use the approach in my own projects.

  2. “since git’s magic capabilities to “figuring out what happened” just because it stores snapshots and not diffs, could be always storing snapshots a better way to do Event Sourcing?”

    I went and read that email you linked to from Linus, it’s quite interesting! But I have to say that I don’t think merely snapshotting is the same thing that Linus is talking about.

    A Domain Object in a CQRS, DDD, Event Sourced program is constructed from past events. The snapshot is really just a point-in-time picutre of the Domain Object. Having the snapshot means it is quicker to build the state of an object without processing every Domain Event since the beginning of time. You load the latest snapshot and then apply all subsequent events and then you have an acurate picture of the Domain Object’s state.

    If you were to attempt to only snapshot Domain Objects when a change occurs then you would incur *significant* overhead. Because every single event in your program would cause a snapshot to occur. So, instead of a huge list of Events you would instead have an equally huge list of snapshots in terms of numbers, but a far, far, far bigger list in terms of data consumed, since a snapshot will not only capture the changed data, but all the data of the entire Domain Object at that point in time.

    And in the end, you will not have gained anything. You haven’t got any more information than you would have if you had only stored Events. All you will have done is consumed unneccesary storage with duplicated data.

    What Linus is talking about is the understanding of the Domain in question. David saw a reporting need (searching for information is a read. therefore a report of data based around some constraints) and assumed that the easiest way to get his report would be to calculate this information during the write and add it as extra metadata. This is the wrong approach. Since Event Sourcing has already recorded every single interaction with the Domain Model, we don’t need to bolt new data on during the write that is really just a calculation on the existing data. To get this report, all we need to do is to run back through time in the Event Source and evaluate the data then. As Linus pointed out, this is slightly more expensive since you don’t have the answer handily metadata’d for you – but as he said, the question is irrelevant until you ask it. Instead of trying to answer every question before it is asked, just store all the data and ask questions as they arise.

    Hope this helps!

  3. The main differences comes down to the persistence of original intent. When storing one or many named events it more clearly describes what changed and possibly why it changed whereas with snapshots you may have a number of seemingly unrelated changes with no explanation of why. In this respect RDMS auditing features wouldn’t be much different than the idea of just storing snapshots for an aggregate, it’s just a question of organization of data as relationally normalized or aggregate centric units.

  4. I realise that this is an old post, but I’ll throw my two cents in anyway.

    This thought experiment is VERY interesting. Take a look at the Historical Modelling website, especially the PDF linked to from the front page (http://historicalmodelling.com). He doesn’t specifically mention it, but I believe that “successor facts referencing predecessor facts” is in terms of snapshots, much like how Git works (also, the author’s references to content-addressable-storage – CAS – is also very Git-like).

    Bob, Git does store snapshots. However, it also uses “pack files” – which are essentially deltas – to reduce storage space. This is the inverse of event sourcing, where events (deltas) are the primary data storage and snapshots are used to improve read performance (reading of events, not reads in the CQRS sense).

    That is, immutable state-based storage has fast read times (one read to determine the current state) at the expense of storage space (duplication), while immutable delta-based storage lower storage costs at the expense of slower read times (one read per event). Developers throw around the “premature optimisation is the root of all evil” slogan frequently. When choosing between state-based or event-based immutable storage, you’ll have to consider the characteristics of each. Perhaps – in some domains at least – optimising storage space via event sourcing is the “premature” case, especially given the low cost of storage.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s