Event sourcing and git

I attended DDD Day in Bologna, last Saturday. Greg Young was really amazing and effective in introducing Command and Query Responsibility Segregation and Event Sourcing to the audience.

As far as I understood, the key concept in Event Sourcing (I’m using Greg Young’s words) is

that we have an Event Store holding the events to rebuild an object behind the domain as opposed to something storing the current state

Every time you delete or update something, Greg sais, you loose information. It makes sense to me.

Since “conceptually the Event Store is an infinitely appending file“, rebuilding of objects based on events can be problematic, and Greg explained an approach in using snapshots.

I’m everything but an expert in this topis, but that reminded me of revision control systems. I might be completely wrong, but I don’t see too many differences between persisting objects based on events (and sometimes snapshots) with Event Sourcing and committing changes to a revision control system.

Further on, that reminded me of a very nice article by Αριστοτέλης Παγκαλτζής I read about the key difference between git and all the other revision control systems (bolds are mine):

Among the systems I did look into, there are really just two contenders: Git and Mercurial. All the other systems track metadata; Git and hg just track content and infer the metadata.

By tracking metadata I mean that these systems keep a record of what steps were taken. “This file had its name changed.” “Those modifications came from that file in that branch.” “This file was copied from that file.” Tracking content alone means doing none of that. When you commit, the VCS just records what the tree looks like. It doesn’t care about how the tree got that way. When you ask it about two revisions, it looks at the tree beforehand and the tree afterwards, and figures out what happened inbetween. A file is not a unit that defines any sort of boundary in this view. The VCS always looks at entire trees; files have no individual identity separate from their trees at all.

As a consequence, whether you used VCS tools to manipulate your working copy or regular command line utilities or applied a patch or whatever is irrelevant. The resulting history is always the same.

Another consequence, at least with Git, is that it can track the movement of things smaller than a file, e.g. a single function being moved from one file to another.
And that sub-file level tracking in Git is an example of how, if the VCS is improved and its tracking becomes more intelligent, your entire repository instantly benefits from this. A metadata tracking system can’t do that because the old part of your repository didn’t have the necessary metadata recorded. A file-based VCS can’t do that because it doesn’t have an innate understanding that there are interrelationships between files.

So that’s why the only contenders are Git and Mercurial.

Now I wonder: can be storing metadata (with a RCS) compared to persisting events with Event Sourcing? If so, since git’s magic capabilities to “figuring out what happened” just because it stores snapshots and not diffs, could be always storing snapshots a better way to do Event Sourcing?

Just wondering.

[Update]
@gianmarcog suggested me a very interesting post by Linus Torvalds about git tracking “_nothing_ but information” rather than events happened to the source code, which I find amazingly interesting, especially if I try to read it thinking about event sourcing. Thanks, @gianmarcog

3 thoughts on “Event sourcing and git

  1. “I don’t see too many differences between persisting objects based on events (and sometimes snapshots) with Event Sourcing and committing changes to a revision control system.”

    I agree! In fact, that “penny just dropped for me”. I googled it, and found your post. Seeing how powerful events are in Git / Hg makes me more confident expending the effort to use the approach in my own projects.

  2. Bob says:

    “since git’s magic capabilities to “figuring out what happened” just because it stores snapshots and not diffs, could be always storing snapshots a better way to do Event Sourcing?”

    I went and read that email you linked to from Linus, it’s quite interesting! But I have to say that I don’t think merely snapshotting is the same thing that Linus is talking about.

    A Domain Object in a CQRS, DDD, Event Sourced program is constructed from past events. The snapshot is really just a point-in-time picutre of the Domain Object. Having the snapshot means it is quicker to build the state of an object without processing every Domain Event since the beginning of time. You load the latest snapshot and then apply all subsequent events and then you have an acurate picture of the Domain Object’s state.

    If you were to attempt to only snapshot Domain Objects when a change occurs then you would incur *significant* overhead. Because every single event in your program would cause a snapshot to occur. So, instead of a huge list of Events you would instead have an equally huge list of snapshots in terms of numbers, but a far, far, far bigger list in terms of data consumed, since a snapshot will not only capture the changed data, but all the data of the entire Domain Object at that point in time.

    And in the end, you will not have gained anything. You haven’t got any more information than you would have if you had only stored Events. All you will have done is consumed unneccesary storage with duplicated data.

    What Linus is talking about is the understanding of the Domain in question. David saw a reporting need (searching for information is a read. therefore a report of data based around some constraints) and assumed that the easiest way to get his report would be to calculate this information during the write and add it as extra metadata. This is the wrong approach. Since Event Sourcing has already recorded every single interaction with the Domain Model, we don’t need to bolt new data on during the write that is really just a calculation on the existing data. To get this report, all we need to do is to run back through time in the Event Source and evaluate the data then. As Linus pointed out, this is slightly more expensive since you don’t have the answer handily metadata’d for you – but as he said, the question is irrelevant until you ask it. Instead of trying to answer every question before it is asked, just store all the data and ask questions as they arise.

    Hope this helps!

  3. The main differences comes down to the persistence of original intent. When storing one or many named events it more clearly describes what changed and possibly why it changed whereas with snapshots you may have a number of seemingly unrelated changes with no explanation of why. In this respect RDMS auditing features wouldn’t be much different than the idea of just storing snapshots for an aggregate, it’s just a question of organization of data as relationally normalized or aggregate centric units.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s