Git Objects - How They Work Internally?

Posted on: by Rajeev Edmonds

Git is one of the essential and primary tools a developer use daily. If you're using it too (which I'm sure you are), have you ever wondered how it works behind the scenes? How do the stored snapshots get restored—almost instantly?

Let's try to understand how Git stores information—internally—and how it retrieves it—whenever required. It'll give you a better understanding of Git's way of working with data.

Colorful Github mascot replica on a table

Git uses different types of objects for storing repository information. These objects are primarily are of 4 types viz., blob, tree, commit, and an annotated tag.

But, before we move on to understand these objects, we must know that Git references almost every piece of stored information through a hash. It uses a SHA-1 hash (checksum) which is a 40 character string.

In other words, all these objects are referenced or accessed through these hash strings. They're unique and immutable. You're already familiar with these hashes. Take a look!

$ git log --format=fuller
    commit 8cf7b994d9815ccf2787e8660c6679ca0a012430 (HEAD -> master, origin/master, origin/HEAD)
    Author:     Rajeev Edmonds <rajeevedmonds@users.noreply.github.com>
    AuthorDate: Mon Aug 2 21:00:17 2021 +0530
    Commit:     Rajeev Edmonds <rajeevedmonds@users.noreply.github.com>
    CommitDate: Mon Aug 2 21:00:17 2021 +0530
    
        style: add space to skill elements
    
    commit 03b6e1b6ff25d4bda9083b2bb165793c87a5d4e5
    Author:     Rajeev Edmonds <rajeevedmonds@users.noreply.github.com>
    AuthorDate: Mon Aug 2 20:47:23 2021 +0530
    Commit:     Rajeev Edmonds <1507156+rajeevedmonds@users.noreply.github.com>
    CommitDate: Mon Aug 2 20:47:23 2021 +0530
    
        style: update skills text

Here, the string 8cf7b994d9815ccf2787e8660c6679ca0a012430 is the 40 character commit hash generated by Git. It's referencing the snapshot of this commit.

In simpler words, Git has created a commit object, and this hash is pointing to that object. If you want to retrieve information associated with this commit, use this hash.

Git stores these objects in the .git/objects directory.

Types of Git objects

It's not just your regular staging-commit workflow that creates these objects, but you can directly throw any file at Git, and it'll create a data object for that one. Here's how it works!

$ git hash-object -w sample.txt
03b6e1b6ff25d4bda9083b2bb165793c87a5d4e5

The above command takes the file sample.txt and creates a data object for it within the Git database. It returns a hash (key) that points to the data object.

After running this command, if you inspect the .git/objects directory, you'll find a directory 03, and within it, a blob (binary large object) b6e1b6ff25d4bda9083b2bb165793c87a5d4e5 file. It's a binary file containing the content of the file sample.txt in an encrypted form.

$ git cat-file -t 03b6e1b6ff25d4bda9083b2bb165793c87a5d4e5
blob

You can verify that it is indeed a blob file through the command shown above.

An illustration of a Git blob object

You may note, the first 2 characters of the hash are used to create the directory, and the rest 38 are used to name the blob file.

In a nutshell, Git uses blobs to store snapshots (content at a given point of time) of the file. Multiple trees can point to the same blob file if left unchanged in a commit-chain.

Anatomy of a Commit Object

Let's take a closer look at the commit object. Visualizing it graphically can help you better understand how things are stored in the Git database.

Before visualizing the graphic, let's see what the Git command line emits when we try to see the contents of a commit hash. Here's what we get!

$ git cat-file -p 03b6e1b6ff25d4bda9083b2bb165793c87a5d4e5
tree 5275fd785145e22bf8be2663c5166f23e1ea6aee
parent 8cf7b994d9815ccf2787e8660c6679ca0a012430
author Rajeev Edmonds <rajeevedmonds@users.noreply.github.com> 1628057403 +0530
committer Rajeev Edmonds <rajeevedmonds@users.noreply.github.com> 1628057403 +0530
        
style: change border of about page avatar image

A commit object has the following attributes, viz.,

  • A top-level directory tree hash for the snapshot.
  • A parent commit (if any) hash.
  • Author information with a UNIX timestamp.
  • Committer information with a UNIX timestamp.
  • And, a commit message.

Now, let's understand it graphically.

A Git commit object illustration

If your project consists of a single file, the hash may not necessarily point to a tree. In such a case, it'll point to a blob associated with that file. Depending upon the directory structure of the tree at the time of the snapshot, a tree may contain sub-trees and blobs.

In case it's your very first commit for a project, the parent commit hash value will be NULL, and it obviously won't point anywhere.

Anatomy of a Tree Object

Let's move on and pay attention to the tree object. The commit object graphic shown above includes a hash pointing to the directory tree for that snapshot.

First, let's take a look at the same on the command line. And, for that, you can use the following command. This command is displaying a tree object pointed to by the latest commit on the master branch. It's the actual output of one of my projects.

$ git cat-file -p master^{tree}
100644 blob ac67ef2fee465686a0d6785ebe7d9ca749bb2a54    .eleventy.js
100644 blob d2bc75ca4d87786726be14921f7d0a5668445da7    .gitignore
100644 blob d852ec19c83842a046d284cc788d65994356d90e    README.md
100644 blob 901ae6b698ad239b82a02295afd9c5f9a47a42b9    netlify.toml
100644 blob 10d24740a72f42c4d9b84f1db7c2d81ef3459866    package-lock.json
100644 blob f44fa19c0825292000e97e3c8d2261bac417acf5    package.json
040000 tree 01e532758656f7ad1d9529b1b321a98be297f65a    src

You can see 6 blobs and one tree here. The former ones are 6 files and the latter one is the src directory. Each of the 6 blobs is the snapshot of the respective file shown on the right.

As mentioned earlier, a tree may point to other sub-trees and blobs depending on the directory structure below the root-level tree.

And now, let's visualize it graphically.

An illustration of condensed tree object in Git An illustration of Git tree object

The sub-tree representing the src directory may itself contain more sub-trees and blobs. The larger and complex is the directory structure, the longer and denser is the snapshot tree.

Let's see what our sub-tree consists of. Remember, it's the actual project I'm working on. You can view it easily through its commit hash. Here we go!

$ git cat-file -p 01e532758656f7ad1d9529b1b321a98be297f65a
100644 blob cb02fb6565eae9b03f13efb2ef9103f9dd64c0f8    .eleventyignore
100644 blob 22be4d8b5a6fc8122dc1aa36da88bc183f9ce981    404.html
040000 tree 45261c1bf4cf7706f8e9246ae5d1e5036210f0fd    _data
040000 tree c5841a54043b62e1d46a370691834bd6030698e6    _includes
040000 tree 868386fa348f1302a594a0e266195bb0239b0ec5    _partials
100644 blob 9a63b13e6a84c471db5f6c22f1fba8cc554758e8    index.html
040000 tree c4bcefe913f3b4758e0c7fe1e39853492596ebcf    js
100644 blob dc83c83c125e6a9c523ec079d722dbe9220832cf    robots.njk
100644 blob 3c987d9aaa0ce77bdcfff529c2e5c6761dfe3148    sitemap.njk
100644 blob 5bf7d929ae4a07b70cbe0a96a76b45cf210795ed    tags.html

You can easily see that our src directory contains more sub-directories (sub-trees) and files (blobs). A tree may contain just a collection of blobs without a single sub-tree. This happens when you don't have any sub-directory within the parent directory.

So, that's how Git tree objects work under the hood. If you've already read about tree data structure, grasping all this is going to be a—cakewalk for you.

Anatomy of an Annotated Tag Object

The last one is our tag object. If you're frequently working with repos periodically pushing software releases for internal or public use, you may be already familiar with Git tags.

A tag object is quite similar to a commit object. It consists of the tagger details, a message, tag name, type of object, and a pointer to the commit it is associated with.

Let's see what the command line has to tell about our hypothetical tag object.

$ git cat-file -p 375a229b8aafd6a09052f038d2db00b690bdd3f6
object cc727ca3c1f017d1c5d26e908b29c1136efe2e3e
type commit
tag v1.0.1
tagger Rajeev Edmonds <rajeevedmonds@users.noreply.github.com> 1628749051 +0530
    
First patch for the first major release

Here, you can see that this annotated tag is pointing to the commit cc727ca and its own object-type is commit. Tags are like branch references. The difference is that they always point to the same commit. Bookmarks, you see!

Here's a graphical representation of the same.

An illustration of Git tag object

Tags are frequently used for the semantic versioning of real-world projects. Technically, you can make a tag point to almost any type of Git object.

Multiple tags can point to the same Git object though it's rare to see them in small to medium projects. Generally, one of these tags is a human-readable label for internal use of the development team, and the other one is the semantic version for public use.

Previous Blog Post: Understanding Dense and Sparse Arrays in JavaScript