Git Internals — Objects, Trees & the .git Directory
Why Internals Matter
You can use Git effectively without understanding its internals. But understanding the internals makes you dramatically more effective when:
- A merge conflict seems inexplicable
- A rebase produces unexpected results
- You need to recover from an unusual situation
- You want to understand what a complex command is actually doing
More fundamentally, Git is elegant. Once you see the object model, all the commands stop being a collection of memorized incantations and become obvious extensions of a simple, coherent design. The internals are worth understanding for their own sake.
Content-Addressable Storage
Git's entire object system is built on a simple principle: objects are identified by their content.
Every object Git stores gets a SHA-1 hash computed from its content. The hash is 40 hexadecimal characters. Git uses the first 2 characters as a directory name and the remaining 38 as the filename inside .git/objects/.
This system is called content-addressable storage. Its properties:
- Deterministic: The same content always produces the same hash. Two identical files are stored once.
- Tamper-evident: Changing even a single byte changes the hash completely. You cannot quietly modify stored objects.
- Efficient: Duplicate content (across commits, across files) is stored exactly once.
- Self-verifying: Git can detect corruption by recomputing hashes and comparing.
Let's verify this ourselves:
The Four Object Types
Git stores exactly four types of objects in its database:
- Blob — file content
- Tree — directory listing (references to blobs and other trees)
- Commit — a snapshot with metadata and a pointer to a tree
- Tag — an annotated tag object pointing to another object
Every object has the same basic structure:
- A header:
<type> <size>\0 - The content
The SHA-1 hash is computed from this entire string (header + content).
Blobs: Storing File Content
A blob stores the raw content of a file. Nothing else — no filename, no permissions, no metadata. Just the bytes.
Because blobs store only content (no filename), two files with identical content in different locations or with different names share the same blob. This is how Git achieves space efficiency.
Trees: Storing Directory Structure
A tree is the Git equivalent of a directory. It is a list of entries, where each entry contains:
- The mode (file permissions:
100644for regular file,100755for executable,040000for directory,120000for symlink) - The object type (
blobortree) - The SHA-1 hash of the referenced object
- The filename
A tree can reference other trees (subdirectories):
Because tree objects reference other objects by their SHA-1 hash, the tree hash changes if any file in the directory (or any subdirectory) changes. This creates a cascading integrity guarantee: the commit hash depends on the root tree hash, which depends on all the file and subdirectory hashes, all the way down.
Commit Objects: The Full Picture
A commit object is what you create when you run git commit. It contains:
- A pointer to the root tree (the snapshot)
- Pointers to parent commit(s) (zero for root commit, one for normal commit, two or more for merge commit)
- Author (name, email, timestamp)
- Committer (name, email, timestamp — can differ from author, e.g., when applying a patch)
- The commit message
The parent field is what creates the linked list of commits that forms the history. For a merge commit, there are two parent lines.
Because the commit hash includes the tree hash and the parent hash(es), changing anything in history (any file content, any commit message, any parent pointer) produces a completely different hash for that commit and all its descendants. This is why "rewriting history" with rebase produces new commit hashes.
The Directed Acyclic Graph (DAG)
The commit history forms a directed acyclic graph (DAG): directed (parent pointers go one way: child to parent), acyclic (you can never follow parent pointers and return to your starting commit).
Commit D has two parents (B and C) — it is a merge commit. You can follow the arrows from any commit back to the root, but you cannot cycle.
Tag Objects
Annotated tag objects (from git tag -a) are the fourth object type:
A tag object points to another object (usually a commit, but tags can technically point to any object). The object field contains the SHA-1 of the tagged commit.
Lightweight tags do not create tag objects — they are just reference files (like branches) that point directly to a commit.
How Branches Are Files
Branches are implemented as plain text files in .git/refs/heads/. Each file contains exactly one thing: the SHA-1 hash of the commit that branch points to.
This is why branching is free in Git. Creating a branch is literally creating a 41-byte file. Deleting a branch is deleting that file. Switching branches is updating HEAD to point to a different file.
Remote-Tracking Branches as Files
Remote-tracking branches follow the same pattern, stored in .git/refs/remotes/:
The Index: Staging Area Internals
The staging area (index) is stored in .git/index. It is a binary file that lists every tracked file with its:
- SHA-1 hash (blob hash of the staged content)
- File mode (permissions)
- File name
- Timestamps and stat information (for performance: comparing mtime to detect changes without reading content)
The third column (0) is the "stage number." During a merge conflict:
- Stage 0: normal, no conflict
- Stage 1: the common ancestor (base) version
- Stage 2: the HEAD (current branch) version
- Stage 3: the MERGE_HEAD (incoming branch) version
This is why you can do git checkout --ours taskr.sh and git checkout --theirs taskr.sh — Git reads the blob from stage 2 or stage 3 respectively.
The .git Directory: Complete Layout
Packfiles and Garbage Collection
Initially, every object Git creates is stored as a separate compressed file in .git/objects/ (called a "loose object"). This works fine for small repositories but becomes inefficient as the repository grows.
Git periodically (or when you run git gc) packs loose objects into packfiles — single large files that store many objects together, often with delta compression. Delta compression stores similar objects as a base object plus a diff (similar to the old SVN model, but applied to stored objects rather than the conceptual model of commits).
After git gc:
Packfiles are stored in .git/objects/pack/ with a .pack extension and an accompanying .idx index file.
Garbage collection also removes objects that are unreachable from any reference (branch, tag, or reflog entry). This is how git reset --hard can theoretically result in data loss — the reset commits become unreachable and will eventually be deleted by git gc. In practice, the reflog keeps them accessible for 90 days.
Exploring the Object Database
Let's do a complete walkthrough, tracing from a branch reference all the way to file content:
You have traced the entire chain from branch name → commit object → tree object → blob object → file content. This is what Git does internally every time you check out a branch.
git ls-tree — Exploring Trees
Practical Exercises
Exercise 1: Explore Your Object Database
Exercise 2: Trace the Commit Chain
Exercise 3: Understand Branches as Files
Exercise 4: The Index During a Conflict
Summary
- Git's object database is content-addressable: every object is identified by the SHA-1 hash of its content. Same content = same hash, stored once.
- Four object types: blob (file content), tree (directory listing), commit (snapshot with metadata and parent pointers), tag (annotated tag object).
- Commits form a directed acyclic graph: each commit points to its parent(s), creating an immutable chain of history.
- Branches are 41-byte files in
.git/refs/heads/containing a commit hash. This is why branching is nearly free. - The index (
.git/index) is the staging area: a binary file listing all tracked files with their current staged blob hashes. - During conflicts, the index holds three versions of each conflicted file (stages 1, 2, 3: ancestor, ours, theirs).
- Packfiles compress many loose objects into a single file using delta compression, keeping repositories compact over time.
- Understanding the object model explains why rebase creates new commit hashes, why branch creation is free, and how
git reflogrecovers "deleted" commits.
The final lesson brings everything together: the workflow patterns that teams use to organize branches, integrate changes, and maintain a healthy repository at scale.