Git Internals Explained – How Git Actually Stores Data

Home » Cloud & DevSecOps » Git Internals Explained – How Git Actually Stores Data

Understanding Git from the inside out — clearly, practically and with real-world analogies.

Why You Should Care About Git Internals

Most software engineer use Git daily — git addgit commitgit push — without ever thinking about what Git actually does under the hood. That’s fine, until it isn’t.

Understanding Git internals helps

  • Debug strange Git issues with confidence
  • Recover lost commits and corrupted repositories
  • Design better CI/CD and branching strategies
  • Use Git more efficiently at scale (large repos, mono-repos)
  • Explain Git clearly to your team (a leadership skill)

At its core, Git is not a version control tool — it is a content-addressable database. Once we understand this single idea, everything else clicks.

Git in One Sentence

Git stores snapshots of your project as immutable objects, addressed by the cryptographic hash of their content.

The .git Directory: Git’s Brain

When we run

git init

Git creates a hidden directory

.git/

This folder contains everything Git knows about our repository. Delete it, and our project becomes a normal folder again.

Key internal directories

.git/
├── objects/ # All Git data lives here
├── refs/ # Branches and tags
├── HEAD # Pointer to current branch
├── index # Staging area
├── config # Repo-specific config

We’ll focus mainly on objects, because that’s where Git truly stores data.

Git Is a Content-Addressable Object Store

Git does not store files. It stores objects.

Each object is

  • Immutable (never changes)
  • Identified by a SHA-1 hash (40 hex characters)
  • Stored based on its content, not its name

If two files have identical content, Git stores only one object.

This design makes Git

  • Extremely space-efficient
  • Naturally deduplicated
  • Cryptographically verifiable

The Four Core Git Objects

Git has only four object types:

  1. Blob – file content
  2. Tree – directory structure
  3. Commit – snapshot + metadata
  4. Tag – named reference (optional)

Everything in Git is built from these.

Blob Objects – Storing File Content

blob stores the contents of a file — nothing else. Blobs do not store: File name, File Path, No file history

Example

File

hello.txt

Content

Hello Git

Git stores

blob "Hello Git"

we can inspect it

git hash-object hello.txt

This outputs a SHA-1 hash, for example

557db03de997c86a4a028e1ebd3a1ceb225be238

Git stores this blob at

.git/objects/55/7db03de997c86a4a028e1ebd3a1ceb225be238

Compressed. Immutable. Permanent.

Tree Objects: Representing Directories

tree represents a directory.

It maps

  • Filenames
  • To blobs or other trees
  • With permissions

Think of a tree as Git’s version of a filesystem index.

Example Tree

project/
├── README.md
└── src/
└── app.py

Git creates

  • Blob for README.md
  • Blob for app.py
  • Tree for src/
  • Tree for project/

Trees reference blobs and other trees by hash, not by path.

Commit Objects – Snapshots, Not Diffs

This is the most misunderstood part of Git.

Git does not store diffs — it stores full snapshots.

commit contains

  • Reference to a root tree
  • Parent commit(s)
  • Author & committer
  • Timestamp
  • Commit message

Commit Structure (Conceptual)

commit
├── tree: <root-tree-hash>
├── parent: <parent-commit-hash>
├── author: Rahul <rahul@email>
├── committer: Rahul <rahul@email>
└── message: "Initial commit"

Each commit is a node in a DAG (Directed Acyclic Graph).

Why Git Is Fast & Still Space Efficient

Although commits store full snapshots, Git is still fast and space-efficient. Why?

Because unchanged files reuse the same blob objects. Only new or modified files create new blobs.

This is called structural sharing — a powerful idea used in functional programming and distributed systems.

The Staging Area (Index) – Git’s Secret Weapon

The index (or staging area) sits between

Working Directory → Index → Repository

When we run

git add file.txt

Git

  • Creates a blob
  • Stores it in .git/objects
  • Adds a reference in .git/index

This allows

  • Partial commits
  • Clean commit history
  • Fine-grained control (enterprise teams love this)

Branches Are Just Pointers

A Git branch is not a copy of code.

It is simply

refs/heads/main → <commit-hash>

That’s it.

Creating a branch

git branch feature-x

Means

  • New pointer
  • Same commit
  • Zero data copied

This is why branching in Git is cheap and fast.

HEAD – Where You Are Right Now

HEAD is a special pointer.

Usually

HEAD → refs/heads/main

Detached HEAD

HEAD → <commit-hash>

Understanding HEAD helps you

  • Recover lost commits
  • Understand rebases
  • Avoid accidental history loss
  • Recover from mistakes safely

Tags – Human-Friendly Names

Tags point to commits.

Two types

  • Lightweight tag (simple pointer)
  • Annotated tag (full object)

Annotated tags are preferred for

  • Releases
  • Production deployments
  • Audit trails

Git Internals and Distributed Systems

From an architect’s lens, Git is a distributed system:

  • Immutable data structures
  • Content-addressable storage
  • Merkle DAG
  • Eventual consistency across clones

Every clone is

  • A full replica
  • Capable of independent operation
  • Cryptographically verifiable

This is why Git scales so well across

  • Large enterprises
  • Open-source ecosystems
  • Global teams

Real-World Example – Recovering a Lost Commit

Because commits are immutable

git reflog

Shows where HEAD pointed in the past. we can recover almost anything — a lifesaver in production incidents.

Remember:

  • Blob → File content
  • Tree → Folder structure
  • Commit → Snapshot + metadata
  • Branch → Pointer
  • HEAD → Current pointer

Once this clicks, Git stops being magical — and becomes predictable.

Architect’s Perspective

Git’s design is

  • Simple
  • Immutable
  • Distributed by default

These same principles appear in

  • Blockchains
  • Event sourcing
  • Modern data platforms

Mastering Git internals doesn’t just make you a better developer — it sharpens system design intuition.

Git Internals: A Complete Guide for Engineers

Frequently Asked Questions (FAQ)

Is Git a database?

Yes. Internally, Git works like a content-addressable database. Every file, directory, and commit is stored as an object identified by a cryptographic hash.

How does Git store files internally?

Git stores file contents as blob objects, directories as tree objects, and history as commit objects. Each object is stored using a hash of its content.

Does Git store diffs or full copies of files?

Git stores snapshots, not diffs. Each commit represents a complete snapshot of the project, but Git optimizes storage using references and compression behind the scenes.

Why does Git use hashes (SHA)?

Hashes ensure data integrity. If any content changes, its hash changes. This makes Git history tamper-evident and reliable across distributed systems.

Where does Git store all this data?

Git stores everything inside the .git directory, including objects, references, the index (staging area), and metadata.

Is understanding Git internals useful for everyday developers?

Yes. Understanding Git internals helps developers
1. Debug broken repositories
2. Use Git more confidently
3. Understand branching, merging, and rebasing clearly
4. Work effectively in large teams and CI/CD systems

2 thoughts on “Git Internals Explained – How Git Actually Stores Data”

Comments are closed.

Discover more from Rahul Suryawanshi

Subscribe now to keep reading and get access to the full archive.

Continue reading