Core Concepts of Git

The content of this article is based on my own understanding and https://git-scm.com/book/en/v2.

This article is not a tutorial on using Git, but rather leans towards the theoretical aspect, aiming for a deeper understanding of Git, so that we can use it better and make the tool a powerful assistant for us.

Version Control Systems#

Git is currently the best distributed version control system in the world. A version control system is a system that records a series of changes to files over time, allowing you to revert to a previous version if desired. Version control systems can be divided into three main categories: local version control systems, centralized version control systems, and distributed version control systems.

Local Version Control Systems store various versions of files on the local disk in a certain data format (some VCS save file change patches, which calculate and save the differences when the file content changes). This method somewhat solves the problem of manual copying and pasting but does not address the issue of collaboration among multiple people.

Local Version Control {.to-figure}

Centralized Version Control Systems do not fundamentally change from local version control; they simply add a central server where the databases of various versions are stored. Administrators can control the permissions of developers, and developers can pull data from the central server. Although centralized version control solves team collaboration issues, its drawbacks are also evident: all data is stored on the central server, and if the server crashes or the disk is damaged, it can lead to immeasurable losses.

Centralized Version Control {.to-figure}

Distributed Version Control Systems differ from the previous two. First, in distributed version control systems like Git, Mercurial, Bazaar, and Darcs, the system saves not the differences in file changes but the snapshots of files, meaning it copies the entire file and saves it without caring about the specific changes. Secondly, and most importantly, distributed version control systems are distributed; when you copy code from the central server, you are copying a complete repository, including historical records, commit records, etc. This way, even if one machine crashes, a complete backup of the files can still be found.

Distributed Version Control {.to-figure}

Git Basics#

Git is a distributed version control system that saves complete snapshots of files rather than differential changes or file patches.

Saving the complete content of each change {.to-figure}

Each commit in Git is a complete copy of the project files, so you can fully restore to any previous commit without any differences. Here’s a question: If my project size is 10M, does the space occupied by Git increase linearly with the number of commits? If I commit 10 times, does it occupy 100M? Clearly not. Git is very intelligent; if a file has not changed, it only saves a pointer to the previous version of the file, meaning that for a specific version of a file, Git only saves one copy but can have multiple pointers pointing to that file.

Additionally, note that Git is best suited for saving text files; in fact, Git was designed to save text files, such as source code in various languages, because Git can perform excellent compression and differential analysis on text files (as everyone has experienced, Git's differential analysis can pinpoint whether you added or deleted a specific letter). Binary files like videos and images can also be managed by Git, but they do not achieve good results (low compression ratio, no differential analysis). Experiments have shown that a 500k text file compressed by Git is only about 50k, and after slightly changing the content, two commits will have two files of about 50k each, indeed saving complete snapshots. For binary files, such as videos and images, the compression rate is very small, and the space occupied by Git almost increases linearly with the number of commits.

Unchanged files only save a pointer to the previous version {.to-figure}

A Git project has three working areas: the working directory, the staging area, and the local repository. The working directory is the area where you are currently working; the staging area is where files are saved after you run the git add command, and it is also the list of files to be saved in the next commit (Note: Git actually reads the content of the staging area for commits, which is unrelated to the files in the working area; this is why if you modify a file and do not add it to the staging area with git add, it will not be saved to the version library); the local repository is the version library that records the complete state and content of your project at a certain commit, meaning your data will never be lost.
file
Correspondingly, files also have three states: committed, modified, and staged. Committed means that the file has been safely stored in the local version library; modified means that a file has been changed but not yet committed; staged means that the modified file is included in the list of files to be saved in the next commit, i.e., the staging area. Therefore, the basic workflow of using Git is:

Add, delete, or modify files in the working area.
Run git add to save the file snapshot to the staging area.
Commit the update to permanently save the file version in the version library.

Git Objects#

Now that we understand the basic workflow of Git, how does Git accomplish this? How does Git distinguish whether a file has changed? Here’s a brief introduction to the basic principles of Git.

SHA-1 Checksum#

Git is a content-addressable file system. This means that at its core, Git is simply storing key-value pairs (key-value), where value is the content of the file, and key is the 40-character length SHA-1 checksum of the file content and header information, for example: 5453545dccd33565a585ffe5f53fda3e067b84d8. Git uses this checksum not for encryption, but for data integrity; it ensures that many years later, when you check out a certain commit, it will be exactly the same as it was years ago. Even the slightest modification to a file will result in a completely different SHA-1 checksum, a phenomenon known as the "avalanche effect."

Thus, the SHA-1 checksum is the pointer to the file mentioned earlier, which is somewhat different from the pointer in C language: C language uses the address of data in memory as a pointer, while Git uses the SHA-1 checksum of the file as a pointer, with the purpose of uniquely distinguishing different objects. However, when the content pointed to by a C language pointer changes, the pointer does not change, but when the content of the file pointed to by a Git pointer changes, the pointer also changes. Therefore, every version of a file in Git has a unique pointer pointing to it.

Blob Objects, Tree Objects, Commit Objects#

The blob object only saves the content of the file, while the tree object is more like a directory in an operating system; it can save blob objects and other tree objects. A single tree object contains one or more tree records, each record containing a SHA-1 pointer to a blob object or a child tree object, along with information such as the object's permission mode (mode), type, and filename:
file
When you modify a file and commit, the changed file generates a new blob object that records the complete content of the file (all content, not just the changes), and then a unique SHA-1 checksum is generated for that file. The pointer for that file in this commit is set to that SHA-1 checksum, while for unchanged files, it simply copies the pointer of the previous version, i.e., the SHA-1 checksum, and does not generate a new blob object. This also explains why the total size of a project of 10M after 10 commits is far less than 100M.

Additionally, each commit may not only have one tree object; they indicate different snapshots of the project, but you must remember the SHA-1 checksums of all objects to obtain a complete snapshot, and there is no information about the author, when, or why these snapshots were saved. The commit object was created to address these issues; the format of the commit object is simple: it indicates the top-level tree object of the project snapshot at that point in time, author/committer information (obtained from the Git settings of user.name and user.email), the current timestamp, an empty line, the ID of the previous commit object, and the commit message. You can simply run git log to get this new information:

$ git log
commit 2cb0bb475c34a48957d18f67d0623e3304a26489
Author: lufficc <luffy.lcc@gmail.com>
Date:   Sun Oct 2 17:29:30 2016 +0800

    fix some font size

commit f0c8b4b31735b5e5e96e456f9b0c8d5fc7a3e68a
Author: lufficc <luffy.lcc@gmail.com>
Date:   Sat Oct 1 02:55:48 2016 +0800

    fix post show css

***********omitted***********

file
The Test.txt in the image above was generated before the first commit, and its initial SHA-1 checksum starts with 3c4e9c. After modifying it, a new blob object is generated during the second commit, with a checksum starting with 1f7a7a. During the third commit, since Test.txt did not change, it simply saved the SHA-1 checksum of the most recent version without generating a new blob object. Newly added files during the project development process will generate a new blob object to save them after committing. Note that every commit object, except for the first one, has a pointer to the previous commit object.

In summary, the blob object saves the content of the file; the tree object is like a folder, saving blob objects and other tree objects; the commit object saves the tree object, commit information, author, email, and the ID of the previous commit object (the first commit does not have one). Git achieves version control and other functions like branching by organizing and managing the states and complex relationships of these objects.

Git References#

Now looking at references will be much simpler. If we want to see the complete history of a commit record, we must remember the commit ID, but the commit ID is a 40-character SHA-1 checksum, which is hard to remember. Therefore, references are aliases for SHA-1 checksums, stored in the .git/refs folder.

The most common reference is probably master, as this is the default branch created by Git (it can be modified, but generally is not). It always points to the last commit record of your project's main branch. If you run cat .git/refs/heads in the project root directory, it will output a SHA-1 checksum, for example:

$ cat .git/refs/heads/master
4f3e6a6f8c62bde818b4b3d12c8cf3af45d6dc00

Thus, master is just an alias for a 40-character SHA-1 checksum.

Another question is, how does Git know the last commit ID of your current branch? In the .git folder, there is a HEAD file, like this:

$ cat .git/HEAD
ref: refs/heads/master

The HEAD file does not actually contain the SHA-1 value; it is a reference pointing to the current branch, and its content changes as you switch branches, formatted like this: ref: refs/heads/<branch-name>. When you execute the git commit command, it creates a commit object and sets the parent of this commit object to the SHA-1 value of the reference pointed to by HEAD.

Now let's talk about Git tags. Tags are somewhat like references; they point to a commit object rather than a tree, containing a tag, a set of data, a message, and a pointer to a commit object. However, the difference is that references change as the project progresses, while tags do not—they always point to the same commit, merely providing a more friendly name.

Git Branches#

Branches#

Branches are a killer feature of Git, and Git encourages frequent use of branches and merges in the workflow, even multiple times a day. This is because Git branches are very lightweight; unlike other version control systems, creating a branch means making a complete copy of the project, while Git creates a branch instantaneously, regardless of the complexity of your project.

As mentioned earlier, Git saves the most basic object of files as blob objects; Git is essentially just a giant file tree, where each node of the tree is a blob object, and branches are simply forks of the tree. In simple terms, a branch is a named reference that contains a 40-character checksum of a commit object, so creating a branch is as simple as writing 41 bytes (plus a newline) to a file, which is naturally fast and independent of the project's complexity.

The default branch in Git is master, stored in the .git\refs\heads\master file. Suppose you run git branch dev on the master branch to create a branch named dev; what Git actually does is:

Create a new text file named dev (without an extension) in the .git\refs\heads folder.
Write the 40-character SHA-1 checksum of the current branch (currently master) plus a newline into the dev file.
Done.

file

Creating a branch is that simple. So how about switching branches? Even simpler:

Modify the HEAD file in the .git folder to ref: refs/heads/<branch-name>.
Restore the files in the working directory to be exactly the same as the commit pointed to by the branch.
Done.

Remember, the HEAD file points to the last commit of the current branch, and it is also the content that will be written when creating a new branch from the current branch.

Branch Merging#

Now let's talk about merging. First, there is Fast-forward; in other words, if following one branch can reach another branch, then when Git merges the two, it will simply move the pointer to the right because this single-line historical branch has no divergences to resolve, so this merging process can be called a fast-forward. For example:
file
Note the direction of the arrows; since each commit has a pointer to the previous commit, the direction of the arrows pointing left is more reasonable.

When merging the dev branch into the master branch, since they are on the same line, there are no divergences to resolve, so the master branch simply points to the dev branch, making it very fast.

When branches diverge, conflicts may arise, and Git will require you to resolve these conflicts. For example, consider the following history:
file
Since the master branch and the dev branch are not on the same line, i.e., v7 is not a direct ancestor of v5, Git has to perform some additional processing. In this case, Git will perform a simple three-way merge calculation using the ends of the two branches (v7 and v5) and their common ancestor (v3). After merging, a merge commit v8 will be generated:
file
Note: The merge commit has two ancestors (v7 and v5).

Branch Rebase#

There are two ways to integrate changes from one branch into another: merge and rebase. First, merge and rebase ultimately yield the same result, but rebase can produce a cleaner commit history. Still using the previous example, if we perform a simple merge, a commit object v8 will be generated. Now let's try using rebase to merge branches by switching to dev:

$ git checkout dev
$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: added staged command

file

This code means: rewind to the most recent common ancestor of the two branches v3, and based on the subsequent commit objects of the current branch (which is the branch dev to be rebased, including v4, v5), generate a series of file patches, and then use the last commit object of the base branch (the main branch master, v7) as the new starting point, applying the prepared patches one by one, ultimately generating two new merge commit objects (v4', v5'), thus rewriting the commit history of dev to make it a direct downstream of the master branch, as shown below:
file
Now, you can return to the master branch for a fast-forward merge because the master branch and the dev branch are on the same line:

$ git checkout master
$ git merge dev

file
The snapshot corresponding to v5' is actually identical to the snapshot content of the ordinary three-way merge, i.e., the v8 in the previous example. Although the final integrated result is no different, rebase can produce a cleaner commit history. If you inspect the history of a rebased branch, it will look clearer, as if all modifications were made sequentially on a single line, even though they originally occurred in parallel.

Summary#

Git saves the complete content of files, not differential changes.
Git stores files in a key-value pair (key-value) manner.
Each file, and different versions of the same file, have a unique 40-character SHA-1 checksum corresponding to them.
The SHA-1 checksum serves as a pointer to the file, which Git relies on to distinguish files.
Each file generates a blob object in Git's version library to save it.
For unchanged files, Git only retains the pointer to the previous version.
Git essentially achieves version control by maintaining a complex file tree.
The basic workflow of using Git involves the flow of files between three working areas.
Branches should be used extensively for team collaboration.
A branch is merely a reference to a commit object.