Dulwich Tutorial

Contents

Introduction

Git repository format

For a better understanding of Dulwich, we'll start by explaining most of the Git secrets.

Open the ".git" folder of any Git-managed repository. You'll find folders like "branches", "hooks"... We're only interested in "objects" here. Open it.

You'll mostly see 2 hex-digits folders. Git identifies content by its SHA-1 digest. The 2 hex-digits plus the 38 hex-digits of files inside these folders form the 40 characters (or 20 bytes) id of Git objects you'll manage in Dulwich.

We'll first study the three main objects:

  • The Commit;
  • The Tree;
  • The Blob.

The Commit

You're used to generate commits using Git. You have set up your name and e-mail, and you know how to see the history using git log.

A commit file looks like this:

commit <content length><NUL>tree <tree sha>
parent <parent sha>
[parent <parent sha> if several parents from merges]
author <author name> <author e-mail> <timestamp> <timezone>
committer <author name> <author e-mail> <timestamp> <timezone>

<commit message>

But where are the changes you commited? The commit contains a reference to a tree.

The Tree

A tree is a collection of file information, the state of your working copy at a given point in time.

A tree file looks like this:

tree <content length><NUL><file mode> <filename><NUL><blob sha>...

And repeats for every file in the tree.

Note that for a unknown reason, the SHA-1 digest is in binary form here.

The file mode is like the octal argument you could give to the chmod command. Except it is in extended form to tell regular files from directories and other types.

We now know how our files are referenced but we haven't found their actual content yet. That's where the reference to a blob comes in.

The Blob

A blob is simply the content of files you are versionning.

A blob file looks like this:

blob <content length><NUL><content>

If you change a single line, another blob will be generated by Git at commit time. This is how Git can fastly checkout any version in time.

On the opposite, several identical files with different filenames generate only one blob. That's mostly how renames are so cheap and efficient in Git.

Dulwich Objects

Dulwich implements these three objects with an API to easily access the information you need, while abstracting some more secrets Git is using to accelerate operations and reduce space.

More About Git formats

These three objects make 90 % of a Git repository. The rest is branch information and optimizations.

For instance there is an index of the current state of the working copy. There are also pack files to group several small objects in a single indexed file.

For a more detailled explanation of object formats and SHA-1 digests, see: http://www-cs-students.stanford.edu/~blynn/gitmagic/ch08.html

Just note that recent versions of Git compress object files using zlib.

The Repository

After this introduction, let's start directly with code:

>>> from dulwich.repo import Repo

The access to every object is through the Repo object. You can open an existing repository or you can create a new one. There are two types of Git repositories:

Regular Repositories -- They are the ones you create using git init and you daily use. They contain a .git folder.

Bare Repositories -- There is not ".git" folder. The top-level folder contains itself the "branches", "hooks"... folders. These are used for published repositories (mirrors).

Let's create a folder and turn it into a repository, like git init would:

>>> from os import mkdir
>>> mkdir("myrepo")
>>> repo = Repo.init("myrepo")
>>> repo
<Repo at '/tmp/myrepo/'>

You can already look a the structure of the "myrepo/.git" folder, though it is mostly empty for now.

Initial commit

When you use Git, you generally add or modify content. As our repository is empty for now, we'll start by adding a new file:

>>> from dulwich.objects import Blob
>>> blob = Blob.from_string("My file content\n")
>>> blob.id
'c55063a4d5d37aa1af2b2dad3a70aa34dae54dc6'

Of course you could create a blob from an existing file using from_file instead.

As said in the introduction, file content is separed from file name. Let's give this content a name:

>>> from dulwich.objects import Tree
>>> tree = Tree()
>>> tree.add(0100644, "spam", blob.id)

Note that "0100644" is the octal form for a regular file with common permissions. You can hardcode them or you can use the stat module.

The tree state of our repository still needs to be placed in time. That's the job of the commit:

>>> from dulwich.objects import Commit, parse_timezone
>>> from time import time
>>> commit = Commit()
>>> commit.tree = tree.id
>>> author = "Your Name <your.email@example.com>"
>>> commit.author = commit.committer = author
>>> commit.commit_time = commit.author_time = int(time())
>>> tz = parse_timezone('-0200')
>>> commit.commit_timezone = commit.author_timezone = tz
>>> commit.encoding = "UTF-8"
>>> commit.message = "Initial commit"

Note that the initial commit has no parents.

At this point, the repository is still empty because all operations happen in memory. Let's "commit" it.

>>> object_store = repo.object_store
>>> object_store.add_object(blob)

Now the ".git/objects" folder contains a first SHA-1 file. Let's continue saving the changes:

>>> object_store.add_object(tree)
>>> object_store.add_object(commit)

Now the physical repository contains three objects but still has no branch. Let's create the master branch like Git would:

>>> repo.refs['refs/heads/master'] = commit.id

The master branch now has a commit where to start, but Git itself would not known what is the current branch. That's another reference:

>>> repo.refs['HEAD'] = 'ref: refs/heads/master'

Now our repository is officialy tracking a branch named "master" refering to a single commit.

Playing again with Git

At this point you can come back to the shell, go into the "myrepo" folder and type git status to let Git confirm that this is a regular repository on branch "master".

Git will tell you that the file "spam" is deleted, which is normal because Git is comparing the repository state with the current working copy. And we have absolutely no working copy using Dulwich because we don't need it at all!

You can checkout the last state using git checkout -f. The force flag will prevent Git from complaining that there are uncommitted changes in the working copy.

The file spam appears and with no surprise contains the same bytes as the blob:

$ cat spam
My file content

Attention!

Remember to recreate the repo object when you modify the repository outside of Dulwich!

Changing a File and Commit it

Now we have a first commit, the next one will show a difference.

As seen in the introduction, it's about making a path in a tree point to a new blob. The old blob will remain to compute the diff. The tree is altered and the new commit'task is to point to this new version.

In the following examples, we assume we still have the repo and tree object from the previous chapter.

Let's first build the blob:

>>> spam = Blob.from_string("My new file content\n")
>>> spam.id
'16ee2682887a962f854ebd25a61db16ef4efe49f'

An alternative is to alter the previously constructed blob object:

>>> blob.data = "My new file content\n"
>>> blob.id
'16ee2682887a962f854ebd25a61db16ef4efe49f'

In any case, update the blob id known as "spam". You also have the opportunity of changing its mode:

>>> tree["spam"] = (0100644, spam.id)

Now let's record the change:

>>> c2 = Commit()
>>> c2.tree = tree.id
>>> c2.parents = [commit.id]
>>> c2.author = c2.committer = author
>>> c2.commit_time = c2.author_time = int(time())
>>> c2.commit_timezone = c2.author_timezone = tz
>>> c2.encoding = "UTF-8"
>>> c2.message = 'Changing "spam"'

In this new commit we record the changed tree id, and most important, the previous commit as the parent. Parents are actually a list because a commit may happen to have several parents after merging branches.

Remain to record this whole new family:

>>> object_store.add_object(spam)
>>> object_store.add_object(tree)
>>> object_store.add_object(c2)

You can already ask git to introspect this commit using git show and the value of commit.id as an argument. You'll see the difference will the previous blob recorded as "spam".

You won't see it using git log because the head is still the previous commit. It's easy to remedy:

>>> repo.refs['refs/heads/master'] = c2.id

Now all git tools will work as expected. Though don't forget that Dulwich is still open!

Adding a file

If you followed well, the next lesson will be straightforward.

We need a new blob:

>>> ham = Blob.from_string("Another\nmultiline\nfile\n")
>>> ham.id
'a3b5eda0b83eb8fb6e5dce91ecafda9e97269c70'

But the same tree:

>>> tree["ham"] = (0100644, spam.id)

And a new commit:

>>> c3 = Commit()
>>> c3.tree = tree.id
>>> c3.parents = [commit.id]
>>> c3.author = c3.committer = author
>>> c3.commit_time = c3.author_time = int(time())
>>> c3.commit_timezone = c3.author_timezone = tz
>>> c3.encoding = "UTF-8"
>>> c3.message = 'Adding "ham"'

Save it all:

>>> object_store.add_object(spam)
>>> object_store.add_object(tree)
>>> object_store.add_object(c3)

Update the head:

>>> repo.refs['refs/heads/master'] = commit.id

A call to git show will confirm the addition of "spam".

Remember you can also call git checkout -f to make it appear.

Well... Adding "spam" was not such a good idea... We'll remove it.

Removing a file

Removing a file just means removing its entry in the tree. The blob won't be deleted because Git tries to preserve the history of your repository.

It's all pythonic:

  >>> del tree["ham"]

>>> c4 = Commit()
>>> c4.tree = tree.id
>>> c4.parents = [commit.id]
>>> c4.author = c4.committer = author
>>> c4.commit_time = c4.author_time = int(time())
>>> c4.commit_timezone = c4.author_timezone = tz
>>> c4.encoding = "UTF-8"
>>> c4.message = 'Removing "ham"'

Here we only have the new tree and the commit to save:

>>> object_store.add_object(spam)
>>> object_store.add_object(tree)
>>> object_store.add_object(c4)

And of course update the head:

>>> repo.refs['refs/heads/master'] = commit.id

If you don't trust me, ask git show. ;-)

Renaming a file

Remember you learned that the file name and content are distinct. So renaming a file is just about associating a blob id to a new name. We won't store more content, and the operation will be painless.

Let's transfer the blob id from the old name to the new one:

>>> tree["eggs"] = tree["spam"]
>>> del tree["spam"]

As usual, we need a commit to store the new tree id:

>>> c5 = Commit()
>>> c5.tree = tree.id
>>> c5.parents = [commit.id]
>>> c5.author = c5.committer = author
>>> c5.commit_time = c5.author_time = int(time())
>>> c5.commit_timezone = c5.author_timezone = tz
>>> c5.encoding = "UTF-8"
>>> c5.message = 'Rename "spam" to "eggs"'

As for a deletion, we only have a tree and a commit to save:

>>> object_store.add_object(tree)
>>> object_store.add_object(c5)

Remains to make the head bleeding-edge:

>>> repo.refs['refs/heads/master'] = commit.id

As a last exercise, see how git show illustrates it.

Conclusion

You'll find the test.py program with some tips I use to ease generating objects.

You can also make Tag objects, but this is left as a exercise to the reader.

Dulwich is abstracting much of the Git plumbing, so there would be more to see.

Dulwich is also able to clone and push repositories.

That's all folks!