Backups with bup

Fri, Jan 27 2012 - 5 minute read

I’m thinking about backups once more, and thought I would take a look at bup. Bup’s claim to fame (and the reason I first heard about it) is that it’s a git based backup system, or rather it uses the git repository format, with its own tools to make it deal with large files effectively. The more I looked at it, the more I realized that bup being git-based isn’t the main feature. Bup has a rolling checksum algorithm similar to rsync and splits up files so that only changes are backed up, even in the case of large files. This also has a nice side effect: you get deduplication of data for free. This includes space efficient backups of VM images, and files across multiple computers (the OS files are almost identical). I have two laptops with the same data (git repositories, photos, other work) on both of them, and multiple VM images used for testing, so the ability to have block level deduplication in backups sounded ideal.

Bup can also generate par2 files for error recovery, and has commands to verify/recover corrupted backups. This is a useful feature given that bup goes to great lengths to ensure that each piece of data is only stored once.

My old backups were with rsnapshot, and as it happened, bup has a conversion tool for this, so the first step was to move them over to using bup. The command do to this is bup import-rsnapshot, but this didn’t quite work for me and gave an error when running bup save. Thankfully there is a dry-run option which prints out the commands that bup uses, and because rsnapshot backups direct copies of the files, what bup does is basically back up the backup. So I ended up running:

export BUP_DIR=/bup
/usr/bin/bup index -ux -f bupindex.rigel.tmp manual.0/rigel/
/usr/bin/bup save --strip --date=1314714851 -f bupindex.rigel.tmp \
    -n rigel manual.0/rigel/

The two bup commands were directly output from the import-rsnapshot command, and I did this multiple times for each backup I had.

Next was to take the initial backup from my laptop. This was actually a different laptop from the one I took the rsnapshot backups with, but I’d copied over a lot of the data and wanted to see how well the dedup feature worked. As can be seen with the rsnapshot import, taking a backup is actually two steps, bup index followed by bup save. The index command generates a list of files to back up, while the save command actually does it. The documentation gives a couple of reasons for splitting this in to two steps, mainly that it allows you to use a different method (such as inotify) to generate and update the index, and it also allows you to only generate the list of files once if you are backing up to multiple locations. This separation of duties appeals to the tinkerer in me, but it would still have been nice to have a shortcut ‘just back it up’ command, similar to how git pull is a combination of git fetch and git merge.

The commands to take a backup are:

export BUP_DIR=/bup
bup index -ux --exclude=/bup /
bup save -n procyon /

First, I set the backup directory to /bup. What I’m doing here is backing up locally (and copying to an external hard drive later), but you can also pass the -r option to back up directly to a remote server via ssh.

I also pass the -x option to bup index to limit it to one filesystem, and also exclude the backup directory itself from the backup.

Next, the bup save command actually performs that backup. I passed in the hostname of my laptop (procyon) as the name of the backup set. Multiple backups can have the same name, and they show up as multiple git commits, so a hostname is a good choice for the name of the backup set.

As I mentioned above, bup can make use of par2 to generate parity files. This is a separate step, and is done using the bup fsck command:

bup fsck -g -j4

The -g option generates the par2 files, and the -j 4 option means run up to 4 par2 jobs at the same time. Generating parity files is CPU intensive, so I set it to twice the number of CPUs in my system. I have hyperthreading turned on, and it saturated all 4 ‘virtual’ CPUs. Once this was done, I ended up with several .par2 files in the /bup/objects/pack directory (this is a git repository, and all data is stored in the objects/ dir.

And the results? Bup used 30GB for 2 original backups from rsnapshot (rsnapshot used 26GB and 37GB for the first and second backups, and this was taking into account identical files). Then, when I backed up my 2nd laptop (with approx 40GB used at the time) the size of the bup backup increased by only 4GB. This backup of my laptop included a 5GB ubuntu VM image that didn’t exist in the previous snapshots, so bup must have been able to deal with the duplicate data from the image and the live OS.

All of this sounds amazing, but of course there are a few downsides, all of which are spelled out pretty plainly in the bup README:

no metadata - these are backups of my personal laptop, and I’ll be restoring either single files, or reinstalling and copying over files as I need them, so losing permissions/file ownership etc. isn’t a big deal for me. However, this feature is supposed to be coming soon.
no way to prune old backups - this is another feature that is coming soon, but given that I’m a pack rat, rarely deleting old data, and the dedup feature, I’m not too concerned for the moment.
bup is relatively new and immature. This shows both in the possible bugs I encountered above, the lack of what some might consider essential features, and the somewhat low level command usage (separate index, save and fsck commands). This is easily worked around however, and is likely to improve in future.

That said, if you can live with the above limitations, and want incredible space savings for your backups (especially across multiple computers), then I would suggest giving bup a try.