btrfs trouble, trouble, and troubleshooting

The last weekend I lost more than 8 hours trying to recover from a serious bug that reached the btrfs filesystem in recent days. This bug was relatively specific: it would only affect linux users running a specific kernel version (3.17.1, if I recall it correctly), a specific version of the btrfs-progs package, and systems with read-only snapshots — yeah, I’ve been using the snapshot feature of the btrFS extensively.

There be dragons (er, I mean, corruption). Yeah, this image is from three years before of this post, I know.
There be dragons (er, I mean, corruption). Yeah, this image is from three years before of this post, I know.

First symptom: after a pacman -Syu, my entire filesystem became read-only. I couldn’t save anything. So, I rebooted it. So what? Rebooted directly into the initramfs rescue prompt. I’ve always had nightmares with this prompt. And then I wouldn’t have access to all of my files until I would get rid of it.

A few Google searches would alert me that this was a severe/critical bug. However, fortunately (no, actually, unfortunately…), other beta and enthusiastic people also had it. The Arch Linux bugtracker and its discussion forums contained some mentions about it. They served me as an initial “what I am supposed to do now, besides screaming?”. Yeah, you don’t want to lose all of your data at once. I had backups, but they were a few weeks old.

During this process I’ve simply fallen in love with the grml linux distribution. It is amazing for recover processes. And it is one of the distros that inspired the creators of Arch. Really nice, and now I understand why! (The other distro is CRUX, but I still haven’t tried it myself). It is based on debian/testing.

From grml, I logged into hexchat (an IRC client) into the #btrfs channel, and within less than five minutes I got some help and orientation from a guy who had the same problems I had in that day. He told me what I was supposed to do, and it really worked! Now I have my system back.

Why I lost 8+ hours, then? It wasn’t that perfect. Mainly because I was downloading several distros and dd’ing them into several USB Flash Drives from a grml Virtual Machine inside the Windows 8 Desktop from my parents.  I did this several times, until I realized (no, the #btrfs guy realized) that I had to be running a recent linux kernel to do the recovering process. What linux distribution has a recent linux kernel? Arch, of course. Then, the actual recover process was done inside an Arch live medium. Grml taught me several things I’m supposed to expect from a recover distribution, designed for system administrators; but unfortunately its kernel was relatively old at the time (well, it is based on debian, I can’t blame it).

End of the history? Although losing several hours, I discovered a nice linux distribution, was convinced (again) that IRC is still awesome, could post on the Arch forums confirming the solution I used to recover my system, and had fun in the process (NO!!!! Actually, yes. I love this sysadmin thing.). 

Moral of the history? BACKUP YOUR SYSTEMS!!!! And sleep well.

I’m now using ext4 on my root partition and btrfs on my home partition. This means I can hibernate my system now; I missed that for a while.

Edit: links to the mentioned bug.

btrfs trouble, trouble, and troubleshooting