Lessons learned from saving my Artix Linux install
A few days ago, my otherwise perfectly smooth experience of running Artix Linux ran into a small moment of Panic. After running a routine update with Pacman, I noticed that some errors were thrown in the end of the script, but didn't really think there would be too much of a problem. I was proven wrong a few hours later, when I restarted my computer and found out that it wouldn't boot anymore.
I can't say that I hadn't prepared for this moment - rolling around with an Arch-based distribution has its risks of something breaking at some given time, and I had been making some backups thanks to my Project 128 adventures. However, it's moments like these, where you feel in your skin all that stability and confort from GNU/Linux melting away, that give you a silent panic. Fuck, it's not booting. And rebooting or trying anything else doesn't seem to help at all. What now?
Mini-disaster recovery starts
After sleeping on the issue to try to regain my sanity on the following day, I followed through with what I know best: booting a Live Medium. I had to choose one to do my work, and the next question was: which one? I thought about my go-to Swiss-army knife distribution that is Puppy Linux, but some of its functions "feel" a little weird in comparison, especially if I was to do some critical work to recover my system. So, as an alternative, I resorted to an older live medium distro that I knew, but hadn't used in a while: BunsenLabs Linux.
This worked for me to get to decrypt the drive and check the integrity of my data: everything was still there and not corrupted. Big first whew! I was still left with the task of actually fixing the install, however, so the mission was only beginning. Searching around for the answer from the messages in the logfiles returned mixed results.
Something pointed towards a kernel issue, but I didn't remember updating the kernel recently - when I do see the kernel has been updated, this usually prompts me to reboot the machine. But the "lack of space" error messages I had been seeing did point to an issue to the tiny
/boot partition on my machine. And almost everywhere I looked the common point was that I would never have to do a full reinstall, unless the system had fscked up real bad.
Eventually, I found this post on a blog describing the troubleshooting of common Arch Linux problems, which seemed to be the same symptoms that I had. In addition to troubleshooting the kernel issues, the post contains another hint: the init environment.
I'm not 100% technically sure, but it looks like after some kernel updates sometimes the post-boot init environment (the
initramfs) doesn't get re-built correctly, and this causes the boot to fail despite the kernel getting loaded clean. And thankfully, the fix isn't hard to implement.
Folk documentation saves the day
If you ever find yourself stuck after an Arch Linux kernel upgrade, but notice that your
/boot/ directory isn't empty and kernels have been upgrade, here's the steps to fix it:
- Boot into a live Arch Linux environment (the install ISO image)
- Mount your hard drive containing your Arch install (use
cryptsetup open DEVICE someidentifierto decrypt it first if you're using full disk encryption like me)
chrootinto your mounted disk
- Once chrooted, run
mkinitcpio -p linux.
I tried to run these from my live BunsenLabs environment, but couldn't get it to work due to mkinitcpio complaining about
/proc not being mounted. I guess Arch Linux's
arch-chroot works around these by filling out the gaps of the standard chroot environment, so the live medium can really impersonate the full install from the Arch perspective.
At the end of the procedure, the initramfs will be rebuilt for the Linux kernel and if it's successful, you can reboot into your original install. Crisis averted, no need to do clean reinstall. You can resume work correctly. Thanks, random stranger that authored this documentation!
Good lessons from a bad incident
Downtime is never a good thing, and when you realize you don't know when did you last make a backup, it can be terrifying. Thankfully, this story had a good ending, and actually I learned a couple of good things from this bad incident and made me better prepared to when it happens again in the future. And since we're talking about Disaster Recovery and I've been reading a little more about it lately, the RTO was about of about 1 day with the RPO of about the same. Hardly something impressive in comparison to Enterprise levels, but for a one-man mission, it was ok.
The first and foremost good thing was that this gave me the opportunity to test-drive BunsenLabs once again. I had used Crunchbang Linux (the BunsenLabs predecessor) aeons ago, back when I was still discovering Linux, and at the time it was a perfect match for my netbook. Getting to try this re-vamped edition in another low-specs computer was very satisfying. Bunsenlabs is a great live-medium OS, and offers great support for system maintenance, and is more familiar to me than Puppy (since it's a direct Debian derivative). I will be carrying it in my emergency USB from now on!
The second good thing was that this incident forced me to do another backup of all my data. Not that I had lost anything, but out of the possibility of full reinstall, I took a full backup again. And this has really forced me to think about rsync'ing more frequently, perhaps every week or three days, and to keep an eye out for my third off-site copy.
The third good thing was that this incident gave me an insight on how to operate with a seemingly doomed Arch Install, even when the disk is encrypted, thanks to the tools available in the Arch live ISO. Who knew that you could have so much power and flexibility even on such a minimalistic environment? I also noticed it runs on ZSH rather than bash, and it was a pretty nice shell too. I might try it more thoroughly and switch later.
So there you have it! A seemingly disastrous situation turned into a good learning opportunity with no side effects to my data. Have you ever run into an unbootable system situation previously? How did you recover your install afterwards? Let me know on Mastodon!
This post is number #21 of my #100DaysToOffload project. Follow my progress through Mastodon!
Last updated on 07/23/21