I just fixed a problem with my NFS Server inexplicably stopping all the time that had me confused as hell. It suddenly started happening after doing a big apt-update.
The symptoms were after booting the NFS server would be running fine, but after a bit of time it would suddenly stop and all my NFS clients would freeze. It looked like systemd just decided to stop it for no reason. Running systemctl status It looked like this:
$ sudo systemctl status nfs-kernel-server.service * nfs-server.service - NFS server and services Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; preset: enabled) Drop-In: /run/systemd/generator/nfs-server.service.d --order-with-mounts.conf Active: inactive (dead) since Sun 2024-06-16 01:28:18 AEST; 11min ago Duration: 13min 5.405s Invocation: ae0f3876014b477fad03afa383db37e3 Process: 975 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS) Process: 977 ExecStart=/usr/sbin/rpc.nfsd (code=exited, status=0/SUCCESS) Process: 1832 ExecStop=/usr/sbin/rpc.nfsd 0 (code=exited, status=0/SUCCESS) Process: 1837 ExecStopPost=/usr/sbin/exportfs -au (code=exited, status=0/SUCCESS) Process: 1838 ExecStopPost=/usr/sbin/exportfs -f (code=exited, status=0/SUCCESS) Main PID: 977 (code=exited, status=0/SUCCESS) Jun 16 01:15:12 watiya systemd[1]: Starting nfs-server.service - NFS server and services... Jun 16 01:15:12 watiya systemd[1]: Finished nfs-server.service - NFS server and services. Jun 16 01:28:18 watiya systemd[1]: Stopping nfs-server.service - NFS server and services... Jun 16 01:28:18 watiya systemd[1]: nfs-server.service: Deactivated successfully. Jun 16 01:28:18 watiya systemd[1]: Stopped nfs-server.service - NFS server and services.
I could restart the nfs server with sudo systemctl restart nfs-kernel-server.service and it would restart and all my clients would unfreeze, but 10 to 15 mins later it would stop again.
I chased so many rabbits down holes trying to fix this. I have had intermittent problems similar to this in the past, but I'm pretty sure they were related to a failing disk at the time. This time I looked everywhere for signs of some kind of errors or failures happening and couldn't find anything. I even turned nfs debugging logging and trawled journelctl logs in detail and couldn't see a single error, or what was triggering it. It doesn't help that I'm not a systemd expert, which lead me down blind alleys suspecting there was some kind of enable difference, or perhaps start vs restart was the cause.
Then a combination of things triggered an AH-HA moment that solved it.
- The logs did show "unmounting /mnt/meda" around about the time that the nfs server was stopping. At first I thought this was unrelated, or perhaps another symptom. It doesn't help that journalctl logs are not exactly in time order...
- I noticed the little bit about order-with-mounts.conf and started trying to figure out what exactly that was and where it was.
- I stumbled on https://access.redhat.com/solutions/4091731 which pointed out that the NFS server requires the filesystems it exports to be mounted.
Then I remembered my /etc/fstab had this:
/dev/bcache0 /mnt/media btrfs x-systemd.automount,x-systemd.idle-timeout=600s,subvol=@media 0 0
So I was auto-mounting /mnt/media that I was exporting in my /etc/exports with this:
/mnt/media 192.168.7.0/24(rw,sync,no_subtree_check,no_root_squash,insecure)
It turns out that order-with-mounts.conf is auto-generated from your exports and makes your nfs server depends on those exports being mounted. This ensures that they are mounted before the nfs server starts, and that the nfs server will be stopped when they are unmounted.
So when the nfs server was started or restarted after it stopped, it would touch /mnt/media and trigger the automount, and the nfs server would happily start. But, and maybe this is a recent change in the kernel nfs server that made this start happening, it doesn't hold any file open in /mnt/media unless an nfs client does. This means that 10 mins later, the automounter decides the idle-timeout has passed and unmounts it. Systemd then enforces the order-with-mounts.conf requirements and cleanly shuts down the NFS server.
So the simple fix is to not automount any partitions you export with NFS, or at the very least ensure that they don't have an idle-timeout set.