I just fixed a problem with my NFS Server inexplicably stopping all the time that had me confused as hell. It suddenly started happening after doing a big apt-update.

The symptoms were after booting the NFS server would be running fine, but after a bit of time it would suddenly stop and all my NFS clients would freeze. It looked like systemd just decided to stop it for no reason. Running systemctl status It looked like this:

$ sudo systemctl status nfs-kernel-server.service
* nfs-server.service - NFS server and services
     Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; preset: enabled)
    Drop-In: /run/systemd/generator/nfs-server.service.d
             --order-with-mounts.conf
     Active: inactive (dead) since Sun 2024-06-16 01:28:18 AEST; 11min ago
   Duration: 13min 5.405s
 Invocation: ae0f3876014b477fad03afa383db37e3
    Process: 975 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
    Process: 977 ExecStart=/usr/sbin/rpc.nfsd (code=exited, status=0/SUCCESS)
    Process: 1832 ExecStop=/usr/sbin/rpc.nfsd 0 (code=exited, status=0/SUCCESS)
    Process: 1837 ExecStopPost=/usr/sbin/exportfs -au (code=exited, status=0/SUCCESS)
    Process: 1838 ExecStopPost=/usr/sbin/exportfs -f (code=exited, status=0/SUCCESS)
   Main PID: 977 (code=exited, status=0/SUCCESS)

Jun 16 01:15:12 watiya systemd[1]: Starting nfs-server.service - NFS server and services...
Jun 16 01:15:12 watiya systemd[1]: Finished nfs-server.service - NFS server and services.
Jun 16 01:28:18 watiya systemd[1]: Stopping nfs-server.service - NFS server and services...
Jun 16 01:28:18 watiya systemd[1]: nfs-server.service: Deactivated successfully.
Jun 16 01:28:18 watiya systemd[1]: Stopped nfs-server.service - NFS server and services.

I could restart the nfs server with sudo systemctl restart nfs-kernel-server.service and it would restart and all my clients would unfreeze, but 10 to 15 mins later it would stop again.

I chased so many rabbits down holes trying to fix this. I have had intermittent problems similar to this in the past, but I'm pretty sure they were related to a failing disk at the time. This time I looked everywhere for signs of some kind of errors or failures happening and couldn't find anything. I even turned nfs debugging logging and trawled journelctl logs in detail and couldn't see a single error, or what was triggering it. It doesn't help that I'm not a systemd expert, which lead me down blind alleys suspecting there was some kind of enable difference, or perhaps start vs restart was the cause.

Then a combination of things triggered an AH-HA moment that solved it.

Then I remembered my /etc/fstab had this:

/dev/bcache0 /mnt/media btrfs x-systemd.automount,x-systemd.idle-timeout=600s,subvol=@media 0 0

So I was auto-mounting /mnt/media that I was exporting in my /etc/exports with this:

/mnt/media 192.168.7.0/24(rw,sync,no_subtree_check,no_root_squash,insecure)

It turns out that order-with-mounts.conf is auto-generated from your exports and makes your nfs server depends on those exports being mounted. This ensures that they are mounted before the nfs server starts, and that the nfs server will be stopped when they are unmounted.

So when the nfs server was started or restarted after it stopped, it would touch /mnt/media and trigger the automount, and the nfs server would happily start. But, and maybe this is a recent change in the kernel nfs server that made this start happening, it doesn't hold any file open in /mnt/media unless an nfs client does. This means that 10 mins later, the automounter decides the idle-timeout has passed and unmounts it. Systemd then enforces the order-with-mounts.conf requirements and cleanly shuts down the NFS server.

So the simple fix is to not automount any partitions you export with NFS, or at the very least ensure that they don't have an idle-timeout set.




subject:
  ( 2 subscribers )