vnop_lookup returning ENOENT aborts rm(1)

When recursively removing a directory with a large number of entries that resides on my custom filesystem, rm(1) aborts with ENOENT.

% rm -Rv /Volumes/myfs/linux-kernel/linux-6.10.6
[...]
/Volumes/myfs/linux-kernel/linux-6.10.6/include/drm/bridge/aux-bridge.h
/Volumes/myfs/linux-kernel/linux-6.10.6/include/drm/bridge/dw_hdmi.h
rm: fts_read: No such file or directory

I'm observing the following sequence of calls being made.

2024-09-17 17:58:25 vnops.c:281 myfs_vnop_lookup: rm-936 -> dw_hdmi.h ;initial lookup call
2024-09-17 17:58:25 vnops.c:315 myfs_vnop_lookup: -> cache_lookup(dw_hdmi.h)
2024-09-17 17:58:25 vnops.c:317 myfs_vnop_lookup: <- cache_lookup(dw_hdmi.h) -> 0 ;cache miss

2024-09-17 17:58:25 rpc.c:431 myfsLookup: rm-936 -> dw_hdmi.h ;do remote lookup
2024-09-17 17:58:25 rpc.c:500 myfsLookup: -> myfs_lookup_rpc(dw_hdmi.h)
2024-09-17 17:58:25 rpc.c:502 myfsLookup: <- myfs_lookup_rpc(dw_hdmi.h) -> 0 ;file found and added to vfs cache

2024-09-17 17:58:25 vnops.c:281 myfs_vnop_lookup: rm-936 -> dw_hdmi.h ;subsequent lookup call
2024-09-17 17:58:25 vnops.c:315 myfs_vnop_lookup: -> cache_lookup(dw_hdmi.h)
2024-09-17 17:58:25 vnops.c:317 myfs_vnop_lookup: <- cache_lookup(dw_hdmi.h) -> -1 ;cache hit

2024-09-17 17:58:25 vnops.c:1478 myfs_vnop_remove: -> myfs_unlink(dw_hdmi.h) ;unlink sequence
2024-09-17 17:58:25 rpc.c:1992 myfs_unlink: -> myfs_unlink_rpc(dw_hdmi.h)
2024-09-17 17:58:25 rpc.c:1994 myfs_unlink: <- myfs_unlink_rpc(dw_hdmi.h) -> 0 ;remote unlink succeeded
2024-09-17 17:58:25 vnops.c:1480 myfs_vnop_remove: <- myfs_unlink(dw_hdmi.h) -> 0
2024-09-17 17:58:25 vnops.c:1487 myfs_vnop_remove: -> cache_purge(dw_hdmi.h)
2024-09-17 17:58:25 vnops.c:1489 myfs_vnop_remove: <- cache_purge(dw_hdmi.h)
2024-09-17 17:58:25 vnops.c:1499 myfs_vnop_remove: -> vnode_recycle(dw_hdmi.h)
2024-09-17 17:58:25 vnops.c:1501 myfs_vnop_remove: <- vnode_recycle(dw_hdmi.h)

2024-09-17 17:58:27 vnops.c:281 myfs_vnop_lookup: fseventsd-101 -> dw_hdmi.h ;another lookup; why?
2024-09-17 17:58:27 vnops.c:315 myfs_vnop_lookup: -> cache_lookup(dw_hdmi.h)
2024-09-17 17:58:27 vnops.c:317 myfs_vnop_lookup: <- cache_lookup(dw_hdmi.h) -> 0
2024-09-17 17:58:27 rpc.c:431 myfsLookup: fseventsd-101 -> dw_hdmi.h
2024-09-17 17:58:27 rpc.c:500 myfsLookup: -> myfs_lookup_rpc(dw_hdmi.h)
2024-09-17 17:58:27 rpc.c:502 myfsLookup: <- myfs_lookup_rpc(dw_hdmi.h) -> ENOENT(2)
2024-09-17 17:58:27 vnops.c:371 myfs_vnop_lookup: SET(NNEGNCENTRIES): dw_hdmi.h
2024-09-17 17:58:27 vnops.c:373 myfs_vnop_lookup: ENOENT(2) <- shouldAddToNegativeNameCache(dw_hdmi.h)

I checked the value of vnode's v_iocount when vnop_remove and vnop_reclaim are being called.

Each vnop_remove and followed by vnop_reclaim with v_iocount set to 1 in both calls, as expected.

What I don't understand is why after removing the file is there another lookup call being made, which returns ENOENT to rm(1), which causes it to abort.

Any pointers on what could be amiss there would be much appreciated.

Answered by DTS Engineer in 804199022

When recursively removing a directory with a large number of entries that resides on my custom filesystem, rm(1) aborts with ENOENT.

My psychic debugging sense™ says that your custom file systems backing store is another "standard" file system and you implemented VNOP_REMOVE by simply deleting the target file/directory.

The problem with that approach is that it simply isn't how any Unix file system operates. Unix allows the deletion of open objects, which means it's entirely valid to have (or lookup) a deleted object.

I checked the value of vnode's v_iocount when vnop_remove and vnop_reclaim are being called.

What about "v_usecount" (and "v_kusecount")? Those are what track actual references to the vnode.

What I don't understand is why after removing the file is there another lookup call being made, which returns ENOENT to rm(1), which causes it to abort.

Any pointers on what could be amiss there would be much appreciated.

Two major recommendations:

  1. Have you directly tested the basic behavior of the underlying system calls like "unlink", "rmdir", etc? Both in the simple case (for example, create a file, delete the file) and in more complicated cases (for example, create file, open file, remove file, close file)? Many developers get something running, typically through something like FUSE, and then immediately "jump" higher level functionality like command line tools or the Finder without ever verifying that the "basic" systems calls actually work.

  2. If your basic testing works fine, then you can find the source to "rm" in the "file_cmds" project, so you can build and test it yourself.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Accepted Answer

When recursively removing a directory with a large number of entries that resides on my custom filesystem, rm(1) aborts with ENOENT.

My psychic debugging sense™ says that your custom file systems backing store is another "standard" file system and you implemented VNOP_REMOVE by simply deleting the target file/directory.

The problem with that approach is that it simply isn't how any Unix file system operates. Unix allows the deletion of open objects, which means it's entirely valid to have (or lookup) a deleted object.

I checked the value of vnode's v_iocount when vnop_remove and vnop_reclaim are being called.

What about "v_usecount" (and "v_kusecount")? Those are what track actual references to the vnode.

What I don't understand is why after removing the file is there another lookup call being made, which returns ENOENT to rm(1), which causes it to abort.

Any pointers on what could be amiss there would be much appreciated.

Two major recommendations:

  1. Have you directly tested the basic behavior of the underlying system calls like "unlink", "rmdir", etc? Both in the simple case (for example, create a file, delete the file) and in more complicated cases (for example, create file, open file, remove file, close file)? Many developers get something running, typically through something like FUSE, and then immediately "jump" higher level functionality like command line tools or the Finder without ever verifying that the "basic" systems calls actually work.

  2. If your basic testing works fine, then you can find the source to "rm" in the "file_cmds" project, so you can build and test it yourself.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks a lot for the tip Kevin.

Basic testing does work fine.

terminal1: % echo line1 >file.txt
terminal1: % ed file.txt ;open file.txt in ed
terminal2: % rm file.txt
terminal1: % ed file.txt ;close file.txt in ed

Finder

1. Create file.rtf using TextEdit.
2. Close file.rtf using TextEdit.
3. Open file.rtf using TextEdit.
4. Remove file.rtf via Finder.
5. Close file.rtf using TextEdit.

Onto a debugging session of rm(1) as you suggested.

Turns out, it was my lookup algorithm that was to blame.

Once adjusted as described here rm(1) was no longer complaining and running to completion.

Thanks for your help.

vnop_lookup returning ENOENT aborts rm(1)
 
 
Q