2. Find, Filter and Batch Ops

The previous section dealt with individual files; we are most worried about batch operating on large numbers of files.

2.1. ls and find

Running the ls command returns a vector of file system entries—the contents of a directory. If no directory name is given, the current directory is used. Try it in the exifuse folder:

exifuse/::julia> ls()
9-element Vector{AbstractEntry}:
 FileEntry("/home/martin/work/workspace_umlet/exifuse/LICENSE", -rw-r--r--, 1062 bytes)
 FileEntry("/home/martin/work/workspace_umlet/exifuse/README.md", -rw-r--r--, 34 bytes)
 ...
 FileEntry("/home/martin/work/workspace_umlet/exifuse/exifuse", -rwxr-xr-x, 3161 bytes)
 FileEntry("/home/martin/work/workspace_umlet/exifuse/exifuse_logo.jpg", -rw-r--r--, 450671 bytes)
 DirEntry("/home/martin/work/workspace_umlet/exifuse/pics", drwxr-xr-x)

[ 7 files ( none of which symlinked ) -- 444.581 Kb -- 455,251 bytes ( #paths:7  #dev:1  #inodes&dev:7 ) ]
{ :__empty__ 3/_ :__unregistered__ 1/"md" 1/"bat" :txt 1/"txt" :jpeg 1/"jpg" }
[ 2 dirs ( no syml ) ( #paths:2 ) ] :: [ no dev,sock,fifo.. ] :: [ no unknown/broken ]

Similarly, find recursively traverses all subfolders, and returns all entries found.

exifuse/::julia> find()
206-element Vector{AbstractEntry}:
 DirEntry("/home/martin/work/workspace_umlet/exifuse", drwxr-xr-x)
 DirEntry("/home/martin/work/workspace_umlet/exifuse/.git", drwxr-xr-x)
 FileEntry("/home/martin/work/workspace_umlet/exifuse/.git/COMMIT_EDITMSG", -rw-r--r--, 9 bytes)
 FileEntry("/home/martin/work/workspace_umlet/exifuse/.git/FETCH_HEAD", -rw-r--r--, 84 bytes)
 FileEntry("/home/martin/work/workspace_umlet/exifuse/.git/HEAD", -rw-r--r--, 21 bytes)
 ...

[ 123 files ( 1 symlinked ) -- 2.403 Mb -- 2,519,980 bytes ( #paths:121  #dev:1  #inodes&dev:121 ) ]
{ :__empty__ 78/_ :jpeg 19/"jpg" :__unregistered__ 12/"sample" 1/"idx" 1/"pack" 1/"md" 1/"bat" :gif 3/"gif" :tiff 3/"tiff" :png 3/"png" :txt 1/"txt" }
[ 83 dirs ( 1 symlinked ) ( #paths:82 ) ] :: [ no dev,sock,fifo.. ] :: [ no unknown/broken ]

The returned vector is a standard Julia vector. The output is slightly tweaked to give you a few summary stats of the entries—mainly counts and sizes. (Additionally, infos about the number of distinct paths, as well as device IDs and inode entries are given; more on these later.)

2.2. getfiles

Often, we are only interested in actual file entries, e.g., when working with our photo library. The function getfiles operates on vectors of entries (i.e., on vectors containing both files and folders, as well as symlinks to them), and just extracts the regular files, as well as the target files of symlinks.

exifuse/::julia> E = find()
...
exifuse/::julia> F = getfiles(E)
...

# alternatively:
exifuse/::julia> F = find() |> getfiles
...

Notice how only files remain in the resulting vector, and no more dirs.

(getfiles is a shortcut and just implements X |> filter(isfile) |> map(follow).)

Now, with symlinks in a directory potentially pointing to files or directories inside the very same hierarchy, you can end up with duplicate path names of file entries—this can be tedious to deal with. So getfiles is often used in conjunction with..

2.3. checkpaths and findfiles

This function operates on a vector of entries, typically coming from find. It ensures that the entries are sound in the sense that regular and symlinked entries to not lead to duplicate paths. Because it sill has the full information of dirs and symlinks, it can more easily point out the problematic entries—often, a symlink to a known directory can cause many subsequent duplicate file entries. So we can use:

> find("pics/pics_tree") |> checkpaths |> getfiles
[ Info: Checking sanity of symlinks-to-dirs and dirs (most likely cause for duplicate files in a tree):
[ Info:   Check if a symlink points to an already known regular dir.. none found -- OK
[ Info:   Check if two symlinks point to the same dir.. none found -- OK
[ Info:   Check if all dirs (known and symlinked) have distinct paths.. 5 'DirEntry's checked -- OK
[ Info: Checking sanity of symlinks-to-files and files:
[ Info:   Check if a symlink points to an already known regular file.. none found -- OK
[ Info:   Check if two symlinks point to the same file.. none found -- OK
[ Info:   Check if all files (known and symlinked) have distinct paths.. 7 'FileEntry's checked -- OK
7-element Vector{AbstractEntry}:
 FileEntry("/home/martin/work/workspace_umlet/exifuse/pics/pics_tree/bar/s.tiff", -rw-r--r--, 130534 bytes)
 FileEntry("/home/martin/work/workspace_umlet/exifuse/pics/pics_tree/bar/u.png", -rw-r--r--, 83761 bytes)
...

The info messages tell you about the checks performed.

Use another of the sample directories to see what happens in case some checks fail:

> find("pics/pics_withsyml_unix/") |> checkpaths |> getfiles
[ Info: Checking sanity of symlinks-to-dirs and dirs (most likely cause for duplicate files in a tree):
[ Info:   Check if a symlink points to an already known regular dir..
ERROR: Symlink to known regular dir detected (<syml-path> -> <target-path>):
"/home/martin/work/workspace_umlet/exifuse/pics/pics_withsyml_unix/symlink_to_bar" -> "/home/martin/work/workspace_umlet/exifuse/pics/pics_withsyml_unix/bar"
=> delete symlink, or add the <symlink-path> to the 'skip_paths' option.

Because the command sequence above is a handful to write out, the shortcut findfiles directly implements the logic above:

> findfiles("pics/pics_tree")
...

finddupl

eachentry

filter

map

apply

broadcast