1. File System Entries

We think of a file system as a bunch of directories, containing files and some more directories. It is fairly easy to navigate it with some file explorer and act on a few individual files manually—we can delete, rename or copy them.

Yet if we want to operate on large numbers of files, often conditional on some properties of those files, things get more hairy. We typically script this in some loop, maybe using the Linux find command, in combination with awk or grep. We need to be careful about the different command line options these tools use, or how they handle white space in file names. But the underlying problems run deeper:

  • What should happen if in the middle of a large batch script that copies thousands of files an error occurs? Should the script just run on? If yes, will we later spot in some endless logfile that an error occurred? If no, what should happen to the files there were already copied?

  • Typical Linux scripting, often involving command pipes, is sequential. But often we would benefit from some global, joint checks of certain pre-conditions before we start: Will the total file size fit on the new volume? Are there any duplicates we should not store to the cloud? Are all file types known, or does our directory contain some freak files?

  • A particular issue arises with linked files. So-called symlinks or shortcuts can refer to other files. How to deal with them? Simply skipping them could make us lose data; accounting for them requires us to check if the symlink points to a file we already handle lest we create duplicates.

Exifuse uses a thin file system abstraction (via our Julia package FileJockey.jl) that helps building small file system primitives than can be combined to tackle typical operations like..

Note

How Finding Duplicates is Done

It is quite simple to find duplicate files in Exifuse; just type:

$ cd exifuse/pics
$ exifuse
...
pics/::julia> D = finddupl("pics_withdupl")
...TODO output after re-formatted

You will get a lot of output dumped on the screen—essentially showing which checks are performed, as quite a few things can go wrong. In the end though, in D, you hold the full duplicate information (a special object of type Dupl with a list of <original => [duplicates]>). You could then simply rm it to remove the duplicates while retaining the originals.

But this is not the main point here: the topic of finding duplicates should just illustrate how the underlying logic works. Because finddupl is implemented in the background in a single line of code, by combining primitives:

# we omit a few details here with at the '.'
finddupl( . ) = find( . ) |> checkpaths |> filter(isfile) |> map(follow) |> getdupl

The steps are:

  1. The familiar find just traverses a directory and returns all file system entries.

  2. checkpaths then makes sure that all file paths are unique, including symlinked files. This makes sure that we can’t have a file and a symlink to it—if we follow the symlink, we’d end up with two identical filenames, which could be classified (very wrongly!) as duplicates.

  3. The filter step removes all non-files like directory entries. We are only left with file entries, and symlink2file entries.

  4. The map(follow) step then resolves all symlinks to their targets—we are left with a vector of pure, regular files.

  5. Finally, getdupl will perform the costly check of file content identity and return the duplicates. It can actually only be called with a vector of pure files—it is not even defined for a general vector of file entries.

Underneath it all are some special types that handle different kinds of file system entries. Embedded in the type-safe Julia language, they help tackle the problems mentioned above:

  • Julia and its type system of course provide extensive error handling capabilities. It is easy to define, e.g., certain function only for certain “allowed” types; we can detect and handle exceptions; etc.

  • Exifuse retains information about files in memory to facilitate joint checks on them, and to speed up file status handling for large numbers of files on potentially slow network drives.

  • Exifuse explicitly distinguishes between regular files and directories, and symlinks to them. It also identifies the files with their canonical path, i.e., an absolute path containing no intermediate symlinks—which eliminates a whole class of potential problems.

1.1. File System Entry Types

Let’s have a look at the basic building blocks—types for file system entries. Go to the example folder in the exifuse folder, look at its content, and start Exifuse:

$ cd exifuse/pics/pics_withsyml_unix
$ ls
bar  e.jpg  symlink_to_bar  symlink_to_e.jpg
$ exifuse
...

We had just observed 4 file system entries. We’ll now represent them as objects in Julia; they are generated via the entry function:

pics_withsyml_unix/::julia> f = entry("e.jpg")
FileEntry("/foo/exifuse/pics/pics_withsyml_unix/e.jpg", -rw-r--r--, 24867 bytes)

This of this FileEntry as a typed wrapper around a pure, regular file, with the following characteristics:

  • The entry is sure to exist—entry() will fail if the given path is invalid.

  • The path used inside this type is a full, absolute path—and it is “canonical” in the sense that it will contain no intermediate symlinks. If two such paths are different, they refer to different file system entries. The main guarantee this provides is the following: if two ``FileEntry``s have different paths but the same file content, then it is safe to remove one of them. (This is true even for two hardlinks; but don’t worry about them for now.)

  • As indicated by the access permissions and the file size at display, some basic “stats” of the file are stored along its path, effectively caching them. So running filesize(f) will not require any new disk access (most notable, of course, when dealing with many files on slow network drives). Another immediate consequence of this is that most of Julia’s built-in file system function (like isfile()) already and automatically work on this type—that’s because those function internally just call and use stat() on their arguments (which our entries implement by returning the cached copy).

The same holds for the other file entry types—DirEntry, and variants of Symlink (to either a FileEntry or DirEntry):

pics_withsyml_unix/::julia> d = entry("bar")
DirEntry("/foo/exifuse/pics/pics_withsyml_unix/bar", drwxr-xr-x)

pics_withsyml_unix/::julia> sf = entry("symlink_to_e.jpg")
Symlink{FileEntry}("/foo/exifuse/pics/pics_withsyml_unix/symlink_to_e.jpg" -> "/foo/exifuse/pics/pics_withsyml_unix/e.jpg")

pics_withsyml_unix/::julia> sd = entry("symlink_to_bar/")
Symlink{DirEntry}("/foo/exifuse/pics/pics_withsyml_unix/symlink_to_bar" -> "/foo/exifuse/pics/pics_withsyml_unix/bar")

A note about the path guarantee of the symlink itself (i.e., not its target): the path is again canonical—up to the last part, the symlink’s name itself. So there are no intermediate directory symlinks, very much like above.

We won’t often deal with the entry function, as we usually just use ls or find to give us a whole vector of entries:

pics_withsyml_unix/::julia> ls()
4-element Vector{AbstractEntry}:
 DirEntry("/foo/exifuse/pics/pics_withsyml_unix/bar", drwxr-xr-x)
 FileEntry("/foo/exifuse/pics/pics_withsyml_unix/e.jpg", -rw-r--r--, 24867 bytes)
 Symlink{DirEntry}("/foo/exifuse/pics/pics_withsyml_unix/symlink_to_bar" -> "/foo/exifuse/pics/pics_withsyml_unix/bar")
 Symlink{FileEntry}("/foo/exifuse/pics/pics_withsyml_unix/symlink_to_e.jpg" -> "/foo/exifuse/pics/pics_withsyml_unix/e.jpg")
 ...

1.2. Julia’s Basic File Read Ops

As mentioned above, most of Julia’s native file operations seamlessly work on our file entry types; here are a few examples:

> f = entry("e.jpg")
FileEntry("/foo/exifuse/pics/pics_withsyml_unix/e.jpg", -rw-r--r--, 24867 bytes)

> sf = entry("symlink_to_e.jpg")
Symlink{FileEntry}("/foo/exifuse/pics/pics_withsyml_unix/symlink_to_e.jpg" -> "/foo/exifuse/pics/pics_withsyml_unix/e.jpg")

> sd = entry("symlink_to_bar/")
Symlink{DirEntry}("/foo/exifuse/pics/pics_withsyml_unix/symlink_to_bar" -> "/foo/exifuse/pics/pics_withsyml_unix/bar")

> isfile(f)
true

> filesize(f)
24867

> basename(f)
"e.jpg"

> dirname(f)
"/foo/exifuse/pics/pics_withsyml_unix"

> islink(f)
false

> islink(sf)
true

> isfile(f)    #  a symlink to a file is considered a file
true

> isfile(sd)   # ..unlike one to a dir
false

> stat(f)
StatStruct for "/foo/exifuse/pics/pics_withsyml_unix/e.jpg"
   size: 24867 bytes
 device: 2080
  inode: 108351
   mode: 0o100644 (-rw-r--r--)
  nlink: 1
    uid: 1000 (martin)
    gid: 1000 (martin)
   rdev: 0
  blksz: 4096
 blocks: 56
  mtime: 2023-04-29T17:52:02+0200 (58 minutes ago)
  ctime: 2023-04-29T17:52:02+0200 (58 minutes ago)

1.3. Additional File Helpers

Exifuse provides a few additional basic convenience function, aimed at helping us type less in our chains of REPL commands hinted at in the tutorials. Behold:

> filesizehuman(f)
"24.284 Kb"

> path(f)
"/foo/exifuse/pics/pics_withsyml_unix/e.jpg"

> name(f)
"e.jpg"     # same as basename(), but less type-y

> ext(f)
"jpg"

> f |> ext
"e.jpg"     # same via pipe syntax

> ext(f) == "jpg"
true

> hasext(f, "jpg")
true        # seems silly at first, but handy for later '.. |> filter(hasext("jpg")) |> ..'

> f |> sizegt(20000)
true

> f |> sizelt(20000)
false

1.4. File Operations

As above, Julia provides a host of file modification functions (mv, cp,..). They are not all wrapped (yet) to operate directly on FileEntry onjects, but you can simply invoke them with their path() value. The reason for this is that a say cp cannot trivially operate on a whole vector of entries, as there would have to be a way of specifying target path for each of them. Below we’ll discover how hardlink is vectorized this way.

1.4.1. Delete Files / rm

The rm command works on a FileEntry, much like the original Julia function operating on a string path.

> rm(f)

It also works on whole vectors:

> F = findfiles |> filter(sizezero)
> rm(F)

You could achieve the same via

> findfiles |> filter(sizezero) |> rm

or

> findfiles |> filter(sizezero) |> map(rm)

or

> findfiles |> filter(sizezero) |> apply(rm)

The latter, apply, works like map but does not return an unneeded vector of nothing entries in the end.

So in Julia, many functions do not really require an explicit vectorized version, as we can always map or apply a function, or broadcast it. In this case, rm is vectorized your your typing convenience. But there might be other functions where vectorization provides more benefits. Later on we’ll encounter the exify function. Because exify calls out to an external tool with startup costs, it is beneficial for it to work on whole batches of files at once.

rm also operates in a specific way on dupl dictionaries—the result of the getdupl and finddupl functions. They contain a dictionary of <original => [dupes]>. In this case, rm only removes the dictionary values, i.e., duplicates, and not the keys/originals.

> finddupl("myphotos") |> rm

1.4.2. Script-Delete Files / rmscript

To be on the safe side, rmscript creates a script that would delete files if you ran it; you can then inspect and modify it.

If operating on a Dupl structure of duplicates, the script also contains comments with the not-to-deleted originals.

[TODO implementation work in progress]

1.4.3. Check for Same Content / aredupl

The function aredupl checks if two distinct files have the same file content.

> aredupl(f, g)

This function will fail with an exception if

  • the paths of the two files are the same (that could be dangerous if you nonchalantly remove one of those presumed “duplicates”); or

  • the files are hardlinks referring to the same underlying raw file (this would be less dangerous, but could lead to unintuitive disk savings of zero when we delete one of the entries).

1.4.4. Check for Sameness / aresame

aresame checks if two files are in fact you hardlinks to the same underlying raw file. Julia also provides a similar function, but here we want a stricter contract.