Prefaulting Memory Mapped Arrays in Julia

One of the key facts about memory maps is that the mapping is performed asynchronously, with the data being loaded on demand as it is used. In some cases, though, we may know that the entire mapped region will be required — and only at the end of some other long-running process — so that there would be an advantage to expending some parallel effort to prefault the entire mapping into memory. In this article I motivate such a use case, and then I explore two separate ways of achieving the prefaulting behavior.


Introduction

In my research, we make use of several very large sparse matrices, and one of my goals in CMB.jl is to take advantage of constructs like shared memory (especially of read-only data products) wherever possible. Because the matrices are sourced from disk, we have the opportunity to offload the management of the shared memory to the operating system; in particular, we can be agnostic to whether there are any other processes using the same data file or not since the operating system will automatically reuse the same memory for all processes that request the same mapped files.

An important consideration to keep in mind, though, is that memory mapping (by default) is an asynchronous (lazy) process. What typically happens is that the file region is mapped into the process' virtual memory, and the mmap call returns almost immediately without having actually loaded any data from disk. The result is more an “I owe you” where the mapped memory’s addresses are considered valid, but the operating system will wait until first use (either read or write) to actually load a given section of the file into active memory.

For my use case, I know two things:

  1. The matrix is very large and takes a significant amount of time to load from disk (especially when retrieved from network attached storage, as is typical in a research computing cluster environment).
  2. The matrix is required only at the end of a CPU-intensive calculation.

First consider the case where the lazy loading behavior is unchanged for a computation parallelized over multiple threads/cores. If we assume all threads are well balanced in runtime, then all threads will complete the CPU-intensive calculation at nearly the same time and move on to using the matrix. This then causes the operating system to actually load the data from disk into memory, but since loading from disk is much slower than the computational use, all threads will spend most of their time stalled while waiting for the memory region to ready. The undesirable result of using the default lazy loading behavior is that we have stalled all computational threads.

Now consider a case where we spare a single thread to spend its time prefaulting the matrix into active memory by doing some sort of interaction with the matrix in parallel with the CPU-intensive calculation. We’ve lost one computational thread,1 but by loading the matrix in parallel with the computations, that small sacrifice saves all other threads from stalling once they reach the point of using the matrix.

Default Lazy Behavior

Let us first begin by demonstrating the default lazy loading behavior, which requires a sufficiently large data file to make the actual read time noticeably distinct from any small overhead in constructing the memory map. I have enough RAM in my machine to comfortably load a 15GiB data file into memory without issues, so we’ll just populate a flat file with randomly drawn bytes:

julia> using Random

julia> data_path = expanduser("~/raw_data.bin")

julia> open(data_path, "w") do fid
           data = rand(UInt8, 15 * 1024^3) # 15 GiB of data
           write(fid, data)
       end

The other important factor in timing the load times of this large sample file is that the system will cache the file in RAM (as long as there is enough free space to maintain the cache even when not in use, which is true for my machine), so unless we take particular actions to clear the cache, we will be profiling the file cache and not the actual read performance. Thankfully, Linux makes it easy to ask the kernel to evict the file caches on demand. By writing the character '1' to the special file /proc/sys/vm/drop_caches, the kernel will dump all cached files which are not in current use. The following function writes the character '1' to the control file with root permissions (by invoking through sudo to avoid having to run Julia itself with root permissions):

julia> function drop_caches()
           global ans = nothing # Clear implicitly saved return value in the REPL.
           GC.gc()              # Garbage collect to free memory/run finalizers.
           run(pipeline(`sudo tee /proc/sys/vm/drop_caches`,
                        stdin  = IOBuffer("1"),
                        stdout = devnull))
       end

Now we’re ready to time the default behavior of a memory map and compare that to a raw read of the data from disk. (I’ll motivate using UnixMmap.jl instead of the standard library’s Mmap in the next section, but for this example, Mmap.mmap would work the same.)

julia> using UnixMmap: UnixMmap, mmap

julia> drop_caches(); @time read(data_path);
 11.073034 seconds (16 allocations: 15.000 GiB, 0.00% gc time)

julia> drop_caches(); @time mmap(data_path);
  0.000051 seconds (19 allocations: 1.484 KiB)

The obvious conclusion is that memory mapping the file takes a negligible amount of time compared to actually reading the data from disk. We can concretely show that the mmapped array has lazy loading behavior by also timing a subsequent use of the loaded data products, where we see that summing over the array takes longer2 in the mmapped case because there is also time spent retrieving the data from disk.

julia> let data = read(data_path)
           drop_caches()
           @time sum(data);
       end
  2.751237 seconds (76.50 k allocations: 3.940 MiB)

julia> let data = mmap(data_path)
           drop_caches()
           @time sum(data);
       end
 10.364109 seconds (1 allocation: 16 bytes)

To avoid the second case wherein the sum operation is dramatically slowed while the data is actually loaded into memory by the operating system, we seek a way to prefault all of the mmapped data into active memory.

Prefaulting Memory Maps

Linux and MAP_POPULATE

Linux makes this task easy. Linux’s mmap supports the MAP_POPULATE flag which instructs the kernel to synchronously load the entire mapped region into active memory (thereby disabling the lazy-loading quick return of the traditional mmap that we observed above). Julia’s Mmap standard library unfortunately doesn’t support the entire suite of mmap flags, so I instead wrote UnixMmap.jl which exposes all of the additional flags that mmap accepts across various Unix-like operating systems.

Using the prefaulting form of mmap, we can now see that the load happens immediately and is comparable to the time required to read the data.

julia> using UnixMmap: MAP_SHARED, MAP_POPULATE

julia> drop_caches(); @time mmap(data_path, flags = MAP_SHARED | MAP_POPULATE);
 10.528155 seconds (19 allocations: 1.484 KiB)

The difference, though, is that a second mapping of the file benefits greatly from the existing mapping of the file, even for a separate Julia instance as was used for the following case:

julia> @time mmap(data_path, flags = MAP_SHARED | MAP_POPULATE);
  0.474336 seconds (19 allocations: 1.484 KiB)

This is precisely the behavior I motivated in the introduction for why loading large data products using memory mapping is desirable — without any effort on our part, we have gained the benefit of shared memory (both in terms of secondary load time and the total memory use).

Thankfully, most research computing environments I am familiar with are built on Linux clusters, so using the MAP_POPULATE flag is the obvious choice. (The feature is also relatively old — since Linux 2.6.46, released 2002 Nov 04 — so at this point it should be ubiquitously available.)

Non-Linux Systems

For completeness, though, it is interesting to consider how to prefault the mmapped arrays on non-Linux systems ourselves.

I stated earlier that one of the conditions that causes the system to actually load the disk contents into memory is to read from the array, so our first attempt might be to just loop over all elements in the array, read them, and return:

julia> function prefault0(A::Array)
           @inbounds for ii in eachindex(A)
               A[ii]
           end
           nothing
       end
prefault0 (generic function with 1 method)

julia> data = mmap(data_path);

julia> @time prefault0(data)
  0.000002 seconds

(where I’ve previously already run prefault0 to ensure it was already compiled).

Well that clearly didn’t work.

The issue is that this function is just too simple, and the compiler decides that nothing useful is done with the loop and completely eliminates it. That is easily seen in the machine code generated for this function:

julia> code_native(prefault0, (Vector{UInt8},), debuginfo=:none)
        .text
        movabsq $jl_system_image_data, %rax
        movq    %rsi, -8(%rsp)
        retq

There are no conditionals or jumps to previous lines that would form the loop; instead it appears only a single constant value is returned (which is presumably the machine-code representation of nothing).

To keep the compiler from eliding our loop, we have to convince it that some useful work is actually being performed. Therefore, let us do one of the simplest operations: sum over the some entries to make use of the loaded values. Note that reading even a single byte from a given “page” of the mmaped array is sufficient to trigger the system to load the entire page into memory, so we don’t actually need to sum over the entire array — just loading and using one value per page (being careful to translate the skip from bytes to number of elements) is enough work.

julia> function prefault1(A::Array)
           v = zero(eltype(A))
           step = UnixMmap.PAGESIZE ÷ sizeof(eltype(A))
           @inbounds for ii in 1:step:length(A)
               v += A[ii]
           end
           return v
       end
prefault1 (generic function with 1 method)

julia> @time prefault1(data);
 10.628456 seconds

julia> @time sum(data);
  2.690651 seconds (1 allocation: 16 bytes)

This time the compiler keeps our loop in place, and we successfully prefault the array into memory so that the subsequent time to sum over the array is comparable to the time we got with the direct read in Section II.

Conclusions

In this posting I motived the use case for loading very large datasets using memory mapped files rather than directly reading them to regular arrays. We saw that mmapped data can be shared across processes with no effort required on our part, but the downside to the default mmap behavior is that the load is delayed until it is used. When we know ahead of time that the data will be used multiple times at the end of another long-running calculation, it is advantageous to load the data into active memory in parallel. I then showed that, on Linux, this is simple to accomplish using the MAP_POPULATE flag which asks the kernel to load the entire mapped region before completing the mmap call. On non-Linux systems where a similar flag doesn’t exist, we instead have to write our own short (but not too simple) function to make use of the data to coerce the system into pulling all of the data into memory.

Finally, a topic we haven’t touched on here is that of advisory flags on memory mapped regions, which may be useful depending on system configuration and load. Julia v1.6 will include the Mmap.madvise! function in the Mmap standard library, and a parallel implementation is also provided by UnixMmap.jl.


  1. The loss of this thread is also amortized further if multiple processes (staggered a bit in launch time) with independent thread pools run on the same computer and make use of the same matrix. The first process will lose a thread to spend time loading the matrix from disk, but as long as mapped data is not purged from the operating system’s page caches, subsequent processes' loading thread will complete almost immediately. Those loader threads are then effectively immediately available to also run the CPU-intensive calculations. This is a major motivation for directly memory mapping the matrix rather than reading into (anonymous) shared memory, since the operation system can give us the amortization with no work on our part. ↩︎

  2. You may notice that the mmapped read + sum operation takes only about 10.4 sec, which is notably less than the 11.1 sec + 2.8 sec ≈ 13.9 sec in total to read the data and then to sum over the array. I have not investigated this in detail, but the discrepancy is repeatable on my machine, and my guess is that there may be two interacting factors at work. First, the summing calculations on already-loaded data and loading the next block of data can be performed in parallel on separate CPU cores, so the total wall time being reported by the @time macro is less than the CPU time consumed across all associated work. Second, there may be some cache-level performance gains from having the recently-loaded data in CPU caches so that the summing is not fetching all the way from main memory, so the summing itself can be performed a bit faster. ↩︎