\chapter{Discussion}
\label{chap:discussion}

In the course of designing and implementing preload as a file prefetching
system that works on a higher level than previous ones, we faced several
issues and problems.  While we did not solve every one of them, we grasped
an intimate knowledge of how other prefetching systems work.  In the following
sections we discuss limitations and possible improvements of our approach, and
will come up with recommendations for systems seeking to improve application
start-up time through prefetching.


\section{Limitations}

Preload's major limitation in reducing I/O stall during application start-up is
that it only tracks mapped files.  While mapped files are known to be a
superior way to access read-only data for various reasons\footnote{Using a
shared copy for all processes, and avoiding copy to user-space.}, not all
applications make use of it.  In fact, most of the hundreds of mapped files
for the applications we measured are shared libraries and font-related files,
handled by the linker and libraries down into the application stack.  If
applications put more effort to make best use of \syscall{mmap}, preload can
be more successful in reducing their start-up time.  Another source of I/O
stall that preload does not help with is reading directories.  And finally
there is one more system call that can cause an I/O stall, \syscall{stat}.

It also happens to be the case that while all blocking I/O operations take
about the same time to complete, which is the disk access time (10ms in our
experiments), the \syscall{stat} and \syscall{getdents} calls take
significantly less memory to cache.  So, in a computer with various
memory-hungry applications eating all the lunch off the page-cache's plate, it
seems most beneficial to go after caching the restuls of these two system
calls\footnote{Or more practically, caching the I/O blocks that these system calls read.} instead of
caching files.  This is in fact one of the advantages of the SuSE Preload
approach to boot speed-up that we covered in \autoref{subsec:suse-preload}.
In defense of prefetching files, applications should really avoid performing
more than a few \syscall{stat} and \syscall{getdents} calls.  And unlike
reading files, changing them to not do this is typically very easy.
%Most of the time applications do something very silly when they really do
%not have to.

If one wants to target all the I/O stalls, they need to be able to instrument
all I/O accesses made by applications.  This is not feasible to be performed
in user-space\footnote{It is possible though, by preloading a shared library
to sniff system calls, rewriting the binary to generate hints, or by using
the debugging API in the kernel, similar to
\texttt{strace}.  However, all have measurable effects on the application.},
and so automatically out of scope of preload.  When that data is available, it
seems logical to prefetch them all, \emph{and} to reorder blocks on the disk
to make sure the files required for starting popular applications are put in
the same area on the disk to reduce disk access time when reading them.  As we
covered in \autoref{subsec:winxp}, this is roughly what Windows XP does.
Windows XP however prefetches the files upon application launch.  We discuss
that approach in \autoref{sec:aggressive}.


\section{Aggressive Prefetching}
\label{sec:aggressive}

Papathanasiou and Scott \cite{papathan05aggressive} argue that with the
drastic growth of processor power and main memory sizes in the past decade,
the time may have come to employ aggressive prefetching.  However, that is
only possible if prefetching is integrated with caching, and probably only
relevant in other levels of prefetching that can achieve a high prediction
accuracy.

For preload, prediction accuracy is hardly a performance measure when you
think about what preload does under a steady state (no application starting or
shutting down): it prefetches the same set of maps again and again, every
cycle.  This is a property of the memoryless model.  Although when a file is
already in memory, the next prefetch request for it is a light no-op,
preload still must repeat this every cycle to make sure that the predicted maps
will be in memory when the user starts the next application.  There are two
basic reasons for this: (i) preload does not have a separate cache, nor does
it have any control on the cache replacement algorithm, and (ii) the time that
the next application starts has a drastically high variance.  None of these
issues exist in most other prefetching frameworks and implementations.  For
example, in most file-based prefetching systems, the prefetching engine is
implemented in the kernel and has direct control over the cache, but even more
important is that patterns in file accesses are mandated by a limited set of
commonly-used applications that always access the same set of files in the same order
with the same almost constant delay in between.

For the reasons stated above, preload's operation can be best thought of as
\emph{keeping the cache warm} for popular applications, based on the set of
currently running applications.  A major problem with keeping the cache warm
without proper integration with the cache is that lots of extra work needs to
be done.  For example, playing a DVD movie on a computer trashes the entire
page cache, because the DVD content is 4.7\,GB worth of data read into the
main memory over a two hour period and accessed only once, but the kernel has
no idea whatsoever that caching DVD contents is hopeless.  When we put this
DVD playing scenario in contrast to what preload is doing, we get back to the
question of whether we really need prefetching to improve performance, and, how
much can a more sophisticated, history-based, cache replacement algorithm
improve performance.  Vellanki and Chervenak \cite{vellanki99costbenefit}
raised the same question and demonstrated that well over half of all accesses
in a file-system are cacheable based on history, significantly more than LRU
and prefetching in most cases.

Another aspect of the way preload works is whether we need to prefetch prior
to application launch to be successful in improving start-up time.  The answer
is implied to be yes in the design of preload, but that is not necessarily the
case.  In particular, since we failed to remove all the stall time from
application start-up, it may not be unrealistic to start prefetching all files
the application needs upon its launch.  Again, that needs to be handled in the
kernel\footnote{Or using a preloaded library or tracing}, but if one has the
ability to do that, it also means that they have information for full I/O
requests (not only maps), then they can get rid of all the prediction logic
and polluting page cache and start prefetching upon application launch.  If
correctly implemented, that can improve start-up time, and improve as much as
preload could achieve (about 50\% for larger applications).  Windows XP does
this, as described in \autoref{subsec:winxp}, and claims to highly improve application start-up
time.  However, we failed to find any academic evidence of Windows XP's
prefetching performance. We found instead a technical how-to on the web
\cite{winxp:prefetching-is-bad} that suggests removing the prefetching
database files in Windows XP\footnote{Located in the \texttt{Prefetch}
directory in side the Windows folder.} as a way to \emph{speed up Windows XP
boot up and shutdown}.  Windows XP uses the same mechanism to prefetch files
during the boot process.  So the technical article may be sacrificing the
application start-up time to get a faster boot.  This in fact lines up with
preload's results of slowing the boot process down, and our measurements of
Fedora's Readahead system covered in \autoref{subsec:fedora-readahead}
revealed the same behavior. In our measurements the Readahead service in Fedora
slowed the boot process down, and sped up login-time.


\section{Improvements}
\label{sec:improvements}

There are various improvements that can be applied on preload, as well as
other prefetching systems that are widely in use today and were covered in
\autoref{chap:introduction}.

As we noted in \autoref{sec:aggressive}, prefetching during the boot process
can very well negatively affect the boot time.  This is in part due to the
fact that the boot process is mostly I/O intensive already.  The I/O bus is
not fully utilized during the entire boot process, but weaving prefetching
requests into the holes of the normal I/O load is a hard problem.  The way we
implemented prefetching, the I/O load caused by the
prefetcher \emph{is} going to delay I/O requested by other processes no matter
how distributed it is in the boot process.  This is a direct result of the
scheduling guarantees the kernel makes about not blocking any process for too
long. 
The rest of the poor behavior can be associated to poor
kernel I/O scheduling performance, and in fact Seelam et al suggest that the
Anticipatory Scheduler (AS) that is the default I/O scheduler in Linux 2.6
starves processes \cite{seelam05liuxschedulers}.
The Anticipatory Scheduler works by delaying moving the disk head for a few
milliseconds, hoping that the process that caused the head to be moved to its
current position may be rescheduled and request I/O blocks around the same
position on the disk.  This policy has negative impacts on a prefetcher
reading hundreds of files spanned all across the hard disk.

Starting at the 2.6.13 version, the
Linux kernel supports I/O scheduling priorities, including an \emph{idle}
class that is ideal for boot-time prefetching, but unfortunately I/O
scheduling priorities are only implemented for the Completely Fair Queue (CFQ)
I/O scheduler.

An improvement would be to postpone prefetching until the boot process is done
and the log-in screen is shown.  This can be performed using the GNOME Display
Manager as described in \autoref{subsec:gnome-display-manager}.

The newer Linux kernels implement the \texttt{MADV\_REMOVE} advice to the
\syscall{madvise} system call, but only for tmpfs/shmfs\footnote{Two in-memory
file-systems.} file-systems, and so is not useful for cache eviction hinting
by applications.

\section{Summary of Recommendations}

We recommend that systems seeking to use prefetching to improve boot time should
limit prefetching to blocks required by \syscall{stat} and \syscall{getdents}
system calls, and do that very mildly, and call \syscall{sched\_yield} regularly.  For further boot
time speed up, parallelizing boot tasks should be explored.

To improve the log-in time, it is best to start prefetching when the log-in
display manager becomes idle.  This is a good time to prefetch: the system is
idle, and it can be predicted with high probability what to prefetch.  This
feature is implemented in GNOME Display Manager for example.  When possible,
the idle I/O scheduling class should be used for prefetching in this stage.

Application start-up time can be improved by modifying applications to reduce
the number of \syscall{stat} and \syscall{getdents}\footnote{Usually caused by
the \syscall{readdir} POSIX function.} system calls.  Moreover, using
\syscall{mmap} instead of \syscall{read} improves performance on its own, and
allows for more prefetching opportunity, like what preload does.  Finally,
applications can take advantage of the \syscall{madvise} system call to let
the kernel know that they will need a section of a mapped file, and let the
kernel prefetch decide to prefetch it.

File-based prefetching integrated with the cache subsystem may be used to
further improve application start-up performance, and reorganizing file layout
on the hard-disk can be used if all other routes have been taken.