\chapter{Experimental Evaluation}
\label{chap:experimental}

In this chapter we present our experimental evaluation results.  Evaluating
preload's performance through experiments is inherently hard compared to
block/file-based or web prefetching because of the much higher variance of
user actions it depends on.  For that reason, we break our experiments into
two parts: to measure how much prefetching improves application start-up time,
and how precise preload's predictions are.
%For the latter we use a trace-based simulation, replaying real-world traces
%through preload and measure prediction hit ratio.


\section{Experimental Setup}

Many prefetching experiments use trace-based simulation to measure
performance.  This has the benefit of comparing before/after numbers of the
exactly same run, reducing noise caused by external factors, and being
comparable to results from other systems when performed on standard trace
sets.  This approach however has its own limitations.  Namely, that only hit
rate is measured this way, not actual performance improvements on wall-clock
operation time \cite{orourke01parallelism}.  Moreover, most of the trace-based
simulations assume an infinite I/O bandwidth.

There exist commonly-used traces for measuring file-system performance
(e.g.~Auspex, Sprite, Andrew \cite{orourke01parallelism}), however, most of
them are more than a decade old and so not quite useful today (in fact many of
them are not available on the Internet anymore), and more importantly, they do
not have enough information traced to be useful for measuring preload's
performance.  They do not have any requesting-process information tagged with
file accesses.  Moreover, trace-based simulation of preload is harder than
similar systems because simulating the behavior of the main memory that
preload uses for caching is not easy, given all other processes that are using
it at the same time (all memory allocations come from the same pool that is
the main memory).
%We use traces recorded specifically for our experiments.


\subsection{Methodology}

We perform two sets of experiments: start-up time measurement
and hit ratio measurement.

%trace-based simulations.

For wall-clock time experiments, we run certain applications multiple times
(five trials in average)
with and without preload running, and measure their start-up time, ie.~the time
from launching the application until the application window is fully exposed.
The start-up time is measured by modifying the source code of the
program to print out the time of day once as the first operation in
\texttt{main()}, and another time in an idle callback called from the main
loop of the application.
We then use the average time of the multiple runs as the value reported.
For cold-cache experiments, we drop all pages cached from the page-cache.  
For warm-cache, we start the application, close it, and start again.  For
testing under preload, we run preload, clear the cache and wait until preload
prefetches the application in question, and run the application.

For hit ratio computation we run a modified version of preload over the period
of two weeks, with the user(s) using the system normally.  Whenever an
application is started, the modified preload checks which of the maps that the
application requires were prefetched during the previous cycle and counts
those as hits; the rest are counted as misses.  These numbers are used to
compute a total hit ratio for the particular scenario.  This is very
conservative as a map may already be in memory but not prefetched by preload.
In that case it will be counted as a miss.

We test two scenarios: a single-user scenario with a single user using the
system for her day-to-day computer uses (email, web, document processing,
instant messaging, and games), and a two-user scenario with two users using
the system, one of them using the GNOME desktop, and the other the KDE desktop
environment.  They both use the Firefox browser but use mostly different
applications for the rest of their needs.

Moreover, our modified version of preload also implements a na\"ive prediction
algorithm that assigns to each application not-running a starting probability
relative to the total number of times it has been started.  We compute hit
ratio for the na\"ive algorithm too.


%For trace-based simulation, we record all data that preload normally scans
%from the system, processes running and their maps, over the period of two
%weeks, with the user(s) using the system normally.  We record two scenarios: a
%single-user scenario with a single user using the system for her day-to-day
%computer uses (email, web, document processing, instant messaging, and games),
%and a two-user scenario with two users using the system, one of them using the
%GNOME desktop, and the other the KDE desktop environment.  They both use the
%Firefox browser but use mostly different applications for the rest of their
%needs.
%
%After having the traces we replay them through preload to make predictions of
%maps to be prefetched in each cycle.  Whenever an application is started (in
%the trace), we check which of the maps it requires are prefetched during the
%previous cycle and count those as hits, and the rest as misses.  We use these
%values to compute a total hit ratio for preload.


\subsection{Operating Environment}

We have performed all of the experiments on a system with an Intel
Pentium M 1.7\,GHz processor with 2\,MB of CPU cache, 512\,MB of main memory,
and a 4500RPM 60GB hard-disk drive.  The operating system used is a stock
Fedora Core 5 distribution, with Linux kernel version 2.6.17-1.2139\_FC5, and
the default I/O scheduler (AS).


\subsection{Limitations of Experiment}

Our experiments fail to measure effect of preload on the I/O subsystem.  In
particular, we do not measure how preload slows down I/O-intensive running
processes.  Moreover, we do not measure preload's hit ratio compared to that
of the bare kernel page cache.  Instead, we compare it with our na\"ive
prefetching algorithm and show significant improvements.  This comparison shows a lower
bound on the improvements of preload over not prefetching at all.  The
experiment also assumes that prefetched files are not evicted from the cache
in the next cycle (that is 20 seconds by default).  Given the total memory
that preload uses for prefetching, and given the LRU cache replacement
algorithm, this assumption is fairly realistic.

As we discuss in \autoref{chap:discussion}, measuring the true effects of
preload on the overall behavior of the system is a very hard task.  We also
discuss in the same chapter other ways to minimize the negative effects of
preload that are not implemented in the version we test.


\section{Measurements}

The following two subsections present the results of our measurements.

\subsection{Startup-Time}

\autoref{tab:startup-time} shows start-up time for several applications from a
cold cache, warm cache, and with preload prefetching them.  Cold cache times
are achieved by running the application after clearing the page cache by the
command ``\texttt{echo 1 > /proc/sys/vm/drop\_caches}'' which is supported in
Linux 2.6.16 and newer kernels, designed specifically for simulating a cold
cache.  Warm cache are achieved by running the application, exiting, and
running again.  Measurements under preload are performed by running preload,
dropping caches, waiting for preload to pass the next prefetching cycle, and
run the application.

\begin{table}[t]
\centering
\begin{tabulary}{\textwidth}{|L||C|C||C|C||C|C|}
\hline
Application		& cold & warm & preload & gain & \# Maps & size	\\
\hline
OpenOffice.org Writer	& 15s	     & 2s	  & 7s	    & 53\%	  & 323	    & 90\,MB	\\
Firefox Web Browser	& 11s	     & 2s	  & 5s	    & 55\% 	  & 288	    & 38\,MB	\\
Evolution Mailer	& 9s 	     & 1s	  & 4s	    & 55\% 	  & 308	    & 85\,MB	\\
Gedit Text Editor	& 6s	     & 0.1s	  & 4s	    & 33\% 	  & 216	    & 52\,MB	\\
Gnome Terminal		& 4s	     & 0.4s	  & 3s	    & 25\% 	  & 184	    & 27\,MB	\\
\hline
\end{tabulary}
\caption{Application start-up time with cold and warm caches, and with preload}
\label{tab:startup-time}
\end{table}


\autoref{tab:boot-login-time} shows system boot time and login time with and
without preload running.  Boot time is from the moment the computer is
turned on until the login page is shown.  Login time is from the moment login
information is entered up to when the entire desktop is rendered completely.
As can be seen prefetching during the boot process slows it down, but improves
login-time.  Since a system once booted is used to login at least once, and
possibly many more times, reducing the login time while keeping the total
boot+login time constant is a net gain.  We discuss the effect of prefetching
on boot-time in \autoref{sec:improvements}.

\begin{table}[t]
\centering
\begin{tabulary}{\textwidth}{|L||C|C|C||C|C|}
\hline
		& without preload	&  with preload		\\
\hline
Boot time	&	95s		&	103s		\\
Login time	&	30s		&	23s		\\
\hline
Total time	&	125s		&	126s		\\
\hline
\end{tabulary}
\caption{Boot and login times with and without preload}
\label{tab:boot-login-time}
\end{table}


\subsection{Hit Rate}

We modified preload such that whenever an application is started, it checks
which of the maps the application requires are prefetched during the previous
cycle and counts those as hits, and the rest as misses.
It also implements a na\"ive prediction algorithm that assigns to each
application not-running a starting probability relative to the total number of
times it has been started.  This modified prediction algorithm is not used for
prefetching and is only used to  compute hit ratio for the na\"ive algorithm.

We tested two scenarios: a single-user scenario with a single user using the
system for her day-to-day computer uses (email, web, document processing,
instant messaging, and games), and a two-user scenario with two users using
the system, one of them using the GNOME desktop, and the other the KDE desktop
environment.  They both use the Firefox browser but use mostly different
applications for the rest of their needs.

The computed hit ratios are presented in \autoref{tab:hit-rate}.

\begin{table}[t]
\centering
\begin{tabulary}{\textwidth}{|L||C|C||C|}
\hline
Scenario	&	na\"ive	&	preload	&   improvement	\\
\hline
Single-user	&	93\%	&	93\%	&	0\%	\\
Two-user	&	63\%	&	91\%	&	44\%	\\
\hline
\end{tabulary}
\caption{Hit rate for preload and the na\"ive algorithm for two scenarios}
\label{tab:hit-rate}
\end{table}


\section{Performance Analysis}

While an exact analysis of preload's performance is not possible, we have
evaluated two measures of performance: wall-clock application start-up time
and hit rate improvement for predictions preload makes over the na\"ive
algorithm.  The following subsections shortly analyze the numbers we obtained.
The rest of the discussion is presented in \autoref{chap:discussion}.


\subsection{Startup-Time}
\label{subsec:startup-time-analysis}

Preload improved application start-up time by 50\% for larger applications,
compared to a cold-cache start.  However, for the very same applications,
starting from a warm cache is at least twice as fast as preload can achieve.
The differences comes from the fact that preload only tracks and prefetches
those files that applications use by mapping into their process address space,
and access to them does not involved any system calls.
As a result, files that the application reads using the \syscall{read} system
call are not prefetched.  The more the application uses \syscall{mmap} instead
of \syscall{read}, the better preload performs on it.

To verify this, we look into Gnome Text Editor and Gnome Terminal applications
in more detail.  We traced all system calls they make during the start-up for
cold cache, warm cache, and under preload.  The trace also contains the time
spent in each system call invocation.  This was achieved using the
``\texttt{strace -T}'' command.  \autoref{tab:syscall-times} shows
the total time spent in system calls for these three cases.

\begin{table}[t]
\centering
\begin{tabulary}{\textwidth}{|L||C|C||C|C|}
\hline
Application		& cold	& warm	& preload & preload minus warm	\\
\hline
Gnome Text Editor	& 3.3s	& 0.82s	& 3.2s	& 2.4s		\\
Gnome Terminal		& 1.7s	& 0.07s	& 1.4s	& 1.3s		\\
\hline
\end{tabulary}
\caption
[Time spent in system calls with cold and warm caches, and under preload]
{Time spent in system calls with cold and warm caches, and under
preload\newline\footnotesize
The last column shows the difference between the third and fourth columns;
that is, the extra time spent in system calls under preload that is not with
a warm cache.
\hfill}
\label{tab:syscall-times}
\end{table}

For both Gnome Text Editor and Gnome Terminal, time spent in system calls is
almost the same for cold cache and under preload
(\autoref{tab:syscall-times}).  That is expected as preload does not prefetch
anything that directly affects time spent in system calls\footnote{The
exception is when a system call reads the same part of a file that has also
been mapped and prefetched by preload.}.  The warm cache case however spends
significantly less time in system calls.  The difference between system call
time with a warm cache and under preload (\autoref{tab:syscall-times} last
column) accounts for more than half of the start-up time difference under a
warm cache and under preload (\autoref{tab:startup-time}).

We can confirm that the time difference is indeed accumulated disk access time
by checking out which system calls are taking long, and how long.  For this
purpose we filter all system calls taking longer than one millisecond when
Gnome Terminal is starting under a warm cache, and under preload:  Under a
warm cache there are only five system calls (out of about 4900) lasting more
than a millisecond, for a total of 10 milliseconds.  Under preload, however, there
are 110, half of them taking more than 10 milliseconds each.  These are very
obviously stalled system calls hit by the disk access time.  Of the system calls
taking more than 10 milliseconds, 80\% are \syscall{read}, and the rest are
\syscall{getdents}.  The \syscall{read} ones are obviously those reading from
a file, while \syscall{getdents} ones are for reading directories.


\subsection{Hit Rate}

Our hit rate measurement is flawed because we do not measure hit rate with no
prefetching.  However, we show that by letting preload use 30\% of the page
cache it can guarantee a greater than 90\% hit rate for all the files that
applications map (which from \autoref{subsec:startup-time-analysis}, we know are responsible for about
half of the I/O stall on larger applications). While preload produces similar
results as the na\"ive algorithm for a single user, we have shown significant
improvement for the case of two users using different sets of applications.
The reason that the single-user scenario yields the exact same results as the
na\"ive algorithm is that all the maps of all the interesting applications
actually fit into preload's share of the cache and so they are all prefetched.
However, in the two-user scenario not all the applications that the users use
fit in that range, and so preload's application-tracking nature outperforms the
na\"ive approach of prefetching regardless of what applications are currently
running.
