\chapter{Fundamentals}
\label{chap:fundamentals}

This chapter shapes the core of the thesis, by formalizing the problem and the
solution we propose.  This includes defining the terminology and notation, the
main idea, design decisions that were made, and finally the model.

This work lies in the intersection of operating systems and machine learning
areas of computer science.  For this reason, we provide a detailed definition
of terminology and mathematical concepts we use throughout the thesis.

\section{Terminology}

There are two fundamental objects that we deal with in our model:
\emph{applications} and \emph{maps}.  The following four definitions make it
clear what we exactly mean by these two terms.

We provide specific details and examples drawn the Linux kernel and GNU C
library.  Other Unix variants provide similar features.

\subsection{Processes}

We use the term \emph{process} as it is always used in the systems literature:
a program in execution.  It is completely characterized by a single current
execution point and address space.  The list of current processes in the
system as well as information about each process can be found by scanning the
\proc\ pseudo-filesystem.

Each process has exactly one file on the file-system which is the program that
this process is executing.  The file \texttt{/proc/<pid>/exe} is a symbolic
link to this file.  When the program file is unlinked from the file-system, the
string ``\texttt{ (deleted)}'' will be appended to this symlink, so that can
be detected easily.

\subsection{Applications}

An \emph{application} is a program that provides the user with tools to
accomplish a task.  On systems that we focus on in this thesis, an application
is almost always a program with a graphical user interface.  An application
may be started by the user choosing a launcher from an application menu, or by
other means.  Some examples of applications are the Firefox web browser,
the OpenOffice.org Writer word processor, or the Totem movie player.

An application is said to be running at a certain point in time, if and only
if there is at least one process which is executing this application program.
Applications typically have larger binaries in size compared to other kinds of
programs available on a modern Unix system, and they have larger working sets
when running, and longer running times too.  These are consequences of
the inherent complexity of applications as programs interacting with users in
a graphical user interface and performing complex tasks, compared to the ``do
one thing and do it well'' Unix mantra.  While there are hundreds, even
thousands of programs installed on a typical Unix system, there are hardly
more than a hundred applications installed on such systems.  Even then, a
user rarely uses more than ten or twenty different applications in her
day-to-day interaction with the computer.

Since our goal is to achieve faster application start-up time, there are
mechanisms in preload to roughly distinguish application programs from other
programs.  Preload ignores any processes that are very short-lived, or their
address space is smaller than a certain size.  This has the extra benefit of
keeping the model in a manageable size.
The details of this filtering is described in \autoref{exe-filtering}.

Since applications are the focus of this thesis, we use this term regularly,
but most of the time what we really mean is a ``program that has passed
preload's tests for being an application program''.

\subsection{Shared Objects}

By \emph{shared object}, we simply mean a file that a program uses when
running, and uses by mapping the file into its address space using the
\syscall{mmap} system call.  When several processes use a shared
object, only one copy of its contents is kept in the main memory, and mapped
into the address space of each process using it.

The most common type of shared objects are shared libraries, but the program
binary itself, fonts, locale definitions, static system-wide caches (like the
icon cache) and other static data files are some other types of shared objects
used by programs.

By their nature, most of the shared objects used by a process are mapped when the
process is starting.  By prefetching these shared objects, as we will see, we
can improve the time it takes for the application to start up.

A shared object is uniquely identified by a file-name on the file-system.

\subsection{Maps}

A map is a contiguous part of a shared object that a process maps into its
address space.  This is identified by an offset and a length; in practice, both
of them are multiples of the page-size of the system, 4kb on 32-bit and 8kb on
64-bit processors.

A process may use multiple maps of the same shared object.
The list of the maps of a process can be accessed through the file
\texttt{/proc/<pid>/maps}.  This contains a list of address ranges, access
permissions, offsets, and file-names of all maps of the process.
When the shared object file of a map is unlinked from the file-system, the
string ``\texttt{ (deleted)}'' will appear after the file-name of the map in
the maps file, so this can be detected easily.

\section{Mathematical Background}

In this section we review the mathematical concepts and notation from
probability theory that are used in the following sections.  A reader familiar
with the definition of the subsection titles in this section may skip to the
next section.
%The notation and definitions in this section are largely borrowed from
%\href{http://en.wikipedia.org/}{Wikipedia, the free encyclopedia}.

\iffalse % commented out
\subsection{Correlation Coefficient}

In probability theory and statistics, \emph{correlation coefficient}, is a
numeric measure of the strength of linear relationship between two random
variables.  In general statistical usage, correlation refers to the departure
of two variables from independence. In this broad sense there are several
coefficients, measuring the degree of correlation, adapted to the nature of
data.
\cite{wiki:correlation}

A number of different coefficients are used for different situations. The best
known is the Pearson product-moment correlation coefficient, which is obtained
by dividing the covariance of the two variables by the product of their
standard deviations.

The correlation $\rho_{X, Y}$ between two random variables $X$ and $Y$ with
expected values $\mu_X$ and $\mu_Y$ and standard deviations $\sigma_X$ and
$\sigma_Y$ is defined as:
\begin{equation}
\rho_{X,Y}
= \frac{\mbox{cov}{X,Y}}{\sigma_X\sigma_Y}
= \frac{E((X-\mu_X)(Y-\mu_Y))}{\sigma_X\sigma_Y}
= \frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-E^2(X)}\sqrt{E(Y^2)-E^2(Y)}}
\end{equation}

It is a corollary of the Cauchy-Schwarz inequality that the correlation cannot
exceed $1$ in absolute value.

For two binary random variables $X$ and $Y$ that take values in $\{0,1\}$,
the correlation coefficient can be written as:
\begin{equation}
\rho_{X,Y}
= \frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)-P^2(X)}\sqrt{P(Y)-P^2(Y)}}
= \frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)(1-P(X))}\sqrt{P(Y)(1-P(Y))}}
\end{equation}
where $P(X)=\Pr(X=1)$.

If $X$ and $Y$ are the random variables of two applications $A_X$ and $A_X$
running, then we can compute the correlation coefficient by knowing the total
time, the total time $A_X$ has been running, total time $A_Y$ has been
running, and the total time both of them have been running at the same time.
From now on, that is what we mean when we say \emph{the correlation
coefficient of two applications $A_X$ and $A_Y$}.

\fi

\subsection{Exponential Distributions}

The \emph{exponential distributions} are a class of continuous probability
distributions. They are often used to model the time between events that happen
at a constant average rate.
\cite{wiki:exponential-distribution}

The probability density function (pdf) of an exponential distribution has the
form
\begin{equation}
    f(x;\lambda) = \left\{\begin{matrix} \lambda e^{-\lambda x} &,\; x \ge 0,
    \\ 0 &,\; x < 0. \end{matrix}\right.
\end{equation}
where $\lambda > 0$ is a parameter of the distribution, often called the rate
parameter. The distribution is supported on the interval $[0,\infty)$.
%If a random variable $X$ has this distribution, we write $X ~
%\mbox{Exponential}(\lambda)$.
%
%The cumulative distribution function (cdf) is given by
%\begin{equation}
%    F(x;\lambda) = \left\{\begin{matrix} 1-e^{-\lambda x}&,\; x \ge 0, \\ 0
%    &,\; x < 0. \end{matrix}\right.
%\end{equation}

The exponential distribution is used to model Poisson processes, which are
situations in which an object initially in state A can change to state B with
constant probability per unit time $\lambda$. The time at which the state
actually changes is described by an exponential random variable with parameter
$\lambda$.  Therefore, the integral from $0$ to $T$ over $f$ is the
probability that the object has transitioned into state B by the time $T$.

In real world scenarios, the assumption of a constant rate (or probability per
unit time) is rarely satisfied. For example, the rate of incoming phone calls
differs according to the time of day. But if we focus on a time interval
during which the rate is roughly constant, such as from 2 to 4 PM during work
days, the exponential distribution can be used as a good approximate model for
the time until the next phone call arrives.  Likewise, the rate of an
application being started differs according to the time of day.  But when the
computer system is on and being used, the exponential distribution can be used
as a good approximate model for the time until the next start-up of the
application.

\subsection{Continuous-time Markov Chains}
\label{subsec:markov}

A \emph{continuous-time Markov chain} is a stochastic process $\{ X(t) : t
\geq 0 \}$ that enjoys the Markov property and takes values from amongst the
elements of a discrete set called the state space. The Markov property states
that at any times $s > t > 0$, the conditional probability distribution of the
process at time $s$ given the whole history of the process up to and including
time $t$, depends only on the state of the process at time $t$. In effect, the
state of the process at time $s$ is conditionally independent of the history
of the process before time $t$, given the state of the process at time $t$.
\cite{wiki:continuous-time-markov-chain}

Intuitively, one can define a Markov chain as follows. Let $X(t)$ be the
random variable describing the state of the process at time $t$. Now prescribe
that in some small increment of time from $t$ to $t+h$, the probability that
the process makes a transition to some state $j$, given that it started in
some state $i \neq j$ at time $t$, is given by
\begin{equation}
    \Pr(X(t+h) = j | X(t) = i) = q_{ij}h + o(h),
\end{equation}
where $o(h)$ represents a quantity that goes to zero as $h$ goes to zero.
Hence, over a sufficiently small interval of time, the probability of a
particular transition is roughly proportional to the duration of that
interval.

Continuous-time Markov chains are most easily defined by specifying the
transition rates $q_{ij}$, and these are typically given as the $ij$-th
elements of the transition rate matrix, $Q$.  The continuous-time Markov
chains that we use have $Q$-matrices that are:
\begin{itemize}
\item conservative---the $i$-th diagonal element $q_{ii}$ of $Q$ is given by
\begin{equation}
        q_{ii} = -q_{i} = -\sum_{j\neq i} q_{ij},
\end{equation}
Note that this is only a convention, to make rows of $Q$ sum to zero, hence
the name conservative.  The diagonal is never used as a rate parameter, and
we never use this property of the matrix, but include it to have a
well-defined diagonal.
\item stable---for any given state $i$, all elements $q_{ij}$ (and $q_{ii}$)
are finite.
\end{itemize}

When the $Q$-matrix is stable, the probability that no
transition happens in some time $r$ is
\begin{equation}
    \Pr(X(t+r) = i | X(s) = i, \forall r\in[t, t+r)) = e^{-q_{i}r}.
\end{equation}

Therefore, the probability distribution of the waiting time until the first
transition is an exponential distribution with rate parameter $q_i$ ($=
−q_{ii}$), and continuous-time Markov chains are thus memoryless processes.

Given that a process that started in state $i$ has experienced a transition
out of state $i$, the conditional probability that the transition is into
state $j$ is
\begin{equation}
    \frac{q_{ij}}{\sum_{k\neq i} q_{ik}}= \frac{q_{ij}}{q_i}.
\end{equation}

\subsection{Exponentially-fading Mean}

The \emph{exponentially-fading mean} of a function $f(t) : t\mapsto
\cal{R}$ at time $T$ is defined as:
\begin{equation}
\mu(f,T;\lambda) =
\frac{ \int_{-\infty}^T f(t) e^{\lambda t} dt }{\int_{-\infty}^T e^{\lambda t} dt} =
\frac{ \int_{-\infty}^T f(t) e^{\lambda t} dt }{\frac1\lambda e^{\lambda T}} =
\lambda\int_{-\infty}^T f(t) e^{-\lambda (T-t)} dt
\end{equation}
where $\lambda$ is called the decay factor, and identifies how fast the older
values are faded.

From this definition, it follows that for any $T_0 < T$
\begin{equation}
\mu(f,T;\lambda) =
e^{-\lambda(T-T_0)}
\mu(f,T_0;\lambda)
+
\lambda\int_{T_0}^T f(t) e^{-\lambda (T-t)} dt
\end{equation}
Moreover, if $f(t)$ is constant between $T_0$ and $T$,
\begin{equation}
\mu(f,T;\lambda) =
e^{-\lambda(T-T_0)}
\mu(f,T_0;\lambda)
+
(1-e^{-\lambda(T-T_0)})f(T)
\end{equation}

Now if we sample the continuous signal $f(T)$ at regular intervals of
length $\tau$, the resulting discrete signal $F_n$ will have
\begin{equation}
\mu(F,n;\lambda,\tau) =
e^{-\lambda\tau}
\mu(F,n-1;\lambda)
+
(1-e^{-\lambda\tau})F_n
\end{equation}
where $e^{-\lambda\tau}$ may be called the mixing-factor.  This last equation
is useful even if $F_n$ is any arbitrary sequence of numbers.  It allows to
maintain an on-line average of the sequence without keeping the number of
elements seen so far, and for the average to be fading.

An extension to the above case is when we want to find the fading mean of a
sequence of measurements at arbitrary points in time, as opposed to a sampling
with a fixed frequency of samples. If $F_n$ is the
sequence of measurements, and $T_n$ is the sequence of times the measurements
were performed,
and assuming that $T_n$ is ascending, the exponentially-fading mean of the
measurements at time $T_n$ is:
\begin{equation}
\label{eqn:decay}
\mu(F,n;\lambda,T) =
e^{-\lambda(T_n-T_{n-1})}
\mu(F,n-1;\lambda,T)
+
(1-e^{-\lambda(T_n-T_{n-1})})F_n
\end{equation}

\section{Main Idea}

There are two fairly isolated components in preload: the data gathering and
model training component, and the predictor.  These two are connected together
using a shared probabilistic model.  The former component trains the model
online based on data gathered by on-going monitoring of user actions, while
the latter uses the model to make predictions and perform prefetching.

The data gathering component will gather information about running
applications periodically, once each \emph{cycle} where a cycle is a tunable
parameter that defaults to twenty seconds.  The list of running applications
is produced by filtering the list of the processes running on the system, and
for each application, the list of its file-backed memory maps is fetched, and
used to update the model parameters.

The predictor component also takes action once every cycle, and uses the
trained model and the list of currently running applications.  For every
application that is not running, the predictor derives a probability that this application
is going to be started during the next cycle.  The predictor then uses these
per-application probabilities to assign probabilities to their maps, and
sorts the maps based on their probabilities, and proceeds with
prefetching the top ones into main memory.  Memory statistics and system load
are used to decide how much prefetching is performed in each cycle, to
minimize the effect of preload on the system load.

The problem can be seen as a stochastic Markov chain whose states are members
of the power-set of all applications that the user may run, and given the
current state, we are interested to know the transition probabilities to every
other state during the next cycle.  If we could build and train this Markov
chain then we would be done, but this is not feasible, given the total number of states
and the little training data we have.  To overcome this shortcoming, we make
independence assumptions and model every pair of applications separately in a
four-state continuous-time Markov chain whose states correspond to the four
combinations of each of the applications being running or not.  Each of these
Markov chains models the correlation between the state of the two applications
involved.

Every state in each of these Markov chains has a waiting time parameter that
is the average time before leaving this state. When a transition happens,
every outgoing edge from the current state has a probability assigned to it
that this edge is taken.  All these parameters are trained as an exponentially 
fading mean of their values over time, such that when a user's habits change,
the model adapts to it in a constant time with high probability.

% Example?


\section{Design Decisions}

In this section we discuss the main decisions behind the design above, as well
as some remaining details that should be resolved in order to make a complete
implementation of preload.

\subsection{Bayesian Probabilistic Model}

We use a Bayesian probabilistic model, which means, we assign a probability
(real number between zero and one including) to our (the system's) belief of
an event.  This holds equivalently true for events in the past and current
time as well as for events in the future.

As an example, for a map $M$, we \emph{may} assign a probability number
$\mbox{incore}_M$ to $M$ that is our belief of the event that map $M$ is in
core at this time.  This probability may take any value between zero and one as
long as we do not query (using the \syscall{mincore} system call for example)
the map for being in core, although we may be able to do that.

\subsection{Memoryless Model}

Another important decision made in the design process that needs justification
is the use of a memoryless model.  Memoryless in this context means that given
the trained model, at any point in time, all decisions are made only based on
the current list of applications running, no matter when they were started or
what the previous states of the system have been.  This is not necessarily
what happens in reality. For example a user may play his favorite video game
in his break times that are exactly thirty minutes long.  So as time approaches
the half hour, the probability that the game is closed increases, but a
memoryless model cannot capture that.  Not all user actions are like that
though:  a user may leave the web browser open for very long periods of time,
making a memoryless model as good as one can get.

We have chosen a memoryless Markov-based model mainly because of its
simplicity and ease of implementation.  However, this is not much of a
limitation as we train the system on-line.  We believe this is a good
compromise that keeps the model simple like a memoryless system, while still
keeping it powerful by using the entire history of the system as training
data.

\subsection{Mixture of First-Order Markov Predictors}

Markov predictors are an implementation of prediction by partial matching
(PPM). PPM estimates probabilities based on prior observation.  A PPM of order
$n$ encodes sequences of length $n$ and predicts the next element of a
sequence given the $n-1$ immediately preceding elements. The predictor chooses
the most frequently occurring transition from the static beginning with the
$n-1$ observed elements and predicts the final element of the sequence
\cite{bartels99potentials}.  For this scheme to work well, the data stream
should expose patterns of order $n$.  The larger that $n$ is, the more
conservative the predictor will be, and will fail to make predictions when no
known sequences match the initial $n-1$ elements.  With lower values of $n$,
the predictor will be more aggressive, and more likely to mispredict.  The
complexity of a PPM of order $n$ with a domain of elements of size $d$ is
$\Theta (d^n)$, as one has to keep the frequency of all $d^n$ sequences of
elements in the model (many of them may be zero), to perform predictions.
This obviously falls short if the number of the elements in the training data
stream is \emph{not} significantly larger than $d^n$.  For this reason, in predictors
used for prefetching, a PPM model of order greater than two or three is unlikely to be
effective.

When talking about Markov chains and Markov predictors out of the scope of
PPMs, the order definition is different.  A first-order Markov is one that
chooses next state based on the current state and a second-order Markov is one
that chooses next state based on two previous states, while a zero-order
Markov chooses next state independent of current or any previous states.  This
means a first-order Markov precitor is equivalent to a second-order PPM.  This
should be kept in mind to avoid confusion.

Previous work on fault-based Markov prefetching for virtual memory pages shows
that the Markov predictor of order one outperforms the Markov predictor of
order two \cite{bartels99potentials}.  Another system uses a Hidden Markov
Model (HMM) for file-system input/output access pattern classification
\cite{madhyastha97inputoutput}.  While an HMM has the potential to model a
higher-order Markov model (by encoding sequences of elements in its hidden
state), in this particular application,  they construct a composite HMM from
the HMMs trained for each unique access pattern.  Each of their basic HMMs
then have a total number of states equal to the size of the domain of
elements, and hence can be best thought of as a first-order model.

For web prefetching, a PPM predictor of order two is more common
\cite{nanopoulos01datamining,jiang98adaptive}.  That is due to the fact that
the number of the elements in the working set of a web prefetcher is typically
much smaller than the number of elements in page/block-based or file-based
prefetchers that have to deal with tens of thousands or even millions of
elements.

In the domain of our problem, elements are the applications, and the data
stream is the sequence of application start-ups.  Like in the case of web
prefetching, we have the luxury of having a fairly limited domain set (of size
a few tens of items).  However, our training data stream is fairly
low-frequency too, in the rate of zero or a few application start-ups per
minute, if not per hour.  As a result, it is not obvious whether a first-order
predictor performs better or a second-order one, but the approach we take in
fact resolves this problem by using a mixture of very small Markov chains (two
elements and four states each), that can be easily trained and used as
first-order Markov predictors.

Mixture models are a common tool in machine learning to compose complex models
out of simpler building blocks.  The beauty of this approach is that training
each of the basic models used is easy and feasible, the composition rule is
well-understood, and the final model is powerful.

For the above reasons, we have chosen the mixture of first-order Markov
predictors as our learning model.


\subsection{Using the Correlation Coefficient}

Another question we had to answer was whether the probabilistic correlation
coefficient of the random variables of two applications being run over the
time should be used to weight the effect of them on each other.  We answer no
to this question, for the reason that follows.

Consider the following scenario: There are two programs, one of them is a
\cron\ job running periodically every thirty minutes, taking a few seconds on
each run, the other program is the user's internet browser, running at
their will for long periods of time, with no recognizable pattern.  It is easy
to see that the correlation coefficient of the random variable of these two
programs running is very near to zero, because the odds of both of them
running is almost the same as the odds that the \cron\ job is running
independent of the browser.  Now lets see if these two programs can give
us any information about each other: The browser tells us that no matter if it
is running or not, the \cron\ job is likely to be started very soon (in
fifteen minutes on average).  This is useful information we did not have
otherwise.  In other words, if the correlation coefficient is insignificant,
the predictions become independent of the application predicting, and this can
be thought as a zero-order prediction.
%, and the \cron\ job also tells us the average time
%before the browser may be started.  
%So, while one program's running state
%does not affect the conclusion we make about the other program, it does
%give us some clue about when the other program may be run.
%In other words, when the correlation coefficient is about zero, it is like examining
%each program's mean time to run independently.  This is not as informative
%as what a pair of highly correlated programs can tell about each other,
%but it indeed is useful information.

The example sketched above may not be a real-world case for preload (for
example, because we ignore very short-running processes), but the argument
holds for the general case.

Another reason that forced us to make this decision was that we were unable to
fit it in our probabilistic inference with firm theoretical reasoning.


\subsection{User-space versus Kernel-space}

A common problem with the prefetching literature is that most of the systems
are implemented (by design) in the lower levels of the operating system,
mostly in the kernel space.  This design has several benefits, like tight
integration with file-system caching.  This, on the other hand, increases the
complexity of the implementation drastically, and makes deploying the system
on a normal system much harder.  Mostly because the kernel needs to be
patched, and since prefetching has not shown enough benefits to justify the
complexity of the implementation, no sophisticated prefetching system (like
those developed in academic circles) has been widely deployed by major
operating system distributors.  For this reason, we have decided to implement
preload completely as a user-space program running in the background (also
known as a daemon).

Preload gathers information about running processes and their shared objects
by scanning the \proc\ pseudo-file system and performs prefetching using a few
system calls, mostly \syscall{readahead}, \syscall{posix\_fadvise},
\syscall{posix\_madvise}, \syscall{mmap}, \syscall{madvise},
\syscall{fadvise}, and \syscall{mincore}.

The main problem with this approach is that preload cannot track file accesses
by any means other than \syscall{mmap}ing the file.  This includes
\syscall{open} and \syscall{stat} accesses.


\subsection{The User Running the Application}

Another issue we had to deal with was whether the pair of
$(\mbox{user},\mbox{application})$ should be used in the inference phase
instead of individual applications.  That generally makes sense, since
different users have different habits and sets of commonly-used applications,
but we decided to not do this, basically because this complicates the
implementation and also adds another orthogonal axis on the object space of
the model.  This is not much of a problem, since preload is targeted for
desktop systems that usually have very few number of users (one or two most of
the time).

If the per-user behavior is desired, different preload daemons can be run for
different users.  This is a compromise though, since these daemons will race for
using resources (main memory mostly), and no cross-user inference is
performed.  Cross-user inference is very powerful at login time, since the
login manager is normally run as the super-user, while after login, the
desktop is run as the user who just logged in.




\section{The Model}
\label{sec:model}

In this section we define the objects used in the model representation of the
problem.  These objects will then be used in the algorithms in the next
chapter.  For each object, the member properties are divided into two parts,
the persistent properties and the runtime properties.  The persistent
properties are updated by the data gathering and training component, and will
be saved and restored across runs of preload.  The runtime properties are used
by the predictor to keep track of its state at the current time, and so are
not kept across runs.  For each object, some of the persistent properties form
a key, in that the object is uniquely identified by values for those set of
properties.

% TODO: fixup key indicators
In the following subsections, the key properties are prefixed by an arrow.


\subsection{Map Object}

A \emph{Map object} corresponds to a single map that may be used by one or
more applications.  A Map is identified by the path of its file, a start
offset, and a length.
%Its only runtime property is the probability that the
%map is in the main memory.
The \emph{size} of a Map is its length.

\begin{struct}{Map}
	\persistent
\key	char	*path;	// full name of the file being mapped
\key	size_t	offset;	// start offset in bytes
\key	size_t	length;	// length in bytes

%	\runtime
%	double	p_incore;	// probability that map is in main memory
\end{struct}


%\subsection{ExeMap Object}


\subsection{Exe Object}

An \emph{Exe object} corresponds to an application.  An Exe is identified by
the path of its executable binary, and as its persistent data it contains the
set of maps it uses and the set of Markov chains it builds with every other
application.

The runtime property of the Exe is its running state which is a boolean
variable represented as an integer with value one if the application is
running, and zero otherwise.  The \texttt{running} member is initialized upon
construction of the object, based on information from \proc.

The \emph{size} of an Exe is the sum of the size of its Map objects.

\begin{struct}{Exe}
	\persistent
\key	char	*path;	// full name of the executable binary
	Set of Map	maps;	// the maps this application uses
	Set of Markov	markovs;	// the Markov chains with other applications

	\runtime
	int	running;	// one if running, zero otherwise
\end{struct}


\subsection{Markov Object}

A \emph{Markov object} corresponds to the four-state continuous-time Markov
chain constructed for two applications $A$ and $B$.  The states are numbered 0
to 3 and respectively mean: none of $A$ or $B$ is running, only $A$ is
running, only $B$ is running, and both are running.  A Markov object is
identified by its links to the Exes $A$ and $B$, and has as its persistent
data the (exponentially-fading mean of) transition time for each state,
timestamp of when the last transition from that state happened, and probability
that each outgoing transition edge is taken when a transition happens.

The runtime property of a Markov is its current state and the timestamp of
when it entered the current state.  Upon construction, the current state is
computed based on the \texttt{running} member of the two Exe objects
referenced, and transition time is set to the current timestamp.

\begin{struct}{Markov}
	\persistent
\key	Exe	a, b;	// the two applications involved\newline// in this Markov chain
	double	tt[4]	// mean transition time from each state
	double	tp[4][4]	// probability that transition from\newline%
                                        // one state goes to another
	int	timestamps[4]	// timestamp of last time leaving each state

	\runtime
	int	state;	// current state, 0, 1, 2, or, 3
	int	time;	// timestamp of the last transition

\end{struct}
The trasition times \texttt{tt} are the inverse of transitions rates defined
in \autoref{subsec:markov}.

The \texttt{state} of a Markov object can be computed as follows:
\begin{equation}
M.\state = M.\a.\running + 2 \times M.\b.\running
\end{equation}
and is always updated to maintain this as an invariant.


\subsection{MemStat Object}

The \emph{MemStat object} holds various statistics about total and available
memory in the system as well as disk activities.  All values are in kilobytes.

\begin{struct}{MemStat}
	int	total	// total memory
	int	free	// free memory
	int	cached	// page-cache memory
	int	pagein	// total data paged in (read from disk)
	int	pageout	// total data paged out (written to disk)
\end{struct}


\subsection{State Object}

The \emph{State object} holds all the information about the model except for
configuration parameters.  It contains the set of all applications and maps
known, and also a runtime list of running applications and memory statistics
which are populated from \proc\ when a State object is constructed.

There is a singleton instance of this object at runtime that is trained by the
data gathering component, and used by the predictor.  It has methods to read
its persistent state from a file and to dump them into a file.  This will
load/save all referenced Markov, Exe, and Map objects recursively.


\begin{struct}{State}
	\persistent
	HashTable of Exe	exes	// all the applications known
	HashTable of Map	maps	// all the maps known

	\runtime
	Set of Exe	running_exes	// all applications running
	MemStat	memstat	// memory statistics
\end{struct}


\subsection{Parameters}

There are two important parameters that are used with the model.  These are
left as configuration variables and are set by the user with other less
important configuration parameters that are described in detail in
\autoref{sec:conf-params}.  The two are:
\begin{itemize}
\item[$\tau$:] the length of each cycle in seconds,
\item[$\lambda$:] the decay factor used for exponentially-fading means.
\end{itemize}
We will use these parameters in \autoref{chap:algorithms}.  $\tau$ is used in
the inference algorithms when we predict what happens during the next cycle.
$\lambda$ is used in the training algorithm as the decay factor of
the exponentially-fading means we maintain.

\section{Summary}

In this chapter we provided background material used in our work, and
described the main idea of this thesis.  After discussing design decisions,
the model was presented.  In the next chapter we will present algorithms for
training the model; making predictions based on it; and prefetching those
predictions.
