\chapter{Fundamentals} \label{chap:fundamentals} This chapter shapes the core of the thesis, by formalizing the problem and the solution we propose. This includes defining the terminology and notation, the main idea, design decisions that were made, and finally the model. This work lies in the intersection of operating systems and machine learning areas of computer science. For this reason, we provide a detailed definition of terminology and mathematical concepts we use throughout the thesis. \section{Terminology} There are two fundamental objects that we deal with in our model: \emph{applications} and \emph{maps}. The following four definitions make it clear what we exactly mean by these two terms. We provide specific details and examples drawn the Linux kernel and GNU C library. Other Unix variants provide similar features. \subsection{Processes} We use the term \emph{process} as it is always used in the systems literature: a program in execution. It is completely characterized by a single current execution point and address space. The list of current processes in the system as well as information about each process can be found by scanning the \proc\ pseudo-filesystem. Each process has exactly one file on the file-system which is the program that this process is executing. The file \texttt{/proc//exe} is a symbolic link to this file. When the program file is unlinked from the file-system, the string ``\texttt{ (deleted)}'' will be appended to this symlink, so that can be detected easily. \subsection{Applications} An \emph{application} is a program that provides the user with tools to accomplish a task. On systems that we focus on in this thesis, an application is almost always a program with a graphical user interface. An application may be started by the user choosing a launcher from an application menu, or by other means. Some examples of applications are the Firefox web browser, the OpenOffice.org Writer word processor, or the Totem movie player. An application is said to be running at a certain point in time, if and only if there is at least one process which is executing this application program. Applications typically have larger binaries in size compared to other kinds of programs available on a modern Unix system, and they have larger working sets when running, and longer running times too. These are consequences of the inherent complexity of applications as programs interacting with users in a graphical user interface and performing complex tasks, compared to the ``do one thing and do it well'' Unix mantra. While there are hundreds, even thousands of programs installed on a typical Unix system, there are hardly more than a hundred applications installed on such systems. Even then, a user rarely uses more than ten or twenty different applications in her day-to-day interaction with the computer. Since our goal is to achieve faster application start-up time, there are mechanisms in preload to roughly distinguish application programs from other programs. Preload ignores any processes that are very short-lived, or their address space is smaller than a certain size. This has the extra benefit of keeping the model in a manageable size. The details of this filtering is described in \autoref{exe-filtering}. Since applications are the focus of this thesis, we use this term regularly, but most of the time what we really mean is a ``program that has passed preload's tests for being an application program''. \subsection{Shared Objects} By \emph{shared object}, we simply mean a file that a program uses when running, and uses by mapping the file into its address space using the \syscall{mmap} system call. When several processes use a shared object, only one copy of its contents is kept in the main memory, and mapped into the address space of each process using it. The most common type of shared objects are shared libraries, but the program binary itself, fonts, locale definitions, static system-wide caches (like the icon cache) and other static data files are some other types of shared objects used by programs. By their nature, most of the shared objects used by a process are mapped when the process is starting. By prefetching these shared objects, as we will see, we can improve the time it takes for the application to start up. A shared object is uniquely identified by a file-name on the file-system. \subsection{Maps} A map is a contiguous part of a shared object that a process maps into its address space. This is identified by an offset and a length; in practice, both of them are multiples of the page-size of the system, 4kb on 32-bit and 8kb on 64-bit processors. A process may use multiple maps of the same shared object. The list of the maps of a process can be accessed through the file \texttt{/proc//maps}. This contains a list of address ranges, access permissions, offsets, and file-names of all maps of the process. When the shared object file of a map is unlinked from the file-system, the string ``\texttt{ (deleted)}'' will appear after the file-name of the map in the maps file, so this can be detected easily. \section{Mathematical Background} In this section we review the mathematical concepts and notation from probability theory that are used in the following sections. A reader familiar with the definition of the subsection titles in this section may skip to the next section. %The notation and definitions in this section are largely borrowed from %\href{http://en.wikipedia.org/}{Wikipedia, the free encyclopedia}. \iffalse % commented out \subsection{Correlation Coefficient} In probability theory and statistics, \emph{correlation coefficient}, is a numeric measure of the strength of linear relationship between two random variables. In general statistical usage, correlation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of data. \cite{wiki:correlation} A number of different coefficients are used for different situations. The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. The correlation $\rho_{X, Y}$ between two random variables $X$ and $Y$ with expected values $\mu_X$ and $\mu_Y$ and standard deviations $\sigma_X$ and $\sigma_Y$ is defined as: \begin{equation} \rho_{X,Y} = \frac{\mbox{cov}{X,Y}}{\sigma_X\sigma_Y} = \frac{E((X-\mu_X)(Y-\mu_Y))}{\sigma_X\sigma_Y} = \frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-E^2(X)}\sqrt{E(Y^2)-E^2(Y)}} \end{equation} It is a corollary of the Cauchy-Schwarz inequality that the correlation cannot exceed $1$ in absolute value. For two binary random variables $X$ and $Y$ that take values in $\{0,1\}$, the correlation coefficient can be written as: \begin{equation} \rho_{X,Y} = \frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)-P^2(X)}\sqrt{P(Y)-P^2(Y)}} = \frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)(1-P(X))}\sqrt{P(Y)(1-P(Y))}} \end{equation} where $P(X)=\Pr(X=1)$. If $X$ and $Y$ are the random variables of two applications $A_X$ and $A_X$ running, then we can compute the correlation coefficient by knowing the total time, the total time $A_X$ has been running, total time $A_Y$ has been running, and the total time both of them have been running at the same time. From now on, that is what we mean when we say \emph{the correlation coefficient of two applications $A_X$ and $A_Y$}. \fi \subsection{Exponential Distributions} The \emph{exponential distributions} are a class of continuous probability distributions. They are often used to model the time between events that happen at a constant average rate. \cite{wiki:exponential-distribution} The probability density function (pdf) of an exponential distribution has the form \begin{equation} f(x;\lambda) = \left\{\begin{matrix} \lambda e^{-\lambda x} &,\; x \ge 0, \\ 0 &,\; x < 0. \end{matrix}\right. \end{equation} where $\lambda > 0$ is a parameter of the distribution, often called the rate parameter. The distribution is supported on the interval $[0,\infty)$. %If a random variable $X$ has this distribution, we write $X ~ %\mbox{Exponential}(\lambda)$. % %The cumulative distribution function (cdf) is given by %\begin{equation} % F(x;\lambda) = \left\{\begin{matrix} 1-e^{-\lambda x}&,\; x \ge 0, \\ 0 % &,\; x < 0. \end{matrix}\right. %\end{equation} The exponential distribution is used to model Poisson processes, which are situations in which an object initially in state A can change to state B with constant probability per unit time $\lambda$. The time at which the state actually changes is described by an exponential random variable with parameter $\lambda$. Therefore, the integral from $0$ to $T$ over $f$ is the probability that the object has transitioned into state B by the time $T$. In real world scenarios, the assumption of a constant rate (or probability per unit time) is rarely satisfied. For example, the rate of incoming phone calls differs according to the time of day. But if we focus on a time interval during which the rate is roughly constant, such as from 2 to 4 PM during work days, the exponential distribution can be used as a good approximate model for the time until the next phone call arrives. Likewise, the rate of an application being started differs according to the time of day. But when the computer system is on and being used, the exponential distribution can be used as a good approximate model for the time until the next start-up of the application. \subsection{Continuous-time Markov Chains} \label{subsec:markov} A \emph{continuous-time Markov chain} is a stochastic process $\{ X(t) : t \geq 0 \}$ that enjoys the Markov property and takes values from amongst the elements of a discrete set called the state space. The Markov property states that at any times $s > t > 0$, the conditional probability distribution of the process at time $s$ given the whole history of the process up to and including time $t$, depends only on the state of the process at time $t$. In effect, the state of the process at time $s$ is conditionally independent of the history of the process before time $t$, given the state of the process at time $t$. \cite{wiki:continuous-time-markov-chain} Intuitively, one can define a Markov chain as follows. Let $X(t)$ be the random variable describing the state of the process at time $t$. Now prescribe that in some small increment of time from $t$ to $t+h$, the probability that the process makes a transition to some state $j$, given that it started in some state $i \neq j$ at time $t$, is given by \begin{equation} \Pr(X(t+h) = j | X(t) = i) = q_{ij}h + o(h), \end{equation} where $o(h)$ represents a quantity that goes to zero as $h$ goes to zero. Hence, over a sufficiently small interval of time, the probability of a particular transition is roughly proportional to the duration of that interval. Continuous-time Markov chains are most easily defined by specifying the transition rates $q_{ij}$, and these are typically given as the $ij$-th elements of the transition rate matrix, $Q$. The continuous-time Markov chains that we use have $Q$-matrices that are: \begin{itemize} \item conservative---the $i$-th diagonal element $q_{ii}$ of $Q$ is given by \begin{equation} q_{ii} = -q_{i} = -\sum_{j\neq i} q_{ij}, \end{equation} Note that this is only a convention, to make rows of $Q$ sum to zero, hence the name conservative. The diagonal is never used as a rate parameter, and we never use this property of the matrix, but include it to have a well-defined diagonal. \item stable---for any given state $i$, all elements $q_{ij}$ (and $q_{ii}$) are finite. \end{itemize} When the $Q$-matrix is stable, the probability that no transition happens in some time $r$ is \begin{equation} \Pr(X(t+r) = i | X(s) = i, \forall r\in[t, t+r)) = e^{-q_{i}r}. \end{equation} Therefore, the probability distribution of the waiting time until the first transition is an exponential distribution with rate parameter $q_i$ ($= −q_{ii}$), and continuous-time Markov chains are thus memoryless processes. Given that a process that started in state $i$ has experienced a transition out of state $i$, the conditional probability that the transition is into state $j$ is \begin{equation} \frac{q_{ij}}{\sum_{k\neq i} q_{ik}}= \frac{q_{ij}}{q_i}. \end{equation} \subsection{Exponentially-fading Mean} The \emph{exponentially-fading mean} of a function $f(t) : t\mapsto \cal{R}$ at time $T$ is defined as: \begin{equation} \mu(f,T;\lambda) = \frac{ \int_{-\infty}^T f(t) e^{\lambda t} dt }{\int_{-\infty}^T e^{\lambda t} dt} = \frac{ \int_{-\infty}^T f(t) e^{\lambda t} dt }{\frac1\lambda e^{\lambda T}} = \lambda\int_{-\infty}^T f(t) e^{-\lambda (T-t)} dt \end{equation} where $\lambda$ is called the decay factor, and identifies how fast the older values are faded. From this definition, it follows that for any $T_0 < T$ \begin{equation} \mu(f,T;\lambda) = e^{-\lambda(T-T_0)} \mu(f,T_0;\lambda) + \lambda\int_{T_0}^T f(t) e^{-\lambda (T-t)} dt \end{equation} Moreover, if $f(t)$ is constant between $T_0$ and $T$, \begin{equation} \mu(f,T;\lambda) = e^{-\lambda(T-T_0)} \mu(f,T_0;\lambda) + (1-e^{-\lambda(T-T_0)})f(T) \end{equation} Now if we sample the continuous signal $f(T)$ at regular intervals of length $\tau$, the resulting discrete signal $F_n$ will have \begin{equation} \mu(F,n;\lambda,\tau) = e^{-\lambda\tau} \mu(F,n-1;\lambda) + (1-e^{-\lambda\tau})F_n \end{equation} where $e^{-\lambda\tau}$ may be called the mixing-factor. This last equation is useful even if $F_n$ is any arbitrary sequence of numbers. It allows to maintain an on-line average of the sequence without keeping the number of elements seen so far, and for the average to be fading. An extension to the above case is when we want to find the fading mean of a sequence of measurements at arbitrary points in time, as opposed to a sampling with a fixed frequency of samples. If $F_n$ is the sequence of measurements, and $T_n$ is the sequence of times the measurements were performed, and assuming that $T_n$ is ascending, the exponentially-fading mean of the measurements at time $T_n$ is: \begin{equation} \label{eqn:decay} \mu(F,n;\lambda,T) = e^{-\lambda(T_n-T_{n-1})} \mu(F,n-1;\lambda,T) + (1-e^{-\lambda(T_n-T_{n-1})})F_n \end{equation} \section{Main Idea} There are two fairly isolated components in preload: the data gathering and model training component, and the predictor. These two are connected together using a shared probabilistic model. The former component trains the model online based on data gathered by on-going monitoring of user actions, while the latter uses the model to make predictions and perform prefetching. The data gathering component will gather information about running applications periodically, once each \emph{cycle} where a cycle is a tunable parameter that defaults to twenty seconds. The list of running applications is produced by filtering the list of the processes running on the system, and for each application, the list of its file-backed memory maps is fetched, and used to update the model parameters. The predictor component also takes action once every cycle, and uses the trained model and the list of currently running applications. For every application that is not running, the predictor derives a probability that this application is going to be started during the next cycle. The predictor then uses these per-application probabilities to assign probabilities to their maps, and sorts the maps based on their probabilities, and proceeds with prefetching the top ones into main memory. Memory statistics and system load are used to decide how much prefetching is performed in each cycle, to minimize the effect of preload on the system load. The problem can be seen as a stochastic Markov chain whose states are members of the power-set of all applications that the user may run, and given the current state, we are interested to know the transition probabilities to every other state during the next cycle. If we could build and train this Markov chain then we would be done, but this is not feasible, given the total number of states and the little training data we have. To overcome this shortcoming, we make independence assumptions and model every pair of applications separately in a four-state continuous-time Markov chain whose states correspond to the four combinations of each of the applications being running or not. Each of these Markov chains models the correlation between the state of the two applications involved. Every state in each of these Markov chains has a waiting time parameter that is the average time before leaving this state. When a transition happens, every outgoing edge from the current state has a probability assigned to it that this edge is taken. All these parameters are trained as an exponentially fading mean of their values over time, such that when a user's habits change, the model adapts to it in a constant time with high probability. % Example? \section{Design Decisions} In this section we discuss the main decisions behind the design above, as well as some remaining details that should be resolved in order to make a complete implementation of preload. \subsection{Bayesian Probabilistic Model} We use a Bayesian probabilistic model, which means, we assign a probability (real number between zero and one including) to our (the system's) belief of an event. This holds equivalently true for events in the past and current time as well as for events in the future. As an example, for a map $M$, we \emph{may} assign a probability number $\mbox{incore}_M$ to $M$ that is our belief of the event that map $M$ is in core at this time. This probability may take any value between zero and one as long as we do not query (using the \syscall{mincore} system call for example) the map for being in core, although we may be able to do that. \subsection{Memoryless Model} Another important decision made in the design process that needs justification is the use of a memoryless model. Memoryless in this context means that given the trained model, at any point in time, all decisions are made only based on the current list of applications running, no matter when they were started or what the previous states of the system have been. This is not necessarily what happens in reality. For example a user may play his favorite video game in his break times that are exactly thirty minutes long. So as time approaches the half hour, the probability that the game is closed increases, but a memoryless model cannot capture that. Not all user actions are like that though: a user may leave the web browser open for very long periods of time, making a memoryless model as good as one can get. We have chosen a memoryless Markov-based model mainly because of its simplicity and ease of implementation. However, this is not much of a limitation as we train the system on-line. We believe this is a good compromise that keeps the model simple like a memoryless system, while still keeping it powerful by using the entire history of the system as training data. \subsection{Mixture of First-Order Markov Predictors} Markov predictors are an implementation of prediction by partial matching (PPM). PPM estimates probabilities based on prior observation. A PPM of order $n$ encodes sequences of length $n$ and predicts the next element of a sequence given the $n-1$ immediately preceding elements. The predictor chooses the most frequently occurring transition from the static beginning with the $n-1$ observed elements and predicts the final element of the sequence \cite{bartels99potentials}. For this scheme to work well, the data stream should expose patterns of order $n$. The larger that $n$ is, the more conservative the predictor will be, and will fail to make predictions when no known sequences match the initial $n-1$ elements. With lower values of $n$, the predictor will be more aggressive, and more likely to mispredict. The complexity of a PPM of order $n$ with a domain of elements of size $d$ is $\Theta (d^n)$, as one has to keep the frequency of all $d^n$ sequences of elements in the model (many of them may be zero), to perform predictions. This obviously falls short if the number of the elements in the training data stream is \emph{not} significantly larger than $d^n$. For this reason, in predictors used for prefetching, a PPM model of order greater than two or three is unlikely to be effective. When talking about Markov chains and Markov predictors out of the scope of PPMs, the order definition is different. A first-order Markov is one that chooses next state based on the current state and a second-order Markov is one that chooses next state based on two previous states, while a zero-order Markov chooses next state independent of current or any previous states. This means a first-order Markov precitor is equivalent to a second-order PPM. This should be kept in mind to avoid confusion. Previous work on fault-based Markov prefetching for virtual memory pages shows that the Markov predictor of order one outperforms the Markov predictor of order two \cite{bartels99potentials}. Another system uses a Hidden Markov Model (HMM) for file-system input/output access pattern classification \cite{madhyastha97inputoutput}. While an HMM has the potential to model a higher-order Markov model (by encoding sequences of elements in its hidden state), in this particular application, they construct a composite HMM from the HMMs trained for each unique access pattern. Each of their basic HMMs then have a total number of states equal to the size of the domain of elements, and hence can be best thought of as a first-order model. For web prefetching, a PPM predictor of order two is more common \cite{nanopoulos01datamining,jiang98adaptive}. That is due to the fact that the number of the elements in the working set of a web prefetcher is typically much smaller than the number of elements in page/block-based or file-based prefetchers that have to deal with tens of thousands or even millions of elements. In the domain of our problem, elements are the applications, and the data stream is the sequence of application start-ups. Like in the case of web prefetching, we have the luxury of having a fairly limited domain set (of size a few tens of items). However, our training data stream is fairly low-frequency too, in the rate of zero or a few application start-ups per minute, if not per hour. As a result, it is not obvious whether a first-order predictor performs better or a second-order one, but the approach we take in fact resolves this problem by using a mixture of very small Markov chains (two elements and four states each), that can be easily trained and used as first-order Markov predictors. Mixture models are a common tool in machine learning to compose complex models out of simpler building blocks. The beauty of this approach is that training each of the basic models used is easy and feasible, the composition rule is well-understood, and the final model is powerful. For the above reasons, we have chosen the mixture of first-order Markov predictors as our learning model. \subsection{Using the Correlation Coefficient} Another question we had to answer was whether the probabilistic correlation coefficient of the random variables of two applications being run over the time should be used to weight the effect of them on each other. We answer no to this question, for the reason that follows. Consider the following scenario: There are two programs, one of them is a \cron\ job running periodically every thirty minutes, taking a few seconds on each run, the other program is the user's internet browser, running at their will for long periods of time, with no recognizable pattern. It is easy to see that the correlation coefficient of the random variable of these two programs running is very near to zero, because the odds of both of them running is almost the same as the odds that the \cron\ job is running independent of the browser. Now lets see if these two programs can give us any information about each other: The browser tells us that no matter if it is running or not, the \cron\ job is likely to be started very soon (in fifteen minutes on average). This is useful information we did not have otherwise. In other words, if the correlation coefficient is insignificant, the predictions become independent of the application predicting, and this can be thought as a zero-order prediction. %, and the \cron\ job also tells us the average time %before the browser may be started. %So, while one program's running state %does not affect the conclusion we make about the other program, it does %give us some clue about when the other program may be run. %In other words, when the correlation coefficient is about zero, it is like examining %each program's mean time to run independently. This is not as informative %as what a pair of highly correlated programs can tell about each other, %but it indeed is useful information. The example sketched above may not be a real-world case for preload (for example, because we ignore very short-running processes), but the argument holds for the general case. Another reason that forced us to make this decision was that we were unable to fit it in our probabilistic inference with firm theoretical reasoning. \subsection{User-space versus Kernel-space} A common problem with the prefetching literature is that most of the systems are implemented (by design) in the lower levels of the operating system, mostly in the kernel space. This design has several benefits, like tight integration with file-system caching. This, on the other hand, increases the complexity of the implementation drastically, and makes deploying the system on a normal system much harder. Mostly because the kernel needs to be patched, and since prefetching has not shown enough benefits to justify the complexity of the implementation, no sophisticated prefetching system (like those developed in academic circles) has been widely deployed by major operating system distributors. For this reason, we have decided to implement preload completely as a user-space program running in the background (also known as a daemon). Preload gathers information about running processes and their shared objects by scanning the \proc\ pseudo-file system and performs prefetching using a few system calls, mostly \syscall{readahead}, \syscall{posix\_fadvise}, \syscall{posix\_madvise}, \syscall{mmap}, \syscall{madvise}, \syscall{fadvise}, and \syscall{mincore}. The main problem with this approach is that preload cannot track file accesses by any means other than \syscall{mmap}ing the file. This includes \syscall{open} and \syscall{stat} accesses. \subsection{The User Running the Application} Another issue we had to deal with was whether the pair of $(\mbox{user},\mbox{application})$ should be used in the inference phase instead of individual applications. That generally makes sense, since different users have different habits and sets of commonly-used applications, but we decided to not do this, basically because this complicates the implementation and also adds another orthogonal axis on the object space of the model. This is not much of a problem, since preload is targeted for desktop systems that usually have very few number of users (one or two most of the time). If the per-user behavior is desired, different preload daemons can be run for different users. This is a compromise though, since these daemons will race for using resources (main memory mostly), and no cross-user inference is performed. Cross-user inference is very powerful at login time, since the login manager is normally run as the super-user, while after login, the desktop is run as the user who just logged in. \section{The Model} \label{sec:model} In this section we define the objects used in the model representation of the problem. These objects will then be used in the algorithms in the next chapter. For each object, the member properties are divided into two parts, the persistent properties and the runtime properties. The persistent properties are updated by the data gathering and training component, and will be saved and restored across runs of preload. The runtime properties are used by the predictor to keep track of its state at the current time, and so are not kept across runs. For each object, some of the persistent properties form a key, in that the object is uniquely identified by values for those set of properties. % TODO: fixup key indicators In the following subsections, the key properties are prefixed by an arrow. \subsection{Map Object} A \emph{Map object} corresponds to a single map that may be used by one or more applications. A Map is identified by the path of its file, a start offset, and a length. %Its only runtime property is the probability that the %map is in the main memory. The \emph{size} of a Map is its length. \begin{struct}{Map} \persistent \key char *path; // full name of the file being mapped \key size_t offset; // start offset in bytes \key size_t length; // length in bytes % \runtime % double p_incore; // probability that map is in main memory \end{struct} %\subsection{ExeMap Object} \subsection{Exe Object} An \emph{Exe object} corresponds to an application. An Exe is identified by the path of its executable binary, and as its persistent data it contains the set of maps it uses and the set of Markov chains it builds with every other application. The runtime property of the Exe is its running state which is a boolean variable represented as an integer with value one if the application is running, and zero otherwise. The \texttt{running} member is initialized upon construction of the object, based on information from \proc. The \emph{size} of an Exe is the sum of the size of its Map objects. \begin{struct}{Exe} \persistent \key char *path; // full name of the executable binary Set of Map maps; // the maps this application uses Set of Markov markovs; // the Markov chains with other applications \runtime int running; // one if running, zero otherwise \end{struct} \subsection{Markov Object} A \emph{Markov object} corresponds to the four-state continuous-time Markov chain constructed for two applications $A$ and $B$. The states are numbered 0 to 3 and respectively mean: none of $A$ or $B$ is running, only $A$ is running, only $B$ is running, and both are running. A Markov object is identified by its links to the Exes $A$ and $B$, and has as its persistent data the (exponentially-fading mean of) transition time for each state, timestamp of when the last transition from that state happened, and probability that each outgoing transition edge is taken when a transition happens. The runtime property of a Markov is its current state and the timestamp of when it entered the current state. Upon construction, the current state is computed based on the \texttt{running} member of the two Exe objects referenced, and transition time is set to the current timestamp. \begin{struct}{Markov} \persistent \key Exe a, b; // the two applications involved\newline// in this Markov chain double tt[4] // mean transition time from each state double tp[4][4] // probability that transition from\newline% // one state goes to another int timestamps[4] // timestamp of last time leaving each state \runtime int state; // current state, 0, 1, 2, or, 3 int time; // timestamp of the last transition \end{struct} The trasition times \texttt{tt} are the inverse of transitions rates defined in \autoref{subsec:markov}. The \texttt{state} of a Markov object can be computed as follows: \begin{equation} M.\state = M.\a.\running + 2 \times M.\b.\running \end{equation} and is always updated to maintain this as an invariant. \subsection{MemStat Object} The \emph{MemStat object} holds various statistics about total and available memory in the system as well as disk activities. All values are in kilobytes. \begin{struct}{MemStat} int total // total memory int free // free memory int cached // page-cache memory int pagein // total data paged in (read from disk) int pageout // total data paged out (written to disk) \end{struct} \subsection{State Object} The \emph{State object} holds all the information about the model except for configuration parameters. It contains the set of all applications and maps known, and also a runtime list of running applications and memory statistics which are populated from \proc\ when a State object is constructed. There is a singleton instance of this object at runtime that is trained by the data gathering component, and used by the predictor. It has methods to read its persistent state from a file and to dump them into a file. This will load/save all referenced Markov, Exe, and Map objects recursively. \begin{struct}{State} \persistent HashTable of Exe exes // all the applications known HashTable of Map maps // all the maps known \runtime Set of Exe running_exes // all applications running MemStat memstat // memory statistics \end{struct} \subsection{Parameters} There are two important parameters that are used with the model. These are left as configuration variables and are set by the user with other less important configuration parameters that are described in detail in \autoref{sec:conf-params}. The two are: \begin{itemize} \item[$\tau$:] the length of each cycle in seconds, \item[$\lambda$:] the decay factor used for exponentially-fading means. \end{itemize} We will use these parameters in \autoref{chap:algorithms}. $\tau$ is used in the inference algorithms when we predict what happens during the next cycle. $\lambda$ is used in the training algorithm as the decay factor of the exponentially-fading means we maintain. \section{Summary} In this chapter we provided background material used in our work, and described the main idea of this thesis. After discussing design decisions, the model was presented. In the next chapter we will present algorithms for training the model; making predictions based on it; and prefetching those predictions.