1%
2% Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle@hp.com>
3%
4%  This program is free software; you can redistribute it and/or modify
5%  it under the terms of the GNU General Public License as published by
6%  the Free Software Foundation; either version 2 of the License, or
7%  (at your option) any later version.
8%
9%  This program is distributed in the hope that it will be useful,
10%  but WITHOUT ANY WARRANTY; without even the implied warranty of
11%  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
12%  GNU General Public License for more details.
13%
14%  You should have received a copy of the GNU General Public License
15%  along with this program; if not, write to the Free Software
16%  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
17%
18%  vi :set textwidth=75
19%
20\documentclass{article}
21\usepackage{multirow,graphicx,placeins}
22
23\begin{document}
24%---------------------
25\title{\texttt{btrecord} and \texttt{btreplay} User Guide}
26\author{Alan D. Brunelle (Alan.Brunelle@hp.com)}
27\date{\today}
28\maketitle
29\begin{abstract}
30\input{abstract.tex}
31\end{abstract}
32\thispagestyle{empty}\newpage
33%---------------------
34\tableofcontents\thispagestyle{empty}\newpage
35%---------------------
36\section{Introduction}
37\input{abstract.tex}
38
39\bigskip 
40This document presents the command line overview for
41\texttt{btrecord} and \texttt{btreplay}, and shows some commonly used
42example usages of it in everyday work here at OSLO's Scalability and
43Performance Group.
44
45\subsection*{Build Note}
46
47To build these tools, one needs to
48place the source directory next to a valid
49\texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}}
50directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}.
51
52
53%---------------------
54\newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model}
55
56The \texttt{blktrace} utility provides the ability to collect detailed
57traces from the kernel for each IO processed by the block IO layer. The
58traces provide a complete timeline for each IO processed, including
59detailed information concerning when an IO was first received by the block
60IO layer -- indicating the device, CPU number, time stamp, IO direction,
61sector number and IO size (number of sectors). Using this information,
62one is able to \emph{replay} the IO again on the same machine or another
63set up entirely.
64
65\subsection{Basic Workflow}
66The basic operating work-flow to replay IOs would be something like:
67
68\begin{enumerate}
69  \item Run \texttt{blktrace} to collect traces. Here you specify the
70  device or devices that you wish to trace and later replay IOs upon. Note:
71  the only traces you are interested in are \emph{QUEUE} requests --
72  thus, to save system resources (including storage for traces), one could
73  specify the \texttt{-a queue} command line option to \texttt{blktrace}.
74
75  \item While \texttt{blktrace} is running, you run the workload that you
76  are interested in. 
77
78  \item When the work load has completed, you stop the \texttt{blktrace}
79  utility (thus saving all traces over the complete workload). 
80
81  \item You extract the pertinent IO information from the traces saved by
82  \texttt{blktrace} using the \texttt{btrecord} utility. This will parse
83  each trace file created by \texttt{blktrace}, and craft IO descriptions
84  to be used in the next phase of the workload processing.
85
86  \item Once \texttt{btrecord} has successfully created a series of data
87  files to be processed, you can run the \texttt{btreplay} utility which
88  attempts to generate the same IOs seen during the sample workload phase.
89\end{enumerate}
90
91\subsection{IO Stream Replay Characteristics}
92  The major characteristics of the IO stream that are kept intact include:
93
94  \begin{description}
95    \item[Device] The IOs are replayed on the same device as was seen
96    during the sample workload.
97
98    \item[IO direction] The same IO direction (read/write) is maintained.
99
100    \item[IO offset] The same device offset is maintained.
101
102    \item[IO size] The same number of sectors are transferred.
103
104    \item[Time differential] The time stamps stored during the
105    \texttt{blktrace} run are used to determine the amount of time between
106    IOs during the sample workload. \texttt{btreplay} \emph{attempts} to
107    maintain the same time differential between IOs, but no guarantees as
108    to complete accuracy are provided by the utility.
109
110    \item[Device IO Stream Ordering] All IOs on a device are submitted in
111    the precise order they were seen during the sample workload run. 
112  \end{description}
113
114  As noted above, the time between IOs may not be accurately maintained
115  during replays. In addition the actual ordering of IOs \emph{between}
116  devices is not necessarily maintained. (Each device with an IO stream
117  maintains its own concept of time, and thus there may be slippage of the
118  time kept between managing threads.)
119
120  \begin{quotation}
121    We have prototyped a different approach, wherein a single managing
122    thread handles all IOs across all devices. This approach, while
123    guaranteeing correct ordering of IOs across all devices, resulted in
124    much worse timing on a per IO basis. 
125  \end{quotation}
126
127\subsection{\texttt{btrecord/btreplay} Method of Operation}
128
129As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from
130\texttt{blktrace} output. These \texttt{QUEUE} operations indicate the
131entrance of IOs into the block IO layer. In order to replay these IOs with
132some accuracy in regards to ordering and timeliness, we decided to take
133multiple sequential (in time) IOs and put them in a single \emph{bunch} of
134IOs that will be processed as a single \emph{asynchronous IO} call to the
135kernel\footnote{Attempts to do them individually resulted in too large of a
136turnaround time penalty (user-space to kernel and back). Note that in a
137number of workloads, the IOs are coming in from the page cache handling
138code, and thus are submitted to the block IO layer with \emph{very small}
139time intervals between issues.}. To manage the size of the \emph{bunches},
140the \texttt{btrecord} utility provides you with two controlling knobs:
141
142\begin{description}
143  \item[\texttt{--max-bunch-time}] This is the amount of time to encompass
144  in one bunch -- only IOs within the time specified are eligible
145  for \emph{bunching.} The default time is 10 milliseconds (10,000,000
146  nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m}
147  for more information.
148
149  \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from
150  1 to 512 packets in size and by default we max a bunch to contain no
151  more than 8 individual IOs. With this option, one can increase or
152  decrease the maximum \emph{bunch} size.  Refer to section~\ref{sec:c-o-M}
153  on page~\pageref{sec:c-o-M} for more information.
154\end{description}
155
156Each input data file (one per device per CPU) results in a new record
157data file (again, one per device per CPU) which contains information
158about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on
159these record data files by spawning a new pair of threads per file. One
160thread manages the submitting of AIOs per bunch in the record data file,
161while the other thread manages reclaiming AIOs completed\footnote{We
162have found that having the same thread do both results in a further
163reduction in replay timing accuracy.}.
164
165Each submitting thread simply reads the input file of \emph{bunches}
166recorded by \texttt{btrecord}, and attempts to faithfully reproduce the
167ordering and timing of IOs seen during the sample workload. The reclaiming
168thread simply waits for AIO completions, freeing up resources for the
169submitting thread to utilize to submit new AIOs.
170
171The number of CPUs being used on the replay system can be different from
172the number on the recorded system. To help with mappings here the
173\texttt{--cpus} option allows one to state how many CPUs on the replay
174system to utilize. If the number of CPUs on the replay system is less than
175on the recording system, we wrap CPU IDs. This \emph{may} result in an
176overload of CPU processing capabilities on the replay system. (Refer to
177section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the
178\texttt{--cpus} option.)
179
180\newpage\subsection{Known Deficiencies and Proposed Possible Fixes}
181
182The overall known deficiencies with this current set of utilities is
183outlined here, in some cases ideas on additions and/or improvements are
184included as well.
185
186\begin{enumerate}
187  \item Lack of IO ordering across devices. 
188
189  \begin{quote}
190    \emph{We could institute the notion of global time across threads,
191    and thus ensure IO ordering across devices, with some reduction in
192    timing accuracy.}
193  \end{quote}
194
195  \item Lack of IO timing accuracy -- additional time between IO bunches.
196
197  \begin{quote}
198    \emph{This is the primary problem with any IO replay mechanism -- how
199    to guarantee per-IO timing accuracy with respect to other replayed IOs?
200    One idea to reduce errors in this area would be to push the IO replay
201    into the kernel, where you \emph{may} receive more responsive timings.}
202  \end{quote}
203
204  \item Bunching of IOs results in reduced time amongst IOs within a bunch.
205
206  \begin{quote}
207    \emph{The user has \emph{some} control over this (via the
208    \texttt{--max-pkts} option). One \emph{could} simply specify
209    \texttt{-max-pkts=1} and then each IO would be treated individually. Of
210    course, this would probably then run into the problem of excessive
211    inter-IO times.}
212  \end{quote}
213
214  \item 1-to-1 mapping of devices -- for now the devices on the replay
215  machine must be the same as on the recording machine. 
216
217  \begin{quote}
218    \emph{It should be relatively trivial to add in the notion of
219    mapping -- simply include a file that is read which maps devices
220    on one machine to devices (with offsets and sizes) on the replay
221    machine\footnote{The notion of an offset and device size to replay on
222    could be used to both allow for a single device to masquerade as more
223    than one device, and could be utilized in case the replay device is
224    smaller than the recorded device.}.}
225    
226    \medskip\emph{One could also add in the notion of CPU mappings as well --
227    device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system
228    shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the
229    replay machine.}
230
231    \bigskip
232    \begin{quote}
233      With version 0.9.1 we now support the \texttt{-M} option to do this
234      -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more
235      information on device mapping.
236    \end{quote}
237  \end{quote}
238
239\end{enumerate}
240
241%---------------------
242\newpage\section{\label{sec:command-line}Command Line Options}
243\subsection{\texttt{btrecord} Command Line Options}
244\begin{figure}[h!]
245\begin{verbatim}
246Usage: btrecord -- version 0.9.3
247
248	[ -d <dir>  : --input-directory=<dir> ] Default: .
249	[ -D <dir>  : --output-directory=<dir>] Default: .
250	[ -F        : --find-traces           ] Default: Off
251	[ -h        : --help                  ] Default: Off
252	[ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec
253	[ -M <pkts> : --max-pkts=<pkts>       ] Default: 8
254	[ -o <base> : --output-base=<base>    ] Default: replay
255	[ -v        : --verbose               ] Default: Off
256	[ -V        : --version               ] Default: Off
257	<dev>...                                Default: None
258\end{verbatim}
259\caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output}
260\end{figure}
261\FloatBarrier
262
263\subsubsection{\label{sec:c-o-d}\texttt{-d} or
264\texttt{--input-directory}\\Set Input Directory}
265
266The \texttt{-d} option requires a single parameter providing the directory
267name for where input files are to be found. The default directory is the
268current directory (\texttt{.}).
269
270\subsubsection{\label{sec:c-o-D}\texttt{-D} or
271\texttt{--output-directory}\\Set Output Directory}
272
273The \texttt{-D} option requires a single parameter providing the directory
274name for where output files are to be placed. The default directory is the
275current directory (\texttt{.}).
276
277\subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files
278Automatically}
279
280The \texttt{-F} option instructs \texttt{btrecord} to go find all the
281trace files in the directory specified (either via the \texttt{-d}
282option, or in the default directory '.').
283
284\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
285\subsubsection{\texttt{-V} or \texttt{--version}\\Display
286\texttt{btrecord}Version}
287
288The \texttt{-h} option displays the command line options and
289defaults, as presented in figure~\ref{fig:btrecord--help} on
290page~\pageref{fig:btrecord--help}.
291
292The \texttt{-V} option displays the \texttt{btreplay} version, as shown here:
293
294\begin{verbatim}
295$ btrecord --version
296btrecord -- version 0.9.0
297\end{verbatim}
298
299Both commands exit immediately after processing the option.
300
301\subsubsection{\label{sec:c-o-m}\texttt{-m} or
302\texttt{--max-bunch-time}\\Set Maximum Time Per Bunch}
303
304The \texttt{-m} option requires a single parameter which specifies an
305amount of time (in nanoseconds) to include in any one bunch of IOs that
306are to be processed. The smaller the value, the smaller the number of
307IOs processed at one time -- perhaps yielding in more realistic replay.
308However, after a certain point the amount of overhead per bunch may result
309in additional real replay time, thus yielding less accurate replay times.
310
311The default value is 10,000,000 nanoseconds (10 milliseconds).
312
313\subsubsection{\label{sec:c-o-M}\texttt{-M} or
314\texttt{--max-pkts}\\Set Maximum Packets Per Bunch}
315
316The \texttt{-M} option requires a single parameter which specifies the
317maximum number of IOs to store in a single bunch. As with the \texttt{-m}
318option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not}
319yield more accurate replay times.
320
321The default value is 8, with a maximum value of up to 512 being supported.
322
323\subsubsection{\label{sec:c-o-o}\texttt{-o} or
324\texttt{--output-base}\\Set Base Name for Output Files}
325
326Each output file has 3 fields:
327
328\begin{enumerate}
329  \item Device identifier (taken directly from the device name of the
330  \texttt{blktrace} output file).
331
332  \item \texttt{btrecord} base name -- by default ``replay''.
333
334  \item And the CPU number (again, taken directly from the
335  \texttt{blktrace} output file name).
336\end{enumerate}
337
338This option requires a single parameter that will override the default name
339(replay), and replace it with the specified value.
340
341\subsubsection{\label{sec:c-o-v}\texttt{-v} or
342\texttt{--verbose}\\Select Verbose Output}
343
344This option will output some simple statistics at the end of a successful
345run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows
346an example of some output, while figure~\ref{fig:verb-defs}
347(page~\pageref{fig:verb-defs}) shows what the fields mean.
348
349\begin{figure}[h!]
350\begin{verbatim}
351sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch
352sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch
353sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch
354sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch
355\end{verbatim}
356\caption{\label{fig:verb-out}Verbose Output Example}
357\end{figure}
358\FloatBarrier
359
360\begin{figure}[h!]
361\begin{description}
362  \item[Field 1] The first field contains the device name and CPU
363  identifier. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and
364  traces on CPU 0. 
365
366  \item[Field 2] The second field contains the total number of packets
367  processed for each device file. 
368
369  \item[Field 3] The next field shows the number of packets eligible for
370  replay. 
371
372  \item[Field 4] The fourth field contains the total number of IO bunches. 
373
374  \item[Field 5] The last field shows the average number of IOs per bunch
375  recorded.
376\end{description}
377\caption{\label{fig:verb-defs}Verbose Field Definitions}
378\end{figure}
379\FloatBarrier
380
381%---------------------
382\newpage\subsection{\texttt{btreplay} Command Line Options}
383\begin{figure}[h!]
384\begin{verbatim}
385Usage: btreplay -- version 0.9.3
386
387	[ -c <cpus> : --cpus=<cpus>           ] Default: 1
388	[ -d <dir>  : --input-directory=<dir> ] Default: .
389	[ -F        : --find-records          ] Default: Off
390	[ -h        : --help                  ] Default: Off
391	[ -i <base> : --input-base=<base>     ] Default: replay
392	[ -I <iters>: --iterations=<iters>    ] Default: 1
393	[ -M <file> : --map-devs=<file>       ] Default: None
394	[ -N        : --no-stalls             ] Default: Off
395	[ -x <int>  : --acc-factor=<int>      ] Default: 1
396	[ -v        : --verbose               ] Default: Off
397	[ -V        : --version               ] Default: Off
398	[ -W        : --write-enable          ] Default: Off
399	<dev...>                                Default: None
400\end{verbatim}
401\caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output}
402\end{figure}
403\FloatBarrier
404
405\subsubsection{\label{sec:p-o-c}\texttt{-c} or
406\texttt{--cpus}\\Set Number of CPUs to Use}
407
408\subsubsection{\label{sec:p-o-d}\texttt{-d} or
409\texttt{--input-directory}\\Set Input Directory}
410
411The \texttt{-d} option requires a single parameter providing the directory
412name for where input files are to be found. The default directory is the
413current directory (\texttt{.}).
414
415\subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles
416Automatically}
417
418The \texttt{-F} option instructs \texttt{btreplay} to go find all the
419record files in the directory specified (either via the \texttt{-d}
420option, or in the default directory '.').
421
422\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
423\subsubsection{\texttt{-V} or \texttt{--version}\\Display
424\texttt{btreplay}Version}
425
426The \texttt{-h} option displays the command line options and
427defaults, as presented in figure~\ref{fig:btreplay--help} on
428page~\pageref{fig:btreplay--help}.
429
430The \texttt{-V} option displays the \texttt{btreplay} version, as show here:
431
432\begin{verbatim}
433$ btreplay --version
434btreplay -- version 0.9.0
435\end{verbatim}
436
437Both commands exit immediately after processing the option.
438
439\subsubsection{\label{sec:p-o-i}\texttt{-i} or
440\texttt{--input-base}\\Set Base Name for Input Files}
441
442Each input file has 3 fields:
443
444\begin{enumerate}
445  \item Device identifier (taken directly from the device name of the
446  \texttt{blktrace} output file).
447
448  \item \texttt{btrecord} base name -- by default ``replay''.
449
450  \item And the CPU number (again, taken directly from the
451  \texttt{blktrace} output file name).
452\end{enumerate}
453
454This option requires a single parameter that will override the default name
455(replay), and replace it with the specified value.
456
457\subsubsection{\label{sec:p-o-I}\texttt{-I} or
458\texttt{--iterations}\\Set Number of Iterations to Run}
459
460This option requires a single parameter which specifies the number of times
461to run through the input files. The default value is 1.
462
463\subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\
464Specify Device Mappings}
465
466This option requires a single parameter which specifies the name of a
467file containing device mappings. The file must be very simply managed, with
468just two pieces of data per line:
469
470\begin{enumerate}
471  \item The device name on the recorded system (with the \texttt{'/dev/'}
472  removed). Example: \texttt{/dev/sda} would just be \texttt{sda}.
473
474  \item The device name on the replay system to use (again, without the
475  \texttt{'/dev/'} path prepended).
476\end{enumerate}
477
478An example file for when one would map devices \texttt{/dev/sda} and
479\texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and
480\texttt{sdh} on the replay system would be:
481
482\begin{verbatim}
483sda sdg
484sdb sdh
485\end{verbatim}
486
487The only entries in the file that are allowed are these two element lines
488-- we do not (yet?) support the notion of blank lines, or comment lines, or
489the like.
490
491The utility \emph{does} allow for multiple \texttt{-M} options to be
492supplied on the command line.
493
494\subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable
495Pre-bunch Stalls}
496
497When specified on the command line, all pre-bunch stall indicators will be
498ignored. IOs will be replayed without inter-bunch delays.
499
500\subsubsection{\label{sec:o-x}\texttt{-x} or \texttt{--acc-factor}\\Acceleration
501Factor}
502
503  While the \texttt{--no-stalls} option allows the traces to be replayed
504  with no waiting time, this option specifies some acceleration factor
505  to be used. If the value of two is used, then the stall time is
506  divided by half resulting in a reduction of the execution time by
507  this factor. Note that if this number is too high, the results will
508  be equivalent of not having stall.
509
510\subsubsection{\label{sec:p-o-v}\texttt{-v} or
511\texttt{--verbose}\\Select Verbose Output}
512
513When specified on the command line, this option instructs \texttt{btreplay}
514to store information concerning each \emph{stall} and IO operation
515performed by \texttt{btreplay}. The name of each file so created will be
516the input file name used with an extension of \texttt{.rep} appended onto
517it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a
518verbose output file with the name \texttt{sdab.replay.3.rep} in the
519directory specified for input files.
520
521In addition, \texttt{btreplay} will also output to \texttt{stderr} the
522names of the input files being processed.
523
524\subsubsection{\label{sec:p-o-W}\texttt{-W} or
525\texttt{--write-enable}\\Enable Writing During Replay}
526
527As a precautionary measure, by default \texttt{btreplay} will \emph{not}
528process \emph{write} requests. In order to enable \texttt{btreplay} to
529actually \emph{write} to devices one must explicitly specify the
530\texttt{-W} option.
531
532\end{document}
533