1## @file
2#
3# Technical notes for the virtio-net driver.
4#
5# Copyright (C) 2013, Red Hat, Inc.
6#
7# This program and the accompanying materials are licensed and made available
8# under the terms and conditions of the BSD License which accompanies this
9# distribution. The full text of the license may be found at
10# http://opensource.org/licenses/bsd-license.php
11#
12# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT
13# WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
14#
15##
16
17Disclaimer
18----------
19
20All statements concerning standards and specifications are informative and not
21normative. They are made in good faith. Corrections are most welcome on the
22edk2-devel mailing list.
23
24The following documents have been perused while writing the driver and this
25document:
26- Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C;
27  June 27, 2012
28- Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01;
29- Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7.
30
31
32Summary
33-------
34
35The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for
36virtio-net devices. Higher level protocols are automatically installed on top
37of it by the DXE Core / the ConnectController() boot service, enabling for
38virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib
39applications, and PXE booting in OVMF.
40
41
42UEFI driver structure
43---------------------
44
45A driver instance, belonging to a given virtio-net device, can be in one of
46four states at any time. The states stack up as follows below. The state
47transitions are labeled with the primary function (and its important callees
48faithfully indented) that implement the transition.
49
50                               |  ^
51                               |  |
52   [DriverBinding.c]           |  | [DriverBinding.c]
53   VirtioNetDriverBindingStart |  | VirtioNetDriverBindingStop
54     VirtioNetSnpPopulate      |  |   VirtioNetSnpEvacuate
55       VirtioNetGetFeatures    |  |
56                               v  |
57                   +-------------------------+
58                   | EfiSimpleNetworkStopped |
59                   +-------------------------+
60                               |  ^
61                [SnpStart.c]   |  | [SnpStop.c]
62                VirtioNetStart |  | VirtioNetStop
63                               |  |
64                               v  |
65                   +-------------------------+
66                   | EfiSimpleNetworkStarted |
67                   +-------------------------+
68                               |  ^
69  [SnpInitialize.c]            |  | [SnpShutdown.c]
70  VirtioNetInitialize          |  | VirtioNetShutdown
71    VirtioNetInitRing {Rx, Tx} |  |   VirtioNetShutdownRx [SnpSharedHelpers.c]
72      VirtioRingInit           |  |   VirtioNetShutdownTx [SnpSharedHelpers.c]
73    VirtioNetInitTx            |  |   VirtioRingUninit {Tx, Rx}
74    VirtioNetInitRx            |  |
75                               v  |
76                  +-----------------------------+
77                  | EfiSimpleNetworkInitialized |
78                  +-----------------------------+
79
80The state at the top means "nonexistent" and is hence unnamed on the diagram --
81a driver instance actually doesn't exist at that point. The transition
82functions out of and into that state implement the Driver Binding Protocol.
83
84The lower three states characterize an existent driver instance and are all
85states defined by the Simple Network Protocol. The transition functions between
86them are member functions of the Simple Network Protocol.
87
88Each transition function validates its expected source state and its
89parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect
90from the controller unless it's in EfiSimpleNetworkStopped.
91
92
93Driver instance states (Simple Network Protocol)
94------------------------------------------------
95
96In the EfiSimpleNetworkStopped state, the virtio-net device is (has been)
97re-set. No resources are allocated for networking / traffic purposes. The MAC
98address and other device attributes have been retrieved from the device (this
99is necessary for completing the VirtioNetDriverBindingStart transition).
100
101The EfiSimpleNetworkStarted is completely identical to the
102EfiSimpleNetworkStopped state for virtio-net, in the functional and
103resource-usage sense. This state is mandated / provided by the Simple Network
104Protocol for flexibility that the virtio-net driver doesn't exploit.
105
106In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown
107SNP member function, and must therefore correspond to a hardware configuration
108where "[it] is safe for another driver to initialize". (Clearly another UEFI
109driver could not do that due to the exclusivity of the driver binding that
110VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.)
111
112The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the
113driver instance. Virtio and other resources required for network traffic have
114been allocated, and the following SNP member functions are available (in
115addition to VirtioNetShutdown which leaves the state):
116
117- VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that
118  may have arrived asynchronously;
119
120- VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous
121  transmission (meant to be used together with VirtioNetGetStatus);
122
123- VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending
124  Tx packets;
125
126- VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6
127  address into a multicast MAC address;
128
129- VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast /
130  broadcast filter configuration (not their actual effect -- a more liberal
131  filter setting than requested is allowed by the UEFI specification).
132
133The following SNP member functions are not supported [SnpUnsupported.c]:
134
135- VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop
136  from/to EfiSimpleNetworkInitialized);
137
138- VirtioNetStationAddress: assign a new MAC address to the virtio NIC,
139
140- VirtioNetStatistics: collect statistics,
141
142- VirtioNetNvData: access non-volatile data on the virtio NIC.
143
144Missing support for these functions is allowed by the UEFI specification and
145doesn't seem to trip up higher level protocols.
146
147
148Events and task priority levels
149-------------------------------
150
151The UEFI specification defines a sophisticated mechanism for asynchronous
152events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for
153details). Such callbacks work like software interrupts, and some notion of
154locking / masking is important to implement critical sections (atomic or
155exclusive access to data or a device). This notion is defined as Task Priority
156Levels.
157
158The virtio-net driver for OVMF must concern itself with events for two reasons:
159
160- The Simple Network Protocol provides its clients with a (non-optional) WAIT
161  type event called WaitForPacket: it allows them to check or wait for Rx
162  packets by polling or blocking on this event. (This functionality overlaps
163  with the Receive member function.) The event is available to clients starting
164  with EfiSimpleNetworkStopped (inclusive).
165
166  The virtio-net driver is informed about such client polling or blockage by
167  receiving an asynchronous callback (a software interrupt). In the callback
168  function the driver must interrogate the driver instance state, and if it is
169  EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are
170  available for consumption. If so, it must signal the WaitForPacket WAIT type
171  event, waking the client.
172
173  For simplicity and safety, all parts of the virtio-net driver that access any
174  bit of the driver instance (data or device) run at the TPL_CALLBACK level.
175  This is the highest level allowed for an SNP implementation, and all code
176  protected in this manner satisfies even stricter non-blocking requirements
177  than what's documented for TPL_CALLBACK.
178
179  The task priority level for the WaitForPacket callback too is set by the
180  driver, the choice is TPL_CALLBACK again. This in effect serializes  the
181  WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal"
182  parts of the driver.
183
184- According to the Driver Writer's Guide, a network driver should install a
185  callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY
186  type event). When the ExitBootServices() boot service has cleaned up internal
187  firmware state and is about to pass control to the OS, any network driver has
188  to stop any in-flight DMA transfers, lest it corrupts OS memory. For this
189  reason EXIT_BOOT_SERVICES is emitted and the network driver must abort
190  in-flight DMA transfers.
191
192  This callback (VirtioNetExitBoot) is synchronized with the rest of the driver
193  code just the same as explained for WaitForPacket. In
194  EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data
195  transfer. After the callback returns, no further driver code is expected to
196  be scheduled.
197
198
199Virtio internals -- Rx
200----------------------
201
202Requests (Rx and Tx alike) are always submitted by the guest and processed by
203the host. For Tx, processing means transmission. For Rx, processing means
204filling in the request with an incoming packet. Submitted requests exist on the
205"Available Ring", and answered (processed) requests show up on the "Used Ring".
206
207Packet data includes the media (Ethernet) header: destination MAC, source MAC,
208and Ethertype (14 bytes total).
209
210The following structures implement packet reception. Most of them are defined
211in the Virtio specification, the only driver-specific trait here is the static
212pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The
213diagram is simplified.
214
215                     Available Index       Available Index
216                     last processed          incremented
217                       by the host           by the guest
218                           v       ------->        v
219Available  +-------+-------+-------+-------+-------+
220Ring       |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx|
221           +-------+-------+-------+-------+-------+
222                              =D6     =D2
223
224       D2         D3          D4         D5          D6         D7
225Descr. +----------+----------++----------+----------++----------+----------+
226Table  |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx|
227       +----------+----------++----------+----------++----------+----------+
228        =A2    =D3 =A3         =A4    =D5 =A5         =A6    =D7 =A7
229
230
231            A2        A3     A4       A5     A6       A7
232Receive     +---------------+---------------+---------------+
233Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet|
234Area        +---------------+---------------+---------------+
235
236                Used Index                               Used Index incremented
237        last processed by the guest                            by the host
238                    v                    ------->                   v
239Used    +-----------+-----------+-----------+-----------+-----------+
240Ring    |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|
241        +-----------+-----------+-----------+-----------+-----------+
242                                     =D4
243
244In VirtioNetInitRx, the guest allocates the fixed size Receive Destination
245Area, which accommodates all packets delivered asynchronously by the host. To
246each packet, a slice of this area is dedicated; each slice is further
247subdivided into virtio-net request header and network packet data. The
248(guest-physical) addresses of these sub-slices are denoted with A2, A3, A4 and
249so on. Importantly, an even-subscript "A" always belongs to a virtio-net
250request header, while an odd-subscript "A" always belongs to a packet
251sub-slice.
252
253Furthermore, the guest lays out a static pattern in the Descriptor Table. For
254each packet that can be in-flight or already arrived from the host,
255VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N,
256the Nth descriptor chain is set up as follows:
257
258- the first (=head) descriptor, with even index, points to the fixed-size
259  sub-slice receiving the virtio-net request header,
260
261- the second descriptor (with odd index) points to the fixed (1514 byte) size
262  sub-slice receiving the packet data,
263
264- a link from the first (head) descriptor in the chain is established to the
265  second (tail) descriptor in the chain.
266
267Finally, the guest populates the Available Ring with the indices of the head
268descriptors. All descriptor indices on both the Available Ring and the Used
269Ring are even.
270
271Packet reception occurs as follows:
272
273- The host consumes a descriptor index off the Available Ring. This index is
274  even (=2*N), and fingers the head descriptor of the chain belonging to packet
275  N.
276
277- The host reads the descriptors D(2*N) and -- following the Next link there
278  --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the
279  packet data at A(2*N+1).
280
281- The host places the index of the head descriptor, 2*N, onto the Used Ring,
282  and sets the Len field in the same Used Ring Element to the total number of
283  bytes transferred for the entire descriptor chain. This enables the guest to
284  identify the length of Rx packets.
285
286- VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it
287  copies the data out to the caller, and recycles the index of the head
288  descriptor (ie. 2*N) to the Available Ring.
289
290- Because the host can process (answer) Rx requests in any order theoretically,
291  the order of head descriptor indices on each of the Available Ring and the
292  Used Ring is virtually random. (Except right after the initial population in
293  VirtioNetInitRx, when the Available Ring is full and increasing, and the Used
294  Ring is empty.)
295
296- If the Available Ring is empty, the host is forced to drop packets. If the
297  Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet
298  available).
299
300
301Virtio internals -- Tx
302----------------------
303
304The transmission structure erected by VirtioNetInitTx is similar, it differs
305in the following:
306
307- There is no Receive Destination Area.
308
309- Each head descriptor, D(2*N), points to a read-only virtio-net request header
310  that is shared by all of the head descriptors. This virtio-net request header
311  is never modified by the host.
312
313- Each tail descriptor is re-pointed to the caller-supplied packet buffer
314  whenever VirtioNetTransmit places the corresponding head descriptor on the
315  Available Ring. The caller is responsible to hang on to the unmodified buffer
316  until it is reported transmitted by VirtioNetGetStatus.
317
318Steps of packet transmission:
319
320- Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor
321  chains by keeping the indices of their head descriptors in a stack that is
322  private to the driver instance. All elements of the stack are even.
323
324- If the stack is empty (that is, each descriptor chain, in isolation, is
325  either pending transmission, or has been processed by the host but not
326  yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns
327  EFI_NOT_READY.
328
329- Otherwise the index of a free chain's head descriptor is popped from the
330  stack. The linked tail descriptor is re-pointed as discussed above. The head
331  descriptor's index is pushed on the Available Ring.
332
333- The host moves the head descriptor index from the Available Ring to the Used
334  Ring when it transmits the packet.
335
336- Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the
337  function reports no Tx completion. Otherwise, a head descriptor's index is
338  consumed from the Used Ring and recycled to the private stack. The client
339  code's original packet buffer address is fetched from the tail descriptor
340  (where it has been stored at VirtioNetTransmit time) and returned to the
341  caller.
342
343- The Len field of the Used Ring Element is not checked. The host is assumed to
344  have transmitted the entire packet -- VirtioNetTransmit had forced it below
345  1514 bytes (inclusive). The Virtio specification suggests this packet size is
346  always accepted (and a lower MTU could be encountered on any later hop as
347  well). Additionally, there's no good way to report a short transmit via
348  VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification
349  and higher level protocols could interpret it as a fatal condition.
350
351- The host can theoretically reorder head descriptor indices when moving them
352  from the Available Ring to the Used Ring (out of order transmission). Because
353  of this (and the choice of a stack over a list for free descriptor chain
354  tracking) the order of head descriptor indices on either Ring is
355  unpredictable.
356