1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2          "http://www.w3.org/TR/html4/strict.dtd">
3<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
4<html>
5<head>
6  <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
7  <title>&lt;atomic&gt; design</title>
8  <link type="text/css" rel="stylesheet" href="menu.css">
9  <link type="text/css" rel="stylesheet" href="content.css">
10</head>
11
12<body>
13<div id="menu">
14  <div>
15    <a href="http://llvm.org/">LLVM Home</a>
16  </div>
17
18  <div class="submenu">
19    <label>libc++ Info</label>
20    <a href="/index.html">About</a>
21  </div>
22
23  <div class="submenu">
24    <label>Quick Links</label>
25    <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev">cfe-dev</a>
26    <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits">cfe-commits</a>
27    <a href="http://llvm.org/bugs/">Bug Reports</a>
28    <a href="http://llvm.org/svn/llvm-project/libcxx/trunk/">Browse SVN</a>
29    <a href="http://llvm.org/viewvc/llvm-project/libcxx/trunk/">Browse ViewVC</a>
30  </div>
31</div>
32
33<div id="content">
34  <!--*********************************************************************-->
35  <h1>&lt;atomic&gt; design</h1>
36  <!--*********************************************************************-->
37
38<p>
39The <tt>&lt;atomic&gt;</tt> header is one of the most closely coupled headers to
40the compiler.  Ideally when you invoke any function from
41<tt>&lt;atomic&gt;</tt>, it should result in highly optimized assembly being
42inserted directly into your application ...  assembly that is not otherwise
43representable by higher level C or C++ expressions.  The design of the libc++
44<tt>&lt;atomic&gt;</tt> header started with this goal in mind.  A secondary, but
45still very important goal is that the compiler should have to do minimal work to
46facilitate the implementation of <tt>&lt;atomic&gt;</tt>.  Without this second
47goal, then practically speaking, the libc++ <tt>&lt;atomic&gt;</tt> header would
48be doomed to be a barely supported, second class citizen on almost every
49platform.
50</p>
51
52<p>Goals:</p>
53
54<blockquote><ul>
55<li>Optimal code generation for atomic operations</li>
56<li>Minimal effort for the compiler to achieve goal 1 on any given platform</li>
57<li>Conformance to the C++0X draft standard</li>
58</ul></blockquote>
59
60<p>
61The purpose of this document is to inform compiler writers what they need to do
62to enable a high performance libc++ <tt>&lt;atomic&gt;</tt> with minimal effort.
63</p>
64
65<h2>The minimal work that must be done for a conforming <tt>&lt;atomic&gt;</tt></h2>
66
67<p>
68The only "atomic" operations that must actually be lock free in
69<tt>&lt;atomic&gt;</tt> are represented by the following compiler intrinsics:
70</p>
71
72<blockquote><pre>
73__atomic_flag__
74__atomic_exchange_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr)
75{
76    unique_lock&lt;mutex&gt; _(some_mutex);
77    __atomic_flag__ result = *obj;
78    *obj = desr;
79    return result;
80}
81
82void
83__atomic_store_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr)
84{
85    unique_lock&lt;mutex&gt; _(some_mutex);
86    *obj = desr;
87}
88</pre></blockquote>
89
90<p>
91Where:
92</p>
93
94<blockquote><ul>
95<li>
96If <tt>__has_feature(__atomic_flag)</tt> evaluates to 1 in the preprocessor then
97the compiler must define <tt>__atomic_flag__</tt> (e.g. as a typedef to
98<tt>int</tt>).
99</li>
100<li>
101If <tt>__has_feature(__atomic_flag)</tt> evaluates to 0 in the preprocessor then
102the library defines <tt>__atomic_flag__</tt> as a typedef to <tt>bool</tt>.
103</li>
104<li>
105<p>
106To communicate that the above intrinsics are available, the compiler must
107arrange for <tt>__has_feature</tt> to return 1 when fed the intrinsic name
108appended with an '_' and the mangled type name of <tt>__atomic_flag__</tt>.
109</p>
110<p>
111For example if <tt>__atomic_flag__</tt> is <tt>unsigned int</tt>:
112</p>
113<blockquote><pre>
114__has_feature(__atomic_flag) == 1
115__has_feature(__atomic_exchange_seq_cst_j) == 1
116__has_feature(__atomic_store_seq_cst_j) == 1
117
118typedef unsigned int __atomic_flag__; 
119
120unsigned int __atomic_exchange_seq_cst(unsigned int volatile*, unsigned int)
121{
122   // ...
123}
124
125void __atomic_store_seq_cst(unsigned int volatile*, unsigned int)
126{
127   // ...
128}
129</pre></blockquote>
130</li>
131</ul></blockquote>
132
133<p>
134That's it!  Compiler writers do the above and you've got a fully conforming
135(though sub-par performance) <tt>&lt;atomic&gt;</tt> header!
136</p>
137
138<h2>Recommended work for a higher performance <tt>&lt;atomic&gt;</tt></h2>
139
140<p>
141It would be good if the above intrinsics worked with all integral types plus
142<tt>void*</tt>.  Because this may not be possible to do in a lock-free manner for
143all integral types on all platforms, a compiler must communicate each type that
144an intrinsic works with.  For example if <tt>__atomic_exchange_seq_cst</tt> works
145for all types except for <tt>long long</tt> and <tt>unsigned long long</tt>
146then:
147</p>
148
149<blockquote><pre>
150__has_feature(__atomic_exchange_seq_cst_b) == 1  // bool
151__has_feature(__atomic_exchange_seq_cst_c) == 1  // char
152__has_feature(__atomic_exchange_seq_cst_a) == 1  // signed char
153__has_feature(__atomic_exchange_seq_cst_h) == 1  // unsigned char
154__has_feature(__atomic_exchange_seq_cst_Ds) == 1 // char16_t
155__has_feature(__atomic_exchange_seq_cst_Di) == 1 // char32_t
156__has_feature(__atomic_exchange_seq_cst_w) == 1  // wchar_t
157__has_feature(__atomic_exchange_seq_cst_s) == 1  // short
158__has_feature(__atomic_exchange_seq_cst_t) == 1  // unsigned short
159__has_feature(__atomic_exchange_seq_cst_i) == 1  // int
160__has_feature(__atomic_exchange_seq_cst_j) == 1  // unsigned int
161__has_feature(__atomic_exchange_seq_cst_l) == 1  // long
162__has_feature(__atomic_exchange_seq_cst_m) == 1  // unsigned long
163__has_feature(__atomic_exchange_seq_cst_Pv) == 1 // void*
164</pre></blockquote>
165
166<p>
167Note that only the <tt>__has_feature</tt> flag is decorated with the argument
168type.  The name of the compiler intrinsic is not decorated, but instead works
169like a C++ overloaded function.
170</p>
171
172<p>
173Additionally there are other intrinsics besides
174<tt>__atomic_exchange_seq_cst</tt> and <tt>__atomic_store_seq_cst</tt>.  They
175are optional.  But if the compiler can generate faster code than provided by the
176library, then clients will benefit from the compiler writer's expertise and
177knowledge of the targeted platform.
178</p>
179
180<p>
181Below is the complete list of <i>sequentially consistent</i> intrinsics, and
182their library implementations.  Template syntax is used to indicate the desired
183overloading for integral and void* types.  The template does not represent a
184requirement that the intrinsic operate on <em>any</em> type!
185</p>
186
187<blockquote><pre>
188T is one of:  bool, char, signed char, unsigned char, short, unsigned short,
189              int, unsigned int, long, unsigned long,
190              long long, unsigned long long, char16_t, char32_t, wchar_t, void*
191
192template &lt;class T&gt;
193T
194__atomic_load_seq_cst(T const volatile* obj)
195{
196    unique_lock&lt;mutex&gt; _(some_mutex);
197    return *obj;
198}
199
200template &lt;class T&gt;
201void
202__atomic_store_seq_cst(T volatile* obj, T desr)
203{
204    unique_lock&lt;mutex&gt; _(some_mutex);
205    *obj = desr;
206}
207
208template &lt;class T&gt;
209T
210__atomic_exchange_seq_cst(T volatile* obj, T desr)
211{
212    unique_lock&lt;mutex&gt; _(some_mutex);
213    T r = *obj;
214    *obj = desr;
215    return r;
216}
217
218template &lt;class T&gt;
219bool
220__atomic_compare_exchange_strong_seq_cst_seq_cst(T volatile* obj, T* exp, T desr)
221{
222    unique_lock&lt;mutex&gt; _(some_mutex);
223    if (std::memcmp(const_cast&lt;T*&gt;(obj), exp, sizeof(T)) == 0)
224    {
225        std::memcpy(const_cast&lt;T*&gt;(obj), &amp;desr, sizeof(T));
226        return true;
227    }
228    std::memcpy(exp, const_cast&lt;T*&gt;(obj), sizeof(T));
229    return false;
230}
231
232template &lt;class T&gt;
233bool
234__atomic_compare_exchange_weak_seq_cst_seq_cst(T volatile* obj, T* exp, T desr)
235{
236    unique_lock&lt;mutex&gt; _(some_mutex);
237    if (std::memcmp(const_cast&lt;T*&gt;(obj), exp, sizeof(T)) == 0)
238    {
239        std::memcpy(const_cast&lt;T*&gt;(obj), &amp;desr, sizeof(T));
240        return true;
241    }
242    std::memcpy(exp, const_cast&lt;T*&gt;(obj), sizeof(T));
243    return false;
244}
245
246T is one of:  char, signed char, unsigned char, short, unsigned short,
247              int, unsigned int, long, unsigned long,
248              long long, unsigned long long, char16_t, char32_t, wchar_t
249
250template &lt;class T&gt;
251T
252__atomic_fetch_add_seq_cst(T volatile* obj, T operand)
253{
254    unique_lock&lt;mutex&gt; _(some_mutex);
255    T r = *obj;
256    *obj += operand;
257    return r;
258}
259
260template &lt;class T&gt;
261T
262__atomic_fetch_sub_seq_cst(T volatile* obj, T operand)
263{
264    unique_lock&lt;mutex&gt; _(some_mutex);
265    T r = *obj;
266    *obj -= operand;
267    return r;
268}
269
270template &lt;class T&gt;
271T
272__atomic_fetch_and_seq_cst(T volatile* obj, T operand)
273{
274    unique_lock&lt;mutex&gt; _(some_mutex);
275    T r = *obj;
276    *obj &amp;= operand;
277    return r;
278}
279
280template &lt;class T&gt;
281T
282__atomic_fetch_or_seq_cst(T volatile* obj, T operand)
283{
284    unique_lock&lt;mutex&gt; _(some_mutex);
285    T r = *obj;
286    *obj |= operand;
287    return r;
288}
289
290template &lt;class T&gt;
291T
292__atomic_fetch_xor_seq_cst(T volatile* obj, T operand)
293{
294    unique_lock&lt;mutex&gt; _(some_mutex);
295    T r = *obj;
296    *obj ^= operand;
297    return r;
298}
299
300void*
301__atomic_fetch_add_seq_cst(void* volatile* obj, ptrdiff_t operand)
302{
303    unique_lock&lt;mutex&gt; _(some_mutex);
304    void* r = *obj;
305    (char*&amp;)(*obj) += operand;
306    return r;
307}
308
309void*
310__atomic_fetch_sub_seq_cst(void* volatile* obj, ptrdiff_t operand)
311{
312    unique_lock&lt;mutex&gt; _(some_mutex);
313    void* r = *obj;
314    (char*&amp;)(*obj) -= operand;
315    return r;
316}
317
318void __atomic_thread_fence_seq_cst()
319{
320    unique_lock&lt;mutex&gt; _(some_mutex);
321}
322
323void __atomic_signal_fence_seq_cst()
324{
325    unique_lock&lt;mutex&gt; _(some_mutex);
326}
327</pre></blockquote>
328
329<p>
330One should consult the (currently draft)
331<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf">C++ standard</a>
332for the details of the definitions for these operations.  For example
333<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> is allowed to fail
334spuriously while <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> is
335not.
336</p>
337
338<p>
339If on your platform the lock-free definition of
340<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> would be the same as
341<tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt>, you may omit the
342<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> intrinsic without a
343performance cost.  The library will prefer your implementation of
344<tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> over its own
345definition for implementing
346<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt>.  That is, the library
347will arrange for <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> to call
348<tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> if you supply an
349intrinsic for the strong version but not the weak.
350</p>
351
352<h2>Taking advantage of weaker memory synchronization</h2>
353
354<p>
355So far all of the intrinsics presented require a <em>sequentially
356consistent</em> memory ordering.  That is, no loads or stores can move across
357the operation (just as if the library had locked that internal mutex).  But
358<tt>&lt;atomic&gt;</tt> supports weaker memory ordering operations.  In all,
359there are six memory orderings (listed here from strongest to weakest):
360</p>
361
362<blockquote><pre>
363memory_order_seq_cst
364memory_order_acq_rel
365memory_order_release
366memory_order_acquire
367memory_order_consume
368memory_order_relaxed
369</pre></blockquote>
370
371<p>
372(See the
373<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf">C++ standard</a>
374for the detailed definitions of each of these orderings).
375</p>
376
377<p>
378On some platforms, the compiler vendor can offer some or even all of the above
379intrinsics at one or more weaker levels of memory synchronization.  This might
380lead for example to not issuing an <tt>mfence</tt> instruction on the x86.
381</p>
382
383<p>
384If the compiler does not offer any given operation, at any given memory ordering
385level, the library will automatically attempt to call the next highest memory
386ordering operation.  This continues up to <tt>seq_cst</tt>, and if that doesn't
387exist, then the library takes over and does the job with a <tt>mutex</tt>.  This
388is a compile-time search &amp; selection operation.  At run time, the
389application will only see the few inlined assembly instructions for the selected
390intrinsic.
391</p>
392
393<p>
394Each intrinsic is appended with the 7-letter name of the memory ordering it
395addresses.  For example a <tt>load</tt> with <tt>relaxed</tt> ordering is
396defined by:
397</p>
398
399<blockquote><pre>
400T __atomic_load_relaxed(const volatile T* obj);
401</pre></blockquote>
402
403<p>
404And announced with:
405</p>
406
407<blockquote><pre>
408__has_feature(__atomic_load_relaxed_b) == 1  // bool
409__has_feature(__atomic_load_relaxed_c) == 1  // char
410__has_feature(__atomic_load_relaxed_a) == 1  // signed char
411...
412</pre></blockquote>
413
414<p>
415The <tt>__atomic_compare_exchange_strong(weak)</tt> intrinsics are parameterized
416on two memory orderings.  The first ordering applies when the operation returns
417<tt>true</tt> and the second ordering applies when the operation returns
418<tt>false</tt>.
419</p>
420
421<p>
422Not every memory ordering is appropriate for every operation.  <tt>exchange</tt>
423and the <tt>fetch_<i>op</i></tt> operations support all 6.  But <tt>load</tt>
424only supports <tt>relaxed</tt>, <tt>consume</tt>, <tt>acquire</tt> and <tt>seq_cst</tt>.
425<tt>store</tt>
426only supports <tt>relaxed</tt>, <tt>release</tt>, and <tt>seq_cst</tt>.  The
427<tt>compare_exchange</tt> operations support the following 16 combinations out
428of the possible 36:
429</p>
430
431<blockquote><pre>
432relaxed_relaxed
433consume_relaxed
434consume_consume
435acquire_relaxed
436acquire_consume
437acquire_acquire
438release_relaxed
439release_consume
440release_acquire
441acq_rel_relaxed
442acq_rel_consume
443acq_rel_acquire
444seq_cst_relaxed
445seq_cst_consume
446seq_cst_acquire
447seq_cst_seq_cst
448</pre></blockquote>
449
450<p>
451Again, the compiler supplies intrinsics only for the strongest orderings where
452it can make a difference.  The library takes care of calling the weakest
453supplied intrinsic that is as strong or stronger than the customer asked for.
454</p>
455
456</div>
457</body>
458</html>
459