pickletools.py revision 9594942716a8f9c557b85d31751753d89cd7cebf
1'''"Executable documentation" for the pickle module.
2
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here.  Some functions meant for external use:
5
6genops(pickle)
7   Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
9dis(pickle, out=None, memo=None, indentlevel=4)
10   Print a symbolic disassembly of a pickle.
11'''
12
13import codecs
14import pickle
15import re
16import sys
17
18__all__ = ['dis', 'genops', 'optimize']
19
20bytes_types = pickle.bytes_types
21
22# Other ideas:
23#
24# - A pickle verifier:  read a pickle and check it exhaustively for
25#   well-formedness.  dis() does a lot of this already.
26#
27# - A protocol identifier:  examine a pickle and return its protocol number
28#   (== the highest .proto attr value among all the opcodes in the pickle).
29#   dis() already prints this info at the end.
30#
31# - A pickle optimizer:  for example, tuple-building code is sometimes more
32#   elaborate than necessary, catering for the possibility that the tuple
33#   is recursive.  Or lots of times a PUT is generated that's never accessed
34#   by a later GET.
35
36
37"""
38"A pickle" is a program for a virtual pickle machine (PM, but more accurately
39called an unpickling machine).  It's a sequence of opcodes, interpreted by the
40PM, building an arbitrarily complex Python object.
41
42For the most part, the PM is very simple:  there are no looping, testing, or
43conditional instructions, no arithmetic and no function calls.  Opcodes are
44executed once each, from first to last, until a STOP opcode is reached.
45
46The PM has two data areas, "the stack" and "the memo".
47
48Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
49integer object on the stack, whose value is gotten from a decimal string
50literal immediately following the INT opcode in the pickle bytestream.  Other
51opcodes take Python objects off the stack.  The result of unpickling is
52whatever object is left on the stack when the final STOP opcode is executed.
53
54The memo is simply an array of objects, or it can be implemented as a dict
55mapping little integers to objects.  The memo serves as the PM's "long term
56memory", and the little integers indexing the memo are akin to variable
57names.  Some opcodes pop a stack object into the memo at a given index,
58and others push a memo object at a given index onto the stack again.
59
60At heart, that's all the PM has.  Subtleties arise for these reasons:
61
62+ Object identity.  Objects can be arbitrarily complex, and subobjects
63  may be shared (for example, the list [a, a] refers to the same object a
64  twice).  It can be vital that unpickling recreate an isomorphic object
65  graph, faithfully reproducing sharing.
66
67+ Recursive objects.  For example, after "L = []; L.append(L)", L is a
68  list, and L[0] is the same list.  This is related to the object identity
69  point, and some sequences of pickle opcodes are subtle in order to
70  get the right result in all cases.
71
72+ Things pickle doesn't know everything about.  Examples of things pickle
73  does know everything about are Python's builtin scalar and container
74  types, like ints and tuples.  They generally have opcodes dedicated to
75  them.  For things like module references and instances of user-defined
76  classes, pickle's knowledge is limited.  Historically, many enhancements
77  have been made to the pickle protocol in order to do a better (faster,
78  and/or more compact) job on those.
79
80+ Backward compatibility and micro-optimization.  As explained below,
81  pickle opcodes never go away, not even when better ways to do a thing
82  get invented.  The repertoire of the PM just keeps growing over time.
83  For example, protocol 0 had two opcodes for building Python integers (INT
84  and LONG), protocol 1 added three more for more-efficient pickling of short
85  integers, and protocol 2 added two more for more-efficient pickling of
86  long integers (before protocol 2, the only ways to pickle a Python long
87  took time quadratic in the number of digits, for both pickling and
88  unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
89  wearying complication.
90
91
92Pickle protocols:
93
94For compatibility, the meaning of a pickle opcode never changes.  Instead new
95pickle opcodes get added, and each version's unpickler can handle all the
96pickle opcodes in all protocol versions to date.  So old pickles continue to
97be readable forever.  The pickler can generally be told to restrict itself to
98the subset of opcodes available under previous protocol versions too, so that
99users can create pickles under the current version readable by older
100versions.  However, a pickle does not contain its version number embedded
101within it.  If an older unpickler tries to read a pickle using a later
102protocol, the result is most likely an exception due to seeing an unknown (in
103the older unpickler) opcode.
104
105The original pickle used what's now called "protocol 0", and what was called
106"text mode" before Python 2.3.  The entire pickle bytestream is made up of
107printable 7-bit ASCII characters, plus the newline character, in protocol 0.
108That's why it was called text mode.  Protocol 0 is small and elegant, but
109sometimes painfully inefficient.
110
111The second major set of additions is now called "protocol 1", and was called
112"binary mode" before Python 2.3.  This added many opcodes with arguments
113consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
114bytes.  Binary mode pickles can be substantially smaller than equivalent
115text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
116int as 4 bytes following the opcode, which is cheaper to unpickle than the
117(perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
118a number of opcodes that operate on many stack elements at once (like APPENDS
119and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
120
121The third major set of additions came in Python 2.3, and is called "protocol
1222".  This added:
123
124- A better way to pickle instances of new-style classes (NEWOBJ).
125
126- A way for a pickle to identify its protocol (PROTO).
127
128- Time- and space- efficient pickling of long ints (LONG{1,4}).
129
130- Shortcuts for small tuples (TUPLE{1,2,3}}.
131
132- Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
133
134- The "extension registry", a vector of popular objects that can be pushed
135  efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
136  the registry contents are predefined (there's nothing akin to the memo's
137  PUT).
138
139Another independent change with Python 2.3 is the abandonment of any
140pretense that it might be safe to load pickles received from untrusted
141parties -- no sufficient security analysis has been done to guarantee
142this and there isn't a use case that warrants the expense of such an
143analysis.
144
145To this end, all tests for __safe_for_unpickling__ or for
146copyreg.safe_constructors are removed from the unpickling code.
147References to these variables in the descriptions below are to be seen
148as describing unpickling in Python 2.2 and before.
149"""
150
151# Meta-rule:  Descriptions are stored in instances of descriptor objects,
152# with plain constructors.  No meta-language is defined from which
153# descriptors could be constructed.  If you want, e.g., XML, write a little
154# program to generate XML from the objects.
155
156##############################################################################
157# Some pickle opcodes have an argument, following the opcode in the
158# bytestream.  An argument is of a specific type, described by an instance
159# of ArgumentDescriptor.  These are not to be confused with arguments taken
160# off the stack -- ArgumentDescriptor applies only to arguments embedded in
161# the opcode stream, immediately following an opcode.
162
163# Represents the number of bytes consumed by an argument delimited by the
164# next newline character.
165UP_TO_NEWLINE = -1
166
167# Represents the number of bytes consumed by a two-argument opcode where
168# the first argument gives the number of bytes in the second argument.
169TAKEN_FROM_ARGUMENT1  = -2   # num bytes is 1-byte unsigned int
170TAKEN_FROM_ARGUMENT4  = -3   # num bytes is 4-byte signed little-endian int
171TAKEN_FROM_ARGUMENT4U = -4   # num bytes is 4-byte unsigned little-endian int
172
173class ArgumentDescriptor(object):
174    __slots__ = (
175        # name of descriptor record, also a module global name; a string
176        'name',
177
178        # length of argument, in bytes; an int; UP_TO_NEWLINE and
179        # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
180        # cases
181        'n',
182
183        # a function taking a file-like object, reading this kind of argument
184        # from the object at the current position, advancing the current
185        # position by n bytes, and returning the value of the argument
186        'reader',
187
188        # human-readable docs for this arg descriptor; a string
189        'doc',
190    )
191
192    def __init__(self, name, n, reader, doc):
193        assert isinstance(name, str)
194        self.name = name
195
196        assert isinstance(n, int) and (n >= 0 or
197                                       n in (UP_TO_NEWLINE,
198                                             TAKEN_FROM_ARGUMENT1,
199                                             TAKEN_FROM_ARGUMENT4,
200                                             TAKEN_FROM_ARGUMENT4U))
201        self.n = n
202
203        self.reader = reader
204
205        assert isinstance(doc, str)
206        self.doc = doc
207
208from struct import unpack as _unpack
209
210def read_uint1(f):
211    r"""
212    >>> import io
213    >>> read_uint1(io.BytesIO(b'\xff'))
214    255
215    """
216
217    data = f.read(1)
218    if data:
219        return data[0]
220    raise ValueError("not enough data in stream to read uint1")
221
222uint1 = ArgumentDescriptor(
223            name='uint1',
224            n=1,
225            reader=read_uint1,
226            doc="One-byte unsigned integer.")
227
228
229def read_uint2(f):
230    r"""
231    >>> import io
232    >>> read_uint2(io.BytesIO(b'\xff\x00'))
233    255
234    >>> read_uint2(io.BytesIO(b'\xff\xff'))
235    65535
236    """
237
238    data = f.read(2)
239    if len(data) == 2:
240        return _unpack("<H", data)[0]
241    raise ValueError("not enough data in stream to read uint2")
242
243uint2 = ArgumentDescriptor(
244            name='uint2',
245            n=2,
246            reader=read_uint2,
247            doc="Two-byte unsigned integer, little-endian.")
248
249
250def read_int4(f):
251    r"""
252    >>> import io
253    >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
254    255
255    >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
256    True
257    """
258
259    data = f.read(4)
260    if len(data) == 4:
261        return _unpack("<i", data)[0]
262    raise ValueError("not enough data in stream to read int4")
263
264int4 = ArgumentDescriptor(
265           name='int4',
266           n=4,
267           reader=read_int4,
268           doc="Four-byte signed integer, little-endian, 2's complement.")
269
270
271def read_uint4(f):
272    r"""
273    >>> import io
274    >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00'))
275    255
276    >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31
277    True
278    """
279
280    data = f.read(4)
281    if len(data) == 4:
282        return _unpack("<I", data)[0]
283    raise ValueError("not enough data in stream to read uint4")
284
285uint4 = ArgumentDescriptor(
286            name='uint4',
287            n=4,
288            reader=read_uint4,
289            doc="Four-byte unsigned integer, little-endian.")
290
291
292def read_stringnl(f, decode=True, stripquotes=True):
293    r"""
294    >>> import io
295    >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
296    'abcd'
297
298    >>> read_stringnl(io.BytesIO(b"\n"))
299    Traceback (most recent call last):
300    ...
301    ValueError: no string quotes around b''
302
303    >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
304    ''
305
306    >>> read_stringnl(io.BytesIO(b"''\n"))
307    ''
308
309    >>> read_stringnl(io.BytesIO(b'"abcd"'))
310    Traceback (most recent call last):
311    ...
312    ValueError: no newline found when trying to read stringnl
313
314    Embedded escapes are undone in the result.
315    >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
316    'a\n\\b\x00c\td'
317    """
318
319    data = f.readline()
320    if not data.endswith(b'\n'):
321        raise ValueError("no newline found when trying to read stringnl")
322    data = data[:-1]    # lose the newline
323
324    if stripquotes:
325        for q in (b'"', b"'"):
326            if data.startswith(q):
327                if not data.endswith(q):
328                    raise ValueError("strinq quote %r not found at both "
329                                     "ends of %r" % (q, data))
330                data = data[1:-1]
331                break
332        else:
333            raise ValueError("no string quotes around %r" % data)
334
335    if decode:
336        data = codecs.escape_decode(data)[0].decode("ascii")
337    return data
338
339stringnl = ArgumentDescriptor(
340               name='stringnl',
341               n=UP_TO_NEWLINE,
342               reader=read_stringnl,
343               doc="""A newline-terminated string.
344
345                   This is a repr-style string, with embedded escapes, and
346                   bracketing quotes.
347                   """)
348
349def read_stringnl_noescape(f):
350    return read_stringnl(f, stripquotes=False)
351
352stringnl_noescape = ArgumentDescriptor(
353                        name='stringnl_noescape',
354                        n=UP_TO_NEWLINE,
355                        reader=read_stringnl_noescape,
356                        doc="""A newline-terminated string.
357
358                        This is a str-style string, without embedded escapes,
359                        or bracketing quotes.  It should consist solely of
360                        printable ASCII characters.
361                        """)
362
363def read_stringnl_noescape_pair(f):
364    r"""
365    >>> import io
366    >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
367    'Queue Empty'
368    """
369
370    return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
371
372stringnl_noescape_pair = ArgumentDescriptor(
373                             name='stringnl_noescape_pair',
374                             n=UP_TO_NEWLINE,
375                             reader=read_stringnl_noescape_pair,
376                             doc="""A pair of newline-terminated strings.
377
378                             These are str-style strings, without embedded
379                             escapes, or bracketing quotes.  They should
380                             consist solely of printable ASCII characters.
381                             The pair is returned as a single string, with
382                             a single blank separating the two strings.
383                             """)
384
385def read_string4(f):
386    r"""
387    >>> import io
388    >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
389    ''
390    >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
391    'abc'
392    >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
393    Traceback (most recent call last):
394    ...
395    ValueError: expected 50331648 bytes in a string4, but only 6 remain
396    """
397
398    n = read_int4(f)
399    if n < 0:
400        raise ValueError("string4 byte count < 0: %d" % n)
401    data = f.read(n)
402    if len(data) == n:
403        return data.decode("latin-1")
404    raise ValueError("expected %d bytes in a string4, but only %d remain" %
405                     (n, len(data)))
406
407string4 = ArgumentDescriptor(
408              name="string4",
409              n=TAKEN_FROM_ARGUMENT4,
410              reader=read_string4,
411              doc="""A counted string.
412
413              The first argument is a 4-byte little-endian signed int giving
414              the number of bytes in the string, and the second argument is
415              that many bytes.
416              """)
417
418
419def read_string1(f):
420    r"""
421    >>> import io
422    >>> read_string1(io.BytesIO(b"\x00"))
423    ''
424    >>> read_string1(io.BytesIO(b"\x03abcdef"))
425    'abc'
426    """
427
428    n = read_uint1(f)
429    assert n >= 0
430    data = f.read(n)
431    if len(data) == n:
432        return data.decode("latin-1")
433    raise ValueError("expected %d bytes in a string1, but only %d remain" %
434                     (n, len(data)))
435
436string1 = ArgumentDescriptor(
437              name="string1",
438              n=TAKEN_FROM_ARGUMENT1,
439              reader=read_string1,
440              doc="""A counted string.
441
442              The first argument is a 1-byte unsigned int giving the number
443              of bytes in the string, and the second argument is that many
444              bytes.
445              """)
446
447
448def read_bytes1(f):
449    r"""
450    >>> import io
451    >>> read_bytes1(io.BytesIO(b"\x00"))
452    b''
453    >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
454    b'abc'
455    """
456
457    n = read_uint1(f)
458    assert n >= 0
459    data = f.read(n)
460    if len(data) == n:
461        return data
462    raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
463                     (n, len(data)))
464
465bytes1 = ArgumentDescriptor(
466              name="bytes1",
467              n=TAKEN_FROM_ARGUMENT1,
468              reader=read_bytes1,
469              doc="""A counted bytes string.
470
471              The first argument is a 1-byte unsigned int giving the number
472              of bytes, and the second argument is that many bytes.
473              """)
474
475
476def read_bytes4(f):
477    r"""
478    >>> import io
479    >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc"))
480    b''
481    >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
482    b'abc'
483    >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
484    Traceback (most recent call last):
485    ...
486    ValueError: expected 50331648 bytes in a bytes4, but only 6 remain
487    """
488
489    n = read_uint4(f)
490    if n > sys.maxsize:
491        raise ValueError("bytes4 byte count > sys.maxsize: %d" % n)
492    data = f.read(n)
493    if len(data) == n:
494        return data
495    raise ValueError("expected %d bytes in a bytes4, but only %d remain" %
496                     (n, len(data)))
497
498bytes4 = ArgumentDescriptor(
499              name="bytes4",
500              n=TAKEN_FROM_ARGUMENT4U,
501              reader=read_bytes4,
502              doc="""A counted bytes string.
503
504              The first argument is a 4-byte little-endian unsigned int giving
505              the number of bytes, and the second argument is that many bytes.
506              """)
507
508
509def read_unicodestringnl(f):
510    r"""
511    >>> import io
512    >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
513    True
514    """
515
516    data = f.readline()
517    if not data.endswith(b'\n'):
518        raise ValueError("no newline found when trying to read "
519                         "unicodestringnl")
520    data = data[:-1]    # lose the newline
521    return str(data, 'raw-unicode-escape')
522
523unicodestringnl = ArgumentDescriptor(
524                      name='unicodestringnl',
525                      n=UP_TO_NEWLINE,
526                      reader=read_unicodestringnl,
527                      doc="""A newline-terminated Unicode string.
528
529                      This is raw-unicode-escape encoded, so consists of
530                      printable ASCII characters, and may contain embedded
531                      escape sequences.
532                      """)
533
534def read_unicodestring4(f):
535    r"""
536    >>> import io
537    >>> s = 'abcd\uabcd'
538    >>> enc = s.encode('utf-8')
539    >>> enc
540    b'abcd\xea\xaf\x8d'
541    >>> n = bytes([len(enc), 0, 0, 0])  # little-endian 4-byte length
542    >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
543    >>> s == t
544    True
545
546    >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
547    Traceback (most recent call last):
548    ...
549    ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
550    """
551
552    n = read_uint4(f)
553    if n > sys.maxsize:
554        raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n)
555    data = f.read(n)
556    if len(data) == n:
557        return str(data, 'utf-8', 'surrogatepass')
558    raise ValueError("expected %d bytes in a unicodestring4, but only %d "
559                     "remain" % (n, len(data)))
560
561unicodestring4 = ArgumentDescriptor(
562                    name="unicodestring4",
563                    n=TAKEN_FROM_ARGUMENT4U,
564                    reader=read_unicodestring4,
565                    doc="""A counted Unicode string.
566
567                    The first argument is a 4-byte little-endian signed int
568                    giving the number of bytes in the string, and the second
569                    argument-- the UTF-8 encoding of the Unicode string --
570                    contains that many bytes.
571                    """)
572
573
574def read_decimalnl_short(f):
575    r"""
576    >>> import io
577    >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
578    1234
579
580    >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
581    Traceback (most recent call last):
582    ...
583    ValueError: invalid literal for int() with base 10: b'1234L'
584    """
585
586    s = read_stringnl(f, decode=False, stripquotes=False)
587
588    # There's a hack for True and False here.
589    if s == b"00":
590        return False
591    elif s == b"01":
592        return True
593
594    return int(s)
595
596def read_decimalnl_long(f):
597    r"""
598    >>> import io
599
600    >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
601    1234
602
603    >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
604    123456789012345678901234
605    """
606
607    s = read_stringnl(f, decode=False, stripquotes=False)
608    if s[-1:] == b'L':
609        s = s[:-1]
610    return int(s)
611
612
613decimalnl_short = ArgumentDescriptor(
614                      name='decimalnl_short',
615                      n=UP_TO_NEWLINE,
616                      reader=read_decimalnl_short,
617                      doc="""A newline-terminated decimal integer literal.
618
619                          This never has a trailing 'L', and the integer fit
620                          in a short Python int on the box where the pickle
621                          was written -- but there's no guarantee it will fit
622                          in a short Python int on the box where the pickle
623                          is read.
624                          """)
625
626decimalnl_long = ArgumentDescriptor(
627                     name='decimalnl_long',
628                     n=UP_TO_NEWLINE,
629                     reader=read_decimalnl_long,
630                     doc="""A newline-terminated decimal integer literal.
631
632                         This has a trailing 'L', and can represent integers
633                         of any size.
634                         """)
635
636
637def read_floatnl(f):
638    r"""
639    >>> import io
640    >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
641    -1.25
642    """
643    s = read_stringnl(f, decode=False, stripquotes=False)
644    return float(s)
645
646floatnl = ArgumentDescriptor(
647              name='floatnl',
648              n=UP_TO_NEWLINE,
649              reader=read_floatnl,
650              doc="""A newline-terminated decimal floating literal.
651
652              In general this requires 17 significant digits for roundtrip
653              identity, and pickling then unpickling infinities, NaNs, and
654              minus zero doesn't work across boxes, or on some boxes even
655              on itself (e.g., Windows can't read the strings it produces
656              for infinities or NaNs).
657              """)
658
659def read_float8(f):
660    r"""
661    >>> import io, struct
662    >>> raw = struct.pack(">d", -1.25)
663    >>> raw
664    b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
665    >>> read_float8(io.BytesIO(raw + b"\n"))
666    -1.25
667    """
668
669    data = f.read(8)
670    if len(data) == 8:
671        return _unpack(">d", data)[0]
672    raise ValueError("not enough data in stream to read float8")
673
674
675float8 = ArgumentDescriptor(
676             name='float8',
677             n=8,
678             reader=read_float8,
679             doc="""An 8-byte binary representation of a float, big-endian.
680
681             The format is unique to Python, and shared with the struct
682             module (format string '>d') "in theory" (the struct and pickle
683             implementations don't share the code -- they should).  It's
684             strongly related to the IEEE-754 double format, and, in normal
685             cases, is in fact identical to the big-endian 754 double format.
686             On other boxes the dynamic range is limited to that of a 754
687             double, and "add a half and chop" rounding is used to reduce
688             the precision to 53 bits.  However, even on a 754 box,
689             infinities, NaNs, and minus zero may not be handled correctly
690             (may not survive roundtrip pickling intact).
691             """)
692
693# Protocol 2 formats
694
695from pickle import decode_long
696
697def read_long1(f):
698    r"""
699    >>> import io
700    >>> read_long1(io.BytesIO(b"\x00"))
701    0
702    >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
703    255
704    >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
705    32767
706    >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
707    -256
708    >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
709    -32768
710    """
711
712    n = read_uint1(f)
713    data = f.read(n)
714    if len(data) != n:
715        raise ValueError("not enough data in stream to read long1")
716    return decode_long(data)
717
718long1 = ArgumentDescriptor(
719    name="long1",
720    n=TAKEN_FROM_ARGUMENT1,
721    reader=read_long1,
722    doc="""A binary long, little-endian, using 1-byte size.
723
724    This first reads one byte as an unsigned size, then reads that
725    many bytes and interprets them as a little-endian 2's-complement long.
726    If the size is 0, that's taken as a shortcut for the long 0L.
727    """)
728
729def read_long4(f):
730    r"""
731    >>> import io
732    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
733    255
734    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
735    32767
736    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
737    -256
738    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
739    -32768
740    >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
741    0
742    """
743
744    n = read_int4(f)
745    if n < 0:
746        raise ValueError("long4 byte count < 0: %d" % n)
747    data = f.read(n)
748    if len(data) != n:
749        raise ValueError("not enough data in stream to read long4")
750    return decode_long(data)
751
752long4 = ArgumentDescriptor(
753    name="long4",
754    n=TAKEN_FROM_ARGUMENT4,
755    reader=read_long4,
756    doc="""A binary representation of a long, little-endian.
757
758    This first reads four bytes as a signed size (but requires the
759    size to be >= 0), then reads that many bytes and interprets them
760    as a little-endian 2's-complement long.  If the size is 0, that's taken
761    as a shortcut for the int 0, although LONG1 should really be used
762    then instead (and in any case where # of bytes < 256).
763    """)
764
765
766##############################################################################
767# Object descriptors.  The stack used by the pickle machine holds objects,
768# and in the stack_before and stack_after attributes of OpcodeInfo
769# descriptors we need names to describe the various types of objects that can
770# appear on the stack.
771
772class StackObject(object):
773    __slots__ = (
774        # name of descriptor record, for info only
775        'name',
776
777        # type of object, or tuple of type objects (meaning the object can
778        # be of any type in the tuple)
779        'obtype',
780
781        # human-readable docs for this kind of stack object; a string
782        'doc',
783    )
784
785    def __init__(self, name, obtype, doc):
786        assert isinstance(name, str)
787        self.name = name
788
789        assert isinstance(obtype, type) or isinstance(obtype, tuple)
790        if isinstance(obtype, tuple):
791            for contained in obtype:
792                assert isinstance(contained, type)
793        self.obtype = obtype
794
795        assert isinstance(doc, str)
796        self.doc = doc
797
798    def __repr__(self):
799        return self.name
800
801
802pyint = StackObject(
803            name='int',
804            obtype=int,
805            doc="A short (as opposed to long) Python integer object.")
806
807pylong = StackObject(
808             name='long',
809             obtype=int,
810             doc="A long (as opposed to short) Python integer object.")
811
812pyinteger_or_bool = StackObject(
813                        name='int_or_bool',
814                        obtype=(int, bool),
815                        doc="A Python integer object (short or long), or "
816                            "a Python bool.")
817
818pybool = StackObject(
819             name='bool',
820             obtype=(bool,),
821             doc="A Python bool object.")
822
823pyfloat = StackObject(
824              name='float',
825              obtype=float,
826              doc="A Python float object.")
827
828pystring = StackObject(
829               name='string',
830               obtype=bytes,
831               doc="A Python (8-bit) string object.")
832
833pybytes = StackObject(
834               name='bytes',
835               obtype=bytes,
836               doc="A Python bytes object.")
837
838pyunicode = StackObject(
839                name='str',
840                obtype=str,
841                doc="A Python (Unicode) string object.")
842
843pynone = StackObject(
844             name="None",
845             obtype=type(None),
846             doc="The Python None object.")
847
848pytuple = StackObject(
849              name="tuple",
850              obtype=tuple,
851              doc="A Python tuple object.")
852
853pylist = StackObject(
854             name="list",
855             obtype=list,
856             doc="A Python list object.")
857
858pydict = StackObject(
859             name="dict",
860             obtype=dict,
861             doc="A Python dict object.")
862
863anyobject = StackObject(
864                name='any',
865                obtype=object,
866                doc="Any kind of object whatsoever.")
867
868markobject = StackObject(
869                 name="mark",
870                 obtype=StackObject,
871                 doc="""'The mark' is a unique object.
872
873                 Opcodes that operate on a variable number of objects
874                 generally don't embed the count of objects in the opcode,
875                 or pull it off the stack.  Instead the MARK opcode is used
876                 to push a special marker object on the stack, and then
877                 some other opcodes grab all the objects from the top of
878                 the stack down to (but not including) the topmost marker
879                 object.
880                 """)
881
882stackslice = StackObject(
883                 name="stackslice",
884                 obtype=StackObject,
885                 doc="""An object representing a contiguous slice of the stack.
886
887                 This is used in conjunction with markobject, to represent all
888                 of the stack following the topmost markobject.  For example,
889                 the POP_MARK opcode changes the stack from
890
891                     [..., markobject, stackslice]
892                 to
893                     [...]
894
895                 No matter how many object are on the stack after the topmost
896                 markobject, POP_MARK gets rid of all of them (including the
897                 topmost markobject too).
898                 """)
899
900##############################################################################
901# Descriptors for pickle opcodes.
902
903class OpcodeInfo(object):
904
905    __slots__ = (
906        # symbolic name of opcode; a string
907        'name',
908
909        # the code used in a bytestream to represent the opcode; a
910        # one-character string
911        'code',
912
913        # If the opcode has an argument embedded in the byte string, an
914        # instance of ArgumentDescriptor specifying its type.  Note that
915        # arg.reader(s) can be used to read and decode the argument from
916        # the bytestream s, and arg.doc documents the format of the raw
917        # argument bytes.  If the opcode doesn't have an argument embedded
918        # in the bytestream, arg should be None.
919        'arg',
920
921        # what the stack looks like before this opcode runs; a list
922        'stack_before',
923
924        # what the stack looks like after this opcode runs; a list
925        'stack_after',
926
927        # the protocol number in which this opcode was introduced; an int
928        'proto',
929
930        # human-readable docs for this opcode; a string
931        'doc',
932    )
933
934    def __init__(self, name, code, arg,
935                 stack_before, stack_after, proto, doc):
936        assert isinstance(name, str)
937        self.name = name
938
939        assert isinstance(code, str)
940        assert len(code) == 1
941        self.code = code
942
943        assert arg is None or isinstance(arg, ArgumentDescriptor)
944        self.arg = arg
945
946        assert isinstance(stack_before, list)
947        for x in stack_before:
948            assert isinstance(x, StackObject)
949        self.stack_before = stack_before
950
951        assert isinstance(stack_after, list)
952        for x in stack_after:
953            assert isinstance(x, StackObject)
954        self.stack_after = stack_after
955
956        assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL
957        self.proto = proto
958
959        assert isinstance(doc, str)
960        self.doc = doc
961
962I = OpcodeInfo
963opcodes = [
964
965    # Ways to spell integers.
966
967    I(name='INT',
968      code='I',
969      arg=decimalnl_short,
970      stack_before=[],
971      stack_after=[pyinteger_or_bool],
972      proto=0,
973      doc="""Push an integer or bool.
974
975      The argument is a newline-terminated decimal literal string.
976
977      The intent may have been that this always fit in a short Python int,
978      but INT can be generated in pickles written on a 64-bit box that
979      require a Python long on a 32-bit box.  The difference between this
980      and LONG then is that INT skips a trailing 'L', and produces a short
981      int whenever possible.
982
983      Another difference is due to that, when bool was introduced as a
984      distinct type in 2.3, builtin names True and False were also added to
985      2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
986      True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
987      Leading zeroes are never produced for a genuine integer.  The 2.3
988      (and later) unpicklers special-case these and return bool instead;
989      earlier unpicklers ignore the leading "0" and return the int.
990      """),
991
992    I(name='BININT',
993      code='J',
994      arg=int4,
995      stack_before=[],
996      stack_after=[pyint],
997      proto=1,
998      doc="""Push a four-byte signed integer.
999
1000      This handles the full range of Python (short) integers on a 32-bit
1001      box, directly as binary bytes (1 for the opcode and 4 for the integer).
1002      If the integer is non-negative and fits in 1 or 2 bytes, pickling via
1003      BININT1 or BININT2 saves space.
1004      """),
1005
1006    I(name='BININT1',
1007      code='K',
1008      arg=uint1,
1009      stack_before=[],
1010      stack_after=[pyint],
1011      proto=1,
1012      doc="""Push a one-byte unsigned integer.
1013
1014      This is a space optimization for pickling very small non-negative ints,
1015      in range(256).
1016      """),
1017
1018    I(name='BININT2',
1019      code='M',
1020      arg=uint2,
1021      stack_before=[],
1022      stack_after=[pyint],
1023      proto=1,
1024      doc="""Push a two-byte unsigned integer.
1025
1026      This is a space optimization for pickling small positive ints, in
1027      range(256, 2**16).  Integers in range(256) can also be pickled via
1028      BININT2, but BININT1 instead saves a byte.
1029      """),
1030
1031    I(name='LONG',
1032      code='L',
1033      arg=decimalnl_long,
1034      stack_before=[],
1035      stack_after=[pylong],
1036      proto=0,
1037      doc="""Push a long integer.
1038
1039      The same as INT, except that the literal ends with 'L', and always
1040      unpickles to a Python long.  There doesn't seem a real purpose to the
1041      trailing 'L'.
1042
1043      Note that LONG takes time quadratic in the number of digits when
1044      unpickling (this is simply due to the nature of decimal->binary
1045      conversion).  Proto 2 added linear-time (in C; still quadratic-time
1046      in Python) LONG1 and LONG4 opcodes.
1047      """),
1048
1049    I(name="LONG1",
1050      code='\x8a',
1051      arg=long1,
1052      stack_before=[],
1053      stack_after=[pylong],
1054      proto=2,
1055      doc="""Long integer using one-byte length.
1056
1057      A more efficient encoding of a Python long; the long1 encoding
1058      says it all."""),
1059
1060    I(name="LONG4",
1061      code='\x8b',
1062      arg=long4,
1063      stack_before=[],
1064      stack_after=[pylong],
1065      proto=2,
1066      doc="""Long integer using found-byte length.
1067
1068      A more efficient encoding of a Python long; the long4 encoding
1069      says it all."""),
1070
1071    # Ways to spell strings (8-bit, not Unicode).
1072
1073    I(name='STRING',
1074      code='S',
1075      arg=stringnl,
1076      stack_before=[],
1077      stack_after=[pystring],
1078      proto=0,
1079      doc="""Push a Python string object.
1080
1081      The argument is a repr-style string, with bracketing quote characters,
1082      and perhaps embedded escapes.  The argument extends until the next
1083      newline character.  (Actually, they are decoded into a str instance
1084      using the encoding given to the Unpickler constructor. or the default,
1085      'ASCII'.)
1086      """),
1087
1088    I(name='BINSTRING',
1089      code='T',
1090      arg=string4,
1091      stack_before=[],
1092      stack_after=[pystring],
1093      proto=1,
1094      doc="""Push a Python string object.
1095
1096      There are two arguments:  the first is a 4-byte little-endian signed int
1097      giving the number of bytes in the string, and the second is that many
1098      bytes, which are taken literally as the string content.  (Actually,
1099      they are decoded into a str instance using the encoding given to the
1100      Unpickler constructor. or the default, 'ASCII'.)
1101      """),
1102
1103    I(name='SHORT_BINSTRING',
1104      code='U',
1105      arg=string1,
1106      stack_before=[],
1107      stack_after=[pystring],
1108      proto=1,
1109      doc="""Push a Python string object.
1110
1111      There are two arguments:  the first is a 1-byte unsigned int giving
1112      the number of bytes in the string, and the second is that many bytes,
1113      which are taken literally as the string content.  (Actually, they
1114      are decoded into a str instance using the encoding given to the
1115      Unpickler constructor. or the default, 'ASCII'.)
1116      """),
1117
1118    # Bytes (protocol 3 only; older protocols don't support bytes at all)
1119
1120    I(name='BINBYTES',
1121      code='B',
1122      arg=bytes4,
1123      stack_before=[],
1124      stack_after=[pybytes],
1125      proto=3,
1126      doc="""Push a Python bytes object.
1127
1128      There are two arguments:  the first is a 4-byte little-endian unsigned int
1129      giving the number of bytes, and the second is that many bytes, which are
1130      taken literally as the bytes content.
1131      """),
1132
1133    I(name='SHORT_BINBYTES',
1134      code='C',
1135      arg=bytes1,
1136      stack_before=[],
1137      stack_after=[pybytes],
1138      proto=3,
1139      doc="""Push a Python bytes object.
1140
1141      There are two arguments:  the first is a 1-byte unsigned int giving
1142      the number of bytes, and the second is that many bytes, which are taken
1143      literally as the string content.
1144      """),
1145
1146    # Ways to spell None.
1147
1148    I(name='NONE',
1149      code='N',
1150      arg=None,
1151      stack_before=[],
1152      stack_after=[pynone],
1153      proto=0,
1154      doc="Push None on the stack."),
1155
1156    # Ways to spell bools, starting with proto 2.  See INT for how this was
1157    # done before proto 2.
1158
1159    I(name='NEWTRUE',
1160      code='\x88',
1161      arg=None,
1162      stack_before=[],
1163      stack_after=[pybool],
1164      proto=2,
1165      doc="""True.
1166
1167      Push True onto the stack."""),
1168
1169    I(name='NEWFALSE',
1170      code='\x89',
1171      arg=None,
1172      stack_before=[],
1173      stack_after=[pybool],
1174      proto=2,
1175      doc="""True.
1176
1177      Push False onto the stack."""),
1178
1179    # Ways to spell Unicode strings.
1180
1181    I(name='UNICODE',
1182      code='V',
1183      arg=unicodestringnl,
1184      stack_before=[],
1185      stack_after=[pyunicode],
1186      proto=0,  # this may be pure-text, but it's a later addition
1187      doc="""Push a Python Unicode string object.
1188
1189      The argument is a raw-unicode-escape encoding of a Unicode string,
1190      and so may contain embedded escape sequences.  The argument extends
1191      until the next newline character.
1192      """),
1193
1194    I(name='BINUNICODE',
1195      code='X',
1196      arg=unicodestring4,
1197      stack_before=[],
1198      stack_after=[pyunicode],
1199      proto=1,
1200      doc="""Push a Python Unicode string object.
1201
1202      There are two arguments:  the first is a 4-byte little-endian unsigned int
1203      giving the number of bytes in the string.  The second is that many
1204      bytes, and is the UTF-8 encoding of the Unicode string.
1205      """),
1206
1207    # Ways to spell floats.
1208
1209    I(name='FLOAT',
1210      code='F',
1211      arg=floatnl,
1212      stack_before=[],
1213      stack_after=[pyfloat],
1214      proto=0,
1215      doc="""Newline-terminated decimal float literal.
1216
1217      The argument is repr(a_float), and in general requires 17 significant
1218      digits for roundtrip conversion to be an identity (this is so for
1219      IEEE-754 double precision values, which is what Python float maps to
1220      on most boxes).
1221
1222      In general, FLOAT cannot be used to transport infinities, NaNs, or
1223      minus zero across boxes (or even on a single box, if the platform C
1224      library can't read the strings it produces for such things -- Windows
1225      is like that), but may do less damage than BINFLOAT on boxes with
1226      greater precision or dynamic range than IEEE-754 double.
1227      """),
1228
1229    I(name='BINFLOAT',
1230      code='G',
1231      arg=float8,
1232      stack_before=[],
1233      stack_after=[pyfloat],
1234      proto=1,
1235      doc="""Float stored in binary form, with 8 bytes of data.
1236
1237      This generally requires less than half the space of FLOAT encoding.
1238      In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1239      minus zero, raises an exception if the exponent exceeds the range of
1240      an IEEE-754 double, and retains no more than 53 bits of precision (if
1241      there are more than that, "add a half and chop" rounding is used to
1242      cut it back to 53 significant bits).
1243      """),
1244
1245    # Ways to build lists.
1246
1247    I(name='EMPTY_LIST',
1248      code=']',
1249      arg=None,
1250      stack_before=[],
1251      stack_after=[pylist],
1252      proto=1,
1253      doc="Push an empty list."),
1254
1255    I(name='APPEND',
1256      code='a',
1257      arg=None,
1258      stack_before=[pylist, anyobject],
1259      stack_after=[pylist],
1260      proto=0,
1261      doc="""Append an object to a list.
1262
1263      Stack before:  ... pylist anyobject
1264      Stack after:   ... pylist+[anyobject]
1265
1266      although pylist is really extended in-place.
1267      """),
1268
1269    I(name='APPENDS',
1270      code='e',
1271      arg=None,
1272      stack_before=[pylist, markobject, stackslice],
1273      stack_after=[pylist],
1274      proto=1,
1275      doc="""Extend a list by a slice of stack objects.
1276
1277      Stack before:  ... pylist markobject stackslice
1278      Stack after:   ... pylist+stackslice
1279
1280      although pylist is really extended in-place.
1281      """),
1282
1283    I(name='LIST',
1284      code='l',
1285      arg=None,
1286      stack_before=[markobject, stackslice],
1287      stack_after=[pylist],
1288      proto=0,
1289      doc="""Build a list out of the topmost stack slice, after markobject.
1290
1291      All the stack entries following the topmost markobject are placed into
1292      a single Python list, which single list object replaces all of the
1293      stack from the topmost markobject onward.  For example,
1294
1295      Stack before: ... markobject 1 2 3 'abc'
1296      Stack after:  ... [1, 2, 3, 'abc']
1297      """),
1298
1299    # Ways to build tuples.
1300
1301    I(name='EMPTY_TUPLE',
1302      code=')',
1303      arg=None,
1304      stack_before=[],
1305      stack_after=[pytuple],
1306      proto=1,
1307      doc="Push an empty tuple."),
1308
1309    I(name='TUPLE',
1310      code='t',
1311      arg=None,
1312      stack_before=[markobject, stackslice],
1313      stack_after=[pytuple],
1314      proto=0,
1315      doc="""Build a tuple out of the topmost stack slice, after markobject.
1316
1317      All the stack entries following the topmost markobject are placed into
1318      a single Python tuple, which single tuple object replaces all of the
1319      stack from the topmost markobject onward.  For example,
1320
1321      Stack before: ... markobject 1 2 3 'abc'
1322      Stack after:  ... (1, 2, 3, 'abc')
1323      """),
1324
1325    I(name='TUPLE1',
1326      code='\x85',
1327      arg=None,
1328      stack_before=[anyobject],
1329      stack_after=[pytuple],
1330      proto=2,
1331      doc="""Build a one-tuple out of the topmost item on the stack.
1332
1333      This code pops one value off the stack and pushes a tuple of
1334      length 1 whose one item is that value back onto it.  In other
1335      words:
1336
1337          stack[-1] = tuple(stack[-1:])
1338      """),
1339
1340    I(name='TUPLE2',
1341      code='\x86',
1342      arg=None,
1343      stack_before=[anyobject, anyobject],
1344      stack_after=[pytuple],
1345      proto=2,
1346      doc="""Build a two-tuple out of the top two items on the stack.
1347
1348      This code pops two values off the stack and pushes a tuple of
1349      length 2 whose items are those values back onto it.  In other
1350      words:
1351
1352          stack[-2:] = [tuple(stack[-2:])]
1353      """),
1354
1355    I(name='TUPLE3',
1356      code='\x87',
1357      arg=None,
1358      stack_before=[anyobject, anyobject, anyobject],
1359      stack_after=[pytuple],
1360      proto=2,
1361      doc="""Build a three-tuple out of the top three items on the stack.
1362
1363      This code pops three values off the stack and pushes a tuple of
1364      length 3 whose items are those values back onto it.  In other
1365      words:
1366
1367          stack[-3:] = [tuple(stack[-3:])]
1368      """),
1369
1370    # Ways to build dicts.
1371
1372    I(name='EMPTY_DICT',
1373      code='}',
1374      arg=None,
1375      stack_before=[],
1376      stack_after=[pydict],
1377      proto=1,
1378      doc="Push an empty dict."),
1379
1380    I(name='DICT',
1381      code='d',
1382      arg=None,
1383      stack_before=[markobject, stackslice],
1384      stack_after=[pydict],
1385      proto=0,
1386      doc="""Build a dict out of the topmost stack slice, after markobject.
1387
1388      All the stack entries following the topmost markobject are placed into
1389      a single Python dict, which single dict object replaces all of the
1390      stack from the topmost markobject onward.  The stack slice alternates
1391      key, value, key, value, ....  For example,
1392
1393      Stack before: ... markobject 1 2 3 'abc'
1394      Stack after:  ... {1: 2, 3: 'abc'}
1395      """),
1396
1397    I(name='SETITEM',
1398      code='s',
1399      arg=None,
1400      stack_before=[pydict, anyobject, anyobject],
1401      stack_after=[pydict],
1402      proto=0,
1403      doc="""Add a key+value pair to an existing dict.
1404
1405      Stack before:  ... pydict key value
1406      Stack after:   ... pydict
1407
1408      where pydict has been modified via pydict[key] = value.
1409      """),
1410
1411    I(name='SETITEMS',
1412      code='u',
1413      arg=None,
1414      stack_before=[pydict, markobject, stackslice],
1415      stack_after=[pydict],
1416      proto=1,
1417      doc="""Add an arbitrary number of key+value pairs to an existing dict.
1418
1419      The slice of the stack following the topmost markobject is taken as
1420      an alternating sequence of keys and values, added to the dict
1421      immediately under the topmost markobject.  Everything at and after the
1422      topmost markobject is popped, leaving the mutated dict at the top
1423      of the stack.
1424
1425      Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1426      Stack after:   ... pydict
1427
1428      where pydict has been modified via pydict[key_i] = value_i for i in
1429      1, 2, ..., n, and in that order.
1430      """),
1431
1432    # Stack manipulation.
1433
1434    I(name='POP',
1435      code='0',
1436      arg=None,
1437      stack_before=[anyobject],
1438      stack_after=[],
1439      proto=0,
1440      doc="Discard the top stack item, shrinking the stack by one item."),
1441
1442    I(name='DUP',
1443      code='2',
1444      arg=None,
1445      stack_before=[anyobject],
1446      stack_after=[anyobject, anyobject],
1447      proto=0,
1448      doc="Push the top stack item onto the stack again, duplicating it."),
1449
1450    I(name='MARK',
1451      code='(',
1452      arg=None,
1453      stack_before=[],
1454      stack_after=[markobject],
1455      proto=0,
1456      doc="""Push markobject onto the stack.
1457
1458      markobject is a unique object, used by other opcodes to identify a
1459      region of the stack containing a variable number of objects for them
1460      to work on.  See markobject.doc for more detail.
1461      """),
1462
1463    I(name='POP_MARK',
1464      code='1',
1465      arg=None,
1466      stack_before=[markobject, stackslice],
1467      stack_after=[],
1468      proto=1,
1469      doc="""Pop all the stack objects at and above the topmost markobject.
1470
1471      When an opcode using a variable number of stack objects is done,
1472      POP_MARK is used to remove those objects, and to remove the markobject
1473      that delimited their starting position on the stack.
1474      """),
1475
1476    # Memo manipulation.  There are really only two operations (get and put),
1477    # each in all-text, "short binary", and "long binary" flavors.
1478
1479    I(name='GET',
1480      code='g',
1481      arg=decimalnl_short,
1482      stack_before=[],
1483      stack_after=[anyobject],
1484      proto=0,
1485      doc="""Read an object from the memo and push it on the stack.
1486
1487      The index of the memo object to push is given by the newline-terminated
1488      decimal string following.  BINGET and LONG_BINGET are space-optimized
1489      versions.
1490      """),
1491
1492    I(name='BINGET',
1493      code='h',
1494      arg=uint1,
1495      stack_before=[],
1496      stack_after=[anyobject],
1497      proto=1,
1498      doc="""Read an object from the memo and push it on the stack.
1499
1500      The index of the memo object to push is given by the 1-byte unsigned
1501      integer following.
1502      """),
1503
1504    I(name='LONG_BINGET',
1505      code='j',
1506      arg=uint4,
1507      stack_before=[],
1508      stack_after=[anyobject],
1509      proto=1,
1510      doc="""Read an object from the memo and push it on the stack.
1511
1512      The index of the memo object to push is given by the 4-byte unsigned
1513      little-endian integer following.
1514      """),
1515
1516    I(name='PUT',
1517      code='p',
1518      arg=decimalnl_short,
1519      stack_before=[],
1520      stack_after=[],
1521      proto=0,
1522      doc="""Store the stack top into the memo.  The stack is not popped.
1523
1524      The index of the memo location to write into is given by the newline-
1525      terminated decimal string following.  BINPUT and LONG_BINPUT are
1526      space-optimized versions.
1527      """),
1528
1529    I(name='BINPUT',
1530      code='q',
1531      arg=uint1,
1532      stack_before=[],
1533      stack_after=[],
1534      proto=1,
1535      doc="""Store the stack top into the memo.  The stack is not popped.
1536
1537      The index of the memo location to write into is given by the 1-byte
1538      unsigned integer following.
1539      """),
1540
1541    I(name='LONG_BINPUT',
1542      code='r',
1543      arg=uint4,
1544      stack_before=[],
1545      stack_after=[],
1546      proto=1,
1547      doc="""Store the stack top into the memo.  The stack is not popped.
1548
1549      The index of the memo location to write into is given by the 4-byte
1550      unsigned little-endian integer following.
1551      """),
1552
1553    # Access the extension registry (predefined objects).  Akin to the GET
1554    # family.
1555
1556    I(name='EXT1',
1557      code='\x82',
1558      arg=uint1,
1559      stack_before=[],
1560      stack_after=[anyobject],
1561      proto=2,
1562      doc="""Extension code.
1563
1564      This code and the similar EXT2 and EXT4 allow using a registry
1565      of popular objects that are pickled by name, typically classes.
1566      It is envisioned that through a global negotiation and
1567      registration process, third parties can set up a mapping between
1568      ints and object names.
1569
1570      In order to guarantee pickle interchangeability, the extension
1571      code registry ought to be global, although a range of codes may
1572      be reserved for private use.
1573
1574      EXT1 has a 1-byte integer argument.  This is used to index into the
1575      extension registry, and the object at that index is pushed on the stack.
1576      """),
1577
1578    I(name='EXT2',
1579      code='\x83',
1580      arg=uint2,
1581      stack_before=[],
1582      stack_after=[anyobject],
1583      proto=2,
1584      doc="""Extension code.
1585
1586      See EXT1.  EXT2 has a two-byte integer argument.
1587      """),
1588
1589    I(name='EXT4',
1590      code='\x84',
1591      arg=int4,
1592      stack_before=[],
1593      stack_after=[anyobject],
1594      proto=2,
1595      doc="""Extension code.
1596
1597      See EXT1.  EXT4 has a four-byte integer argument.
1598      """),
1599
1600    # Push a class object, or module function, on the stack, via its module
1601    # and name.
1602
1603    I(name='GLOBAL',
1604      code='c',
1605      arg=stringnl_noescape_pair,
1606      stack_before=[],
1607      stack_after=[anyobject],
1608      proto=0,
1609      doc="""Push a global object (module.attr) on the stack.
1610
1611      Two newline-terminated strings follow the GLOBAL opcode.  The first is
1612      taken as a module name, and the second as a class name.  The class
1613      object module.class is pushed on the stack.  More accurately, the
1614      object returned by self.find_class(module, class) is pushed on the
1615      stack, so unpickling subclasses can override this form of lookup.
1616      """),
1617
1618    # Ways to build objects of classes pickle doesn't know about directly
1619    # (user-defined classes).  I despair of documenting this accurately
1620    # and comprehensibly -- you really have to read the pickle code to
1621    # find all the special cases.
1622
1623    I(name='REDUCE',
1624      code='R',
1625      arg=None,
1626      stack_before=[anyobject, anyobject],
1627      stack_after=[anyobject],
1628      proto=0,
1629      doc="""Push an object built from a callable and an argument tuple.
1630
1631      The opcode is named to remind of the __reduce__() method.
1632
1633      Stack before: ... callable pytuple
1634      Stack after:  ... callable(*pytuple)
1635
1636      The callable and the argument tuple are the first two items returned
1637      by a __reduce__ method.  Applying the callable to the argtuple is
1638      supposed to reproduce the original object, or at least get it started.
1639      If the __reduce__ method returns a 3-tuple, the last component is an
1640      argument to be passed to the object's __setstate__, and then the REDUCE
1641      opcode is followed by code to create setstate's argument, and then a
1642      BUILD opcode to apply  __setstate__ to that argument.
1643
1644      If not isinstance(callable, type), REDUCE complains unless the
1645      callable has been registered with the copyreg module's
1646      safe_constructors dict, or the callable has a magic
1647      '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1648      why it does this, but I've sure seen this complaint often enough when
1649      I didn't want to <wink>.
1650      """),
1651
1652    I(name='BUILD',
1653      code='b',
1654      arg=None,
1655      stack_before=[anyobject, anyobject],
1656      stack_after=[anyobject],
1657      proto=0,
1658      doc="""Finish building an object, via __setstate__ or dict update.
1659
1660      Stack before: ... anyobject argument
1661      Stack after:  ... anyobject
1662
1663      where anyobject may have been mutated, as follows:
1664
1665      If the object has a __setstate__ method,
1666
1667          anyobject.__setstate__(argument)
1668
1669      is called.
1670
1671      Else the argument must be a dict, the object must have a __dict__, and
1672      the object is updated via
1673
1674          anyobject.__dict__.update(argument)
1675      """),
1676
1677    I(name='INST',
1678      code='i',
1679      arg=stringnl_noescape_pair,
1680      stack_before=[markobject, stackslice],
1681      stack_after=[anyobject],
1682      proto=0,
1683      doc="""Build a class instance.
1684
1685      This is the protocol 0 version of protocol 1's OBJ opcode.
1686      INST is followed by two newline-terminated strings, giving a
1687      module and class name, just as for the GLOBAL opcode (and see
1688      GLOBAL for more details about that).  self.find_class(module, name)
1689      is used to get a class object.
1690
1691      In addition, all the objects on the stack following the topmost
1692      markobject are gathered into a tuple and popped (along with the
1693      topmost markobject), just as for the TUPLE opcode.
1694
1695      Now it gets complicated.  If all of these are true:
1696
1697        + The argtuple is empty (markobject was at the top of the stack
1698          at the start).
1699
1700        + The class object does not have a __getinitargs__ attribute.
1701
1702      then we want to create an old-style class instance without invoking
1703      its __init__() method (pickle has waffled on this over the years; not
1704      calling __init__() is current wisdom).  In this case, an instance of
1705      an old-style dummy class is created, and then we try to rebind its
1706      __class__ attribute to the desired class object.  If this succeeds,
1707      the new instance object is pushed on the stack, and we're done.
1708
1709      Else (the argtuple is not empty, it's not an old-style class object,
1710      or the class object does have a __getinitargs__ attribute), the code
1711      first insists that the class object have a __safe_for_unpickling__
1712      attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
1713      it doesn't matter whether this attribute has a true or false value, it
1714      only matters whether it exists (XXX this is a bug).  If
1715      __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
1716
1717      Else (the class object does have a __safe_for_unpickling__ attr),
1718      the class object obtained from INST's arguments is applied to the
1719      argtuple obtained from the stack, and the resulting instance object
1720      is pushed on the stack.
1721
1722      NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
1723      NOTE:  the distinction between old-style and new-style classes does
1724             not make sense in Python 3.
1725      """),
1726
1727    I(name='OBJ',
1728      code='o',
1729      arg=None,
1730      stack_before=[markobject, anyobject, stackslice],
1731      stack_after=[anyobject],
1732      proto=1,
1733      doc="""Build a class instance.
1734
1735      This is the protocol 1 version of protocol 0's INST opcode, and is
1736      very much like it.  The major difference is that the class object
1737      is taken off the stack, allowing it to be retrieved from the memo
1738      repeatedly if several instances of the same class are created.  This
1739      can be much more efficient (in both time and space) than repeatedly
1740      embedding the module and class names in INST opcodes.
1741
1742      Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
1743      the class object is taken off the stack, immediately above the
1744      topmost markobject:
1745
1746      Stack before: ... markobject classobject stackslice
1747      Stack after:  ... new_instance_object
1748
1749      As for INST, the remainder of the stack above the markobject is
1750      gathered into an argument tuple, and then the logic seems identical,
1751      except that no __safe_for_unpickling__ check is done (XXX this is
1752      a bug).  See INST for the gory details.
1753
1754      NOTE:  In Python 2.3, INST and OBJ are identical except for how they
1755      get the class object.  That was always the intent; the implementations
1756      had diverged for accidental reasons.
1757      """),
1758
1759    I(name='NEWOBJ',
1760      code='\x81',
1761      arg=None,
1762      stack_before=[anyobject, anyobject],
1763      stack_after=[anyobject],
1764      proto=2,
1765      doc="""Build an object instance.
1766
1767      The stack before should be thought of as containing a class
1768      object followed by an argument tuple (the tuple being the stack
1769      top).  Call these cls and args.  They are popped off the stack,
1770      and the value returned by cls.__new__(cls, *args) is pushed back
1771      onto the stack.
1772      """),
1773
1774    # Machine control.
1775
1776    I(name='PROTO',
1777      code='\x80',
1778      arg=uint1,
1779      stack_before=[],
1780      stack_after=[],
1781      proto=2,
1782      doc="""Protocol version indicator.
1783
1784      For protocol 2 and above, a pickle must start with this opcode.
1785      The argument is the protocol version, an int in range(2, 256).
1786      """),
1787
1788    I(name='STOP',
1789      code='.',
1790      arg=None,
1791      stack_before=[anyobject],
1792      stack_after=[],
1793      proto=0,
1794      doc="""Stop the unpickling machine.
1795
1796      Every pickle ends with this opcode.  The object at the top of the stack
1797      is popped, and that's the result of unpickling.  The stack should be
1798      empty then.
1799      """),
1800
1801    # Ways to deal with persistent IDs.
1802
1803    I(name='PERSID',
1804      code='P',
1805      arg=stringnl_noescape,
1806      stack_before=[],
1807      stack_after=[anyobject],
1808      proto=0,
1809      doc="""Push an object identified by a persistent ID.
1810
1811      The pickle module doesn't define what a persistent ID means.  PERSID's
1812      argument is a newline-terminated str-style (no embedded escapes, no
1813      bracketing quote characters) string, which *is* "the persistent ID".
1814      The unpickler passes this string to self.persistent_load().  Whatever
1815      object that returns is pushed on the stack.  There is no implementation
1816      of persistent_load() in Python's unpickler:  it must be supplied by an
1817      unpickler subclass.
1818      """),
1819
1820    I(name='BINPERSID',
1821      code='Q',
1822      arg=None,
1823      stack_before=[anyobject],
1824      stack_after=[anyobject],
1825      proto=1,
1826      doc="""Push an object identified by a persistent ID.
1827
1828      Like PERSID, except the persistent ID is popped off the stack (instead
1829      of being a string embedded in the opcode bytestream).  The persistent
1830      ID is passed to self.persistent_load(), and whatever object that
1831      returns is pushed on the stack.  See PERSID for more detail.
1832      """),
1833]
1834del I
1835
1836# Verify uniqueness of .name and .code members.
1837name2i = {}
1838code2i = {}
1839
1840for i, d in enumerate(opcodes):
1841    if d.name in name2i:
1842        raise ValueError("repeated name %r at indices %d and %d" %
1843                         (d.name, name2i[d.name], i))
1844    if d.code in code2i:
1845        raise ValueError("repeated code %r at indices %d and %d" %
1846                         (d.code, code2i[d.code], i))
1847
1848    name2i[d.name] = i
1849    code2i[d.code] = i
1850
1851del name2i, code2i, i, d
1852
1853##############################################################################
1854# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1855# Also ensure we've got the same stuff as pickle.py, although the
1856# introspection here is dicey.
1857
1858code2op = {}
1859for d in opcodes:
1860    code2op[d.code] = d
1861del d
1862
1863def assure_pickle_consistency(verbose=False):
1864
1865    copy = code2op.copy()
1866    for name in pickle.__all__:
1867        if not re.match("[A-Z][A-Z0-9_]+$", name):
1868            if verbose:
1869                print("skipping %r: it doesn't look like an opcode name" % name)
1870            continue
1871        picklecode = getattr(pickle, name)
1872        if not isinstance(picklecode, bytes) or len(picklecode) != 1:
1873            if verbose:
1874                print(("skipping %r: value %r doesn't look like a pickle "
1875                       "code" % (name, picklecode)))
1876            continue
1877        picklecode = picklecode.decode("latin-1")
1878        if picklecode in copy:
1879            if verbose:
1880                print("checking name %r w/ code %r for consistency" % (
1881                      name, picklecode))
1882            d = copy[picklecode]
1883            if d.name != name:
1884                raise ValueError("for pickle code %r, pickle.py uses name %r "
1885                                 "but we're using name %r" % (picklecode,
1886                                                              name,
1887                                                              d.name))
1888            # Forget this one.  Any left over in copy at the end are a problem
1889            # of a different kind.
1890            del copy[picklecode]
1891        else:
1892            raise ValueError("pickle.py appears to have a pickle opcode with "
1893                             "name %r and code %r, but we don't" %
1894                             (name, picklecode))
1895    if copy:
1896        msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1897        for code, d in copy.items():
1898            msg.append("    name %r with code %r" % (d.name, code))
1899        raise ValueError("\n".join(msg))
1900
1901assure_pickle_consistency()
1902del assure_pickle_consistency
1903
1904##############################################################################
1905# A pickle opcode generator.
1906
1907def genops(pickle):
1908    """Generate all the opcodes in a pickle.
1909
1910    'pickle' is a file-like object, or string, containing the pickle.
1911
1912    Each opcode in the pickle is generated, from the current pickle position,
1913    stopping after a STOP opcode is delivered.  A triple is generated for
1914    each opcode:
1915
1916        opcode, arg, pos
1917
1918    opcode is an OpcodeInfo record, describing the current opcode.
1919
1920    If the opcode has an argument embedded in the pickle, arg is its decoded
1921    value, as a Python object.  If the opcode doesn't have an argument, arg
1922    is None.
1923
1924    If the pickle has a tell() method, pos was the value of pickle.tell()
1925    before reading the current opcode.  If the pickle is a bytes object,
1926    it's wrapped in a BytesIO object, and the latter's tell() result is
1927    used.  Else (the pickle doesn't have a tell(), and it's not obvious how
1928    to query its current position) pos is None.
1929    """
1930
1931    if isinstance(pickle, bytes_types):
1932        import io
1933        pickle = io.BytesIO(pickle)
1934
1935    if hasattr(pickle, "tell"):
1936        getpos = pickle.tell
1937    else:
1938        getpos = lambda: None
1939
1940    while True:
1941        pos = getpos()
1942        code = pickle.read(1)
1943        opcode = code2op.get(code.decode("latin-1"))
1944        if opcode is None:
1945            if code == b"":
1946                raise ValueError("pickle exhausted before seeing STOP")
1947            else:
1948                raise ValueError("at position %s, opcode %r unknown" % (
1949                                 pos is None and "<unknown>" or pos,
1950                                 code))
1951        if opcode.arg is None:
1952            arg = None
1953        else:
1954            arg = opcode.arg.reader(pickle)
1955        yield opcode, arg, pos
1956        if code == b'.':
1957            assert opcode.name == 'STOP'
1958            break
1959
1960##############################################################################
1961# A pickle optimizer.
1962
1963def optimize(p):
1964    'Optimize a pickle string by removing unused PUT opcodes'
1965    gets = set()            # set of args used by a GET opcode
1966    puts = []               # (arg, startpos, stoppos) for the PUT opcodes
1967    prevpos = None          # set to pos if previous opcode was a PUT
1968    for opcode, arg, pos in genops(p):
1969        if prevpos is not None:
1970            puts.append((prevarg, prevpos, pos))
1971            prevpos = None
1972        if 'PUT' in opcode.name:
1973            prevarg, prevpos = arg, pos
1974        elif 'GET' in opcode.name:
1975            gets.add(arg)
1976
1977    # Copy the pickle string except for PUTS without a corresponding GET
1978    s = []
1979    i = 0
1980    for arg, start, stop in puts:
1981        j = stop if (arg in gets) else start
1982        s.append(p[i:j])
1983        i = stop
1984    s.append(p[i:])
1985    return b''.join(s)
1986
1987##############################################################################
1988# A symbolic pickle disassembler.
1989
1990def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
1991    """Produce a symbolic disassembly of a pickle.
1992
1993    'pickle' is a file-like object, or string, containing a (at least one)
1994    pickle.  The pickle is disassembled from the current position, through
1995    the first STOP opcode encountered.
1996
1997    Optional arg 'out' is a file-like object to which the disassembly is
1998    printed.  It defaults to sys.stdout.
1999
2000    Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
2001    may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
2002    Passing the same memo object to another dis() call then allows disassembly
2003    to proceed across multiple pickles that were all created by the same
2004    pickler with the same memo.  Ordinarily you don't need to worry about this.
2005
2006    Optional arg 'indentlevel' is the number of blanks by which to indent
2007    a new MARK level.  It defaults to 4.
2008
2009    Optional arg 'annotate' if nonzero instructs dis() to add short
2010    description of the opcode on each line of disassembled output.
2011    The value given to 'annotate' must be an integer and is used as a
2012    hint for the column where annotation should start.  The default
2013    value is 0, meaning no annotations.
2014
2015    In addition to printing the disassembly, some sanity checks are made:
2016
2017    + All embedded opcode arguments "make sense".
2018
2019    + Explicit and implicit pop operations have enough items on the stack.
2020
2021    + When an opcode implicitly refers to a markobject, a markobject is
2022      actually on the stack.
2023
2024    + A memo entry isn't referenced before it's defined.
2025
2026    + The markobject isn't stored in the memo.
2027
2028    + A memo entry isn't redefined.
2029    """
2030
2031    # Most of the hair here is for sanity checks, but most of it is needed
2032    # anyway to detect when a protocol 0 POP takes a MARK off the stack
2033    # (which in turn is needed to indent MARK blocks correctly).
2034
2035    stack = []          # crude emulation of unpickler stack
2036    if memo is None:
2037        memo = {}       # crude emulation of unpickler memo
2038    maxproto = -1       # max protocol number seen
2039    markstack = []      # bytecode positions of MARK opcodes
2040    indentchunk = ' ' * indentlevel
2041    errormsg = None
2042    annocol = annotate  # column hint for annotations
2043    for opcode, arg, pos in genops(pickle):
2044        if pos is not None:
2045            print("%5d:" % pos, end=' ', file=out)
2046
2047        line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
2048                              indentchunk * len(markstack),
2049                              opcode.name)
2050
2051        maxproto = max(maxproto, opcode.proto)
2052        before = opcode.stack_before    # don't mutate
2053        after = opcode.stack_after      # don't mutate
2054        numtopop = len(before)
2055
2056        # See whether a MARK should be popped.
2057        markmsg = None
2058        if markobject in before or (opcode.name == "POP" and
2059                                    stack and
2060                                    stack[-1] is markobject):
2061            assert markobject not in after
2062            if __debug__:
2063                if markobject in before:
2064                    assert before[-1] is stackslice
2065            if markstack:
2066                markpos = markstack.pop()
2067                if markpos is None:
2068                    markmsg = "(MARK at unknown opcode offset)"
2069                else:
2070                    markmsg = "(MARK at %d)" % markpos
2071                # Pop everything at and after the topmost markobject.
2072                while stack[-1] is not markobject:
2073                    stack.pop()
2074                stack.pop()
2075                # Stop later code from popping too much.
2076                try:
2077                    numtopop = before.index(markobject)
2078                except ValueError:
2079                    assert opcode.name == "POP"
2080                    numtopop = 0
2081            else:
2082                errormsg = markmsg = "no MARK exists on stack"
2083
2084        # Check for correct memo usage.
2085        if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
2086            assert arg is not None
2087            if arg in memo:
2088                errormsg = "memo key %r already defined" % arg
2089            elif not stack:
2090                errormsg = "stack is empty -- can't store into memo"
2091            elif stack[-1] is markobject:
2092                errormsg = "can't store markobject in the memo"
2093            else:
2094                memo[arg] = stack[-1]
2095
2096        elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2097            if arg in memo:
2098                assert len(after) == 1
2099                after = [memo[arg]]     # for better stack emulation
2100            else:
2101                errormsg = "memo key %r has never been stored into" % arg
2102
2103        if arg is not None or markmsg:
2104            # make a mild effort to align arguments
2105            line += ' ' * (10 - len(opcode.name))
2106            if arg is not None:
2107                line += ' ' + repr(arg)
2108            if markmsg:
2109                line += ' ' + markmsg
2110        if annotate:
2111            line += ' ' * (annocol - len(line))
2112            # make a mild effort to align annotations
2113            annocol = len(line)
2114            if annocol > 50:
2115                annocol = annotate
2116            line += ' ' + opcode.doc.split('\n', 1)[0]
2117        print(line, file=out)
2118
2119        if errormsg:
2120            # Note that we delayed complaining until the offending opcode
2121            # was printed.
2122            raise ValueError(errormsg)
2123
2124        # Emulate the stack effects.
2125        if len(stack) < numtopop:
2126            raise ValueError("tries to pop %d items from stack with "
2127                             "only %d items" % (numtopop, len(stack)))
2128        if numtopop:
2129            del stack[-numtopop:]
2130        if markobject in after:
2131            assert markobject not in before
2132            markstack.append(pos)
2133
2134        stack.extend(after)
2135
2136    print("highest protocol among opcodes =", maxproto, file=out)
2137    if stack:
2138        raise ValueError("stack not empty after STOP: %r" % stack)
2139
2140# For use in the doctest, simply as an example of a class to pickle.
2141class _Example:
2142    def __init__(self, value):
2143        self.value = value
2144
2145_dis_test = r"""
2146>>> import pickle
2147>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2148>>> pkl0 = pickle.dumps(x, 0)
2149>>> dis(pkl0)
2150    0: (    MARK
2151    1: l        LIST       (MARK at 0)
2152    2: p    PUT        0
2153    5: L    LONG       1
2154    9: a    APPEND
2155   10: L    LONG       2
2156   14: a    APPEND
2157   15: (    MARK
2158   16: L        LONG       3
2159   20: L        LONG       4
2160   24: t        TUPLE      (MARK at 15)
2161   25: p    PUT        1
2162   28: a    APPEND
2163   29: (    MARK
2164   30: d        DICT       (MARK at 29)
2165   31: p    PUT        2
2166   34: c    GLOBAL     '_codecs encode'
2167   50: p    PUT        3
2168   53: (    MARK
2169   54: V        UNICODE    'abc'
2170   59: p        PUT        4
2171   62: V        UNICODE    'latin1'
2172   70: p        PUT        5
2173   73: t        TUPLE      (MARK at 53)
2174   74: p    PUT        6
2175   77: R    REDUCE
2176   78: p    PUT        7
2177   81: V    UNICODE    'def'
2178   86: p    PUT        8
2179   89: s    SETITEM
2180   90: a    APPEND
2181   91: .    STOP
2182highest protocol among opcodes = 0
2183
2184Try again with a "binary" pickle.
2185
2186>>> pkl1 = pickle.dumps(x, 1)
2187>>> dis(pkl1)
2188    0: ]    EMPTY_LIST
2189    1: q    BINPUT     0
2190    3: (    MARK
2191    4: K        BININT1    1
2192    6: K        BININT1    2
2193    8: (        MARK
2194    9: K            BININT1    3
2195   11: K            BININT1    4
2196   13: t            TUPLE      (MARK at 8)
2197   14: q        BINPUT     1
2198   16: }        EMPTY_DICT
2199   17: q        BINPUT     2
2200   19: c        GLOBAL     '_codecs encode'
2201   35: q        BINPUT     3
2202   37: (        MARK
2203   38: X            BINUNICODE 'abc'
2204   46: q            BINPUT     4
2205   48: X            BINUNICODE 'latin1'
2206   59: q            BINPUT     5
2207   61: t            TUPLE      (MARK at 37)
2208   62: q        BINPUT     6
2209   64: R        REDUCE
2210   65: q        BINPUT     7
2211   67: X        BINUNICODE 'def'
2212   75: q        BINPUT     8
2213   77: s        SETITEM
2214   78: e        APPENDS    (MARK at 3)
2215   79: .    STOP
2216highest protocol among opcodes = 1
2217
2218Exercise the INST/OBJ/BUILD family.
2219
2220>>> import pickletools
2221>>> dis(pickle.dumps(pickletools.dis, 0))
2222    0: c    GLOBAL     'pickletools dis'
2223   17: p    PUT        0
2224   20: .    STOP
2225highest protocol among opcodes = 0
2226
2227>>> from pickletools import _Example
2228>>> x = [_Example(42)] * 2
2229>>> dis(pickle.dumps(x, 0))
2230    0: (    MARK
2231    1: l        LIST       (MARK at 0)
2232    2: p    PUT        0
2233    5: c    GLOBAL     'copy_reg _reconstructor'
2234   30: p    PUT        1
2235   33: (    MARK
2236   34: c        GLOBAL     'pickletools _Example'
2237   56: p        PUT        2
2238   59: c        GLOBAL     '__builtin__ object'
2239   79: p        PUT        3
2240   82: N        NONE
2241   83: t        TUPLE      (MARK at 33)
2242   84: p    PUT        4
2243   87: R    REDUCE
2244   88: p    PUT        5
2245   91: (    MARK
2246   92: d        DICT       (MARK at 91)
2247   93: p    PUT        6
2248   96: V    UNICODE    'value'
2249  103: p    PUT        7
2250  106: L    LONG       42
2251  111: s    SETITEM
2252  112: b    BUILD
2253  113: a    APPEND
2254  114: g    GET        5
2255  117: a    APPEND
2256  118: .    STOP
2257highest protocol among opcodes = 0
2258
2259>>> dis(pickle.dumps(x, 1))
2260    0: ]    EMPTY_LIST
2261    1: q    BINPUT     0
2262    3: (    MARK
2263    4: c        GLOBAL     'copy_reg _reconstructor'
2264   29: q        BINPUT     1
2265   31: (        MARK
2266   32: c            GLOBAL     'pickletools _Example'
2267   54: q            BINPUT     2
2268   56: c            GLOBAL     '__builtin__ object'
2269   76: q            BINPUT     3
2270   78: N            NONE
2271   79: t            TUPLE      (MARK at 31)
2272   80: q        BINPUT     4
2273   82: R        REDUCE
2274   83: q        BINPUT     5
2275   85: }        EMPTY_DICT
2276   86: q        BINPUT     6
2277   88: X        BINUNICODE 'value'
2278   98: q        BINPUT     7
2279  100: K        BININT1    42
2280  102: s        SETITEM
2281  103: b        BUILD
2282  104: h        BINGET     5
2283  106: e        APPENDS    (MARK at 3)
2284  107: .    STOP
2285highest protocol among opcodes = 1
2286
2287Try "the canonical" recursive-object test.
2288
2289>>> L = []
2290>>> T = L,
2291>>> L.append(T)
2292>>> L[0] is T
2293True
2294>>> T[0] is L
2295True
2296>>> L[0][0] is L
2297True
2298>>> T[0][0] is T
2299True
2300>>> dis(pickle.dumps(L, 0))
2301    0: (    MARK
2302    1: l        LIST       (MARK at 0)
2303    2: p    PUT        0
2304    5: (    MARK
2305    6: g        GET        0
2306    9: t        TUPLE      (MARK at 5)
2307   10: p    PUT        1
2308   13: a    APPEND
2309   14: .    STOP
2310highest protocol among opcodes = 0
2311
2312>>> dis(pickle.dumps(L, 1))
2313    0: ]    EMPTY_LIST
2314    1: q    BINPUT     0
2315    3: (    MARK
2316    4: h        BINGET     0
2317    6: t        TUPLE      (MARK at 3)
2318    7: q    BINPUT     1
2319    9: a    APPEND
2320   10: .    STOP
2321highest protocol among opcodes = 1
2322
2323Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2324has to emulate the stack in order to realize that the POP opcode at 16 gets
2325rid of the MARK at 0.
2326
2327>>> dis(pickle.dumps(T, 0))
2328    0: (    MARK
2329    1: (        MARK
2330    2: l            LIST       (MARK at 1)
2331    3: p        PUT        0
2332    6: (        MARK
2333    7: g            GET        0
2334   10: t            TUPLE      (MARK at 6)
2335   11: p        PUT        1
2336   14: a        APPEND
2337   15: 0        POP
2338   16: 0        POP        (MARK at 0)
2339   17: g    GET        1
2340   20: .    STOP
2341highest protocol among opcodes = 0
2342
2343>>> dis(pickle.dumps(T, 1))
2344    0: (    MARK
2345    1: ]        EMPTY_LIST
2346    2: q        BINPUT     0
2347    4: (        MARK
2348    5: h            BINGET     0
2349    7: t            TUPLE      (MARK at 4)
2350    8: q        BINPUT     1
2351   10: a        APPEND
2352   11: 1        POP_MARK   (MARK at 0)
2353   12: h    BINGET     1
2354   14: .    STOP
2355highest protocol among opcodes = 1
2356
2357Try protocol 2.
2358
2359>>> dis(pickle.dumps(L, 2))
2360    0: \x80 PROTO      2
2361    2: ]    EMPTY_LIST
2362    3: q    BINPUT     0
2363    5: h    BINGET     0
2364    7: \x85 TUPLE1
2365    8: q    BINPUT     1
2366   10: a    APPEND
2367   11: .    STOP
2368highest protocol among opcodes = 2
2369
2370>>> dis(pickle.dumps(T, 2))
2371    0: \x80 PROTO      2
2372    2: ]    EMPTY_LIST
2373    3: q    BINPUT     0
2374    5: h    BINGET     0
2375    7: \x85 TUPLE1
2376    8: q    BINPUT     1
2377   10: a    APPEND
2378   11: 0    POP
2379   12: h    BINGET     1
2380   14: .    STOP
2381highest protocol among opcodes = 2
2382
2383Try protocol 3 with annotations:
2384
2385>>> dis(pickle.dumps(T, 3), annotate=1)
2386    0: \x80 PROTO      3 Protocol version indicator.
2387    2: ]    EMPTY_LIST   Push an empty list.
2388    3: q    BINPUT     0 Store the stack top into the memo.  The stack is not popped.
2389    5: h    BINGET     0 Read an object from the memo and push it on the stack.
2390    7: \x85 TUPLE1       Build a one-tuple out of the topmost item on the stack.
2391    8: q    BINPUT     1 Store the stack top into the memo.  The stack is not popped.
2392   10: a    APPEND       Append an object to a list.
2393   11: 0    POP          Discard the top stack item, shrinking the stack by one item.
2394   12: h    BINGET     1 Read an object from the memo and push it on the stack.
2395   14: .    STOP         Stop the unpickling machine.
2396highest protocol among opcodes = 2
2397
2398"""
2399
2400_memo_test = r"""
2401>>> import pickle
2402>>> import io
2403>>> f = io.BytesIO()
2404>>> p = pickle.Pickler(f, 2)
2405>>> x = [1, 2, 3]
2406>>> p.dump(x)
2407>>> p.dump(x)
2408>>> f.seek(0)
24090
2410>>> memo = {}
2411>>> dis(f, memo=memo)
2412    0: \x80 PROTO      2
2413    2: ]    EMPTY_LIST
2414    3: q    BINPUT     0
2415    5: (    MARK
2416    6: K        BININT1    1
2417    8: K        BININT1    2
2418   10: K        BININT1    3
2419   12: e        APPENDS    (MARK at 5)
2420   13: .    STOP
2421highest protocol among opcodes = 2
2422>>> dis(f, memo=memo)
2423   14: \x80 PROTO      2
2424   16: h    BINGET     0
2425   18: .    STOP
2426highest protocol among opcodes = 2
2427"""
2428
2429__test__ = {'disassembler_test': _dis_test,
2430            'disassembler_memo_test': _memo_test,
2431           }
2432
2433def _test():
2434    import doctest
2435    return doctest.testmod()
2436
2437if __name__ == "__main__":
2438    import sys, argparse
2439    parser = argparse.ArgumentParser(
2440        description='disassemble one or more pickle files')
2441    parser.add_argument(
2442        'pickle_file', type=argparse.FileType('br'),
2443        nargs='*', help='the pickle file')
2444    parser.add_argument(
2445        '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2446        help='the file where the output should be written')
2447    parser.add_argument(
2448        '-m', '--memo', action='store_true',
2449        help='preserve memo between disassemblies')
2450    parser.add_argument(
2451        '-l', '--indentlevel', default=4, type=int,
2452        help='the number of blanks by which to indent a new MARK level')
2453    parser.add_argument(
2454        '-a', '--annotate',  action='store_true',
2455        help='annotate each line with a short opcode description')
2456    parser.add_argument(
2457        '-p', '--preamble', default="==> {name} <==",
2458        help='if more than one pickle file is specified, print this before'
2459        ' each disassembly')
2460    parser.add_argument(
2461        '-t', '--test', action='store_true',
2462        help='run self-test suite')
2463    parser.add_argument(
2464        '-v', action='store_true',
2465        help='run verbosely; only affects self-test run')
2466    args = parser.parse_args()
2467    if args.test:
2468        _test()
2469    else:
2470        annotate = 30 if args.annotate else 0
2471        if not args.pickle_file:
2472            parser.print_help()
2473        elif len(args.pickle_file) == 1:
2474            dis(args.pickle_file[0], args.output, None,
2475                args.indentlevel, annotate)
2476        else:
2477            memo = {} if args.memo else None
2478            for f in args.pickle_file:
2479                preamble = args.preamble.format(name=f.name)
2480                args.output.write(preamble + '\n')
2481                dis(f, args.output, memo, args.indentlevel, annotate)
2482