pickletools.py revision 28d271ef6b0eb37e27b5b3234ba146922b11d89f
1'''"Executable documentation" for the pickle module.
2
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here.  Some functions meant for external use:
5
6genops(pickle)
7   Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
9dis(pickle, out=None, memo=None, indentlevel=4)
10   Print a symbolic disassembly of a pickle.
11'''
12
13import codecs
14import io
15import pickle
16import re
17import sys
18
19__all__ = ['dis', 'genops', 'optimize']
20
21bytes_types = pickle.bytes_types
22
23# Other ideas:
24#
25# - A pickle verifier:  read a pickle and check it exhaustively for
26#   well-formedness.  dis() does a lot of this already.
27#
28# - A protocol identifier:  examine a pickle and return its protocol number
29#   (== the highest .proto attr value among all the opcodes in the pickle).
30#   dis() already prints this info at the end.
31#
32# - A pickle optimizer:  for example, tuple-building code is sometimes more
33#   elaborate than necessary, catering for the possibility that the tuple
34#   is recursive.  Or lots of times a PUT is generated that's never accessed
35#   by a later GET.
36
37
38# "A pickle" is a program for a virtual pickle machine (PM, but more accurately
39# called an unpickling machine).  It's a sequence of opcodes, interpreted by the
40# PM, building an arbitrarily complex Python object.
41#
42# For the most part, the PM is very simple:  there are no looping, testing, or
43# conditional instructions, no arithmetic and no function calls.  Opcodes are
44# executed once each, from first to last, until a STOP opcode is reached.
45#
46# The PM has two data areas, "the stack" and "the memo".
47#
48# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
49# integer object on the stack, whose value is gotten from a decimal string
50# literal immediately following the INT opcode in the pickle bytestream.  Other
51# opcodes take Python objects off the stack.  The result of unpickling is
52# whatever object is left on the stack when the final STOP opcode is executed.
53#
54# The memo is simply an array of objects, or it can be implemented as a dict
55# mapping little integers to objects.  The memo serves as the PM's "long term
56# memory", and the little integers indexing the memo are akin to variable
57# names.  Some opcodes pop a stack object into the memo at a given index,
58# and others push a memo object at a given index onto the stack again.
59#
60# At heart, that's all the PM has.  Subtleties arise for these reasons:
61#
62# + Object identity.  Objects can be arbitrarily complex, and subobjects
63#   may be shared (for example, the list [a, a] refers to the same object a
64#   twice).  It can be vital that unpickling recreate an isomorphic object
65#   graph, faithfully reproducing sharing.
66#
67# + Recursive objects.  For example, after "L = []; L.append(L)", L is a
68#   list, and L[0] is the same list.  This is related to the object identity
69#   point, and some sequences of pickle opcodes are subtle in order to
70#   get the right result in all cases.
71#
72# + Things pickle doesn't know everything about.  Examples of things pickle
73#   does know everything about are Python's builtin scalar and container
74#   types, like ints and tuples.  They generally have opcodes dedicated to
75#   them.  For things like module references and instances of user-defined
76#   classes, pickle's knowledge is limited.  Historically, many enhancements
77#   have been made to the pickle protocol in order to do a better (faster,
78#   and/or more compact) job on those.
79#
80# + Backward compatibility and micro-optimization.  As explained below,
81#   pickle opcodes never go away, not even when better ways to do a thing
82#   get invented.  The repertoire of the PM just keeps growing over time.
83#   For example, protocol 0 had two opcodes for building Python integers (INT
84#   and LONG), protocol 1 added three more for more-efficient pickling of short
85#   integers, and protocol 2 added two more for more-efficient pickling of
86#   long integers (before protocol 2, the only ways to pickle a Python long
87#   took time quadratic in the number of digits, for both pickling and
88#   unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
89#   wearying complication.
90#
91#
92# Pickle protocols:
93#
94# For compatibility, the meaning of a pickle opcode never changes.  Instead new
95# pickle opcodes get added, and each version's unpickler can handle all the
96# pickle opcodes in all protocol versions to date.  So old pickles continue to
97# be readable forever.  The pickler can generally be told to restrict itself to
98# the subset of opcodes available under previous protocol versions too, so that
99# users can create pickles under the current version readable by older
100# versions.  However, a pickle does not contain its version number embedded
101# within it.  If an older unpickler tries to read a pickle using a later
102# protocol, the result is most likely an exception due to seeing an unknown (in
103# the older unpickler) opcode.
104#
105# The original pickle used what's now called "protocol 0", and what was called
106# "text mode" before Python 2.3.  The entire pickle bytestream is made up of
107# printable 7-bit ASCII characters, plus the newline character, in protocol 0.
108# That's why it was called text mode.  Protocol 0 is small and elegant, but
109# sometimes painfully inefficient.
110#
111# The second major set of additions is now called "protocol 1", and was called
112# "binary mode" before Python 2.3.  This added many opcodes with arguments
113# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
114# bytes.  Binary mode pickles can be substantially smaller than equivalent
115# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
116# int as 4 bytes following the opcode, which is cheaper to unpickle than the
117# (perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
118# a number of opcodes that operate on many stack elements at once (like APPENDS
119# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
120#
121# The third major set of additions came in Python 2.3, and is called "protocol
122# 2".  This added:
123#
124# - A better way to pickle instances of new-style classes (NEWOBJ).
125#
126# - A way for a pickle to identify its protocol (PROTO).
127#
128# - Time- and space- efficient pickling of long ints (LONG{1,4}).
129#
130# - Shortcuts for small tuples (TUPLE{1,2,3}}.
131#
132# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
133#
134# - The "extension registry", a vector of popular objects that can be pushed
135#   efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
136#   the registry contents are predefined (there's nothing akin to the memo's
137#   PUT).
138#
139# Another independent change with Python 2.3 is the abandonment of any
140# pretense that it might be safe to load pickles received from untrusted
141# parties -- no sufficient security analysis has been done to guarantee
142# this and there isn't a use case that warrants the expense of such an
143# analysis.
144#
145# To this end, all tests for __safe_for_unpickling__ or for
146# copyreg.safe_constructors are removed from the unpickling code.
147# References to these variables in the descriptions below are to be seen
148# as describing unpickling in Python 2.2 and before.
149
150
151# Meta-rule:  Descriptions are stored in instances of descriptor objects,
152# with plain constructors.  No meta-language is defined from which
153# descriptors could be constructed.  If you want, e.g., XML, write a little
154# program to generate XML from the objects.
155
156##############################################################################
157# Some pickle opcodes have an argument, following the opcode in the
158# bytestream.  An argument is of a specific type, described by an instance
159# of ArgumentDescriptor.  These are not to be confused with arguments taken
160# off the stack -- ArgumentDescriptor applies only to arguments embedded in
161# the opcode stream, immediately following an opcode.
162
163# Represents the number of bytes consumed by an argument delimited by the
164# next newline character.
165UP_TO_NEWLINE = -1
166
167# Represents the number of bytes consumed by a two-argument opcode where
168# the first argument gives the number of bytes in the second argument.
169TAKEN_FROM_ARGUMENT1  = -2   # num bytes is 1-byte unsigned int
170TAKEN_FROM_ARGUMENT4  = -3   # num bytes is 4-byte signed little-endian int
171TAKEN_FROM_ARGUMENT4U = -4   # num bytes is 4-byte unsigned little-endian int
172TAKEN_FROM_ARGUMENT8U = -5   # num bytes is 8-byte unsigned little-endian int
173
174class ArgumentDescriptor(object):
175    __slots__ = (
176        # name of descriptor record, also a module global name; a string
177        'name',
178
179        # length of argument, in bytes; an int; UP_TO_NEWLINE and
180        # TAKEN_FROM_ARGUMENT{1,4,8} are negative values for variable-length
181        # cases
182        'n',
183
184        # a function taking a file-like object, reading this kind of argument
185        # from the object at the current position, advancing the current
186        # position by n bytes, and returning the value of the argument
187        'reader',
188
189        # human-readable docs for this arg descriptor; a string
190        'doc',
191    )
192
193    def __init__(self, name, n, reader, doc):
194        assert isinstance(name, str)
195        self.name = name
196
197        assert isinstance(n, int) and (n >= 0 or
198                                       n in (UP_TO_NEWLINE,
199                                             TAKEN_FROM_ARGUMENT1,
200                                             TAKEN_FROM_ARGUMENT4,
201                                             TAKEN_FROM_ARGUMENT4U,
202                                             TAKEN_FROM_ARGUMENT8U))
203        self.n = n
204
205        self.reader = reader
206
207        assert isinstance(doc, str)
208        self.doc = doc
209
210from struct import unpack as _unpack
211
212def read_uint1(f):
213    r"""
214    >>> import io
215    >>> read_uint1(io.BytesIO(b'\xff'))
216    255
217    """
218
219    data = f.read(1)
220    if data:
221        return data[0]
222    raise ValueError("not enough data in stream to read uint1")
223
224uint1 = ArgumentDescriptor(
225            name='uint1',
226            n=1,
227            reader=read_uint1,
228            doc="One-byte unsigned integer.")
229
230
231def read_uint2(f):
232    r"""
233    >>> import io
234    >>> read_uint2(io.BytesIO(b'\xff\x00'))
235    255
236    >>> read_uint2(io.BytesIO(b'\xff\xff'))
237    65535
238    """
239
240    data = f.read(2)
241    if len(data) == 2:
242        return _unpack("<H", data)[0]
243    raise ValueError("not enough data in stream to read uint2")
244
245uint2 = ArgumentDescriptor(
246            name='uint2',
247            n=2,
248            reader=read_uint2,
249            doc="Two-byte unsigned integer, little-endian.")
250
251
252def read_int4(f):
253    r"""
254    >>> import io
255    >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
256    255
257    >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
258    True
259    """
260
261    data = f.read(4)
262    if len(data) == 4:
263        return _unpack("<i", data)[0]
264    raise ValueError("not enough data in stream to read int4")
265
266int4 = ArgumentDescriptor(
267           name='int4',
268           n=4,
269           reader=read_int4,
270           doc="Four-byte signed integer, little-endian, 2's complement.")
271
272
273def read_uint4(f):
274    r"""
275    >>> import io
276    >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00'))
277    255
278    >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31
279    True
280    """
281
282    data = f.read(4)
283    if len(data) == 4:
284        return _unpack("<I", data)[0]
285    raise ValueError("not enough data in stream to read uint4")
286
287uint4 = ArgumentDescriptor(
288            name='uint4',
289            n=4,
290            reader=read_uint4,
291            doc="Four-byte unsigned integer, little-endian.")
292
293
294def read_uint8(f):
295    r"""
296    >>> import io
297    >>> read_uint8(io.BytesIO(b'\xff\x00\x00\x00\x00\x00\x00\x00'))
298    255
299    >>> read_uint8(io.BytesIO(b'\xff' * 8)) == 2**64-1
300    True
301    """
302
303    data = f.read(8)
304    if len(data) == 8:
305        return _unpack("<Q", data)[0]
306    raise ValueError("not enough data in stream to read uint8")
307
308uint8 = ArgumentDescriptor(
309            name='uint8',
310            n=8,
311            reader=read_uint8,
312            doc="Eight-byte unsigned integer, little-endian.")
313
314
315def read_stringnl(f, decode=True, stripquotes=True):
316    r"""
317    >>> import io
318    >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
319    'abcd'
320
321    >>> read_stringnl(io.BytesIO(b"\n"))
322    Traceback (most recent call last):
323    ...
324    ValueError: no string quotes around b''
325
326    >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
327    ''
328
329    >>> read_stringnl(io.BytesIO(b"''\n"))
330    ''
331
332    >>> read_stringnl(io.BytesIO(b'"abcd"'))
333    Traceback (most recent call last):
334    ...
335    ValueError: no newline found when trying to read stringnl
336
337    Embedded escapes are undone in the result.
338    >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
339    'a\n\\b\x00c\td'
340    """
341
342    data = f.readline()
343    if not data.endswith(b'\n'):
344        raise ValueError("no newline found when trying to read stringnl")
345    data = data[:-1]    # lose the newline
346
347    if stripquotes:
348        for q in (b'"', b"'"):
349            if data.startswith(q):
350                if not data.endswith(q):
351                    raise ValueError("strinq quote %r not found at both "
352                                     "ends of %r" % (q, data))
353                data = data[1:-1]
354                break
355        else:
356            raise ValueError("no string quotes around %r" % data)
357
358    if decode:
359        data = codecs.escape_decode(data)[0].decode("ascii")
360    return data
361
362stringnl = ArgumentDescriptor(
363               name='stringnl',
364               n=UP_TO_NEWLINE,
365               reader=read_stringnl,
366               doc="""A newline-terminated string.
367
368                   This is a repr-style string, with embedded escapes, and
369                   bracketing quotes.
370                   """)
371
372def read_stringnl_noescape(f):
373    return read_stringnl(f, stripquotes=False)
374
375stringnl_noescape = ArgumentDescriptor(
376                        name='stringnl_noescape',
377                        n=UP_TO_NEWLINE,
378                        reader=read_stringnl_noescape,
379                        doc="""A newline-terminated string.
380
381                        This is a str-style string, without embedded escapes,
382                        or bracketing quotes.  It should consist solely of
383                        printable ASCII characters.
384                        """)
385
386def read_stringnl_noescape_pair(f):
387    r"""
388    >>> import io
389    >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
390    'Queue Empty'
391    """
392
393    return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
394
395stringnl_noescape_pair = ArgumentDescriptor(
396                             name='stringnl_noescape_pair',
397                             n=UP_TO_NEWLINE,
398                             reader=read_stringnl_noescape_pair,
399                             doc="""A pair of newline-terminated strings.
400
401                             These are str-style strings, without embedded
402                             escapes, or bracketing quotes.  They should
403                             consist solely of printable ASCII characters.
404                             The pair is returned as a single string, with
405                             a single blank separating the two strings.
406                             """)
407
408
409def read_string1(f):
410    r"""
411    >>> import io
412    >>> read_string1(io.BytesIO(b"\x00"))
413    ''
414    >>> read_string1(io.BytesIO(b"\x03abcdef"))
415    'abc'
416    """
417
418    n = read_uint1(f)
419    assert n >= 0
420    data = f.read(n)
421    if len(data) == n:
422        return data.decode("latin-1")
423    raise ValueError("expected %d bytes in a string1, but only %d remain" %
424                     (n, len(data)))
425
426string1 = ArgumentDescriptor(
427              name="string1",
428              n=TAKEN_FROM_ARGUMENT1,
429              reader=read_string1,
430              doc="""A counted string.
431
432              The first argument is a 1-byte unsigned int giving the number
433              of bytes in the string, and the second argument is that many
434              bytes.
435              """)
436
437
438def read_string4(f):
439    r"""
440    >>> import io
441    >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
442    ''
443    >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
444    'abc'
445    >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
446    Traceback (most recent call last):
447    ...
448    ValueError: expected 50331648 bytes in a string4, but only 6 remain
449    """
450
451    n = read_int4(f)
452    if n < 0:
453        raise ValueError("string4 byte count < 0: %d" % n)
454    data = f.read(n)
455    if len(data) == n:
456        return data.decode("latin-1")
457    raise ValueError("expected %d bytes in a string4, but only %d remain" %
458                     (n, len(data)))
459
460string4 = ArgumentDescriptor(
461              name="string4",
462              n=TAKEN_FROM_ARGUMENT4,
463              reader=read_string4,
464              doc="""A counted string.
465
466              The first argument is a 4-byte little-endian signed int giving
467              the number of bytes in the string, and the second argument is
468              that many bytes.
469              """)
470
471
472def read_bytes1(f):
473    r"""
474    >>> import io
475    >>> read_bytes1(io.BytesIO(b"\x00"))
476    b''
477    >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
478    b'abc'
479    """
480
481    n = read_uint1(f)
482    assert n >= 0
483    data = f.read(n)
484    if len(data) == n:
485        return data
486    raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
487                     (n, len(data)))
488
489bytes1 = ArgumentDescriptor(
490              name="bytes1",
491              n=TAKEN_FROM_ARGUMENT1,
492              reader=read_bytes1,
493              doc="""A counted bytes string.
494
495              The first argument is a 1-byte unsigned int giving the number
496              of bytes in the string, and the second argument is that many
497              bytes.
498              """)
499
500
501def read_bytes1(f):
502    r"""
503    >>> import io
504    >>> read_bytes1(io.BytesIO(b"\x00"))
505    b''
506    >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
507    b'abc'
508    """
509
510    n = read_uint1(f)
511    assert n >= 0
512    data = f.read(n)
513    if len(data) == n:
514        return data
515    raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
516                     (n, len(data)))
517
518bytes1 = ArgumentDescriptor(
519              name="bytes1",
520              n=TAKEN_FROM_ARGUMENT1,
521              reader=read_bytes1,
522              doc="""A counted bytes string.
523
524              The first argument is a 1-byte unsigned int giving the number
525              of bytes, and the second argument is that many bytes.
526              """)
527
528
529def read_bytes4(f):
530    r"""
531    >>> import io
532    >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc"))
533    b''
534    >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
535    b'abc'
536    >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
537    Traceback (most recent call last):
538    ...
539    ValueError: expected 50331648 bytes in a bytes4, but only 6 remain
540    """
541
542    n = read_uint4(f)
543    assert n >= 0
544    if n > sys.maxsize:
545        raise ValueError("bytes4 byte count > sys.maxsize: %d" % n)
546    data = f.read(n)
547    if len(data) == n:
548        return data
549    raise ValueError("expected %d bytes in a bytes4, but only %d remain" %
550                     (n, len(data)))
551
552bytes4 = ArgumentDescriptor(
553              name="bytes4",
554              n=TAKEN_FROM_ARGUMENT4U,
555              reader=read_bytes4,
556              doc="""A counted bytes string.
557
558              The first argument is a 4-byte little-endian unsigned int giving
559              the number of bytes, and the second argument is that many bytes.
560              """)
561
562
563def read_bytes8(f):
564    r"""
565    >>> import io, struct, sys
566    >>> read_bytes8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc"))
567    b''
568    >>> read_bytes8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef"))
569    b'abc'
570    >>> bigsize8 = struct.pack("<Q", sys.maxsize//3)
571    >>> read_bytes8(io.BytesIO(bigsize8 + b"abcdef"))  #doctest: +ELLIPSIS
572    Traceback (most recent call last):
573    ...
574    ValueError: expected ... bytes in a bytes8, but only 6 remain
575    """
576
577    n = read_uint8(f)
578    assert n >= 0
579    if n > sys.maxsize:
580        raise ValueError("bytes8 byte count > sys.maxsize: %d" % n)
581    data = f.read(n)
582    if len(data) == n:
583        return data
584    raise ValueError("expected %d bytes in a bytes8, but only %d remain" %
585                     (n, len(data)))
586
587bytes8 = ArgumentDescriptor(
588              name="bytes8",
589              n=TAKEN_FROM_ARGUMENT8U,
590              reader=read_bytes8,
591              doc="""A counted bytes string.
592
593              The first argument is a 8-byte little-endian unsigned int giving
594              the number of bytes, and the second argument is that many bytes.
595              """)
596
597def read_unicodestringnl(f):
598    r"""
599    >>> import io
600    >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
601    True
602    """
603
604    data = f.readline()
605    if not data.endswith(b'\n'):
606        raise ValueError("no newline found when trying to read "
607                         "unicodestringnl")
608    data = data[:-1]    # lose the newline
609    return str(data, 'raw-unicode-escape')
610
611unicodestringnl = ArgumentDescriptor(
612                      name='unicodestringnl',
613                      n=UP_TO_NEWLINE,
614                      reader=read_unicodestringnl,
615                      doc="""A newline-terminated Unicode string.
616
617                      This is raw-unicode-escape encoded, so consists of
618                      printable ASCII characters, and may contain embedded
619                      escape sequences.
620                      """)
621
622
623def read_unicodestring1(f):
624    r"""
625    >>> import io
626    >>> s = 'abcd\uabcd'
627    >>> enc = s.encode('utf-8')
628    >>> enc
629    b'abcd\xea\xaf\x8d'
630    >>> n = bytes([len(enc)])  # little-endian 1-byte length
631    >>> t = read_unicodestring1(io.BytesIO(n + enc + b'junk'))
632    >>> s == t
633    True
634
635    >>> read_unicodestring1(io.BytesIO(n + enc[:-1]))
636    Traceback (most recent call last):
637    ...
638    ValueError: expected 7 bytes in a unicodestring1, but only 6 remain
639    """
640
641    n = read_uint1(f)
642    assert n >= 0
643    data = f.read(n)
644    if len(data) == n:
645        return str(data, 'utf-8', 'surrogatepass')
646    raise ValueError("expected %d bytes in a unicodestring1, but only %d "
647                     "remain" % (n, len(data)))
648
649unicodestring1 = ArgumentDescriptor(
650                    name="unicodestring1",
651                    n=TAKEN_FROM_ARGUMENT1,
652                    reader=read_unicodestring1,
653                    doc="""A counted Unicode string.
654
655                    The first argument is a 1-byte little-endian signed int
656                    giving the number of bytes in the string, and the second
657                    argument-- the UTF-8 encoding of the Unicode string --
658                    contains that many bytes.
659                    """)
660
661
662def read_unicodestring4(f):
663    r"""
664    >>> import io
665    >>> s = 'abcd\uabcd'
666    >>> enc = s.encode('utf-8')
667    >>> enc
668    b'abcd\xea\xaf\x8d'
669    >>> n = bytes([len(enc), 0, 0, 0])  # little-endian 4-byte length
670    >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
671    >>> s == t
672    True
673
674    >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
675    Traceback (most recent call last):
676    ...
677    ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
678    """
679
680    n = read_uint4(f)
681    assert n >= 0
682    if n > sys.maxsize:
683        raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n)
684    data = f.read(n)
685    if len(data) == n:
686        return str(data, 'utf-8', 'surrogatepass')
687    raise ValueError("expected %d bytes in a unicodestring4, but only %d "
688                     "remain" % (n, len(data)))
689
690unicodestring4 = ArgumentDescriptor(
691                    name="unicodestring4",
692                    n=TAKEN_FROM_ARGUMENT4U,
693                    reader=read_unicodestring4,
694                    doc="""A counted Unicode string.
695
696                    The first argument is a 4-byte little-endian signed int
697                    giving the number of bytes in the string, and the second
698                    argument-- the UTF-8 encoding of the Unicode string --
699                    contains that many bytes.
700                    """)
701
702
703def read_unicodestring8(f):
704    r"""
705    >>> import io
706    >>> s = 'abcd\uabcd'
707    >>> enc = s.encode('utf-8')
708    >>> enc
709    b'abcd\xea\xaf\x8d'
710    >>> n = bytes([len(enc)]) + bytes(7)  # little-endian 8-byte length
711    >>> t = read_unicodestring8(io.BytesIO(n + enc + b'junk'))
712    >>> s == t
713    True
714
715    >>> read_unicodestring8(io.BytesIO(n + enc[:-1]))
716    Traceback (most recent call last):
717    ...
718    ValueError: expected 7 bytes in a unicodestring8, but only 6 remain
719    """
720
721    n = read_uint8(f)
722    assert n >= 0
723    if n > sys.maxsize:
724        raise ValueError("unicodestring8 byte count > sys.maxsize: %d" % n)
725    data = f.read(n)
726    if len(data) == n:
727        return str(data, 'utf-8', 'surrogatepass')
728    raise ValueError("expected %d bytes in a unicodestring8, but only %d "
729                     "remain" % (n, len(data)))
730
731unicodestring8 = ArgumentDescriptor(
732                    name="unicodestring8",
733                    n=TAKEN_FROM_ARGUMENT8U,
734                    reader=read_unicodestring8,
735                    doc="""A counted Unicode string.
736
737                    The first argument is a 8-byte little-endian signed int
738                    giving the number of bytes in the string, and the second
739                    argument-- the UTF-8 encoding of the Unicode string --
740                    contains that many bytes.
741                    """)
742
743
744def read_decimalnl_short(f):
745    r"""
746    >>> import io
747    >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
748    1234
749
750    >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
751    Traceback (most recent call last):
752    ...
753    ValueError: invalid literal for int() with base 10: b'1234L'
754    """
755
756    s = read_stringnl(f, decode=False, stripquotes=False)
757
758    # There's a hack for True and False here.
759    if s == b"00":
760        return False
761    elif s == b"01":
762        return True
763
764    return int(s)
765
766def read_decimalnl_long(f):
767    r"""
768    >>> import io
769
770    >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
771    1234
772
773    >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
774    123456789012345678901234
775    """
776
777    s = read_stringnl(f, decode=False, stripquotes=False)
778    if s[-1:] == b'L':
779        s = s[:-1]
780    return int(s)
781
782
783decimalnl_short = ArgumentDescriptor(
784                      name='decimalnl_short',
785                      n=UP_TO_NEWLINE,
786                      reader=read_decimalnl_short,
787                      doc="""A newline-terminated decimal integer literal.
788
789                          This never has a trailing 'L', and the integer fit
790                          in a short Python int on the box where the pickle
791                          was written -- but there's no guarantee it will fit
792                          in a short Python int on the box where the pickle
793                          is read.
794                          """)
795
796decimalnl_long = ArgumentDescriptor(
797                     name='decimalnl_long',
798                     n=UP_TO_NEWLINE,
799                     reader=read_decimalnl_long,
800                     doc="""A newline-terminated decimal integer literal.
801
802                         This has a trailing 'L', and can represent integers
803                         of any size.
804                         """)
805
806
807def read_floatnl(f):
808    r"""
809    >>> import io
810    >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
811    -1.25
812    """
813    s = read_stringnl(f, decode=False, stripquotes=False)
814    return float(s)
815
816floatnl = ArgumentDescriptor(
817              name='floatnl',
818              n=UP_TO_NEWLINE,
819              reader=read_floatnl,
820              doc="""A newline-terminated decimal floating literal.
821
822              In general this requires 17 significant digits for roundtrip
823              identity, and pickling then unpickling infinities, NaNs, and
824              minus zero doesn't work across boxes, or on some boxes even
825              on itself (e.g., Windows can't read the strings it produces
826              for infinities or NaNs).
827              """)
828
829def read_float8(f):
830    r"""
831    >>> import io, struct
832    >>> raw = struct.pack(">d", -1.25)
833    >>> raw
834    b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
835    >>> read_float8(io.BytesIO(raw + b"\n"))
836    -1.25
837    """
838
839    data = f.read(8)
840    if len(data) == 8:
841        return _unpack(">d", data)[0]
842    raise ValueError("not enough data in stream to read float8")
843
844
845float8 = ArgumentDescriptor(
846             name='float8',
847             n=8,
848             reader=read_float8,
849             doc="""An 8-byte binary representation of a float, big-endian.
850
851             The format is unique to Python, and shared with the struct
852             module (format string '>d') "in theory" (the struct and pickle
853             implementations don't share the code -- they should).  It's
854             strongly related to the IEEE-754 double format, and, in normal
855             cases, is in fact identical to the big-endian 754 double format.
856             On other boxes the dynamic range is limited to that of a 754
857             double, and "add a half and chop" rounding is used to reduce
858             the precision to 53 bits.  However, even on a 754 box,
859             infinities, NaNs, and minus zero may not be handled correctly
860             (may not survive roundtrip pickling intact).
861             """)
862
863# Protocol 2 formats
864
865from pickle import decode_long
866
867def read_long1(f):
868    r"""
869    >>> import io
870    >>> read_long1(io.BytesIO(b"\x00"))
871    0
872    >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
873    255
874    >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
875    32767
876    >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
877    -256
878    >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
879    -32768
880    """
881
882    n = read_uint1(f)
883    data = f.read(n)
884    if len(data) != n:
885        raise ValueError("not enough data in stream to read long1")
886    return decode_long(data)
887
888long1 = ArgumentDescriptor(
889    name="long1",
890    n=TAKEN_FROM_ARGUMENT1,
891    reader=read_long1,
892    doc="""A binary long, little-endian, using 1-byte size.
893
894    This first reads one byte as an unsigned size, then reads that
895    many bytes and interprets them as a little-endian 2's-complement long.
896    If the size is 0, that's taken as a shortcut for the long 0L.
897    """)
898
899def read_long4(f):
900    r"""
901    >>> import io
902    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
903    255
904    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
905    32767
906    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
907    -256
908    >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
909    -32768
910    >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
911    0
912    """
913
914    n = read_int4(f)
915    if n < 0:
916        raise ValueError("long4 byte count < 0: %d" % n)
917    data = f.read(n)
918    if len(data) != n:
919        raise ValueError("not enough data in stream to read long4")
920    return decode_long(data)
921
922long4 = ArgumentDescriptor(
923    name="long4",
924    n=TAKEN_FROM_ARGUMENT4,
925    reader=read_long4,
926    doc="""A binary representation of a long, little-endian.
927
928    This first reads four bytes as a signed size (but requires the
929    size to be >= 0), then reads that many bytes and interprets them
930    as a little-endian 2's-complement long.  If the size is 0, that's taken
931    as a shortcut for the int 0, although LONG1 should really be used
932    then instead (and in any case where # of bytes < 256).
933    """)
934
935
936##############################################################################
937# Object descriptors.  The stack used by the pickle machine holds objects,
938# and in the stack_before and stack_after attributes of OpcodeInfo
939# descriptors we need names to describe the various types of objects that can
940# appear on the stack.
941
942class StackObject(object):
943    __slots__ = (
944        # name of descriptor record, for info only
945        'name',
946
947        # type of object, or tuple of type objects (meaning the object can
948        # be of any type in the tuple)
949        'obtype',
950
951        # human-readable docs for this kind of stack object; a string
952        'doc',
953    )
954
955    def __init__(self, name, obtype, doc):
956        assert isinstance(name, str)
957        self.name = name
958
959        assert isinstance(obtype, type) or isinstance(obtype, tuple)
960        if isinstance(obtype, tuple):
961            for contained in obtype:
962                assert isinstance(contained, type)
963        self.obtype = obtype
964
965        assert isinstance(doc, str)
966        self.doc = doc
967
968    def __repr__(self):
969        return self.name
970
971
972pyint = StackObject(
973            name='int',
974            obtype=int,
975            doc="A short (as opposed to long) Python integer object.")
976
977pylong = StackObject(
978             name='long',
979             obtype=int,
980             doc="A long (as opposed to short) Python integer object.")
981
982pyinteger_or_bool = StackObject(
983                        name='int_or_bool',
984                        obtype=(int, bool),
985                        doc="A Python integer object (short or long), or "
986                            "a Python bool.")
987
988pybool = StackObject(
989             name='bool',
990             obtype=(bool,),
991             doc="A Python bool object.")
992
993pyfloat = StackObject(
994              name='float',
995              obtype=float,
996              doc="A Python float object.")
997
998pystring = StackObject(
999               name='string',
1000               obtype=bytes,
1001               doc="A Python (8-bit) string object.")
1002
1003pybytes = StackObject(
1004               name='bytes',
1005               obtype=bytes,
1006               doc="A Python bytes object.")
1007
1008pyunicode = StackObject(
1009                name='str',
1010                obtype=str,
1011                doc="A Python (Unicode) string object.")
1012
1013pynone = StackObject(
1014             name="None",
1015             obtype=type(None),
1016             doc="The Python None object.")
1017
1018pytuple = StackObject(
1019              name="tuple",
1020              obtype=tuple,
1021              doc="A Python tuple object.")
1022
1023pylist = StackObject(
1024             name="list",
1025             obtype=list,
1026             doc="A Python list object.")
1027
1028pydict = StackObject(
1029             name="dict",
1030             obtype=dict,
1031             doc="A Python dict object.")
1032
1033pyset = StackObject(
1034            name="set",
1035            obtype=set,
1036            doc="A Python set object.")
1037
1038pyfrozenset = StackObject(
1039                  name="frozenset",
1040                  obtype=set,
1041                  doc="A Python frozenset object.")
1042
1043anyobject = StackObject(
1044                name='any',
1045                obtype=object,
1046                doc="Any kind of object whatsoever.")
1047
1048markobject = StackObject(
1049                 name="mark",
1050                 obtype=StackObject,
1051                 doc="""'The mark' is a unique object.
1052
1053                 Opcodes that operate on a variable number of objects
1054                 generally don't embed the count of objects in the opcode,
1055                 or pull it off the stack.  Instead the MARK opcode is used
1056                 to push a special marker object on the stack, and then
1057                 some other opcodes grab all the objects from the top of
1058                 the stack down to (but not including) the topmost marker
1059                 object.
1060                 """)
1061
1062stackslice = StackObject(
1063                 name="stackslice",
1064                 obtype=StackObject,
1065                 doc="""An object representing a contiguous slice of the stack.
1066
1067                 This is used in conjunction with markobject, to represent all
1068                 of the stack following the topmost markobject.  For example,
1069                 the POP_MARK opcode changes the stack from
1070
1071                     [..., markobject, stackslice]
1072                 to
1073                     [...]
1074
1075                 No matter how many object are on the stack after the topmost
1076                 markobject, POP_MARK gets rid of all of them (including the
1077                 topmost markobject too).
1078                 """)
1079
1080##############################################################################
1081# Descriptors for pickle opcodes.
1082
1083class OpcodeInfo(object):
1084
1085    __slots__ = (
1086        # symbolic name of opcode; a string
1087        'name',
1088
1089        # the code used in a bytestream to represent the opcode; a
1090        # one-character string
1091        'code',
1092
1093        # If the opcode has an argument embedded in the byte string, an
1094        # instance of ArgumentDescriptor specifying its type.  Note that
1095        # arg.reader(s) can be used to read and decode the argument from
1096        # the bytestream s, and arg.doc documents the format of the raw
1097        # argument bytes.  If the opcode doesn't have an argument embedded
1098        # in the bytestream, arg should be None.
1099        'arg',
1100
1101        # what the stack looks like before this opcode runs; a list
1102        'stack_before',
1103
1104        # what the stack looks like after this opcode runs; a list
1105        'stack_after',
1106
1107        # the protocol number in which this opcode was introduced; an int
1108        'proto',
1109
1110        # human-readable docs for this opcode; a string
1111        'doc',
1112    )
1113
1114    def __init__(self, name, code, arg,
1115                 stack_before, stack_after, proto, doc):
1116        assert isinstance(name, str)
1117        self.name = name
1118
1119        assert isinstance(code, str)
1120        assert len(code) == 1
1121        self.code = code
1122
1123        assert arg is None or isinstance(arg, ArgumentDescriptor)
1124        self.arg = arg
1125
1126        assert isinstance(stack_before, list)
1127        for x in stack_before:
1128            assert isinstance(x, StackObject)
1129        self.stack_before = stack_before
1130
1131        assert isinstance(stack_after, list)
1132        for x in stack_after:
1133            assert isinstance(x, StackObject)
1134        self.stack_after = stack_after
1135
1136        assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL
1137        self.proto = proto
1138
1139        assert isinstance(doc, str)
1140        self.doc = doc
1141
1142I = OpcodeInfo
1143opcodes = [
1144
1145    # Ways to spell integers.
1146
1147    I(name='INT',
1148      code='I',
1149      arg=decimalnl_short,
1150      stack_before=[],
1151      stack_after=[pyinteger_or_bool],
1152      proto=0,
1153      doc="""Push an integer or bool.
1154
1155      The argument is a newline-terminated decimal literal string.
1156
1157      The intent may have been that this always fit in a short Python int,
1158      but INT can be generated in pickles written on a 64-bit box that
1159      require a Python long on a 32-bit box.  The difference between this
1160      and LONG then is that INT skips a trailing 'L', and produces a short
1161      int whenever possible.
1162
1163      Another difference is due to that, when bool was introduced as a
1164      distinct type in 2.3, builtin names True and False were also added to
1165      2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
1166      True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
1167      Leading zeroes are never produced for a genuine integer.  The 2.3
1168      (and later) unpicklers special-case these and return bool instead;
1169      earlier unpicklers ignore the leading "0" and return the int.
1170      """),
1171
1172    I(name='BININT',
1173      code='J',
1174      arg=int4,
1175      stack_before=[],
1176      stack_after=[pyint],
1177      proto=1,
1178      doc="""Push a four-byte signed integer.
1179
1180      This handles the full range of Python (short) integers on a 32-bit
1181      box, directly as binary bytes (1 for the opcode and 4 for the integer).
1182      If the integer is non-negative and fits in 1 or 2 bytes, pickling via
1183      BININT1 or BININT2 saves space.
1184      """),
1185
1186    I(name='BININT1',
1187      code='K',
1188      arg=uint1,
1189      stack_before=[],
1190      stack_after=[pyint],
1191      proto=1,
1192      doc="""Push a one-byte unsigned integer.
1193
1194      This is a space optimization for pickling very small non-negative ints,
1195      in range(256).
1196      """),
1197
1198    I(name='BININT2',
1199      code='M',
1200      arg=uint2,
1201      stack_before=[],
1202      stack_after=[pyint],
1203      proto=1,
1204      doc="""Push a two-byte unsigned integer.
1205
1206      This is a space optimization for pickling small positive ints, in
1207      range(256, 2**16).  Integers in range(256) can also be pickled via
1208      BININT2, but BININT1 instead saves a byte.
1209      """),
1210
1211    I(name='LONG',
1212      code='L',
1213      arg=decimalnl_long,
1214      stack_before=[],
1215      stack_after=[pylong],
1216      proto=0,
1217      doc="""Push a long integer.
1218
1219      The same as INT, except that the literal ends with 'L', and always
1220      unpickles to a Python long.  There doesn't seem a real purpose to the
1221      trailing 'L'.
1222
1223      Note that LONG takes time quadratic in the number of digits when
1224      unpickling (this is simply due to the nature of decimal->binary
1225      conversion).  Proto 2 added linear-time (in C; still quadratic-time
1226      in Python) LONG1 and LONG4 opcodes.
1227      """),
1228
1229    I(name="LONG1",
1230      code='\x8a',
1231      arg=long1,
1232      stack_before=[],
1233      stack_after=[pylong],
1234      proto=2,
1235      doc="""Long integer using one-byte length.
1236
1237      A more efficient encoding of a Python long; the long1 encoding
1238      says it all."""),
1239
1240    I(name="LONG4",
1241      code='\x8b',
1242      arg=long4,
1243      stack_before=[],
1244      stack_after=[pylong],
1245      proto=2,
1246      doc="""Long integer using found-byte length.
1247
1248      A more efficient encoding of a Python long; the long4 encoding
1249      says it all."""),
1250
1251    # Ways to spell strings (8-bit, not Unicode).
1252
1253    I(name='STRING',
1254      code='S',
1255      arg=stringnl,
1256      stack_before=[],
1257      stack_after=[pystring],
1258      proto=0,
1259      doc="""Push a Python string object.
1260
1261      The argument is a repr-style string, with bracketing quote characters,
1262      and perhaps embedded escapes.  The argument extends until the next
1263      newline character.  (Actually, they are decoded into a str instance
1264      using the encoding given to the Unpickler constructor. or the default,
1265      'ASCII'.)
1266      """),
1267
1268    I(name='BINSTRING',
1269      code='T',
1270      arg=string4,
1271      stack_before=[],
1272      stack_after=[pystring],
1273      proto=1,
1274      doc="""Push a Python string object.
1275
1276      There are two arguments:  the first is a 4-byte little-endian signed int
1277      giving the number of bytes in the string, and the second is that many
1278      bytes, which are taken literally as the string content.  (Actually,
1279      they are decoded into a str instance using the encoding given to the
1280      Unpickler constructor. or the default, 'ASCII'.)
1281      """),
1282
1283    I(name='SHORT_BINSTRING',
1284      code='U',
1285      arg=string1,
1286      stack_before=[],
1287      stack_after=[pystring],
1288      proto=1,
1289      doc="""Push a Python string object.
1290
1291      There are two arguments:  the first is a 1-byte unsigned int giving
1292      the number of bytes in the string, and the second is that many bytes,
1293      which are taken literally as the string content.  (Actually, they
1294      are decoded into a str instance using the encoding given to the
1295      Unpickler constructor. or the default, 'ASCII'.)
1296      """),
1297
1298    # Bytes (protocol 3 only; older protocols don't support bytes at all)
1299
1300    I(name='BINBYTES',
1301      code='B',
1302      arg=bytes4,
1303      stack_before=[],
1304      stack_after=[pybytes],
1305      proto=3,
1306      doc="""Push a Python bytes object.
1307
1308      There are two arguments:  the first is a 4-byte little-endian unsigned int
1309      giving the number of bytes, and the second is that many bytes, which are
1310      taken literally as the bytes content.
1311      """),
1312
1313    I(name='SHORT_BINBYTES',
1314      code='C',
1315      arg=bytes1,
1316      stack_before=[],
1317      stack_after=[pybytes],
1318      proto=3,
1319      doc="""Push a Python bytes object.
1320
1321      There are two arguments:  the first is a 1-byte unsigned int giving
1322      the number of bytes, and the second is that many bytes, which are taken
1323      literally as the string content.
1324      """),
1325
1326    I(name='BINBYTES8',
1327      code='\x8e',
1328      arg=bytes8,
1329      stack_before=[],
1330      stack_after=[pybytes],
1331      proto=4,
1332      doc="""Push a Python bytes object.
1333
1334      There are two arguments:  the first is a 8-byte unsigned int giving
1335      the number of bytes in the string, and the second is that many bytes,
1336      which are taken literally as the string content.
1337      """),
1338
1339    # Ways to spell None.
1340
1341    I(name='NONE',
1342      code='N',
1343      arg=None,
1344      stack_before=[],
1345      stack_after=[pynone],
1346      proto=0,
1347      doc="Push None on the stack."),
1348
1349    # Ways to spell bools, starting with proto 2.  See INT for how this was
1350    # done before proto 2.
1351
1352    I(name='NEWTRUE',
1353      code='\x88',
1354      arg=None,
1355      stack_before=[],
1356      stack_after=[pybool],
1357      proto=2,
1358      doc="""True.
1359
1360      Push True onto the stack."""),
1361
1362    I(name='NEWFALSE',
1363      code='\x89',
1364      arg=None,
1365      stack_before=[],
1366      stack_after=[pybool],
1367      proto=2,
1368      doc="""True.
1369
1370      Push False onto the stack."""),
1371
1372    # Ways to spell Unicode strings.
1373
1374    I(name='UNICODE',
1375      code='V',
1376      arg=unicodestringnl,
1377      stack_before=[],
1378      stack_after=[pyunicode],
1379      proto=0,  # this may be pure-text, but it's a later addition
1380      doc="""Push a Python Unicode string object.
1381
1382      The argument is a raw-unicode-escape encoding of a Unicode string,
1383      and so may contain embedded escape sequences.  The argument extends
1384      until the next newline character.
1385      """),
1386
1387    I(name='SHORT_BINUNICODE',
1388      code='\x8c',
1389      arg=unicodestring1,
1390      stack_before=[],
1391      stack_after=[pyunicode],
1392      proto=4,
1393      doc="""Push a Python Unicode string object.
1394
1395      There are two arguments:  the first is a 1-byte little-endian signed int
1396      giving the number of bytes in the string.  The second is that many
1397      bytes, and is the UTF-8 encoding of the Unicode string.
1398      """),
1399
1400    I(name='BINUNICODE',
1401      code='X',
1402      arg=unicodestring4,
1403      stack_before=[],
1404      stack_after=[pyunicode],
1405      proto=1,
1406      doc="""Push a Python Unicode string object.
1407
1408      There are two arguments:  the first is a 4-byte little-endian unsigned int
1409      giving the number of bytes in the string.  The second is that many
1410      bytes, and is the UTF-8 encoding of the Unicode string.
1411      """),
1412
1413    I(name='BINUNICODE8',
1414      code='\x8d',
1415      arg=unicodestring8,
1416      stack_before=[],
1417      stack_after=[pyunicode],
1418      proto=4,
1419      doc="""Push a Python Unicode string object.
1420
1421      There are two arguments:  the first is a 8-byte little-endian signed int
1422      giving the number of bytes in the string.  The second is that many
1423      bytes, and is the UTF-8 encoding of the Unicode string.
1424      """),
1425
1426    # Ways to spell floats.
1427
1428    I(name='FLOAT',
1429      code='F',
1430      arg=floatnl,
1431      stack_before=[],
1432      stack_after=[pyfloat],
1433      proto=0,
1434      doc="""Newline-terminated decimal float literal.
1435
1436      The argument is repr(a_float), and in general requires 17 significant
1437      digits for roundtrip conversion to be an identity (this is so for
1438      IEEE-754 double precision values, which is what Python float maps to
1439      on most boxes).
1440
1441      In general, FLOAT cannot be used to transport infinities, NaNs, or
1442      minus zero across boxes (or even on a single box, if the platform C
1443      library can't read the strings it produces for such things -- Windows
1444      is like that), but may do less damage than BINFLOAT on boxes with
1445      greater precision or dynamic range than IEEE-754 double.
1446      """),
1447
1448    I(name='BINFLOAT',
1449      code='G',
1450      arg=float8,
1451      stack_before=[],
1452      stack_after=[pyfloat],
1453      proto=1,
1454      doc="""Float stored in binary form, with 8 bytes of data.
1455
1456      This generally requires less than half the space of FLOAT encoding.
1457      In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1458      minus zero, raises an exception if the exponent exceeds the range of
1459      an IEEE-754 double, and retains no more than 53 bits of precision (if
1460      there are more than that, "add a half and chop" rounding is used to
1461      cut it back to 53 significant bits).
1462      """),
1463
1464    # Ways to build lists.
1465
1466    I(name='EMPTY_LIST',
1467      code=']',
1468      arg=None,
1469      stack_before=[],
1470      stack_after=[pylist],
1471      proto=1,
1472      doc="Push an empty list."),
1473
1474    I(name='APPEND',
1475      code='a',
1476      arg=None,
1477      stack_before=[pylist, anyobject],
1478      stack_after=[pylist],
1479      proto=0,
1480      doc="""Append an object to a list.
1481
1482      Stack before:  ... pylist anyobject
1483      Stack after:   ... pylist+[anyobject]
1484
1485      although pylist is really extended in-place.
1486      """),
1487
1488    I(name='APPENDS',
1489      code='e',
1490      arg=None,
1491      stack_before=[pylist, markobject, stackslice],
1492      stack_after=[pylist],
1493      proto=1,
1494      doc="""Extend a list by a slice of stack objects.
1495
1496      Stack before:  ... pylist markobject stackslice
1497      Stack after:   ... pylist+stackslice
1498
1499      although pylist is really extended in-place.
1500      """),
1501
1502    I(name='LIST',
1503      code='l',
1504      arg=None,
1505      stack_before=[markobject, stackslice],
1506      stack_after=[pylist],
1507      proto=0,
1508      doc="""Build a list out of the topmost stack slice, after markobject.
1509
1510      All the stack entries following the topmost markobject are placed into
1511      a single Python list, which single list object replaces all of the
1512      stack from the topmost markobject onward.  For example,
1513
1514      Stack before: ... markobject 1 2 3 'abc'
1515      Stack after:  ... [1, 2, 3, 'abc']
1516      """),
1517
1518    # Ways to build tuples.
1519
1520    I(name='EMPTY_TUPLE',
1521      code=')',
1522      arg=None,
1523      stack_before=[],
1524      stack_after=[pytuple],
1525      proto=1,
1526      doc="Push an empty tuple."),
1527
1528    I(name='TUPLE',
1529      code='t',
1530      arg=None,
1531      stack_before=[markobject, stackslice],
1532      stack_after=[pytuple],
1533      proto=0,
1534      doc="""Build a tuple out of the topmost stack slice, after markobject.
1535
1536      All the stack entries following the topmost markobject are placed into
1537      a single Python tuple, which single tuple object replaces all of the
1538      stack from the topmost markobject onward.  For example,
1539
1540      Stack before: ... markobject 1 2 3 'abc'
1541      Stack after:  ... (1, 2, 3, 'abc')
1542      """),
1543
1544    I(name='TUPLE1',
1545      code='\x85',
1546      arg=None,
1547      stack_before=[anyobject],
1548      stack_after=[pytuple],
1549      proto=2,
1550      doc="""Build a one-tuple out of the topmost item on the stack.
1551
1552      This code pops one value off the stack and pushes a tuple of
1553      length 1 whose one item is that value back onto it.  In other
1554      words:
1555
1556          stack[-1] = tuple(stack[-1:])
1557      """),
1558
1559    I(name='TUPLE2',
1560      code='\x86',
1561      arg=None,
1562      stack_before=[anyobject, anyobject],
1563      stack_after=[pytuple],
1564      proto=2,
1565      doc="""Build a two-tuple out of the top two items on the stack.
1566
1567      This code pops two values off the stack and pushes a tuple of
1568      length 2 whose items are those values back onto it.  In other
1569      words:
1570
1571          stack[-2:] = [tuple(stack[-2:])]
1572      """),
1573
1574    I(name='TUPLE3',
1575      code='\x87',
1576      arg=None,
1577      stack_before=[anyobject, anyobject, anyobject],
1578      stack_after=[pytuple],
1579      proto=2,
1580      doc="""Build a three-tuple out of the top three items on the stack.
1581
1582      This code pops three values off the stack and pushes a tuple of
1583      length 3 whose items are those values back onto it.  In other
1584      words:
1585
1586          stack[-3:] = [tuple(stack[-3:])]
1587      """),
1588
1589    # Ways to build dicts.
1590
1591    I(name='EMPTY_DICT',
1592      code='}',
1593      arg=None,
1594      stack_before=[],
1595      stack_after=[pydict],
1596      proto=1,
1597      doc="Push an empty dict."),
1598
1599    I(name='DICT',
1600      code='d',
1601      arg=None,
1602      stack_before=[markobject, stackslice],
1603      stack_after=[pydict],
1604      proto=0,
1605      doc="""Build a dict out of the topmost stack slice, after markobject.
1606
1607      All the stack entries following the topmost markobject are placed into
1608      a single Python dict, which single dict object replaces all of the
1609      stack from the topmost markobject onward.  The stack slice alternates
1610      key, value, key, value, ....  For example,
1611
1612      Stack before: ... markobject 1 2 3 'abc'
1613      Stack after:  ... {1: 2, 3: 'abc'}
1614      """),
1615
1616    I(name='SETITEM',
1617      code='s',
1618      arg=None,
1619      stack_before=[pydict, anyobject, anyobject],
1620      stack_after=[pydict],
1621      proto=0,
1622      doc="""Add a key+value pair to an existing dict.
1623
1624      Stack before:  ... pydict key value
1625      Stack after:   ... pydict
1626
1627      where pydict has been modified via pydict[key] = value.
1628      """),
1629
1630    I(name='SETITEMS',
1631      code='u',
1632      arg=None,
1633      stack_before=[pydict, markobject, stackslice],
1634      stack_after=[pydict],
1635      proto=1,
1636      doc="""Add an arbitrary number of key+value pairs to an existing dict.
1637
1638      The slice of the stack following the topmost markobject is taken as
1639      an alternating sequence of keys and values, added to the dict
1640      immediately under the topmost markobject.  Everything at and after the
1641      topmost markobject is popped, leaving the mutated dict at the top
1642      of the stack.
1643
1644      Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1645      Stack after:   ... pydict
1646
1647      where pydict has been modified via pydict[key_i] = value_i for i in
1648      1, 2, ..., n, and in that order.
1649      """),
1650
1651    # Ways to build sets
1652
1653    I(name='EMPTY_SET',
1654      code='\x8f',
1655      arg=None,
1656      stack_before=[],
1657      stack_after=[pyset],
1658      proto=4,
1659      doc="Push an empty set."),
1660
1661    I(name='ADDITEMS',
1662      code='\x90',
1663      arg=None,
1664      stack_before=[pyset, markobject, stackslice],
1665      stack_after=[pyset],
1666      proto=4,
1667      doc="""Add an arbitrary number of items to an existing set.
1668
1669      The slice of the stack following the topmost markobject is taken as
1670      a sequence of items, added to the set immediately under the topmost
1671      markobject.  Everything at and after the topmost markobject is popped,
1672      leaving the mutated set at the top of the stack.
1673
1674      Stack before:  ... pyset markobject item_1 ... item_n
1675      Stack after:   ... pyset
1676
1677      where pyset has been modified via pyset.add(item_i) = item_i for i in
1678      1, 2, ..., n, and in that order.
1679      """),
1680
1681    # Way to build frozensets
1682
1683    I(name='FROZENSET',
1684      code='\x91',
1685      arg=None,
1686      stack_before=[markobject, stackslice],
1687      stack_after=[pyfrozenset],
1688      proto=4,
1689      doc="""Build a frozenset out of the topmost slice, after markobject.
1690
1691      All the stack entries following the topmost markobject are placed into
1692      a single Python frozenset, which single frozenset object replaces all
1693      of the stack from the topmost markobject onward.  For example,
1694
1695      Stack before: ... markobject 1 2 3
1696      Stack after:  ... frozenset({1, 2, 3})
1697      """),
1698
1699    # Stack manipulation.
1700
1701    I(name='POP',
1702      code='0',
1703      arg=None,
1704      stack_before=[anyobject],
1705      stack_after=[],
1706      proto=0,
1707      doc="Discard the top stack item, shrinking the stack by one item."),
1708
1709    I(name='DUP',
1710      code='2',
1711      arg=None,
1712      stack_before=[anyobject],
1713      stack_after=[anyobject, anyobject],
1714      proto=0,
1715      doc="Push the top stack item onto the stack again, duplicating it."),
1716
1717    I(name='MARK',
1718      code='(',
1719      arg=None,
1720      stack_before=[],
1721      stack_after=[markobject],
1722      proto=0,
1723      doc="""Push markobject onto the stack.
1724
1725      markobject is a unique object, used by other opcodes to identify a
1726      region of the stack containing a variable number of objects for them
1727      to work on.  See markobject.doc for more detail.
1728      """),
1729
1730    I(name='POP_MARK',
1731      code='1',
1732      arg=None,
1733      stack_before=[markobject, stackslice],
1734      stack_after=[],
1735      proto=1,
1736      doc="""Pop all the stack objects at and above the topmost markobject.
1737
1738      When an opcode using a variable number of stack objects is done,
1739      POP_MARK is used to remove those objects, and to remove the markobject
1740      that delimited their starting position on the stack.
1741      """),
1742
1743    # Memo manipulation.  There are really only two operations (get and put),
1744    # each in all-text, "short binary", and "long binary" flavors.
1745
1746    I(name='GET',
1747      code='g',
1748      arg=decimalnl_short,
1749      stack_before=[],
1750      stack_after=[anyobject],
1751      proto=0,
1752      doc="""Read an object from the memo and push it on the stack.
1753
1754      The index of the memo object to push is given by the newline-terminated
1755      decimal string following.  BINGET and LONG_BINGET are space-optimized
1756      versions.
1757      """),
1758
1759    I(name='BINGET',
1760      code='h',
1761      arg=uint1,
1762      stack_before=[],
1763      stack_after=[anyobject],
1764      proto=1,
1765      doc="""Read an object from the memo and push it on the stack.
1766
1767      The index of the memo object to push is given by the 1-byte unsigned
1768      integer following.
1769      """),
1770
1771    I(name='LONG_BINGET',
1772      code='j',
1773      arg=uint4,
1774      stack_before=[],
1775      stack_after=[anyobject],
1776      proto=1,
1777      doc="""Read an object from the memo and push it on the stack.
1778
1779      The index of the memo object to push is given by the 4-byte unsigned
1780      little-endian integer following.
1781      """),
1782
1783    I(name='PUT',
1784      code='p',
1785      arg=decimalnl_short,
1786      stack_before=[],
1787      stack_after=[],
1788      proto=0,
1789      doc="""Store the stack top into the memo.  The stack is not popped.
1790
1791      The index of the memo location to write into is given by the newline-
1792      terminated decimal string following.  BINPUT and LONG_BINPUT are
1793      space-optimized versions.
1794      """),
1795
1796    I(name='BINPUT',
1797      code='q',
1798      arg=uint1,
1799      stack_before=[],
1800      stack_after=[],
1801      proto=1,
1802      doc="""Store the stack top into the memo.  The stack is not popped.
1803
1804      The index of the memo location to write into is given by the 1-byte
1805      unsigned integer following.
1806      """),
1807
1808    I(name='LONG_BINPUT',
1809      code='r',
1810      arg=uint4,
1811      stack_before=[],
1812      stack_after=[],
1813      proto=1,
1814      doc="""Store the stack top into the memo.  The stack is not popped.
1815
1816      The index of the memo location to write into is given by the 4-byte
1817      unsigned little-endian integer following.
1818      """),
1819
1820    I(name='MEMOIZE',
1821      code='\x94',
1822      arg=None,
1823      stack_before=[anyobject],
1824      stack_after=[anyobject],
1825      proto=4,
1826      doc="""Store the stack top into the memo.  The stack is not popped.
1827
1828      The index of the memo location to write is the number of
1829      elements currently present in the memo.
1830      """),
1831
1832    # Access the extension registry (predefined objects).  Akin to the GET
1833    # family.
1834
1835    I(name='EXT1',
1836      code='\x82',
1837      arg=uint1,
1838      stack_before=[],
1839      stack_after=[anyobject],
1840      proto=2,
1841      doc="""Extension code.
1842
1843      This code and the similar EXT2 and EXT4 allow using a registry
1844      of popular objects that are pickled by name, typically classes.
1845      It is envisioned that through a global negotiation and
1846      registration process, third parties can set up a mapping between
1847      ints and object names.
1848
1849      In order to guarantee pickle interchangeability, the extension
1850      code registry ought to be global, although a range of codes may
1851      be reserved for private use.
1852
1853      EXT1 has a 1-byte integer argument.  This is used to index into the
1854      extension registry, and the object at that index is pushed on the stack.
1855      """),
1856
1857    I(name='EXT2',
1858      code='\x83',
1859      arg=uint2,
1860      stack_before=[],
1861      stack_after=[anyobject],
1862      proto=2,
1863      doc="""Extension code.
1864
1865      See EXT1.  EXT2 has a two-byte integer argument.
1866      """),
1867
1868    I(name='EXT4',
1869      code='\x84',
1870      arg=int4,
1871      stack_before=[],
1872      stack_after=[anyobject],
1873      proto=2,
1874      doc="""Extension code.
1875
1876      See EXT1.  EXT4 has a four-byte integer argument.
1877      """),
1878
1879    # Push a class object, or module function, on the stack, via its module
1880    # and name.
1881
1882    I(name='GLOBAL',
1883      code='c',
1884      arg=stringnl_noescape_pair,
1885      stack_before=[],
1886      stack_after=[anyobject],
1887      proto=0,
1888      doc="""Push a global object (module.attr) on the stack.
1889
1890      Two newline-terminated strings follow the GLOBAL opcode.  The first is
1891      taken as a module name, and the second as a class name.  The class
1892      object module.class is pushed on the stack.  More accurately, the
1893      object returned by self.find_class(module, class) is pushed on the
1894      stack, so unpickling subclasses can override this form of lookup.
1895      """),
1896
1897    I(name='STACK_GLOBAL',
1898      code='\x93',
1899      arg=None,
1900      stack_before=[pyunicode, pyunicode],
1901      stack_after=[anyobject],
1902      proto=0,
1903      doc="""Push a global object (module.attr) on the stack.
1904      """),
1905
1906    # Ways to build objects of classes pickle doesn't know about directly
1907    # (user-defined classes).  I despair of documenting this accurately
1908    # and comprehensibly -- you really have to read the pickle code to
1909    # find all the special cases.
1910
1911    I(name='REDUCE',
1912      code='R',
1913      arg=None,
1914      stack_before=[anyobject, anyobject],
1915      stack_after=[anyobject],
1916      proto=0,
1917      doc="""Push an object built from a callable and an argument tuple.
1918
1919      The opcode is named to remind of the __reduce__() method.
1920
1921      Stack before: ... callable pytuple
1922      Stack after:  ... callable(*pytuple)
1923
1924      The callable and the argument tuple are the first two items returned
1925      by a __reduce__ method.  Applying the callable to the argtuple is
1926      supposed to reproduce the original object, or at least get it started.
1927      If the __reduce__ method returns a 3-tuple, the last component is an
1928      argument to be passed to the object's __setstate__, and then the REDUCE
1929      opcode is followed by code to create setstate's argument, and then a
1930      BUILD opcode to apply  __setstate__ to that argument.
1931
1932      If not isinstance(callable, type), REDUCE complains unless the
1933      callable has been registered with the copyreg module's
1934      safe_constructors dict, or the callable has a magic
1935      '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1936      why it does this, but I've sure seen this complaint often enough when
1937      I didn't want to <wink>.
1938      """),
1939
1940    I(name='BUILD',
1941      code='b',
1942      arg=None,
1943      stack_before=[anyobject, anyobject],
1944      stack_after=[anyobject],
1945      proto=0,
1946      doc="""Finish building an object, via __setstate__ or dict update.
1947
1948      Stack before: ... anyobject argument
1949      Stack after:  ... anyobject
1950
1951      where anyobject may have been mutated, as follows:
1952
1953      If the object has a __setstate__ method,
1954
1955          anyobject.__setstate__(argument)
1956
1957      is called.
1958
1959      Else the argument must be a dict, the object must have a __dict__, and
1960      the object is updated via
1961
1962          anyobject.__dict__.update(argument)
1963      """),
1964
1965    I(name='INST',
1966      code='i',
1967      arg=stringnl_noescape_pair,
1968      stack_before=[markobject, stackslice],
1969      stack_after=[anyobject],
1970      proto=0,
1971      doc="""Build a class instance.
1972
1973      This is the protocol 0 version of protocol 1's OBJ opcode.
1974      INST is followed by two newline-terminated strings, giving a
1975      module and class name, just as for the GLOBAL opcode (and see
1976      GLOBAL for more details about that).  self.find_class(module, name)
1977      is used to get a class object.
1978
1979      In addition, all the objects on the stack following the topmost
1980      markobject are gathered into a tuple and popped (along with the
1981      topmost markobject), just as for the TUPLE opcode.
1982
1983      Now it gets complicated.  If all of these are true:
1984
1985        + The argtuple is empty (markobject was at the top of the stack
1986          at the start).
1987
1988        + The class object does not have a __getinitargs__ attribute.
1989
1990      then we want to create an old-style class instance without invoking
1991      its __init__() method (pickle has waffled on this over the years; not
1992      calling __init__() is current wisdom).  In this case, an instance of
1993      an old-style dummy class is created, and then we try to rebind its
1994      __class__ attribute to the desired class object.  If this succeeds,
1995      the new instance object is pushed on the stack, and we're done.
1996
1997      Else (the argtuple is not empty, it's not an old-style class object,
1998      or the class object does have a __getinitargs__ attribute), the code
1999      first insists that the class object have a __safe_for_unpickling__
2000      attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
2001      it doesn't matter whether this attribute has a true or false value, it
2002      only matters whether it exists (XXX this is a bug).  If
2003      __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
2004
2005      Else (the class object does have a __safe_for_unpickling__ attr),
2006      the class object obtained from INST's arguments is applied to the
2007      argtuple obtained from the stack, and the resulting instance object
2008      is pushed on the stack.
2009
2010      NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
2011      NOTE:  the distinction between old-style and new-style classes does
2012             not make sense in Python 3.
2013      """),
2014
2015    I(name='OBJ',
2016      code='o',
2017      arg=None,
2018      stack_before=[markobject, anyobject, stackslice],
2019      stack_after=[anyobject],
2020      proto=1,
2021      doc="""Build a class instance.
2022
2023      This is the protocol 1 version of protocol 0's INST opcode, and is
2024      very much like it.  The major difference is that the class object
2025      is taken off the stack, allowing it to be retrieved from the memo
2026      repeatedly if several instances of the same class are created.  This
2027      can be much more efficient (in both time and space) than repeatedly
2028      embedding the module and class names in INST opcodes.
2029
2030      Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
2031      the class object is taken off the stack, immediately above the
2032      topmost markobject:
2033
2034      Stack before: ... markobject classobject stackslice
2035      Stack after:  ... new_instance_object
2036
2037      As for INST, the remainder of the stack above the markobject is
2038      gathered into an argument tuple, and then the logic seems identical,
2039      except that no __safe_for_unpickling__ check is done (XXX this is
2040      a bug).  See INST for the gory details.
2041
2042      NOTE:  In Python 2.3, INST and OBJ are identical except for how they
2043      get the class object.  That was always the intent; the implementations
2044      had diverged for accidental reasons.
2045      """),
2046
2047    I(name='NEWOBJ',
2048      code='\x81',
2049      arg=None,
2050      stack_before=[anyobject, anyobject],
2051      stack_after=[anyobject],
2052      proto=2,
2053      doc="""Build an object instance.
2054
2055      The stack before should be thought of as containing a class
2056      object followed by an argument tuple (the tuple being the stack
2057      top).  Call these cls and args.  They are popped off the stack,
2058      and the value returned by cls.__new__(cls, *args) is pushed back
2059      onto the stack.
2060      """),
2061
2062    I(name='NEWOBJ_EX',
2063      code='\x92',
2064      arg=None,
2065      stack_before=[anyobject, anyobject, anyobject],
2066      stack_after=[anyobject],
2067      proto=4,
2068      doc="""Build an object instance.
2069
2070      The stack before should be thought of as containing a class
2071      object followed by an argument tuple and by a keyword argument dict
2072      (the dict being the stack top).  Call these cls and args.  They are
2073      popped off the stack, and the value returned by
2074      cls.__new__(cls, *args, *kwargs) is  pushed back  onto the stack.
2075      """),
2076
2077    # Machine control.
2078
2079    I(name='PROTO',
2080      code='\x80',
2081      arg=uint1,
2082      stack_before=[],
2083      stack_after=[],
2084      proto=2,
2085      doc="""Protocol version indicator.
2086
2087      For protocol 2 and above, a pickle must start with this opcode.
2088      The argument is the protocol version, an int in range(2, 256).
2089      """),
2090
2091    I(name='STOP',
2092      code='.',
2093      arg=None,
2094      stack_before=[anyobject],
2095      stack_after=[],
2096      proto=0,
2097      doc="""Stop the unpickling machine.
2098
2099      Every pickle ends with this opcode.  The object at the top of the stack
2100      is popped, and that's the result of unpickling.  The stack should be
2101      empty then.
2102      """),
2103
2104    # Framing support.
2105
2106    I(name='FRAME',
2107      code='\x95',
2108      arg=uint8,
2109      stack_before=[],
2110      stack_after=[],
2111      proto=4,
2112      doc="""Indicate the beginning of a new frame.
2113
2114      The unpickler may use this opcode to safely prefetch data from its
2115      underlying stream.
2116      """),
2117
2118    # Ways to deal with persistent IDs.
2119
2120    I(name='PERSID',
2121      code='P',
2122      arg=stringnl_noescape,
2123      stack_before=[],
2124      stack_after=[anyobject],
2125      proto=0,
2126      doc="""Push an object identified by a persistent ID.
2127
2128      The pickle module doesn't define what a persistent ID means.  PERSID's
2129      argument is a newline-terminated str-style (no embedded escapes, no
2130      bracketing quote characters) string, which *is* "the persistent ID".
2131      The unpickler passes this string to self.persistent_load().  Whatever
2132      object that returns is pushed on the stack.  There is no implementation
2133      of persistent_load() in Python's unpickler:  it must be supplied by an
2134      unpickler subclass.
2135      """),
2136
2137    I(name='BINPERSID',
2138      code='Q',
2139      arg=None,
2140      stack_before=[anyobject],
2141      stack_after=[anyobject],
2142      proto=1,
2143      doc="""Push an object identified by a persistent ID.
2144
2145      Like PERSID, except the persistent ID is popped off the stack (instead
2146      of being a string embedded in the opcode bytestream).  The persistent
2147      ID is passed to self.persistent_load(), and whatever object that
2148      returns is pushed on the stack.  See PERSID for more detail.
2149      """),
2150]
2151del I
2152
2153# Verify uniqueness of .name and .code members.
2154name2i = {}
2155code2i = {}
2156
2157for i, d in enumerate(opcodes):
2158    if d.name in name2i:
2159        raise ValueError("repeated name %r at indices %d and %d" %
2160                         (d.name, name2i[d.name], i))
2161    if d.code in code2i:
2162        raise ValueError("repeated code %r at indices %d and %d" %
2163                         (d.code, code2i[d.code], i))
2164
2165    name2i[d.name] = i
2166    code2i[d.code] = i
2167
2168del name2i, code2i, i, d
2169
2170##############################################################################
2171# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
2172# Also ensure we've got the same stuff as pickle.py, although the
2173# introspection here is dicey.
2174
2175code2op = {}
2176for d in opcodes:
2177    code2op[d.code] = d
2178del d
2179
2180def assure_pickle_consistency(verbose=False):
2181
2182    copy = code2op.copy()
2183    for name in pickle.__all__:
2184        if not re.match("[A-Z][A-Z0-9_]+$", name):
2185            if verbose:
2186                print("skipping %r: it doesn't look like an opcode name" % name)
2187            continue
2188        picklecode = getattr(pickle, name)
2189        if not isinstance(picklecode, bytes) or len(picklecode) != 1:
2190            if verbose:
2191                print(("skipping %r: value %r doesn't look like a pickle "
2192                       "code" % (name, picklecode)))
2193            continue
2194        picklecode = picklecode.decode("latin-1")
2195        if picklecode in copy:
2196            if verbose:
2197                print("checking name %r w/ code %r for consistency" % (
2198                      name, picklecode))
2199            d = copy[picklecode]
2200            if d.name != name:
2201                raise ValueError("for pickle code %r, pickle.py uses name %r "
2202                                 "but we're using name %r" % (picklecode,
2203                                                              name,
2204                                                              d.name))
2205            # Forget this one.  Any left over in copy at the end are a problem
2206            # of a different kind.
2207            del copy[picklecode]
2208        else:
2209            raise ValueError("pickle.py appears to have a pickle opcode with "
2210                             "name %r and code %r, but we don't" %
2211                             (name, picklecode))
2212    if copy:
2213        msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
2214        for code, d in copy.items():
2215            msg.append("    name %r with code %r" % (d.name, code))
2216        raise ValueError("\n".join(msg))
2217
2218assure_pickle_consistency()
2219del assure_pickle_consistency
2220
2221##############################################################################
2222# A pickle opcode generator.
2223
2224def _genops(data, yield_end_pos=False):
2225    if isinstance(data, bytes_types):
2226        data = io.BytesIO(data)
2227
2228    if hasattr(data, "tell"):
2229        getpos = data.tell
2230    else:
2231        getpos = lambda: None
2232
2233    while True:
2234        pos = getpos()
2235        code = data.read(1)
2236        opcode = code2op.get(code.decode("latin-1"))
2237        if opcode is None:
2238            if code == b"":
2239                raise ValueError("pickle exhausted before seeing STOP")
2240            else:
2241                raise ValueError("at position %s, opcode %r unknown" % (
2242                                 "<unknown>" if pos is None else pos,
2243                                 code))
2244        if opcode.arg is None:
2245            arg = None
2246        else:
2247            arg = opcode.arg.reader(data)
2248        if yield_end_pos:
2249            yield opcode, arg, pos, getpos()
2250        else:
2251            yield opcode, arg, pos
2252        if code == b'.':
2253            assert opcode.name == 'STOP'
2254            break
2255
2256def genops(pickle):
2257    """Generate all the opcodes in a pickle.
2258
2259    'pickle' is a file-like object, or string, containing the pickle.
2260
2261    Each opcode in the pickle is generated, from the current pickle position,
2262    stopping after a STOP opcode is delivered.  A triple is generated for
2263    each opcode:
2264
2265        opcode, arg, pos
2266
2267    opcode is an OpcodeInfo record, describing the current opcode.
2268
2269    If the opcode has an argument embedded in the pickle, arg is its decoded
2270    value, as a Python object.  If the opcode doesn't have an argument, arg
2271    is None.
2272
2273    If the pickle has a tell() method, pos was the value of pickle.tell()
2274    before reading the current opcode.  If the pickle is a bytes object,
2275    it's wrapped in a BytesIO object, and the latter's tell() result is
2276    used.  Else (the pickle doesn't have a tell(), and it's not obvious how
2277    to query its current position) pos is None.
2278    """
2279    return _genops(pickle)
2280
2281##############################################################################
2282# A pickle optimizer.
2283
2284def optimize(p):
2285    'Optimize a pickle string by removing unused PUT opcodes'
2286    not_a_put = object()
2287    gets = { not_a_put }    # set of args used by a GET opcode
2288    opcodes = []            # (startpos, stoppos, putid)
2289    proto = 0
2290    for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True):
2291        if 'PUT' in opcode.name:
2292            opcodes.append((pos, end_pos, arg))
2293        elif 'FRAME' in opcode.name:
2294            pass
2295        else:
2296            if 'GET' in opcode.name:
2297                gets.add(arg)
2298            elif opcode.name == 'PROTO':
2299                assert pos == 0, pos
2300                proto = arg
2301            opcodes.append((pos, end_pos, not_a_put))
2302            prevpos, prevarg = pos, None
2303
2304    # Copy the opcodes except for PUTS without a corresponding GET
2305    out = io.BytesIO()
2306    opcodes = iter(opcodes)
2307    if proto >= 2:
2308        # Write the PROTO header before any framing
2309        start, stop, _ = next(opcodes)
2310        out.write(p[start:stop])
2311    buf = pickle._Framer(out.write)
2312    if proto >= 4:
2313        buf.start_framing()
2314    for start, stop, putid in opcodes:
2315        if putid in gets:
2316            #buf.commit_frame()
2317            buf.write(p[start:stop])
2318    if proto >= 4:
2319        buf.end_framing()
2320    return out.getvalue()
2321
2322##############################################################################
2323# A symbolic pickle disassembler.
2324
2325def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
2326    """Produce a symbolic disassembly of a pickle.
2327
2328    'pickle' is a file-like object, or string, containing a (at least one)
2329    pickle.  The pickle is disassembled from the current position, through
2330    the first STOP opcode encountered.
2331
2332    Optional arg 'out' is a file-like object to which the disassembly is
2333    printed.  It defaults to sys.stdout.
2334
2335    Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
2336    may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
2337    Passing the same memo object to another dis() call then allows disassembly
2338    to proceed across multiple pickles that were all created by the same
2339    pickler with the same memo.  Ordinarily you don't need to worry about this.
2340
2341    Optional arg 'indentlevel' is the number of blanks by which to indent
2342    a new MARK level.  It defaults to 4.
2343
2344    Optional arg 'annotate' if nonzero instructs dis() to add short
2345    description of the opcode on each line of disassembled output.
2346    The value given to 'annotate' must be an integer and is used as a
2347    hint for the column where annotation should start.  The default
2348    value is 0, meaning no annotations.
2349
2350    In addition to printing the disassembly, some sanity checks are made:
2351
2352    + All embedded opcode arguments "make sense".
2353
2354    + Explicit and implicit pop operations have enough items on the stack.
2355
2356    + When an opcode implicitly refers to a markobject, a markobject is
2357      actually on the stack.
2358
2359    + A memo entry isn't referenced before it's defined.
2360
2361    + The markobject isn't stored in the memo.
2362
2363    + A memo entry isn't redefined.
2364    """
2365
2366    # Most of the hair here is for sanity checks, but most of it is needed
2367    # anyway to detect when a protocol 0 POP takes a MARK off the stack
2368    # (which in turn is needed to indent MARK blocks correctly).
2369
2370    stack = []          # crude emulation of unpickler stack
2371    if memo is None:
2372        memo = {}       # crude emulation of unpickler memo
2373    maxproto = -1       # max protocol number seen
2374    markstack = []      # bytecode positions of MARK opcodes
2375    indentchunk = ' ' * indentlevel
2376    errormsg = None
2377    annocol = annotate  # column hint for annotations
2378    for opcode, arg, pos in genops(pickle):
2379        if pos is not None:
2380            print("%5d:" % pos, end=' ', file=out)
2381
2382        line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
2383                              indentchunk * len(markstack),
2384                              opcode.name)
2385
2386        maxproto = max(maxproto, opcode.proto)
2387        before = opcode.stack_before    # don't mutate
2388        after = opcode.stack_after      # don't mutate
2389        numtopop = len(before)
2390
2391        # See whether a MARK should be popped.
2392        markmsg = None
2393        if markobject in before or (opcode.name == "POP" and
2394                                    stack and
2395                                    stack[-1] is markobject):
2396            assert markobject not in after
2397            if __debug__:
2398                if markobject in before:
2399                    assert before[-1] is stackslice
2400            if markstack:
2401                markpos = markstack.pop()
2402                if markpos is None:
2403                    markmsg = "(MARK at unknown opcode offset)"
2404                else:
2405                    markmsg = "(MARK at %d)" % markpos
2406                # Pop everything at and after the topmost markobject.
2407                while stack[-1] is not markobject:
2408                    stack.pop()
2409                stack.pop()
2410                # Stop later code from popping too much.
2411                try:
2412                    numtopop = before.index(markobject)
2413                except ValueError:
2414                    assert opcode.name == "POP"
2415                    numtopop = 0
2416            else:
2417                errormsg = markmsg = "no MARK exists on stack"
2418
2419        # Check for correct memo usage.
2420        if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"):
2421            if opcode.name == "MEMOIZE":
2422                memo_idx = len(memo)
2423            else:
2424                assert arg is not None
2425                memo_idx = arg
2426            if memo_idx in memo:
2427                errormsg = "memo key %r already defined" % arg
2428            elif not stack:
2429                errormsg = "stack is empty -- can't store into memo"
2430            elif stack[-1] is markobject:
2431                errormsg = "can't store markobject in the memo"
2432            else:
2433                memo[memo_idx] = stack[-1]
2434        elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2435            if arg in memo:
2436                assert len(after) == 1
2437                after = [memo[arg]]     # for better stack emulation
2438            else:
2439                errormsg = "memo key %r has never been stored into" % arg
2440
2441        if arg is not None or markmsg:
2442            # make a mild effort to align arguments
2443            line += ' ' * (10 - len(opcode.name))
2444            if arg is not None:
2445                line += ' ' + repr(arg)
2446            if markmsg:
2447                line += ' ' + markmsg
2448        if annotate:
2449            line += ' ' * (annocol - len(line))
2450            # make a mild effort to align annotations
2451            annocol = len(line)
2452            if annocol > 50:
2453                annocol = annotate
2454            line += ' ' + opcode.doc.split('\n', 1)[0]
2455        print(line, file=out)
2456
2457        if errormsg:
2458            # Note that we delayed complaining until the offending opcode
2459            # was printed.
2460            raise ValueError(errormsg)
2461
2462        # Emulate the stack effects.
2463        if len(stack) < numtopop:
2464            raise ValueError("tries to pop %d items from stack with "
2465                             "only %d items" % (numtopop, len(stack)))
2466        if numtopop:
2467            del stack[-numtopop:]
2468        if markobject in after:
2469            assert markobject not in before
2470            markstack.append(pos)
2471
2472        stack.extend(after)
2473
2474    print("highest protocol among opcodes =", maxproto, file=out)
2475    if stack:
2476        raise ValueError("stack not empty after STOP: %r" % stack)
2477
2478# For use in the doctest, simply as an example of a class to pickle.
2479class _Example:
2480    def __init__(self, value):
2481        self.value = value
2482
2483_dis_test = r"""
2484>>> import pickle
2485>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2486>>> pkl0 = pickle.dumps(x, 0)
2487>>> dis(pkl0)
2488    0: (    MARK
2489    1: l        LIST       (MARK at 0)
2490    2: p    PUT        0
2491    5: L    LONG       1
2492    9: a    APPEND
2493   10: L    LONG       2
2494   14: a    APPEND
2495   15: (    MARK
2496   16: L        LONG       3
2497   20: L        LONG       4
2498   24: t        TUPLE      (MARK at 15)
2499   25: p    PUT        1
2500   28: a    APPEND
2501   29: (    MARK
2502   30: d        DICT       (MARK at 29)
2503   31: p    PUT        2
2504   34: c    GLOBAL     '_codecs encode'
2505   50: p    PUT        3
2506   53: (    MARK
2507   54: V        UNICODE    'abc'
2508   59: p        PUT        4
2509   62: V        UNICODE    'latin1'
2510   70: p        PUT        5
2511   73: t        TUPLE      (MARK at 53)
2512   74: p    PUT        6
2513   77: R    REDUCE
2514   78: p    PUT        7
2515   81: V    UNICODE    'def'
2516   86: p    PUT        8
2517   89: s    SETITEM
2518   90: a    APPEND
2519   91: .    STOP
2520highest protocol among opcodes = 0
2521
2522Try again with a "binary" pickle.
2523
2524>>> pkl1 = pickle.dumps(x, 1)
2525>>> dis(pkl1)
2526    0: ]    EMPTY_LIST
2527    1: q    BINPUT     0
2528    3: (    MARK
2529    4: K        BININT1    1
2530    6: K        BININT1    2
2531    8: (        MARK
2532    9: K            BININT1    3
2533   11: K            BININT1    4
2534   13: t            TUPLE      (MARK at 8)
2535   14: q        BINPUT     1
2536   16: }        EMPTY_DICT
2537   17: q        BINPUT     2
2538   19: c        GLOBAL     '_codecs encode'
2539   35: q        BINPUT     3
2540   37: (        MARK
2541   38: X            BINUNICODE 'abc'
2542   46: q            BINPUT     4
2543   48: X            BINUNICODE 'latin1'
2544   59: q            BINPUT     5
2545   61: t            TUPLE      (MARK at 37)
2546   62: q        BINPUT     6
2547   64: R        REDUCE
2548   65: q        BINPUT     7
2549   67: X        BINUNICODE 'def'
2550   75: q        BINPUT     8
2551   77: s        SETITEM
2552   78: e        APPENDS    (MARK at 3)
2553   79: .    STOP
2554highest protocol among opcodes = 1
2555
2556Exercise the INST/OBJ/BUILD family.
2557
2558>>> import pickletools
2559>>> dis(pickle.dumps(pickletools.dis, 0))
2560    0: c    GLOBAL     'pickletools dis'
2561   17: p    PUT        0
2562   20: .    STOP
2563highest protocol among opcodes = 0
2564
2565>>> from pickletools import _Example
2566>>> x = [_Example(42)] * 2
2567>>> dis(pickle.dumps(x, 0))
2568    0: (    MARK
2569    1: l        LIST       (MARK at 0)
2570    2: p    PUT        0
2571    5: c    GLOBAL     'copy_reg _reconstructor'
2572   30: p    PUT        1
2573   33: (    MARK
2574   34: c        GLOBAL     'pickletools _Example'
2575   56: p        PUT        2
2576   59: c        GLOBAL     '__builtin__ object'
2577   79: p        PUT        3
2578   82: N        NONE
2579   83: t        TUPLE      (MARK at 33)
2580   84: p    PUT        4
2581   87: R    REDUCE
2582   88: p    PUT        5
2583   91: (    MARK
2584   92: d        DICT       (MARK at 91)
2585   93: p    PUT        6
2586   96: V    UNICODE    'value'
2587  103: p    PUT        7
2588  106: L    LONG       42
2589  111: s    SETITEM
2590  112: b    BUILD
2591  113: a    APPEND
2592  114: g    GET        5
2593  117: a    APPEND
2594  118: .    STOP
2595highest protocol among opcodes = 0
2596
2597>>> dis(pickle.dumps(x, 1))
2598    0: ]    EMPTY_LIST
2599    1: q    BINPUT     0
2600    3: (    MARK
2601    4: c        GLOBAL     'copy_reg _reconstructor'
2602   29: q        BINPUT     1
2603   31: (        MARK
2604   32: c            GLOBAL     'pickletools _Example'
2605   54: q            BINPUT     2
2606   56: c            GLOBAL     '__builtin__ object'
2607   76: q            BINPUT     3
2608   78: N            NONE
2609   79: t            TUPLE      (MARK at 31)
2610   80: q        BINPUT     4
2611   82: R        REDUCE
2612   83: q        BINPUT     5
2613   85: }        EMPTY_DICT
2614   86: q        BINPUT     6
2615   88: X        BINUNICODE 'value'
2616   98: q        BINPUT     7
2617  100: K        BININT1    42
2618  102: s        SETITEM
2619  103: b        BUILD
2620  104: h        BINGET     5
2621  106: e        APPENDS    (MARK at 3)
2622  107: .    STOP
2623highest protocol among opcodes = 1
2624
2625Try "the canonical" recursive-object test.
2626
2627>>> L = []
2628>>> T = L,
2629>>> L.append(T)
2630>>> L[0] is T
2631True
2632>>> T[0] is L
2633True
2634>>> L[0][0] is L
2635True
2636>>> T[0][0] is T
2637True
2638>>> dis(pickle.dumps(L, 0))
2639    0: (    MARK
2640    1: l        LIST       (MARK at 0)
2641    2: p    PUT        0
2642    5: (    MARK
2643    6: g        GET        0
2644    9: t        TUPLE      (MARK at 5)
2645   10: p    PUT        1
2646   13: a    APPEND
2647   14: .    STOP
2648highest protocol among opcodes = 0
2649
2650>>> dis(pickle.dumps(L, 1))
2651    0: ]    EMPTY_LIST
2652    1: q    BINPUT     0
2653    3: (    MARK
2654    4: h        BINGET     0
2655    6: t        TUPLE      (MARK at 3)
2656    7: q    BINPUT     1
2657    9: a    APPEND
2658   10: .    STOP
2659highest protocol among opcodes = 1
2660
2661Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2662has to emulate the stack in order to realize that the POP opcode at 16 gets
2663rid of the MARK at 0.
2664
2665>>> dis(pickle.dumps(T, 0))
2666    0: (    MARK
2667    1: (        MARK
2668    2: l            LIST       (MARK at 1)
2669    3: p        PUT        0
2670    6: (        MARK
2671    7: g            GET        0
2672   10: t            TUPLE      (MARK at 6)
2673   11: p        PUT        1
2674   14: a        APPEND
2675   15: 0        POP
2676   16: 0        POP        (MARK at 0)
2677   17: g    GET        1
2678   20: .    STOP
2679highest protocol among opcodes = 0
2680
2681>>> dis(pickle.dumps(T, 1))
2682    0: (    MARK
2683    1: ]        EMPTY_LIST
2684    2: q        BINPUT     0
2685    4: (        MARK
2686    5: h            BINGET     0
2687    7: t            TUPLE      (MARK at 4)
2688    8: q        BINPUT     1
2689   10: a        APPEND
2690   11: 1        POP_MARK   (MARK at 0)
2691   12: h    BINGET     1
2692   14: .    STOP
2693highest protocol among opcodes = 1
2694
2695Try protocol 2.
2696
2697>>> dis(pickle.dumps(L, 2))
2698    0: \x80 PROTO      2
2699    2: ]    EMPTY_LIST
2700    3: q    BINPUT     0
2701    5: h    BINGET     0
2702    7: \x85 TUPLE1
2703    8: q    BINPUT     1
2704   10: a    APPEND
2705   11: .    STOP
2706highest protocol among opcodes = 2
2707
2708>>> dis(pickle.dumps(T, 2))
2709    0: \x80 PROTO      2
2710    2: ]    EMPTY_LIST
2711    3: q    BINPUT     0
2712    5: h    BINGET     0
2713    7: \x85 TUPLE1
2714    8: q    BINPUT     1
2715   10: a    APPEND
2716   11: 0    POP
2717   12: h    BINGET     1
2718   14: .    STOP
2719highest protocol among opcodes = 2
2720
2721Try protocol 3 with annotations:
2722
2723>>> dis(pickle.dumps(T, 3), annotate=1)
2724    0: \x80 PROTO      3 Protocol version indicator.
2725    2: ]    EMPTY_LIST   Push an empty list.
2726    3: q    BINPUT     0 Store the stack top into the memo.  The stack is not popped.
2727    5: h    BINGET     0 Read an object from the memo and push it on the stack.
2728    7: \x85 TUPLE1       Build a one-tuple out of the topmost item on the stack.
2729    8: q    BINPUT     1 Store the stack top into the memo.  The stack is not popped.
2730   10: a    APPEND       Append an object to a list.
2731   11: 0    POP          Discard the top stack item, shrinking the stack by one item.
2732   12: h    BINGET     1 Read an object from the memo and push it on the stack.
2733   14: .    STOP         Stop the unpickling machine.
2734highest protocol among opcodes = 2
2735
2736"""
2737
2738_memo_test = r"""
2739>>> import pickle
2740>>> import io
2741>>> f = io.BytesIO()
2742>>> p = pickle.Pickler(f, 2)
2743>>> x = [1, 2, 3]
2744>>> p.dump(x)
2745>>> p.dump(x)
2746>>> f.seek(0)
27470
2748>>> memo = {}
2749>>> dis(f, memo=memo)
2750    0: \x80 PROTO      2
2751    2: ]    EMPTY_LIST
2752    3: q    BINPUT     0
2753    5: (    MARK
2754    6: K        BININT1    1
2755    8: K        BININT1    2
2756   10: K        BININT1    3
2757   12: e        APPENDS    (MARK at 5)
2758   13: .    STOP
2759highest protocol among opcodes = 2
2760>>> dis(f, memo=memo)
2761   14: \x80 PROTO      2
2762   16: h    BINGET     0
2763   18: .    STOP
2764highest protocol among opcodes = 2
2765"""
2766
2767__test__ = {'disassembler_test': _dis_test,
2768            'disassembler_memo_test': _memo_test,
2769           }
2770
2771def _test():
2772    import doctest
2773    return doctest.testmod()
2774
2775if __name__ == "__main__":
2776    import sys, argparse
2777    parser = argparse.ArgumentParser(
2778        description='disassemble one or more pickle files')
2779    parser.add_argument(
2780        'pickle_file', type=argparse.FileType('br'),
2781        nargs='*', help='the pickle file')
2782    parser.add_argument(
2783        '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2784        help='the file where the output should be written')
2785    parser.add_argument(
2786        '-m', '--memo', action='store_true',
2787        help='preserve memo between disassemblies')
2788    parser.add_argument(
2789        '-l', '--indentlevel', default=4, type=int,
2790        help='the number of blanks by which to indent a new MARK level')
2791    parser.add_argument(
2792        '-a', '--annotate',  action='store_true',
2793        help='annotate each line with a short opcode description')
2794    parser.add_argument(
2795        '-p', '--preamble', default="==> {name} <==",
2796        help='if more than one pickle file is specified, print this before'
2797        ' each disassembly')
2798    parser.add_argument(
2799        '-t', '--test', action='store_true',
2800        help='run self-test suite')
2801    parser.add_argument(
2802        '-v', action='store_true',
2803        help='run verbosely; only affects self-test run')
2804    args = parser.parse_args()
2805    if args.test:
2806        _test()
2807    else:
2808        annotate = 30 if args.annotate else 0
2809        if not args.pickle_file:
2810            parser.print_help()
2811        elif len(args.pickle_file) == 1:
2812            dis(args.pickle_file[0], args.output, None,
2813                args.indentlevel, annotate)
2814        else:
2815            memo = {} if args.memo else None
2816            for f in args.pickle_file:
2817                preamble = args.preamble.format(name=f.name)
2818                args.output.write(preamble + '\n')
2819                dis(f, args.output, memo, args.indentlevel, annotate)
2820