pickletools.py revision 1996e23054f2ac79cf89c9ef04714f336b0a17ce
1""""Executable documentation" for the pickle module.
2
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here.  Some functions meant for external use:
5
6genops(pickle)
7   Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
9dis(pickle, out=None, indentlevel=4)
10   Print a symbolic disassembly of a pickle.
11"""
12
13# Other ideas:
14#
15# - A pickle verifier:  read a pickle and check it exhaustively for
16#   well-formedness.
17#
18# - A protocol identifier:  examine a pickle and return its protocol number
19#   (== the highest .proto attr value among all the opcodes in the pickle).
20#
21# - A pickle optimizer:  for example, tuple-building code is sometimes more
22#   elaborate than necessary, catering for the possibility that the tuple
23#   is recursive.  Or lots of times a PUT is generated that's never accessed
24#   by a later GET.
25
26
27"""
28"A pickle" is a program for a virtual pickle machine (PM, but more accurately
29called an unpickling machine).  It's a sequence of opcodes, interpreted by the
30PM, building an arbitrarily complex Python object.
31
32For the most part, the PM is very simple:  there are no looping, testing, or
33conditional instructions, no arithmetic and no function calls.  Opcodes are
34executed once each, from first to last, until a STOP opcode is reached.
35
36The PM has two data areas, "the stack" and "the memo".
37
38Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
39integer object on the stack, whose value is gotten from a decimal string
40literal immediately following the INT opcode in the pickle bytestream.  Other
41opcodes take Python objects off the stack.  The result of unpickling is
42whatever object is left on the stack when the final STOP opcode is executed.
43
44The memo is simply an array of objects, or it can be implemented as a dict
45mapping little integers to objects.  The memo serves as the PM's "long term
46memory", and the little integers indexing the memo are akin to variable
47names.  Some opcodes pop a stack object into the memo at a given index,
48and others push a memo object at a given index onto the stack again.
49
50At heart, that's all the PM has.  Subtleties arise for these reasons:
51
52+ Object identity.  Objects can be arbitrarily complex, and subobjects
53  may be shared (for example, the list [a, a] refers to the same object a
54  twice).  It can be vital that unpickling recreate an isomorphic object
55  graph, faithfully reproducing sharing.
56
57+ Recursive objects.  For example, after "L = []; L.append(L)", L is a
58  list, and L[0] is the same list.  This is related to the object identity
59  point, and some sequences of pickle opcodes are subtle in order to
60  get the right result in all cases.
61
62+ Things pickle doesn't know everything about.  Examples of things pickle
63  does know everything about are Python's builtin scalar and container
64  types, like ints and tuples.  They generally have opcodes dedicated to
65  them.  For things like module references and instances of user-defined
66  classes, pickle's knowledge is limited.  Historically, many enhancements
67  have been made to the pickle protocol in order to do a better (faster,
68  and/or more compact) job on those.
69
70+ Backward compatibility and micro-optimization.  As explained below,
71  pickle opcodes never go away, not even when better ways to do a thing
72  get invented.  The repertoire of the PM just keeps growing over time.
73  So, e.g., there are now five distinct opcodes for building a Python integer,
74  four of them devoted to "short" integers.  Even so, the only way to pickle
75  a Python long int takes time quadratic in the number of digits, for both
76  pickling and unpickling.  This isn't so much a subtlety as a source of
77  wearying complication.
78
79
80Pickle protocols:
81
82For compatibility, the meaning of a pickle opcode never changes.  Instead new
83pickle opcodes get added, and each version's unpickler can handle all the
84pickle opcodes in all protocol versions to date.  So old pickles continue to
85be readable forever.  The pickler can generally be told to restrict itself to
86the subset of opcodes available under previous protocol versions too, so that
87users can create pickles under the current version readable by older
88versions.  However, a pickle does not contain its version number embedded
89within it.  If an older unpickler tries to read a pickle using a later
90protocol, the result is most likely an exception due to seeing an unknown (in
91the older unpickler) opcode.
92
93The original pickle used what's now called "protocol 0", and what was called
94"text mode" before Python 2.3.  The entire pickle bytestream is made up of
95printable 7-bit ASCII characters, plus the newline character, in protocol 0.
96That's why it was called text mode.
97
98The second major set of additions is now called "protocol 1", and was called
99"binary mode" before Python 2.3.  This added many opcodes with arguments
100consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
101bytes.  Binary mode pickles can be substantially smaller than equivalent
102text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
103int as 4 bytes following the opcode, which is cheaper to unpickle than the
104(perhaps) 11-character decimal string attached to INT.
105
106The third major set of additions came in Python 2.3, and is called "protocol
1072".  XXX Write a short blurb when Guido figures out what they are <wink>. XXX
108"""
109
110# Meta-rule:  Descriptions are stored in instances of descriptor objects,
111# with plain constructors.  No meta-language is defined from which
112# descriptors could be constructed.  If you want, e.g., XML, write a little
113# program to generate XML from the objects.
114
115##############################################################################
116# Some pickle opcodes have an argument, following the opcode in the
117# bytestream.  An argument is of a specific type, described by an instance
118# of ArgumentDescriptor.  These are not to be confused with arguments taken
119# off the stack -- ArgumentDescriptor applies only to arguments embedded in
120# the opcode stream, immediately following an opcode.
121
122# Represents the number of bytes consumed by an argument delimited by the
123# next newline character.
124UP_TO_NEWLINE = -1
125
126# Represents the number of bytes consumed by a two-argument opcode where
127# the first argument gives the number of bytes in the second argument.
128TAKEN_FROM_ARGUMENT = -2
129
130class ArgumentDescriptor(object):
131    __slots__ = (
132        # name of descriptor record, also a module global name; a string
133        'name',
134
135        # length of argument, in bytes; an int; UP_TO_NEWLINE and
136        # TAKEN_FROM_ARGUMENT are negative values for variable-length cases
137        'n',
138
139        # a function taking a file-like object, reading this kind of argument
140        # from the object at the current position, advancing the current
141        # position by n bytes, and returning the value of the argument
142        'reader',
143
144        # human-readable docs for this arg descriptor; a string
145        'doc',
146    )
147
148    def __init__(self, name, n, reader, doc):
149        assert isinstance(name, str)
150        self.name = name
151
152        assert isinstance(n, int) and (n >= 0 or
153                                       n is UP_TO_NEWLINE or
154                                       n is TAKEN_FROM_ARGUMENT)
155        self.n = n
156
157        self.reader = reader
158
159        assert isinstance(doc, str)
160        self.doc = doc
161
162from struct import unpack as _unpack
163
164def read_uint1(f):
165    """
166    >>> import StringIO
167    >>> read_uint1(StringIO.StringIO('\\xff'))
168    255
169    """
170
171    data = f.read(1)
172    if data:
173        return ord(data)
174    raise ValueError("not enough data in stream to read uint1")
175
176uint1 = ArgumentDescriptor(
177            name='uint1',
178            n=1,
179            reader=read_uint1,
180            doc="One-byte unsigned integer.")
181
182
183def read_uint2(f):
184    """
185    >>> import StringIO
186    >>> read_uint2(StringIO.StringIO('\\xff\\x00'))
187    255
188    >>> read_uint2(StringIO.StringIO('\\xff\\xff'))
189    65535
190    """
191
192    data = f.read(2)
193    if len(data) == 2:
194        return _unpack("<H", data)[0]
195    raise ValueError("not enough data in stream to read uint2")
196
197uint2 = ArgumentDescriptor(
198            name='uint2',
199            n=2,
200            reader=read_uint2,
201            doc="Two-byte unsigned integer, little-endian.")
202
203
204def read_int4(f):
205    """
206    >>> import StringIO
207    >>> read_int4(StringIO.StringIO('\\xff\\x00\\x00\\x00'))
208    255
209    >>> read_int4(StringIO.StringIO('\\x00\\x00\\x00\\x80')) == -(2**31)
210    True
211    """
212
213    data = f.read(4)
214    if len(data) == 4:
215        return _unpack("<i", data)[0]
216    raise ValueError("not enough data in stream to read int4")
217
218int4 = ArgumentDescriptor(
219           name='int4',
220           n=4,
221           reader=read_int4,
222           doc="Four-byte signed integer, little-endian, 2's complement.")
223
224
225def read_stringnl(f, decode=True, stripquotes=True):
226    """
227    >>> import StringIO
228    >>> read_stringnl(StringIO.StringIO("'abcd'\\nefg\\n"))
229    'abcd'
230
231    >>> read_stringnl(StringIO.StringIO("\\n"))
232    Traceback (most recent call last):
233    ...
234    ValueError: no string quotes around ''
235
236    >>> read_stringnl(StringIO.StringIO("\\n"), stripquotes=False)
237    ''
238
239    >>> read_stringnl(StringIO.StringIO("''\\n"))
240    ''
241
242    >>> read_stringnl(StringIO.StringIO('"abcd"'))
243    Traceback (most recent call last):
244    ...
245    ValueError: no newline found when trying to read stringnl
246
247    Embedded escapes are undone in the result.
248    >>> read_stringnl(StringIO.StringIO("'a\\\\nb\\x00c\\td'\\n'e'"))
249    'a\\nb\\x00c\\td'
250    """
251
252    data = f.readline()
253    if not data.endswith('\n'):
254        raise ValueError("no newline found when trying to read stringnl")
255    data = data[:-1]    # lose the newline
256
257    if stripquotes:
258        for q in "'\"":
259            if data.startswith(q):
260                if not data.endswith(q):
261                    raise ValueError("strinq quote %r not found at both "
262                                     "ends of %r" % (q, data))
263                data = data[1:-1]
264                break
265        else:
266            raise ValueError("no string quotes around %r" % data)
267
268    # I'm not sure when 'string_escape' was added to the std codecs; it's
269    # crazy not to use it if it's there.
270    if decode:
271        data = data.decode('string_escape')
272    return data
273
274stringnl = ArgumentDescriptor(
275               name='stringnl',
276               n=UP_TO_NEWLINE,
277               reader=read_stringnl,
278               doc="""A newline-terminated string.
279
280                   This is a repr-style string, with embedded escapes, and
281                   bracketing quotes.
282                   """)
283
284def read_stringnl_noescape(f):
285    return read_stringnl(f, decode=False, stripquotes=False)
286
287stringnl_noescape = ArgumentDescriptor(
288                        name='stringnl_noescape',
289                        n=UP_TO_NEWLINE,
290                        reader=read_stringnl_noescape,
291                        doc="""A newline-terminated string.
292
293                        This is a str-style string, without embedded escapes,
294                        or bracketing quotes.  It should consist solely of
295                        printable ASCII characters.
296                        """)
297
298def read_stringnl_noescape_pair(f):
299    """
300    >>> import StringIO
301    >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\\nEmpty\\njunk"))
302    'Queue Empty'
303    """
304
305    return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
306
307stringnl_noescape_pair = ArgumentDescriptor(
308                             name='stringnl_noescape_pair',
309                             n=UP_TO_NEWLINE,
310                             reader=read_stringnl_noescape_pair,
311                             doc="""A pair of newline-terminated strings.
312
313                             These are str-style strings, without embedded
314                             escapes, or bracketing quotes.  They should
315                             consist solely of printable ASCII characters.
316                             The pair is returned as a single string, with
317                             a single blank separating the two strings.
318                             """)
319
320def read_string4(f):
321    """
322    >>> import StringIO
323    >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x00abc"))
324    ''
325    >>> read_string4(StringIO.StringIO("\\x03\\x00\\x00\\x00abcdef"))
326    'abc'
327    >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x03abcdef"))
328    Traceback (most recent call last):
329    ...
330    ValueError: expected 50331648 bytes in a string4, but only 6 remain
331    """
332
333    n = read_int4(f)
334    if n < 0:
335        raise ValueError("string4 byte count < 0: %d" % n)
336    data = f.read(n)
337    if len(data) == n:
338        return data
339    raise ValueError("expected %d bytes in a string4, but only %d remain" %
340                     (n, len(data)))
341
342string4 = ArgumentDescriptor(
343              name="string4",
344              n=TAKEN_FROM_ARGUMENT,
345              reader=read_string4,
346              doc="""A counted string.
347
348              The first argument is a 4-byte little-endian signed int giving
349              the number of bytes in the string, and the second argument is
350              that many bytes.
351              """)
352
353
354def read_string1(f):
355    """
356    >>> import StringIO
357    >>> read_string1(StringIO.StringIO("\\x00"))
358    ''
359    >>> read_string1(StringIO.StringIO("\\x03abcdef"))
360    'abc'
361    """
362
363    n = read_uint1(f)
364    assert n >= 0
365    data = f.read(n)
366    if len(data) == n:
367        return data
368    raise ValueError("expected %d bytes in a string1, but only %d remain" %
369                     (n, len(data)))
370
371string1 = ArgumentDescriptor(
372              name="string1",
373              n=TAKEN_FROM_ARGUMENT,
374              reader=read_string1,
375              doc="""A counted string.
376
377              The first argument is a 1-byte unsigned int giving the number
378              of bytes in the string, and the second argument is that many
379              bytes.
380              """)
381
382
383def read_unicodestringnl(f):
384    """
385    >>> import StringIO
386    >>> read_unicodestringnl(StringIO.StringIO("abc\\uabcd\\njunk"))
387    u'abc\\uabcd'
388    """
389
390    data = f.readline()
391    if not data.endswith('\n'):
392        raise ValueError("no newline found when trying to read "
393                         "unicodestringnl")
394    data = data[:-1]    # lose the newline
395    return unicode(data, 'raw-unicode-escape')
396
397unicodestringnl = ArgumentDescriptor(
398                      name='unicodestringnl',
399                      n=UP_TO_NEWLINE,
400                      reader=read_unicodestringnl,
401                      doc="""A newline-terminated Unicode string.
402
403                      This is raw-unicode-escape encoded, so consists of
404                      printable ASCII characters, and may contain embedded
405                      escape sequences.
406                      """)
407
408def read_unicodestring4(f):
409    """
410    >>> import StringIO
411    >>> s = u'abcd\\uabcd'
412    >>> enc = s.encode('utf-8')
413    >>> enc
414    'abcd\\xea\\xaf\\x8d'
415    >>> n = chr(len(enc)) + chr(0) * 3  # little-endian 4-byte length
416    >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
417    >>> s == t
418    True
419
420    >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
421    Traceback (most recent call last):
422    ...
423    ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
424    """
425
426    n = read_int4(f)
427    if n < 0:
428        raise ValueError("unicodestring4 byte count < 0: %d" % n)
429    data = f.read(n)
430    if len(data) == n:
431        return unicode(data, 'utf-8')
432    raise ValueError("expected %d bytes in a unicodestring4, but only %d "
433                     "remain" % (n, len(data)))
434
435unicodestring4 = ArgumentDescriptor(
436                    name="unicodestring4",
437                    n=TAKEN_FROM_ARGUMENT,
438                    reader=read_unicodestring4,
439                    doc="""A counted Unicode string.
440
441                    The first argument is a 4-byte little-endian signed int
442                    giving the number of bytes in the string, and the second
443                    argument-- the UTF-8 encoding of the Unicode string --
444                    contains that many bytes.
445                    """)
446
447
448def read_decimalnl_short(f):
449    """
450    >>> import StringIO
451    >>> read_decimalnl_short(StringIO.StringIO("1234\\n56"))
452    1234
453
454    >>> read_decimalnl_short(StringIO.StringIO("1234L\\n56"))
455    Traceback (most recent call last):
456    ...
457    ValueError: trailing 'L' not allowed in '1234L'
458    """
459
460    s = read_stringnl(f, decode=False, stripquotes=False)
461    if s.endswith("L"):
462        raise ValueError("trailing 'L' not allowed in %r" % s)
463
464    # It's not necessarily true that the result fits in a Python short int:
465    # the pickle may have been written on a 64-bit box.  There's also a hack
466    # for True and False here.
467    if s == "00":
468        return False
469    elif s == "01":
470        return True
471
472    try:
473        return int(s)
474    except OverflowError:
475        return long(s)
476
477def read_decimalnl_long(f):
478    """
479    >>> import StringIO
480
481    >>> read_decimalnl_long(StringIO.StringIO("1234\\n56"))
482    Traceback (most recent call last):
483    ...
484    ValueError: trailing 'L' required in '1234'
485
486    Someday the trailing 'L' will probably go away from this output.
487
488    >>> read_decimalnl_long(StringIO.StringIO("1234L\\n56"))
489    1234L
490
491    >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\\n6"))
492    123456789012345678901234L
493    """
494
495    s = read_stringnl(f, decode=False, stripquotes=False)
496    if not s.endswith("L"):
497        raise ValueError("trailing 'L' required in %r" % s)
498    return long(s)
499
500
501decimalnl_short = ArgumentDescriptor(
502                      name='decimalnl_short',
503                      n=UP_TO_NEWLINE,
504                      reader=read_decimalnl_short,
505                      doc="""A newline-terminated decimal integer literal.
506
507                          This never has a trailing 'L', and the integer fit
508                          in a short Python int on the box where the pickle
509                          was written -- but there's no guarantee it will fit
510                          in a short Python int on the box where the pickle
511                          is read.
512                          """)
513
514decimalnl_long = ArgumentDescriptor(
515                     name='decimalnl_long',
516                     n=UP_TO_NEWLINE,
517                     reader=read_decimalnl_long,
518                     doc="""A newline-terminated decimal integer literal.
519
520                         This has a trailing 'L', and can represent integers
521                         of any size.
522                         """)
523
524
525def read_floatnl(f):
526    """
527    >>> import StringIO
528    >>> read_floatnl(StringIO.StringIO("-1.25\\n6"))
529    -1.25
530    """
531    s = read_stringnl(f, decode=False, stripquotes=False)
532    return float(s)
533
534floatnl = ArgumentDescriptor(
535              name='floatnl',
536              n=UP_TO_NEWLINE,
537              reader=read_floatnl,
538              doc="""A newline-terminated decimal floating literal.
539
540              In general this requires 17 significant digits for roundtrip
541              identity, and pickling then unpickling infinities, NaNs, and
542              minus zero doesn't work across boxes, or on some boxes even
543              on itself (e.g., Windows can't read the strings it produces
544              for infinities or NaNs).
545              """)
546
547def read_float8(f):
548    """
549    >>> import StringIO, struct
550    >>> raw = struct.pack(">d", -1.25)
551    >>> raw
552    '\\xbf\\xf4\\x00\\x00\\x00\\x00\\x00\\x00'
553    >>> read_float8(StringIO.StringIO(raw + "\\n"))
554    -1.25
555    """
556
557    data = f.read(8)
558    if len(data) == 8:
559        return _unpack(">d", data)[0]
560    raise ValueError("not enough data in stream to read float8")
561
562
563float8 = ArgumentDescriptor(
564             name='float8',
565             n=8,
566             reader=read_float8,
567             doc="""An 8-byte binary representation of a float, big-endian.
568
569             The format is unique to Python, and shared with the struct
570             module (format string '>d') "in theory" (the struct and cPickle
571             implementations don't share the code -- they should).  It's
572             strongly related to the IEEE-754 double format, and, in normal
573             cases, is in fact identical to the big-endian 754 double format.
574             On other boxes the dynamic range is limited to that of a 754
575             double, and "add a half and chop" rounding is used to reduce
576             the precision to 53 bits.  However, even on a 754 box,
577             infinities, NaNs, and minus zero may not be handled correctly
578             (may not survive roundtrip pickling intact).
579             """)
580
581##############################################################################
582# Object descriptors.  The stack used by the pickle machine holds objects,
583# and in the stack_before and stack_after attributes of OpcodeInfo
584# descriptors we need names to describe the various types of objects that can
585# appear on the stack.
586
587class StackObject(object):
588    __slots__ = (
589        # name of descriptor record, for info only
590        'name',
591
592        # type of object, or tuple of type objects (meaning the object can
593        # be of any type in the tuple)
594        'obtype',
595
596        # human-readable docs for this kind of stack object; a string
597        'doc',
598    )
599
600    def __init__(self, name, obtype, doc):
601        assert isinstance(name, str)
602        self.name = name
603
604        assert isinstance(obtype, type) or isinstance(obtype, tuple)
605        if isinstance(obtype, tuple):
606            for contained in obtype:
607                assert isinstance(contained, type)
608        self.obtype = obtype
609
610        assert isinstance(doc, str)
611        self.doc = doc
612
613
614pyint = StackObject(
615            name='int',
616            obtype=int,
617            doc="A short (as opposed to long) Python integer object.")
618
619pylong = StackObject(
620             name='long',
621             obtype=long,
622             doc="A long (as opposed to short) Python integer object.")
623
624pyinteger_or_bool = StackObject(
625                        name='int_or_bool',
626                        obtype=(int, long, bool),
627                        doc="A Python integer object (short or long), or "
628                            "a Python bool.")
629
630pyfloat = StackObject(
631              name='float',
632              obtype=float,
633              doc="A Python float object.")
634
635pystring = StackObject(
636               name='str',
637               obtype=str,
638               doc="A Python string object.")
639
640pyunicode = StackObject(
641                name='unicode',
642                obtype=unicode,
643                doc="A Python Unicode string object.")
644
645pynone = StackObject(
646             name="None",
647             obtype=type(None),
648             doc="The Python None object.")
649
650pytuple = StackObject(
651              name="tuple",
652              obtype=tuple,
653              doc="A Python tuple object.")
654
655pylist = StackObject(
656             name="list",
657             obtype=list,
658             doc="A Python list object.")
659
660pydict = StackObject(
661             name="dict",
662             obtype=dict,
663             doc="A Python dict object.")
664
665anyobject = StackObject(
666                name='any',
667                obtype=object,
668                doc="Any kind of object whatsoever.")
669
670markobject = StackObject(
671                 name="mark",
672                 obtype=StackObject,
673                 doc="""'The mark' is a unique object.
674
675                 Opcodes that operate on a variable number of objects
676                 generally don't embed the count of objects in the opcode,
677                 or pull it off the stack.  Instead the MARK opcode is used
678                 to push a special marker object on the stack, and then
679                 some other opcodes grab all the objects from the top of
680                 the stack down to (but not including) the topmost marker
681                 object.
682                 """)
683
684stackslice = StackObject(
685                 name="stackslice",
686                 obtype=StackObject,
687                 doc="""An object representing a contiguous slice of the stack.
688
689                 This is used in conjuction with markobject, to represent all
690                 of the stack following the topmost markobject.  For example,
691                 the POP_MARK opcode changes the stack from
692
693                     [..., markobject, stackslice]
694                 to
695                     [...]
696
697                 No matter how many object are on the stack after the topmost
698                 markobject, POP_MARK gets rid of all of them (including the
699                 topmost markobject too).
700                 """)
701
702##############################################################################
703# Descriptors for pickle opcodes.
704
705class OpcodeInfo(object):
706
707    __slots__ = (
708        # symbolic name of opcode; a string
709        'name',
710
711        # the code used in a bytestream to represent the opcode; a
712        # one-character string
713        'code',
714
715        # If the opcode has an argument embedded in the byte string, an
716        # instance of ArgumentDescriptor specifying its type.  Note that
717        # arg.reader(s) can be used to read and decode the argument from
718        # the bytestream s, and arg.doc documents the format of the raw
719        # argument bytes.  If the opcode doesn't have an argument embedded
720        # in the bytestream, arg should be None.
721        'arg',
722
723        # what the stack looks like before this opcode runs; a list
724        'stack_before',
725
726        # what the stack looks like after this opcode runs; a list
727        'stack_after',
728
729        # the protocol number in which this opcode was introduced; an int
730        'proto',
731
732        # human-readable docs for this opcode; a string
733        'doc',
734    )
735
736    def __init__(self, name, code, arg,
737                 stack_before, stack_after, proto, doc):
738        assert isinstance(name, str)
739        self.name = name
740
741        assert isinstance(code, str)
742        assert len(code) == 1
743        self.code = code
744
745        assert arg is None or isinstance(arg, ArgumentDescriptor)
746        self.arg = arg
747
748        assert isinstance(stack_before, list)
749        for x in stack_before:
750            assert isinstance(x, StackObject)
751        self.stack_before = stack_before
752
753        assert isinstance(stack_after, list)
754        for x in stack_after:
755            assert isinstance(x, StackObject)
756        self.stack_after = stack_after
757
758        assert isinstance(proto, int) and 0 <= proto <= 2
759        self.proto = proto
760
761        assert isinstance(doc, str)
762        self.doc = doc
763
764I = OpcodeInfo
765opcodes = [
766
767    # Ways to spell integers.
768
769    I(name='INT',
770      code='I',
771      arg=decimalnl_short,
772      stack_before=[],
773      stack_after=[pyinteger_or_bool],
774      proto=0,
775      doc="""Push an integer or bool.
776
777      The argument is a newline-terminated decimal literal string.
778
779      The intent may have been that this always fit in a short Python int,
780      but INT can be generated in pickles written on a 64-bit box that
781      require a Python long on a 32-bit box.  The difference between this
782      and LONG then is that INT skips a trailing 'L', and produces a short
783      int whenever possible.
784
785      Another difference is due to that, when bool was introduced as a
786      distinct type in 2.3, builtin names True and False were also added to
787      2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
788      True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
789      Leading zeroes are never produced for a genuine integer.  The 2.3
790      (and later) unpicklers special-case these and return bool instead;
791      earlier unpicklers ignore the leading "0" and return the int.
792      """),
793
794    I(name='LONG',
795      code='L',
796      arg=decimalnl_long,
797      stack_before=[],
798      stack_after=[pylong],
799      proto=0,
800      doc="""Push a long integer.
801
802      The same as INT, except that the literal ends with 'L', and always
803      unpickles to a Python long.  There doesn't seem a real purpose to the
804      trailing 'L'.
805      """),
806
807    I(name='BININT',
808      code='J',
809      arg=int4,
810      stack_before=[],
811      stack_after=[pyint],
812      proto=1,
813      doc="""Push a four-byte signed integer.
814
815      This handles the full range of Python (short) integers on a 32-bit
816      box, directly as binary bytes (1 for the opcode and 4 for the integer).
817      If the integer is non-negative and fits in 1 or 2 bytes, pickling via
818      BININT1 or BININT2 saves space.
819      """),
820
821    I(name='BININT1',
822      code='K',
823      arg=uint1,
824      stack_before=[],
825      stack_after=[pyint],
826      proto=1,
827      doc="""Push a one-byte unsigned integer.
828
829      This is a space optimization for pickling very small non-negative ints,
830      in range(256).
831      """),
832
833    I(name='BININT2',
834      code='M',
835      arg=uint2,
836      stack_before=[],
837      stack_after=[pyint],
838      proto=1,
839      doc="""Push a two-byte unsigned integer.
840
841      This is a space optimization for pickling small positive ints, in
842      range(256, 2**16).  Integers in range(256) can also be pickled via
843      BININT2, but BININT1 instead saves a byte.
844      """),
845
846    # Ways to spell strings (8-bit, not Unicode).
847
848    I(name='STRING',
849      code='S',
850      arg=stringnl,
851      stack_before=[],
852      stack_after=[pystring],
853      proto=0,
854      doc="""Push a Python string object.
855
856      The argument is a repr-style string, with bracketing quote characters,
857      and perhaps embedded escapes.  The argument extends until the next
858      newline character.
859      """),
860
861    I(name='BINSTRING',
862      code='T',
863      arg=string4,
864      stack_before=[],
865      stack_after=[pystring],
866      proto=1,
867      doc="""Push a Python string object.
868
869      There are two arguments:  the first is a 4-byte little-endian signed int
870      giving the number of bytes in the string, and the second is that many
871      bytes, which are taken literally as the string content.
872      """),
873
874    I(name='SHORT_BINSTRING',
875      code='U',
876      arg=string1,
877      stack_before=[],
878      stack_after=[pystring],
879      proto=1,
880      doc="""Push a Python string object.
881
882      There are two arguments:  the first is a 1-byte unsigned int giving
883      the number of bytes in the string, and the second is that many bytes,
884      which are taken literally as the string content.
885      """),
886
887    # Ways to spell None.
888
889    I(name='NONE',
890      code='N',
891      arg=None,
892      stack_before=[],
893      stack_after=[pynone],
894      proto=0,
895      doc="Push None on the stack."),
896
897    # Ways to spell Unicode strings.
898
899    I(name='UNICODE',
900      code='V',
901      arg=unicodestringnl,
902      stack_before=[],
903      stack_after=[pyunicode],
904      proto=0,  # this may be pure-text, but it's a later addition
905      doc="""Push a Python Unicode string object.
906
907      The argument is a raw-unicode-escape encoding of a Unicode string,
908      and so may contain embedded escape sequences.  The argument extends
909      until the next newline character.
910      """),
911
912    I(name='BINUNICODE',
913      code='X',
914      arg=unicodestring4,
915      stack_before=[],
916      stack_after=[pyunicode],
917      proto=1,
918      doc="""Push a Python Unicode string object.
919
920      There are two arguments:  the first is a 4-byte little-endian signed int
921      giving the number of bytes in the string.  The second is that many
922      bytes, and is the UTF-8 encoding of the Unicode string.
923      """),
924
925    # Ways to spell floats.
926
927    I(name='FLOAT',
928      code='F',
929      arg=floatnl,
930      stack_before=[],
931      stack_after=[pyfloat],
932      proto=0,
933      doc="""Newline-terminated decimal float literal.
934
935      The argument is repr(a_float), and in general requires 17 significant
936      digits for roundtrip conversion to be an identity (this is so for
937      IEEE-754 double precision values, which is what Python float maps to
938      on most boxes).
939
940      In general, FLOAT cannot be used to transport infinities, NaNs, or
941      minus zero across boxes (or even on a single box, if the platform C
942      library can't read the strings it produces for such things -- Windows
943      is like that), but may do less damage than BINFLOAT on boxes with
944      greater precision or dynamic range than IEEE-754 double.
945      """),
946
947    I(name='BINFLOAT',
948      code='G',
949      arg=float8,
950      stack_before=[],
951      stack_after=[pyfloat],
952      proto=1,
953      doc="""Float stored in binary form, with 8 bytes of data.
954
955      This generally requires less than half the space of FLOAT encoding.
956      In general, BINFLOAT cannot be used to transport infinities, NaNs, or
957      minus zero, raises an exception if the exponent exceeds the range of
958      an IEEE-754 double, and retains no more than 53 bits of precision (if
959      there are more than that, "add a half and chop" rounding is used to
960      cut it back to 53 significant bits).
961      """),
962
963    # Ways to build lists.
964
965    I(name='EMPTY_LIST',
966      code=']',
967      arg=None,
968      stack_before=[],
969      stack_after=[pylist],
970      proto=1,
971      doc="Push an empty list."),
972
973    I(name='APPEND',
974      code='a',
975      arg=None,
976      stack_before=[pylist, anyobject],
977      stack_after=[pylist],
978      proto=0,
979      doc="""Append an object to a list.
980
981      Stack before:  ... pylist anyobject
982      Stack after:   ... pylist+[anyobject]
983      """),
984
985    I(name='APPENDS',
986      code='e',
987      arg=None,
988      stack_before=[pylist, markobject, stackslice],
989      stack_after=[pylist],
990      proto=1,
991      doc="""Extend a list by a slice of stack objects.
992
993      Stack before:  ... pylist markobject stackslice
994      Stack after:   ... pylist+stackslice
995      """),
996
997    I(name='LIST',
998      code='l',
999      arg=None,
1000      stack_before=[markobject, stackslice],
1001      stack_after=[pylist],
1002      proto=0,
1003      doc="""Build a list out of the topmost stack slice, after markobject.
1004
1005      All the stack entries following the topmost markobject are placed into
1006      a single Python list, which single list object replaces all of the
1007      stack from the topmost markobject onward.  For example,
1008
1009      Stack before: ... markobject 1 2 3 'abc'
1010      Stack after:  ... [1, 2, 3, 'abc']
1011      """),
1012
1013    # Ways to build tuples.
1014
1015    I(name='EMPTY_TUPLE',
1016      code=')',
1017      arg=None,
1018      stack_before=[],
1019      stack_after=[pytuple],
1020      proto=1,
1021      doc="Push an empty tuple."),
1022
1023    I(name='TUPLE',
1024      code='t',
1025      arg=None,
1026      stack_before=[markobject, stackslice],
1027      stack_after=[pytuple],
1028      proto=0,
1029      doc="""Build a tuple out of the topmost stack slice, after markobject.
1030
1031      All the stack entries following the topmost markobject are placed into
1032      a single Python tuple, which single tuple object replaces all of the
1033      stack from the topmost markobject onward.  For example,
1034
1035      Stack before: ... markobject 1 2 3 'abc'
1036      Stack after:  ... (1, 2, 3, 'abc')
1037      """),
1038
1039    # Ways to build dicts.
1040
1041    I(name='EMPTY_DICT',
1042      code='}',
1043      arg=None,
1044      stack_before=[],
1045      stack_after=[pydict],
1046      proto=1,
1047      doc="Push an empty dict."),
1048
1049    I(name='DICT',
1050      code='d',
1051      arg=None,
1052      stack_before=[markobject, stackslice],
1053      stack_after=[pydict],
1054      proto=0,
1055      doc="""Build a dict out of the topmost stack slice, after markobject.
1056
1057      All the stack entries following the topmost markobject are placed into
1058      a single Python dict, which single dict object replaces all of the
1059      stack from the topmost markobject onward.  The stack slice alternates
1060      key, value, key, value, ....  For example,
1061
1062      Stack before: ... markobject 1 2 3 'abc'
1063      Stack after:  ... {1: 2, 3: 'abc'}
1064      """),
1065
1066    I(name='SETITEM',
1067      code='s',
1068      arg=None,
1069      stack_before=[pydict, anyobject, anyobject],
1070      stack_after=[pydict],
1071      proto=0,
1072      doc="""Add a key+value pair to an existing dict.
1073
1074      Stack before:  ... pydict key value
1075      Stack after:   ... pydict
1076
1077      where pydict has been modified via pydict[key] = value.
1078      """),
1079
1080    I(name='SETITEMS',
1081      code='u',
1082      arg=None,
1083      stack_before=[pydict, markobject, stackslice],
1084      stack_after=[pydict],
1085      proto=1,
1086      doc="""Add an arbitrary number of key+value pairs to an existing dict.
1087
1088      The slice of the stack following the topmost markobject is taken as
1089      an alternating sequence of keys and values, added to the dict
1090      immediately under the topmost markobject.  Everything at and after the
1091      topmost markobject is popped, leaving the mutated dict at the top
1092      of the stack.
1093
1094      Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1095      Stack after:   ... pydict
1096
1097      where pydict has been modified via pydict[key_i] = value_i for i in
1098      1, 2, ..., n, and in that order.
1099      """),
1100
1101    # Stack manipulation.
1102
1103    I(name='POP',
1104      code='0',
1105      arg=None,
1106      stack_before=[anyobject],
1107      stack_after=[],
1108      proto=0,
1109      doc="Discard the top stack item, shrinking the stack by one item."),
1110
1111    I(name='DUP',
1112      code='2',
1113      arg=None,
1114      stack_before=[anyobject],
1115      stack_after=[anyobject, anyobject],
1116      proto=0,
1117      doc="Push the top stack item onto the stack again, duplicating it."),
1118
1119    I(name='MARK',
1120      code='(',
1121      arg=None,
1122      stack_before=[],
1123      stack_after=[markobject],
1124      proto=0,
1125      doc="""Push markobject onto the stack.
1126
1127      markobject is a unique object, used by other opcodes to identify a
1128      region of the stack containing a variable number of objects for them
1129      to work on.  See markobject.doc for more detail.
1130      """),
1131
1132    I(name='POP_MARK',
1133      code='1',
1134      arg=None,
1135      stack_before=[markobject, stackslice],
1136      stack_after=[],
1137      proto=0,
1138      doc="""Pop all the stack objects at and above the topmost markobject.
1139
1140      When an opcode using a variable number of stack objects is done,
1141      POP_MARK is used to remove those objects, and to remove the markobject
1142      that delimited their starting position on the stack.
1143      """),
1144
1145    # Memo manipulation.  There are really only two operations (get and put),
1146    # each in all-text, "short binary", and "long binary" flavors.
1147
1148    I(name='GET',
1149      code='g',
1150      arg=decimalnl_short,
1151      stack_before=[],
1152      stack_after=[anyobject],
1153      proto=0,
1154      doc="""Read an object from the memo and push it on the stack.
1155
1156      The index of the memo object to push is given by the newline-teriminated
1157      decimal string following.  BINGET and LONG_BINGET are space-optimized
1158      versions.
1159      """),
1160
1161    I(name='BINGET',
1162      code='h',
1163      arg=uint1,
1164      stack_before=[],
1165      stack_after=[anyobject],
1166      proto=1,
1167      doc="""Read an object from the memo and push it on the stack.
1168
1169      The index of the memo object to push is given by the 1-byte unsigned
1170      integer following.
1171      """),
1172
1173    I(name='LONG_BINGET',
1174      code='j',
1175      arg=int4,
1176      stack_before=[],
1177      stack_after=[anyobject],
1178      proto=1,
1179      doc="""Read an object from the memo and push it on the stack.
1180
1181      The index of the memo object to push is given by the 4-byte signed
1182      little-endian integer following.
1183      """),
1184
1185    I(name='PUT',
1186      code='p',
1187      arg=decimalnl_short,
1188      stack_before=[],
1189      stack_after=[],
1190      proto=0,
1191      doc="""Store the stack top into the memo.  The stack is not popped.
1192
1193      The index of the memo location to write into is given by the newline-
1194      terminated decimal string following.  BINPUT and LONG_BINPUT are
1195      space-optimized versions.
1196      """),
1197
1198    I(name='BINPUT',
1199      code='q',
1200      arg=uint1,
1201      stack_before=[],
1202      stack_after=[],
1203      proto=1,
1204      doc="""Store the stack top into the memo.  The stack is not popped.
1205
1206      The index of the memo location to write into is given by the 1-byte
1207      unsigned integer following.
1208      """),
1209
1210    I(name='LONG_BINPUT',
1211      code='r',
1212      arg=int4,
1213      stack_before=[],
1214      stack_after=[],
1215      proto=1,
1216      doc="""Store the stack top into the memo.  The stack is not popped.
1217
1218      The index of the memo location to write into is given by the 4-byte
1219      signed little-endian integer following.
1220      """),
1221
1222    # Push a class object, or module function, on the stack, via its module
1223    # and name.
1224
1225    I(name='GLOBAL',
1226      code='c',
1227      arg=stringnl_noescape_pair,
1228      stack_before=[],
1229      stack_after=[anyobject],
1230      proto=0,
1231      doc="""Push a global object (module.attr) on the stack.
1232
1233      Two newline-terminated strings follow the GLOBAL opcode.  The first is
1234      taken as a module name, and the second as a class name.  The class
1235      object module.class is pushed on the stack.  More accurately, the
1236      object returned by self.find_class(module, class) is pushed on the
1237      stack, so unpickling subclasses can override this form of lookup.
1238      """),
1239
1240    # Ways to build objects of classes pickle doesn't know about directly
1241    # (user-defined classes).  I despair of documenting this accurately
1242    # and comprehensibly -- you really have to read the pickle code to
1243    # find all the special cases.
1244
1245    I(name='REDUCE',
1246      code='R',
1247      arg=None,
1248      stack_before=[anyobject, anyobject],
1249      stack_after=[anyobject],
1250      proto=0,
1251      doc="""Push an object built from a callable and an argument tuple.
1252
1253      The opcode is named to remind of the __reduce__() method.
1254
1255      Stack before: ... callable pytuple
1256      Stack after:  ... callable(*pytuple)
1257
1258      The callable and the argument tuple are the first two items returned
1259      by a __reduce__ method.  Applying the callable to the argtuple is
1260      supposed to reproduce the original object, or at least get it started.
1261      If the __reduce__ method returns a 3-tuple, the last component is an
1262      argument to be passed to the object's __setstate__, and then the REDUCE
1263      opcode is followed by code to create setstate's argument, and then a
1264      BUILD opcode to apply  __setstate__ to that argument.
1265
1266      There are lots of special cases here.  The argtuple can be None, in
1267      which case callable.__basicnew__() is called instead to produce the
1268      object to be pushed on the stack.  This appears to be a trick unique
1269      to ExtensionClasses, and is deprecated regardless.
1270
1271      If type(callable) is not ClassType, REDUCE complains unless the
1272      callable has been registered with the copy_reg module's
1273      safe_constructors dict, or the callable has a magic
1274      '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1275      why it does this, but I've sure seen this complaint often enough when
1276      I didn't want to <wink>.
1277      """),
1278
1279    I(name='BUILD',
1280      code='b',
1281      arg=None,
1282      stack_before=[anyobject, anyobject],
1283      stack_after=[anyobject],
1284      proto=0,
1285      doc="""Finish building an object, via __setstate__ or dict update.
1286
1287      Stack before: ... anyobject argument
1288      Stack after:  ... anyobject
1289
1290      where anyobject may have been mutated, as follows:
1291
1292      If the object has a __setstate__ method,
1293
1294          anyobject.__setstate__(argument)
1295
1296      is called.
1297
1298      Else the argument must be a dict, the object must have a __dict__, and
1299      the object is updated via
1300
1301          anyobject.__dict__.update(argument)
1302
1303      This may raise RuntimeError in restricted execution mode (which
1304      disallows access to __dict__ directly); in that case, the object
1305      is updated instead via
1306
1307          for k, v in argument.items():
1308              anyobject[k] = v
1309      """),
1310
1311    I(name='INST',
1312      code='i',
1313      arg=stringnl_noescape_pair,
1314      stack_before=[markobject, stackslice],
1315      stack_after=[anyobject],
1316      proto=0,
1317      doc="""Build a class instance.
1318
1319      This is the protocol 0 version of protocol 1's OBJ opcode.
1320      INST is followed by two newline-terminated strings, giving a
1321      module and class name, just as for the GLOBAL opcode (and see
1322      GLOBAL for more details about that).  self.find_class(module, name)
1323      is used to get a class object.
1324
1325      In addition, all the objects on the stack following the topmost
1326      markobject are gathered into a tuple and popped (along with the
1327      topmost markobject), just as for the TUPLE opcode.
1328
1329      Now it gets complicated.  If all of these are true:
1330
1331        + The argtuple is empty (markobject was at the top of the stack
1332          at the start).
1333
1334        + It's an old-style class object (the type of the class object is
1335          ClassType).
1336
1337        + The class object does not have a __getinitargs__ attribute.
1338
1339      then we want to create an old-style class instance without invoking
1340      its __init__() method (pickle has waffled on this over the years; not
1341      calling __init__() is current wisdom).  In this case, an instance of
1342      an old-style dummy class is created, and then we try to rebind its
1343      __class__ attribute to the desired class object.  If this succeeds,
1344      the new instance object is pushed on the stack, and we're done.  In
1345      restricted execution mode it can fail (assignment to __class__ is
1346      disallowed), and I'm not really sure what happens then -- it looks
1347      like the code ends up calling the class object's __init__ anyway,
1348      via falling into the next case.
1349
1350      Else (the argtuple is not empty, it's not an old-style class object,
1351      or the class object does have a __getinitargs__ attribute), the code
1352      first insists that the class object have a __safe_for_unpickling__
1353      attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
1354      it doesn't matter whether this attribute has a true or false value, it
1355      only matters whether it exists (XXX this smells like a bug).  If
1356      __safe_for_unpickling__ dosn't exist, UnpicklingError is raised.
1357
1358      Else (the class object does have a __safe_for_unpickling__ attr),
1359      the class object obtained from INST's arguments is applied to the
1360      argtuple obtained from the stack, and the resulting instance object
1361      is pushed on the stack.
1362      """),
1363
1364    I(name='OBJ',
1365      code='o',
1366      arg=None,
1367      stack_before=[markobject, anyobject, stackslice],
1368      stack_after=[anyobject],
1369      proto=1,
1370      doc="""Build a class instance.
1371
1372      This is the protocol 1 version of protocol 0's INST opcode, and is
1373      very much like it.  The major difference is that the class object
1374      is taken off the stack, allowing it to be retrieved from the memo
1375      repeatedly if several instances of the same class are created.  This
1376      can be much more efficient (in both time and space) than repeatedly
1377      embedding the module and class names in INST opcodes.
1378
1379      Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
1380      the class object is taken off the stack, immediately above the
1381      topmost markobject:
1382
1383      Stack before: ... markobject classobject stackslice
1384      Stack after:  ... new_instance_object
1385
1386      As for INST, the remainder of the stack above the markobject is
1387      gathered into an argument tuple, and then the logic seems identical,
1388      except that no __safe_for_unpickling__ check is done (XXX this smells
1389      like a bug).  See INST for the gory details.
1390      """),
1391
1392    # Machine control.
1393
1394    I(name='STOP',
1395      code='.',
1396      arg=None,
1397      stack_before=[anyobject],
1398      stack_after=[],
1399      proto=0,
1400      doc="""Stop the unpickling machine.
1401
1402      Every pickle ends with this opcode.  The object at the top of the stack
1403      is popped, and that's the result of unpickling.  The stack should be
1404      empty then.
1405      """),
1406
1407    # Ways to deal with persistent IDs.
1408
1409    I(name='PERSID',
1410      code='P',
1411      arg=stringnl_noescape,
1412      stack_before=[],
1413      stack_after=[anyobject],
1414      proto=0,
1415      doc="""Push an object identified by a persistent ID.
1416
1417      The pickle module doesn't define what a persistent ID means.  PERSID's
1418      argument is a newline-terminated str-style (no embedded escapes, no
1419      bracketing quote characters) string, which *is* "the persistent ID".
1420      The unpickler passes this string to self.persistent_load().  Whatever
1421      object that returns is pushed on the stack.  There is no implementation
1422      of persistent_load() in Python's unpickler:  it must be supplied by an
1423      unpickler subclass.
1424      """),
1425
1426    I(name='BINPERSID',
1427      code='Q',
1428      arg=None,
1429      stack_before=[anyobject],
1430      stack_after=[anyobject],
1431      proto=1,
1432      doc="""Push an object identified by a persistent ID.
1433
1434      Like PERSID, except the persistent ID is popped off the stack (instead
1435      of being a string embedded in the opcode bytestream).  The persistent
1436      ID is passed to self.persistent_load(), and whatever object that
1437      returns is pushed on the stack.  See PERSID for more detail.
1438      """),
1439]
1440del I
1441
1442# Verify uniqueness of .name and .code members.
1443name2i = {}
1444code2i = {}
1445
1446for i, d in enumerate(opcodes):
1447    if d.name in name2i:
1448        raise ValueError("repeated name %r at indices %d and %d" %
1449                         (d.name, name2i[d.name], i))
1450    if d.code in code2i:
1451        raise ValueError("repeated code %r at indices %d and %d" %
1452                         (d.code, code2i[d.code], i))
1453
1454    name2i[d.name] = i
1455    code2i[d.code] = i
1456
1457del name2i, code2i, i, d
1458
1459##############################################################################
1460# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1461# Also ensure we've got the same stuff as pickle.py, although the
1462# introspection here is dicey.
1463
1464code2op = {}
1465for d in opcodes:
1466    code2op[d.code] = d
1467del d
1468
1469def assure_pickle_consistency(verbose=False):
1470    import pickle, re
1471
1472    copy = code2op.copy()
1473    for name in pickle.__all__:
1474        if not re.match("[A-Z][A-Z0-9_]+$", name):
1475            if verbose:
1476                print "skipping %r: it doesn't look like an opcode name" % name
1477            continue
1478        picklecode = getattr(pickle, name)
1479        if not isinstance(picklecode, str) or len(picklecode) != 1:
1480            if verbose:
1481                print ("skipping %r: value %r doesn't look like a pickle "
1482                       "code" % (name, picklecode))
1483            continue
1484        if picklecode in copy:
1485            if verbose:
1486                print "checking name %r w/ code %r for consistency" % (
1487                      name, picklecode)
1488            d = copy[picklecode]
1489            if d.name != name:
1490                raise ValueError("for pickle code %r, pickle.py uses name %r "
1491                                 "but we're using name %r" % (picklecode,
1492                                                              name,
1493                                                              d.name))
1494            # Forget this one.  Any left over in copy at the end are a problem
1495            # of a different kind.
1496            del copy[picklecode]
1497        else:
1498            raise ValueError("pickle.py appears to have a pickle opcode with "
1499                             "name %r and code %r, but we don't" %
1500                             (name, picklecode))
1501    if copy:
1502        msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1503        for code, d in copy.items():
1504            msg.append("    name %r with code %r" % (d.name, code))
1505        raise ValueError("\n".join(msg))
1506
1507assure_pickle_consistency()
1508
1509##############################################################################
1510# A pickle opcode generator.
1511
1512def genops(pickle):
1513    """"Generate all the opcodes in a pickle.
1514
1515    'pickle' is a file-like object, or string, containing the pickle.
1516
1517    Each opcode in the pickle is generated, from the current pickle position,
1518    stopping after a STOP opcode is delivered.  A triple is generated for
1519    each opcode:
1520
1521        opcode, arg, pos
1522
1523    opcode is an OpcodeInfo record, describing the current opcode.
1524
1525    If the opcode has an argument embedded in the pickle, arg is its decoded
1526    value, as a Python object.  If the opcode doesn't have an argument, arg
1527    is None.
1528
1529    If the pickle has a tell() method, pos was the value of pickle.tell()
1530    before reading the current opcode.  If the pickle is a string object,
1531    it's wrapped in a StringIO object, and the latter's tell() result is
1532    used.  Else (the pickle doesn't have a tell(), and it's not obvious how
1533    to query its current position) pos is None.
1534    """
1535
1536    import cStringIO as StringIO
1537
1538    if isinstance(pickle, str):
1539        pickle = StringIO.StringIO(pickle)
1540
1541    if hasattr(pickle, "tell"):
1542        getpos = pickle.tell
1543    else:
1544        getpos = lambda: None
1545
1546    while True:
1547        pos = getpos()
1548        code = pickle.read(1)
1549        opcode = code2op.get(code)
1550        if opcode is None:
1551            if code == "":
1552                raise ValueError("pickle exhausted before seeing STOP")
1553            else:
1554                raise ValueError("at position %s, opcode %r unknown" % (
1555                                 pos is None and "<unknown>" or pos,
1556                                 code))
1557        if opcode.arg is None:
1558            arg = None
1559        else:
1560            arg = opcode.arg.reader(pickle)
1561        yield opcode, arg, pos
1562        if code == '.':
1563            assert opcode.name == 'STOP'
1564            break
1565
1566##############################################################################
1567# A symbolic pickle disassembler.
1568
1569def dis(pickle, out=None, indentlevel=4):
1570    """Produce a symbolic disassembly of a pickle.
1571
1572    'pickle' is a file-like object, or string, containing a (at least one)
1573    pickle.  The pickle is disassembled from the current position, through
1574    the first STOP opcode encountered.
1575
1576    Optional arg 'out' is a file-like object to which the disassembly is
1577    printed.  It defaults to sys.stdout.
1578
1579    Optional arg indentlevel is the number of blanks by which to indent
1580    a new MARK level.  It defaults to 4.
1581    """
1582
1583    markstack = []
1584    indentchunk = ' ' * indentlevel
1585    for opcode, arg, pos in genops(pickle):
1586        if pos is not None:
1587            print >> out, "%5d:" % pos,
1588
1589        line = "%s %s%s" % (opcode.code,
1590                            indentchunk * len(markstack),
1591                            opcode.name)
1592
1593        markmsg = None
1594        if markstack and markobject in opcode.stack_before:
1595                assert markobject not in opcode.stack_after
1596                markpos = markstack.pop()
1597                if markpos is not None:
1598                    markmsg = "(MARK at %d)" % markpos
1599
1600        if arg is not None or markmsg:
1601            # make a mild effort to align arguments
1602            line += ' ' * (10 - len(opcode.name))
1603            if arg is not None:
1604                line += ' ' + repr(arg)
1605            if markmsg:
1606                line += ' ' + markmsg
1607        print >> out, line
1608
1609        if markobject in opcode.stack_after:
1610            assert markobject not in opcode.stack_before
1611            markstack.append(pos)
1612
1613
1614_dis_test = """
1615>>> import pickle
1616>>> x = [1, 2, (3, 4), {'abc': u"def"}]
1617>>> pik = pickle.dumps(x)
1618>>> dis(pik)
1619    0: ( MARK
1620    1: l     LIST       (MARK at 0)
1621    2: p PUT        0
1622    5: I INT        1
1623    8: a APPEND
1624    9: I INT        2
1625   12: a APPEND
1626   13: ( MARK
1627   14: I     INT        3
1628   17: I     INT        4
1629   20: t     TUPLE      (MARK at 13)
1630   21: p PUT        1
1631   24: a APPEND
1632   25: ( MARK
1633   26: d     DICT       (MARK at 25)
1634   27: p PUT        2
1635   30: S STRING     'abc'
1636   37: p PUT        3
1637   40: V UNICODE    u'def'
1638   45: p PUT        4
1639   48: s SETITEM
1640   49: a APPEND
1641   50: . STOP
1642
1643Try again with a "binary" pickle.
1644
1645>>> pik = pickle.dumps(x, 1)
1646>>> dis(pik)
1647    0: ] EMPTY_LIST
1648    1: q BINPUT     0
1649    3: ( MARK
1650    4: K     BININT1    1
1651    6: K     BININT1    2
1652    8: (     MARK
1653    9: K         BININT1    3
1654   11: K         BININT1    4
1655   13: t         TUPLE      (MARK at 8)
1656   14: q     BINPUT     1
1657   16: }     EMPTY_DICT
1658   17: q     BINPUT     2
1659   19: U     SHORT_BINSTRING 'abc'
1660   24: q     BINPUT     3
1661   26: X     BINUNICODE u'def'
1662   34: q     BINPUT     4
1663   36: s     SETITEM
1664   37: e     APPENDS    (MARK at 3)
1665   38: . STOP
1666
1667Exercise the INST/OBJ/BUILD family.
1668
1669>>> import random
1670>>> dis(pickle.dumps(random.random))
1671    0: c GLOBAL     'random random'
1672   15: p PUT        0
1673   18: . STOP
1674
1675>>> x = [pickle.PicklingError()] * 2
1676>>> dis(pickle.dumps(x))
1677    0: ( MARK
1678    1: l     LIST       (MARK at 0)
1679    2: p PUT        0
1680    5: ( MARK
1681    6: i     INST       'pickle PicklingError' (MARK at 5)
1682   28: p PUT        1
1683   31: ( MARK
1684   32: d     DICT       (MARK at 31)
1685   33: p PUT        2
1686   36: S STRING     'args'
1687   44: p PUT        3
1688   47: ( MARK
1689   48: t     TUPLE      (MARK at 47)
1690   49: p PUT        4
1691   52: s SETITEM
1692   53: b BUILD
1693   54: a APPEND
1694   55: g GET        1
1695   58: a APPEND
1696   59: . STOP
1697
1698>>> dis(pickle.dumps(x, 1))
1699    0: ] EMPTY_LIST
1700    1: q BINPUT     0
1701    3: ( MARK
1702    4: (     MARK
1703    5: c         GLOBAL     'pickle PicklingError'
1704   27: q         BINPUT     1
1705   29: o         OBJ        (MARK at 4)
1706   30: q     BINPUT     2
1707   32: }     EMPTY_DICT
1708   33: q     BINPUT     3
1709   35: U     SHORT_BINSTRING 'args'
1710   41: q     BINPUT     4
1711   43: )     EMPTY_TUPLE
1712   44: s     SETITEM
1713   45: b     BUILD
1714   46: h     BINGET     2
1715   48: e     APPENDS    (MARK at 3)
1716   49: . STOP
1717
1718Try "the canonical" recursive-object test.
1719
1720>>> L = []
1721>>> T = L,
1722>>> L.append(T)
1723>>> L[0] is T
1724True
1725>>> T[0] is L
1726True
1727>>> L[0][0] is L
1728True
1729>>> T[0][0] is T
1730True
1731>>> dis(pickle.dumps(L))
1732    0: ( MARK
1733    1: l     LIST       (MARK at 0)
1734    2: p PUT        0
1735    5: ( MARK
1736    6: g     GET        0
1737    9: t     TUPLE      (MARK at 5)
1738   10: p PUT        1
1739   13: a APPEND
1740   14: . STOP
1741>>> dis(pickle.dumps(L, 1))
1742    0: ] EMPTY_LIST
1743    1: q BINPUT     0
1744    3: ( MARK
1745    4: h     BINGET     0
1746    6: t     TUPLE      (MARK at 3)
1747    7: q BINPUT     1
1748    9: a APPEND
1749   10: . STOP
1750
1751The protocol 0 pickle of the tuple causes the disassembly to get confused,
1752as it doesn't realize that the POP opcode at 16 gets rid of the MARK at 0
1753(so the output remains indented until the end).  The protocol 1 pickle
1754doesn't trigger this glitch, because the disassembler realizes that
1755POP_MARK gets rid of the MARK.  Doing a better job on the protocol 0
1756pickle would require the disassembler to emulate the stack.
1757
1758>>> dis(pickle.dumps(T))
1759    0: ( MARK
1760    1: (     MARK
1761    2: l         LIST       (MARK at 1)
1762    3: p     PUT        0
1763    6: (     MARK
1764    7: g         GET        0
1765   10: t         TUPLE      (MARK at 6)
1766   11: p     PUT        1
1767   14: a     APPEND
1768   15: 0     POP
1769   16: 0     POP
1770   17: g     GET        1
1771   20: .     STOP
1772>>> dis(pickle.dumps(T, 1))
1773    0: ( MARK
1774    1: ]     EMPTY_LIST
1775    2: q     BINPUT     0
1776    4: (     MARK
1777    5: h         BINGET     0
1778    7: t         TUPLE      (MARK at 4)
1779    8: q     BINPUT     1
1780   10: a     APPEND
1781   11: 1     POP_MARK   (MARK at 0)
1782   12: h BINGET     1
1783   14: . STOP
1784"""
1785
1786__test__ = {'dissassembler_test': _dis_test,
1787           }
1788
1789def _test():
1790    import doctest
1791    return doctest.testmod()
1792
1793if __name__ == "__main__":
1794    _test()
1795