pickletools.py revision 1996e23054f2ac79cf89c9ef04714f336b0a17ce
1""""Executable documentation" for the pickle module. 2 3Extensive comments about the pickle protocols and pickle-machine opcodes 4can be found here. Some functions meant for external use: 5 6genops(pickle) 7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. 8 9dis(pickle, out=None, indentlevel=4) 10 Print a symbolic disassembly of a pickle. 11""" 12 13# Other ideas: 14# 15# - A pickle verifier: read a pickle and check it exhaustively for 16# well-formedness. 17# 18# - A protocol identifier: examine a pickle and return its protocol number 19# (== the highest .proto attr value among all the opcodes in the pickle). 20# 21# - A pickle optimizer: for example, tuple-building code is sometimes more 22# elaborate than necessary, catering for the possibility that the tuple 23# is recursive. Or lots of times a PUT is generated that's never accessed 24# by a later GET. 25 26 27""" 28"A pickle" is a program for a virtual pickle machine (PM, but more accurately 29called an unpickling machine). It's a sequence of opcodes, interpreted by the 30PM, building an arbitrarily complex Python object. 31 32For the most part, the PM is very simple: there are no looping, testing, or 33conditional instructions, no arithmetic and no function calls. Opcodes are 34executed once each, from first to last, until a STOP opcode is reached. 35 36The PM has two data areas, "the stack" and "the memo". 37 38Many opcodes push Python objects onto the stack; e.g., INT pushes a Python 39integer object on the stack, whose value is gotten from a decimal string 40literal immediately following the INT opcode in the pickle bytestream. Other 41opcodes take Python objects off the stack. The result of unpickling is 42whatever object is left on the stack when the final STOP opcode is executed. 43 44The memo is simply an array of objects, or it can be implemented as a dict 45mapping little integers to objects. The memo serves as the PM's "long term 46memory", and the little integers indexing the memo are akin to variable 47names. Some opcodes pop a stack object into the memo at a given index, 48and others push a memo object at a given index onto the stack again. 49 50At heart, that's all the PM has. Subtleties arise for these reasons: 51 52+ Object identity. Objects can be arbitrarily complex, and subobjects 53 may be shared (for example, the list [a, a] refers to the same object a 54 twice). It can be vital that unpickling recreate an isomorphic object 55 graph, faithfully reproducing sharing. 56 57+ Recursive objects. For example, after "L = []; L.append(L)", L is a 58 list, and L[0] is the same list. This is related to the object identity 59 point, and some sequences of pickle opcodes are subtle in order to 60 get the right result in all cases. 61 62+ Things pickle doesn't know everything about. Examples of things pickle 63 does know everything about are Python's builtin scalar and container 64 types, like ints and tuples. They generally have opcodes dedicated to 65 them. For things like module references and instances of user-defined 66 classes, pickle's knowledge is limited. Historically, many enhancements 67 have been made to the pickle protocol in order to do a better (faster, 68 and/or more compact) job on those. 69 70+ Backward compatibility and micro-optimization. As explained below, 71 pickle opcodes never go away, not even when better ways to do a thing 72 get invented. The repertoire of the PM just keeps growing over time. 73 So, e.g., there are now five distinct opcodes for building a Python integer, 74 four of them devoted to "short" integers. Even so, the only way to pickle 75 a Python long int takes time quadratic in the number of digits, for both 76 pickling and unpickling. This isn't so much a subtlety as a source of 77 wearying complication. 78 79 80Pickle protocols: 81 82For compatibility, the meaning of a pickle opcode never changes. Instead new 83pickle opcodes get added, and each version's unpickler can handle all the 84pickle opcodes in all protocol versions to date. So old pickles continue to 85be readable forever. The pickler can generally be told to restrict itself to 86the subset of opcodes available under previous protocol versions too, so that 87users can create pickles under the current version readable by older 88versions. However, a pickle does not contain its version number embedded 89within it. If an older unpickler tries to read a pickle using a later 90protocol, the result is most likely an exception due to seeing an unknown (in 91the older unpickler) opcode. 92 93The original pickle used what's now called "protocol 0", and what was called 94"text mode" before Python 2.3. The entire pickle bytestream is made up of 95printable 7-bit ASCII characters, plus the newline character, in protocol 0. 96That's why it was called text mode. 97 98The second major set of additions is now called "protocol 1", and was called 99"binary mode" before Python 2.3. This added many opcodes with arguments 100consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" 101bytes. Binary mode pickles can be substantially smaller than equivalent 102text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte 103int as 4 bytes following the opcode, which is cheaper to unpickle than the 104(perhaps) 11-character decimal string attached to INT. 105 106The third major set of additions came in Python 2.3, and is called "protocol 1072". XXX Write a short blurb when Guido figures out what they are <wink>. XXX 108""" 109 110# Meta-rule: Descriptions are stored in instances of descriptor objects, 111# with plain constructors. No meta-language is defined from which 112# descriptors could be constructed. If you want, e.g., XML, write a little 113# program to generate XML from the objects. 114 115############################################################################## 116# Some pickle opcodes have an argument, following the opcode in the 117# bytestream. An argument is of a specific type, described by an instance 118# of ArgumentDescriptor. These are not to be confused with arguments taken 119# off the stack -- ArgumentDescriptor applies only to arguments embedded in 120# the opcode stream, immediately following an opcode. 121 122# Represents the number of bytes consumed by an argument delimited by the 123# next newline character. 124UP_TO_NEWLINE = -1 125 126# Represents the number of bytes consumed by a two-argument opcode where 127# the first argument gives the number of bytes in the second argument. 128TAKEN_FROM_ARGUMENT = -2 129 130class ArgumentDescriptor(object): 131 __slots__ = ( 132 # name of descriptor record, also a module global name; a string 133 'name', 134 135 # length of argument, in bytes; an int; UP_TO_NEWLINE and 136 # TAKEN_FROM_ARGUMENT are negative values for variable-length cases 137 'n', 138 139 # a function taking a file-like object, reading this kind of argument 140 # from the object at the current position, advancing the current 141 # position by n bytes, and returning the value of the argument 142 'reader', 143 144 # human-readable docs for this arg descriptor; a string 145 'doc', 146 ) 147 148 def __init__(self, name, n, reader, doc): 149 assert isinstance(name, str) 150 self.name = name 151 152 assert isinstance(n, int) and (n >= 0 or 153 n is UP_TO_NEWLINE or 154 n is TAKEN_FROM_ARGUMENT) 155 self.n = n 156 157 self.reader = reader 158 159 assert isinstance(doc, str) 160 self.doc = doc 161 162from struct import unpack as _unpack 163 164def read_uint1(f): 165 """ 166 >>> import StringIO 167 >>> read_uint1(StringIO.StringIO('\\xff')) 168 255 169 """ 170 171 data = f.read(1) 172 if data: 173 return ord(data) 174 raise ValueError("not enough data in stream to read uint1") 175 176uint1 = ArgumentDescriptor( 177 name='uint1', 178 n=1, 179 reader=read_uint1, 180 doc="One-byte unsigned integer.") 181 182 183def read_uint2(f): 184 """ 185 >>> import StringIO 186 >>> read_uint2(StringIO.StringIO('\\xff\\x00')) 187 255 188 >>> read_uint2(StringIO.StringIO('\\xff\\xff')) 189 65535 190 """ 191 192 data = f.read(2) 193 if len(data) == 2: 194 return _unpack("<H", data)[0] 195 raise ValueError("not enough data in stream to read uint2") 196 197uint2 = ArgumentDescriptor( 198 name='uint2', 199 n=2, 200 reader=read_uint2, 201 doc="Two-byte unsigned integer, little-endian.") 202 203 204def read_int4(f): 205 """ 206 >>> import StringIO 207 >>> read_int4(StringIO.StringIO('\\xff\\x00\\x00\\x00')) 208 255 209 >>> read_int4(StringIO.StringIO('\\x00\\x00\\x00\\x80')) == -(2**31) 210 True 211 """ 212 213 data = f.read(4) 214 if len(data) == 4: 215 return _unpack("<i", data)[0] 216 raise ValueError("not enough data in stream to read int4") 217 218int4 = ArgumentDescriptor( 219 name='int4', 220 n=4, 221 reader=read_int4, 222 doc="Four-byte signed integer, little-endian, 2's complement.") 223 224 225def read_stringnl(f, decode=True, stripquotes=True): 226 """ 227 >>> import StringIO 228 >>> read_stringnl(StringIO.StringIO("'abcd'\\nefg\\n")) 229 'abcd' 230 231 >>> read_stringnl(StringIO.StringIO("\\n")) 232 Traceback (most recent call last): 233 ... 234 ValueError: no string quotes around '' 235 236 >>> read_stringnl(StringIO.StringIO("\\n"), stripquotes=False) 237 '' 238 239 >>> read_stringnl(StringIO.StringIO("''\\n")) 240 '' 241 242 >>> read_stringnl(StringIO.StringIO('"abcd"')) 243 Traceback (most recent call last): 244 ... 245 ValueError: no newline found when trying to read stringnl 246 247 Embedded escapes are undone in the result. 248 >>> read_stringnl(StringIO.StringIO("'a\\\\nb\\x00c\\td'\\n'e'")) 249 'a\\nb\\x00c\\td' 250 """ 251 252 data = f.readline() 253 if not data.endswith('\n'): 254 raise ValueError("no newline found when trying to read stringnl") 255 data = data[:-1] # lose the newline 256 257 if stripquotes: 258 for q in "'\"": 259 if data.startswith(q): 260 if not data.endswith(q): 261 raise ValueError("strinq quote %r not found at both " 262 "ends of %r" % (q, data)) 263 data = data[1:-1] 264 break 265 else: 266 raise ValueError("no string quotes around %r" % data) 267 268 # I'm not sure when 'string_escape' was added to the std codecs; it's 269 # crazy not to use it if it's there. 270 if decode: 271 data = data.decode('string_escape') 272 return data 273 274stringnl = ArgumentDescriptor( 275 name='stringnl', 276 n=UP_TO_NEWLINE, 277 reader=read_stringnl, 278 doc="""A newline-terminated string. 279 280 This is a repr-style string, with embedded escapes, and 281 bracketing quotes. 282 """) 283 284def read_stringnl_noescape(f): 285 return read_stringnl(f, decode=False, stripquotes=False) 286 287stringnl_noescape = ArgumentDescriptor( 288 name='stringnl_noescape', 289 n=UP_TO_NEWLINE, 290 reader=read_stringnl_noescape, 291 doc="""A newline-terminated string. 292 293 This is a str-style string, without embedded escapes, 294 or bracketing quotes. It should consist solely of 295 printable ASCII characters. 296 """) 297 298def read_stringnl_noescape_pair(f): 299 """ 300 >>> import StringIO 301 >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\\nEmpty\\njunk")) 302 'Queue Empty' 303 """ 304 305 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) 306 307stringnl_noescape_pair = ArgumentDescriptor( 308 name='stringnl_noescape_pair', 309 n=UP_TO_NEWLINE, 310 reader=read_stringnl_noescape_pair, 311 doc="""A pair of newline-terminated strings. 312 313 These are str-style strings, without embedded 314 escapes, or bracketing quotes. They should 315 consist solely of printable ASCII characters. 316 The pair is returned as a single string, with 317 a single blank separating the two strings. 318 """) 319 320def read_string4(f): 321 """ 322 >>> import StringIO 323 >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x00abc")) 324 '' 325 >>> read_string4(StringIO.StringIO("\\x03\\x00\\x00\\x00abcdef")) 326 'abc' 327 >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x03abcdef")) 328 Traceback (most recent call last): 329 ... 330 ValueError: expected 50331648 bytes in a string4, but only 6 remain 331 """ 332 333 n = read_int4(f) 334 if n < 0: 335 raise ValueError("string4 byte count < 0: %d" % n) 336 data = f.read(n) 337 if len(data) == n: 338 return data 339 raise ValueError("expected %d bytes in a string4, but only %d remain" % 340 (n, len(data))) 341 342string4 = ArgumentDescriptor( 343 name="string4", 344 n=TAKEN_FROM_ARGUMENT, 345 reader=read_string4, 346 doc="""A counted string. 347 348 The first argument is a 4-byte little-endian signed int giving 349 the number of bytes in the string, and the second argument is 350 that many bytes. 351 """) 352 353 354def read_string1(f): 355 """ 356 >>> import StringIO 357 >>> read_string1(StringIO.StringIO("\\x00")) 358 '' 359 >>> read_string1(StringIO.StringIO("\\x03abcdef")) 360 'abc' 361 """ 362 363 n = read_uint1(f) 364 assert n >= 0 365 data = f.read(n) 366 if len(data) == n: 367 return data 368 raise ValueError("expected %d bytes in a string1, but only %d remain" % 369 (n, len(data))) 370 371string1 = ArgumentDescriptor( 372 name="string1", 373 n=TAKEN_FROM_ARGUMENT, 374 reader=read_string1, 375 doc="""A counted string. 376 377 The first argument is a 1-byte unsigned int giving the number 378 of bytes in the string, and the second argument is that many 379 bytes. 380 """) 381 382 383def read_unicodestringnl(f): 384 """ 385 >>> import StringIO 386 >>> read_unicodestringnl(StringIO.StringIO("abc\\uabcd\\njunk")) 387 u'abc\\uabcd' 388 """ 389 390 data = f.readline() 391 if not data.endswith('\n'): 392 raise ValueError("no newline found when trying to read " 393 "unicodestringnl") 394 data = data[:-1] # lose the newline 395 return unicode(data, 'raw-unicode-escape') 396 397unicodestringnl = ArgumentDescriptor( 398 name='unicodestringnl', 399 n=UP_TO_NEWLINE, 400 reader=read_unicodestringnl, 401 doc="""A newline-terminated Unicode string. 402 403 This is raw-unicode-escape encoded, so consists of 404 printable ASCII characters, and may contain embedded 405 escape sequences. 406 """) 407 408def read_unicodestring4(f): 409 """ 410 >>> import StringIO 411 >>> s = u'abcd\\uabcd' 412 >>> enc = s.encode('utf-8') 413 >>> enc 414 'abcd\\xea\\xaf\\x8d' 415 >>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length 416 >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk')) 417 >>> s == t 418 True 419 420 >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1])) 421 Traceback (most recent call last): 422 ... 423 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain 424 """ 425 426 n = read_int4(f) 427 if n < 0: 428 raise ValueError("unicodestring4 byte count < 0: %d" % n) 429 data = f.read(n) 430 if len(data) == n: 431 return unicode(data, 'utf-8') 432 raise ValueError("expected %d bytes in a unicodestring4, but only %d " 433 "remain" % (n, len(data))) 434 435unicodestring4 = ArgumentDescriptor( 436 name="unicodestring4", 437 n=TAKEN_FROM_ARGUMENT, 438 reader=read_unicodestring4, 439 doc="""A counted Unicode string. 440 441 The first argument is a 4-byte little-endian signed int 442 giving the number of bytes in the string, and the second 443 argument-- the UTF-8 encoding of the Unicode string -- 444 contains that many bytes. 445 """) 446 447 448def read_decimalnl_short(f): 449 """ 450 >>> import StringIO 451 >>> read_decimalnl_short(StringIO.StringIO("1234\\n56")) 452 1234 453 454 >>> read_decimalnl_short(StringIO.StringIO("1234L\\n56")) 455 Traceback (most recent call last): 456 ... 457 ValueError: trailing 'L' not allowed in '1234L' 458 """ 459 460 s = read_stringnl(f, decode=False, stripquotes=False) 461 if s.endswith("L"): 462 raise ValueError("trailing 'L' not allowed in %r" % s) 463 464 # It's not necessarily true that the result fits in a Python short int: 465 # the pickle may have been written on a 64-bit box. There's also a hack 466 # for True and False here. 467 if s == "00": 468 return False 469 elif s == "01": 470 return True 471 472 try: 473 return int(s) 474 except OverflowError: 475 return long(s) 476 477def read_decimalnl_long(f): 478 """ 479 >>> import StringIO 480 481 >>> read_decimalnl_long(StringIO.StringIO("1234\\n56")) 482 Traceback (most recent call last): 483 ... 484 ValueError: trailing 'L' required in '1234' 485 486 Someday the trailing 'L' will probably go away from this output. 487 488 >>> read_decimalnl_long(StringIO.StringIO("1234L\\n56")) 489 1234L 490 491 >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\\n6")) 492 123456789012345678901234L 493 """ 494 495 s = read_stringnl(f, decode=False, stripquotes=False) 496 if not s.endswith("L"): 497 raise ValueError("trailing 'L' required in %r" % s) 498 return long(s) 499 500 501decimalnl_short = ArgumentDescriptor( 502 name='decimalnl_short', 503 n=UP_TO_NEWLINE, 504 reader=read_decimalnl_short, 505 doc="""A newline-terminated decimal integer literal. 506 507 This never has a trailing 'L', and the integer fit 508 in a short Python int on the box where the pickle 509 was written -- but there's no guarantee it will fit 510 in a short Python int on the box where the pickle 511 is read. 512 """) 513 514decimalnl_long = ArgumentDescriptor( 515 name='decimalnl_long', 516 n=UP_TO_NEWLINE, 517 reader=read_decimalnl_long, 518 doc="""A newline-terminated decimal integer literal. 519 520 This has a trailing 'L', and can represent integers 521 of any size. 522 """) 523 524 525def read_floatnl(f): 526 """ 527 >>> import StringIO 528 >>> read_floatnl(StringIO.StringIO("-1.25\\n6")) 529 -1.25 530 """ 531 s = read_stringnl(f, decode=False, stripquotes=False) 532 return float(s) 533 534floatnl = ArgumentDescriptor( 535 name='floatnl', 536 n=UP_TO_NEWLINE, 537 reader=read_floatnl, 538 doc="""A newline-terminated decimal floating literal. 539 540 In general this requires 17 significant digits for roundtrip 541 identity, and pickling then unpickling infinities, NaNs, and 542 minus zero doesn't work across boxes, or on some boxes even 543 on itself (e.g., Windows can't read the strings it produces 544 for infinities or NaNs). 545 """) 546 547def read_float8(f): 548 """ 549 >>> import StringIO, struct 550 >>> raw = struct.pack(">d", -1.25) 551 >>> raw 552 '\\xbf\\xf4\\x00\\x00\\x00\\x00\\x00\\x00' 553 >>> read_float8(StringIO.StringIO(raw + "\\n")) 554 -1.25 555 """ 556 557 data = f.read(8) 558 if len(data) == 8: 559 return _unpack(">d", data)[0] 560 raise ValueError("not enough data in stream to read float8") 561 562 563float8 = ArgumentDescriptor( 564 name='float8', 565 n=8, 566 reader=read_float8, 567 doc="""An 8-byte binary representation of a float, big-endian. 568 569 The format is unique to Python, and shared with the struct 570 module (format string '>d') "in theory" (the struct and cPickle 571 implementations don't share the code -- they should). It's 572 strongly related to the IEEE-754 double format, and, in normal 573 cases, is in fact identical to the big-endian 754 double format. 574 On other boxes the dynamic range is limited to that of a 754 575 double, and "add a half and chop" rounding is used to reduce 576 the precision to 53 bits. However, even on a 754 box, 577 infinities, NaNs, and minus zero may not be handled correctly 578 (may not survive roundtrip pickling intact). 579 """) 580 581############################################################################## 582# Object descriptors. The stack used by the pickle machine holds objects, 583# and in the stack_before and stack_after attributes of OpcodeInfo 584# descriptors we need names to describe the various types of objects that can 585# appear on the stack. 586 587class StackObject(object): 588 __slots__ = ( 589 # name of descriptor record, for info only 590 'name', 591 592 # type of object, or tuple of type objects (meaning the object can 593 # be of any type in the tuple) 594 'obtype', 595 596 # human-readable docs for this kind of stack object; a string 597 'doc', 598 ) 599 600 def __init__(self, name, obtype, doc): 601 assert isinstance(name, str) 602 self.name = name 603 604 assert isinstance(obtype, type) or isinstance(obtype, tuple) 605 if isinstance(obtype, tuple): 606 for contained in obtype: 607 assert isinstance(contained, type) 608 self.obtype = obtype 609 610 assert isinstance(doc, str) 611 self.doc = doc 612 613 614pyint = StackObject( 615 name='int', 616 obtype=int, 617 doc="A short (as opposed to long) Python integer object.") 618 619pylong = StackObject( 620 name='long', 621 obtype=long, 622 doc="A long (as opposed to short) Python integer object.") 623 624pyinteger_or_bool = StackObject( 625 name='int_or_bool', 626 obtype=(int, long, bool), 627 doc="A Python integer object (short or long), or " 628 "a Python bool.") 629 630pyfloat = StackObject( 631 name='float', 632 obtype=float, 633 doc="A Python float object.") 634 635pystring = StackObject( 636 name='str', 637 obtype=str, 638 doc="A Python string object.") 639 640pyunicode = StackObject( 641 name='unicode', 642 obtype=unicode, 643 doc="A Python Unicode string object.") 644 645pynone = StackObject( 646 name="None", 647 obtype=type(None), 648 doc="The Python None object.") 649 650pytuple = StackObject( 651 name="tuple", 652 obtype=tuple, 653 doc="A Python tuple object.") 654 655pylist = StackObject( 656 name="list", 657 obtype=list, 658 doc="A Python list object.") 659 660pydict = StackObject( 661 name="dict", 662 obtype=dict, 663 doc="A Python dict object.") 664 665anyobject = StackObject( 666 name='any', 667 obtype=object, 668 doc="Any kind of object whatsoever.") 669 670markobject = StackObject( 671 name="mark", 672 obtype=StackObject, 673 doc="""'The mark' is a unique object. 674 675 Opcodes that operate on a variable number of objects 676 generally don't embed the count of objects in the opcode, 677 or pull it off the stack. Instead the MARK opcode is used 678 to push a special marker object on the stack, and then 679 some other opcodes grab all the objects from the top of 680 the stack down to (but not including) the topmost marker 681 object. 682 """) 683 684stackslice = StackObject( 685 name="stackslice", 686 obtype=StackObject, 687 doc="""An object representing a contiguous slice of the stack. 688 689 This is used in conjuction with markobject, to represent all 690 of the stack following the topmost markobject. For example, 691 the POP_MARK opcode changes the stack from 692 693 [..., markobject, stackslice] 694 to 695 [...] 696 697 No matter how many object are on the stack after the topmost 698 markobject, POP_MARK gets rid of all of them (including the 699 topmost markobject too). 700 """) 701 702############################################################################## 703# Descriptors for pickle opcodes. 704 705class OpcodeInfo(object): 706 707 __slots__ = ( 708 # symbolic name of opcode; a string 709 'name', 710 711 # the code used in a bytestream to represent the opcode; a 712 # one-character string 713 'code', 714 715 # If the opcode has an argument embedded in the byte string, an 716 # instance of ArgumentDescriptor specifying its type. Note that 717 # arg.reader(s) can be used to read and decode the argument from 718 # the bytestream s, and arg.doc documents the format of the raw 719 # argument bytes. If the opcode doesn't have an argument embedded 720 # in the bytestream, arg should be None. 721 'arg', 722 723 # what the stack looks like before this opcode runs; a list 724 'stack_before', 725 726 # what the stack looks like after this opcode runs; a list 727 'stack_after', 728 729 # the protocol number in which this opcode was introduced; an int 730 'proto', 731 732 # human-readable docs for this opcode; a string 733 'doc', 734 ) 735 736 def __init__(self, name, code, arg, 737 stack_before, stack_after, proto, doc): 738 assert isinstance(name, str) 739 self.name = name 740 741 assert isinstance(code, str) 742 assert len(code) == 1 743 self.code = code 744 745 assert arg is None or isinstance(arg, ArgumentDescriptor) 746 self.arg = arg 747 748 assert isinstance(stack_before, list) 749 for x in stack_before: 750 assert isinstance(x, StackObject) 751 self.stack_before = stack_before 752 753 assert isinstance(stack_after, list) 754 for x in stack_after: 755 assert isinstance(x, StackObject) 756 self.stack_after = stack_after 757 758 assert isinstance(proto, int) and 0 <= proto <= 2 759 self.proto = proto 760 761 assert isinstance(doc, str) 762 self.doc = doc 763 764I = OpcodeInfo 765opcodes = [ 766 767 # Ways to spell integers. 768 769 I(name='INT', 770 code='I', 771 arg=decimalnl_short, 772 stack_before=[], 773 stack_after=[pyinteger_or_bool], 774 proto=0, 775 doc="""Push an integer or bool. 776 777 The argument is a newline-terminated decimal literal string. 778 779 The intent may have been that this always fit in a short Python int, 780 but INT can be generated in pickles written on a 64-bit box that 781 require a Python long on a 32-bit box. The difference between this 782 and LONG then is that INT skips a trailing 'L', and produces a short 783 int whenever possible. 784 785 Another difference is due to that, when bool was introduced as a 786 distinct type in 2.3, builtin names True and False were also added to 787 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, 788 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". 789 Leading zeroes are never produced for a genuine integer. The 2.3 790 (and later) unpicklers special-case these and return bool instead; 791 earlier unpicklers ignore the leading "0" and return the int. 792 """), 793 794 I(name='LONG', 795 code='L', 796 arg=decimalnl_long, 797 stack_before=[], 798 stack_after=[pylong], 799 proto=0, 800 doc="""Push a long integer. 801 802 The same as INT, except that the literal ends with 'L', and always 803 unpickles to a Python long. There doesn't seem a real purpose to the 804 trailing 'L'. 805 """), 806 807 I(name='BININT', 808 code='J', 809 arg=int4, 810 stack_before=[], 811 stack_after=[pyint], 812 proto=1, 813 doc="""Push a four-byte signed integer. 814 815 This handles the full range of Python (short) integers on a 32-bit 816 box, directly as binary bytes (1 for the opcode and 4 for the integer). 817 If the integer is non-negative and fits in 1 or 2 bytes, pickling via 818 BININT1 or BININT2 saves space. 819 """), 820 821 I(name='BININT1', 822 code='K', 823 arg=uint1, 824 stack_before=[], 825 stack_after=[pyint], 826 proto=1, 827 doc="""Push a one-byte unsigned integer. 828 829 This is a space optimization for pickling very small non-negative ints, 830 in range(256). 831 """), 832 833 I(name='BININT2', 834 code='M', 835 arg=uint2, 836 stack_before=[], 837 stack_after=[pyint], 838 proto=1, 839 doc="""Push a two-byte unsigned integer. 840 841 This is a space optimization for pickling small positive ints, in 842 range(256, 2**16). Integers in range(256) can also be pickled via 843 BININT2, but BININT1 instead saves a byte. 844 """), 845 846 # Ways to spell strings (8-bit, not Unicode). 847 848 I(name='STRING', 849 code='S', 850 arg=stringnl, 851 stack_before=[], 852 stack_after=[pystring], 853 proto=0, 854 doc="""Push a Python string object. 855 856 The argument is a repr-style string, with bracketing quote characters, 857 and perhaps embedded escapes. The argument extends until the next 858 newline character. 859 """), 860 861 I(name='BINSTRING', 862 code='T', 863 arg=string4, 864 stack_before=[], 865 stack_after=[pystring], 866 proto=1, 867 doc="""Push a Python string object. 868 869 There are two arguments: the first is a 4-byte little-endian signed int 870 giving the number of bytes in the string, and the second is that many 871 bytes, which are taken literally as the string content. 872 """), 873 874 I(name='SHORT_BINSTRING', 875 code='U', 876 arg=string1, 877 stack_before=[], 878 stack_after=[pystring], 879 proto=1, 880 doc="""Push a Python string object. 881 882 There are two arguments: the first is a 1-byte unsigned int giving 883 the number of bytes in the string, and the second is that many bytes, 884 which are taken literally as the string content. 885 """), 886 887 # Ways to spell None. 888 889 I(name='NONE', 890 code='N', 891 arg=None, 892 stack_before=[], 893 stack_after=[pynone], 894 proto=0, 895 doc="Push None on the stack."), 896 897 # Ways to spell Unicode strings. 898 899 I(name='UNICODE', 900 code='V', 901 arg=unicodestringnl, 902 stack_before=[], 903 stack_after=[pyunicode], 904 proto=0, # this may be pure-text, but it's a later addition 905 doc="""Push a Python Unicode string object. 906 907 The argument is a raw-unicode-escape encoding of a Unicode string, 908 and so may contain embedded escape sequences. The argument extends 909 until the next newline character. 910 """), 911 912 I(name='BINUNICODE', 913 code='X', 914 arg=unicodestring4, 915 stack_before=[], 916 stack_after=[pyunicode], 917 proto=1, 918 doc="""Push a Python Unicode string object. 919 920 There are two arguments: the first is a 4-byte little-endian signed int 921 giving the number of bytes in the string. The second is that many 922 bytes, and is the UTF-8 encoding of the Unicode string. 923 """), 924 925 # Ways to spell floats. 926 927 I(name='FLOAT', 928 code='F', 929 arg=floatnl, 930 stack_before=[], 931 stack_after=[pyfloat], 932 proto=0, 933 doc="""Newline-terminated decimal float literal. 934 935 The argument is repr(a_float), and in general requires 17 significant 936 digits for roundtrip conversion to be an identity (this is so for 937 IEEE-754 double precision values, which is what Python float maps to 938 on most boxes). 939 940 In general, FLOAT cannot be used to transport infinities, NaNs, or 941 minus zero across boxes (or even on a single box, if the platform C 942 library can't read the strings it produces for such things -- Windows 943 is like that), but may do less damage than BINFLOAT on boxes with 944 greater precision or dynamic range than IEEE-754 double. 945 """), 946 947 I(name='BINFLOAT', 948 code='G', 949 arg=float8, 950 stack_before=[], 951 stack_after=[pyfloat], 952 proto=1, 953 doc="""Float stored in binary form, with 8 bytes of data. 954 955 This generally requires less than half the space of FLOAT encoding. 956 In general, BINFLOAT cannot be used to transport infinities, NaNs, or 957 minus zero, raises an exception if the exponent exceeds the range of 958 an IEEE-754 double, and retains no more than 53 bits of precision (if 959 there are more than that, "add a half and chop" rounding is used to 960 cut it back to 53 significant bits). 961 """), 962 963 # Ways to build lists. 964 965 I(name='EMPTY_LIST', 966 code=']', 967 arg=None, 968 stack_before=[], 969 stack_after=[pylist], 970 proto=1, 971 doc="Push an empty list."), 972 973 I(name='APPEND', 974 code='a', 975 arg=None, 976 stack_before=[pylist, anyobject], 977 stack_after=[pylist], 978 proto=0, 979 doc="""Append an object to a list. 980 981 Stack before: ... pylist anyobject 982 Stack after: ... pylist+[anyobject] 983 """), 984 985 I(name='APPENDS', 986 code='e', 987 arg=None, 988 stack_before=[pylist, markobject, stackslice], 989 stack_after=[pylist], 990 proto=1, 991 doc="""Extend a list by a slice of stack objects. 992 993 Stack before: ... pylist markobject stackslice 994 Stack after: ... pylist+stackslice 995 """), 996 997 I(name='LIST', 998 code='l', 999 arg=None, 1000 stack_before=[markobject, stackslice], 1001 stack_after=[pylist], 1002 proto=0, 1003 doc="""Build a list out of the topmost stack slice, after markobject. 1004 1005 All the stack entries following the topmost markobject are placed into 1006 a single Python list, which single list object replaces all of the 1007 stack from the topmost markobject onward. For example, 1008 1009 Stack before: ... markobject 1 2 3 'abc' 1010 Stack after: ... [1, 2, 3, 'abc'] 1011 """), 1012 1013 # Ways to build tuples. 1014 1015 I(name='EMPTY_TUPLE', 1016 code=')', 1017 arg=None, 1018 stack_before=[], 1019 stack_after=[pytuple], 1020 proto=1, 1021 doc="Push an empty tuple."), 1022 1023 I(name='TUPLE', 1024 code='t', 1025 arg=None, 1026 stack_before=[markobject, stackslice], 1027 stack_after=[pytuple], 1028 proto=0, 1029 doc="""Build a tuple out of the topmost stack slice, after markobject. 1030 1031 All the stack entries following the topmost markobject are placed into 1032 a single Python tuple, which single tuple object replaces all of the 1033 stack from the topmost markobject onward. For example, 1034 1035 Stack before: ... markobject 1 2 3 'abc' 1036 Stack after: ... (1, 2, 3, 'abc') 1037 """), 1038 1039 # Ways to build dicts. 1040 1041 I(name='EMPTY_DICT', 1042 code='}', 1043 arg=None, 1044 stack_before=[], 1045 stack_after=[pydict], 1046 proto=1, 1047 doc="Push an empty dict."), 1048 1049 I(name='DICT', 1050 code='d', 1051 arg=None, 1052 stack_before=[markobject, stackslice], 1053 stack_after=[pydict], 1054 proto=0, 1055 doc="""Build a dict out of the topmost stack slice, after markobject. 1056 1057 All the stack entries following the topmost markobject are placed into 1058 a single Python dict, which single dict object replaces all of the 1059 stack from the topmost markobject onward. The stack slice alternates 1060 key, value, key, value, .... For example, 1061 1062 Stack before: ... markobject 1 2 3 'abc' 1063 Stack after: ... {1: 2, 3: 'abc'} 1064 """), 1065 1066 I(name='SETITEM', 1067 code='s', 1068 arg=None, 1069 stack_before=[pydict, anyobject, anyobject], 1070 stack_after=[pydict], 1071 proto=0, 1072 doc="""Add a key+value pair to an existing dict. 1073 1074 Stack before: ... pydict key value 1075 Stack after: ... pydict 1076 1077 where pydict has been modified via pydict[key] = value. 1078 """), 1079 1080 I(name='SETITEMS', 1081 code='u', 1082 arg=None, 1083 stack_before=[pydict, markobject, stackslice], 1084 stack_after=[pydict], 1085 proto=1, 1086 doc="""Add an arbitrary number of key+value pairs to an existing dict. 1087 1088 The slice of the stack following the topmost markobject is taken as 1089 an alternating sequence of keys and values, added to the dict 1090 immediately under the topmost markobject. Everything at and after the 1091 topmost markobject is popped, leaving the mutated dict at the top 1092 of the stack. 1093 1094 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n 1095 Stack after: ... pydict 1096 1097 where pydict has been modified via pydict[key_i] = value_i for i in 1098 1, 2, ..., n, and in that order. 1099 """), 1100 1101 # Stack manipulation. 1102 1103 I(name='POP', 1104 code='0', 1105 arg=None, 1106 stack_before=[anyobject], 1107 stack_after=[], 1108 proto=0, 1109 doc="Discard the top stack item, shrinking the stack by one item."), 1110 1111 I(name='DUP', 1112 code='2', 1113 arg=None, 1114 stack_before=[anyobject], 1115 stack_after=[anyobject, anyobject], 1116 proto=0, 1117 doc="Push the top stack item onto the stack again, duplicating it."), 1118 1119 I(name='MARK', 1120 code='(', 1121 arg=None, 1122 stack_before=[], 1123 stack_after=[markobject], 1124 proto=0, 1125 doc="""Push markobject onto the stack. 1126 1127 markobject is a unique object, used by other opcodes to identify a 1128 region of the stack containing a variable number of objects for them 1129 to work on. See markobject.doc for more detail. 1130 """), 1131 1132 I(name='POP_MARK', 1133 code='1', 1134 arg=None, 1135 stack_before=[markobject, stackslice], 1136 stack_after=[], 1137 proto=0, 1138 doc="""Pop all the stack objects at and above the topmost markobject. 1139 1140 When an opcode using a variable number of stack objects is done, 1141 POP_MARK is used to remove those objects, and to remove the markobject 1142 that delimited their starting position on the stack. 1143 """), 1144 1145 # Memo manipulation. There are really only two operations (get and put), 1146 # each in all-text, "short binary", and "long binary" flavors. 1147 1148 I(name='GET', 1149 code='g', 1150 arg=decimalnl_short, 1151 stack_before=[], 1152 stack_after=[anyobject], 1153 proto=0, 1154 doc="""Read an object from the memo and push it on the stack. 1155 1156 The index of the memo object to push is given by the newline-teriminated 1157 decimal string following. BINGET and LONG_BINGET are space-optimized 1158 versions. 1159 """), 1160 1161 I(name='BINGET', 1162 code='h', 1163 arg=uint1, 1164 stack_before=[], 1165 stack_after=[anyobject], 1166 proto=1, 1167 doc="""Read an object from the memo and push it on the stack. 1168 1169 The index of the memo object to push is given by the 1-byte unsigned 1170 integer following. 1171 """), 1172 1173 I(name='LONG_BINGET', 1174 code='j', 1175 arg=int4, 1176 stack_before=[], 1177 stack_after=[anyobject], 1178 proto=1, 1179 doc="""Read an object from the memo and push it on the stack. 1180 1181 The index of the memo object to push is given by the 4-byte signed 1182 little-endian integer following. 1183 """), 1184 1185 I(name='PUT', 1186 code='p', 1187 arg=decimalnl_short, 1188 stack_before=[], 1189 stack_after=[], 1190 proto=0, 1191 doc="""Store the stack top into the memo. The stack is not popped. 1192 1193 The index of the memo location to write into is given by the newline- 1194 terminated decimal string following. BINPUT and LONG_BINPUT are 1195 space-optimized versions. 1196 """), 1197 1198 I(name='BINPUT', 1199 code='q', 1200 arg=uint1, 1201 stack_before=[], 1202 stack_after=[], 1203 proto=1, 1204 doc="""Store the stack top into the memo. The stack is not popped. 1205 1206 The index of the memo location to write into is given by the 1-byte 1207 unsigned integer following. 1208 """), 1209 1210 I(name='LONG_BINPUT', 1211 code='r', 1212 arg=int4, 1213 stack_before=[], 1214 stack_after=[], 1215 proto=1, 1216 doc="""Store the stack top into the memo. The stack is not popped. 1217 1218 The index of the memo location to write into is given by the 4-byte 1219 signed little-endian integer following. 1220 """), 1221 1222 # Push a class object, or module function, on the stack, via its module 1223 # and name. 1224 1225 I(name='GLOBAL', 1226 code='c', 1227 arg=stringnl_noescape_pair, 1228 stack_before=[], 1229 stack_after=[anyobject], 1230 proto=0, 1231 doc="""Push a global object (module.attr) on the stack. 1232 1233 Two newline-terminated strings follow the GLOBAL opcode. The first is 1234 taken as a module name, and the second as a class name. The class 1235 object module.class is pushed on the stack. More accurately, the 1236 object returned by self.find_class(module, class) is pushed on the 1237 stack, so unpickling subclasses can override this form of lookup. 1238 """), 1239 1240 # Ways to build objects of classes pickle doesn't know about directly 1241 # (user-defined classes). I despair of documenting this accurately 1242 # and comprehensibly -- you really have to read the pickle code to 1243 # find all the special cases. 1244 1245 I(name='REDUCE', 1246 code='R', 1247 arg=None, 1248 stack_before=[anyobject, anyobject], 1249 stack_after=[anyobject], 1250 proto=0, 1251 doc="""Push an object built from a callable and an argument tuple. 1252 1253 The opcode is named to remind of the __reduce__() method. 1254 1255 Stack before: ... callable pytuple 1256 Stack after: ... callable(*pytuple) 1257 1258 The callable and the argument tuple are the first two items returned 1259 by a __reduce__ method. Applying the callable to the argtuple is 1260 supposed to reproduce the original object, or at least get it started. 1261 If the __reduce__ method returns a 3-tuple, the last component is an 1262 argument to be passed to the object's __setstate__, and then the REDUCE 1263 opcode is followed by code to create setstate's argument, and then a 1264 BUILD opcode to apply __setstate__ to that argument. 1265 1266 There are lots of special cases here. The argtuple can be None, in 1267 which case callable.__basicnew__() is called instead to produce the 1268 object to be pushed on the stack. This appears to be a trick unique 1269 to ExtensionClasses, and is deprecated regardless. 1270 1271 If type(callable) is not ClassType, REDUCE complains unless the 1272 callable has been registered with the copy_reg module's 1273 safe_constructors dict, or the callable has a magic 1274 '__safe_for_unpickling__' attribute with a true value. I'm not sure 1275 why it does this, but I've sure seen this complaint often enough when 1276 I didn't want to <wink>. 1277 """), 1278 1279 I(name='BUILD', 1280 code='b', 1281 arg=None, 1282 stack_before=[anyobject, anyobject], 1283 stack_after=[anyobject], 1284 proto=0, 1285 doc="""Finish building an object, via __setstate__ or dict update. 1286 1287 Stack before: ... anyobject argument 1288 Stack after: ... anyobject 1289 1290 where anyobject may have been mutated, as follows: 1291 1292 If the object has a __setstate__ method, 1293 1294 anyobject.__setstate__(argument) 1295 1296 is called. 1297 1298 Else the argument must be a dict, the object must have a __dict__, and 1299 the object is updated via 1300 1301 anyobject.__dict__.update(argument) 1302 1303 This may raise RuntimeError in restricted execution mode (which 1304 disallows access to __dict__ directly); in that case, the object 1305 is updated instead via 1306 1307 for k, v in argument.items(): 1308 anyobject[k] = v 1309 """), 1310 1311 I(name='INST', 1312 code='i', 1313 arg=stringnl_noescape_pair, 1314 stack_before=[markobject, stackslice], 1315 stack_after=[anyobject], 1316 proto=0, 1317 doc="""Build a class instance. 1318 1319 This is the protocol 0 version of protocol 1's OBJ opcode. 1320 INST is followed by two newline-terminated strings, giving a 1321 module and class name, just as for the GLOBAL opcode (and see 1322 GLOBAL for more details about that). self.find_class(module, name) 1323 is used to get a class object. 1324 1325 In addition, all the objects on the stack following the topmost 1326 markobject are gathered into a tuple and popped (along with the 1327 topmost markobject), just as for the TUPLE opcode. 1328 1329 Now it gets complicated. If all of these are true: 1330 1331 + The argtuple is empty (markobject was at the top of the stack 1332 at the start). 1333 1334 + It's an old-style class object (the type of the class object is 1335 ClassType). 1336 1337 + The class object does not have a __getinitargs__ attribute. 1338 1339 then we want to create an old-style class instance without invoking 1340 its __init__() method (pickle has waffled on this over the years; not 1341 calling __init__() is current wisdom). In this case, an instance of 1342 an old-style dummy class is created, and then we try to rebind its 1343 __class__ attribute to the desired class object. If this succeeds, 1344 the new instance object is pushed on the stack, and we're done. In 1345 restricted execution mode it can fail (assignment to __class__ is 1346 disallowed), and I'm not really sure what happens then -- it looks 1347 like the code ends up calling the class object's __init__ anyway, 1348 via falling into the next case. 1349 1350 Else (the argtuple is not empty, it's not an old-style class object, 1351 or the class object does have a __getinitargs__ attribute), the code 1352 first insists that the class object have a __safe_for_unpickling__ 1353 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, 1354 it doesn't matter whether this attribute has a true or false value, it 1355 only matters whether it exists (XXX this smells like a bug). If 1356 __safe_for_unpickling__ dosn't exist, UnpicklingError is raised. 1357 1358 Else (the class object does have a __safe_for_unpickling__ attr), 1359 the class object obtained from INST's arguments is applied to the 1360 argtuple obtained from the stack, and the resulting instance object 1361 is pushed on the stack. 1362 """), 1363 1364 I(name='OBJ', 1365 code='o', 1366 arg=None, 1367 stack_before=[markobject, anyobject, stackslice], 1368 stack_after=[anyobject], 1369 proto=1, 1370 doc="""Build a class instance. 1371 1372 This is the protocol 1 version of protocol 0's INST opcode, and is 1373 very much like it. The major difference is that the class object 1374 is taken off the stack, allowing it to be retrieved from the memo 1375 repeatedly if several instances of the same class are created. This 1376 can be much more efficient (in both time and space) than repeatedly 1377 embedding the module and class names in INST opcodes. 1378 1379 Unlike INST, OBJ takes no arguments from the opcode stream. Instead 1380 the class object is taken off the stack, immediately above the 1381 topmost markobject: 1382 1383 Stack before: ... markobject classobject stackslice 1384 Stack after: ... new_instance_object 1385 1386 As for INST, the remainder of the stack above the markobject is 1387 gathered into an argument tuple, and then the logic seems identical, 1388 except that no __safe_for_unpickling__ check is done (XXX this smells 1389 like a bug). See INST for the gory details. 1390 """), 1391 1392 # Machine control. 1393 1394 I(name='STOP', 1395 code='.', 1396 arg=None, 1397 stack_before=[anyobject], 1398 stack_after=[], 1399 proto=0, 1400 doc="""Stop the unpickling machine. 1401 1402 Every pickle ends with this opcode. The object at the top of the stack 1403 is popped, and that's the result of unpickling. The stack should be 1404 empty then. 1405 """), 1406 1407 # Ways to deal with persistent IDs. 1408 1409 I(name='PERSID', 1410 code='P', 1411 arg=stringnl_noescape, 1412 stack_before=[], 1413 stack_after=[anyobject], 1414 proto=0, 1415 doc="""Push an object identified by a persistent ID. 1416 1417 The pickle module doesn't define what a persistent ID means. PERSID's 1418 argument is a newline-terminated str-style (no embedded escapes, no 1419 bracketing quote characters) string, which *is* "the persistent ID". 1420 The unpickler passes this string to self.persistent_load(). Whatever 1421 object that returns is pushed on the stack. There is no implementation 1422 of persistent_load() in Python's unpickler: it must be supplied by an 1423 unpickler subclass. 1424 """), 1425 1426 I(name='BINPERSID', 1427 code='Q', 1428 arg=None, 1429 stack_before=[anyobject], 1430 stack_after=[anyobject], 1431 proto=1, 1432 doc="""Push an object identified by a persistent ID. 1433 1434 Like PERSID, except the persistent ID is popped off the stack (instead 1435 of being a string embedded in the opcode bytestream). The persistent 1436 ID is passed to self.persistent_load(), and whatever object that 1437 returns is pushed on the stack. See PERSID for more detail. 1438 """), 1439] 1440del I 1441 1442# Verify uniqueness of .name and .code members. 1443name2i = {} 1444code2i = {} 1445 1446for i, d in enumerate(opcodes): 1447 if d.name in name2i: 1448 raise ValueError("repeated name %r at indices %d and %d" % 1449 (d.name, name2i[d.name], i)) 1450 if d.code in code2i: 1451 raise ValueError("repeated code %r at indices %d and %d" % 1452 (d.code, code2i[d.code], i)) 1453 1454 name2i[d.name] = i 1455 code2i[d.code] = i 1456 1457del name2i, code2i, i, d 1458 1459############################################################################## 1460# Build a code2op dict, mapping opcode characters to OpcodeInfo records. 1461# Also ensure we've got the same stuff as pickle.py, although the 1462# introspection here is dicey. 1463 1464code2op = {} 1465for d in opcodes: 1466 code2op[d.code] = d 1467del d 1468 1469def assure_pickle_consistency(verbose=False): 1470 import pickle, re 1471 1472 copy = code2op.copy() 1473 for name in pickle.__all__: 1474 if not re.match("[A-Z][A-Z0-9_]+$", name): 1475 if verbose: 1476 print "skipping %r: it doesn't look like an opcode name" % name 1477 continue 1478 picklecode = getattr(pickle, name) 1479 if not isinstance(picklecode, str) or len(picklecode) != 1: 1480 if verbose: 1481 print ("skipping %r: value %r doesn't look like a pickle " 1482 "code" % (name, picklecode)) 1483 continue 1484 if picklecode in copy: 1485 if verbose: 1486 print "checking name %r w/ code %r for consistency" % ( 1487 name, picklecode) 1488 d = copy[picklecode] 1489 if d.name != name: 1490 raise ValueError("for pickle code %r, pickle.py uses name %r " 1491 "but we're using name %r" % (picklecode, 1492 name, 1493 d.name)) 1494 # Forget this one. Any left over in copy at the end are a problem 1495 # of a different kind. 1496 del copy[picklecode] 1497 else: 1498 raise ValueError("pickle.py appears to have a pickle opcode with " 1499 "name %r and code %r, but we don't" % 1500 (name, picklecode)) 1501 if copy: 1502 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] 1503 for code, d in copy.items(): 1504 msg.append(" name %r with code %r" % (d.name, code)) 1505 raise ValueError("\n".join(msg)) 1506 1507assure_pickle_consistency() 1508 1509############################################################################## 1510# A pickle opcode generator. 1511 1512def genops(pickle): 1513 """"Generate all the opcodes in a pickle. 1514 1515 'pickle' is a file-like object, or string, containing the pickle. 1516 1517 Each opcode in the pickle is generated, from the current pickle position, 1518 stopping after a STOP opcode is delivered. A triple is generated for 1519 each opcode: 1520 1521 opcode, arg, pos 1522 1523 opcode is an OpcodeInfo record, describing the current opcode. 1524 1525 If the opcode has an argument embedded in the pickle, arg is its decoded 1526 value, as a Python object. If the opcode doesn't have an argument, arg 1527 is None. 1528 1529 If the pickle has a tell() method, pos was the value of pickle.tell() 1530 before reading the current opcode. If the pickle is a string object, 1531 it's wrapped in a StringIO object, and the latter's tell() result is 1532 used. Else (the pickle doesn't have a tell(), and it's not obvious how 1533 to query its current position) pos is None. 1534 """ 1535 1536 import cStringIO as StringIO 1537 1538 if isinstance(pickle, str): 1539 pickle = StringIO.StringIO(pickle) 1540 1541 if hasattr(pickle, "tell"): 1542 getpos = pickle.tell 1543 else: 1544 getpos = lambda: None 1545 1546 while True: 1547 pos = getpos() 1548 code = pickle.read(1) 1549 opcode = code2op.get(code) 1550 if opcode is None: 1551 if code == "": 1552 raise ValueError("pickle exhausted before seeing STOP") 1553 else: 1554 raise ValueError("at position %s, opcode %r unknown" % ( 1555 pos is None and "<unknown>" or pos, 1556 code)) 1557 if opcode.arg is None: 1558 arg = None 1559 else: 1560 arg = opcode.arg.reader(pickle) 1561 yield opcode, arg, pos 1562 if code == '.': 1563 assert opcode.name == 'STOP' 1564 break 1565 1566############################################################################## 1567# A symbolic pickle disassembler. 1568 1569def dis(pickle, out=None, indentlevel=4): 1570 """Produce a symbolic disassembly of a pickle. 1571 1572 'pickle' is a file-like object, or string, containing a (at least one) 1573 pickle. The pickle is disassembled from the current position, through 1574 the first STOP opcode encountered. 1575 1576 Optional arg 'out' is a file-like object to which the disassembly is 1577 printed. It defaults to sys.stdout. 1578 1579 Optional arg indentlevel is the number of blanks by which to indent 1580 a new MARK level. It defaults to 4. 1581 """ 1582 1583 markstack = [] 1584 indentchunk = ' ' * indentlevel 1585 for opcode, arg, pos in genops(pickle): 1586 if pos is not None: 1587 print >> out, "%5d:" % pos, 1588 1589 line = "%s %s%s" % (opcode.code, 1590 indentchunk * len(markstack), 1591 opcode.name) 1592 1593 markmsg = None 1594 if markstack and markobject in opcode.stack_before: 1595 assert markobject not in opcode.stack_after 1596 markpos = markstack.pop() 1597 if markpos is not None: 1598 markmsg = "(MARK at %d)" % markpos 1599 1600 if arg is not None or markmsg: 1601 # make a mild effort to align arguments 1602 line += ' ' * (10 - len(opcode.name)) 1603 if arg is not None: 1604 line += ' ' + repr(arg) 1605 if markmsg: 1606 line += ' ' + markmsg 1607 print >> out, line 1608 1609 if markobject in opcode.stack_after: 1610 assert markobject not in opcode.stack_before 1611 markstack.append(pos) 1612 1613 1614_dis_test = """ 1615>>> import pickle 1616>>> x = [1, 2, (3, 4), {'abc': u"def"}] 1617>>> pik = pickle.dumps(x) 1618>>> dis(pik) 1619 0: ( MARK 1620 1: l LIST (MARK at 0) 1621 2: p PUT 0 1622 5: I INT 1 1623 8: a APPEND 1624 9: I INT 2 1625 12: a APPEND 1626 13: ( MARK 1627 14: I INT 3 1628 17: I INT 4 1629 20: t TUPLE (MARK at 13) 1630 21: p PUT 1 1631 24: a APPEND 1632 25: ( MARK 1633 26: d DICT (MARK at 25) 1634 27: p PUT 2 1635 30: S STRING 'abc' 1636 37: p PUT 3 1637 40: V UNICODE u'def' 1638 45: p PUT 4 1639 48: s SETITEM 1640 49: a APPEND 1641 50: . STOP 1642 1643Try again with a "binary" pickle. 1644 1645>>> pik = pickle.dumps(x, 1) 1646>>> dis(pik) 1647 0: ] EMPTY_LIST 1648 1: q BINPUT 0 1649 3: ( MARK 1650 4: K BININT1 1 1651 6: K BININT1 2 1652 8: ( MARK 1653 9: K BININT1 3 1654 11: K BININT1 4 1655 13: t TUPLE (MARK at 8) 1656 14: q BINPUT 1 1657 16: } EMPTY_DICT 1658 17: q BINPUT 2 1659 19: U SHORT_BINSTRING 'abc' 1660 24: q BINPUT 3 1661 26: X BINUNICODE u'def' 1662 34: q BINPUT 4 1663 36: s SETITEM 1664 37: e APPENDS (MARK at 3) 1665 38: . STOP 1666 1667Exercise the INST/OBJ/BUILD family. 1668 1669>>> import random 1670>>> dis(pickle.dumps(random.random)) 1671 0: c GLOBAL 'random random' 1672 15: p PUT 0 1673 18: . STOP 1674 1675>>> x = [pickle.PicklingError()] * 2 1676>>> dis(pickle.dumps(x)) 1677 0: ( MARK 1678 1: l LIST (MARK at 0) 1679 2: p PUT 0 1680 5: ( MARK 1681 6: i INST 'pickle PicklingError' (MARK at 5) 1682 28: p PUT 1 1683 31: ( MARK 1684 32: d DICT (MARK at 31) 1685 33: p PUT 2 1686 36: S STRING 'args' 1687 44: p PUT 3 1688 47: ( MARK 1689 48: t TUPLE (MARK at 47) 1690 49: p PUT 4 1691 52: s SETITEM 1692 53: b BUILD 1693 54: a APPEND 1694 55: g GET 1 1695 58: a APPEND 1696 59: . STOP 1697 1698>>> dis(pickle.dumps(x, 1)) 1699 0: ] EMPTY_LIST 1700 1: q BINPUT 0 1701 3: ( MARK 1702 4: ( MARK 1703 5: c GLOBAL 'pickle PicklingError' 1704 27: q BINPUT 1 1705 29: o OBJ (MARK at 4) 1706 30: q BINPUT 2 1707 32: } EMPTY_DICT 1708 33: q BINPUT 3 1709 35: U SHORT_BINSTRING 'args' 1710 41: q BINPUT 4 1711 43: ) EMPTY_TUPLE 1712 44: s SETITEM 1713 45: b BUILD 1714 46: h BINGET 2 1715 48: e APPENDS (MARK at 3) 1716 49: . STOP 1717 1718Try "the canonical" recursive-object test. 1719 1720>>> L = [] 1721>>> T = L, 1722>>> L.append(T) 1723>>> L[0] is T 1724True 1725>>> T[0] is L 1726True 1727>>> L[0][0] is L 1728True 1729>>> T[0][0] is T 1730True 1731>>> dis(pickle.dumps(L)) 1732 0: ( MARK 1733 1: l LIST (MARK at 0) 1734 2: p PUT 0 1735 5: ( MARK 1736 6: g GET 0 1737 9: t TUPLE (MARK at 5) 1738 10: p PUT 1 1739 13: a APPEND 1740 14: . STOP 1741>>> dis(pickle.dumps(L, 1)) 1742 0: ] EMPTY_LIST 1743 1: q BINPUT 0 1744 3: ( MARK 1745 4: h BINGET 0 1746 6: t TUPLE (MARK at 3) 1747 7: q BINPUT 1 1748 9: a APPEND 1749 10: . STOP 1750 1751The protocol 0 pickle of the tuple causes the disassembly to get confused, 1752as it doesn't realize that the POP opcode at 16 gets rid of the MARK at 0 1753(so the output remains indented until the end). The protocol 1 pickle 1754doesn't trigger this glitch, because the disassembler realizes that 1755POP_MARK gets rid of the MARK. Doing a better job on the protocol 0 1756pickle would require the disassembler to emulate the stack. 1757 1758>>> dis(pickle.dumps(T)) 1759 0: ( MARK 1760 1: ( MARK 1761 2: l LIST (MARK at 1) 1762 3: p PUT 0 1763 6: ( MARK 1764 7: g GET 0 1765 10: t TUPLE (MARK at 6) 1766 11: p PUT 1 1767 14: a APPEND 1768 15: 0 POP 1769 16: 0 POP 1770 17: g GET 1 1771 20: . STOP 1772>>> dis(pickle.dumps(T, 1)) 1773 0: ( MARK 1774 1: ] EMPTY_LIST 1775 2: q BINPUT 0 1776 4: ( MARK 1777 5: h BINGET 0 1778 7: t TUPLE (MARK at 4) 1779 8: q BINPUT 1 1780 10: a APPEND 1781 11: 1 POP_MARK (MARK at 0) 1782 12: h BINGET 1 1783 14: . STOP 1784""" 1785 1786__test__ = {'dissassembler_test': _dis_test, 1787 } 1788 1789def _test(): 1790 import doctest 1791 return doctest.testmod() 1792 1793if __name__ == "__main__": 1794 _test() 1795