pickletools.py revision 9594942716a8f9c557b85d31751753d89cd7cebf
1'''"Executable documentation" for the pickle module. 2 3Extensive comments about the pickle protocols and pickle-machine opcodes 4can be found here. Some functions meant for external use: 5 6genops(pickle) 7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. 8 9dis(pickle, out=None, memo=None, indentlevel=4) 10 Print a symbolic disassembly of a pickle. 11''' 12 13import codecs 14import pickle 15import re 16import sys 17 18__all__ = ['dis', 'genops', 'optimize'] 19 20bytes_types = pickle.bytes_types 21 22# Other ideas: 23# 24# - A pickle verifier: read a pickle and check it exhaustively for 25# well-formedness. dis() does a lot of this already. 26# 27# - A protocol identifier: examine a pickle and return its protocol number 28# (== the highest .proto attr value among all the opcodes in the pickle). 29# dis() already prints this info at the end. 30# 31# - A pickle optimizer: for example, tuple-building code is sometimes more 32# elaborate than necessary, catering for the possibility that the tuple 33# is recursive. Or lots of times a PUT is generated that's never accessed 34# by a later GET. 35 36 37""" 38"A pickle" is a program for a virtual pickle machine (PM, but more accurately 39called an unpickling machine). It's a sequence of opcodes, interpreted by the 40PM, building an arbitrarily complex Python object. 41 42For the most part, the PM is very simple: there are no looping, testing, or 43conditional instructions, no arithmetic and no function calls. Opcodes are 44executed once each, from first to last, until a STOP opcode is reached. 45 46The PM has two data areas, "the stack" and "the memo". 47 48Many opcodes push Python objects onto the stack; e.g., INT pushes a Python 49integer object on the stack, whose value is gotten from a decimal string 50literal immediately following the INT opcode in the pickle bytestream. Other 51opcodes take Python objects off the stack. The result of unpickling is 52whatever object is left on the stack when the final STOP opcode is executed. 53 54The memo is simply an array of objects, or it can be implemented as a dict 55mapping little integers to objects. The memo serves as the PM's "long term 56memory", and the little integers indexing the memo are akin to variable 57names. Some opcodes pop a stack object into the memo at a given index, 58and others push a memo object at a given index onto the stack again. 59 60At heart, that's all the PM has. Subtleties arise for these reasons: 61 62+ Object identity. Objects can be arbitrarily complex, and subobjects 63 may be shared (for example, the list [a, a] refers to the same object a 64 twice). It can be vital that unpickling recreate an isomorphic object 65 graph, faithfully reproducing sharing. 66 67+ Recursive objects. For example, after "L = []; L.append(L)", L is a 68 list, and L[0] is the same list. This is related to the object identity 69 point, and some sequences of pickle opcodes are subtle in order to 70 get the right result in all cases. 71 72+ Things pickle doesn't know everything about. Examples of things pickle 73 does know everything about are Python's builtin scalar and container 74 types, like ints and tuples. They generally have opcodes dedicated to 75 them. For things like module references and instances of user-defined 76 classes, pickle's knowledge is limited. Historically, many enhancements 77 have been made to the pickle protocol in order to do a better (faster, 78 and/or more compact) job on those. 79 80+ Backward compatibility and micro-optimization. As explained below, 81 pickle opcodes never go away, not even when better ways to do a thing 82 get invented. The repertoire of the PM just keeps growing over time. 83 For example, protocol 0 had two opcodes for building Python integers (INT 84 and LONG), protocol 1 added three more for more-efficient pickling of short 85 integers, and protocol 2 added two more for more-efficient pickling of 86 long integers (before protocol 2, the only ways to pickle a Python long 87 took time quadratic in the number of digits, for both pickling and 88 unpickling). "Opcode bloat" isn't so much a subtlety as a source of 89 wearying complication. 90 91 92Pickle protocols: 93 94For compatibility, the meaning of a pickle opcode never changes. Instead new 95pickle opcodes get added, and each version's unpickler can handle all the 96pickle opcodes in all protocol versions to date. So old pickles continue to 97be readable forever. The pickler can generally be told to restrict itself to 98the subset of opcodes available under previous protocol versions too, so that 99users can create pickles under the current version readable by older 100versions. However, a pickle does not contain its version number embedded 101within it. If an older unpickler tries to read a pickle using a later 102protocol, the result is most likely an exception due to seeing an unknown (in 103the older unpickler) opcode. 104 105The original pickle used what's now called "protocol 0", and what was called 106"text mode" before Python 2.3. The entire pickle bytestream is made up of 107printable 7-bit ASCII characters, plus the newline character, in protocol 0. 108That's why it was called text mode. Protocol 0 is small and elegant, but 109sometimes painfully inefficient. 110 111The second major set of additions is now called "protocol 1", and was called 112"binary mode" before Python 2.3. This added many opcodes with arguments 113consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" 114bytes. Binary mode pickles can be substantially smaller than equivalent 115text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte 116int as 4 bytes following the opcode, which is cheaper to unpickle than the 117(perhaps) 11-character decimal string attached to INT. Protocol 1 also added 118a number of opcodes that operate on many stack elements at once (like APPENDS 119and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE). 120 121The third major set of additions came in Python 2.3, and is called "protocol 1222". This added: 123 124- A better way to pickle instances of new-style classes (NEWOBJ). 125 126- A way for a pickle to identify its protocol (PROTO). 127 128- Time- and space- efficient pickling of long ints (LONG{1,4}). 129 130- Shortcuts for small tuples (TUPLE{1,2,3}}. 131 132- Dedicated opcodes for bools (NEWTRUE, NEWFALSE). 133 134- The "extension registry", a vector of popular objects that can be pushed 135 efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but 136 the registry contents are predefined (there's nothing akin to the memo's 137 PUT). 138 139Another independent change with Python 2.3 is the abandonment of any 140pretense that it might be safe to load pickles received from untrusted 141parties -- no sufficient security analysis has been done to guarantee 142this and there isn't a use case that warrants the expense of such an 143analysis. 144 145To this end, all tests for __safe_for_unpickling__ or for 146copyreg.safe_constructors are removed from the unpickling code. 147References to these variables in the descriptions below are to be seen 148as describing unpickling in Python 2.2 and before. 149""" 150 151# Meta-rule: Descriptions are stored in instances of descriptor objects, 152# with plain constructors. No meta-language is defined from which 153# descriptors could be constructed. If you want, e.g., XML, write a little 154# program to generate XML from the objects. 155 156############################################################################## 157# Some pickle opcodes have an argument, following the opcode in the 158# bytestream. An argument is of a specific type, described by an instance 159# of ArgumentDescriptor. These are not to be confused with arguments taken 160# off the stack -- ArgumentDescriptor applies only to arguments embedded in 161# the opcode stream, immediately following an opcode. 162 163# Represents the number of bytes consumed by an argument delimited by the 164# next newline character. 165UP_TO_NEWLINE = -1 166 167# Represents the number of bytes consumed by a two-argument opcode where 168# the first argument gives the number of bytes in the second argument. 169TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int 170TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int 171TAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int 172 173class ArgumentDescriptor(object): 174 __slots__ = ( 175 # name of descriptor record, also a module global name; a string 176 'name', 177 178 # length of argument, in bytes; an int; UP_TO_NEWLINE and 179 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length 180 # cases 181 'n', 182 183 # a function taking a file-like object, reading this kind of argument 184 # from the object at the current position, advancing the current 185 # position by n bytes, and returning the value of the argument 186 'reader', 187 188 # human-readable docs for this arg descriptor; a string 189 'doc', 190 ) 191 192 def __init__(self, name, n, reader, doc): 193 assert isinstance(name, str) 194 self.name = name 195 196 assert isinstance(n, int) and (n >= 0 or 197 n in (UP_TO_NEWLINE, 198 TAKEN_FROM_ARGUMENT1, 199 TAKEN_FROM_ARGUMENT4, 200 TAKEN_FROM_ARGUMENT4U)) 201 self.n = n 202 203 self.reader = reader 204 205 assert isinstance(doc, str) 206 self.doc = doc 207 208from struct import unpack as _unpack 209 210def read_uint1(f): 211 r""" 212 >>> import io 213 >>> read_uint1(io.BytesIO(b'\xff')) 214 255 215 """ 216 217 data = f.read(1) 218 if data: 219 return data[0] 220 raise ValueError("not enough data in stream to read uint1") 221 222uint1 = ArgumentDescriptor( 223 name='uint1', 224 n=1, 225 reader=read_uint1, 226 doc="One-byte unsigned integer.") 227 228 229def read_uint2(f): 230 r""" 231 >>> import io 232 >>> read_uint2(io.BytesIO(b'\xff\x00')) 233 255 234 >>> read_uint2(io.BytesIO(b'\xff\xff')) 235 65535 236 """ 237 238 data = f.read(2) 239 if len(data) == 2: 240 return _unpack("<H", data)[0] 241 raise ValueError("not enough data in stream to read uint2") 242 243uint2 = ArgumentDescriptor( 244 name='uint2', 245 n=2, 246 reader=read_uint2, 247 doc="Two-byte unsigned integer, little-endian.") 248 249 250def read_int4(f): 251 r""" 252 >>> import io 253 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00')) 254 255 255 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31) 256 True 257 """ 258 259 data = f.read(4) 260 if len(data) == 4: 261 return _unpack("<i", data)[0] 262 raise ValueError("not enough data in stream to read int4") 263 264int4 = ArgumentDescriptor( 265 name='int4', 266 n=4, 267 reader=read_int4, 268 doc="Four-byte signed integer, little-endian, 2's complement.") 269 270 271def read_uint4(f): 272 r""" 273 >>> import io 274 >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00')) 275 255 276 >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31 277 True 278 """ 279 280 data = f.read(4) 281 if len(data) == 4: 282 return _unpack("<I", data)[0] 283 raise ValueError("not enough data in stream to read uint4") 284 285uint4 = ArgumentDescriptor( 286 name='uint4', 287 n=4, 288 reader=read_uint4, 289 doc="Four-byte unsigned integer, little-endian.") 290 291 292def read_stringnl(f, decode=True, stripquotes=True): 293 r""" 294 >>> import io 295 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n")) 296 'abcd' 297 298 >>> read_stringnl(io.BytesIO(b"\n")) 299 Traceback (most recent call last): 300 ... 301 ValueError: no string quotes around b'' 302 303 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False) 304 '' 305 306 >>> read_stringnl(io.BytesIO(b"''\n")) 307 '' 308 309 >>> read_stringnl(io.BytesIO(b'"abcd"')) 310 Traceback (most recent call last): 311 ... 312 ValueError: no newline found when trying to read stringnl 313 314 Embedded escapes are undone in the result. 315 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'")) 316 'a\n\\b\x00c\td' 317 """ 318 319 data = f.readline() 320 if not data.endswith(b'\n'): 321 raise ValueError("no newline found when trying to read stringnl") 322 data = data[:-1] # lose the newline 323 324 if stripquotes: 325 for q in (b'"', b"'"): 326 if data.startswith(q): 327 if not data.endswith(q): 328 raise ValueError("strinq quote %r not found at both " 329 "ends of %r" % (q, data)) 330 data = data[1:-1] 331 break 332 else: 333 raise ValueError("no string quotes around %r" % data) 334 335 if decode: 336 data = codecs.escape_decode(data)[0].decode("ascii") 337 return data 338 339stringnl = ArgumentDescriptor( 340 name='stringnl', 341 n=UP_TO_NEWLINE, 342 reader=read_stringnl, 343 doc="""A newline-terminated string. 344 345 This is a repr-style string, with embedded escapes, and 346 bracketing quotes. 347 """) 348 349def read_stringnl_noescape(f): 350 return read_stringnl(f, stripquotes=False) 351 352stringnl_noescape = ArgumentDescriptor( 353 name='stringnl_noescape', 354 n=UP_TO_NEWLINE, 355 reader=read_stringnl_noescape, 356 doc="""A newline-terminated string. 357 358 This is a str-style string, without embedded escapes, 359 or bracketing quotes. It should consist solely of 360 printable ASCII characters. 361 """) 362 363def read_stringnl_noescape_pair(f): 364 r""" 365 >>> import io 366 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk")) 367 'Queue Empty' 368 """ 369 370 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) 371 372stringnl_noescape_pair = ArgumentDescriptor( 373 name='stringnl_noescape_pair', 374 n=UP_TO_NEWLINE, 375 reader=read_stringnl_noescape_pair, 376 doc="""A pair of newline-terminated strings. 377 378 These are str-style strings, without embedded 379 escapes, or bracketing quotes. They should 380 consist solely of printable ASCII characters. 381 The pair is returned as a single string, with 382 a single blank separating the two strings. 383 """) 384 385def read_string4(f): 386 r""" 387 >>> import io 388 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc")) 389 '' 390 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 391 'abc' 392 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 393 Traceback (most recent call last): 394 ... 395 ValueError: expected 50331648 bytes in a string4, but only 6 remain 396 """ 397 398 n = read_int4(f) 399 if n < 0: 400 raise ValueError("string4 byte count < 0: %d" % n) 401 data = f.read(n) 402 if len(data) == n: 403 return data.decode("latin-1") 404 raise ValueError("expected %d bytes in a string4, but only %d remain" % 405 (n, len(data))) 406 407string4 = ArgumentDescriptor( 408 name="string4", 409 n=TAKEN_FROM_ARGUMENT4, 410 reader=read_string4, 411 doc="""A counted string. 412 413 The first argument is a 4-byte little-endian signed int giving 414 the number of bytes in the string, and the second argument is 415 that many bytes. 416 """) 417 418 419def read_string1(f): 420 r""" 421 >>> import io 422 >>> read_string1(io.BytesIO(b"\x00")) 423 '' 424 >>> read_string1(io.BytesIO(b"\x03abcdef")) 425 'abc' 426 """ 427 428 n = read_uint1(f) 429 assert n >= 0 430 data = f.read(n) 431 if len(data) == n: 432 return data.decode("latin-1") 433 raise ValueError("expected %d bytes in a string1, but only %d remain" % 434 (n, len(data))) 435 436string1 = ArgumentDescriptor( 437 name="string1", 438 n=TAKEN_FROM_ARGUMENT1, 439 reader=read_string1, 440 doc="""A counted string. 441 442 The first argument is a 1-byte unsigned int giving the number 443 of bytes in the string, and the second argument is that many 444 bytes. 445 """) 446 447 448def read_bytes1(f): 449 r""" 450 >>> import io 451 >>> read_bytes1(io.BytesIO(b"\x00")) 452 b'' 453 >>> read_bytes1(io.BytesIO(b"\x03abcdef")) 454 b'abc' 455 """ 456 457 n = read_uint1(f) 458 assert n >= 0 459 data = f.read(n) 460 if len(data) == n: 461 return data 462 raise ValueError("expected %d bytes in a bytes1, but only %d remain" % 463 (n, len(data))) 464 465bytes1 = ArgumentDescriptor( 466 name="bytes1", 467 n=TAKEN_FROM_ARGUMENT1, 468 reader=read_bytes1, 469 doc="""A counted bytes string. 470 471 The first argument is a 1-byte unsigned int giving the number 472 of bytes, and the second argument is that many bytes. 473 """) 474 475 476def read_bytes4(f): 477 r""" 478 >>> import io 479 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc")) 480 b'' 481 >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 482 b'abc' 483 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 484 Traceback (most recent call last): 485 ... 486 ValueError: expected 50331648 bytes in a bytes4, but only 6 remain 487 """ 488 489 n = read_uint4(f) 490 if n > sys.maxsize: 491 raise ValueError("bytes4 byte count > sys.maxsize: %d" % n) 492 data = f.read(n) 493 if len(data) == n: 494 return data 495 raise ValueError("expected %d bytes in a bytes4, but only %d remain" % 496 (n, len(data))) 497 498bytes4 = ArgumentDescriptor( 499 name="bytes4", 500 n=TAKEN_FROM_ARGUMENT4U, 501 reader=read_bytes4, 502 doc="""A counted bytes string. 503 504 The first argument is a 4-byte little-endian unsigned int giving 505 the number of bytes, and the second argument is that many bytes. 506 """) 507 508 509def read_unicodestringnl(f): 510 r""" 511 >>> import io 512 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd' 513 True 514 """ 515 516 data = f.readline() 517 if not data.endswith(b'\n'): 518 raise ValueError("no newline found when trying to read " 519 "unicodestringnl") 520 data = data[:-1] # lose the newline 521 return str(data, 'raw-unicode-escape') 522 523unicodestringnl = ArgumentDescriptor( 524 name='unicodestringnl', 525 n=UP_TO_NEWLINE, 526 reader=read_unicodestringnl, 527 doc="""A newline-terminated Unicode string. 528 529 This is raw-unicode-escape encoded, so consists of 530 printable ASCII characters, and may contain embedded 531 escape sequences. 532 """) 533 534def read_unicodestring4(f): 535 r""" 536 >>> import io 537 >>> s = 'abcd\uabcd' 538 >>> enc = s.encode('utf-8') 539 >>> enc 540 b'abcd\xea\xaf\x8d' 541 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length 542 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk')) 543 >>> s == t 544 True 545 546 >>> read_unicodestring4(io.BytesIO(n + enc[:-1])) 547 Traceback (most recent call last): 548 ... 549 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain 550 """ 551 552 n = read_uint4(f) 553 if n > sys.maxsize: 554 raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n) 555 data = f.read(n) 556 if len(data) == n: 557 return str(data, 'utf-8', 'surrogatepass') 558 raise ValueError("expected %d bytes in a unicodestring4, but only %d " 559 "remain" % (n, len(data))) 560 561unicodestring4 = ArgumentDescriptor( 562 name="unicodestring4", 563 n=TAKEN_FROM_ARGUMENT4U, 564 reader=read_unicodestring4, 565 doc="""A counted Unicode string. 566 567 The first argument is a 4-byte little-endian signed int 568 giving the number of bytes in the string, and the second 569 argument-- the UTF-8 encoding of the Unicode string -- 570 contains that many bytes. 571 """) 572 573 574def read_decimalnl_short(f): 575 r""" 576 >>> import io 577 >>> read_decimalnl_short(io.BytesIO(b"1234\n56")) 578 1234 579 580 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56")) 581 Traceback (most recent call last): 582 ... 583 ValueError: invalid literal for int() with base 10: b'1234L' 584 """ 585 586 s = read_stringnl(f, decode=False, stripquotes=False) 587 588 # There's a hack for True and False here. 589 if s == b"00": 590 return False 591 elif s == b"01": 592 return True 593 594 return int(s) 595 596def read_decimalnl_long(f): 597 r""" 598 >>> import io 599 600 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56")) 601 1234 602 603 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6")) 604 123456789012345678901234 605 """ 606 607 s = read_stringnl(f, decode=False, stripquotes=False) 608 if s[-1:] == b'L': 609 s = s[:-1] 610 return int(s) 611 612 613decimalnl_short = ArgumentDescriptor( 614 name='decimalnl_short', 615 n=UP_TO_NEWLINE, 616 reader=read_decimalnl_short, 617 doc="""A newline-terminated decimal integer literal. 618 619 This never has a trailing 'L', and the integer fit 620 in a short Python int on the box where the pickle 621 was written -- but there's no guarantee it will fit 622 in a short Python int on the box where the pickle 623 is read. 624 """) 625 626decimalnl_long = ArgumentDescriptor( 627 name='decimalnl_long', 628 n=UP_TO_NEWLINE, 629 reader=read_decimalnl_long, 630 doc="""A newline-terminated decimal integer literal. 631 632 This has a trailing 'L', and can represent integers 633 of any size. 634 """) 635 636 637def read_floatnl(f): 638 r""" 639 >>> import io 640 >>> read_floatnl(io.BytesIO(b"-1.25\n6")) 641 -1.25 642 """ 643 s = read_stringnl(f, decode=False, stripquotes=False) 644 return float(s) 645 646floatnl = ArgumentDescriptor( 647 name='floatnl', 648 n=UP_TO_NEWLINE, 649 reader=read_floatnl, 650 doc="""A newline-terminated decimal floating literal. 651 652 In general this requires 17 significant digits for roundtrip 653 identity, and pickling then unpickling infinities, NaNs, and 654 minus zero doesn't work across boxes, or on some boxes even 655 on itself (e.g., Windows can't read the strings it produces 656 for infinities or NaNs). 657 """) 658 659def read_float8(f): 660 r""" 661 >>> import io, struct 662 >>> raw = struct.pack(">d", -1.25) 663 >>> raw 664 b'\xbf\xf4\x00\x00\x00\x00\x00\x00' 665 >>> read_float8(io.BytesIO(raw + b"\n")) 666 -1.25 667 """ 668 669 data = f.read(8) 670 if len(data) == 8: 671 return _unpack(">d", data)[0] 672 raise ValueError("not enough data in stream to read float8") 673 674 675float8 = ArgumentDescriptor( 676 name='float8', 677 n=8, 678 reader=read_float8, 679 doc="""An 8-byte binary representation of a float, big-endian. 680 681 The format is unique to Python, and shared with the struct 682 module (format string '>d') "in theory" (the struct and pickle 683 implementations don't share the code -- they should). It's 684 strongly related to the IEEE-754 double format, and, in normal 685 cases, is in fact identical to the big-endian 754 double format. 686 On other boxes the dynamic range is limited to that of a 754 687 double, and "add a half and chop" rounding is used to reduce 688 the precision to 53 bits. However, even on a 754 box, 689 infinities, NaNs, and minus zero may not be handled correctly 690 (may not survive roundtrip pickling intact). 691 """) 692 693# Protocol 2 formats 694 695from pickle import decode_long 696 697def read_long1(f): 698 r""" 699 >>> import io 700 >>> read_long1(io.BytesIO(b"\x00")) 701 0 702 >>> read_long1(io.BytesIO(b"\x02\xff\x00")) 703 255 704 >>> read_long1(io.BytesIO(b"\x02\xff\x7f")) 705 32767 706 >>> read_long1(io.BytesIO(b"\x02\x00\xff")) 707 -256 708 >>> read_long1(io.BytesIO(b"\x02\x00\x80")) 709 -32768 710 """ 711 712 n = read_uint1(f) 713 data = f.read(n) 714 if len(data) != n: 715 raise ValueError("not enough data in stream to read long1") 716 return decode_long(data) 717 718long1 = ArgumentDescriptor( 719 name="long1", 720 n=TAKEN_FROM_ARGUMENT1, 721 reader=read_long1, 722 doc="""A binary long, little-endian, using 1-byte size. 723 724 This first reads one byte as an unsigned size, then reads that 725 many bytes and interprets them as a little-endian 2's-complement long. 726 If the size is 0, that's taken as a shortcut for the long 0L. 727 """) 728 729def read_long4(f): 730 r""" 731 >>> import io 732 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00")) 733 255 734 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f")) 735 32767 736 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff")) 737 -256 738 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80")) 739 -32768 740 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00")) 741 0 742 """ 743 744 n = read_int4(f) 745 if n < 0: 746 raise ValueError("long4 byte count < 0: %d" % n) 747 data = f.read(n) 748 if len(data) != n: 749 raise ValueError("not enough data in stream to read long4") 750 return decode_long(data) 751 752long4 = ArgumentDescriptor( 753 name="long4", 754 n=TAKEN_FROM_ARGUMENT4, 755 reader=read_long4, 756 doc="""A binary representation of a long, little-endian. 757 758 This first reads four bytes as a signed size (but requires the 759 size to be >= 0), then reads that many bytes and interprets them 760 as a little-endian 2's-complement long. If the size is 0, that's taken 761 as a shortcut for the int 0, although LONG1 should really be used 762 then instead (and in any case where # of bytes < 256). 763 """) 764 765 766############################################################################## 767# Object descriptors. The stack used by the pickle machine holds objects, 768# and in the stack_before and stack_after attributes of OpcodeInfo 769# descriptors we need names to describe the various types of objects that can 770# appear on the stack. 771 772class StackObject(object): 773 __slots__ = ( 774 # name of descriptor record, for info only 775 'name', 776 777 # type of object, or tuple of type objects (meaning the object can 778 # be of any type in the tuple) 779 'obtype', 780 781 # human-readable docs for this kind of stack object; a string 782 'doc', 783 ) 784 785 def __init__(self, name, obtype, doc): 786 assert isinstance(name, str) 787 self.name = name 788 789 assert isinstance(obtype, type) or isinstance(obtype, tuple) 790 if isinstance(obtype, tuple): 791 for contained in obtype: 792 assert isinstance(contained, type) 793 self.obtype = obtype 794 795 assert isinstance(doc, str) 796 self.doc = doc 797 798 def __repr__(self): 799 return self.name 800 801 802pyint = StackObject( 803 name='int', 804 obtype=int, 805 doc="A short (as opposed to long) Python integer object.") 806 807pylong = StackObject( 808 name='long', 809 obtype=int, 810 doc="A long (as opposed to short) Python integer object.") 811 812pyinteger_or_bool = StackObject( 813 name='int_or_bool', 814 obtype=(int, bool), 815 doc="A Python integer object (short or long), or " 816 "a Python bool.") 817 818pybool = StackObject( 819 name='bool', 820 obtype=(bool,), 821 doc="A Python bool object.") 822 823pyfloat = StackObject( 824 name='float', 825 obtype=float, 826 doc="A Python float object.") 827 828pystring = StackObject( 829 name='string', 830 obtype=bytes, 831 doc="A Python (8-bit) string object.") 832 833pybytes = StackObject( 834 name='bytes', 835 obtype=bytes, 836 doc="A Python bytes object.") 837 838pyunicode = StackObject( 839 name='str', 840 obtype=str, 841 doc="A Python (Unicode) string object.") 842 843pynone = StackObject( 844 name="None", 845 obtype=type(None), 846 doc="The Python None object.") 847 848pytuple = StackObject( 849 name="tuple", 850 obtype=tuple, 851 doc="A Python tuple object.") 852 853pylist = StackObject( 854 name="list", 855 obtype=list, 856 doc="A Python list object.") 857 858pydict = StackObject( 859 name="dict", 860 obtype=dict, 861 doc="A Python dict object.") 862 863anyobject = StackObject( 864 name='any', 865 obtype=object, 866 doc="Any kind of object whatsoever.") 867 868markobject = StackObject( 869 name="mark", 870 obtype=StackObject, 871 doc="""'The mark' is a unique object. 872 873 Opcodes that operate on a variable number of objects 874 generally don't embed the count of objects in the opcode, 875 or pull it off the stack. Instead the MARK opcode is used 876 to push a special marker object on the stack, and then 877 some other opcodes grab all the objects from the top of 878 the stack down to (but not including) the topmost marker 879 object. 880 """) 881 882stackslice = StackObject( 883 name="stackslice", 884 obtype=StackObject, 885 doc="""An object representing a contiguous slice of the stack. 886 887 This is used in conjunction with markobject, to represent all 888 of the stack following the topmost markobject. For example, 889 the POP_MARK opcode changes the stack from 890 891 [..., markobject, stackslice] 892 to 893 [...] 894 895 No matter how many object are on the stack after the topmost 896 markobject, POP_MARK gets rid of all of them (including the 897 topmost markobject too). 898 """) 899 900############################################################################## 901# Descriptors for pickle opcodes. 902 903class OpcodeInfo(object): 904 905 __slots__ = ( 906 # symbolic name of opcode; a string 907 'name', 908 909 # the code used in a bytestream to represent the opcode; a 910 # one-character string 911 'code', 912 913 # If the opcode has an argument embedded in the byte string, an 914 # instance of ArgumentDescriptor specifying its type. Note that 915 # arg.reader(s) can be used to read and decode the argument from 916 # the bytestream s, and arg.doc documents the format of the raw 917 # argument bytes. If the opcode doesn't have an argument embedded 918 # in the bytestream, arg should be None. 919 'arg', 920 921 # what the stack looks like before this opcode runs; a list 922 'stack_before', 923 924 # what the stack looks like after this opcode runs; a list 925 'stack_after', 926 927 # the protocol number in which this opcode was introduced; an int 928 'proto', 929 930 # human-readable docs for this opcode; a string 931 'doc', 932 ) 933 934 def __init__(self, name, code, arg, 935 stack_before, stack_after, proto, doc): 936 assert isinstance(name, str) 937 self.name = name 938 939 assert isinstance(code, str) 940 assert len(code) == 1 941 self.code = code 942 943 assert arg is None or isinstance(arg, ArgumentDescriptor) 944 self.arg = arg 945 946 assert isinstance(stack_before, list) 947 for x in stack_before: 948 assert isinstance(x, StackObject) 949 self.stack_before = stack_before 950 951 assert isinstance(stack_after, list) 952 for x in stack_after: 953 assert isinstance(x, StackObject) 954 self.stack_after = stack_after 955 956 assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL 957 self.proto = proto 958 959 assert isinstance(doc, str) 960 self.doc = doc 961 962I = OpcodeInfo 963opcodes = [ 964 965 # Ways to spell integers. 966 967 I(name='INT', 968 code='I', 969 arg=decimalnl_short, 970 stack_before=[], 971 stack_after=[pyinteger_or_bool], 972 proto=0, 973 doc="""Push an integer or bool. 974 975 The argument is a newline-terminated decimal literal string. 976 977 The intent may have been that this always fit in a short Python int, 978 but INT can be generated in pickles written on a 64-bit box that 979 require a Python long on a 32-bit box. The difference between this 980 and LONG then is that INT skips a trailing 'L', and produces a short 981 int whenever possible. 982 983 Another difference is due to that, when bool was introduced as a 984 distinct type in 2.3, builtin names True and False were also added to 985 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, 986 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". 987 Leading zeroes are never produced for a genuine integer. The 2.3 988 (and later) unpicklers special-case these and return bool instead; 989 earlier unpicklers ignore the leading "0" and return the int. 990 """), 991 992 I(name='BININT', 993 code='J', 994 arg=int4, 995 stack_before=[], 996 stack_after=[pyint], 997 proto=1, 998 doc="""Push a four-byte signed integer. 999 1000 This handles the full range of Python (short) integers on a 32-bit 1001 box, directly as binary bytes (1 for the opcode and 4 for the integer). 1002 If the integer is non-negative and fits in 1 or 2 bytes, pickling via 1003 BININT1 or BININT2 saves space. 1004 """), 1005 1006 I(name='BININT1', 1007 code='K', 1008 arg=uint1, 1009 stack_before=[], 1010 stack_after=[pyint], 1011 proto=1, 1012 doc="""Push a one-byte unsigned integer. 1013 1014 This is a space optimization for pickling very small non-negative ints, 1015 in range(256). 1016 """), 1017 1018 I(name='BININT2', 1019 code='M', 1020 arg=uint2, 1021 stack_before=[], 1022 stack_after=[pyint], 1023 proto=1, 1024 doc="""Push a two-byte unsigned integer. 1025 1026 This is a space optimization for pickling small positive ints, in 1027 range(256, 2**16). Integers in range(256) can also be pickled via 1028 BININT2, but BININT1 instead saves a byte. 1029 """), 1030 1031 I(name='LONG', 1032 code='L', 1033 arg=decimalnl_long, 1034 stack_before=[], 1035 stack_after=[pylong], 1036 proto=0, 1037 doc="""Push a long integer. 1038 1039 The same as INT, except that the literal ends with 'L', and always 1040 unpickles to a Python long. There doesn't seem a real purpose to the 1041 trailing 'L'. 1042 1043 Note that LONG takes time quadratic in the number of digits when 1044 unpickling (this is simply due to the nature of decimal->binary 1045 conversion). Proto 2 added linear-time (in C; still quadratic-time 1046 in Python) LONG1 and LONG4 opcodes. 1047 """), 1048 1049 I(name="LONG1", 1050 code='\x8a', 1051 arg=long1, 1052 stack_before=[], 1053 stack_after=[pylong], 1054 proto=2, 1055 doc="""Long integer using one-byte length. 1056 1057 A more efficient encoding of a Python long; the long1 encoding 1058 says it all."""), 1059 1060 I(name="LONG4", 1061 code='\x8b', 1062 arg=long4, 1063 stack_before=[], 1064 stack_after=[pylong], 1065 proto=2, 1066 doc="""Long integer using found-byte length. 1067 1068 A more efficient encoding of a Python long; the long4 encoding 1069 says it all."""), 1070 1071 # Ways to spell strings (8-bit, not Unicode). 1072 1073 I(name='STRING', 1074 code='S', 1075 arg=stringnl, 1076 stack_before=[], 1077 stack_after=[pystring], 1078 proto=0, 1079 doc="""Push a Python string object. 1080 1081 The argument is a repr-style string, with bracketing quote characters, 1082 and perhaps embedded escapes. The argument extends until the next 1083 newline character. (Actually, they are decoded into a str instance 1084 using the encoding given to the Unpickler constructor. or the default, 1085 'ASCII'.) 1086 """), 1087 1088 I(name='BINSTRING', 1089 code='T', 1090 arg=string4, 1091 stack_before=[], 1092 stack_after=[pystring], 1093 proto=1, 1094 doc="""Push a Python string object. 1095 1096 There are two arguments: the first is a 4-byte little-endian signed int 1097 giving the number of bytes in the string, and the second is that many 1098 bytes, which are taken literally as the string content. (Actually, 1099 they are decoded into a str instance using the encoding given to the 1100 Unpickler constructor. or the default, 'ASCII'.) 1101 """), 1102 1103 I(name='SHORT_BINSTRING', 1104 code='U', 1105 arg=string1, 1106 stack_before=[], 1107 stack_after=[pystring], 1108 proto=1, 1109 doc="""Push a Python string object. 1110 1111 There are two arguments: the first is a 1-byte unsigned int giving 1112 the number of bytes in the string, and the second is that many bytes, 1113 which are taken literally as the string content. (Actually, they 1114 are decoded into a str instance using the encoding given to the 1115 Unpickler constructor. or the default, 'ASCII'.) 1116 """), 1117 1118 # Bytes (protocol 3 only; older protocols don't support bytes at all) 1119 1120 I(name='BINBYTES', 1121 code='B', 1122 arg=bytes4, 1123 stack_before=[], 1124 stack_after=[pybytes], 1125 proto=3, 1126 doc="""Push a Python bytes object. 1127 1128 There are two arguments: the first is a 4-byte little-endian unsigned int 1129 giving the number of bytes, and the second is that many bytes, which are 1130 taken literally as the bytes content. 1131 """), 1132 1133 I(name='SHORT_BINBYTES', 1134 code='C', 1135 arg=bytes1, 1136 stack_before=[], 1137 stack_after=[pybytes], 1138 proto=3, 1139 doc="""Push a Python bytes object. 1140 1141 There are two arguments: the first is a 1-byte unsigned int giving 1142 the number of bytes, and the second is that many bytes, which are taken 1143 literally as the string content. 1144 """), 1145 1146 # Ways to spell None. 1147 1148 I(name='NONE', 1149 code='N', 1150 arg=None, 1151 stack_before=[], 1152 stack_after=[pynone], 1153 proto=0, 1154 doc="Push None on the stack."), 1155 1156 # Ways to spell bools, starting with proto 2. See INT for how this was 1157 # done before proto 2. 1158 1159 I(name='NEWTRUE', 1160 code='\x88', 1161 arg=None, 1162 stack_before=[], 1163 stack_after=[pybool], 1164 proto=2, 1165 doc="""True. 1166 1167 Push True onto the stack."""), 1168 1169 I(name='NEWFALSE', 1170 code='\x89', 1171 arg=None, 1172 stack_before=[], 1173 stack_after=[pybool], 1174 proto=2, 1175 doc="""True. 1176 1177 Push False onto the stack."""), 1178 1179 # Ways to spell Unicode strings. 1180 1181 I(name='UNICODE', 1182 code='V', 1183 arg=unicodestringnl, 1184 stack_before=[], 1185 stack_after=[pyunicode], 1186 proto=0, # this may be pure-text, but it's a later addition 1187 doc="""Push a Python Unicode string object. 1188 1189 The argument is a raw-unicode-escape encoding of a Unicode string, 1190 and so may contain embedded escape sequences. The argument extends 1191 until the next newline character. 1192 """), 1193 1194 I(name='BINUNICODE', 1195 code='X', 1196 arg=unicodestring4, 1197 stack_before=[], 1198 stack_after=[pyunicode], 1199 proto=1, 1200 doc="""Push a Python Unicode string object. 1201 1202 There are two arguments: the first is a 4-byte little-endian unsigned int 1203 giving the number of bytes in the string. The second is that many 1204 bytes, and is the UTF-8 encoding of the Unicode string. 1205 """), 1206 1207 # Ways to spell floats. 1208 1209 I(name='FLOAT', 1210 code='F', 1211 arg=floatnl, 1212 stack_before=[], 1213 stack_after=[pyfloat], 1214 proto=0, 1215 doc="""Newline-terminated decimal float literal. 1216 1217 The argument is repr(a_float), and in general requires 17 significant 1218 digits for roundtrip conversion to be an identity (this is so for 1219 IEEE-754 double precision values, which is what Python float maps to 1220 on most boxes). 1221 1222 In general, FLOAT cannot be used to transport infinities, NaNs, or 1223 minus zero across boxes (or even on a single box, if the platform C 1224 library can't read the strings it produces for such things -- Windows 1225 is like that), but may do less damage than BINFLOAT on boxes with 1226 greater precision or dynamic range than IEEE-754 double. 1227 """), 1228 1229 I(name='BINFLOAT', 1230 code='G', 1231 arg=float8, 1232 stack_before=[], 1233 stack_after=[pyfloat], 1234 proto=1, 1235 doc="""Float stored in binary form, with 8 bytes of data. 1236 1237 This generally requires less than half the space of FLOAT encoding. 1238 In general, BINFLOAT cannot be used to transport infinities, NaNs, or 1239 minus zero, raises an exception if the exponent exceeds the range of 1240 an IEEE-754 double, and retains no more than 53 bits of precision (if 1241 there are more than that, "add a half and chop" rounding is used to 1242 cut it back to 53 significant bits). 1243 """), 1244 1245 # Ways to build lists. 1246 1247 I(name='EMPTY_LIST', 1248 code=']', 1249 arg=None, 1250 stack_before=[], 1251 stack_after=[pylist], 1252 proto=1, 1253 doc="Push an empty list."), 1254 1255 I(name='APPEND', 1256 code='a', 1257 arg=None, 1258 stack_before=[pylist, anyobject], 1259 stack_after=[pylist], 1260 proto=0, 1261 doc="""Append an object to a list. 1262 1263 Stack before: ... pylist anyobject 1264 Stack after: ... pylist+[anyobject] 1265 1266 although pylist is really extended in-place. 1267 """), 1268 1269 I(name='APPENDS', 1270 code='e', 1271 arg=None, 1272 stack_before=[pylist, markobject, stackslice], 1273 stack_after=[pylist], 1274 proto=1, 1275 doc="""Extend a list by a slice of stack objects. 1276 1277 Stack before: ... pylist markobject stackslice 1278 Stack after: ... pylist+stackslice 1279 1280 although pylist is really extended in-place. 1281 """), 1282 1283 I(name='LIST', 1284 code='l', 1285 arg=None, 1286 stack_before=[markobject, stackslice], 1287 stack_after=[pylist], 1288 proto=0, 1289 doc="""Build a list out of the topmost stack slice, after markobject. 1290 1291 All the stack entries following the topmost markobject are placed into 1292 a single Python list, which single list object replaces all of the 1293 stack from the topmost markobject onward. For example, 1294 1295 Stack before: ... markobject 1 2 3 'abc' 1296 Stack after: ... [1, 2, 3, 'abc'] 1297 """), 1298 1299 # Ways to build tuples. 1300 1301 I(name='EMPTY_TUPLE', 1302 code=')', 1303 arg=None, 1304 stack_before=[], 1305 stack_after=[pytuple], 1306 proto=1, 1307 doc="Push an empty tuple."), 1308 1309 I(name='TUPLE', 1310 code='t', 1311 arg=None, 1312 stack_before=[markobject, stackslice], 1313 stack_after=[pytuple], 1314 proto=0, 1315 doc="""Build a tuple out of the topmost stack slice, after markobject. 1316 1317 All the stack entries following the topmost markobject are placed into 1318 a single Python tuple, which single tuple object replaces all of the 1319 stack from the topmost markobject onward. For example, 1320 1321 Stack before: ... markobject 1 2 3 'abc' 1322 Stack after: ... (1, 2, 3, 'abc') 1323 """), 1324 1325 I(name='TUPLE1', 1326 code='\x85', 1327 arg=None, 1328 stack_before=[anyobject], 1329 stack_after=[pytuple], 1330 proto=2, 1331 doc="""Build a one-tuple out of the topmost item on the stack. 1332 1333 This code pops one value off the stack and pushes a tuple of 1334 length 1 whose one item is that value back onto it. In other 1335 words: 1336 1337 stack[-1] = tuple(stack[-1:]) 1338 """), 1339 1340 I(name='TUPLE2', 1341 code='\x86', 1342 arg=None, 1343 stack_before=[anyobject, anyobject], 1344 stack_after=[pytuple], 1345 proto=2, 1346 doc="""Build a two-tuple out of the top two items on the stack. 1347 1348 This code pops two values off the stack and pushes a tuple of 1349 length 2 whose items are those values back onto it. In other 1350 words: 1351 1352 stack[-2:] = [tuple(stack[-2:])] 1353 """), 1354 1355 I(name='TUPLE3', 1356 code='\x87', 1357 arg=None, 1358 stack_before=[anyobject, anyobject, anyobject], 1359 stack_after=[pytuple], 1360 proto=2, 1361 doc="""Build a three-tuple out of the top three items on the stack. 1362 1363 This code pops three values off the stack and pushes a tuple of 1364 length 3 whose items are those values back onto it. In other 1365 words: 1366 1367 stack[-3:] = [tuple(stack[-3:])] 1368 """), 1369 1370 # Ways to build dicts. 1371 1372 I(name='EMPTY_DICT', 1373 code='}', 1374 arg=None, 1375 stack_before=[], 1376 stack_after=[pydict], 1377 proto=1, 1378 doc="Push an empty dict."), 1379 1380 I(name='DICT', 1381 code='d', 1382 arg=None, 1383 stack_before=[markobject, stackslice], 1384 stack_after=[pydict], 1385 proto=0, 1386 doc="""Build a dict out of the topmost stack slice, after markobject. 1387 1388 All the stack entries following the topmost markobject are placed into 1389 a single Python dict, which single dict object replaces all of the 1390 stack from the topmost markobject onward. The stack slice alternates 1391 key, value, key, value, .... For example, 1392 1393 Stack before: ... markobject 1 2 3 'abc' 1394 Stack after: ... {1: 2, 3: 'abc'} 1395 """), 1396 1397 I(name='SETITEM', 1398 code='s', 1399 arg=None, 1400 stack_before=[pydict, anyobject, anyobject], 1401 stack_after=[pydict], 1402 proto=0, 1403 doc="""Add a key+value pair to an existing dict. 1404 1405 Stack before: ... pydict key value 1406 Stack after: ... pydict 1407 1408 where pydict has been modified via pydict[key] = value. 1409 """), 1410 1411 I(name='SETITEMS', 1412 code='u', 1413 arg=None, 1414 stack_before=[pydict, markobject, stackslice], 1415 stack_after=[pydict], 1416 proto=1, 1417 doc="""Add an arbitrary number of key+value pairs to an existing dict. 1418 1419 The slice of the stack following the topmost markobject is taken as 1420 an alternating sequence of keys and values, added to the dict 1421 immediately under the topmost markobject. Everything at and after the 1422 topmost markobject is popped, leaving the mutated dict at the top 1423 of the stack. 1424 1425 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n 1426 Stack after: ... pydict 1427 1428 where pydict has been modified via pydict[key_i] = value_i for i in 1429 1, 2, ..., n, and in that order. 1430 """), 1431 1432 # Stack manipulation. 1433 1434 I(name='POP', 1435 code='0', 1436 arg=None, 1437 stack_before=[anyobject], 1438 stack_after=[], 1439 proto=0, 1440 doc="Discard the top stack item, shrinking the stack by one item."), 1441 1442 I(name='DUP', 1443 code='2', 1444 arg=None, 1445 stack_before=[anyobject], 1446 stack_after=[anyobject, anyobject], 1447 proto=0, 1448 doc="Push the top stack item onto the stack again, duplicating it."), 1449 1450 I(name='MARK', 1451 code='(', 1452 arg=None, 1453 stack_before=[], 1454 stack_after=[markobject], 1455 proto=0, 1456 doc="""Push markobject onto the stack. 1457 1458 markobject is a unique object, used by other opcodes to identify a 1459 region of the stack containing a variable number of objects for them 1460 to work on. See markobject.doc for more detail. 1461 """), 1462 1463 I(name='POP_MARK', 1464 code='1', 1465 arg=None, 1466 stack_before=[markobject, stackslice], 1467 stack_after=[], 1468 proto=1, 1469 doc="""Pop all the stack objects at and above the topmost markobject. 1470 1471 When an opcode using a variable number of stack objects is done, 1472 POP_MARK is used to remove those objects, and to remove the markobject 1473 that delimited their starting position on the stack. 1474 """), 1475 1476 # Memo manipulation. There are really only two operations (get and put), 1477 # each in all-text, "short binary", and "long binary" flavors. 1478 1479 I(name='GET', 1480 code='g', 1481 arg=decimalnl_short, 1482 stack_before=[], 1483 stack_after=[anyobject], 1484 proto=0, 1485 doc="""Read an object from the memo and push it on the stack. 1486 1487 The index of the memo object to push is given by the newline-terminated 1488 decimal string following. BINGET and LONG_BINGET are space-optimized 1489 versions. 1490 """), 1491 1492 I(name='BINGET', 1493 code='h', 1494 arg=uint1, 1495 stack_before=[], 1496 stack_after=[anyobject], 1497 proto=1, 1498 doc="""Read an object from the memo and push it on the stack. 1499 1500 The index of the memo object to push is given by the 1-byte unsigned 1501 integer following. 1502 """), 1503 1504 I(name='LONG_BINGET', 1505 code='j', 1506 arg=uint4, 1507 stack_before=[], 1508 stack_after=[anyobject], 1509 proto=1, 1510 doc="""Read an object from the memo and push it on the stack. 1511 1512 The index of the memo object to push is given by the 4-byte unsigned 1513 little-endian integer following. 1514 """), 1515 1516 I(name='PUT', 1517 code='p', 1518 arg=decimalnl_short, 1519 stack_before=[], 1520 stack_after=[], 1521 proto=0, 1522 doc="""Store the stack top into the memo. The stack is not popped. 1523 1524 The index of the memo location to write into is given by the newline- 1525 terminated decimal string following. BINPUT and LONG_BINPUT are 1526 space-optimized versions. 1527 """), 1528 1529 I(name='BINPUT', 1530 code='q', 1531 arg=uint1, 1532 stack_before=[], 1533 stack_after=[], 1534 proto=1, 1535 doc="""Store the stack top into the memo. The stack is not popped. 1536 1537 The index of the memo location to write into is given by the 1-byte 1538 unsigned integer following. 1539 """), 1540 1541 I(name='LONG_BINPUT', 1542 code='r', 1543 arg=uint4, 1544 stack_before=[], 1545 stack_after=[], 1546 proto=1, 1547 doc="""Store the stack top into the memo. The stack is not popped. 1548 1549 The index of the memo location to write into is given by the 4-byte 1550 unsigned little-endian integer following. 1551 """), 1552 1553 # Access the extension registry (predefined objects). Akin to the GET 1554 # family. 1555 1556 I(name='EXT1', 1557 code='\x82', 1558 arg=uint1, 1559 stack_before=[], 1560 stack_after=[anyobject], 1561 proto=2, 1562 doc="""Extension code. 1563 1564 This code and the similar EXT2 and EXT4 allow using a registry 1565 of popular objects that are pickled by name, typically classes. 1566 It is envisioned that through a global negotiation and 1567 registration process, third parties can set up a mapping between 1568 ints and object names. 1569 1570 In order to guarantee pickle interchangeability, the extension 1571 code registry ought to be global, although a range of codes may 1572 be reserved for private use. 1573 1574 EXT1 has a 1-byte integer argument. This is used to index into the 1575 extension registry, and the object at that index is pushed on the stack. 1576 """), 1577 1578 I(name='EXT2', 1579 code='\x83', 1580 arg=uint2, 1581 stack_before=[], 1582 stack_after=[anyobject], 1583 proto=2, 1584 doc="""Extension code. 1585 1586 See EXT1. EXT2 has a two-byte integer argument. 1587 """), 1588 1589 I(name='EXT4', 1590 code='\x84', 1591 arg=int4, 1592 stack_before=[], 1593 stack_after=[anyobject], 1594 proto=2, 1595 doc="""Extension code. 1596 1597 See EXT1. EXT4 has a four-byte integer argument. 1598 """), 1599 1600 # Push a class object, or module function, on the stack, via its module 1601 # and name. 1602 1603 I(name='GLOBAL', 1604 code='c', 1605 arg=stringnl_noescape_pair, 1606 stack_before=[], 1607 stack_after=[anyobject], 1608 proto=0, 1609 doc="""Push a global object (module.attr) on the stack. 1610 1611 Two newline-terminated strings follow the GLOBAL opcode. The first is 1612 taken as a module name, and the second as a class name. The class 1613 object module.class is pushed on the stack. More accurately, the 1614 object returned by self.find_class(module, class) is pushed on the 1615 stack, so unpickling subclasses can override this form of lookup. 1616 """), 1617 1618 # Ways to build objects of classes pickle doesn't know about directly 1619 # (user-defined classes). I despair of documenting this accurately 1620 # and comprehensibly -- you really have to read the pickle code to 1621 # find all the special cases. 1622 1623 I(name='REDUCE', 1624 code='R', 1625 arg=None, 1626 stack_before=[anyobject, anyobject], 1627 stack_after=[anyobject], 1628 proto=0, 1629 doc="""Push an object built from a callable and an argument tuple. 1630 1631 The opcode is named to remind of the __reduce__() method. 1632 1633 Stack before: ... callable pytuple 1634 Stack after: ... callable(*pytuple) 1635 1636 The callable and the argument tuple are the first two items returned 1637 by a __reduce__ method. Applying the callable to the argtuple is 1638 supposed to reproduce the original object, or at least get it started. 1639 If the __reduce__ method returns a 3-tuple, the last component is an 1640 argument to be passed to the object's __setstate__, and then the REDUCE 1641 opcode is followed by code to create setstate's argument, and then a 1642 BUILD opcode to apply __setstate__ to that argument. 1643 1644 If not isinstance(callable, type), REDUCE complains unless the 1645 callable has been registered with the copyreg module's 1646 safe_constructors dict, or the callable has a magic 1647 '__safe_for_unpickling__' attribute with a true value. I'm not sure 1648 why it does this, but I've sure seen this complaint often enough when 1649 I didn't want to <wink>. 1650 """), 1651 1652 I(name='BUILD', 1653 code='b', 1654 arg=None, 1655 stack_before=[anyobject, anyobject], 1656 stack_after=[anyobject], 1657 proto=0, 1658 doc="""Finish building an object, via __setstate__ or dict update. 1659 1660 Stack before: ... anyobject argument 1661 Stack after: ... anyobject 1662 1663 where anyobject may have been mutated, as follows: 1664 1665 If the object has a __setstate__ method, 1666 1667 anyobject.__setstate__(argument) 1668 1669 is called. 1670 1671 Else the argument must be a dict, the object must have a __dict__, and 1672 the object is updated via 1673 1674 anyobject.__dict__.update(argument) 1675 """), 1676 1677 I(name='INST', 1678 code='i', 1679 arg=stringnl_noescape_pair, 1680 stack_before=[markobject, stackslice], 1681 stack_after=[anyobject], 1682 proto=0, 1683 doc="""Build a class instance. 1684 1685 This is the protocol 0 version of protocol 1's OBJ opcode. 1686 INST is followed by two newline-terminated strings, giving a 1687 module and class name, just as for the GLOBAL opcode (and see 1688 GLOBAL for more details about that). self.find_class(module, name) 1689 is used to get a class object. 1690 1691 In addition, all the objects on the stack following the topmost 1692 markobject are gathered into a tuple and popped (along with the 1693 topmost markobject), just as for the TUPLE opcode. 1694 1695 Now it gets complicated. If all of these are true: 1696 1697 + The argtuple is empty (markobject was at the top of the stack 1698 at the start). 1699 1700 + The class object does not have a __getinitargs__ attribute. 1701 1702 then we want to create an old-style class instance without invoking 1703 its __init__() method (pickle has waffled on this over the years; not 1704 calling __init__() is current wisdom). In this case, an instance of 1705 an old-style dummy class is created, and then we try to rebind its 1706 __class__ attribute to the desired class object. If this succeeds, 1707 the new instance object is pushed on the stack, and we're done. 1708 1709 Else (the argtuple is not empty, it's not an old-style class object, 1710 or the class object does have a __getinitargs__ attribute), the code 1711 first insists that the class object have a __safe_for_unpickling__ 1712 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, 1713 it doesn't matter whether this attribute has a true or false value, it 1714 only matters whether it exists (XXX this is a bug). If 1715 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised. 1716 1717 Else (the class object does have a __safe_for_unpickling__ attr), 1718 the class object obtained from INST's arguments is applied to the 1719 argtuple obtained from the stack, and the resulting instance object 1720 is pushed on the stack. 1721 1722 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3. 1723 NOTE: the distinction between old-style and new-style classes does 1724 not make sense in Python 3. 1725 """), 1726 1727 I(name='OBJ', 1728 code='o', 1729 arg=None, 1730 stack_before=[markobject, anyobject, stackslice], 1731 stack_after=[anyobject], 1732 proto=1, 1733 doc="""Build a class instance. 1734 1735 This is the protocol 1 version of protocol 0's INST opcode, and is 1736 very much like it. The major difference is that the class object 1737 is taken off the stack, allowing it to be retrieved from the memo 1738 repeatedly if several instances of the same class are created. This 1739 can be much more efficient (in both time and space) than repeatedly 1740 embedding the module and class names in INST opcodes. 1741 1742 Unlike INST, OBJ takes no arguments from the opcode stream. Instead 1743 the class object is taken off the stack, immediately above the 1744 topmost markobject: 1745 1746 Stack before: ... markobject classobject stackslice 1747 Stack after: ... new_instance_object 1748 1749 As for INST, the remainder of the stack above the markobject is 1750 gathered into an argument tuple, and then the logic seems identical, 1751 except that no __safe_for_unpickling__ check is done (XXX this is 1752 a bug). See INST for the gory details. 1753 1754 NOTE: In Python 2.3, INST and OBJ are identical except for how they 1755 get the class object. That was always the intent; the implementations 1756 had diverged for accidental reasons. 1757 """), 1758 1759 I(name='NEWOBJ', 1760 code='\x81', 1761 arg=None, 1762 stack_before=[anyobject, anyobject], 1763 stack_after=[anyobject], 1764 proto=2, 1765 doc="""Build an object instance. 1766 1767 The stack before should be thought of as containing a class 1768 object followed by an argument tuple (the tuple being the stack 1769 top). Call these cls and args. They are popped off the stack, 1770 and the value returned by cls.__new__(cls, *args) is pushed back 1771 onto the stack. 1772 """), 1773 1774 # Machine control. 1775 1776 I(name='PROTO', 1777 code='\x80', 1778 arg=uint1, 1779 stack_before=[], 1780 stack_after=[], 1781 proto=2, 1782 doc="""Protocol version indicator. 1783 1784 For protocol 2 and above, a pickle must start with this opcode. 1785 The argument is the protocol version, an int in range(2, 256). 1786 """), 1787 1788 I(name='STOP', 1789 code='.', 1790 arg=None, 1791 stack_before=[anyobject], 1792 stack_after=[], 1793 proto=0, 1794 doc="""Stop the unpickling machine. 1795 1796 Every pickle ends with this opcode. The object at the top of the stack 1797 is popped, and that's the result of unpickling. The stack should be 1798 empty then. 1799 """), 1800 1801 # Ways to deal with persistent IDs. 1802 1803 I(name='PERSID', 1804 code='P', 1805 arg=stringnl_noescape, 1806 stack_before=[], 1807 stack_after=[anyobject], 1808 proto=0, 1809 doc="""Push an object identified by a persistent ID. 1810 1811 The pickle module doesn't define what a persistent ID means. PERSID's 1812 argument is a newline-terminated str-style (no embedded escapes, no 1813 bracketing quote characters) string, which *is* "the persistent ID". 1814 The unpickler passes this string to self.persistent_load(). Whatever 1815 object that returns is pushed on the stack. There is no implementation 1816 of persistent_load() in Python's unpickler: it must be supplied by an 1817 unpickler subclass. 1818 """), 1819 1820 I(name='BINPERSID', 1821 code='Q', 1822 arg=None, 1823 stack_before=[anyobject], 1824 stack_after=[anyobject], 1825 proto=1, 1826 doc="""Push an object identified by a persistent ID. 1827 1828 Like PERSID, except the persistent ID is popped off the stack (instead 1829 of being a string embedded in the opcode bytestream). The persistent 1830 ID is passed to self.persistent_load(), and whatever object that 1831 returns is pushed on the stack. See PERSID for more detail. 1832 """), 1833] 1834del I 1835 1836# Verify uniqueness of .name and .code members. 1837name2i = {} 1838code2i = {} 1839 1840for i, d in enumerate(opcodes): 1841 if d.name in name2i: 1842 raise ValueError("repeated name %r at indices %d and %d" % 1843 (d.name, name2i[d.name], i)) 1844 if d.code in code2i: 1845 raise ValueError("repeated code %r at indices %d and %d" % 1846 (d.code, code2i[d.code], i)) 1847 1848 name2i[d.name] = i 1849 code2i[d.code] = i 1850 1851del name2i, code2i, i, d 1852 1853############################################################################## 1854# Build a code2op dict, mapping opcode characters to OpcodeInfo records. 1855# Also ensure we've got the same stuff as pickle.py, although the 1856# introspection here is dicey. 1857 1858code2op = {} 1859for d in opcodes: 1860 code2op[d.code] = d 1861del d 1862 1863def assure_pickle_consistency(verbose=False): 1864 1865 copy = code2op.copy() 1866 for name in pickle.__all__: 1867 if not re.match("[A-Z][A-Z0-9_]+$", name): 1868 if verbose: 1869 print("skipping %r: it doesn't look like an opcode name" % name) 1870 continue 1871 picklecode = getattr(pickle, name) 1872 if not isinstance(picklecode, bytes) or len(picklecode) != 1: 1873 if verbose: 1874 print(("skipping %r: value %r doesn't look like a pickle " 1875 "code" % (name, picklecode))) 1876 continue 1877 picklecode = picklecode.decode("latin-1") 1878 if picklecode in copy: 1879 if verbose: 1880 print("checking name %r w/ code %r for consistency" % ( 1881 name, picklecode)) 1882 d = copy[picklecode] 1883 if d.name != name: 1884 raise ValueError("for pickle code %r, pickle.py uses name %r " 1885 "but we're using name %r" % (picklecode, 1886 name, 1887 d.name)) 1888 # Forget this one. Any left over in copy at the end are a problem 1889 # of a different kind. 1890 del copy[picklecode] 1891 else: 1892 raise ValueError("pickle.py appears to have a pickle opcode with " 1893 "name %r and code %r, but we don't" % 1894 (name, picklecode)) 1895 if copy: 1896 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] 1897 for code, d in copy.items(): 1898 msg.append(" name %r with code %r" % (d.name, code)) 1899 raise ValueError("\n".join(msg)) 1900 1901assure_pickle_consistency() 1902del assure_pickle_consistency 1903 1904############################################################################## 1905# A pickle opcode generator. 1906 1907def genops(pickle): 1908 """Generate all the opcodes in a pickle. 1909 1910 'pickle' is a file-like object, or string, containing the pickle. 1911 1912 Each opcode in the pickle is generated, from the current pickle position, 1913 stopping after a STOP opcode is delivered. A triple is generated for 1914 each opcode: 1915 1916 opcode, arg, pos 1917 1918 opcode is an OpcodeInfo record, describing the current opcode. 1919 1920 If the opcode has an argument embedded in the pickle, arg is its decoded 1921 value, as a Python object. If the opcode doesn't have an argument, arg 1922 is None. 1923 1924 If the pickle has a tell() method, pos was the value of pickle.tell() 1925 before reading the current opcode. If the pickle is a bytes object, 1926 it's wrapped in a BytesIO object, and the latter's tell() result is 1927 used. Else (the pickle doesn't have a tell(), and it's not obvious how 1928 to query its current position) pos is None. 1929 """ 1930 1931 if isinstance(pickle, bytes_types): 1932 import io 1933 pickle = io.BytesIO(pickle) 1934 1935 if hasattr(pickle, "tell"): 1936 getpos = pickle.tell 1937 else: 1938 getpos = lambda: None 1939 1940 while True: 1941 pos = getpos() 1942 code = pickle.read(1) 1943 opcode = code2op.get(code.decode("latin-1")) 1944 if opcode is None: 1945 if code == b"": 1946 raise ValueError("pickle exhausted before seeing STOP") 1947 else: 1948 raise ValueError("at position %s, opcode %r unknown" % ( 1949 pos is None and "<unknown>" or pos, 1950 code)) 1951 if opcode.arg is None: 1952 arg = None 1953 else: 1954 arg = opcode.arg.reader(pickle) 1955 yield opcode, arg, pos 1956 if code == b'.': 1957 assert opcode.name == 'STOP' 1958 break 1959 1960############################################################################## 1961# A pickle optimizer. 1962 1963def optimize(p): 1964 'Optimize a pickle string by removing unused PUT opcodes' 1965 gets = set() # set of args used by a GET opcode 1966 puts = [] # (arg, startpos, stoppos) for the PUT opcodes 1967 prevpos = None # set to pos if previous opcode was a PUT 1968 for opcode, arg, pos in genops(p): 1969 if prevpos is not None: 1970 puts.append((prevarg, prevpos, pos)) 1971 prevpos = None 1972 if 'PUT' in opcode.name: 1973 prevarg, prevpos = arg, pos 1974 elif 'GET' in opcode.name: 1975 gets.add(arg) 1976 1977 # Copy the pickle string except for PUTS without a corresponding GET 1978 s = [] 1979 i = 0 1980 for arg, start, stop in puts: 1981 j = stop if (arg in gets) else start 1982 s.append(p[i:j]) 1983 i = stop 1984 s.append(p[i:]) 1985 return b''.join(s) 1986 1987############################################################################## 1988# A symbolic pickle disassembler. 1989 1990def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0): 1991 """Produce a symbolic disassembly of a pickle. 1992 1993 'pickle' is a file-like object, or string, containing a (at least one) 1994 pickle. The pickle is disassembled from the current position, through 1995 the first STOP opcode encountered. 1996 1997 Optional arg 'out' is a file-like object to which the disassembly is 1998 printed. It defaults to sys.stdout. 1999 2000 Optional arg 'memo' is a Python dict, used as the pickle's memo. It 2001 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes. 2002 Passing the same memo object to another dis() call then allows disassembly 2003 to proceed across multiple pickles that were all created by the same 2004 pickler with the same memo. Ordinarily you don't need to worry about this. 2005 2006 Optional arg 'indentlevel' is the number of blanks by which to indent 2007 a new MARK level. It defaults to 4. 2008 2009 Optional arg 'annotate' if nonzero instructs dis() to add short 2010 description of the opcode on each line of disassembled output. 2011 The value given to 'annotate' must be an integer and is used as a 2012 hint for the column where annotation should start. The default 2013 value is 0, meaning no annotations. 2014 2015 In addition to printing the disassembly, some sanity checks are made: 2016 2017 + All embedded opcode arguments "make sense". 2018 2019 + Explicit and implicit pop operations have enough items on the stack. 2020 2021 + When an opcode implicitly refers to a markobject, a markobject is 2022 actually on the stack. 2023 2024 + A memo entry isn't referenced before it's defined. 2025 2026 + The markobject isn't stored in the memo. 2027 2028 + A memo entry isn't redefined. 2029 """ 2030 2031 # Most of the hair here is for sanity checks, but most of it is needed 2032 # anyway to detect when a protocol 0 POP takes a MARK off the stack 2033 # (which in turn is needed to indent MARK blocks correctly). 2034 2035 stack = [] # crude emulation of unpickler stack 2036 if memo is None: 2037 memo = {} # crude emulation of unpickler memo 2038 maxproto = -1 # max protocol number seen 2039 markstack = [] # bytecode positions of MARK opcodes 2040 indentchunk = ' ' * indentlevel 2041 errormsg = None 2042 annocol = annotate # column hint for annotations 2043 for opcode, arg, pos in genops(pickle): 2044 if pos is not None: 2045 print("%5d:" % pos, end=' ', file=out) 2046 2047 line = "%-4s %s%s" % (repr(opcode.code)[1:-1], 2048 indentchunk * len(markstack), 2049 opcode.name) 2050 2051 maxproto = max(maxproto, opcode.proto) 2052 before = opcode.stack_before # don't mutate 2053 after = opcode.stack_after # don't mutate 2054 numtopop = len(before) 2055 2056 # See whether a MARK should be popped. 2057 markmsg = None 2058 if markobject in before or (opcode.name == "POP" and 2059 stack and 2060 stack[-1] is markobject): 2061 assert markobject not in after 2062 if __debug__: 2063 if markobject in before: 2064 assert before[-1] is stackslice 2065 if markstack: 2066 markpos = markstack.pop() 2067 if markpos is None: 2068 markmsg = "(MARK at unknown opcode offset)" 2069 else: 2070 markmsg = "(MARK at %d)" % markpos 2071 # Pop everything at and after the topmost markobject. 2072 while stack[-1] is not markobject: 2073 stack.pop() 2074 stack.pop() 2075 # Stop later code from popping too much. 2076 try: 2077 numtopop = before.index(markobject) 2078 except ValueError: 2079 assert opcode.name == "POP" 2080 numtopop = 0 2081 else: 2082 errormsg = markmsg = "no MARK exists on stack" 2083 2084 # Check for correct memo usage. 2085 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"): 2086 assert arg is not None 2087 if arg in memo: 2088 errormsg = "memo key %r already defined" % arg 2089 elif not stack: 2090 errormsg = "stack is empty -- can't store into memo" 2091 elif stack[-1] is markobject: 2092 errormsg = "can't store markobject in the memo" 2093 else: 2094 memo[arg] = stack[-1] 2095 2096 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"): 2097 if arg in memo: 2098 assert len(after) == 1 2099 after = [memo[arg]] # for better stack emulation 2100 else: 2101 errormsg = "memo key %r has never been stored into" % arg 2102 2103 if arg is not None or markmsg: 2104 # make a mild effort to align arguments 2105 line += ' ' * (10 - len(opcode.name)) 2106 if arg is not None: 2107 line += ' ' + repr(arg) 2108 if markmsg: 2109 line += ' ' + markmsg 2110 if annotate: 2111 line += ' ' * (annocol - len(line)) 2112 # make a mild effort to align annotations 2113 annocol = len(line) 2114 if annocol > 50: 2115 annocol = annotate 2116 line += ' ' + opcode.doc.split('\n', 1)[0] 2117 print(line, file=out) 2118 2119 if errormsg: 2120 # Note that we delayed complaining until the offending opcode 2121 # was printed. 2122 raise ValueError(errormsg) 2123 2124 # Emulate the stack effects. 2125 if len(stack) < numtopop: 2126 raise ValueError("tries to pop %d items from stack with " 2127 "only %d items" % (numtopop, len(stack))) 2128 if numtopop: 2129 del stack[-numtopop:] 2130 if markobject in after: 2131 assert markobject not in before 2132 markstack.append(pos) 2133 2134 stack.extend(after) 2135 2136 print("highest protocol among opcodes =", maxproto, file=out) 2137 if stack: 2138 raise ValueError("stack not empty after STOP: %r" % stack) 2139 2140# For use in the doctest, simply as an example of a class to pickle. 2141class _Example: 2142 def __init__(self, value): 2143 self.value = value 2144 2145_dis_test = r""" 2146>>> import pickle 2147>>> x = [1, 2, (3, 4), {b'abc': "def"}] 2148>>> pkl0 = pickle.dumps(x, 0) 2149>>> dis(pkl0) 2150 0: ( MARK 2151 1: l LIST (MARK at 0) 2152 2: p PUT 0 2153 5: L LONG 1 2154 9: a APPEND 2155 10: L LONG 2 2156 14: a APPEND 2157 15: ( MARK 2158 16: L LONG 3 2159 20: L LONG 4 2160 24: t TUPLE (MARK at 15) 2161 25: p PUT 1 2162 28: a APPEND 2163 29: ( MARK 2164 30: d DICT (MARK at 29) 2165 31: p PUT 2 2166 34: c GLOBAL '_codecs encode' 2167 50: p PUT 3 2168 53: ( MARK 2169 54: V UNICODE 'abc' 2170 59: p PUT 4 2171 62: V UNICODE 'latin1' 2172 70: p PUT 5 2173 73: t TUPLE (MARK at 53) 2174 74: p PUT 6 2175 77: R REDUCE 2176 78: p PUT 7 2177 81: V UNICODE 'def' 2178 86: p PUT 8 2179 89: s SETITEM 2180 90: a APPEND 2181 91: . STOP 2182highest protocol among opcodes = 0 2183 2184Try again with a "binary" pickle. 2185 2186>>> pkl1 = pickle.dumps(x, 1) 2187>>> dis(pkl1) 2188 0: ] EMPTY_LIST 2189 1: q BINPUT 0 2190 3: ( MARK 2191 4: K BININT1 1 2192 6: K BININT1 2 2193 8: ( MARK 2194 9: K BININT1 3 2195 11: K BININT1 4 2196 13: t TUPLE (MARK at 8) 2197 14: q BINPUT 1 2198 16: } EMPTY_DICT 2199 17: q BINPUT 2 2200 19: c GLOBAL '_codecs encode' 2201 35: q BINPUT 3 2202 37: ( MARK 2203 38: X BINUNICODE 'abc' 2204 46: q BINPUT 4 2205 48: X BINUNICODE 'latin1' 2206 59: q BINPUT 5 2207 61: t TUPLE (MARK at 37) 2208 62: q BINPUT 6 2209 64: R REDUCE 2210 65: q BINPUT 7 2211 67: X BINUNICODE 'def' 2212 75: q BINPUT 8 2213 77: s SETITEM 2214 78: e APPENDS (MARK at 3) 2215 79: . STOP 2216highest protocol among opcodes = 1 2217 2218Exercise the INST/OBJ/BUILD family. 2219 2220>>> import pickletools 2221>>> dis(pickle.dumps(pickletools.dis, 0)) 2222 0: c GLOBAL 'pickletools dis' 2223 17: p PUT 0 2224 20: . STOP 2225highest protocol among opcodes = 0 2226 2227>>> from pickletools import _Example 2228>>> x = [_Example(42)] * 2 2229>>> dis(pickle.dumps(x, 0)) 2230 0: ( MARK 2231 1: l LIST (MARK at 0) 2232 2: p PUT 0 2233 5: c GLOBAL 'copy_reg _reconstructor' 2234 30: p PUT 1 2235 33: ( MARK 2236 34: c GLOBAL 'pickletools _Example' 2237 56: p PUT 2 2238 59: c GLOBAL '__builtin__ object' 2239 79: p PUT 3 2240 82: N NONE 2241 83: t TUPLE (MARK at 33) 2242 84: p PUT 4 2243 87: R REDUCE 2244 88: p PUT 5 2245 91: ( MARK 2246 92: d DICT (MARK at 91) 2247 93: p PUT 6 2248 96: V UNICODE 'value' 2249 103: p PUT 7 2250 106: L LONG 42 2251 111: s SETITEM 2252 112: b BUILD 2253 113: a APPEND 2254 114: g GET 5 2255 117: a APPEND 2256 118: . STOP 2257highest protocol among opcodes = 0 2258 2259>>> dis(pickle.dumps(x, 1)) 2260 0: ] EMPTY_LIST 2261 1: q BINPUT 0 2262 3: ( MARK 2263 4: c GLOBAL 'copy_reg _reconstructor' 2264 29: q BINPUT 1 2265 31: ( MARK 2266 32: c GLOBAL 'pickletools _Example' 2267 54: q BINPUT 2 2268 56: c GLOBAL '__builtin__ object' 2269 76: q BINPUT 3 2270 78: N NONE 2271 79: t TUPLE (MARK at 31) 2272 80: q BINPUT 4 2273 82: R REDUCE 2274 83: q BINPUT 5 2275 85: } EMPTY_DICT 2276 86: q BINPUT 6 2277 88: X BINUNICODE 'value' 2278 98: q BINPUT 7 2279 100: K BININT1 42 2280 102: s SETITEM 2281 103: b BUILD 2282 104: h BINGET 5 2283 106: e APPENDS (MARK at 3) 2284 107: . STOP 2285highest protocol among opcodes = 1 2286 2287Try "the canonical" recursive-object test. 2288 2289>>> L = [] 2290>>> T = L, 2291>>> L.append(T) 2292>>> L[0] is T 2293True 2294>>> T[0] is L 2295True 2296>>> L[0][0] is L 2297True 2298>>> T[0][0] is T 2299True 2300>>> dis(pickle.dumps(L, 0)) 2301 0: ( MARK 2302 1: l LIST (MARK at 0) 2303 2: p PUT 0 2304 5: ( MARK 2305 6: g GET 0 2306 9: t TUPLE (MARK at 5) 2307 10: p PUT 1 2308 13: a APPEND 2309 14: . STOP 2310highest protocol among opcodes = 0 2311 2312>>> dis(pickle.dumps(L, 1)) 2313 0: ] EMPTY_LIST 2314 1: q BINPUT 0 2315 3: ( MARK 2316 4: h BINGET 0 2317 6: t TUPLE (MARK at 3) 2318 7: q BINPUT 1 2319 9: a APPEND 2320 10: . STOP 2321highest protocol among opcodes = 1 2322 2323Note that, in the protocol 0 pickle of the recursive tuple, the disassembler 2324has to emulate the stack in order to realize that the POP opcode at 16 gets 2325rid of the MARK at 0. 2326 2327>>> dis(pickle.dumps(T, 0)) 2328 0: ( MARK 2329 1: ( MARK 2330 2: l LIST (MARK at 1) 2331 3: p PUT 0 2332 6: ( MARK 2333 7: g GET 0 2334 10: t TUPLE (MARK at 6) 2335 11: p PUT 1 2336 14: a APPEND 2337 15: 0 POP 2338 16: 0 POP (MARK at 0) 2339 17: g GET 1 2340 20: . STOP 2341highest protocol among opcodes = 0 2342 2343>>> dis(pickle.dumps(T, 1)) 2344 0: ( MARK 2345 1: ] EMPTY_LIST 2346 2: q BINPUT 0 2347 4: ( MARK 2348 5: h BINGET 0 2349 7: t TUPLE (MARK at 4) 2350 8: q BINPUT 1 2351 10: a APPEND 2352 11: 1 POP_MARK (MARK at 0) 2353 12: h BINGET 1 2354 14: . STOP 2355highest protocol among opcodes = 1 2356 2357Try protocol 2. 2358 2359>>> dis(pickle.dumps(L, 2)) 2360 0: \x80 PROTO 2 2361 2: ] EMPTY_LIST 2362 3: q BINPUT 0 2363 5: h BINGET 0 2364 7: \x85 TUPLE1 2365 8: q BINPUT 1 2366 10: a APPEND 2367 11: . STOP 2368highest protocol among opcodes = 2 2369 2370>>> dis(pickle.dumps(T, 2)) 2371 0: \x80 PROTO 2 2372 2: ] EMPTY_LIST 2373 3: q BINPUT 0 2374 5: h BINGET 0 2375 7: \x85 TUPLE1 2376 8: q BINPUT 1 2377 10: a APPEND 2378 11: 0 POP 2379 12: h BINGET 1 2380 14: . STOP 2381highest protocol among opcodes = 2 2382 2383Try protocol 3 with annotations: 2384 2385>>> dis(pickle.dumps(T, 3), annotate=1) 2386 0: \x80 PROTO 3 Protocol version indicator. 2387 2: ] EMPTY_LIST Push an empty list. 2388 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped. 2389 5: h BINGET 0 Read an object from the memo and push it on the stack. 2390 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack. 2391 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped. 2392 10: a APPEND Append an object to a list. 2393 11: 0 POP Discard the top stack item, shrinking the stack by one item. 2394 12: h BINGET 1 Read an object from the memo and push it on the stack. 2395 14: . STOP Stop the unpickling machine. 2396highest protocol among opcodes = 2 2397 2398""" 2399 2400_memo_test = r""" 2401>>> import pickle 2402>>> import io 2403>>> f = io.BytesIO() 2404>>> p = pickle.Pickler(f, 2) 2405>>> x = [1, 2, 3] 2406>>> p.dump(x) 2407>>> p.dump(x) 2408>>> f.seek(0) 24090 2410>>> memo = {} 2411>>> dis(f, memo=memo) 2412 0: \x80 PROTO 2 2413 2: ] EMPTY_LIST 2414 3: q BINPUT 0 2415 5: ( MARK 2416 6: K BININT1 1 2417 8: K BININT1 2 2418 10: K BININT1 3 2419 12: e APPENDS (MARK at 5) 2420 13: . STOP 2421highest protocol among opcodes = 2 2422>>> dis(f, memo=memo) 2423 14: \x80 PROTO 2 2424 16: h BINGET 0 2425 18: . STOP 2426highest protocol among opcodes = 2 2427""" 2428 2429__test__ = {'disassembler_test': _dis_test, 2430 'disassembler_memo_test': _memo_test, 2431 } 2432 2433def _test(): 2434 import doctest 2435 return doctest.testmod() 2436 2437if __name__ == "__main__": 2438 import sys, argparse 2439 parser = argparse.ArgumentParser( 2440 description='disassemble one or more pickle files') 2441 parser.add_argument( 2442 'pickle_file', type=argparse.FileType('br'), 2443 nargs='*', help='the pickle file') 2444 parser.add_argument( 2445 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'), 2446 help='the file where the output should be written') 2447 parser.add_argument( 2448 '-m', '--memo', action='store_true', 2449 help='preserve memo between disassemblies') 2450 parser.add_argument( 2451 '-l', '--indentlevel', default=4, type=int, 2452 help='the number of blanks by which to indent a new MARK level') 2453 parser.add_argument( 2454 '-a', '--annotate', action='store_true', 2455 help='annotate each line with a short opcode description') 2456 parser.add_argument( 2457 '-p', '--preamble', default="==> {name} <==", 2458 help='if more than one pickle file is specified, print this before' 2459 ' each disassembly') 2460 parser.add_argument( 2461 '-t', '--test', action='store_true', 2462 help='run self-test suite') 2463 parser.add_argument( 2464 '-v', action='store_true', 2465 help='run verbosely; only affects self-test run') 2466 args = parser.parse_args() 2467 if args.test: 2468 _test() 2469 else: 2470 annotate = 30 if args.annotate else 0 2471 if not args.pickle_file: 2472 parser.print_help() 2473 elif len(args.pickle_file) == 1: 2474 dis(args.pickle_file[0], args.output, None, 2475 args.indentlevel, annotate) 2476 else: 2477 memo = {} if args.memo else None 2478 for f in args.pickle_file: 2479 preamble = args.preamble.format(name=f.name) 2480 args.output.write(preamble + '\n') 2481 dis(f, args.output, memo, args.indentlevel, annotate) 2482