pickletools.py revision 28d271ef6b0eb37e27b5b3234ba146922b11d89f
1'''"Executable documentation" for the pickle module. 2 3Extensive comments about the pickle protocols and pickle-machine opcodes 4can be found here. Some functions meant for external use: 5 6genops(pickle) 7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. 8 9dis(pickle, out=None, memo=None, indentlevel=4) 10 Print a symbolic disassembly of a pickle. 11''' 12 13import codecs 14import io 15import pickle 16import re 17import sys 18 19__all__ = ['dis', 'genops', 'optimize'] 20 21bytes_types = pickle.bytes_types 22 23# Other ideas: 24# 25# - A pickle verifier: read a pickle and check it exhaustively for 26# well-formedness. dis() does a lot of this already. 27# 28# - A protocol identifier: examine a pickle and return its protocol number 29# (== the highest .proto attr value among all the opcodes in the pickle). 30# dis() already prints this info at the end. 31# 32# - A pickle optimizer: for example, tuple-building code is sometimes more 33# elaborate than necessary, catering for the possibility that the tuple 34# is recursive. Or lots of times a PUT is generated that's never accessed 35# by a later GET. 36 37 38# "A pickle" is a program for a virtual pickle machine (PM, but more accurately 39# called an unpickling machine). It's a sequence of opcodes, interpreted by the 40# PM, building an arbitrarily complex Python object. 41# 42# For the most part, the PM is very simple: there are no looping, testing, or 43# conditional instructions, no arithmetic and no function calls. Opcodes are 44# executed once each, from first to last, until a STOP opcode is reached. 45# 46# The PM has two data areas, "the stack" and "the memo". 47# 48# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python 49# integer object on the stack, whose value is gotten from a decimal string 50# literal immediately following the INT opcode in the pickle bytestream. Other 51# opcodes take Python objects off the stack. The result of unpickling is 52# whatever object is left on the stack when the final STOP opcode is executed. 53# 54# The memo is simply an array of objects, or it can be implemented as a dict 55# mapping little integers to objects. The memo serves as the PM's "long term 56# memory", and the little integers indexing the memo are akin to variable 57# names. Some opcodes pop a stack object into the memo at a given index, 58# and others push a memo object at a given index onto the stack again. 59# 60# At heart, that's all the PM has. Subtleties arise for these reasons: 61# 62# + Object identity. Objects can be arbitrarily complex, and subobjects 63# may be shared (for example, the list [a, a] refers to the same object a 64# twice). It can be vital that unpickling recreate an isomorphic object 65# graph, faithfully reproducing sharing. 66# 67# + Recursive objects. For example, after "L = []; L.append(L)", L is a 68# list, and L[0] is the same list. This is related to the object identity 69# point, and some sequences of pickle opcodes are subtle in order to 70# get the right result in all cases. 71# 72# + Things pickle doesn't know everything about. Examples of things pickle 73# does know everything about are Python's builtin scalar and container 74# types, like ints and tuples. They generally have opcodes dedicated to 75# them. For things like module references and instances of user-defined 76# classes, pickle's knowledge is limited. Historically, many enhancements 77# have been made to the pickle protocol in order to do a better (faster, 78# and/or more compact) job on those. 79# 80# + Backward compatibility and micro-optimization. As explained below, 81# pickle opcodes never go away, not even when better ways to do a thing 82# get invented. The repertoire of the PM just keeps growing over time. 83# For example, protocol 0 had two opcodes for building Python integers (INT 84# and LONG), protocol 1 added three more for more-efficient pickling of short 85# integers, and protocol 2 added two more for more-efficient pickling of 86# long integers (before protocol 2, the only ways to pickle a Python long 87# took time quadratic in the number of digits, for both pickling and 88# unpickling). "Opcode bloat" isn't so much a subtlety as a source of 89# wearying complication. 90# 91# 92# Pickle protocols: 93# 94# For compatibility, the meaning of a pickle opcode never changes. Instead new 95# pickle opcodes get added, and each version's unpickler can handle all the 96# pickle opcodes in all protocol versions to date. So old pickles continue to 97# be readable forever. The pickler can generally be told to restrict itself to 98# the subset of opcodes available under previous protocol versions too, so that 99# users can create pickles under the current version readable by older 100# versions. However, a pickle does not contain its version number embedded 101# within it. If an older unpickler tries to read a pickle using a later 102# protocol, the result is most likely an exception due to seeing an unknown (in 103# the older unpickler) opcode. 104# 105# The original pickle used what's now called "protocol 0", and what was called 106# "text mode" before Python 2.3. The entire pickle bytestream is made up of 107# printable 7-bit ASCII characters, plus the newline character, in protocol 0. 108# That's why it was called text mode. Protocol 0 is small and elegant, but 109# sometimes painfully inefficient. 110# 111# The second major set of additions is now called "protocol 1", and was called 112# "binary mode" before Python 2.3. This added many opcodes with arguments 113# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" 114# bytes. Binary mode pickles can be substantially smaller than equivalent 115# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte 116# int as 4 bytes following the opcode, which is cheaper to unpickle than the 117# (perhaps) 11-character decimal string attached to INT. Protocol 1 also added 118# a number of opcodes that operate on many stack elements at once (like APPENDS 119# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE). 120# 121# The third major set of additions came in Python 2.3, and is called "protocol 122# 2". This added: 123# 124# - A better way to pickle instances of new-style classes (NEWOBJ). 125# 126# - A way for a pickle to identify its protocol (PROTO). 127# 128# - Time- and space- efficient pickling of long ints (LONG{1,4}). 129# 130# - Shortcuts for small tuples (TUPLE{1,2,3}}. 131# 132# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE). 133# 134# - The "extension registry", a vector of popular objects that can be pushed 135# efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but 136# the registry contents are predefined (there's nothing akin to the memo's 137# PUT). 138# 139# Another independent change with Python 2.3 is the abandonment of any 140# pretense that it might be safe to load pickles received from untrusted 141# parties -- no sufficient security analysis has been done to guarantee 142# this and there isn't a use case that warrants the expense of such an 143# analysis. 144# 145# To this end, all tests for __safe_for_unpickling__ or for 146# copyreg.safe_constructors are removed from the unpickling code. 147# References to these variables in the descriptions below are to be seen 148# as describing unpickling in Python 2.2 and before. 149 150 151# Meta-rule: Descriptions are stored in instances of descriptor objects, 152# with plain constructors. No meta-language is defined from which 153# descriptors could be constructed. If you want, e.g., XML, write a little 154# program to generate XML from the objects. 155 156############################################################################## 157# Some pickle opcodes have an argument, following the opcode in the 158# bytestream. An argument is of a specific type, described by an instance 159# of ArgumentDescriptor. These are not to be confused with arguments taken 160# off the stack -- ArgumentDescriptor applies only to arguments embedded in 161# the opcode stream, immediately following an opcode. 162 163# Represents the number of bytes consumed by an argument delimited by the 164# next newline character. 165UP_TO_NEWLINE = -1 166 167# Represents the number of bytes consumed by a two-argument opcode where 168# the first argument gives the number of bytes in the second argument. 169TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int 170TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int 171TAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int 172TAKEN_FROM_ARGUMENT8U = -5 # num bytes is 8-byte unsigned little-endian int 173 174class ArgumentDescriptor(object): 175 __slots__ = ( 176 # name of descriptor record, also a module global name; a string 177 'name', 178 179 # length of argument, in bytes; an int; UP_TO_NEWLINE and 180 # TAKEN_FROM_ARGUMENT{1,4,8} are negative values for variable-length 181 # cases 182 'n', 183 184 # a function taking a file-like object, reading this kind of argument 185 # from the object at the current position, advancing the current 186 # position by n bytes, and returning the value of the argument 187 'reader', 188 189 # human-readable docs for this arg descriptor; a string 190 'doc', 191 ) 192 193 def __init__(self, name, n, reader, doc): 194 assert isinstance(name, str) 195 self.name = name 196 197 assert isinstance(n, int) and (n >= 0 or 198 n in (UP_TO_NEWLINE, 199 TAKEN_FROM_ARGUMENT1, 200 TAKEN_FROM_ARGUMENT4, 201 TAKEN_FROM_ARGUMENT4U, 202 TAKEN_FROM_ARGUMENT8U)) 203 self.n = n 204 205 self.reader = reader 206 207 assert isinstance(doc, str) 208 self.doc = doc 209 210from struct import unpack as _unpack 211 212def read_uint1(f): 213 r""" 214 >>> import io 215 >>> read_uint1(io.BytesIO(b'\xff')) 216 255 217 """ 218 219 data = f.read(1) 220 if data: 221 return data[0] 222 raise ValueError("not enough data in stream to read uint1") 223 224uint1 = ArgumentDescriptor( 225 name='uint1', 226 n=1, 227 reader=read_uint1, 228 doc="One-byte unsigned integer.") 229 230 231def read_uint2(f): 232 r""" 233 >>> import io 234 >>> read_uint2(io.BytesIO(b'\xff\x00')) 235 255 236 >>> read_uint2(io.BytesIO(b'\xff\xff')) 237 65535 238 """ 239 240 data = f.read(2) 241 if len(data) == 2: 242 return _unpack("<H", data)[0] 243 raise ValueError("not enough data in stream to read uint2") 244 245uint2 = ArgumentDescriptor( 246 name='uint2', 247 n=2, 248 reader=read_uint2, 249 doc="Two-byte unsigned integer, little-endian.") 250 251 252def read_int4(f): 253 r""" 254 >>> import io 255 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00')) 256 255 257 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31) 258 True 259 """ 260 261 data = f.read(4) 262 if len(data) == 4: 263 return _unpack("<i", data)[0] 264 raise ValueError("not enough data in stream to read int4") 265 266int4 = ArgumentDescriptor( 267 name='int4', 268 n=4, 269 reader=read_int4, 270 doc="Four-byte signed integer, little-endian, 2's complement.") 271 272 273def read_uint4(f): 274 r""" 275 >>> import io 276 >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00')) 277 255 278 >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31 279 True 280 """ 281 282 data = f.read(4) 283 if len(data) == 4: 284 return _unpack("<I", data)[0] 285 raise ValueError("not enough data in stream to read uint4") 286 287uint4 = ArgumentDescriptor( 288 name='uint4', 289 n=4, 290 reader=read_uint4, 291 doc="Four-byte unsigned integer, little-endian.") 292 293 294def read_uint8(f): 295 r""" 296 >>> import io 297 >>> read_uint8(io.BytesIO(b'\xff\x00\x00\x00\x00\x00\x00\x00')) 298 255 299 >>> read_uint8(io.BytesIO(b'\xff' * 8)) == 2**64-1 300 True 301 """ 302 303 data = f.read(8) 304 if len(data) == 8: 305 return _unpack("<Q", data)[0] 306 raise ValueError("not enough data in stream to read uint8") 307 308uint8 = ArgumentDescriptor( 309 name='uint8', 310 n=8, 311 reader=read_uint8, 312 doc="Eight-byte unsigned integer, little-endian.") 313 314 315def read_stringnl(f, decode=True, stripquotes=True): 316 r""" 317 >>> import io 318 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n")) 319 'abcd' 320 321 >>> read_stringnl(io.BytesIO(b"\n")) 322 Traceback (most recent call last): 323 ... 324 ValueError: no string quotes around b'' 325 326 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False) 327 '' 328 329 >>> read_stringnl(io.BytesIO(b"''\n")) 330 '' 331 332 >>> read_stringnl(io.BytesIO(b'"abcd"')) 333 Traceback (most recent call last): 334 ... 335 ValueError: no newline found when trying to read stringnl 336 337 Embedded escapes are undone in the result. 338 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'")) 339 'a\n\\b\x00c\td' 340 """ 341 342 data = f.readline() 343 if not data.endswith(b'\n'): 344 raise ValueError("no newline found when trying to read stringnl") 345 data = data[:-1] # lose the newline 346 347 if stripquotes: 348 for q in (b'"', b"'"): 349 if data.startswith(q): 350 if not data.endswith(q): 351 raise ValueError("strinq quote %r not found at both " 352 "ends of %r" % (q, data)) 353 data = data[1:-1] 354 break 355 else: 356 raise ValueError("no string quotes around %r" % data) 357 358 if decode: 359 data = codecs.escape_decode(data)[0].decode("ascii") 360 return data 361 362stringnl = ArgumentDescriptor( 363 name='stringnl', 364 n=UP_TO_NEWLINE, 365 reader=read_stringnl, 366 doc="""A newline-terminated string. 367 368 This is a repr-style string, with embedded escapes, and 369 bracketing quotes. 370 """) 371 372def read_stringnl_noescape(f): 373 return read_stringnl(f, stripquotes=False) 374 375stringnl_noescape = ArgumentDescriptor( 376 name='stringnl_noescape', 377 n=UP_TO_NEWLINE, 378 reader=read_stringnl_noescape, 379 doc="""A newline-terminated string. 380 381 This is a str-style string, without embedded escapes, 382 or bracketing quotes. It should consist solely of 383 printable ASCII characters. 384 """) 385 386def read_stringnl_noescape_pair(f): 387 r""" 388 >>> import io 389 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk")) 390 'Queue Empty' 391 """ 392 393 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) 394 395stringnl_noescape_pair = ArgumentDescriptor( 396 name='stringnl_noescape_pair', 397 n=UP_TO_NEWLINE, 398 reader=read_stringnl_noescape_pair, 399 doc="""A pair of newline-terminated strings. 400 401 These are str-style strings, without embedded 402 escapes, or bracketing quotes. They should 403 consist solely of printable ASCII characters. 404 The pair is returned as a single string, with 405 a single blank separating the two strings. 406 """) 407 408 409def read_string1(f): 410 r""" 411 >>> import io 412 >>> read_string1(io.BytesIO(b"\x00")) 413 '' 414 >>> read_string1(io.BytesIO(b"\x03abcdef")) 415 'abc' 416 """ 417 418 n = read_uint1(f) 419 assert n >= 0 420 data = f.read(n) 421 if len(data) == n: 422 return data.decode("latin-1") 423 raise ValueError("expected %d bytes in a string1, but only %d remain" % 424 (n, len(data))) 425 426string1 = ArgumentDescriptor( 427 name="string1", 428 n=TAKEN_FROM_ARGUMENT1, 429 reader=read_string1, 430 doc="""A counted string. 431 432 The first argument is a 1-byte unsigned int giving the number 433 of bytes in the string, and the second argument is that many 434 bytes. 435 """) 436 437 438def read_string4(f): 439 r""" 440 >>> import io 441 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc")) 442 '' 443 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 444 'abc' 445 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 446 Traceback (most recent call last): 447 ... 448 ValueError: expected 50331648 bytes in a string4, but only 6 remain 449 """ 450 451 n = read_int4(f) 452 if n < 0: 453 raise ValueError("string4 byte count < 0: %d" % n) 454 data = f.read(n) 455 if len(data) == n: 456 return data.decode("latin-1") 457 raise ValueError("expected %d bytes in a string4, but only %d remain" % 458 (n, len(data))) 459 460string4 = ArgumentDescriptor( 461 name="string4", 462 n=TAKEN_FROM_ARGUMENT4, 463 reader=read_string4, 464 doc="""A counted string. 465 466 The first argument is a 4-byte little-endian signed int giving 467 the number of bytes in the string, and the second argument is 468 that many bytes. 469 """) 470 471 472def read_bytes1(f): 473 r""" 474 >>> import io 475 >>> read_bytes1(io.BytesIO(b"\x00")) 476 b'' 477 >>> read_bytes1(io.BytesIO(b"\x03abcdef")) 478 b'abc' 479 """ 480 481 n = read_uint1(f) 482 assert n >= 0 483 data = f.read(n) 484 if len(data) == n: 485 return data 486 raise ValueError("expected %d bytes in a bytes1, but only %d remain" % 487 (n, len(data))) 488 489bytes1 = ArgumentDescriptor( 490 name="bytes1", 491 n=TAKEN_FROM_ARGUMENT1, 492 reader=read_bytes1, 493 doc="""A counted bytes string. 494 495 The first argument is a 1-byte unsigned int giving the number 496 of bytes in the string, and the second argument is that many 497 bytes. 498 """) 499 500 501def read_bytes1(f): 502 r""" 503 >>> import io 504 >>> read_bytes1(io.BytesIO(b"\x00")) 505 b'' 506 >>> read_bytes1(io.BytesIO(b"\x03abcdef")) 507 b'abc' 508 """ 509 510 n = read_uint1(f) 511 assert n >= 0 512 data = f.read(n) 513 if len(data) == n: 514 return data 515 raise ValueError("expected %d bytes in a bytes1, but only %d remain" % 516 (n, len(data))) 517 518bytes1 = ArgumentDescriptor( 519 name="bytes1", 520 n=TAKEN_FROM_ARGUMENT1, 521 reader=read_bytes1, 522 doc="""A counted bytes string. 523 524 The first argument is a 1-byte unsigned int giving the number 525 of bytes, and the second argument is that many bytes. 526 """) 527 528 529def read_bytes4(f): 530 r""" 531 >>> import io 532 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc")) 533 b'' 534 >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef")) 535 b'abc' 536 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef")) 537 Traceback (most recent call last): 538 ... 539 ValueError: expected 50331648 bytes in a bytes4, but only 6 remain 540 """ 541 542 n = read_uint4(f) 543 assert n >= 0 544 if n > sys.maxsize: 545 raise ValueError("bytes4 byte count > sys.maxsize: %d" % n) 546 data = f.read(n) 547 if len(data) == n: 548 return data 549 raise ValueError("expected %d bytes in a bytes4, but only %d remain" % 550 (n, len(data))) 551 552bytes4 = ArgumentDescriptor( 553 name="bytes4", 554 n=TAKEN_FROM_ARGUMENT4U, 555 reader=read_bytes4, 556 doc="""A counted bytes string. 557 558 The first argument is a 4-byte little-endian unsigned int giving 559 the number of bytes, and the second argument is that many bytes. 560 """) 561 562 563def read_bytes8(f): 564 r""" 565 >>> import io, struct, sys 566 >>> read_bytes8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc")) 567 b'' 568 >>> read_bytes8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef")) 569 b'abc' 570 >>> bigsize8 = struct.pack("<Q", sys.maxsize//3) 571 >>> read_bytes8(io.BytesIO(bigsize8 + b"abcdef")) #doctest: +ELLIPSIS 572 Traceback (most recent call last): 573 ... 574 ValueError: expected ... bytes in a bytes8, but only 6 remain 575 """ 576 577 n = read_uint8(f) 578 assert n >= 0 579 if n > sys.maxsize: 580 raise ValueError("bytes8 byte count > sys.maxsize: %d" % n) 581 data = f.read(n) 582 if len(data) == n: 583 return data 584 raise ValueError("expected %d bytes in a bytes8, but only %d remain" % 585 (n, len(data))) 586 587bytes8 = ArgumentDescriptor( 588 name="bytes8", 589 n=TAKEN_FROM_ARGUMENT8U, 590 reader=read_bytes8, 591 doc="""A counted bytes string. 592 593 The first argument is a 8-byte little-endian unsigned int giving 594 the number of bytes, and the second argument is that many bytes. 595 """) 596 597def read_unicodestringnl(f): 598 r""" 599 >>> import io 600 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd' 601 True 602 """ 603 604 data = f.readline() 605 if not data.endswith(b'\n'): 606 raise ValueError("no newline found when trying to read " 607 "unicodestringnl") 608 data = data[:-1] # lose the newline 609 return str(data, 'raw-unicode-escape') 610 611unicodestringnl = ArgumentDescriptor( 612 name='unicodestringnl', 613 n=UP_TO_NEWLINE, 614 reader=read_unicodestringnl, 615 doc="""A newline-terminated Unicode string. 616 617 This is raw-unicode-escape encoded, so consists of 618 printable ASCII characters, and may contain embedded 619 escape sequences. 620 """) 621 622 623def read_unicodestring1(f): 624 r""" 625 >>> import io 626 >>> s = 'abcd\uabcd' 627 >>> enc = s.encode('utf-8') 628 >>> enc 629 b'abcd\xea\xaf\x8d' 630 >>> n = bytes([len(enc)]) # little-endian 1-byte length 631 >>> t = read_unicodestring1(io.BytesIO(n + enc + b'junk')) 632 >>> s == t 633 True 634 635 >>> read_unicodestring1(io.BytesIO(n + enc[:-1])) 636 Traceback (most recent call last): 637 ... 638 ValueError: expected 7 bytes in a unicodestring1, but only 6 remain 639 """ 640 641 n = read_uint1(f) 642 assert n >= 0 643 data = f.read(n) 644 if len(data) == n: 645 return str(data, 'utf-8', 'surrogatepass') 646 raise ValueError("expected %d bytes in a unicodestring1, but only %d " 647 "remain" % (n, len(data))) 648 649unicodestring1 = ArgumentDescriptor( 650 name="unicodestring1", 651 n=TAKEN_FROM_ARGUMENT1, 652 reader=read_unicodestring1, 653 doc="""A counted Unicode string. 654 655 The first argument is a 1-byte little-endian signed int 656 giving the number of bytes in the string, and the second 657 argument-- the UTF-8 encoding of the Unicode string -- 658 contains that many bytes. 659 """) 660 661 662def read_unicodestring4(f): 663 r""" 664 >>> import io 665 >>> s = 'abcd\uabcd' 666 >>> enc = s.encode('utf-8') 667 >>> enc 668 b'abcd\xea\xaf\x8d' 669 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length 670 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk')) 671 >>> s == t 672 True 673 674 >>> read_unicodestring4(io.BytesIO(n + enc[:-1])) 675 Traceback (most recent call last): 676 ... 677 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain 678 """ 679 680 n = read_uint4(f) 681 assert n >= 0 682 if n > sys.maxsize: 683 raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n) 684 data = f.read(n) 685 if len(data) == n: 686 return str(data, 'utf-8', 'surrogatepass') 687 raise ValueError("expected %d bytes in a unicodestring4, but only %d " 688 "remain" % (n, len(data))) 689 690unicodestring4 = ArgumentDescriptor( 691 name="unicodestring4", 692 n=TAKEN_FROM_ARGUMENT4U, 693 reader=read_unicodestring4, 694 doc="""A counted Unicode string. 695 696 The first argument is a 4-byte little-endian signed int 697 giving the number of bytes in the string, and the second 698 argument-- the UTF-8 encoding of the Unicode string -- 699 contains that many bytes. 700 """) 701 702 703def read_unicodestring8(f): 704 r""" 705 >>> import io 706 >>> s = 'abcd\uabcd' 707 >>> enc = s.encode('utf-8') 708 >>> enc 709 b'abcd\xea\xaf\x8d' 710 >>> n = bytes([len(enc)]) + bytes(7) # little-endian 8-byte length 711 >>> t = read_unicodestring8(io.BytesIO(n + enc + b'junk')) 712 >>> s == t 713 True 714 715 >>> read_unicodestring8(io.BytesIO(n + enc[:-1])) 716 Traceback (most recent call last): 717 ... 718 ValueError: expected 7 bytes in a unicodestring8, but only 6 remain 719 """ 720 721 n = read_uint8(f) 722 assert n >= 0 723 if n > sys.maxsize: 724 raise ValueError("unicodestring8 byte count > sys.maxsize: %d" % n) 725 data = f.read(n) 726 if len(data) == n: 727 return str(data, 'utf-8', 'surrogatepass') 728 raise ValueError("expected %d bytes in a unicodestring8, but only %d " 729 "remain" % (n, len(data))) 730 731unicodestring8 = ArgumentDescriptor( 732 name="unicodestring8", 733 n=TAKEN_FROM_ARGUMENT8U, 734 reader=read_unicodestring8, 735 doc="""A counted Unicode string. 736 737 The first argument is a 8-byte little-endian signed int 738 giving the number of bytes in the string, and the second 739 argument-- the UTF-8 encoding of the Unicode string -- 740 contains that many bytes. 741 """) 742 743 744def read_decimalnl_short(f): 745 r""" 746 >>> import io 747 >>> read_decimalnl_short(io.BytesIO(b"1234\n56")) 748 1234 749 750 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56")) 751 Traceback (most recent call last): 752 ... 753 ValueError: invalid literal for int() with base 10: b'1234L' 754 """ 755 756 s = read_stringnl(f, decode=False, stripquotes=False) 757 758 # There's a hack for True and False here. 759 if s == b"00": 760 return False 761 elif s == b"01": 762 return True 763 764 return int(s) 765 766def read_decimalnl_long(f): 767 r""" 768 >>> import io 769 770 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56")) 771 1234 772 773 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6")) 774 123456789012345678901234 775 """ 776 777 s = read_stringnl(f, decode=False, stripquotes=False) 778 if s[-1:] == b'L': 779 s = s[:-1] 780 return int(s) 781 782 783decimalnl_short = ArgumentDescriptor( 784 name='decimalnl_short', 785 n=UP_TO_NEWLINE, 786 reader=read_decimalnl_short, 787 doc="""A newline-terminated decimal integer literal. 788 789 This never has a trailing 'L', and the integer fit 790 in a short Python int on the box where the pickle 791 was written -- but there's no guarantee it will fit 792 in a short Python int on the box where the pickle 793 is read. 794 """) 795 796decimalnl_long = ArgumentDescriptor( 797 name='decimalnl_long', 798 n=UP_TO_NEWLINE, 799 reader=read_decimalnl_long, 800 doc="""A newline-terminated decimal integer literal. 801 802 This has a trailing 'L', and can represent integers 803 of any size. 804 """) 805 806 807def read_floatnl(f): 808 r""" 809 >>> import io 810 >>> read_floatnl(io.BytesIO(b"-1.25\n6")) 811 -1.25 812 """ 813 s = read_stringnl(f, decode=False, stripquotes=False) 814 return float(s) 815 816floatnl = ArgumentDescriptor( 817 name='floatnl', 818 n=UP_TO_NEWLINE, 819 reader=read_floatnl, 820 doc="""A newline-terminated decimal floating literal. 821 822 In general this requires 17 significant digits for roundtrip 823 identity, and pickling then unpickling infinities, NaNs, and 824 minus zero doesn't work across boxes, or on some boxes even 825 on itself (e.g., Windows can't read the strings it produces 826 for infinities or NaNs). 827 """) 828 829def read_float8(f): 830 r""" 831 >>> import io, struct 832 >>> raw = struct.pack(">d", -1.25) 833 >>> raw 834 b'\xbf\xf4\x00\x00\x00\x00\x00\x00' 835 >>> read_float8(io.BytesIO(raw + b"\n")) 836 -1.25 837 """ 838 839 data = f.read(8) 840 if len(data) == 8: 841 return _unpack(">d", data)[0] 842 raise ValueError("not enough data in stream to read float8") 843 844 845float8 = ArgumentDescriptor( 846 name='float8', 847 n=8, 848 reader=read_float8, 849 doc="""An 8-byte binary representation of a float, big-endian. 850 851 The format is unique to Python, and shared with the struct 852 module (format string '>d') "in theory" (the struct and pickle 853 implementations don't share the code -- they should). It's 854 strongly related to the IEEE-754 double format, and, in normal 855 cases, is in fact identical to the big-endian 754 double format. 856 On other boxes the dynamic range is limited to that of a 754 857 double, and "add a half and chop" rounding is used to reduce 858 the precision to 53 bits. However, even on a 754 box, 859 infinities, NaNs, and minus zero may not be handled correctly 860 (may not survive roundtrip pickling intact). 861 """) 862 863# Protocol 2 formats 864 865from pickle import decode_long 866 867def read_long1(f): 868 r""" 869 >>> import io 870 >>> read_long1(io.BytesIO(b"\x00")) 871 0 872 >>> read_long1(io.BytesIO(b"\x02\xff\x00")) 873 255 874 >>> read_long1(io.BytesIO(b"\x02\xff\x7f")) 875 32767 876 >>> read_long1(io.BytesIO(b"\x02\x00\xff")) 877 -256 878 >>> read_long1(io.BytesIO(b"\x02\x00\x80")) 879 -32768 880 """ 881 882 n = read_uint1(f) 883 data = f.read(n) 884 if len(data) != n: 885 raise ValueError("not enough data in stream to read long1") 886 return decode_long(data) 887 888long1 = ArgumentDescriptor( 889 name="long1", 890 n=TAKEN_FROM_ARGUMENT1, 891 reader=read_long1, 892 doc="""A binary long, little-endian, using 1-byte size. 893 894 This first reads one byte as an unsigned size, then reads that 895 many bytes and interprets them as a little-endian 2's-complement long. 896 If the size is 0, that's taken as a shortcut for the long 0L. 897 """) 898 899def read_long4(f): 900 r""" 901 >>> import io 902 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00")) 903 255 904 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f")) 905 32767 906 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff")) 907 -256 908 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80")) 909 -32768 910 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00")) 911 0 912 """ 913 914 n = read_int4(f) 915 if n < 0: 916 raise ValueError("long4 byte count < 0: %d" % n) 917 data = f.read(n) 918 if len(data) != n: 919 raise ValueError("not enough data in stream to read long4") 920 return decode_long(data) 921 922long4 = ArgumentDescriptor( 923 name="long4", 924 n=TAKEN_FROM_ARGUMENT4, 925 reader=read_long4, 926 doc="""A binary representation of a long, little-endian. 927 928 This first reads four bytes as a signed size (but requires the 929 size to be >= 0), then reads that many bytes and interprets them 930 as a little-endian 2's-complement long. If the size is 0, that's taken 931 as a shortcut for the int 0, although LONG1 should really be used 932 then instead (and in any case where # of bytes < 256). 933 """) 934 935 936############################################################################## 937# Object descriptors. The stack used by the pickle machine holds objects, 938# and in the stack_before and stack_after attributes of OpcodeInfo 939# descriptors we need names to describe the various types of objects that can 940# appear on the stack. 941 942class StackObject(object): 943 __slots__ = ( 944 # name of descriptor record, for info only 945 'name', 946 947 # type of object, or tuple of type objects (meaning the object can 948 # be of any type in the tuple) 949 'obtype', 950 951 # human-readable docs for this kind of stack object; a string 952 'doc', 953 ) 954 955 def __init__(self, name, obtype, doc): 956 assert isinstance(name, str) 957 self.name = name 958 959 assert isinstance(obtype, type) or isinstance(obtype, tuple) 960 if isinstance(obtype, tuple): 961 for contained in obtype: 962 assert isinstance(contained, type) 963 self.obtype = obtype 964 965 assert isinstance(doc, str) 966 self.doc = doc 967 968 def __repr__(self): 969 return self.name 970 971 972pyint = StackObject( 973 name='int', 974 obtype=int, 975 doc="A short (as opposed to long) Python integer object.") 976 977pylong = StackObject( 978 name='long', 979 obtype=int, 980 doc="A long (as opposed to short) Python integer object.") 981 982pyinteger_or_bool = StackObject( 983 name='int_or_bool', 984 obtype=(int, bool), 985 doc="A Python integer object (short or long), or " 986 "a Python bool.") 987 988pybool = StackObject( 989 name='bool', 990 obtype=(bool,), 991 doc="A Python bool object.") 992 993pyfloat = StackObject( 994 name='float', 995 obtype=float, 996 doc="A Python float object.") 997 998pystring = StackObject( 999 name='string', 1000 obtype=bytes, 1001 doc="A Python (8-bit) string object.") 1002 1003pybytes = StackObject( 1004 name='bytes', 1005 obtype=bytes, 1006 doc="A Python bytes object.") 1007 1008pyunicode = StackObject( 1009 name='str', 1010 obtype=str, 1011 doc="A Python (Unicode) string object.") 1012 1013pynone = StackObject( 1014 name="None", 1015 obtype=type(None), 1016 doc="The Python None object.") 1017 1018pytuple = StackObject( 1019 name="tuple", 1020 obtype=tuple, 1021 doc="A Python tuple object.") 1022 1023pylist = StackObject( 1024 name="list", 1025 obtype=list, 1026 doc="A Python list object.") 1027 1028pydict = StackObject( 1029 name="dict", 1030 obtype=dict, 1031 doc="A Python dict object.") 1032 1033pyset = StackObject( 1034 name="set", 1035 obtype=set, 1036 doc="A Python set object.") 1037 1038pyfrozenset = StackObject( 1039 name="frozenset", 1040 obtype=set, 1041 doc="A Python frozenset object.") 1042 1043anyobject = StackObject( 1044 name='any', 1045 obtype=object, 1046 doc="Any kind of object whatsoever.") 1047 1048markobject = StackObject( 1049 name="mark", 1050 obtype=StackObject, 1051 doc="""'The mark' is a unique object. 1052 1053 Opcodes that operate on a variable number of objects 1054 generally don't embed the count of objects in the opcode, 1055 or pull it off the stack. Instead the MARK opcode is used 1056 to push a special marker object on the stack, and then 1057 some other opcodes grab all the objects from the top of 1058 the stack down to (but not including) the topmost marker 1059 object. 1060 """) 1061 1062stackslice = StackObject( 1063 name="stackslice", 1064 obtype=StackObject, 1065 doc="""An object representing a contiguous slice of the stack. 1066 1067 This is used in conjunction with markobject, to represent all 1068 of the stack following the topmost markobject. For example, 1069 the POP_MARK opcode changes the stack from 1070 1071 [..., markobject, stackslice] 1072 to 1073 [...] 1074 1075 No matter how many object are on the stack after the topmost 1076 markobject, POP_MARK gets rid of all of them (including the 1077 topmost markobject too). 1078 """) 1079 1080############################################################################## 1081# Descriptors for pickle opcodes. 1082 1083class OpcodeInfo(object): 1084 1085 __slots__ = ( 1086 # symbolic name of opcode; a string 1087 'name', 1088 1089 # the code used in a bytestream to represent the opcode; a 1090 # one-character string 1091 'code', 1092 1093 # If the opcode has an argument embedded in the byte string, an 1094 # instance of ArgumentDescriptor specifying its type. Note that 1095 # arg.reader(s) can be used to read and decode the argument from 1096 # the bytestream s, and arg.doc documents the format of the raw 1097 # argument bytes. If the opcode doesn't have an argument embedded 1098 # in the bytestream, arg should be None. 1099 'arg', 1100 1101 # what the stack looks like before this opcode runs; a list 1102 'stack_before', 1103 1104 # what the stack looks like after this opcode runs; a list 1105 'stack_after', 1106 1107 # the protocol number in which this opcode was introduced; an int 1108 'proto', 1109 1110 # human-readable docs for this opcode; a string 1111 'doc', 1112 ) 1113 1114 def __init__(self, name, code, arg, 1115 stack_before, stack_after, proto, doc): 1116 assert isinstance(name, str) 1117 self.name = name 1118 1119 assert isinstance(code, str) 1120 assert len(code) == 1 1121 self.code = code 1122 1123 assert arg is None or isinstance(arg, ArgumentDescriptor) 1124 self.arg = arg 1125 1126 assert isinstance(stack_before, list) 1127 for x in stack_before: 1128 assert isinstance(x, StackObject) 1129 self.stack_before = stack_before 1130 1131 assert isinstance(stack_after, list) 1132 for x in stack_after: 1133 assert isinstance(x, StackObject) 1134 self.stack_after = stack_after 1135 1136 assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL 1137 self.proto = proto 1138 1139 assert isinstance(doc, str) 1140 self.doc = doc 1141 1142I = OpcodeInfo 1143opcodes = [ 1144 1145 # Ways to spell integers. 1146 1147 I(name='INT', 1148 code='I', 1149 arg=decimalnl_short, 1150 stack_before=[], 1151 stack_after=[pyinteger_or_bool], 1152 proto=0, 1153 doc="""Push an integer or bool. 1154 1155 The argument is a newline-terminated decimal literal string. 1156 1157 The intent may have been that this always fit in a short Python int, 1158 but INT can be generated in pickles written on a 64-bit box that 1159 require a Python long on a 32-bit box. The difference between this 1160 and LONG then is that INT skips a trailing 'L', and produces a short 1161 int whenever possible. 1162 1163 Another difference is due to that, when bool was introduced as a 1164 distinct type in 2.3, builtin names True and False were also added to 1165 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, 1166 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". 1167 Leading zeroes are never produced for a genuine integer. The 2.3 1168 (and later) unpicklers special-case these and return bool instead; 1169 earlier unpicklers ignore the leading "0" and return the int. 1170 """), 1171 1172 I(name='BININT', 1173 code='J', 1174 arg=int4, 1175 stack_before=[], 1176 stack_after=[pyint], 1177 proto=1, 1178 doc="""Push a four-byte signed integer. 1179 1180 This handles the full range of Python (short) integers on a 32-bit 1181 box, directly as binary bytes (1 for the opcode and 4 for the integer). 1182 If the integer is non-negative and fits in 1 or 2 bytes, pickling via 1183 BININT1 or BININT2 saves space. 1184 """), 1185 1186 I(name='BININT1', 1187 code='K', 1188 arg=uint1, 1189 stack_before=[], 1190 stack_after=[pyint], 1191 proto=1, 1192 doc="""Push a one-byte unsigned integer. 1193 1194 This is a space optimization for pickling very small non-negative ints, 1195 in range(256). 1196 """), 1197 1198 I(name='BININT2', 1199 code='M', 1200 arg=uint2, 1201 stack_before=[], 1202 stack_after=[pyint], 1203 proto=1, 1204 doc="""Push a two-byte unsigned integer. 1205 1206 This is a space optimization for pickling small positive ints, in 1207 range(256, 2**16). Integers in range(256) can also be pickled via 1208 BININT2, but BININT1 instead saves a byte. 1209 """), 1210 1211 I(name='LONG', 1212 code='L', 1213 arg=decimalnl_long, 1214 stack_before=[], 1215 stack_after=[pylong], 1216 proto=0, 1217 doc="""Push a long integer. 1218 1219 The same as INT, except that the literal ends with 'L', and always 1220 unpickles to a Python long. There doesn't seem a real purpose to the 1221 trailing 'L'. 1222 1223 Note that LONG takes time quadratic in the number of digits when 1224 unpickling (this is simply due to the nature of decimal->binary 1225 conversion). Proto 2 added linear-time (in C; still quadratic-time 1226 in Python) LONG1 and LONG4 opcodes. 1227 """), 1228 1229 I(name="LONG1", 1230 code='\x8a', 1231 arg=long1, 1232 stack_before=[], 1233 stack_after=[pylong], 1234 proto=2, 1235 doc="""Long integer using one-byte length. 1236 1237 A more efficient encoding of a Python long; the long1 encoding 1238 says it all."""), 1239 1240 I(name="LONG4", 1241 code='\x8b', 1242 arg=long4, 1243 stack_before=[], 1244 stack_after=[pylong], 1245 proto=2, 1246 doc="""Long integer using found-byte length. 1247 1248 A more efficient encoding of a Python long; the long4 encoding 1249 says it all."""), 1250 1251 # Ways to spell strings (8-bit, not Unicode). 1252 1253 I(name='STRING', 1254 code='S', 1255 arg=stringnl, 1256 stack_before=[], 1257 stack_after=[pystring], 1258 proto=0, 1259 doc="""Push a Python string object. 1260 1261 The argument is a repr-style string, with bracketing quote characters, 1262 and perhaps embedded escapes. The argument extends until the next 1263 newline character. (Actually, they are decoded into a str instance 1264 using the encoding given to the Unpickler constructor. or the default, 1265 'ASCII'.) 1266 """), 1267 1268 I(name='BINSTRING', 1269 code='T', 1270 arg=string4, 1271 stack_before=[], 1272 stack_after=[pystring], 1273 proto=1, 1274 doc="""Push a Python string object. 1275 1276 There are two arguments: the first is a 4-byte little-endian signed int 1277 giving the number of bytes in the string, and the second is that many 1278 bytes, which are taken literally as the string content. (Actually, 1279 they are decoded into a str instance using the encoding given to the 1280 Unpickler constructor. or the default, 'ASCII'.) 1281 """), 1282 1283 I(name='SHORT_BINSTRING', 1284 code='U', 1285 arg=string1, 1286 stack_before=[], 1287 stack_after=[pystring], 1288 proto=1, 1289 doc="""Push a Python string object. 1290 1291 There are two arguments: the first is a 1-byte unsigned int giving 1292 the number of bytes in the string, and the second is that many bytes, 1293 which are taken literally as the string content. (Actually, they 1294 are decoded into a str instance using the encoding given to the 1295 Unpickler constructor. or the default, 'ASCII'.) 1296 """), 1297 1298 # Bytes (protocol 3 only; older protocols don't support bytes at all) 1299 1300 I(name='BINBYTES', 1301 code='B', 1302 arg=bytes4, 1303 stack_before=[], 1304 stack_after=[pybytes], 1305 proto=3, 1306 doc="""Push a Python bytes object. 1307 1308 There are two arguments: the first is a 4-byte little-endian unsigned int 1309 giving the number of bytes, and the second is that many bytes, which are 1310 taken literally as the bytes content. 1311 """), 1312 1313 I(name='SHORT_BINBYTES', 1314 code='C', 1315 arg=bytes1, 1316 stack_before=[], 1317 stack_after=[pybytes], 1318 proto=3, 1319 doc="""Push a Python bytes object. 1320 1321 There are two arguments: the first is a 1-byte unsigned int giving 1322 the number of bytes, and the second is that many bytes, which are taken 1323 literally as the string content. 1324 """), 1325 1326 I(name='BINBYTES8', 1327 code='\x8e', 1328 arg=bytes8, 1329 stack_before=[], 1330 stack_after=[pybytes], 1331 proto=4, 1332 doc="""Push a Python bytes object. 1333 1334 There are two arguments: the first is a 8-byte unsigned int giving 1335 the number of bytes in the string, and the second is that many bytes, 1336 which are taken literally as the string content. 1337 """), 1338 1339 # Ways to spell None. 1340 1341 I(name='NONE', 1342 code='N', 1343 arg=None, 1344 stack_before=[], 1345 stack_after=[pynone], 1346 proto=0, 1347 doc="Push None on the stack."), 1348 1349 # Ways to spell bools, starting with proto 2. See INT for how this was 1350 # done before proto 2. 1351 1352 I(name='NEWTRUE', 1353 code='\x88', 1354 arg=None, 1355 stack_before=[], 1356 stack_after=[pybool], 1357 proto=2, 1358 doc="""True. 1359 1360 Push True onto the stack."""), 1361 1362 I(name='NEWFALSE', 1363 code='\x89', 1364 arg=None, 1365 stack_before=[], 1366 stack_after=[pybool], 1367 proto=2, 1368 doc="""True. 1369 1370 Push False onto the stack."""), 1371 1372 # Ways to spell Unicode strings. 1373 1374 I(name='UNICODE', 1375 code='V', 1376 arg=unicodestringnl, 1377 stack_before=[], 1378 stack_after=[pyunicode], 1379 proto=0, # this may be pure-text, but it's a later addition 1380 doc="""Push a Python Unicode string object. 1381 1382 The argument is a raw-unicode-escape encoding of a Unicode string, 1383 and so may contain embedded escape sequences. The argument extends 1384 until the next newline character. 1385 """), 1386 1387 I(name='SHORT_BINUNICODE', 1388 code='\x8c', 1389 arg=unicodestring1, 1390 stack_before=[], 1391 stack_after=[pyunicode], 1392 proto=4, 1393 doc="""Push a Python Unicode string object. 1394 1395 There are two arguments: the first is a 1-byte little-endian signed int 1396 giving the number of bytes in the string. The second is that many 1397 bytes, and is the UTF-8 encoding of the Unicode string. 1398 """), 1399 1400 I(name='BINUNICODE', 1401 code='X', 1402 arg=unicodestring4, 1403 stack_before=[], 1404 stack_after=[pyunicode], 1405 proto=1, 1406 doc="""Push a Python Unicode string object. 1407 1408 There are two arguments: the first is a 4-byte little-endian unsigned int 1409 giving the number of bytes in the string. The second is that many 1410 bytes, and is the UTF-8 encoding of the Unicode string. 1411 """), 1412 1413 I(name='BINUNICODE8', 1414 code='\x8d', 1415 arg=unicodestring8, 1416 stack_before=[], 1417 stack_after=[pyunicode], 1418 proto=4, 1419 doc="""Push a Python Unicode string object. 1420 1421 There are two arguments: the first is a 8-byte little-endian signed int 1422 giving the number of bytes in the string. The second is that many 1423 bytes, and is the UTF-8 encoding of the Unicode string. 1424 """), 1425 1426 # Ways to spell floats. 1427 1428 I(name='FLOAT', 1429 code='F', 1430 arg=floatnl, 1431 stack_before=[], 1432 stack_after=[pyfloat], 1433 proto=0, 1434 doc="""Newline-terminated decimal float literal. 1435 1436 The argument is repr(a_float), and in general requires 17 significant 1437 digits for roundtrip conversion to be an identity (this is so for 1438 IEEE-754 double precision values, which is what Python float maps to 1439 on most boxes). 1440 1441 In general, FLOAT cannot be used to transport infinities, NaNs, or 1442 minus zero across boxes (or even on a single box, if the platform C 1443 library can't read the strings it produces for such things -- Windows 1444 is like that), but may do less damage than BINFLOAT on boxes with 1445 greater precision or dynamic range than IEEE-754 double. 1446 """), 1447 1448 I(name='BINFLOAT', 1449 code='G', 1450 arg=float8, 1451 stack_before=[], 1452 stack_after=[pyfloat], 1453 proto=1, 1454 doc="""Float stored in binary form, with 8 bytes of data. 1455 1456 This generally requires less than half the space of FLOAT encoding. 1457 In general, BINFLOAT cannot be used to transport infinities, NaNs, or 1458 minus zero, raises an exception if the exponent exceeds the range of 1459 an IEEE-754 double, and retains no more than 53 bits of precision (if 1460 there are more than that, "add a half and chop" rounding is used to 1461 cut it back to 53 significant bits). 1462 """), 1463 1464 # Ways to build lists. 1465 1466 I(name='EMPTY_LIST', 1467 code=']', 1468 arg=None, 1469 stack_before=[], 1470 stack_after=[pylist], 1471 proto=1, 1472 doc="Push an empty list."), 1473 1474 I(name='APPEND', 1475 code='a', 1476 arg=None, 1477 stack_before=[pylist, anyobject], 1478 stack_after=[pylist], 1479 proto=0, 1480 doc="""Append an object to a list. 1481 1482 Stack before: ... pylist anyobject 1483 Stack after: ... pylist+[anyobject] 1484 1485 although pylist is really extended in-place. 1486 """), 1487 1488 I(name='APPENDS', 1489 code='e', 1490 arg=None, 1491 stack_before=[pylist, markobject, stackslice], 1492 stack_after=[pylist], 1493 proto=1, 1494 doc="""Extend a list by a slice of stack objects. 1495 1496 Stack before: ... pylist markobject stackslice 1497 Stack after: ... pylist+stackslice 1498 1499 although pylist is really extended in-place. 1500 """), 1501 1502 I(name='LIST', 1503 code='l', 1504 arg=None, 1505 stack_before=[markobject, stackslice], 1506 stack_after=[pylist], 1507 proto=0, 1508 doc="""Build a list out of the topmost stack slice, after markobject. 1509 1510 All the stack entries following the topmost markobject are placed into 1511 a single Python list, which single list object replaces all of the 1512 stack from the topmost markobject onward. For example, 1513 1514 Stack before: ... markobject 1 2 3 'abc' 1515 Stack after: ... [1, 2, 3, 'abc'] 1516 """), 1517 1518 # Ways to build tuples. 1519 1520 I(name='EMPTY_TUPLE', 1521 code=')', 1522 arg=None, 1523 stack_before=[], 1524 stack_after=[pytuple], 1525 proto=1, 1526 doc="Push an empty tuple."), 1527 1528 I(name='TUPLE', 1529 code='t', 1530 arg=None, 1531 stack_before=[markobject, stackslice], 1532 stack_after=[pytuple], 1533 proto=0, 1534 doc="""Build a tuple out of the topmost stack slice, after markobject. 1535 1536 All the stack entries following the topmost markobject are placed into 1537 a single Python tuple, which single tuple object replaces all of the 1538 stack from the topmost markobject onward. For example, 1539 1540 Stack before: ... markobject 1 2 3 'abc' 1541 Stack after: ... (1, 2, 3, 'abc') 1542 """), 1543 1544 I(name='TUPLE1', 1545 code='\x85', 1546 arg=None, 1547 stack_before=[anyobject], 1548 stack_after=[pytuple], 1549 proto=2, 1550 doc="""Build a one-tuple out of the topmost item on the stack. 1551 1552 This code pops one value off the stack and pushes a tuple of 1553 length 1 whose one item is that value back onto it. In other 1554 words: 1555 1556 stack[-1] = tuple(stack[-1:]) 1557 """), 1558 1559 I(name='TUPLE2', 1560 code='\x86', 1561 arg=None, 1562 stack_before=[anyobject, anyobject], 1563 stack_after=[pytuple], 1564 proto=2, 1565 doc="""Build a two-tuple out of the top two items on the stack. 1566 1567 This code pops two values off the stack and pushes a tuple of 1568 length 2 whose items are those values back onto it. In other 1569 words: 1570 1571 stack[-2:] = [tuple(stack[-2:])] 1572 """), 1573 1574 I(name='TUPLE3', 1575 code='\x87', 1576 arg=None, 1577 stack_before=[anyobject, anyobject, anyobject], 1578 stack_after=[pytuple], 1579 proto=2, 1580 doc="""Build a three-tuple out of the top three items on the stack. 1581 1582 This code pops three values off the stack and pushes a tuple of 1583 length 3 whose items are those values back onto it. In other 1584 words: 1585 1586 stack[-3:] = [tuple(stack[-3:])] 1587 """), 1588 1589 # Ways to build dicts. 1590 1591 I(name='EMPTY_DICT', 1592 code='}', 1593 arg=None, 1594 stack_before=[], 1595 stack_after=[pydict], 1596 proto=1, 1597 doc="Push an empty dict."), 1598 1599 I(name='DICT', 1600 code='d', 1601 arg=None, 1602 stack_before=[markobject, stackslice], 1603 stack_after=[pydict], 1604 proto=0, 1605 doc="""Build a dict out of the topmost stack slice, after markobject. 1606 1607 All the stack entries following the topmost markobject are placed into 1608 a single Python dict, which single dict object replaces all of the 1609 stack from the topmost markobject onward. The stack slice alternates 1610 key, value, key, value, .... For example, 1611 1612 Stack before: ... markobject 1 2 3 'abc' 1613 Stack after: ... {1: 2, 3: 'abc'} 1614 """), 1615 1616 I(name='SETITEM', 1617 code='s', 1618 arg=None, 1619 stack_before=[pydict, anyobject, anyobject], 1620 stack_after=[pydict], 1621 proto=0, 1622 doc="""Add a key+value pair to an existing dict. 1623 1624 Stack before: ... pydict key value 1625 Stack after: ... pydict 1626 1627 where pydict has been modified via pydict[key] = value. 1628 """), 1629 1630 I(name='SETITEMS', 1631 code='u', 1632 arg=None, 1633 stack_before=[pydict, markobject, stackslice], 1634 stack_after=[pydict], 1635 proto=1, 1636 doc="""Add an arbitrary number of key+value pairs to an existing dict. 1637 1638 The slice of the stack following the topmost markobject is taken as 1639 an alternating sequence of keys and values, added to the dict 1640 immediately under the topmost markobject. Everything at and after the 1641 topmost markobject is popped, leaving the mutated dict at the top 1642 of the stack. 1643 1644 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n 1645 Stack after: ... pydict 1646 1647 where pydict has been modified via pydict[key_i] = value_i for i in 1648 1, 2, ..., n, and in that order. 1649 """), 1650 1651 # Ways to build sets 1652 1653 I(name='EMPTY_SET', 1654 code='\x8f', 1655 arg=None, 1656 stack_before=[], 1657 stack_after=[pyset], 1658 proto=4, 1659 doc="Push an empty set."), 1660 1661 I(name='ADDITEMS', 1662 code='\x90', 1663 arg=None, 1664 stack_before=[pyset, markobject, stackslice], 1665 stack_after=[pyset], 1666 proto=4, 1667 doc="""Add an arbitrary number of items to an existing set. 1668 1669 The slice of the stack following the topmost markobject is taken as 1670 a sequence of items, added to the set immediately under the topmost 1671 markobject. Everything at and after the topmost markobject is popped, 1672 leaving the mutated set at the top of the stack. 1673 1674 Stack before: ... pyset markobject item_1 ... item_n 1675 Stack after: ... pyset 1676 1677 where pyset has been modified via pyset.add(item_i) = item_i for i in 1678 1, 2, ..., n, and in that order. 1679 """), 1680 1681 # Way to build frozensets 1682 1683 I(name='FROZENSET', 1684 code='\x91', 1685 arg=None, 1686 stack_before=[markobject, stackslice], 1687 stack_after=[pyfrozenset], 1688 proto=4, 1689 doc="""Build a frozenset out of the topmost slice, after markobject. 1690 1691 All the stack entries following the topmost markobject are placed into 1692 a single Python frozenset, which single frozenset object replaces all 1693 of the stack from the topmost markobject onward. For example, 1694 1695 Stack before: ... markobject 1 2 3 1696 Stack after: ... frozenset({1, 2, 3}) 1697 """), 1698 1699 # Stack manipulation. 1700 1701 I(name='POP', 1702 code='0', 1703 arg=None, 1704 stack_before=[anyobject], 1705 stack_after=[], 1706 proto=0, 1707 doc="Discard the top stack item, shrinking the stack by one item."), 1708 1709 I(name='DUP', 1710 code='2', 1711 arg=None, 1712 stack_before=[anyobject], 1713 stack_after=[anyobject, anyobject], 1714 proto=0, 1715 doc="Push the top stack item onto the stack again, duplicating it."), 1716 1717 I(name='MARK', 1718 code='(', 1719 arg=None, 1720 stack_before=[], 1721 stack_after=[markobject], 1722 proto=0, 1723 doc="""Push markobject onto the stack. 1724 1725 markobject is a unique object, used by other opcodes to identify a 1726 region of the stack containing a variable number of objects for them 1727 to work on. See markobject.doc for more detail. 1728 """), 1729 1730 I(name='POP_MARK', 1731 code='1', 1732 arg=None, 1733 stack_before=[markobject, stackslice], 1734 stack_after=[], 1735 proto=1, 1736 doc="""Pop all the stack objects at and above the topmost markobject. 1737 1738 When an opcode using a variable number of stack objects is done, 1739 POP_MARK is used to remove those objects, and to remove the markobject 1740 that delimited their starting position on the stack. 1741 """), 1742 1743 # Memo manipulation. There are really only two operations (get and put), 1744 # each in all-text, "short binary", and "long binary" flavors. 1745 1746 I(name='GET', 1747 code='g', 1748 arg=decimalnl_short, 1749 stack_before=[], 1750 stack_after=[anyobject], 1751 proto=0, 1752 doc="""Read an object from the memo and push it on the stack. 1753 1754 The index of the memo object to push is given by the newline-terminated 1755 decimal string following. BINGET and LONG_BINGET are space-optimized 1756 versions. 1757 """), 1758 1759 I(name='BINGET', 1760 code='h', 1761 arg=uint1, 1762 stack_before=[], 1763 stack_after=[anyobject], 1764 proto=1, 1765 doc="""Read an object from the memo and push it on the stack. 1766 1767 The index of the memo object to push is given by the 1-byte unsigned 1768 integer following. 1769 """), 1770 1771 I(name='LONG_BINGET', 1772 code='j', 1773 arg=uint4, 1774 stack_before=[], 1775 stack_after=[anyobject], 1776 proto=1, 1777 doc="""Read an object from the memo and push it on the stack. 1778 1779 The index of the memo object to push is given by the 4-byte unsigned 1780 little-endian integer following. 1781 """), 1782 1783 I(name='PUT', 1784 code='p', 1785 arg=decimalnl_short, 1786 stack_before=[], 1787 stack_after=[], 1788 proto=0, 1789 doc="""Store the stack top into the memo. The stack is not popped. 1790 1791 The index of the memo location to write into is given by the newline- 1792 terminated decimal string following. BINPUT and LONG_BINPUT are 1793 space-optimized versions. 1794 """), 1795 1796 I(name='BINPUT', 1797 code='q', 1798 arg=uint1, 1799 stack_before=[], 1800 stack_after=[], 1801 proto=1, 1802 doc="""Store the stack top into the memo. The stack is not popped. 1803 1804 The index of the memo location to write into is given by the 1-byte 1805 unsigned integer following. 1806 """), 1807 1808 I(name='LONG_BINPUT', 1809 code='r', 1810 arg=uint4, 1811 stack_before=[], 1812 stack_after=[], 1813 proto=1, 1814 doc="""Store the stack top into the memo. The stack is not popped. 1815 1816 The index of the memo location to write into is given by the 4-byte 1817 unsigned little-endian integer following. 1818 """), 1819 1820 I(name='MEMOIZE', 1821 code='\x94', 1822 arg=None, 1823 stack_before=[anyobject], 1824 stack_after=[anyobject], 1825 proto=4, 1826 doc="""Store the stack top into the memo. The stack is not popped. 1827 1828 The index of the memo location to write is the number of 1829 elements currently present in the memo. 1830 """), 1831 1832 # Access the extension registry (predefined objects). Akin to the GET 1833 # family. 1834 1835 I(name='EXT1', 1836 code='\x82', 1837 arg=uint1, 1838 stack_before=[], 1839 stack_after=[anyobject], 1840 proto=2, 1841 doc="""Extension code. 1842 1843 This code and the similar EXT2 and EXT4 allow using a registry 1844 of popular objects that are pickled by name, typically classes. 1845 It is envisioned that through a global negotiation and 1846 registration process, third parties can set up a mapping between 1847 ints and object names. 1848 1849 In order to guarantee pickle interchangeability, the extension 1850 code registry ought to be global, although a range of codes may 1851 be reserved for private use. 1852 1853 EXT1 has a 1-byte integer argument. This is used to index into the 1854 extension registry, and the object at that index is pushed on the stack. 1855 """), 1856 1857 I(name='EXT2', 1858 code='\x83', 1859 arg=uint2, 1860 stack_before=[], 1861 stack_after=[anyobject], 1862 proto=2, 1863 doc="""Extension code. 1864 1865 See EXT1. EXT2 has a two-byte integer argument. 1866 """), 1867 1868 I(name='EXT4', 1869 code='\x84', 1870 arg=int4, 1871 stack_before=[], 1872 stack_after=[anyobject], 1873 proto=2, 1874 doc="""Extension code. 1875 1876 See EXT1. EXT4 has a four-byte integer argument. 1877 """), 1878 1879 # Push a class object, or module function, on the stack, via its module 1880 # and name. 1881 1882 I(name='GLOBAL', 1883 code='c', 1884 arg=stringnl_noescape_pair, 1885 stack_before=[], 1886 stack_after=[anyobject], 1887 proto=0, 1888 doc="""Push a global object (module.attr) on the stack. 1889 1890 Two newline-terminated strings follow the GLOBAL opcode. The first is 1891 taken as a module name, and the second as a class name. The class 1892 object module.class is pushed on the stack. More accurately, the 1893 object returned by self.find_class(module, class) is pushed on the 1894 stack, so unpickling subclasses can override this form of lookup. 1895 """), 1896 1897 I(name='STACK_GLOBAL', 1898 code='\x93', 1899 arg=None, 1900 stack_before=[pyunicode, pyunicode], 1901 stack_after=[anyobject], 1902 proto=0, 1903 doc="""Push a global object (module.attr) on the stack. 1904 """), 1905 1906 # Ways to build objects of classes pickle doesn't know about directly 1907 # (user-defined classes). I despair of documenting this accurately 1908 # and comprehensibly -- you really have to read the pickle code to 1909 # find all the special cases. 1910 1911 I(name='REDUCE', 1912 code='R', 1913 arg=None, 1914 stack_before=[anyobject, anyobject], 1915 stack_after=[anyobject], 1916 proto=0, 1917 doc="""Push an object built from a callable and an argument tuple. 1918 1919 The opcode is named to remind of the __reduce__() method. 1920 1921 Stack before: ... callable pytuple 1922 Stack after: ... callable(*pytuple) 1923 1924 The callable and the argument tuple are the first two items returned 1925 by a __reduce__ method. Applying the callable to the argtuple is 1926 supposed to reproduce the original object, or at least get it started. 1927 If the __reduce__ method returns a 3-tuple, the last component is an 1928 argument to be passed to the object's __setstate__, and then the REDUCE 1929 opcode is followed by code to create setstate's argument, and then a 1930 BUILD opcode to apply __setstate__ to that argument. 1931 1932 If not isinstance(callable, type), REDUCE complains unless the 1933 callable has been registered with the copyreg module's 1934 safe_constructors dict, or the callable has a magic 1935 '__safe_for_unpickling__' attribute with a true value. I'm not sure 1936 why it does this, but I've sure seen this complaint often enough when 1937 I didn't want to <wink>. 1938 """), 1939 1940 I(name='BUILD', 1941 code='b', 1942 arg=None, 1943 stack_before=[anyobject, anyobject], 1944 stack_after=[anyobject], 1945 proto=0, 1946 doc="""Finish building an object, via __setstate__ or dict update. 1947 1948 Stack before: ... anyobject argument 1949 Stack after: ... anyobject 1950 1951 where anyobject may have been mutated, as follows: 1952 1953 If the object has a __setstate__ method, 1954 1955 anyobject.__setstate__(argument) 1956 1957 is called. 1958 1959 Else the argument must be a dict, the object must have a __dict__, and 1960 the object is updated via 1961 1962 anyobject.__dict__.update(argument) 1963 """), 1964 1965 I(name='INST', 1966 code='i', 1967 arg=stringnl_noescape_pair, 1968 stack_before=[markobject, stackslice], 1969 stack_after=[anyobject], 1970 proto=0, 1971 doc="""Build a class instance. 1972 1973 This is the protocol 0 version of protocol 1's OBJ opcode. 1974 INST is followed by two newline-terminated strings, giving a 1975 module and class name, just as for the GLOBAL opcode (and see 1976 GLOBAL for more details about that). self.find_class(module, name) 1977 is used to get a class object. 1978 1979 In addition, all the objects on the stack following the topmost 1980 markobject are gathered into a tuple and popped (along with the 1981 topmost markobject), just as for the TUPLE opcode. 1982 1983 Now it gets complicated. If all of these are true: 1984 1985 + The argtuple is empty (markobject was at the top of the stack 1986 at the start). 1987 1988 + The class object does not have a __getinitargs__ attribute. 1989 1990 then we want to create an old-style class instance without invoking 1991 its __init__() method (pickle has waffled on this over the years; not 1992 calling __init__() is current wisdom). In this case, an instance of 1993 an old-style dummy class is created, and then we try to rebind its 1994 __class__ attribute to the desired class object. If this succeeds, 1995 the new instance object is pushed on the stack, and we're done. 1996 1997 Else (the argtuple is not empty, it's not an old-style class object, 1998 or the class object does have a __getinitargs__ attribute), the code 1999 first insists that the class object have a __safe_for_unpickling__ 2000 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, 2001 it doesn't matter whether this attribute has a true or false value, it 2002 only matters whether it exists (XXX this is a bug). If 2003 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised. 2004 2005 Else (the class object does have a __safe_for_unpickling__ attr), 2006 the class object obtained from INST's arguments is applied to the 2007 argtuple obtained from the stack, and the resulting instance object 2008 is pushed on the stack. 2009 2010 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3. 2011 NOTE: the distinction between old-style and new-style classes does 2012 not make sense in Python 3. 2013 """), 2014 2015 I(name='OBJ', 2016 code='o', 2017 arg=None, 2018 stack_before=[markobject, anyobject, stackslice], 2019 stack_after=[anyobject], 2020 proto=1, 2021 doc="""Build a class instance. 2022 2023 This is the protocol 1 version of protocol 0's INST opcode, and is 2024 very much like it. The major difference is that the class object 2025 is taken off the stack, allowing it to be retrieved from the memo 2026 repeatedly if several instances of the same class are created. This 2027 can be much more efficient (in both time and space) than repeatedly 2028 embedding the module and class names in INST opcodes. 2029 2030 Unlike INST, OBJ takes no arguments from the opcode stream. Instead 2031 the class object is taken off the stack, immediately above the 2032 topmost markobject: 2033 2034 Stack before: ... markobject classobject stackslice 2035 Stack after: ... new_instance_object 2036 2037 As for INST, the remainder of the stack above the markobject is 2038 gathered into an argument tuple, and then the logic seems identical, 2039 except that no __safe_for_unpickling__ check is done (XXX this is 2040 a bug). See INST for the gory details. 2041 2042 NOTE: In Python 2.3, INST and OBJ are identical except for how they 2043 get the class object. That was always the intent; the implementations 2044 had diverged for accidental reasons. 2045 """), 2046 2047 I(name='NEWOBJ', 2048 code='\x81', 2049 arg=None, 2050 stack_before=[anyobject, anyobject], 2051 stack_after=[anyobject], 2052 proto=2, 2053 doc="""Build an object instance. 2054 2055 The stack before should be thought of as containing a class 2056 object followed by an argument tuple (the tuple being the stack 2057 top). Call these cls and args. They are popped off the stack, 2058 and the value returned by cls.__new__(cls, *args) is pushed back 2059 onto the stack. 2060 """), 2061 2062 I(name='NEWOBJ_EX', 2063 code='\x92', 2064 arg=None, 2065 stack_before=[anyobject, anyobject, anyobject], 2066 stack_after=[anyobject], 2067 proto=4, 2068 doc="""Build an object instance. 2069 2070 The stack before should be thought of as containing a class 2071 object followed by an argument tuple and by a keyword argument dict 2072 (the dict being the stack top). Call these cls and args. They are 2073 popped off the stack, and the value returned by 2074 cls.__new__(cls, *args, *kwargs) is pushed back onto the stack. 2075 """), 2076 2077 # Machine control. 2078 2079 I(name='PROTO', 2080 code='\x80', 2081 arg=uint1, 2082 stack_before=[], 2083 stack_after=[], 2084 proto=2, 2085 doc="""Protocol version indicator. 2086 2087 For protocol 2 and above, a pickle must start with this opcode. 2088 The argument is the protocol version, an int in range(2, 256). 2089 """), 2090 2091 I(name='STOP', 2092 code='.', 2093 arg=None, 2094 stack_before=[anyobject], 2095 stack_after=[], 2096 proto=0, 2097 doc="""Stop the unpickling machine. 2098 2099 Every pickle ends with this opcode. The object at the top of the stack 2100 is popped, and that's the result of unpickling. The stack should be 2101 empty then. 2102 """), 2103 2104 # Framing support. 2105 2106 I(name='FRAME', 2107 code='\x95', 2108 arg=uint8, 2109 stack_before=[], 2110 stack_after=[], 2111 proto=4, 2112 doc="""Indicate the beginning of a new frame. 2113 2114 The unpickler may use this opcode to safely prefetch data from its 2115 underlying stream. 2116 """), 2117 2118 # Ways to deal with persistent IDs. 2119 2120 I(name='PERSID', 2121 code='P', 2122 arg=stringnl_noescape, 2123 stack_before=[], 2124 stack_after=[anyobject], 2125 proto=0, 2126 doc="""Push an object identified by a persistent ID. 2127 2128 The pickle module doesn't define what a persistent ID means. PERSID's 2129 argument is a newline-terminated str-style (no embedded escapes, no 2130 bracketing quote characters) string, which *is* "the persistent ID". 2131 The unpickler passes this string to self.persistent_load(). Whatever 2132 object that returns is pushed on the stack. There is no implementation 2133 of persistent_load() in Python's unpickler: it must be supplied by an 2134 unpickler subclass. 2135 """), 2136 2137 I(name='BINPERSID', 2138 code='Q', 2139 arg=None, 2140 stack_before=[anyobject], 2141 stack_after=[anyobject], 2142 proto=1, 2143 doc="""Push an object identified by a persistent ID. 2144 2145 Like PERSID, except the persistent ID is popped off the stack (instead 2146 of being a string embedded in the opcode bytestream). The persistent 2147 ID is passed to self.persistent_load(), and whatever object that 2148 returns is pushed on the stack. See PERSID for more detail. 2149 """), 2150] 2151del I 2152 2153# Verify uniqueness of .name and .code members. 2154name2i = {} 2155code2i = {} 2156 2157for i, d in enumerate(opcodes): 2158 if d.name in name2i: 2159 raise ValueError("repeated name %r at indices %d and %d" % 2160 (d.name, name2i[d.name], i)) 2161 if d.code in code2i: 2162 raise ValueError("repeated code %r at indices %d and %d" % 2163 (d.code, code2i[d.code], i)) 2164 2165 name2i[d.name] = i 2166 code2i[d.code] = i 2167 2168del name2i, code2i, i, d 2169 2170############################################################################## 2171# Build a code2op dict, mapping opcode characters to OpcodeInfo records. 2172# Also ensure we've got the same stuff as pickle.py, although the 2173# introspection here is dicey. 2174 2175code2op = {} 2176for d in opcodes: 2177 code2op[d.code] = d 2178del d 2179 2180def assure_pickle_consistency(verbose=False): 2181 2182 copy = code2op.copy() 2183 for name in pickle.__all__: 2184 if not re.match("[A-Z][A-Z0-9_]+$", name): 2185 if verbose: 2186 print("skipping %r: it doesn't look like an opcode name" % name) 2187 continue 2188 picklecode = getattr(pickle, name) 2189 if not isinstance(picklecode, bytes) or len(picklecode) != 1: 2190 if verbose: 2191 print(("skipping %r: value %r doesn't look like a pickle " 2192 "code" % (name, picklecode))) 2193 continue 2194 picklecode = picklecode.decode("latin-1") 2195 if picklecode in copy: 2196 if verbose: 2197 print("checking name %r w/ code %r for consistency" % ( 2198 name, picklecode)) 2199 d = copy[picklecode] 2200 if d.name != name: 2201 raise ValueError("for pickle code %r, pickle.py uses name %r " 2202 "but we're using name %r" % (picklecode, 2203 name, 2204 d.name)) 2205 # Forget this one. Any left over in copy at the end are a problem 2206 # of a different kind. 2207 del copy[picklecode] 2208 else: 2209 raise ValueError("pickle.py appears to have a pickle opcode with " 2210 "name %r and code %r, but we don't" % 2211 (name, picklecode)) 2212 if copy: 2213 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] 2214 for code, d in copy.items(): 2215 msg.append(" name %r with code %r" % (d.name, code)) 2216 raise ValueError("\n".join(msg)) 2217 2218assure_pickle_consistency() 2219del assure_pickle_consistency 2220 2221############################################################################## 2222# A pickle opcode generator. 2223 2224def _genops(data, yield_end_pos=False): 2225 if isinstance(data, bytes_types): 2226 data = io.BytesIO(data) 2227 2228 if hasattr(data, "tell"): 2229 getpos = data.tell 2230 else: 2231 getpos = lambda: None 2232 2233 while True: 2234 pos = getpos() 2235 code = data.read(1) 2236 opcode = code2op.get(code.decode("latin-1")) 2237 if opcode is None: 2238 if code == b"": 2239 raise ValueError("pickle exhausted before seeing STOP") 2240 else: 2241 raise ValueError("at position %s, opcode %r unknown" % ( 2242 "<unknown>" if pos is None else pos, 2243 code)) 2244 if opcode.arg is None: 2245 arg = None 2246 else: 2247 arg = opcode.arg.reader(data) 2248 if yield_end_pos: 2249 yield opcode, arg, pos, getpos() 2250 else: 2251 yield opcode, arg, pos 2252 if code == b'.': 2253 assert opcode.name == 'STOP' 2254 break 2255 2256def genops(pickle): 2257 """Generate all the opcodes in a pickle. 2258 2259 'pickle' is a file-like object, or string, containing the pickle. 2260 2261 Each opcode in the pickle is generated, from the current pickle position, 2262 stopping after a STOP opcode is delivered. A triple is generated for 2263 each opcode: 2264 2265 opcode, arg, pos 2266 2267 opcode is an OpcodeInfo record, describing the current opcode. 2268 2269 If the opcode has an argument embedded in the pickle, arg is its decoded 2270 value, as a Python object. If the opcode doesn't have an argument, arg 2271 is None. 2272 2273 If the pickle has a tell() method, pos was the value of pickle.tell() 2274 before reading the current opcode. If the pickle is a bytes object, 2275 it's wrapped in a BytesIO object, and the latter's tell() result is 2276 used. Else (the pickle doesn't have a tell(), and it's not obvious how 2277 to query its current position) pos is None. 2278 """ 2279 return _genops(pickle) 2280 2281############################################################################## 2282# A pickle optimizer. 2283 2284def optimize(p): 2285 'Optimize a pickle string by removing unused PUT opcodes' 2286 not_a_put = object() 2287 gets = { not_a_put } # set of args used by a GET opcode 2288 opcodes = [] # (startpos, stoppos, putid) 2289 proto = 0 2290 for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True): 2291 if 'PUT' in opcode.name: 2292 opcodes.append((pos, end_pos, arg)) 2293 elif 'FRAME' in opcode.name: 2294 pass 2295 else: 2296 if 'GET' in opcode.name: 2297 gets.add(arg) 2298 elif opcode.name == 'PROTO': 2299 assert pos == 0, pos 2300 proto = arg 2301 opcodes.append((pos, end_pos, not_a_put)) 2302 prevpos, prevarg = pos, None 2303 2304 # Copy the opcodes except for PUTS without a corresponding GET 2305 out = io.BytesIO() 2306 opcodes = iter(opcodes) 2307 if proto >= 2: 2308 # Write the PROTO header before any framing 2309 start, stop, _ = next(opcodes) 2310 out.write(p[start:stop]) 2311 buf = pickle._Framer(out.write) 2312 if proto >= 4: 2313 buf.start_framing() 2314 for start, stop, putid in opcodes: 2315 if putid in gets: 2316 #buf.commit_frame() 2317 buf.write(p[start:stop]) 2318 if proto >= 4: 2319 buf.end_framing() 2320 return out.getvalue() 2321 2322############################################################################## 2323# A symbolic pickle disassembler. 2324 2325def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0): 2326 """Produce a symbolic disassembly of a pickle. 2327 2328 'pickle' is a file-like object, or string, containing a (at least one) 2329 pickle. The pickle is disassembled from the current position, through 2330 the first STOP opcode encountered. 2331 2332 Optional arg 'out' is a file-like object to which the disassembly is 2333 printed. It defaults to sys.stdout. 2334 2335 Optional arg 'memo' is a Python dict, used as the pickle's memo. It 2336 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes. 2337 Passing the same memo object to another dis() call then allows disassembly 2338 to proceed across multiple pickles that were all created by the same 2339 pickler with the same memo. Ordinarily you don't need to worry about this. 2340 2341 Optional arg 'indentlevel' is the number of blanks by which to indent 2342 a new MARK level. It defaults to 4. 2343 2344 Optional arg 'annotate' if nonzero instructs dis() to add short 2345 description of the opcode on each line of disassembled output. 2346 The value given to 'annotate' must be an integer and is used as a 2347 hint for the column where annotation should start. The default 2348 value is 0, meaning no annotations. 2349 2350 In addition to printing the disassembly, some sanity checks are made: 2351 2352 + All embedded opcode arguments "make sense". 2353 2354 + Explicit and implicit pop operations have enough items on the stack. 2355 2356 + When an opcode implicitly refers to a markobject, a markobject is 2357 actually on the stack. 2358 2359 + A memo entry isn't referenced before it's defined. 2360 2361 + The markobject isn't stored in the memo. 2362 2363 + A memo entry isn't redefined. 2364 """ 2365 2366 # Most of the hair here is for sanity checks, but most of it is needed 2367 # anyway to detect when a protocol 0 POP takes a MARK off the stack 2368 # (which in turn is needed to indent MARK blocks correctly). 2369 2370 stack = [] # crude emulation of unpickler stack 2371 if memo is None: 2372 memo = {} # crude emulation of unpickler memo 2373 maxproto = -1 # max protocol number seen 2374 markstack = [] # bytecode positions of MARK opcodes 2375 indentchunk = ' ' * indentlevel 2376 errormsg = None 2377 annocol = annotate # column hint for annotations 2378 for opcode, arg, pos in genops(pickle): 2379 if pos is not None: 2380 print("%5d:" % pos, end=' ', file=out) 2381 2382 line = "%-4s %s%s" % (repr(opcode.code)[1:-1], 2383 indentchunk * len(markstack), 2384 opcode.name) 2385 2386 maxproto = max(maxproto, opcode.proto) 2387 before = opcode.stack_before # don't mutate 2388 after = opcode.stack_after # don't mutate 2389 numtopop = len(before) 2390 2391 # See whether a MARK should be popped. 2392 markmsg = None 2393 if markobject in before or (opcode.name == "POP" and 2394 stack and 2395 stack[-1] is markobject): 2396 assert markobject not in after 2397 if __debug__: 2398 if markobject in before: 2399 assert before[-1] is stackslice 2400 if markstack: 2401 markpos = markstack.pop() 2402 if markpos is None: 2403 markmsg = "(MARK at unknown opcode offset)" 2404 else: 2405 markmsg = "(MARK at %d)" % markpos 2406 # Pop everything at and after the topmost markobject. 2407 while stack[-1] is not markobject: 2408 stack.pop() 2409 stack.pop() 2410 # Stop later code from popping too much. 2411 try: 2412 numtopop = before.index(markobject) 2413 except ValueError: 2414 assert opcode.name == "POP" 2415 numtopop = 0 2416 else: 2417 errormsg = markmsg = "no MARK exists on stack" 2418 2419 # Check for correct memo usage. 2420 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"): 2421 if opcode.name == "MEMOIZE": 2422 memo_idx = len(memo) 2423 else: 2424 assert arg is not None 2425 memo_idx = arg 2426 if memo_idx in memo: 2427 errormsg = "memo key %r already defined" % arg 2428 elif not stack: 2429 errormsg = "stack is empty -- can't store into memo" 2430 elif stack[-1] is markobject: 2431 errormsg = "can't store markobject in the memo" 2432 else: 2433 memo[memo_idx] = stack[-1] 2434 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"): 2435 if arg in memo: 2436 assert len(after) == 1 2437 after = [memo[arg]] # for better stack emulation 2438 else: 2439 errormsg = "memo key %r has never been stored into" % arg 2440 2441 if arg is not None or markmsg: 2442 # make a mild effort to align arguments 2443 line += ' ' * (10 - len(opcode.name)) 2444 if arg is not None: 2445 line += ' ' + repr(arg) 2446 if markmsg: 2447 line += ' ' + markmsg 2448 if annotate: 2449 line += ' ' * (annocol - len(line)) 2450 # make a mild effort to align annotations 2451 annocol = len(line) 2452 if annocol > 50: 2453 annocol = annotate 2454 line += ' ' + opcode.doc.split('\n', 1)[0] 2455 print(line, file=out) 2456 2457 if errormsg: 2458 # Note that we delayed complaining until the offending opcode 2459 # was printed. 2460 raise ValueError(errormsg) 2461 2462 # Emulate the stack effects. 2463 if len(stack) < numtopop: 2464 raise ValueError("tries to pop %d items from stack with " 2465 "only %d items" % (numtopop, len(stack))) 2466 if numtopop: 2467 del stack[-numtopop:] 2468 if markobject in after: 2469 assert markobject not in before 2470 markstack.append(pos) 2471 2472 stack.extend(after) 2473 2474 print("highest protocol among opcodes =", maxproto, file=out) 2475 if stack: 2476 raise ValueError("stack not empty after STOP: %r" % stack) 2477 2478# For use in the doctest, simply as an example of a class to pickle. 2479class _Example: 2480 def __init__(self, value): 2481 self.value = value 2482 2483_dis_test = r""" 2484>>> import pickle 2485>>> x = [1, 2, (3, 4), {b'abc': "def"}] 2486>>> pkl0 = pickle.dumps(x, 0) 2487>>> dis(pkl0) 2488 0: ( MARK 2489 1: l LIST (MARK at 0) 2490 2: p PUT 0 2491 5: L LONG 1 2492 9: a APPEND 2493 10: L LONG 2 2494 14: a APPEND 2495 15: ( MARK 2496 16: L LONG 3 2497 20: L LONG 4 2498 24: t TUPLE (MARK at 15) 2499 25: p PUT 1 2500 28: a APPEND 2501 29: ( MARK 2502 30: d DICT (MARK at 29) 2503 31: p PUT 2 2504 34: c GLOBAL '_codecs encode' 2505 50: p PUT 3 2506 53: ( MARK 2507 54: V UNICODE 'abc' 2508 59: p PUT 4 2509 62: V UNICODE 'latin1' 2510 70: p PUT 5 2511 73: t TUPLE (MARK at 53) 2512 74: p PUT 6 2513 77: R REDUCE 2514 78: p PUT 7 2515 81: V UNICODE 'def' 2516 86: p PUT 8 2517 89: s SETITEM 2518 90: a APPEND 2519 91: . STOP 2520highest protocol among opcodes = 0 2521 2522Try again with a "binary" pickle. 2523 2524>>> pkl1 = pickle.dumps(x, 1) 2525>>> dis(pkl1) 2526 0: ] EMPTY_LIST 2527 1: q BINPUT 0 2528 3: ( MARK 2529 4: K BININT1 1 2530 6: K BININT1 2 2531 8: ( MARK 2532 9: K BININT1 3 2533 11: K BININT1 4 2534 13: t TUPLE (MARK at 8) 2535 14: q BINPUT 1 2536 16: } EMPTY_DICT 2537 17: q BINPUT 2 2538 19: c GLOBAL '_codecs encode' 2539 35: q BINPUT 3 2540 37: ( MARK 2541 38: X BINUNICODE 'abc' 2542 46: q BINPUT 4 2543 48: X BINUNICODE 'latin1' 2544 59: q BINPUT 5 2545 61: t TUPLE (MARK at 37) 2546 62: q BINPUT 6 2547 64: R REDUCE 2548 65: q BINPUT 7 2549 67: X BINUNICODE 'def' 2550 75: q BINPUT 8 2551 77: s SETITEM 2552 78: e APPENDS (MARK at 3) 2553 79: . STOP 2554highest protocol among opcodes = 1 2555 2556Exercise the INST/OBJ/BUILD family. 2557 2558>>> import pickletools 2559>>> dis(pickle.dumps(pickletools.dis, 0)) 2560 0: c GLOBAL 'pickletools dis' 2561 17: p PUT 0 2562 20: . STOP 2563highest protocol among opcodes = 0 2564 2565>>> from pickletools import _Example 2566>>> x = [_Example(42)] * 2 2567>>> dis(pickle.dumps(x, 0)) 2568 0: ( MARK 2569 1: l LIST (MARK at 0) 2570 2: p PUT 0 2571 5: c GLOBAL 'copy_reg _reconstructor' 2572 30: p PUT 1 2573 33: ( MARK 2574 34: c GLOBAL 'pickletools _Example' 2575 56: p PUT 2 2576 59: c GLOBAL '__builtin__ object' 2577 79: p PUT 3 2578 82: N NONE 2579 83: t TUPLE (MARK at 33) 2580 84: p PUT 4 2581 87: R REDUCE 2582 88: p PUT 5 2583 91: ( MARK 2584 92: d DICT (MARK at 91) 2585 93: p PUT 6 2586 96: V UNICODE 'value' 2587 103: p PUT 7 2588 106: L LONG 42 2589 111: s SETITEM 2590 112: b BUILD 2591 113: a APPEND 2592 114: g GET 5 2593 117: a APPEND 2594 118: . STOP 2595highest protocol among opcodes = 0 2596 2597>>> dis(pickle.dumps(x, 1)) 2598 0: ] EMPTY_LIST 2599 1: q BINPUT 0 2600 3: ( MARK 2601 4: c GLOBAL 'copy_reg _reconstructor' 2602 29: q BINPUT 1 2603 31: ( MARK 2604 32: c GLOBAL 'pickletools _Example' 2605 54: q BINPUT 2 2606 56: c GLOBAL '__builtin__ object' 2607 76: q BINPUT 3 2608 78: N NONE 2609 79: t TUPLE (MARK at 31) 2610 80: q BINPUT 4 2611 82: R REDUCE 2612 83: q BINPUT 5 2613 85: } EMPTY_DICT 2614 86: q BINPUT 6 2615 88: X BINUNICODE 'value' 2616 98: q BINPUT 7 2617 100: K BININT1 42 2618 102: s SETITEM 2619 103: b BUILD 2620 104: h BINGET 5 2621 106: e APPENDS (MARK at 3) 2622 107: . STOP 2623highest protocol among opcodes = 1 2624 2625Try "the canonical" recursive-object test. 2626 2627>>> L = [] 2628>>> T = L, 2629>>> L.append(T) 2630>>> L[0] is T 2631True 2632>>> T[0] is L 2633True 2634>>> L[0][0] is L 2635True 2636>>> T[0][0] is T 2637True 2638>>> dis(pickle.dumps(L, 0)) 2639 0: ( MARK 2640 1: l LIST (MARK at 0) 2641 2: p PUT 0 2642 5: ( MARK 2643 6: g GET 0 2644 9: t TUPLE (MARK at 5) 2645 10: p PUT 1 2646 13: a APPEND 2647 14: . STOP 2648highest protocol among opcodes = 0 2649 2650>>> dis(pickle.dumps(L, 1)) 2651 0: ] EMPTY_LIST 2652 1: q BINPUT 0 2653 3: ( MARK 2654 4: h BINGET 0 2655 6: t TUPLE (MARK at 3) 2656 7: q BINPUT 1 2657 9: a APPEND 2658 10: . STOP 2659highest protocol among opcodes = 1 2660 2661Note that, in the protocol 0 pickle of the recursive tuple, the disassembler 2662has to emulate the stack in order to realize that the POP opcode at 16 gets 2663rid of the MARK at 0. 2664 2665>>> dis(pickle.dumps(T, 0)) 2666 0: ( MARK 2667 1: ( MARK 2668 2: l LIST (MARK at 1) 2669 3: p PUT 0 2670 6: ( MARK 2671 7: g GET 0 2672 10: t TUPLE (MARK at 6) 2673 11: p PUT 1 2674 14: a APPEND 2675 15: 0 POP 2676 16: 0 POP (MARK at 0) 2677 17: g GET 1 2678 20: . STOP 2679highest protocol among opcodes = 0 2680 2681>>> dis(pickle.dumps(T, 1)) 2682 0: ( MARK 2683 1: ] EMPTY_LIST 2684 2: q BINPUT 0 2685 4: ( MARK 2686 5: h BINGET 0 2687 7: t TUPLE (MARK at 4) 2688 8: q BINPUT 1 2689 10: a APPEND 2690 11: 1 POP_MARK (MARK at 0) 2691 12: h BINGET 1 2692 14: . STOP 2693highest protocol among opcodes = 1 2694 2695Try protocol 2. 2696 2697>>> dis(pickle.dumps(L, 2)) 2698 0: \x80 PROTO 2 2699 2: ] EMPTY_LIST 2700 3: q BINPUT 0 2701 5: h BINGET 0 2702 7: \x85 TUPLE1 2703 8: q BINPUT 1 2704 10: a APPEND 2705 11: . STOP 2706highest protocol among opcodes = 2 2707 2708>>> dis(pickle.dumps(T, 2)) 2709 0: \x80 PROTO 2 2710 2: ] EMPTY_LIST 2711 3: q BINPUT 0 2712 5: h BINGET 0 2713 7: \x85 TUPLE1 2714 8: q BINPUT 1 2715 10: a APPEND 2716 11: 0 POP 2717 12: h BINGET 1 2718 14: . STOP 2719highest protocol among opcodes = 2 2720 2721Try protocol 3 with annotations: 2722 2723>>> dis(pickle.dumps(T, 3), annotate=1) 2724 0: \x80 PROTO 3 Protocol version indicator. 2725 2: ] EMPTY_LIST Push an empty list. 2726 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped. 2727 5: h BINGET 0 Read an object from the memo and push it on the stack. 2728 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack. 2729 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped. 2730 10: a APPEND Append an object to a list. 2731 11: 0 POP Discard the top stack item, shrinking the stack by one item. 2732 12: h BINGET 1 Read an object from the memo and push it on the stack. 2733 14: . STOP Stop the unpickling machine. 2734highest protocol among opcodes = 2 2735 2736""" 2737 2738_memo_test = r""" 2739>>> import pickle 2740>>> import io 2741>>> f = io.BytesIO() 2742>>> p = pickle.Pickler(f, 2) 2743>>> x = [1, 2, 3] 2744>>> p.dump(x) 2745>>> p.dump(x) 2746>>> f.seek(0) 27470 2748>>> memo = {} 2749>>> dis(f, memo=memo) 2750 0: \x80 PROTO 2 2751 2: ] EMPTY_LIST 2752 3: q BINPUT 0 2753 5: ( MARK 2754 6: K BININT1 1 2755 8: K BININT1 2 2756 10: K BININT1 3 2757 12: e APPENDS (MARK at 5) 2758 13: . STOP 2759highest protocol among opcodes = 2 2760>>> dis(f, memo=memo) 2761 14: \x80 PROTO 2 2762 16: h BINGET 0 2763 18: . STOP 2764highest protocol among opcodes = 2 2765""" 2766 2767__test__ = {'disassembler_test': _dis_test, 2768 'disassembler_memo_test': _memo_test, 2769 } 2770 2771def _test(): 2772 import doctest 2773 return doctest.testmod() 2774 2775if __name__ == "__main__": 2776 import sys, argparse 2777 parser = argparse.ArgumentParser( 2778 description='disassemble one or more pickle files') 2779 parser.add_argument( 2780 'pickle_file', type=argparse.FileType('br'), 2781 nargs='*', help='the pickle file') 2782 parser.add_argument( 2783 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'), 2784 help='the file where the output should be written') 2785 parser.add_argument( 2786 '-m', '--memo', action='store_true', 2787 help='preserve memo between disassemblies') 2788 parser.add_argument( 2789 '-l', '--indentlevel', default=4, type=int, 2790 help='the number of blanks by which to indent a new MARK level') 2791 parser.add_argument( 2792 '-a', '--annotate', action='store_true', 2793 help='annotate each line with a short opcode description') 2794 parser.add_argument( 2795 '-p', '--preamble', default="==> {name} <==", 2796 help='if more than one pickle file is specified, print this before' 2797 ' each disassembly') 2798 parser.add_argument( 2799 '-t', '--test', action='store_true', 2800 help='run self-test suite') 2801 parser.add_argument( 2802 '-v', action='store_true', 2803 help='run verbosely; only affects self-test run') 2804 args = parser.parse_args() 2805 if args.test: 2806 _test() 2807 else: 2808 annotate = 30 if args.annotate else 0 2809 if not args.pickle_file: 2810 parser.print_help() 2811 elif len(args.pickle_file) == 1: 2812 dis(args.pickle_file[0], args.output, None, 2813 args.indentlevel, annotate) 2814 else: 2815 memo = {} if args.memo else None 2816 for f in args.pickle_file: 2817 preamble = args.preamble.format(name=f.name) 2818 args.output.write(preamble + '\n') 2819 dis(f, args.output, memo, args.indentlevel, annotate) 2820