Unveiling RPython’s GC Secrets: Turbocharged Object Allocation Performance

高效码农

2 months ago

Deep Dive into RPython GC Object Allocation Performance

In the realm of software development, the performance of garbage collection (GC) mechanisms plays a pivotal role in shaping the overall behavior of programs. RPython, a restricted variant of Python designed specifically for the PyPy project, boasts an impressive garbage collection component that excels in object allocation speed and memory management efficiency.

I. Setting Up the Testing Environment for RPython GC

A. Writing the Test Code

To explore the object allocation speed of RPython GC, we first crafted a basic test script:

class A(object):
    pass

def run(loops):
    # Initial test code
    for i in range(loops):
        a = A()
        a.i = i

However, initial tests revealed that RPython’s static optimizer might eliminate object allocation operations that went unused. To address this, we refined the code to ensure two instances remained alive during each loop iteration:

class A(object):
    pass

def run(loops):
    a = prev = None
    for i in range(loops):
        prev = a
        a = A()
        a.i = i
    print(prev, a)  # Prevent the optimizer from removing the allocation operation

Additionally, we provided a version that left the field uninitialized:

def run(initialize_field, loops):
    t1 = time.time()
    if initialize_field:
        a = prev = None
        for i in range(loops):
            prev = a
            a = A()
            a.i = i
        print(prev, a)
    else:
        a = prev = None
        for i in range(loops):
            prev = a
            a = A()
        print(prev, a)
    t2 = time.time()
    print(t2 - t1, 's')
    object_size_in_words = 2  # GC header and one integer field
    mem = loops * 8 * object_size_in_words / 1024.0 / 1024.0 / 1024.0
    print(mem, 'GB')
    print(mem / (t2 - t1), 'GB/s')

B. RPython Scaffold Code and Compilation

After adding the necessary RPython scaffold code, we employed the following command to compile the code into a binary file:

pypy rpython/bin/rpython targetallocatealot.py

This command translates RPython code into C code and further compiles it into a binary file that includes both our test code and the RPython garbage collector.

II. Test Results and Performance Analysis

A. RPython GC Test Results

On a platform featuring an AMD Ryzen 7 PRO 7840U processor (running Ubuntu Linux 24.04.2), the test results were as follows:

$ ./targetallocatealot-c 1000000000 0 without initialization
<a object at 0x7c71ad84cf60>
  <a object at 0x7c71ad84cf70>
    0.433825 s 14.901161 GB 34.348322 GB/s $ ./targetallocatealot-c 1000000000 1
    with initialization
    <a object at 0x71b41c82cf60>
      <a object at 0x71b41c82cf70> 0.501856 s 14.901161 GB 29.692100 GB/s</a></a
    ></a
  ></a
>

B. Comparison with Boehm GC

To provide a point of contrast, we also assessed the performance of Boehm GC:

$ pypy rpython/bin/rpython --gc=boehm --output=targetallocatealot-c-boehm
targetallocatealot.py ... $ ./targetallocatealot-c-boehm 1000000000 0 without
initialization
<a object at 0xffff8bd058a6e3af>
  <a object at 0xffff8bd058a6e3bf>
    9.722585 s 14.901161 GB 1.532634 GB/s $ ./targetallocatealot-c-boehm
    1000000000 1 with initialization
    <a object at 0xffff88e1132983af>
      <a object at 0xffff88e1132983bf>
        9.684149 s 14.901161 GB 1.538717 GB/s</a
      ></a
    ></a
  ></a
>

It should be noted that Boehm GC employs a conservative stack scanning approach and cannot move objects, which introduces additional complexity into the allocation process.

C. Performance Dissection and GC Behavior Exploration

1. Performance Dissection

Leveraging the perf tool to profile the execution process yielded the following key metrics:

On average, each allocation operation required approximately 11 instructions and 2.1 cycles (including loop-related operations)
The instructions-per-cycle (IPC) ratio reached 5.23, indicating high CPU utilization

2. GC Behavior Exploration

RPython GC determines the nursery size based on the L2 cache size. We can inspect relevant information using the PYPYLOG environment variable:

$ PYPYLOG=gc-set-nursery-size,gc-hardware:- ./targetallocatealot-c 1
[f3e6970465723] {gc-set-nursery-size nursery size: 270336 [f3e69704758f3]
gc-set-nursery-size} [f3e697047b9a1] {gc-hardware L2cache = 1048576
[f3e69705ced19] gc-hardware} [f3e69705d11b5] {gc-hardware memtotal =
32274210816.000000 [f3e69705f4948] gc-hardware} [f3e6970615f78]
{gc-set-nursery-size nursery size: 4194304 [f3e697061ecc0] gc-set-nursery-size}
with initialization NULL
<a object at 0x7fa7b1434020> 0.000008 s 0.000000 GB 0.001894 GB/s</a>

When allocating 14.9 GiB of data, the GC performed approximately 38,146 minor collections. We verified this through log analysis:

$ PYPYLOG=gc-minor:out ./targetallocatealot-c 10000000000 1 with initialization
w<a object at 0x7991e3835980>
  <a object at 0x7991e3835990>
    5.315511 s 149.011612 GB 28.033356 GB/s $ head out [f3ee482f4cd97] {gc-minor
    [f3ee482f53874] {gc-minor-walkroots [f3ee482f54117] gc-minor-walkroots}
    minor collect, total memory used: 0 number of pinned objects: 0 total size
    of surviving objects: 0 time taken: 0.000029 [f3ee482f67b7e] gc-minor}
    [f3ee4838097c5] {gc-minor [f3ee48380c945] {gc-minor-walkroots $ grep
    "{gc-minor-walkroots" out | wc -l 38147</a
  ></a
>

Log analysis of minor collection timing revealed that the GC accounted for approximately 2% of the total time.

III. Exploration at the Machine Code Level

The fast path for allocation in RPython GC employs a straightforward bump pointer approach, with pseudocode as follows:

result = gc.nursery_free # Advance the nursery_free pointer by totalsize
gc.nursery_free = result + totalsize # Check if this allocation would exceed the
nursery if gc.nursery_free > gc.nursery_top: # If it does => collect the nursery
and allocate result = collect_and_reserve(totalsize) result.hdr =
<GC flags and type id of A></GC>

By analyzing the machine code of the compiled binary targetallocatealot-c, we roughly identified the machine code implementation of the core loop and attempted to annotate key operations:

... cb68: mov %rbx,%rdi cb6b: mov %rdx,%rbx # Initialize the object header of the
object allocated in the previous iteration cb6e: movq $0x4c8,(%rbx) # Loop
termination check cb75: cmp %rbp,%r12 cb78: je ccb8 # Load nursery_free cb7e: mov
0x33c13(%rip),%rdx # Increment loop counter cb85: add $0x1,%rbp # Add 16 (size
of object) to nursery_free cb89: lea 0x10(%rdx),%rax # Compare nursery_top with
new nursery_free cb8d: cmp %rax,0x33c24(%rip) # Store new nursery_free cb94: mov
%rax,0x33bfd(%rip) # If new nursery_free exceeds nursery_top, fall through to
slow path, if not, start at top cb9b: jae cb68 # Slow path from here on: # Save
live object from last iteration to GC shadow stack cb9d: mov %rbx,-0x8(%rcx)
cba1: mov %r13,%rdi cba4: mov $0x10,%esi # Perform minor collection cba9: call 20800
<pypy_g_IncrementalMiniMarkGC_collect_and_reserve>
  ...</pypy_g_IncrementalMiniMarkGC_collect_and_reserve
>

IV. Performance in a Conventional Python Environment

We also executed the same code as a regular Python3 program on PyPy. Due to its dynamic typing nature, instances of user-defined classes in PyPy are relatively large (at least 7 words, or 56 bytes). However, we can utilize integer objects for testing. Integer objects are heap-allocated and consist of two words: one for GC and one for the machine-word-sized integer value.

The test code is as follows:

import sys, time


def run(loops):
    t1 = time.time()
    a = prev = None
    for i in range(loops):
        prev = a
        a = i
    print(prev, a)  # Ensure two objects remain alive
    t2 = time.time()
    object_size_in_words = 2  # GC header and one integer field
    mem = loops * 28 / 1024.0 / 1024.0 / 1024.0
    print(mem, 'GB')
    print(mem / (t2 - t1), 'GB/s')

def main(argv):
    loops = int(argv[1])
    run(loops)
    return 0

if __name__ == '__main__':
    sys.exit(main(sys.argv))

The results showed a marked contrast between running with and without the JIT:

$ pypy3 allocatealot.py 1000000000
999999998 999999999
14.901161193847656 GB
17.857494904899553 GB/s
$ pypy3 --jit off allocatealot.py 1000000000
999999998 999999999
14.901161193847656 GB
0.8275382375297171 GB/s

Analyzing the machine code generated by PyPy’s JIT is quite intricate. We need to enable JIT logging with PYPYLOG=jit:out, locate the tracking IR for the loop under the jit-log-opt logging category, and then use a script from the PyPy repository to disassemble the generated machine code.

V. Summary and Outlook

A. Summary

The RPython GC’s fast allocation path is meticulously designed, enabling high-speed object allocation. This design is not novel but is a common approach in garbage collector design. Beyond the algorithm itself, the relentless progress of modern CPU architecture has been a significant driver of performance improvements. Running the same code on an older AMD processor used by a colleague yielded markedly inferior results, further underscoring the substantial impact of hardware advancements on program performance.

B. Outlook

As software systems grow increasingly complex and data volumes continue to swell, memory management technologies face unprecedented challenges. Looking ahead, we can explore and enhance memory management strategies in the following directions:

Adaptive Memory Management: Develop memory managers that can dynamically adjust strategies based on the runtime characteristics of applications to better accommodate diverse workloads.
Hardware-Collaborative Design: Work closely with hardware manufacturers to design memory management technologies that align with novel processor architectures, such as heterogeneous computing platforms, fully leveraging hardware features to boost memory access efficiency.
Intelligent Garbage Collection: Harness machine learning technologies to enable garbage collectors to more accurately predict object lifecycles, optimize collection timing and strategies, and minimize disruptions to program execution.
Distributed System Memory Management: In distributed computing environments, investigate efficient distributed memory management solutions to reduce data transfer overhead and enhance overall system performance.

The evolution of memory management technologies is crucial for improving software performance, reducing hardware costs, and advancing the entire information technology industry. We look forward to the emergence of more efficient and intelligent memory management solutions in the near future, which will provide robust support for addressing memory bottlenecks in complex systems.