Deep Dive into RPython GC Object Allocation Performance
In the realm of software development, the performance of garbage collection (GC) mechanisms plays a pivotal role in shaping the overall behavior of programs. RPython, a restricted variant of Python designed specifically for the PyPy project, boasts an impressive garbage collection component that excels in object allocation speed and memory management efficiency.
I. Setting Up the Testing Environment for RPython GC
A. Writing the Test Code
To explore the object allocation speed of RPython GC, we first crafted a basic test script:
class A(object):
pass
def run(loops):
# Initial test code
for i in range(loops):
a = A()
a.i = i
However, initial tests revealed that RPython’s static optimizer might eliminate object allocation operations that went unused. To address this, we refined the code to ensure two instances remained alive during each loop iteration:
class A(object):
pass
def run(loops):
a = prev = None
for i in range(loops):
prev = a
a = A()
a.i = i
print(prev, a) # Prevent the optimizer from removing the allocation operation
Additionally, we provided a version that left the field uninitialized:
def run(initialize_field, loops):
t1 = time.time()
if initialize_field:
a = prev = None
for i in range(loops):
prev = a
a = A()
a.i = i
print(prev, a)
else:
a = prev = None
for i in range(loops):
prev = a
a = A()
print(prev, a)
t2 = time.time()
print(t2 - t1, 's')
object_size_in_words = 2 # GC header and one integer field
mem = loops * 8 * object_size_in_words / 1024.0 / 1024.0 / 1024.0
print(mem, 'GB')
print(mem / (t2 - t1), 'GB/s')
B. RPython Scaffold Code and Compilation
After adding the necessary RPython scaffold code, we employed the following command to compile the code into a binary file:
pypy rpython/bin/rpython targetallocatealot.py
This command translates RPython code into C code and further compiles it into a binary file that includes both our test code and the RPython garbage collector.
II. Test Results and Performance Analysis
A. RPython GC Test Results
On a platform featuring an AMD Ryzen 7 PRO 7840U processor (running Ubuntu Linux 24.04.2), the test results were as follows:
$ ./targetallocatealot-c 1000000000 0 without initialization
<a object at 0x7c71ad84cf60>
<a object at 0x7c71ad84cf70>
0.433825 s 14.901161 GB 34.348322 GB/s $ ./targetallocatealot-c 1000000000 1
with initialization
<a object at 0x71b41c82cf60>
<a object at 0x71b41c82cf70> 0.501856 s 14.901161 GB 29.692100 GB/s</a></a
></a
></a
>
B. Comparison with Boehm GC
To provide a point of contrast, we also assessed the performance of Boehm GC:
$ pypy rpython/bin/rpython --gc=boehm --output=targetallocatealot-c-boehm
targetallocatealot.py ... $ ./targetallocatealot-c-boehm 1000000000 0 without
initialization
<a object at 0xffff8bd058a6e3af>
<a object at 0xffff8bd058a6e3bf>
9.722585 s 14.901161 GB 1.532634 GB/s $ ./targetallocatealot-c-boehm
1000000000 1 with initialization
<a object at 0xffff88e1132983af>
<a object at 0xffff88e1132983bf>
9.684149 s 14.901161 GB 1.538717 GB/s</a
></a
></a
></a
>
It should be noted that Boehm GC employs a conservative stack scanning approach and cannot move objects, which introduces additional complexity into the allocation process.
C. Performance Dissection and GC Behavior Exploration
1. Performance Dissection
Leveraging the perf
tool to profile the execution process yielded the following key metrics:
-
On average, each allocation operation required approximately 11 instructions and 2.1 cycles (including loop-related operations) -
The instructions-per-cycle (IPC) ratio reached 5.23, indicating high CPU utilization
2. GC Behavior Exploration
RPython GC determines the nursery size based on the L2 cache size. We can inspect relevant information using the PYPYLOG environment variable:
$ PYPYLOG=gc-set-nursery-size,gc-hardware:- ./targetallocatealot-c 1
[f3e6970465723] {gc-set-nursery-size nursery size: 270336 [f3e69704758f3]
gc-set-nursery-size} [f3e697047b9a1] {gc-hardware L2cache = 1048576
[f3e69705ced19] gc-hardware} [f3e69705d11b5] {gc-hardware memtotal =
32274210816.000000 [f3e69705f4948] gc-hardware} [f3e6970615f78]
{gc-set-nursery-size nursery size: 4194304 [f3e697061ecc0] gc-set-nursery-size}
with initialization NULL
<a object at 0x7fa7b1434020> 0.000008 s 0.000000 GB 0.001894 GB/s</a>
When allocating 14.9 GiB of data, the GC performed approximately 38,146 minor collections. We verified this through log analysis:
$ PYPYLOG=gc-minor:out ./targetallocatealot-c 10000000000 1 with initialization
w<a object at 0x7991e3835980>
<a object at 0x7991e3835990>
5.315511 s 149.011612 GB 28.033356 GB/s $ head out [f3ee482f4cd97] {gc-minor
[f3ee482f53874] {gc-minor-walkroots [f3ee482f54117] gc-minor-walkroots}
minor collect, total memory used: 0 number of pinned objects: 0 total size
of surviving objects: 0 time taken: 0.000029 [f3ee482f67b7e] gc-minor}
[f3ee4838097c5] {gc-minor [f3ee48380c945] {gc-minor-walkroots $ grep
"{gc-minor-walkroots" out | wc -l 38147</a
></a
>
Log analysis of minor collection timing revealed that the GC accounted for approximately 2% of the total time.
III. Exploration at the Machine Code Level
The fast path for allocation in RPython GC employs a straightforward bump pointer approach, with pseudocode as follows:
result = gc.nursery_free # Advance the nursery_free pointer by totalsize
gc.nursery_free = result + totalsize # Check if this allocation would exceed the
nursery if gc.nursery_free > gc.nursery_top: # If it does => collect the nursery
and allocate result = collect_and_reserve(totalsize) result.hdr =
<GC flags and type id of A></GC>
By analyzing the machine code of the compiled binary targetallocatealot-c
, we roughly identified the machine code implementation of the core loop and attempted to annotate key operations:
... cb68: mov %rbx,%rdi cb6b: mov %rdx,%rbx # Initialize the object header of the
object allocated in the previous iteration cb6e: movq $0x4c8,(%rbx) # Loop
termination check cb75: cmp %rbp,%r12 cb78: je ccb8 # Load nursery_free cb7e: mov
0x33c13(%rip),%rdx # Increment loop counter cb85: add $0x1,%rbp # Add 16 (size
of object) to nursery_free cb89: lea 0x10(%rdx),%rax # Compare nursery_top with
new nursery_free cb8d: cmp %rax,0x33c24(%rip) # Store new nursery_free cb94: mov
%rax,0x33bfd(%rip) # If new nursery_free exceeds nursery_top, fall through to
slow path, if not, start at top cb9b: jae cb68 # Slow path from here on: # Save
live object from last iteration to GC shadow stack cb9d: mov %rbx,-0x8(%rcx)
cba1: mov %r13,%rdi cba4: mov $0x10,%esi # Perform minor collection cba9: call 20800
<pypy_g_IncrementalMiniMarkGC_collect_and_reserve>
...</pypy_g_IncrementalMiniMarkGC_collect_and_reserve
>
IV. Performance in a Conventional Python Environment
We also executed the same code as a regular Python3 program on PyPy. Due to its dynamic typing nature, instances of user-defined classes in PyPy are relatively large (at least 7 words, or 56 bytes). However, we can utilize integer objects for testing. Integer objects are heap-allocated and consist of two words: one for GC and one for the machine-word-sized integer value.
The test code is as follows:
import sys, time
def run(loops):
t1 = time.time()
a = prev = None
for i in range(loops):
prev = a
a = i
print(prev, a) # Ensure two objects remain alive
t2 = time.time()
object_size_in_words = 2 # GC header and one integer field
mem = loops * 28 / 1024.0 / 1024.0 / 1024.0
print(mem, 'GB')
print(mem / (t2 - t1), 'GB/s')
def main(argv):
loops = int(argv[1])
run(loops)
return 0
if __name__ == '__main__':
sys.exit(main(sys.argv))
The results showed a marked contrast between running with and without the JIT:
$ pypy3 allocatealot.py 1000000000
999999998 999999999
14.901161193847656 GB
17.857494904899553 GB/s
$ pypy3 --jit off allocatealot.py 1000000000
999999998 999999999
14.901161193847656 GB
0.8275382375297171 GB/s
Analyzing the machine code generated by PyPy’s JIT is quite intricate. We need to enable JIT logging with PYPYLOG=jit:out, locate the tracking IR for the loop under the jit-log-opt logging category, and then use a script from the PyPy repository to disassemble the generated machine code.
V. Summary and Outlook
A. Summary
The RPython GC’s fast allocation path is meticulously designed, enabling high-speed object allocation. This design is not novel but is a common approach in garbage collector design. Beyond the algorithm itself, the relentless progress of modern CPU architecture has been a significant driver of performance improvements. Running the same code on an older AMD processor used by a colleague yielded markedly inferior results, further underscoring the substantial impact of hardware advancements on program performance.
B. Outlook
As software systems grow increasingly complex and data volumes continue to swell, memory management technologies face unprecedented challenges. Looking ahead, we can explore and enhance memory management strategies in the following directions:
-
Adaptive Memory Management: Develop memory managers that can dynamically adjust strategies based on the runtime characteristics of applications to better accommodate diverse workloads.
-
Hardware-Collaborative Design: Work closely with hardware manufacturers to design memory management technologies that align with novel processor architectures, such as heterogeneous computing platforms, fully leveraging hardware features to boost memory access efficiency.
-
Intelligent Garbage Collection: Harness machine learning technologies to enable garbage collectors to more accurately predict object lifecycles, optimize collection timing and strategies, and minimize disruptions to program execution.
-
Distributed System Memory Management: In distributed computing environments, investigate efficient distributed memory management solutions to reduce data transfer overhead and enhance overall system performance.
The evolution of memory management technologies is crucial for improving software performance, reducing hardware costs, and advancing the entire information technology industry. We look forward to the emergence of more efficient and intelligent memory management solutions in the near future, which will provide robust support for addressing memory bottlenecks in complex systems.