Jim Cromie
[hidden email]
How much is your project worth? $3000K
Memory allocation enhancements in core (sv.c).
Perl is greedy wrt recycling sv-bodies; freed bodies are returned to an interpreter-global freelist, where they hang until process termination. Workloads which rotate through large populations of each body-type would be badly limited; there would be no memory left. While this workload profile is arbitrary, so is the limitation.
More to the point, holding memory til process death, without recourse, is sub-optimal. Its easy to imagine a case where it matters.
If this proposed work gets into 5.12, which is the measure of success, everyone who uses it will potentially see benefits over 5.10, depending upon their workload.
I have these following lines of development in separate git branches. Im feeling pretty good about their soundness and near-term includability to call them deliverables, to be followed by patches per review by porters.
Having written the arena-set code in sv.c, Id always had a notion to visit this idea. Musing on it, I determined a few 1st steps:
- adapt get-arenas(sig): sv_type arg2 -> (void*) reqid and track allocs by the reqid - propagate that to S_more_bodies, new macro shim does body-root-pointer[svtype] deref - add release-arenas()
S_more_bodies()
outer-users (disregarding the macro-wrapper) keep
their current interface, the arenas provisioned by it for each sv_type
are transparently tracked, and can soon be reclaimed.
get-arena/release-arena give balanced api for clients to manage slabs of memory themselves. These slabs are at a low-level, and should be efficient enough for use by XS libraries.
All arenas are tracked by a unique reqid, including the interp's freelists, interp-global body-root-pointers (body-freelists) which replaced the sv_type are just one more example.
An XS-lib to extend this into an object substrate seems practical. (but way out of scope here)
The immediate beneficiary is ptr-tables; which is a "cheap-hash" for internal functions like interpreter cloning; theyre very transient entities, they carry no ref-counting overhead, and refs they might hold are guaranteed to exist for their lifetime. (XXX document this currently tacit requirement). The important CPAN user is Storable::freeze().
These tables are filled with PTR_TBL_ENT_t items, which are currently provisioned by S_more_bodies(sv_type = PTE_SVSLOT), and automatically inherit the new mechanics above.
Now, with a few more quick changes:
- add private pte-freelist pointer to ptr_table_t - in ptr_table_*(): replace global list with private one
When ptr_table_free is called, we know that:
- all slabs allocated for this ptr-table's entries have its reqid. - no other users of those ptes exist. - we can now choose how to recycle them. choice is client's: will app reuse them ?
If client will reuse them immediately, theres no potential gain by changing everything. OTOH, if an server app nstores a large data-structure in response to a rare query, it would be nice if the underlying memory were released back to the system/lib so that the next call to get-arena will succeed.
The CPAN use case is important enough to warrant a look wrt this:
~/.cpan/Metadata is 12M (is your frozen stuff bigger ?) - blead needs a 2**19 table to freeze it - its loaded once by shell - shell is used for a while - freezing isnt involved (so instructive only, not dispositive) - long CPAN shell shutdown corroborates theory
This comes close to commending the early-cleanup strategy for ptes.
On pure (wag) speed grounds, private ptes may not be a win; the re-threading currently done by ptr-table-clear() WILL be there for the next Storable serialization, so all subsequent iterations will be best-case wrt preallocated memory. Note that this also optimizes the current code for benchmark tests, thus creating a worst-case against which to benchmark the new code.
But thats irrelevant. Often, we dont want another fast serialization (.cpan/Metadata is an example), and having megabytes of memory tied up by freed ptr-table-entries until END-times is a waste.
We could pool the freed slabs, eliminating the malloc/free overhead resulting from the patchset described above, which would make an iterative-freeze benchmark more fair.
Completes a general arena management api started with get_arenas()
.
This is XS; since its completely invisible at the perl level, its orthogonal to whatever scope control constructs can be added to further leverage these mechanics. These language consideration are OUT OF SCOPE for my project.
ptr-tables are a low-level pointer-map (hash) implementation, used for interpreter cloning in core, and notably by Storable to do fast %seen mapping during serialization. By extending the api as above, we allow Storable to presize its ptr-table according to whatever estimate a user gives.
For large tables, which are quite common [1], given that Storable is good at it[2], the reduction in work can be significant, and they're all at the user/lib/system boundary;
- we dont make log2 N memory requests to grow from default 512 -> 2*18 - do it once, dont thrash the OS - user has simple way to tune for serialization performance
Ive hacked this into Storable[3], so a first estimate of the patch's effect are published.
[1] ~/.cpan/Metadata is 12M and needs a 2**19 table to freeze. [2] http://cpanratings.perl.org/dist/Storable (5 stars on cpan.org, for speed) [3] http://www.nntp.perl.org/group/perl.perl5.porters/2009/06/msg146907.html
At this point, all tests pass, but more performance characterization is needed. Subsequent testing has challenged the results in [3], necessitating further investigation; valgrind confirms the reduction of instructions to complete the task, and cache miss-rates seem to improve, but patch is still slower for large tables.
Approaches Im considering:
- oprofile the code, using same Storable-based stress-test. - new benchmarks. current test lacks any resource competetition - damn the benchmarks, keep moving.
Currently, we keep free-bodies in linked-lists hung from PL_body_roots[], and reuse them in a LIFO manner; the last body freed is the 1st reused. This is good wrt cache freshness, but its predictable, and current code may have tacit assumptions on this (ie bugs).
I have a patchset (WIP) that does:
- reworks PL_body_roots[] to store both head and tail - has 2 fns: which push to head (default), or tail - maps to tail with a -Dmacro
The patchset works in default mode, but crashes otherwize. Either I'll find my bug, or Ive hit a LIFO dependency, or both.
Theres a low probability of a LIFO bug, but the added complexity is low, the runtime and memory adds are minimal (empty-list is an equality check, not !0, and 16 new pointers/interpreter), and can be ifdefd (-DFIFO_FREEBODIES) so theres usually zero-cost; occaisional testing with -Dmacro builds will suffice to catch any new bugs.
Repo trick: - apply to 5.005 (ie as far as possible) (can anyone automate this with git?) - test LIFO bug there - if no LIFO bug, apply 5.10 on top, test again. - if LIFO bugs, bisect them.
Source approach: - make FIFO/LIFO choice per sv_type, - change each sv_type to FIFO iteratively, isolate problems.
If both are done, they can also be combined.
All the details above need to be hashed out and/or explored; they provide a reasonable way to closely evaluate progress. I anticipate that the pumpkings will agree with my ordering above, but they may differ (or elabororate) on some details.
Other things I can do:
- set up a smoke-house, post regular results for visibility - run my "ready" patches through the smoke-house, publish these too - start gitting other's trees, testing - automate this, integrate with TS - give back TS patches to ABE - check attic for old patches, resurrect appropriate ones
I am newly available to work full-time on this :-(
Given how central this work is to perl-core, how it plays with the pumpking is rather critical to any schedule estimates. The basic work is done, so Id think that most of my time & effort will be on things I cannot anticipate.
The aspect I fear most is getting bogged down in debating the issues around how to do performance testing in the perl-dist, what kind of Test:: library support is needed, how to collect results across many users etc.
Im also flexible wrt changing scope; Id consider it ideal if my patches were quickly reviewed, then accepted, and I had to rapidly switch to enhancements, benchmarking, and Test::Performance considerations that may emerge.
Ive been hacking in perl core for a while, as is evident here:
[jimc@groucho perl-git]$ git log| grep Cromie | wc 91 374 3504
those are >95% Authors, not Integrations into maint-* lines. I suspect Ive lost a few attributions too
Of the non-trivial patches in there, Ive had actual experience extending the code in the areas outlined above. And Ive already done a lot of basic validation; Ive got functioning patchsets for both ptr-table-new-n and private-arenas.
Im pretty confident I can do this. The primary risks are my not entirely understanding what pumpkings will expect for their money (the money is the new ingredient here, they seem to be fine with my free work ;-)
While this sounds all very good, I read many technical details and not much understandable about the goals.
I don't know much of the core guts, so let me ask what exactly the goals are:
1) to reduce memory consumption?
2) to speed up memory handling in perl core?
3) to return freed memory to the OS?
4) reduce memory fragmentation?
Or some combinations of the above?