Grant Proposal: Improving the Robustness of Unicode Support in Rakudo on MoarVM
Tue, 04-Apr-2017 by
Coke
edit post
The Grants Committee has received the following grant proposal for the March round.
Before the Committee members vote, we would like to solicit feedback from the Perl community on the proposal.
Review the proposal below and please comment here by April 12th, 2017.
The Committee members will start the voting process following that
and the conclusion will be announced approximately in one week.
# Improving the Robustness of Unicode Support in Rakudo on MoarVM
- Name:
Samantha McVey (samcv)
- Amount Requested:
7,500 USD
## Synopsis
Implement Unicode Collation Algorithm, improve speed and spec conformance of the
text normalizer. Improve test coverage for Unicode specs and document our
compliance or lack of compliance with the Unicode spec.
## Benefits to the Perl Community
As Perl 6 starts to take off, it is increasingly important to provide robust
Unicode support. Perl 6 already provides some of the best Unicode support on
many levels compared to other programming languages. The goal of this project
is to make Perl 6's Unicode support production ready.
## Deliverables and Inch Stones
- General
- Any deficits of our Unicode coverage will be documented in the course of work on this
project. This is very important
due to the vastness of the Unicode standard. Deficits will have tests
written, unless such a thing would not be possible to test or input is needed
from the rest of the Perl 6 team. In any of these cases, they will be
documented in my reports for future and current developers of Perl 6 to reference.
- Tests will be written to cover all of the relevant Unicode 9.0 test files, as
well as making current ones more robust when checking the breaking of graphemes.
- Unicode Names
- Hangul Syllables and other Unicode names shall be programmatically determined
when generating the Unicode database.
- Unicode Collation Algorithm
- Fully implement the Unicode Collation Algorithm at least for language
nonspecific sorting.
- Assess needs in supporting language and country specific collation, and write
a report of these needs.
- Text Normalization
- Improve the performance of the text normalizer and also allow the normalizer
to save state across multiple characters to properly support Grapheme Breaking
for all of Unicode 9.0 and beyond.
- Unicode Database Generation
- The script used to generate the Unicode database shall be made deterministic,
and produce the same output file on every run.
At the current time about half of the file changes even if no changes are made to the
script. This is an issue that will be solved.
- In addition to the above, the original script assumed that property values for
Unicode characters were unique. This causes issues when there is a conflict between
these. The rewritten script shall resolve this problem.
- Rewrite the Perl 5 script used to generate the Unicode database in Perl 6.
This is also part of the previous item, since a rewrite is needed, it should be
done in Perl 6 to help make it more maintainable.
- Implement all relevant remaining Unicode properties from Unicode 9.0. This
includes the properties needed to support the deliverables listed above.
- Try to reduce the memory footprint of the Unicode database. Currently
the unicode.o binary file created is 4.1MB. I hope to cut that in half.
## Project Details
I have already started working on the rewrite of the Unicode database generation
which is in [this public repository](https://github.com/samcv/UCD).
Some background into this. The original Unicode database generation script we
currently use was added in 2012 and much of the script has not changed much since.
Although this current script is somewhat adequate, as I have been working on
Unicode support these past months it became clear that the script was not easily
maintainable and had some issues which would have required an extensive
rewrite of much of the script. It became clear that a full rewrite was prudent.
I have done lots of fielding and preliminary work as well as the knowledge I have
gained by working with the current script we use and the MoarVM internals.
### Plan
#### Month 1: Get database generation finalized
- Implement testing for resolving codepoints to bitfield rows, to ensure
all codepoints resolve to the correct rows
- Integrate testing of database values into the current Unicode database rewrite
(this will not be in roast, but will be done to ensure correctness during
development)
- Work with timotimo on C implementation on indexed decoding of base40 compressed
Unicode Names (some work already done in [UCD repo](https://github.com/samcv/UCD)
- Implement programmatic generation of codepoint names for Hangul Syllables
- Implement remaining properties and data in the rewrite and achieve pairity with
our current script
#### Month 2: Finish the Unicode Collation Algorithm and integrate with MoarVM, add test coverage to roast
- Assess needs for language/country specific sorting for Unicode Collation Algorithm (UCA)
- Implement weights of codepoint sequences (only single codepoint is implemented currently)
- Implement decomposition of codepoints if no Collation weights are found
- Integrate the rewrite with MoarVM codebase, as well as rewrite all sections
of MoarVM
- Make needed changes in MoarVM so Property Values are not assumed unique.
- Integrate with MoarVM codebase
- Merge into MoarVM repository
## Project Schedule
I can begin work as soon as the grant is approved.
## Completeness Criteria
Tests will be commited to roast, and the rewritten database and Unicode Collation
Algorithm implementation will be merged into MoarVM.
## Bio
Although I am a fairly recent addition to the Perl 6 core developers, in a short
few months I have been very busy. I have two Perl 6 modules, IRC::TextColor
and URL::Find and I am the lead developer of the Perl 6 syntax highlighter
for Atom/Github as well as for docs.perl6.org. I converted the site from
using the old Pygments highlighter to the new highlighter.
My contributions to Perl 6 have been focused on Unicode support in Perl 6,
making changes throughout Rakudo, NQP and MoarVM to achieve this. All of the
work I have already done on improving Unicode support in Perl 6 shows I am
capable of completing this project and am the best person for this grant.
In addition, I have already started work on rewriting the Unicode Database
generation and shrinking the size of the data needed to be loaded on startup.
### Incomplete list of Unicode work I have done in the last few months
#### Tests:
- Fixed several errata in roast related to our Unicode support, which had often
been present for a long time.
- Added a test based on GraphemeBreakTest.txt from Unicode and many others
to Unicode 9.0
- Updated other tests for Unicode 9.0 and reworked others for compliance.
#### MoarVM:
- Implemented a simplistic implementation of the Unicode Collation Algorithm.
- Added `coll` operator, `.collate` method and the `$*COLLATION` variable
to Rakudo. For more information see the
[documentation](https://docs.perl6.org/language/experimental#Collation) I have added on it.
- Added support for named codepoint sequences, which includes the Named Sequences,
Emoji Sequences and Emoji ZWJ Sequences. `"\c[woman gesturing OK]"`
- Implemented
[Unicode Name Aliases](https://github.com/MoarVM/MoarVM/commit/81618648) in
getting codepoints by name and [documentation](https://docs.perl6.org/language/unicode#Name_Aliases)
- Implemented the 'Extend' Grapheme\_Cluster\_Break property which was new in Unicode 9.0.
We previously had no support for this property. This caused errors in grapheme
breakup and incorrect character count and segmentation.
- Implemented many other Grapheme\_Cluster\_Break fixes and added support for
most Emoji sequences. This fixes most Emoji made up of multiple codepoints,
fixing character counts and text segmentation for these extended grapheme clusters.
- Improved the speed of radix 50% for non-ASCII decimal digits
(converts strings to their numeric representation/values)
- Improved the speed of text normalization, making slurping a Unicode heavy text
file 14% faster
- Improved the speed m:i/ / regex matching between 1.8x and 3.3x (depending on not finding a match / finding a match at the beginning).
- Added a multitude of properties to our Unicode database.
#### Rakudo:
- Added support for a large number Unicode properties, handling Bool/Str/Int
return types for `uniprop`
- Implemented `uniprops` method in Rakudo
Comments (10)
Yes, yes and yes!
We've already seen some amazing unicode-related work done by samcv, and we definitely need more of this stuff.
+1 from me. samcv++ already done a ton of fixes, additions, and massive speed improvements with strings and Unicode stuff.
We want to keep 'em coming.
+1, thanks for working on better Unicode Support.
If anyone can get this done and done well, it's Samantha. She's already been hard at work making progress on the Unicode front. The changes proposed in this grant only make Unicode support in PerlĀ 6 that much better.
+1
+1 This is would be great. The internal support of the abstract and high level properties and UTF-8 encoding is pretty dang good, but it would be great to have all the other Unicode goodies and low level bits and bobs.
+1, it's work that needs to be done and Samantha has proven she can do it.
I agree wholeheartedly, samcv has already contributed real good stuff and what i've seen of her plans for the future looks good, too. +1
Extensive Unicode support is one of the unique selling points of Perl 6, currently challenged only by Swift. If we want to keep this competitive advantage, we need to stay on the ball and get what we have to be reliable and performant.
Samantha has already done great work and shown a deep understanding of the issues involved and the tenaciousness to see it through.
A definite +1
+1, unicode is one of the biggest reasons I use perl 6
Definite +1 from me. Strong Unicode support is, as others have mentioned, one of Perl 6's selling points, so this is a strategic thing to be funding. Samantha is absolutely the right person to fund to work on this; she has done excellent work so far on improving the state of Perl 6's Unicode support, both in terms of features and performance.