PerlGPT Large Language Model, Phase 1
Wed, 20-Sep-2023 by
Saif Ahmed
edit post
A new grant application from [John Napiorkowski](https://metacpan.org/author/JJNAPIORK) and [Robert Grimes](https://metacpan.org/author/RMZG), this time targetting the development of a large language model trained
specifically to develop Perl Code. These veteran coders suggest that using natural language to generate Perl
code may potentially allow one to rapidly generate new APIs and applications, or at least give a skeleton to
flesh out into a more elaborate tool.
____
### Applicants:
John Napiorkowski & Robert Grimes
### Amount Requested
The budget for this project is $8,800 USD, to be made in one lump-sum payment upon completion.
This amount is calculated by multiplying the estimated minimum total hours labor (220 hours) by the minimum acceptable hourly rate for an open source Perl project ($40 / hour):
(220 hours) * ($40 / hour) = $8,800 USD
Please note this amount does NOT include the high cost of renting or purchasing the necessary GPU hardware equipment, which will graciously be donated by the ChatGPU.ai group as a service to the Perl community.
### Synopsis
This grant proposal is for phase 1 of the development of PerlGPT, a large language model (LLM) comparable to ChatGPT 4.0, and trained on additional Perl-related content only. PerlGPT will be based on Meta's Code Llama (from Llama v2 or newer) language models, with all new components implemented in Perl where possible and released as free-and-open-source software (FOSS), unlike ChatGPT and other proprietary LLM systems.
[Code Llama: Open Foundation Models for Code](https://arxiv.org/pdf/2308.12950.pdf)
Code Llama's documentation currently lists the following primarily-supported languages: "Python, C++, Java, PHP, TypeScript, C#, and Bash". One of the outcome of the PerlGPT project will be to effectively add Perl to this list. (PerlGPT may ultimately be released under an alternative name, such as "Perllama".)
Phase 1 consists of training a 13B input language model using Perl-related stimulus/response pairs curated from Perl Monks, Stack Overflow, GitLab, GitHub, and other public Perl-specific data sources. Phase 1 will deliver an LLM capable of generating pure-Perl source code in collaboration with a Perl programmer, similar to Microsoft Bing and GitHub Copilot.
As with the [Perl TensorFlow grants](https://news.perlfoundation.org/post/tensor-flow-api-2), PerlGPT "is a long-term multi-phase grant series, aimed at radically changing the perception of Perl in the modern software development and AI industries."
(This grant proposal has been simplified and re-submitted, per the request of TPF's grant committee.)
### Benefits to the Perl Community
Phase 1 implements PerlGPT v1.0 and benefits the Perl community by enabling the creation of new pure-Perl libraries and applications on CPAN.
PerlGPT v1.0 is trained on pure-Perl source code examples and high-quality POD documentation from CPAN, GitLab, GitHub, and Bitbucket. PerlGPT is further trained on plain-English technical discussions pertaining to their respective feature set, gathered from Perl Monks and Stack Overflow.
For example, a programmer may want to create a new Perl API for some 3rd-party web platform such as the Amazon cloud. The programmer can write a plain-English description of their desired API features and functionality for accessing the Amazon cloud. They can also specify design decisions such as whether or not to utilize an MVC framework like Catalyst or Mojolicious, and they can even start stubbing out some Perl classes and subroutines with comments included where source code should be added.
In this example, PerlGPT v1.0 will work with the programmer to iteratively implement their desired Amazon cloud API in pure Perl, including a full-coverage test suite and POD documentation, etc. Once the API is working well enough for public release, the PerlGPT v1.0 LLM can even help the programmer execute the correct Dist::Zilla commands to build and upload the software to CPAN. Finally, many new independent Perl projects can be created with access to the Amazon cloud, thanks to the Perl API created and uploaded to CPAN with the help of PerlGPT v1.0!
These same benefits apply to any other non-Amazon API which somebody may want to create in Perl, or to any pure-Perl library or application that a programmer can think up. The sky is the limit! PerlGPT v1.0 dramatically increases the effectiveness and efficiency of creating new pure-Perl software.
### Deliverables
* An implementation of the PerlGPT v1.0 13B large language model based on the Code Llama language model, configured and built using Dist::Zilla. (100 hours labor)
*
* A comprehensive Perl test suite with automatically-provable coverage for 100% of the PerlGPT v1.0 LLM, using Test2 from CPAN. (45 hours labor)
* A carefully-written and explanatory collection of documentation with coverage for 100% of the PerlGPT v1.0 LLM, using normal POD fully compatible with CPAN. (20 hours labor)
* A small collection of user-friendly example Perl applications, using PerlGPT v1.0 LLM components to effectively showcase this project. (30 hours labor)
* A public GitLab repository with all source code and components of the PerlGPT v1.0 LLM, including unstable or experimental components. (5 hours labor)
* A public CPAN distribution with all stable source code and components of the PerlGPT v1.0 LLM. (5 hours labor)
* A public Docker repository with images containing all stable source code and components of the PerlGPT v1.0 LLM, along with all dependencies, ready to run out-of-the-box. (15 hours labor)
* The PerlGPT v1.0 LLM design does NOT yet support anything other than pure Perl source code. These features will be addressed in future grant proposals.
* This grant proposal specifically does NOT include PerlGPT phase 2 or beyond, such as XS or C or Perl internals, which is far beyond the scope of a single grant and will be addressed in future proposals.
### Project details
We will generate the PerlGPT language model by training a Llama foundational language model. This training will be done using a combination of both manually-curated and automatically-selected stimulus/response pairs, collected from public websites and data sources. We will not utilize any proprietary data or stimulus/response training sets taken from other proprietary language models, such as OpenAI's ChatGPT, etc.
Most of the technical details of how to train the PerlGPT language model can be found in the following papers:
[Training Language Models to Follow Instructions with Human Feedback, 3-4-2022](https://arxiv.org/pdf/2203.02155.pdf)
[Teaching Large Language Models to Self-Debug, 4-11-2023](https://arxiv.org/pdf/2304.05128.pdf)
[LLaMA: Open and Efficient Foundation Language Models, 2-27-2023](https://arxiv.org/pdf/2302.13971.pdf)
### Project Schedule
Total development time is estimated at 3 to 6 months, with the normal disclaimer about the difficulty of predicting software project durations.
During the first work cycle of approximately 1 to 2 months, curate and implement the initial PerlGPT v1.0 training data set.
During the second work cycle, run the LLM training procedure and implement the Perl test suite. (The training may take a particularly long time on our currently-available GPU hardware, due to limitations in available grant funding for renting very expensive high-end GPU hardware.)
During the third work cycle, write the Perl documentation and implement the Perl example applications.
If a fourth work cycle is required, continue until the public releases on CPAN and DockerHub are complete.
### Completeness Criteria
This grant is deemed complete when all the above-listed deliverables are reviewed and accepted by the official TPF-assigned grant manager.
### Bio
We are both professional CPAN authors, with a current total of 98 CPAN distributions between the two of us.
[JJNAPIORK](https://metacpan.org/author/JJNAPIORK)
[RMZG](https://metacpan.org/author/RMZG)
John is the maintainer of Catalyst:
[Catalyst Runtime](https://metacpan.org/release/JJNAPIORK/Catalyst-Runtime-5.90130)
The Perl Foundation sponsored the Perl TensorFlow API phase 1, and we recently worked with our colleagues Will Braswell and Zaki Mughal to successfully utilize and showcase Perl TensorFlow in the new ChatGPU Navi AI system:
[Introduction to TensorFlow in Perl](https://www.youtube.com/watch?v=jCawycAEVYU)
We both live in the Austin, Texas area.
Comments (17)
I'd like to see this project move forward. I've used ChatGPT for programming questions and documentation and found it useful. I would love to see the results from a GPT project with a strong focus on Perl. If the project could augment and speed up areas such as coding and documentation and help identify modules that already do something I might be currently developing for my business or personal use that would be very helpful.
I would like to see funding for this project. I find that the more detail one provides in a traditional search for a solution, using carefully crafted keywords to describe to a complex problem, just results in a very high number of useless matches. I believe that being able to use a plain-English description to guide development would be extremely useful.
I would like to see this project developed. I have been using ChatGPT and GitHub Copilot currently, and would be very interested not only in the finished project itself, but in the reusable code/modules it produces along the way.
I would like to see this project developed. I'm in the early stages of learning perl, but having a perl AI specifically for perl code could be what the world of perl needs to attract more people back to the language.
The PerlGPT grant proposal represents a pivotal step forward in the evolution of Perl programming.
Make no mistake, this effort is essential for the continued relevance of Perl in the ever-evolving world of software engineering.
It's not merely about creating another language model; it's about equipping Perl developers with a powerful, AI-driven partner that can enhance and streamline the software development process.
By providing PerlGPT, we are investing in the future of Perl, making it more accessible, efficient, and appealing to developers, especially those future developers who will come to rely on AI assistance.
I would like to see this as well.
The PerlGPT phase 1 will allow per to stay competitive for new programmers.
I use perl every day and would it would greatly help me in reducing my workload.
This will help Perl by more approachable to new programmers and help the old ones as well. Without this who knows what will happen to perl
I am very much in favor of developing a Perl-specific AI, as existing AIs (such as ChatGPT) have proven far less than optimal as Perl-programming aids. Not only do they "hallucinate" too much (giving code that doesn't work or references to CPAN modules that don't exist are aren't appropriate), but they suffer from lack of privacy. I think most of those pitfalls could be bypassed by development of this proposed "PerlGPT" project.
The PerlGPT promises to enhancing efficiency and accessibility for Perl developers. While existing AIs, including ChatGPT, have been inadequate due to issues like privacy concerns, PerlGPT is tailored to avoid these problems. It promises a refined, private, and more effective AI tool, ensuring Perl's ongoing relevance and adaptability in the dynamic field of software engineering. It's for the individuals and queries answered aren't robotic/
The use of LLMs to assist with the rapid development of source code is evidently the future for programming, given the list of supported languages referred to herein. Perl, missing from that list, currently therefore lags behind in this regard. The grant authors have identified a necessity for Perl. Implemented, this will greatly assist developers of Perl and I envisage would be likely to increase the number of users of Perl.
I would like to see PerlGPT going forward. GPT is available for almost every other language and as a Perl developer I don't see why we should be left out.
PerlGPT is going to be open source which should take care of privacy problems and security. If this goes ahead we should see more developers of Perl.
This is a great idea! This is the kind of development perl need. The developers listed are the correct people to make this happen.
I want to use this technology. I know other perl developers who would also.
I hope we can make this happen! Perl seems well positioned to regain market share as part of the LLM/AI revolution, as our favorite language is so rooted in linguistics and natural language. This really feels like a must-do in order for Perl to continue as a mainstream option.
I sure hope we see this great idea all the way through!
I consult ChatGPT to generate perl code and most of the time the result is wrong. I hope that your project will focus on our beloved perl language and produce correct results. Good luck!
I think it would be very beneficial for the Perl language and its community to have this large data model at it's disposal to help the programmers. I hope to see this happening in the foreseeable future. Good luck!
> Code Llama's documentation currently lists the following primarily-supported languages: "Python, C++, Java, PHP, TypeScript, C#, and Bash". One of the outcome of the PerlGPT project will be to effectively add Perl to this list.
I don't understand what "effectively" means in this context. Is this project to modify Code Llama to support Perl, or is this to fork Code Llama?