- Name: Kieren Diment
- Project Title: The Perl Survey - From "Pilot" to Production
- Synopsis: In 2007 Kirrily Robert organised and administered the Perl survey (http://perlsurvey.org) to provide a snapshot of the Perl community. In particular she made significant effort to recruit as many people as possible, resulting in a sample size of around 4500 responses. While an excellent start for a design for a survey instrument, it can be improved in a number of ways.
The Perl Survey - From "Pilot" to Production
In 2007 Kirrily Robert organised and administered the Perl survey (http://perlsurvey.org) to provide a snapshot of the Perl community. In particular she made significant effort to recruit as many people as possible, resulting in a sample size of around 4500 responses.
While an excellent start for a design for a survey instrument, it can be improved in a number of ways. These are:
- Removal of as many open-ended questions as possible by recoding into closed categories.
- Improvement on existing analyses. A couple of interesting visualisations aside, existing analyses consist of descriptive statistics. A more sophisticated statistical analysis would be useful in order to establish links between different variables - for example looking at programmer seniority or community seniority versus platform and programming knowledge.
- The rich demographic data would very useful, if complimented by an attitude survey. This way more links can be made between individual's demographic profiles, and what they think of issues relating to Perl and the community surrounding it. I propose to rerun the perl Survey in February 2009 with this included. This will track changes to the community, and provide useful measurements of community attitudes.
Benefits to the Perl Community:
Over the past few years, with the rise of other dynamic languages, Perl has often been described as having an "image problem" - misconceptions about the, readability, maintainability and general "hackishness" of perl code are commonplace. A more complete implementation and analysis of the Perl survey should help dispel some of this image problem, and provide a greater insight into the structure of the community. Although the Perl community is internally cohesive, there seems to be a problem with external communication.
Based on the aphorism 'know thyself' and the provision of high quality data analysis, this project should be seen partly as an attempt to move the discussion on within the community, and partly as a resource for people and companies that use or want to use Perl.
- Converting the original Perl survey into a low friction replicable instrument, which can be re-administered periodically to track the state of the community (codename: "The Perl Barometer").
- Scripted inferential statistical analysis for the existing Perl survey data, and for analysis of future runs of the survey.
- A written report extending the "official" report (available at http://xrl.us/bjp5z), as well as regular use.perl.org blog posts outlining progress (at http://use.perl.org/~singingfish)
- A better understanding of the community's attitudes towards the Perl language.
- A framework with which to assess people external to the community's attitude to Perl.
- I will deliver a paper on the this work at the Open Source Developers Conference (OSDC.au) in Sydney Australia in December 2008 (audio and/or video and slides uploaded to the web)
- Leading a discussion group on Perl and the community at YAPC::EU in August 2008 and OSDC.au in December 2008 (the latter is a multilanguage conference with a strong Perl presence).
- Public svn or git repository for all work performed on this grant.
Stage 1. Cleaning up of the perl survey data file.
A number of questions are represented in the Data::PerlSurvey2007 datafile as arrays. So for example the 'Programming Languages Known' hash key contains an array listing all the programming languages checked by that individual. From a statistical point of view this leads to a data file that is difficult to analyse. The correct practice is to create a dummy variable for every option that could exist. Again in terms if the 'Programming Languages Known' question, this means that for every individual, the language should be stored in a hash key rather than an array, and the value of the key should be 1 if the language is known to the respondent, and 0 if not. There is also significant extra work in folding the 780 responses in "Other programming languages known" back into the main "Programming languages known" before dummy variable coding. Unfortunately these responses will need to be processed manually so this stage is labour intensive.
While there are no other obvious problems with the data file other than this, experience suggests that smaller issues will occur. These should be much less significant than the problem with the "other programming languages" question.
Stage 2. Detailed statistical analysis
Once the data file is cleaned, we can then code each variable into the appropriate statistical data type (i.e. continuous, ordinal, nominal or boolean) in preparation for a more detailed analysis. The open source statistical software R (http://r-project.org) will be used for this analysis, and the scripts to generate the analysis will be documented and stored in a public version control repository.
The first step in a serious statistical analysis of the Perl survey data is to assess the best data reduction procedure to use. Two likely candidates are cluster analysis and multidimensional scaling. This work can be time-consuming due to the need to select and evaluate the performance of a variety of distance functions and clustering algorithms. This is worthwhile with the current data set, as there is a large sample of high quality data. This means that we have considerable statistical power. An examination of some of the demographic variables (including measurements of Perl community involvement) and examination of relationships between these and any
patterns found, are likely to prove interesting. So we can see how language preferences differ between "old-timers" and newer programmers, level of CPAN contribution and other community involvement.
This detailed analysis will lead to a much better understanding of the structure of the survey, and this in turn will lead to refinements of the questionnaire based on the data. Which leads to step three - refinement and extension of the existing survey.
Stage 3. Refinement of questionnaire
Once we have a clear picture of the structure of the Perl community, we can then refine the existing questions. For example, from my point of view, there are missing questions about the proportion of work time spent on programming and related tasks. Discussion with the community will reveal others, and the data reduction process from stage two will
also reveal other useful lines of questioning for future surveys. The job then is to find the shortest questionnaire for maximum benefit.
Stage 4. Development of Attitude Survey and/or Q methodology.
Attitude surveys are frequently used in social science to understand individuals and communities better. With careful design, these can be useful instruments with which to better profile the community. We propose to develop the questionnaire by analysis of the existing corpus of text in mailing lists and on the web with an intelligent
search strategy, and to read and code (tag with themes) relevant parts of this corpus. The second activity is to run discussions in person. I hope to obtain funding from elsewhere to enable me to travel to YAPC::EU in august to run a discussion session on the community, based on prior analysis of publicly available sources. Failing this, I will
be able to organise a session on IRC (including an attempt to recruit people who aren't IRC regulars, possibly with the help of CGI::IRC or similar). I will also run a second group at the Open Source Developers Conference in Sydney, Australia in December where Will also present a paper on the Perl community. This will have the advantage
that this is a multi-language conference where the local Perl community is well represented. Looking at external overlapping groups in conjunction with the in-group will provide useful extra information for framing an attitude survey. Experience suggests that this process should result in a well targeted 5 minute attitude survey.
I also mentioned using Q methodology. This is another clustering technique where participants rate a reasonably large number of statements on a continuum (actually a quasi-normal distribution - see the CPAN module for Statistics::QMethod::QuasiNormalDist for more details). Following this, correlation matrix between individuals is calculated and subject to principal components analysis and orthogonal rotation. This procedure results in quantifiable clusters of points of view, and the structure of subjective opinions. The decision whether to use this method really rests on the results of text analysis and group discussions, and depends on the nature of the
issues brought up in discussion.
Stage 5. The Perl Survey 2009
We would plan to run the second Perl Survey in April 2009 on hosting donated by Shadowcat Systems. Data analysis after this should be straightforward and quick based on the pilot work done on the first survey. There are a number of options for delivery of the survey. I may ask Strategic Data in Melbourne, Australia for a donation of their services, use the existing code that Kirrly Robert wrote, or use the survey management software which I am currently writing as part of another project. My own code is meant to be a generic solution to survey delivery, with ease of administration and encouragement of good psychometric practice, which I hope to be able to release under
an open source licence.
I'd anticipate that the survey should be run at 18 month to two year intervals following this.
Stages 1 and 2 Immediate to July 2008
Cleanup and statistical analysis of 2007 Perl survey results. Plan to release report on analysis for July in order to provide basis for stage 3.
Stage 3, August 2008 to December 2008
Discussion groups and interviews, particularly at YAPC::EU and OSDC.au. Qualitative analysis of this data for reporting, and incorporation into attitude survey.
Stage 4, September 2008 - January 2008
Analysis of interview/discussion data, and development of attitude survey and/or Q sort.
Stage 5. Run the second Perl survey through February 2009, with reporting and analysis ready by April 2009.
I have been using Perl for research data management, visualisation, analysis and collection since 2002. In 2006 I became involved in the Catalyst project, and am currently the documentation manager. I was editor-in-chief for the 2006 and 2007 Catalyst Advent calendars and I was closely involved in the development of the Catalyst tutorial.
I have ten years experience of questionnaire design and analysis in health care and management research settings. I have tutored statistics at undergraduate and postgraduate level to students in science, psychology, medicine and marketing. A selection of my publications (with full text) are available at: http://xrl.us/bjwd7
(this does not include current publications in review, pre-2003 articles, and some conference papers).
I am currently trying to develop a substantial research project on open source communities, project viability and aspects of commercialisation. This is an extension of the work that I was involved with on Australian cross-sector R&D organisations from 2003 to 2005. The Perl survey is related to, but separate from my larger research programme.
I have been given maintainership of this project by it's originator. The IRC transcript is below:
! Skud [n=[email protected]] has joined #perlnet
00:52 < Skud> hihi
00:56 < kd_the_realone> hey Skud
00:57 < kd_the_realone> Skud: I'm applying for a tpf grant to do some significant work on the perl survey
00:57 < kd_the_realone> really I just need either your blessing to be doing this, or for you to hand me maintainership of the project
01:04 < Skud> hi kd
01:04 < Skud> kd: yes, please do it
01:04 < Skud> i will happily hand you maintainership
01:04 < Skud> was going to look for someone to take over anyway