2008Q2 Grant Proposal - The Perl Survey
Thu, 01-May-2008 by
* **Name:** Kieren Diment
* **Project Title:** The Perl Survey - From "Pilot" to Production
* **Synopsis:** In 2007 Kirrily Robert organised and administered the Perl survey (http://perlsurvey.org) to provide a snapshot of the Perl community. In particular she made significant effort to recruit as many people as possible, resulting in a sample size of around 4500 responses. While an excellent start for a design for a survey instrument, it can be improved in a number of ways.
The Perl Survey - From "Pilot" to Production
In 2007 Kirrily Robert organised and administered the Perl survey (http://perlsurvey.org) to provide a snapshot of the Perl community. In particular she made significant effort to recruit as many people as possible, resulting in a sample size of around 4500 responses.
While an excellent start for a design for a survey instrument, it can be improved in a number of ways. These are:
# Removal of as many open-ended questions as possible by recoding into closed categories.
# Improvement on existing analyses. A couple of interesting visualisations aside, existing analyses consist of descriptive statistics. A more sophisticated statistical analysis would be useful in order to establish links between different variables - for example looking at programmer seniority or community seniority versus platform and programming knowledge.
# The rich demographic data would very useful, if complimented by an attitude survey. This way more links can be made between individual's demographic profiles, and what they think of issues relating to Perl and the community surrounding it. I propose to rerun the perl Survey in February 2009 with this included. This will track changes to the community, and provide useful measurements of community attitudes.
**Benefits to the Perl Community:**
Over the past few years, with the rise of other dynamic languages, Perl has often been described as having an "image problem" - misconceptions about the, readability, maintainability and general "hackishness" of perl code are commonplace. A more complete implementation and analysis of the Perl survey should help dispel some of this image problem, and provide a greater insight into the structure of the community. Although the Perl community is internally cohesive, there seems to be a problem with external communication.
Based on the aphorism 'know thyself' and the provision of high quality data analysis, this project should be seen partly as an attempt to move the discussion on within the community, and partly as a resource for people and companies that use or want to use Perl.
* Converting the original Perl survey into a low friction replicable instrument, which can be re-administered periodically to track the state of the community (codename: "The Perl Barometer").
* Scripted inferential statistical analysis for the existing Perl survey data, and for analysis of future runs of the survey.
* A written report extending the "official" report (available at http://xrl.us/bjp5z), as well as regular use.perl.org blog posts outlining progress (at http://use.perl.org/~singingfish)
* A better understanding of the community's attitudes towards the Perl language.
* A framework with which to assess people external to the community's attitude to Perl.
* I will deliver a paper on the this work at the Open Source Developers Conference (OSDC.au) in Sydney Australia in December 2008 (audio and/or video and slides uploaded to the web)
* Leading a discussion group on Perl and the community at YAPC::EU in August 2008 and OSDC.au in December 2008 (the latter is a multilanguage conference with a strong Perl presence).
* Public svn or git repository for all work performed on this grant.
Stage 1. Cleaning up of the perl survey data file.
A number of questions are represented in the Data::PerlSurvey2007 datafile as arrays. So for example the 'Programming Languages Known' hash key contains an array listing all the programming languages checked by that individual. From a statistical point of view this leads to a data file that is difficult to analyse. The correct practice is to create a dummy variable for every option that could exist. Again in terms if the 'Programming Languages Known' question, this means that for every individual, the language should be stored in a hash key rather than an array, and the value of the key should be 1 if the language is known to the respondent, and 0 if not. There is also significant extra work in folding the 780 responses in "Other programming languages known" back into the main "Programming languages known" before dummy variable coding. Unfortunately these responses will need to be processed manually so this stage is labour intensive.
While there are no other obvious problems with the data file other than this, experience suggests that smaller issues will occur. These should be much less significant than the problem with the "other programming languages" question.
Stage 2. Detailed statistical analysis
Once the data file is cleaned, we can then code each variable into the appropriate statistical data type (i.e. continuous, ordinal, nominal or boolean) in preparation for a more detailed analysis. The open source statistical software R (http://r-project.org) will be used for this analysis, and the scripts to generate the analysis will be documented and stored in a public version control repository.
The first step in a serious statistical analysis of the Perl survey data is to assess the best data reduction procedure to use. Two likely candidates are cluster analysis and multidimensional scaling. This work can be time-consuming due to the need to select and evaluate the performance of a variety of distance functions and clustering algorithms. This is worthwhile with the current data set, as there is a large sample of high quality data. This means that we have considerable statistical power. An examination of some of the demographic variables (including measurements of Perl community involvement) and examination of relationships between these and any
patterns found, are likely to prove interesting. So we can see how language preferences differ between "old-timers" and newer programmers, level of CPAN contribution and other community involvement.
This detailed analysis will lead to a much better understanding of the structure of the survey, and this in turn will lead to refinements of the questionnaire based on the data. Which leads to step three - refinement and extension of the existing survey.
Stage 3. Refinement of questionnaire
Once we have a clear picture of the structure of the Perl community, we can then refine the existing questions. For example, from my point of view, there are missing questions about the proportion of work time spent on programming and related tasks. Discussion with the community will reveal others, and the data reduction process from stage two will
also reveal other useful lines of questioning for future surveys. The job then is to find the shortest questionnaire for maximum benefit.
Stage 4. Development of Attitude Survey and/or Q methodology.
Attitude surveys are frequently used in social science to understand individuals and communities better. With careful design, these can be useful instruments with which to better profile the community. We propose to develop the questionnaire by analysis of the existing corpus of text in mailing lists and on the web with an intelligent
search strategy, and to read and code (tag with themes) relevant parts of this corpus. The second activity is to run discussions in person. I hope to obtain funding from elsewhere to enable me to travel to YAPC::EU in august to run a discussion session on the community, based on prior analysis of publicly available sources. Failing this, I will
be able to organise a session on IRC (including an attempt to recruit people who aren't IRC regulars, possibly with the help of CGI::IRC or similar). I will also run a second group at the Open Source Developers Conference in Sydney, Australia in December where Will also present a paper on the Perl community. This will have the advantage
that this is a multi-language conference where the local Perl community is well represented. Looking at external overlapping groups in conjunction with the in-group will provide useful extra information for framing an attitude survey. Experience suggests that this process should result in a well targeted 5 minute attitude survey.
I also mentioned using Q methodology. This is another clustering technique where participants rate a reasonably large number of statements on a continuum (actually a quasi-normal distribution - see the CPAN module for Statistics::QMethod::QuasiNormalDist for more details). Following this, correlation matrix between individuals is calculated and subject to principal components analysis and orthogonal rotation. This procedure results in quantifiable clusters of points of view, and the structure of subjective opinions. The decision whether to use this method really rests on the results of text analysis and group discussions, and depends on the nature of the
issues brought up in discussion.
Stage 5. The Perl Survey 2009
We would plan to run the second Perl Survey in April 2009 on hosting donated by Shadowcat Systems. Data analysis after this should be straightforward and quick based on the pilot work done on the first survey. There are a number of options for delivery of the survey. I may ask Strategic Data in Melbourne, Australia for a donation of their services, use the existing code that Kirrly Robert wrote, or use the survey management software which I am currently writing as part of another project. My own code is meant to be a generic solution to survey delivery, with ease of administration and encouragement of good psychometric practice, which I hope to be able to release under
an open source licence.
I'd anticipate that the survey should be run at 18 month to two year intervals following this.
Stages 1 and 2 Immediate to July 2008
Cleanup and statistical analysis of 2007 Perl survey results. Plan to release report on analysis for July in order to provide basis for stage 3.
Stage 3, August 2008 to December 2008
Discussion groups and interviews, particularly at YAPC::EU and OSDC.au. Qualitative analysis of this data for reporting, and incorporation into attitude survey.
Stage 4, September 2008 - January 2008
Analysis of interview/discussion data, and development of attitude survey and/or Q sort.
Stage 5. Run the second Perl survey through February 2009, with reporting and analysis ready by April 2009.
I have been using Perl for research data management, visualisation, analysis and collection since 2002. In 2006 I became involved in the Catalyst project, and am currently the documentation manager. I was editor-in-chief for the 2006 and 2007 Catalyst Advent calendars and I was closely involved in the development of the Catalyst tutorial.
I have ten years experience of questionnaire design and analysis in health care and management research settings. I have tutored statistics at undergraduate and postgraduate level to students in science, psychology, medicine and marketing. A selection of my publications (with full text) are available at: http://xrl.us/bjwd7
(this does not include current publications in review, pre-2003 articles, and some conference papers).
I am currently trying to develop a substantial research project on open source communities, project viability and aspects of commercialisation. This is an extension of the work that I was involved with on Australian cross-sector R&D organisations from 2003 to 2005. The Perl survey is related to, but separate from my larger research programme.
I have been given maintainership of this project by it's originator. The IRC transcript is below:
00:52 -!- Skud [n=Skud@nat02.sfo1.metaweb.com] has joined #perlnet
00:52 < Skud> hihi
00:56 < kd_the_realone> hey Skud
00:57 < kd_the_realone> Skud: I'm applying for a tpf grant to do some significant work on the perl survey
00:57 < kd_the_realone> really I just need either your blessing to be doing this, or for you to hand me maintainership of the project
01:04 < Skud> hi kd
01:04 < Skud> kd: yes, please do it
01:04 < Skud> i will happily hand you maintainership
01:04 < Skud> was going to look for someone to take over anyway
A grant for something somewhat non-codish. How interesting! I think this sort of thing has a lot of potential value to the community.
A grant for something somewhat non-codish. How interesting! I think this sort of thing has a lot of potential value to the community.
I like the depth of statistics involved and just making it repeatable and comparable will be useful.
As Dave repeated, it is quite interesting to have a proposal for something non-codish. But I have my doubts about the real usefulness of it for $3000.
Having participated in survey projects myself, and having a wife who is a statistician, I know how time consuming and challenging it is to get something like this right. Ambiguous or leading questions can spoil an entire data set. I think the applicant sounds experienced and has written an excellent proposal. $3000 is a very reasonable price for the amount of work involved.
I did not participate in the previous survey because of the requirement to submit my email address. I would like to hear how Kieren plans to address or mitigate the problem of ballot stuffing and/or how the survey data will be kept anonymous.
(Note: sorry about the silly username, I am actually the applicant).
Chris is correct about the amount I requested. I used the 4 survey projects, I was involved last year to estimate what would be reasonable to ask for (at well below market rate). I also discussed it with a colleague.
I have tried to get around to looking at the analysis of the data of the original survey on my own time, but it is a fiddly and time consuming job that I haven't been able to find the tuits for in my spare time.
Retaining anonymity is easy in terms of data storage. I'm not sure how many people did not complete the survey due to the email address requirement - this is partly what the discussion work would be about - to estimate the size of this population, and maybe estimate how they might differ from the survey sample. I could think about using a CAPTCHA, or something related for people who did not want to provide an email address.
I did think about splitting this project into three parts of $1000 each, comprising
2. discussions/scoping subsequent surveys and
3. running the second survey
However, I decided that the timing of the grants process, the timing of the conference season and my need to apply for a separate travel grant meant that this would be much less worthwhile than applying for the whole thing at once.
Also, the goal at the end of this process is a survey instrument that's easy to administer, and doesn't require a whole lot of time to do a thorough analysis. I think the value of providing a replicable survey is far greater than a one-off proposition.
I think something like this has enormous value to both the Perl community specifically and the Open Source community at large. Perl has pretty large and mature community compared to some of the other "scripting" languages out there, a professionally done study like this could give some real insight into what makes it work. I think it would be well worth the money.
Kirrily did a great job in organising the first Perl Survey. Through it I got to understand a lot more about the whole Perl community than I had experienced in just my little part of the world. I think it's superb that Kieren is willing to reorganise the survey (with Kirrily's blessing) so that we can evaluate the information from it with a real understanding of its statistical validity.
That Kieren is also able to bring a solid understanding of the social sciences into play is an even greater bonus. With his experience I am sure that we'll be asking the right questions, to get meaningful results.
A solid, repeatable survey, is both excellent PR for Perl and something we can share to improve the Open Source world in general. Like the Perl Best Practices book, it's also something that we'll be leading the open source languages in. It'll also help answer that question of "So who's using Perl these days?"
Although $3000 is a lot of money, I think what we'll be getting from it, will be useful for many years into the future; for both Perl 5 and Perl 6.
Kieren's research comes at an essential time for Perl as there is a growing chasm between the real state of Perl both in development and in industry and the general perception of Perl . This application will help to raise the understanding of Perl and its current evolutionary shape, not just for our own community but the business community as well.
The survey and its associated reports should highlight areas where we can improve, it will give us an understanding of our strengths and weaknesses and in particular how Perl is seen by the wider communities.
My belief in the importance of his research is backed by the commitment of my company to Kieren's work. Shadowcat Systems will be sponsoring some of his research as well as providing free hosting for the associated questionnaire.