Thursday, January 15, 2009

R challenge to proprietary stat software

The NY Times published a glowing story Tuesday (with a follow up blog posting Wednesday) on the success of R, an open source project that has grown to fill most statistical software needs.

R was launched in 1996 as a knockoff of the Bell Labs statistical programming language, S. As with Apache or Perl, much of the value comes from add-on packages, and it has grown a remarkable library of donated packages stored at CRAN, which is modeled on Perl’s CPAN.

I first came across R when running the MacStats website in the late 1990s, and recommended it to fellow academics (interested in stat software and notoriously cheap) back in August 2000.

From an economic or organizational standpoint, R is just a new act of the original open source story: user-innovators solving their own problems. Or, as Eric Raymond observed a decade ago, good software (especially open source software) comes from “scratching a developer's personal itch.” That scratching gave us Project GNU, with programming language compilers, a text editor, and gradually bits and pieces of an operating system.

Once upon a time, statisticians had to write their own Fortran programs to solve their analyses. Even today, most have better math and computer skills than the average college graduate.

So R — as with compilers — had a large pool of potential users who could write their own code. (Unlike, say, those who write children’s edutainment software). Also, university professors have autonomy over use of their time — organizational slack — but often not a lot of discretionary cash. So spending a few days to write a library — rather than buying a $100 or $500 off-the-shelf package — made certain economic sense.

When I was first evaluating R in 2000, the problem was the lack of a GUI. Statistics teachers often could program but undergraduate psych (or business) students could not, nor would they be keen on navigating a line-oriented program.

To make a GUI solution available, R had a Windows version, and started on Mac OS X with an X11 (Unix workstation GUI) implementation. Now it has a native UI version for OS X, in addition to Windows, Linux (4 flavors) and Solaris. It has scientific, social science, probability and domain-specific statistical packages contributed by users. Where once social scientists fought to find any implementation of partial least squares — since Herman Wold, the implementor of the original PLS package, died in 1992 — there are now at least 4 PLS packages available (free) for R.

When I was recommending R back in 2000, it was rough but obviously ambitious in its goals. It’s gratifying to see how it’s evolved to success (and fame), even if it doesn’t teach us anything new about strategies for growing autonomous open source communities.

No comments: