Free statistical software

From Citizendium
Revision as of 22:43, 17 April 2009 by imported>Gene Shackman (adding section on how to get data)
Jump to navigation Jump to search
This article has a Citable Version.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article has an approved citable version (see its Citable Version subpage). While we have done conscientious work, we cannot guarantee that this Main Article, or its citable version, is wholly free of mistakes. By helping to improve this editable Main Article, you will help the process of generating a new, improved citable version.

Template:TOC-right

Free statistical software is a practical alternative to commercial packages. In general, free statistical software gives results which are the same as the results from commercial programs, and most of the packages are fairly easy to learn, using menu systems, although a few are command-driven. These packages come from a variety of sources, including governments, non-governmental organizations (NGOs) like UNESCO, and universities, and are also developed by individuals. Many free software packages are used in academic research in peer-reviewed journals or in publications from major organizations.

Some packages are developed for specific purposes (e.g., time series analysis, factor analysis, calculators for probability distributions, etc.), while others are general packages, with a variety of statistical procedures. This article is a review of the general statistical packages.

Sources of free statistical software

Some of the free software packages are from governmental or NGO organizations, such as Epi Info, from CDC[1], and IDAMS from UNESCO[2]. Some other software packages are from smaller or independent organizations or universities, such as Instat[3] or Irristat[4]. Another package, the R project[5] is being developed by a group of volunteer individuals. A large proportion of free statistical software packages, however, are from individuals. Some of these software packages from individuals include Easyreg[6], MicrOsiris[7], OpenStat[8], PSPP[9], from the GNU project, a free clone of SPSS, and Zelig[10].

In some cases, the statistical software packages were developed for the purposes of making key technologies available to those who could not otherwise afford them, to empower development[11],[12]. In other cases, the packages were developed as teaching aids[8],[3]. Other packages were developed for specific purposes but can be more generally used. Examples are Irristat[4], developed for agricultural analysis, and Epi Info[1], developed for public health. A couple of packages don't appear to give any statements about why they were developed, other than just general use for statistical analysis[9],[5],[7].

Reviews of free statistical software

There are a few reviews of free statistical software. There were two reviews in journals (but not peer reviewed), one by Zhu and Kuljaca[13] and another article by Grant that included mainly a brief review of R[14]. Zhu and Kuljaca outlined some useful characteristics of software, such as ease of use, having a number of statistical procedures and ability to develop new procedures. They reviewed several programs and identified which ones, at that time, had the most functionality. At that time, several of the programs may not have had all of the desired ability for advanced statistics. Grant reviewed some of the programing features of R, and briefly mentioned the availability of other programs. One other paper reviewed statistical packages, mainly commercial, but includes R[15]. One article reviewed EasyReg and included a discussion of it's accuracy[16].

Only one review has compared the output of various packages[17]. In this review, all of the packages read either CSV (Comma Separated Values - text files in which all values are separated by commas) files or excel format. All of the packages gave exactly the same results for correlation and regression. The free software packages also gave the same regression results as did excel. One of the main differences among the packages was how they handled missing data. With the example data sets used in the review, and for the package versions available in November 2006 when this review was conducted, two packages, MicrOsiris and Epi Info, could read files with blanks for missing. Two other programs, Stat4U and WinIdams need something for the missing, like -9 or -9.99. The other packages could only handle data sets with no missing values.

Two websites that list software also have very brief reviews of each package. These two sites are StatCon[18] and by Pezzullo[19]. These sites mainly offer a brief list of the features available in the packages. Similarly, one bachelor's thesis compares the statistical procedures available on free statistical packages[20]. In this review, R had all of the procedures, OpenStat had 16, MacAnova had 15, and Microsiris had 12. The others had from 8 to 11 of the procedures.

There is also a journal specifically for statistical software[21], although the main focus is on commercial software, R and some coding snippets.

These free software packages have been used in a number of scholarly publications, so that at least various journals, NGOs or other organizations regard the packages as valid. For example, OpenStat was used in a research letter to JAMA[22] and in several published studies[23],[24],[25]. Irristat is used in this agricultural report[26], EasyReg is listed or used in these papers[27],[28],[29], various versions of EpiInfo were used in these papers[30],[31],[32] and WinIdams was used in these papers[33], [34].

While Microsiris doesn't appear to be used in academic research, the author of the program was one of the original authors of OSIRIS[35], which was the starting program from which WinIdams was developed[36]. The author of Microsiris also has also contributed or co-contributed several components to WinIdams[36].

Using free statistical software

Before using any statistical packages, it is generally a good idea to have a solid background in Statistics. Then the packages can be used to the best advantage, for example, to choose the most appropriate test, to make sure all the necessary assumptions are met, so that the appropriate conclusions can be drawn.

Once the statistical issues are understood, the next step is to decide which package to use. Most of these packages are menu driven, and can be learned a couple of hours at most, except R, which is generally code driven and requires a much longer time to learn, and to some extent CDC's Epi Info, which also takes some time to learn.

Several of the packages also have tutorials. These tutorials help with a basic introduction and learning the basics of programs. For example, CDC has these tutorials about Epi Info[37],[38]. The CDC page also lists a video slide show tutorial from the University of Nebraska [39], and another site has on line training classes[40]. R has a large number of tutorials and manuals, in English and other languages[41],[42],[43], and a faq site[44]. A few of the packages have a email discussion lists including R[45], OpenStat[46], and PSPP[47].

Most of the packages have on line manuals, guides or help pages. These manuals or guides are useful when there are questions about specific procedures or statistical tests. Some manuals or guides are for R[48], EasyReg [49], OpenStat[8], PSPP[50], Vista[51], WinIdams[52],[53], Microsiris[54] and Zelig[55]. The CDC EpiInfo site itself does not have a manual, but one faculty member from Emory's School of Public Health has an introductory manual[56].

menu driven packages

Many of the packages have some kind of opening menu that is used to get or enter the data, manipulate the data, and select the statistical analysis. One example, from Microsiris, is this:

© Image: Neal Van Eck
Microsiris starting menu.

In general, users would import data, from text, a spreadsheet or from some other form. The menus then allow the user to do things like manipulation of data, like new calculations, selecting certain cases, creating new variables, and so on. Finally, users can click on the statistical procedure and get output.

command driven packages

A few programs, like WinIDAMS and R need commands for many of their procedures.

Getting data

Most packages are able to import data from excel or CSV (text with commas separating values). An example of a CSV file might look like this:

Name,Age,Sex,Born in the US,Degree

Joe,31,M,Yes,BA

Sam,,M,No,MS

Sally,28,F,,Ph.D

Generally, spreadsheet programs like excel can save their files as CSV files.

Some packages, like PSPP, can automatically deal with missing data. If some cases don't have values for some of the variables, and the data is structured correctly, PSPP will automatically put blanks where the values are missing. So for example, say one set of data look like this:

Name Age Sex Born in US Degree
Joe 31 M Yes BA
Sam M No MS
Sally 28 F Ph.D.


In this data set, Sam is missing age, and Sally is missing whether she was born in the USA. If the original data set indicates properly that those values are missing, then when PSPP reads it in, it will recognize that those values are missing. For PSPP using a CSV data set, the indicator is no values or space between the commas. Thus, this line

Sam,,M,No,MS

shows that age is missing, because there is just two commas with no data in between them, where age should be.

On the other hand, some packages need a 'place holder', such as '-9' where there is missing data, and then people who are reading in the data need to tell the program that the -9 means missing data.

References

  1. 1.0 1.1 Epi Info, CDC, 2008 http://www.cdc.gov/epiinfo/index.htm.
  2. IDAMS Statistical Software, http://portal.unesco.org/ci/en/ev.php-URL_ID=2070&URL_DO=DO_TOPIC&URL_SECTION=201.html
  3. 3.0 3.1 Instat - an interactive statistical package, Statistical Services Centre - University of Reading, 2009. http://www.ssc.rdg.ac.uk/software/instat/instat.html
  4. 4.0 4.1 Irristat, International Rice Research Instititue, Biometrics and Bioinformatics Unit, http://www.irri.org/science/software/irristat.asp
  5. 5.0 5.1 The R Project, http://cran.r-project.org/
  6. Easy Reg International, Herman Bierens, Penn State University, 2008 http://econ.la.psu.edu/~hbierens/EASYREG.HTM
  7. 7.0 7.1 MicOsiris, Neal Van Eck, Van Eck Computer Consulting http://www.microsiris.com/
  8. 8.0 8.1 8.2 OpenStat, Bill Miller, 2009 http://www.statpages.org/miller/openstat/
  9. 9.0 9.1 PSPP, 2008 http://www.gnu.org/software/pspp/
  10. Zelig, Kosuke Imai, Gary King and Olivia Lau , 2009 http://gking.harvard.edu/zelig/
  11. UNESCO. 03-11-2004 . In Focus: Communication and Information Sector's In Focus service. UNESCO and Software. http://portal.unesco.org/ci/en/ev.php-URL_ID=17447&URL_DO=DO_TOPIC&URL_SECTION=201.html
  12. VSN International. 2008. GenStat Discovery. http://www.vsni.co.uk/software/genstat-discovery/
  13. "A Short Preview of Free Statistical Software Packages for Teaching Statistics to Industrial Technology Majors" Journal of Industrial Technology (Volume 21-2, April 2005), Ms. Xiaoping Zhu and Dr. Ognjen Kuljaca. http://www.nait.org/jit/current.html
  14. Felix Grant, "Free Statistics Software, Yours, Free to keep....", Scientific Computing World, Sept/Oct 2004, http://www.scientific-computing.com/scwsepoct04free_statistics.html
  15. Edward J. Wegman and Jeffrey L. Solka. 2005. Statistical Software for Today and Tomorrow. http://www.galaxy.gmu.edu/ (listed as "A Guide to Statistical Software".
  16. Hwan-sik Choia and Nicholas M. Kiefer, Software evaluation: EasyReg International. International Journal of Forecasting. Volume 21, Issue 3, July-September 2005, Pages 609-616. http://dx.doi.org/10.1016/j.ijforecast.2005.02.003
  17. Shackman, Gene. 2006. "Comparing free statistical software for data sets with no missing values" and "Comparing free statistical software, Handling missing data". Both available here "Free Software" http://gsociology.icaap.org/methods/soft.html
  18. List of free statistical software, Open Source & Public Domain Packages with Source Code. StatCon 2006. http://statistiksoftware.com/free_software.html
  19. Pezzullo, Free Statistical Software, 2009. http://statpages.org/javasta2.html
  20. Paivi Lankinen and Anna Tanhuala. 2008. Internationalization of Software Companies: Case - Le Sphinx, France. School of Business Administration, Jyvaskyla University of Applied Sciences. Appendix 1, Comparing free statistical software. https://oa.doria.fi/handle/10024/38544
  21. Journal of Statistical Software, http://www.jstatsoft.org/
  22. Future Salary and US Residency Fill Rate Revisited, Mark Ebell. Research letter in JAMA, September 10, 2008—Vol 300, No. 10, p1131-1132. http://jama.ama-assn.org/cgi/reprint/300/10/1131
  23. Differential gene expression patterns in cyclooxygenase-1 and cyclooxygenase-2 deficient mouse brain. Christopher D Toscano, Vinaykumar V Prabhu, Robert Langenbach, Kevin G Becker, and Francesca Bosetti. Genome Biol. 2007; 8(1): R14. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1839133
  24. M Bielaszewska, B Sinha, T Kuczius and H Karch. Cytolethal Distending Toxin from Shiga Toxin-Producing Escherichia coli O157 Causes Irreversible G2/M Arrest, Inhibition of Proliferation, and Death of Human Endothelial Cells. Infection and Immunity, January 2005, p. 552-562, Vol. 73, No. 1. http://iai.asm.org/cgi/content/full/73/1/552
  25. C.D. Toscano, P.J. Kingsley, L.J. Marnett, and F. Bosetti1. NMDA-induced Seizure Intensity is Enhanced in COX-2 Deficient Mice. Neurotoxicology. 2008 November; 29(6): 1114–1120.http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2587528
  26. FAO Plant Production and Protection Paper No. 174, Rome, 2003, Genotype x environment interactions. Challenges and opportunities for plant breeding and cultivar recommendations, http://www.fao.org/DOCREP/005/Y4391E/y4391e00.htm
  27. A Gambardella and Bronwyn H. Hall, "Proprietary versus public domain licensing of software and research products" (2006). Research Policy. 35 (6), pp. 875-892. Postprint available free at: http://repositories.cdlib.org/postprints/1865.
  28. Liu, Wen-Chi and Tsangyao Chang, (2008) "Rational Bubbles in the Korea Stock Market? Further Evidence based on Nonlinear and Nonparametric Cointegration Tests." Economics Bulletin, Vol. 3, No. 34 pp. 1-12. http://economicsbulletin.vanderbilt.edu/2008/volume3/EB-08C30021A.pdf
  29. Harumi Itoa and Darin Lee, Journal of Economics and Business, Volume 57, Issue 1, January-February 2005, Pages 75-95. Assessing the impact of the September 11 terrorist attacks on U.S. airline demand. http://dx.doi.org/10.1016/j.jeconbus.2004.06.003. Also available here http://www.brown.edu/Departments/Economics/Papers/Papers/2003/2003-16_paper.pdf
  30. Rahav G, Gabbay R, Ornoy A, Shechtman S, Arnon J, Diav-Citrini O. Primary versus nonprimary cytomegalovirus infection during pregnancy, Israel. Emerg Infect Dis [serial on the Internet]. 2007 Nov [date cited]. Available from http://www.cdc.gov/EID/content/13/11/1791.htm
  31. Chan P-C, Huang L-M, Wu Y-C, Yang H-L, Chang I-S, Lu C-Y, et al. Tuberculosis in children and adolescents, Taiwan, 1996–2003. Emerg Infect Dis [serial on the Internet]. 2007 Sep. Available from http://www.cdc.gov/EID/content/13/9/1361.htm
  32. ME Gyasi, WMK Amoaku, and MA Adjuik. Epidemiology of Hospitalized Ocular Injuries in the Upper East Region of Ghana. Ghana Med J. 2007 December; 41(4): 171–175. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2350113
  33. N. S. Sapre, N. Pancholi, and S. Gupta, Computational Modeling of Substitution Effect on HIV–1 Non–Nucleoside Reverse Transcriptase Inhibitors with Kier–Hall Electrotopological State (E–state) Indices, Internet Electron. J. Mol. Des. 2008, 7, 55–67, http://www.biochempress.com/cv07_i03.html
  34. Chawla, Anju. Exploring project selection behavior of academic scientists in India. Research Evaluation, Volume 16, Number 1, March 2007 , pp. 35-45(11). http://www.ingentaconnect.com/content/beech/rev/2007/00000016/00000001/art00004
  35. Data Sharing for Demographic Research Knowledge Base, question on OSIRIS, University of Michigan, http://dsdr-kb.psc.isr.umich.edu/answer.html?i=1076
  36. 36.0 36.1 IDAMS, Internationally Developed Data Analysis and Management Software Package. WinIDAMS Reference Manual (release 1.3) UNESCO, 2008. Preface. http://portal.unesco.org/ci/en/ev.php-URL_ID=25081&URL_DO=DO_TOPIC&URL_SECTION=-465.html
  37. Epi Info™ Community Health Assessment Tutorial. The Epi Info™ Community Health Assessment Tutorial was produced by the collaborative efforts of the Centers for Disease Control and Prevention (CDC), the Assessment Initiative (AI), and the New York State Department of Health (NYSDOH). http://www.cdc.gov/epiinfo/communityhealth.htm
  38. Cholera Outbreak in Rwenshama: Using Epi Info for Windows in an Outbreak Investigation. Coordinating Office for Global Health - DGPHCD, http://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm
  39. Introduction to EPI2000. GPVEC Great Plains Veterinary Educational Center. University of Nebraska - Lincoln. http://gpvec.unl.edu/videos/epi-stats.asp
  40. The North Carolina Center for Public Health Preparedness Training Website http://nccphp.sph.unc.edu/training/index.html
  41. Contributed Documentation. http://cran.r-project.org/other-docs.html.
  42. William Revelle, Using R for psychological research: A simple guide to an elegant package, 2008, http://personality-project.org/r/
  43. Dong-Yun Kim, MAT 356 R Tutorial, Spring 2004. http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html
  44. R FAQ. Frequently Asked Questions on R. Version 2.8.2009-03-18. ISBN 3-900051-08-9 http://lib.stat.cmu.edu/R/CRAN/doc/FAQ/R-FAQ.html
  45. R-help -- Main R Mailing List: Primary help. https://stat.ethz.ch/mailman/listinfo/r-help
  46. OpenStatHelp http://tech.groups.yahoo.com/group/OpenStat/
  47. Pspp-users -- PSPP user discussion, http://lists.gnu.org/mailman/listinfo/pspp-users
  48. R Development Core Team. An Introduction to R. Version 2.8.1 (2008-12-22). ISBN 3-900051-12-7. http://cran.r-project.org/doc/manuals/R-intro.html
  49. Herman J. Bierens. EasyReg International: Guided tours. No Date Given. http://econ.la.psu.edu/~hbierens/ERITOURS.HTM
  50. Documentation, No Date Given. PSPP. http://www.gnu.org/software/pspp/documentation.html
  51. Forrest W. Young, 1996. ViSta User's Guide. http://forrest.psych.unc.edu/research/
  52. P.S. Nagpaul. 1999. Guide to Advanced Data Analysis using IDAMS Software. http://www.unesco.org/webworld/idams/advguide/TOC.htm
  53. Unesco. 2008. WinIDAMS 1.3 Reference Manual - Table of Contents. http://www.unesco.org/webworld/portal/idams/html/english/TOC.htm
  54. Van Eck, Richard, Microsiris, Statistical and Data Management Software System. Version 9.1, 2006. Van Eck Computer Consulting. http://www.microsiris.com/MicrOsiris.htm
  55. Kosuke Imai, Gary King, Olivia Lau, No Date Given. Zelig: Everyone's Statistical Software. http://gking.harvard.edu/zelig/docs/index.html
  56. Kevin M. Sullivan. Mar 3 2008. Introduction to Epi Info (Version 3.4.1) Analyze Data Module. http://www.sph.emory.edu/~cdckms/