Tabulations and musings from your editor's biased perspective
What follows is a long exercise in patient digging. I had done a short review like this many years ago and our editor-in-chief thought that it might be a good idea to bring it
click the image to enlarge (pdf)
Statistical software comparison table
up to date. I now have the master template that will allow updates on a more timely basis as well as extension into other areas (a similar review for mathematical, as opposed to strictly statistical, software is in the back of my mind). The motivation for this is to give our readers a basis for comparison of some of the most commonly used commercial packages, and as an aid in selection for future reference. It is also an opportunity to reacquaint myself with many old friends.
Please note that this list is far from all-inclusive and is, in many ways, minimalist, and is directed to the non-statistician, be they scientist, engineer or student. It concentrates on those areas that are most often needed and used by beginning and intermediates analysts. It was not meant to include general mathematical software (e.g. Mathematica, Maple) nor specialized areas such as experimental design (DOE) and graphics, although more and more statistical packages are including DOE and all have some graphical capabilities.
The programs included tend to be the more common in terms of exposure, use and review. Developers and readers are encouraged to contact me with their favorite general statistical packages, as I'm always interested in testing new implementations. Freeware also is not covered but may be in the future as the specialized packages develop menu-driven versions with many of the niceties of commercial software. For comparison to a full-featured professional statistical package, I have included SAS, which is command-line driven. It should be noted, however, that the base capabilities of most of today's menu-driven software can be greatly extended by use of built-in programming languages.
The associated table lists the more commonly used functions, and the appearance of a bullet under any given program indicates that the operation is performed by menu-driven elements and not code. This first effort just lists the major areas without getting into the diagnostics and quick-graphics that may further distinguish the various packages. Also, note that the test breakdown is usually obvious but, in some cases, arbitrary. All notations reflect the capabilities of the latest version of this software at the time of writing. Two of the categories were sufficiently vague to confuse the developers, and therefore probably the readers of this review. The subcategory 'automatic test for normality' meant just that. Ask for a statistical test, and get a normality check done in the background before being 'allowed' to proceed. I stretched the definition in several cases, as what the software in question actually does is close. Also 'Missing Value Analysis' means imputing missing values, and not just noting them and removing observations with missing values. (Note to developers: Endless apologies for the confusion!)
As a last note, I have further divided the programs under their relative utility to the beginner, experienced users, and those demanding the most complete tests repertoires. The last category is for those special niche statistical items that are directed to, and highly useful in, particular areas.
Again, this division is somewhat arbitrary as some of the intermediate software achieves greater capability with add-on packages. Keep in mind that most of these programs have been reviewed in the professional literature and found wanting in some areas. These areas, however, are of much greater concern to the statistician than to the day-to-day analyst and, for general use, many satisfied users have validated these packages on a daily basis.
Now for a few words on the individual players....
These are the packages that are not nearly as all-inclusive as the others, but are user-friendlier to the novice.
• InStat takes a step-by-step approach to guide the novice through a limited repertoire of analyses and has the advantage of accepting either raw or averaged data. It thoughtfully explains the choices at each step, produces instant descriptive statistics and offers the beginner advice as to what the statistical significance of the output may be. It produces a few quick graphs with error bars to clarify the results and works well with columnar data.
• SigmaStat offers a broader menu and its wonderful Wizard to guide and advise the beginner. In addition, it offers features such as an automatic background check for normality, equality of variance, and power. Many of the options are immediately available at the touch of a single button and, as with InStat, the advice offered is most helpful. I have always treasured its ability to work with columns, rather than stacked data.
• Prism mentions on its Web site that "If you find most statistics programs cumbersome, and your needs are simple…." This is a nice general summary as the software views columns as samples or cases, rather than variables. As this is how the bench scientists often arrange their data, it makes for a fast and easy analysis with minimal data manipulation. It's also convenient that analyses and graphs may be automatically updated when changing data in the spreadsheet cells. There are also the many "teaching" features that are quite a help to the beginner.
These packages are far more comprehensive and offer extensive diagnostics and graphics. While retaining many of the ease-of-use features of the Beginning
click the image to enlarge
packages, they are geared to the more demanding user who may find need for a greater variety of analytic tests and descriptors. They also are constructed to work with a much broader range of input/output formats.
• Genstat is a gem from across the big pond that offers most of the usual tests but includes several really unusual (from the standpoint of a general statistical package) and substantial modules. These include the capability to do meta-analysis and genomics (microarray analysis). There is a nifty unit conversion box that will work on the fly, as well as a variety of buttons to quickly manipulate data.
• JMP is richly graphic and contains a variety of tests. Many of the graphics are linked to the analyses and present the analyst with a more compact readout. The arrangement of tests is a little different and is done by platforms (e.g. Fit Y by X rather than t-test, ANOVA, regression) but this may be a plus for the novice overwhelmed by standard terminology. It also has extensive data manipulation and editing features that are powerful time-savers. The neural networks module is a bit light, but the more often-used DOE module, quite extensive.
• MINITAB has a great layout and many very user-friendly features, and has found great utility in many academic departments. Its analysis six-packs, i.e., outputs of multiple graphics in a "view-it-all-at-once" format, really compact quite a bit of information, and make it all easily assimilated. In addition, it has one of the most useful random number generators imaginable. The layout is very intuitive, as is the running of most tests. It also contains a number of "data wrestling" features that make life a bit easier.
• SPSS was originally designed for social science numbers crunching but has found wide applications in general statistics for a variety of scientific disciplines and businesses. It is fondly remembered (from a very early version) as software that was intuitive with a very gentle learning curve. The operations all seemed to do what one thought that they should, usually the first time. Over the years, the developers added many tests and diagnostics that greatly extended the range and capabilities of the base program and then, unfortunately, split them off into add-on modules that had to be purchased separately.
• SYSTAT was, and still is, deeply involved with general linear models and sports one of the most complete ANOVA menus available. The latest version is rather compact and the pull-down menus, at first glance, appear sparse. With just a little time and effort, the depth of analytic offerings becomes readily apparent and the power of such techniques as resampling and Monte Carlo simulations becomes obvious. All this comes with a variety of specialized diagnostics and output that is quickly grasped by the experienced analyst.
These programs are quite a bit more than the part-time statistician of modest needs would use, but are included as they meet the most demanding requirements and are capable of handling almost any layer of complexity through programming extensions. Any time spent with these assumes a level of dedication to the art that goes far beyond simple data analysis, although the same could be said for many of the above software programs should the user want to delve into the deepest levels. The list of capabilities in the table just touches some of what the Intermediate programs do, and does not begin to address the functioning of these advanced programs.
• S-Plus has been around for a while and has gone from all command-line driven to user-friendlier menu driven. They have followed the common path of statistics to business applications to large databases, without forgetting their origins. The breadth and depth of statistical/graphical analysis is still there, but has been extended in the areas of I/O, data management and systems integration capabilities.
• SAS has been a standard since its introduction in the late 1970s. It was the first "professional's" program that I had used way back when, and has been a good friend ever since. Although originally designed, and still used primarily from a command line, there have been a number of upgrades that give SAS a fully useable, menu-driven GUI, and the current version is Enterprise Guide. The manuals for this one are voluminous, and every conceivable variation may be addressed through coding.
• STATISTICA is, and has been, a menu-driven version of the more advanced types. Their motto seems to be "we do it all, no programming necessary." During my first review of this package years ago, it took an entire evening just to make it thru the marketing material! The program is complete with a number of included modules, all easily accessibly from a Task-Switcher. From there, the user can drill through a near exhaustive menu of tests, graphics and options that address everything from simple t-tests to neural networks and experimental design.
There are very few bullets in the table under this header with good reason. These are the niche programs that address very specialized, yet extremely useful, functions that are performed when statistically analyzing data. They are immensely functional and are so utilitarian as to begin to be included, in one form or another, in many of the current versions of general statistical packages.
• nQuery Advisor is one of the most useful from both an experimental and clinical standpoint. Power and sample size calculations form the basis of many analyses and are considered necessary in most cases. The software is made for clinical/biological studies but can be applied in a wide variety of disciplines. The spreadsheets are set up for many types of calculations and there are copious helps to guide the novice through some of the more complex terminology. Quick graphics of power versus sample size are quite useful is assessing testing needs.
• StatMate is also a power-calculating program and one made for the less experienced user. For any of the more common and simple tests, it quickly presents tables of least important differences that allow the user to determine the power. The process is simple, intuitive and, for the beginner, quite instructive.
• ResamplingStats was a really nice idea that began as an EXCEL add-on and quickly became a useful addition to the procedural repertoire as this function was usually performed only by programming. The software was a big improvement from the novice standpoint as the manual used simple-to-follow examples and the setup was (and still is) extraordinarily easy. Utility has been enhanced with the addition of graphics that will readily display the distributions that result.
• Solas occupies a very unique niche in that it addresses a problem that we often times find, but that is not always easy to resolve, i.e., missing data. In the past, we may have merely deleted those samples or substituted the mean of the sample group. However, this always seemed to underestimate the variability inherent in the original set. The multiple imputation techniques that have been developed are included in this package as well as a number of older routines, all wrapped in an easy-to-use format.
• StatXact performs a wide variety of standard statistical tests using the type of techniques originally developed by Fisher for inference when the data sets are small, heavily tied, or skewed. Also called nonparametric or permutational inference, these routines are highly useful when the status of normality is in doubt and the usual asymptotic tests would be misleading. It gets easier to use with each version and the current one offers a very reasonable breadth.
• Unscrambler was a surprise, as it takes two of the more useful areas of statistics, i.e., DOE and multivariate techniques, and wraps them in a compact and highly useful package. Concentrating on prediction and optimization, this software addresses pressing needs in many industries. Originally designed for problems in the chemical industry, it finds wide applications in modern healthcare areas such as diagnostics and drug development. The graphic options do much to encourage further use, as they are well thought out and highly descriptive. This package also features specialized areas of supervised classification, 3-way PLS regression and multivariate curve resolution.
Well, there you have it, a too-brief summary of some of the more common and useful software on the market today. In future attempts, more emphasis will be placed on factors such as graphics and diagnostics that can further assist the interested reader in distinguishing between packages.
John Wass is a statistician with GPRD Pharmacogenetics, Abbott Laboratories. He can be reached at firstname.lastname@example.org.