Attention readers: This blog has moved to a new home at https://chenghlee.wordpress.com/.

Tuesday, December 20, 2011

What should computational papers report?

After spending a part of the weekend helping a co-author figure out how we produced a table for a soon-to-be-published paper, I started thinking about what the minimum reporting requirements for a computational paper are or should be in order to guarantee reproducibility of the results.

Clearly, the minimum standard should be the name and version number of the software package(s) used directly in the analysis (e.g., in my case, DNAcopy 1.22.1). Version numbers are critical, as the algorithms in these packages occasionally change in non-obvious ways, leading to startlingly different results. Unfortunately, in my experience, a large number of papers fail to report version numbers (though this is slowly starting to change), and in the times I've been called upon to review a paper, I've always asked authors to include version numbers of all software they used in their edits prior to publication.

In his 2009 paper on repeatability of microarray analyses [1], John Ioannidis pointed out a major factor limiting reproducibility of results is the lack of enough details in how data were processed. I strongly suspect a key reason why such details aren't provided is that many, perhaps most, researchers don't keep particularly detailed notes about what software they used. Certainly, I've been guilty of this, which is why I had to spend my weekend tracking down this issue.

As far as details go though, it isn't clear to me how far up the tool chain we should go. Details about "parent" software systems (e.g., "Bioconductor 2.6/R 2.12.0") might be especially useful to reproduce results, especially if there have been bug fixes or tweaks in versions released since the original analyses were performed. In my analyses for this paper, I've noticed that the same version of DNAcopy gives slightly different numerical results when used with different releases of R, though not enough to affect the conclusions we drew.

Moreover, in cases where the researcher has built their own software (like I do with R releases), I wonder if details like compiler settings need to be provided or whether those are just minutia that only serve to clutter up the literature. Choices for architecture flags (e.g., "-mfpmath=sse -msse4.1" versus "-mfpmath=387" for gcc) or for aggressive optimizations (e.g., "-ffast-math" [2]) certainly have effects on numerical results, though I would argue that conclusions drawn from results that are especially sensitive to compiler flags are dubious at best.

Similar questions can be asked about whether versions for various linked libraries should be reported. Even "standardized" libraries can lead to different behaviors depending on how they are implemented. For example, R and GraphLab behave badly (i.e., segfault in certain circumstances) when linked to GotoBLAS but run fine when linked to ATLAS or Intel MKL.

A final consideration for reporting has to do with randomized algorithms, such as MCMC simulations. Of course, best practices say that the algorithm should be run several times (at least) to ensure that the results are reasonably stable and consistent. However, should the interests of "full disclosure" and reproducibility require us to report the random seeds used to generate our results? Again, I would be highly skeptical of any results that are that sensitive to the initial seed, and in any case, getting such values might be impossible to get if seeding of the random number generator happens deep in the bowels of code that we don't have access to.

My conclusions after thinking about this? Report the version of any packages you use and the version of the parent software; compiler flags, linked libraries, and random seeds are probably extraneous details that can be left out. Of course, there's also the question of how thoroughly code you've developed needs to be tested for numerical stability and other computational "artifacts", but that's an issue I'll discuss at another time.

[1] J Ioannidis, et al. "Repeatability of published microarray gene expression analyses". Nature Genetics 41(2):2009. 144–155.
[2] Though I would argue that you should never use -ffast-math for numerical software.

2 comments:

Andy O. said...

Interesting thoughts, and I think you are on to something, however I've noticed at least in Physics this really isn't done at all. I'm wondering if the reason most people don't do more reporting relates to the lack of an easy way to do it? Obviously for something like the primary code we use (Vienna Ab-inition Simulation Package - VASP) you could use VASP 5.2, but I do know that it depends upon many library packages. I would assume the best way may be to use an appendix to list all of this information separately. I'm also curious how many people may just use a computational tool with no information of how it was compiled, for example although I compiled VASP myself, many people who use the TACC resources after getting there license information just use the TACC version and type module load vasp. I guess they could just say we are using those versions currently available. Also now that I think about it most people don't report anything more than the type of atomic potentials and included electrons, despite the fact that there are old and new sets available (i.e. some have been regenerated). Anyway, I do agree with you that this is something that should be addressed (and is most easily for widely available packages) and more difficult for home brewed codes. I also think this might hint towards use of more open scientific codes (there is one code in my field that is fully open source and at least with VASP I can read through the source after paying the fees, but there are several that are completely closed).

Cheng H. Lee said...

Copying my Facebook reply:

Opening scientific code is definitely a big thing; I find it weird that NIH requires us to make data publicly available but not code. In the biological sciences, we seem to have much love for "Supplemental Materials", so providing details about software isn't a big issue.

But you're right about using large resources like TACC; it's hard to get details about how the software modules were built (though I suspect most of them just use the default options from the original source). They also add a complication because the lifetime of most of those machines is typically three years or so, and once they're decommissioned, it's hard to recover details about their software configurations.