Write two paragraphs:The first paragraph should summarize the reading. The second paragraph should briefly explore something that interested you (e.g., you may wish to focus on one aspect of the paper in more depth, you may wish to discuss something in the reading that you disagree with).
Developing open source scientific practice∗ K. Jarrod Millman Division of Biostatistics University of California, Berkeley Fernando Pérez Henry H. Wheeler Jr. Brain Imaging Center University of California, Berkeley August 31, 2017 Dedicated to the memory of John D. Hunter III, 1968-2012. Contents 1 Introduction 2 2 Computational research 2 2.1 Computational research life cycle . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Open source ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Communities of practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Routine practice 8 3.1 Version control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Execution automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Collaboration 16 4.1 Distributed version control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Code review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 Infrastructure redux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 Communication 19 5.1 Literate programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2 Literate computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 IPython notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6 Conclusion 25 ∗In Implementing Reproducible Research. Eds. Victoria Stodden, Friedrich Leisch, and Roger D. Peng. pages 149–183. Chapman and Hall/CRC Press, 2014. 1 1 Introduction Computational tools are at the core of modern research. In addition to experiment and the- ory, the notions of simulation and data-intensive discovery are often referred to as “third and fourth pillars” of science [12]. It is more accurate to simply accept that computing is now inextricably woven into the DNA of science, as today, even theory and experiment are computational. Experimental work requires computing (whether in data collection, prepro- cessing, or analysis) and theoretical work requires symbolic manipulation and numerical exploration to develop and refine models. Scanning the pages of any recent scientific jour- nal, one is hard-pressed to find an article that does not depend on computing for its findings. Yet, for all its importance, computing receives perfunctory attention in the training of new scientists and in the conduct of everyday research. It is treated as an inconsequential task that students and researchers learn “on the go” with little consideration for ensuring computational results are trustworthy, comprehensible, and ultimately a secure foundation for reproducible outcomes. Software and data are stored with poor organization, little doc- umentation, and few tests. A haphazard patchwork of software tools is used with limited attention paid to capturing the complex workflows that emerge. The evolution of code is not tracked over time, making it difficult to understand what iteration of the code was used to obtain any specific result. Finally, many of the software packages used by scientists in research are proprietary and closed-source, preventing complete understanding and control of the final scientific results. We argue that these considerations must play a more central role in how scientists are trained and conduct their research. Our approach grows out of our experience as part of both the research and the open source scientific Python communities. We begin (§ 2) by outlining our vision for scientific software development in everyday research. In the re- maining sections, we provide specific recommendations for computational work. First, we describe the routine practices (§ 3) that should be part of the daily conduct of computational work. We next discuss tools and practices developed by open source communities to enable and streamline collaboration (§ 4). Finally, we present an approach to developing and com- municating computational work that we call literate computing in contrast to the traditional approach of literate programming (§ 5). 2 Computational research Consider a researcher using Matlab for prototyping a new analysis method, developing high-performance code in C, post-processing by twiddling controls in a graphical user in- terface, importing data back into Matlab for generating plots, polishing the resulting plots by hand in Adobe Illustrator, and finally pasting the plots into a publication manuscript or PowerPoint presentation. What if months later they realize there is a problem with the results? Will they will be able to remember what buttons they clicked to reproduce the workflow to generate updated plots, manuscript, and presentation? Can they validate that their programs and overall workflow is free of errors? Will other researchers or students be able to reproduce these steps to learn how a new method works or understand how the presented results were obtained? The pressure to publish encourages us to charge forward chasing the goal of an ac- cepted manuscript, but the term “reproducibility” implies repetition and thus a requirement to also move back—to retrace one’s steps, question or change assumptions, and move for- 2 ward again. Unfortunately, the all-too-common way scientists conduct computational work makes this necessary part of the research process difficult at best, often impossible. The open source software development community1 has cultivated tools and practices that, if embraced and adapted by the scientific community, will greatly enhance our ability to achieve reproducible outcomes. Open source software development uses public forums for most discussion and systems for sharing code and data. There is a strong culture of public disclosure, tracking and fixing of bugs, and development often includes exhaustive validation tests that are executed automatically whenever changes are made to the software and whose output is publicly available on the Internet. This detects problems early, miti- gates their recurrence, and ensures that the state and quality of the software is known under a wide variety of situations (operating systems, inputs, parameter ranges, etc). The same systems used for sharing code also track the authorship of contributions. All of this ensures an open collaboration that recognizes the work of individual developers and allows for a meritocracy to emerge. As we learn from the open source process how to improve our scientific practice, we recognize that the ideal of scientific reproducibility is by necessity a reality of shades. We see a gradation from a pure mathematical result whose proof should be accessible to any person skilled in the necessary specialty to one-of-a-kind experiments such as the Large Hadron Collider or the Hubble Space Telescope, that cannot be reproduced in any