![]() It allows to manage changes to files, especially of the source code history. If you are reading this post I bet you have heard of (if not used) version control system. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety. It’s like agreeing that we will all drive on the left or the right. However, in the long run you will save much more time that you can anticipate. At first glance, it might look like you spend more time organizing your project than doing actual analysis. Furthermore, keeping the flow of analysis reproducible, portable and self-contained makes easier to proceed and to extend the project. Now, you might ask yourself: why it is such a big deal? Well, first off, it gives more credibility to the research, because it can be verified and validated by a third party ( your peers). Instead, I want to give an overview of useful things based entirely on my experience. There are dozens of tutorials, and I do not try to compete with them. ![]() But do not get fooled, it is not a yet another git / RStudio tutorial. That is why further content is organized by focusing on tools rather than on stand-alone aspects. For instance, using consistent folder structure will make your project reproducible and portable, while properly managed dependencies will ensure that the project is self-contained and portable. As a consequence, techniques and practices we consider further improve several elements at a time, rather than focusing on a particular one. Of course, one can immediately feel that these aspects are interrelated. This post is an attempt to summarize the use of “sexy” tools and techniques to improve above-mentioned aspects of project significantly. This topic is extensively covered in the section on packrat dependency management system. There is another term that has a similar meaning – isolated, which is related to dependencies of the project. Furthermore, if you need, for instance, to save processed data, then it should be saved separately, and not overwrite raw data. Not only anyone else who does not have the second project will suffer, but yourself, when your current project will be used on the other machine. It is a bad idea to use a function that has been defined in the other of your projects. We call a project self-contained, when you have everything you need at hand (i.e., in the folder of your project) and your project does not affect anything it did not create. Normally, you should be able to run the code on your collaborator’s machine without changing any lines in the scripts. The project is also not considered portable, if it utilizes a particular computer settings, such as absolute paths instead of relative to your project folder (e.g., when reading the data or saving plots to files). For instance, if the project uses a particular package that works only on Windows, then it is not portable. Portability means that regardless of the operating system or a computer, given a minimal prerequisites, the project should work. Ideally, everyone should also have an access to data and software to replicate your analysis (it is not always the case, since data can be private), but this is already a domain of open science. It means, for instance, that if the analysis involves generating random numbers, then one has to set a seed (an initial state of a random generator) to obtain the same random split each time. ![]() In data science context, reproducibility means that the whole analysis can be recreated (or repeated) from scratch: executing scripts based on raw data must yield exactly the same results. We start by working up some intuition about these three key aspects rather than trying to grasp explicit technical definitions. Tired of this? Then, get on board and read my comments on how to make your project reproducible, portable, and self-contained. Be honest with yourself, how many times have you wanted to restart an on-going project from scratch throwing away the current folder? Or how many times have you had to rename files and adjust folder structure to make your project simple and clear? Not to mention, all these thousands of versions of your scripts that are dangling around in your mail box. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |