Setting up a collaborative environment for your data science team is challenging even when working side by side in the same office. The task can be even more onerous when everyone is working remotely. You might already have to deal with a cramped workspace and a crying child at your home office door – you shouldn’t also have to worry about crashing applications and constant version conflicts.
“It works on my machine. Why doesn’t it work on yours?”
If you implement the best practices in this article, you’ll never have to hear this again.
At Appsilon, we’ve spent years developing efficient systems for remote collaboration. In this article, I will cover best practices for organizing a distributed data science team and kicking off a new data science or R Shiny project. I’ll explain how we use Scrum to distribute work in a way that is transparent for both the team and for our clients. I’ll also show how we use Github to collaborate with version control and ensure quality. Finally, I’ll cover a Docker-based workflow to facilitate smooth development. Here’s a guide for what we’ll cover:
We use a modified version of Scrum methodology for project management in the majority of our projects. Before the project begins, the project leader on Appsilon’s side collects the requirements from the client and splits them into high-level tasks. This is how the initial project backlog is created. We also provide rough estimates of how much time we need to complete the work.
For example, we might plan the work for 8 weeks which we further split into 8 sprints. Each sprint starts with a planning session where we (the project team) sit down together with the client and plan what will be done in the week ahead. We take the tasks from the project backlog and split them into smaller tasks and distribute these tasks among the project team. Last but not least, we set a sprint goal, which is the most important thing we want to achieve at the end of the week. We finish the week with a sprint review where we present the increment workout during the week.
Internally, we meet daily for a very short status meeting (which we call a ‘daily’) to give each other updates and make sure everyone is clear on which tasks they need to complete. It’s also a good opportunity to catch up on small things that are happening in the team, as we don’t have the continuous communication that an office environment provides.
There are several tools that help us manage the backlog and sprints. For instance, we use project boards in Asana or Github. The project board reflects the current state of our work. Our clients have access to the board related to their project, so they can check in on the team’s current priorities whenever they want.
We organize our project board into the following columns:
Our scrum process is tightly related to version control and code review. If you’d like to learn more about version control and related topics, watch Marcin Dubel’s presentation on How to Write Production Ready R Code.
We typically use Github to help us manage version control and perform code reviews. I recommend making GitHub part of your workflow regardless of your team setup.
Best practices that we follow:
Before submitting a PR we make sure that:
At Appsilon, our team has always been distributed between two separate offices, with collaborators spread out all over the world. So, even before the pandemic, we had project members scattered between different locations. On top of that, we have served a large number of global clients based in many different time zones.
For some projects, we work on the client’s infrastructure and nothing can leave their environment. For others, we have more “freedom” and can work locally on our own machines. It is essential that we don’t waste time on setting up a development environment regardless of the way we work. We sometimes swap out team members based on the specialization required for a particular project stage (frontend, infrastructure, etc), so it’s important that we make it very easy for new project members to begin development at any given stage of the project.
In order to account for different operating systems, system dependencies, R versions, and R package versions, we do our development in an instance of RStudio that runs inside an isolated environment (a Docker container). When we start a new project, we always build a dedicated Docker image for it in order to ensure consistency amongst team member workstations.
Using Docker and `renv` together, we ensure reproducibility. The underlying system, its dependencies, and required R packages, are fixed and constant for a particular application. To learn more about why this is important, read Pawel Przytula’s blog post on reproducible research. We use a `renv.lock` lockfile to install R packages when the Docker image is built. A tutorial on how to set up Docker and `renv` is readily available from RStudio. We store the most recent version of the lockfile in the project repository. All changes related to the Docker image must be pushed to the registry. Our development workflow can be set up from a git repository as a project template.
There is no secret recipe for making your data science team work efficiently in a remote setup. In fact, we’ve found that using scrum with a well-organized project board, code reviews, and taking care of the development environment is essential for project success no matter how your team is structured. We hope these best practices will help keep your data science team organized and productive even after it becomes safe to return to the office.