Scaling ML science teams vertically, not horizontally.

Tradunsky
Oct 29, 2021
2 min read

Updated: Dec 1, 2021

TL;DR; Independency is a critical property of any software development team to scale.

Context:

We got used to the easiest way to resolve scalability problems by adding more resources.

In software engineering, we use this term of horizontal scalability - basically adding more machines and balancing work between them.

This approach does not necessarily work to solve all problems. Like giving 9 women will not deliver a baby in 1 month 🙂

Indeed, when your team is isolated you can parallel work for some extend inside this team and hire more professionals to collaborate and come up with solutions, sometimes faster.

In this article, we will consider how science teams' dependency on other engineering teams influences velocity.

Problem:

After finishing an experimentation phase of solving a business problem with ML, scientists need to collaborate with engineering teams to ensure the ML model is deployed at a production scale.

Classical solution:

Scientists hand off code and binaries to engineers, explaining what data model needs to be provided with as input, what output it produces and what computation resources and software dependencies requirements are.

Engineers when getting a model codebase trying to dig into code details and come up with a plan to deploy the model as a service.

Pros:

Scientists capacity can be released by ensuring their time only spent doing experimentation;
Engineers may have a better understanding of production systems and how to monitor them.

Cons:

Engineers have less understanding of ML model code and lifecycle;
Engineers may not be aware of how changes in code or data can influence models results;
Scientists have a better understanding of monitoring ML properties of a model;
If ML model quality of results degrades on production, scientists would be able to analyze required actions to improve it.

Because of this relation between background in ML and engineering, model delivery on production requires a lot of synchronization and communication between scientists and engineers.

This approach scalability is 1 scientist vs 1 or many software engineers.

Suggested solution:

Provide simple to use tools and processes as a platform for scientists to deliver ML models to production and monitor them without being dependent on software engineers to do that for them. In some sense, make scientists independent from engineers, but dependent on a platform.

Pros:

Scientists are responsible for the quality of their ML model results and can do that at scale;
Engineers may have a better understanding of production model serving systems and how to create automated monitoring that can be actioned by scientists;
Engineers are responsible for the platform, but not for ML models themselves;
Scientists can do changes to data flow or ML model code without too many interactions with engineers;

Cons:

The independence of scientists on engineers depends on the quality of the platform.

This approach scalability is many scientists vs the number of engineers required to support the platform.

In practice, it was found that the platform maintained by 2 software engineers can support 60 people in the data science department.

Scaling ML science teams vertically, not horizontally.

Recent Posts

Comments

Subscribe Form