Just a few days ago, the Netflix developer team came with the announcement to open-source Metaflow, their proprietary Python-based library for data science projects. The Netflix developer team in their address mentioned that Metaflow for more than two years has been used by their company across a variety of data science projects for real-life use cases ranging from natural language processing to operations research.
Metaflow was built and nurtured by Netflix to enhance data scientist’s productivity while working on an extensive range of projects varying from classical statistics to deep learning. Providing a unified API above the infrastructure stack it helps every stage of the data science projects, starting from prototype to production. This is why it is well suited to say that Metaflow as a framework for data science projects helps data scientists more than machines.
Key Characteristics Of MetaFlow
- Built To Help Data Scientists
Great data science projects always focus on making the value of the data science available for user contexts rather than focusing on engineering.
- Create Model With Top Tools
Metaflow can be used in conjunction with the top data science libraries like PyTorch, Tensorflow, or SciKit Learn. It helps writing the models easily with the typical idiomatic code of Python code that has no significant learning curve.
- Develop With Metaflow
Metaflow also makes the design of the workflow and running it in a scalable manner. It helps maintain versions while keeping track of all the experiments and data with ease.
- AWS Cloud Power
For scaling up more, Metaflow helps with inbuilt integrations to AWS cloud services for storage, computing, and machine learning and most notably, this requires no code changes.
Metaflow Integration With Data Science Infrastructure Stack of Netflix
The models in a data science project play just a small role. Most of the production savvy projects need to depend a lot on the robust infrastructure stack. With Metaflow, the Netflix team if data scientists and developers touch upon every different layer of an infrastructure stack.
From the data warehouse access to data is ensured. This data warehouse can be just a file folder, a full database or a big several petabyte storage data lake. The modeling code basically helps in executing the data in a compute environment with the help of a job scheduler who will take care of orchestrating the entire process.
After this, the developer team structures the code as an object hierarchy to help the execution of the code. These objective hierarchies can be like Python modules or packages. The Machine Learning model registers the code version and input data.
When the ML model is deployed to production, the questions faced pertinently by the developer’s concerns on how to keep the code perform reliably and how to track code performance.
For addressing these concerns, Metaflow as a framework offers a very comprehensive approach to manage the stack. Metaflow framework may have several prescriptions about the lower stack levels but for the actual data science at the top of the stack, it has very little to say. The best thing is, the dedicated developers can make use of this framework with a whole array of machine learning and data science libraries ranging from Tensorflow, PyTorch, or SciKit Learn.
The low learning curve is guaranteed by Metaflow as it allows writing models and business logic through python-based idioms. On the other hand, whenever it is required, Metaflow leverages the already existing infrastructure.
As a framework for building and deploying data science workflows that come loaded with an array of built-in features, Metaflow stands out. Let’s have a quick look at some of its capabilities.
- Can manage computing resources.
- Can carry out containerized runs.
- Can take care of external dependencies.
- Take full care of version, replay and resume workflow.
- Using client API to evaluate past runs.
- Navigating back and forth between a local computer and remote cloud servers as execution demands.
Metaflow captures snapshots of the code, data, and dependencies automatically in a data store backed by S3 irrespective of the fact that it supports the local filesystem as well. This helps in starting workflows, recreate past results, and evaluate the workflow just in a notebook. This ensures enhancing productivity for data science professionals and developers.
Project Scenarios Metaflow Helps With
Here are some of the real-life project scenarios in which Metaflow can help
- Collaboration: It allows you to help another data scientist to debug an error. You just can incorporate the failed run state into your system.
- Restarting a Run: Whenever a run fails or gets stopped you can just fix the error in the code and can restart the workflow again from just where stopped working.
- Hybrid Runs: One can run different steps of a workflow on different platforms. You can run a single step of a workflow in the desktop and can run another step (the model training on the cloud.
- Inspecting Run Metadata: In Metaflow, the hyperparameters are already tuned to ensure better accuracy of the model. You after evaluating their training runs can choose the top-performing hyperparameters.
Get Started With Metaflow
You can go to the project home page at metaflow.org and access the code at github.com/Netflix/metaflow. Metaflow framework offers comprehensive documentation at docs.metaflow.org. You can avail tutorial to get a good start. The online presentations and tutorials can help you set your feet in data science projects with Metaflow.
Metaflow after serving its role inside Netflix is now available as an open-source data science project. One can only aspire to utilize it to the fullest extent for optimum flexibility and productivity in data science projects across different use cases. From now, your Metaflow based project will also make a contribution to the project as it is open-source now.