Similar to code, data should also be version controlled. However, version controlling data cannot be done with Git directly, as Git is not designed to deal with large and / or binary files. Tracking large files directly with Git adds bloat to your repository’s Git history, which needs to be downloaded every time your project is cloned.
For properly version controlling Data in ML projects, mllint
recommends using Data Version Control (DVC).
DVC is an open-source version control system for Machine Learning projects. DVC is built to help version your data and make ML models shareable and reproducible.
It is designed to handle large files, datasets, ML models, and metrics as well as code.
DVC can also help you manage ML experiments by guaranteeing that all files and metrics will be consistent and in the right place to reproduce the experiments,
or use it as a baseline for a new iteration.
Install DVC (e.g. using poetry add --dev dvc
) and run dvc init
in your terminal to get started with DVC.
To learn more about DVC and how to use it, feel free to check out DVC’s documentation and tutorials from these links:
Or if you prefer learning from watching videos, DVC has a YouTube channel with several short, useful and informative videos.
- YouTube Channel: DVCorg
- YouTube Video: Version Control for Data Science Explained in 5 Minutes
- YouTube Playlist: DVC Basics