Rule — Version Control — DVC: Project uses Data Version Control

Similar to code, data should also be version controlled. However, version controlling data cannot be done with Git directly, as Git is not designed to deal with large and / or binary files. Tracking large files directly with Git adds bloat to your repository’s Git history, which needs to be downloaded every time your project is cloned. For properly version controlling Data in ML projects, mllint recommends using Data Version Control (DVC)....

2 min · Bart van Oort (bvobart)

Rule — Version Control — DVC: Should be tracking at least one data file

Using DVC entails tracking changes to your data and models with DVC. If you’re seeing this in a report, your project is using DVC, but it is currently not tracking any files with it. Learn more about getting started with data versioning with DVC, or the dvc add command. Then, add your datasets and models to DVC by running the command dvc add <files> Tip: Under the hood, mllint uses the command dvc list ....

1 min · Bart van Oort (bvobart)

Rule — Version Control — DVC: Should have at least one remote data storage configured

To share your DVC-tracked data with your colleagues and also allow them to interact with your data, DVC should have at least one remote storage configured. If you’re seeing this in a report, your project currently has none. Learn more about DVC remotes here, then pick your desired remote storage solution, check the documetation for adding remotes and add it as your default remote to DVC using dvc remote add -d <name> <url>

1 min · Bart van Oort (bvobart)

Rule — Version Control — Project should not have any large files in its Git history

Git is great for version controlling small, textual files, but not for binary or large files. Tracking large files directly with Git adds bloat to your repository’s Git history, which needs to be downloaded every time your project is cloned. Large files should instead be version controlled as Data, e.g. using Git LFS or DVC. See the version-control/data/ rules of mllint for more info about version controlling data. To fix this rule, it is not enough to just remove these large files from your local filesystem, as the files will still exist inside your Git history....

1 min · Bart van Oort (bvobart)

Rule — Version Control — Project uses Git

The code of any software project should be tracked in version control software. Git is the most widely-used, most popular, free and open-source version controlling tool, designed to handle anything from small projects to extremely large projects such as the Linux kernel. To start using Git, run git init in a terminal at the root of your project. See also Git’s documentation for tutorials on how to work with Git.

1 min · Bart van Oort (bvobart)