Rule — Version Control — DVC: Folder '.dvc' should be committed to Git

DVC uses the .dvc folder to keep records of and information about all your DVC-tracked files and where they are hosted. This folder must be committed to your Git repository in order to work with DVC correctly. Learn more about the .dvc directory here. If you’re seeing this in a report, then your project’s Git repository is not tracking the ‘.dvc’ folder. To fix this, you may use the following commands:...

1 min · Bart van Oort (bvobart)

Rule — Version Control — DVC: Is installed

To be able to use DVC, it must be installed correctly. If you’re seeing this as part of anmllintreport, then it means that mllint was unable to find ‘dvc’ on your PATH. This could either indicate that DVC is not installed in your project, or it is not included on your path. See DVC’s installation instructions to learn more about installing DVC, or simply add it to your project as a Pip package, e....

1 min · Bart van Oort (bvobart)

Rule — Version Control — DVC: Project uses Data Version Control

Similar to code, data should also be version controlled. However, version controlling data cannot be done with Git directly, as Git is not designed to deal with large and / or binary files. Tracking large files directly with Git adds bloat to your repository’s Git history, which needs to be downloaded every time your project is cloned. For properly version controlling Data in ML projects, mllint recommends using Data Version Control (DVC)....

2 min · Bart van Oort (bvobart)

Rule — Version Control — DVC: Should be tracking at least one data file

Using DVC entails tracking changes to your data and models with DVC. If you’re seeing this in a report, your project is using DVC, but it is currently not tracking any files with it. Learn more about getting started with data versioning with DVC, or the dvc add command. Then, add your datasets and models to DVC by running the command dvc add <files> Tip: Under the hood, mllint uses the command dvc list ....

1 min · Bart van Oort (bvobart)

Rule — Version Control — DVC: Should have at least one remote data storage configured

To share your DVC-tracked data with your colleagues and also allow them to interact with your data, DVC should have at least one remote storage configured. If you’re seeing this in a report, your project currently has none. Learn more about DVC remotes here, then pick your desired remote storage solution, check the documetation for adding remotes and add it as your default remote to DVC using dvc remote add -d <name> <url>

1 min · Bart van Oort (bvobart)

Rule — Version Control — Project should not have any large files in its Git history

Git is great for version controlling small, textual files, but not for binary or large files. Tracking large files directly with Git adds bloat to your repository’s Git history, which needs to be downloaded every time your project is cloned. Large files should instead be version controlled as Data, e.g. using Git LFS or DVC. See the version-control/data/ rules of mllint for more info about version controlling data. To fix this rule, it is not enough to just remove these large files from your local filesystem, as the files will still exist inside your Git history....

1 min · Bart van Oort (bvobart)

Rule — Version Control — Project uses Git

The code of any software project should be tracked in version control software. Git is the most widely-used, most popular, free and open-source version controlling tool, designed to handle anything from small projects to extremely large projects such as the Linux kernel. To start using Git, run git init in a terminal at the root of your project. See also Git’s documentation for tutorials on how to work with Git.

1 min · Bart van Oort (bvobart)