Best Practice for Data Science Projects
- How was the data collected/sampled? Engineers can, intentionally or unintentionally, introduce data bias. Data scientists can take the same data and show the same results as favorable or unfavorable.
- How is data split into train/validation/test groups? An inappropriate split of the data may result in significant differences in production results. Applying an 80-20% split to any size of the data sets, and not stratifying skewed data sets are very common mistakes that engineers regularly make.
- Does the test data represent the data, that the model will be used? What percentage of data is changing, in a certain amount of time? A basic check is, to ensure that the data received after completion of the project, still represents the business needs.
- Is it maintainable? The software should meet the requirements of the project. When engineers finish working on the last step of the project, they frequently run to the project manager to demonstrate the metrics of the model (mostly defined on accuracy).
- Is it scalable? If the data set is small, splitting data with an 80–20% ratio is OK, however, if you have a big dataset like 10 million. In this case, splitting with an 80–20 ratio will result in, 8 million train set size and a 2 million test set size. Do you really want to have your test set size that big?
- Is it documented? Engineers can understand every piece of the project and every line of code because they created the code however, good documentation will allow new-hires to quickly come up to speed. It should not take too much time for new people to understand existing work.