It is exciting to see all the new papers attempting to link genetics and geography – a solid tradition in genetics that goes back to the days of Dr. Luigi Luca Cavalli-Sforza, one of the founders of the field. However, before you pet yourself on the shoulder and claim that you built a great PREDICTOR, it’s worth considering the following points.
- Just because you match samples with geography, doesn’t mean you built a predictor. You may have built a classifier, instead. What’s the difference? A classifier matches samples with a training set and uses the geographical coordinates of the training set to yield the geographical coordinates of the test samples. This is NOT a prediction because there is a finite set of possibilities. True prediction occurs when your samples nor its populations are in your training set and the predictor infers geographical coordinates that it was NOT trained on. This is a lot harder.
- How much of your data were used to train the classifier/predictor? Typically, researchers should limit their training set to 10-20% of the dataset, however, the literature is full of papers, many of them are high profile, which trained their classifiers on 50% and even 90% of the data. This is OVER FITTING, or to be more specific, cheating, because your classifier is now very specific to your dataset and probably would not work on other datasets.
- How do you build a good predictor? We already learned that you need to prove that your predictor works by excluding the ancestral population of your sample from your training dataset, i.e., to do drop-one-population. Think of it as social distancing. It is advisable to use a completely different dataset. This way you can argue that your predictor is more robust. Using datasets genotyped on different array demonstrates that your predictor is also robust to the genotyping platform. Finally, using macro- (e.g., worldwide populations) and micro-datasets (e.g., island or village populations) would demonstrate that your predictor is agnostic to the scale of the test dataset.
- Did you build something useful? The only reason to develop a method is that it outperforms existing ones. Don’t compare your method to PCA or SPA, we all know that they perform poorly. Compare your method with a method that has been convincingly demonstrated to work well.
Of course, if your tool doesn’t predict anything, you should ignore all the recommendations above. In this case, your so-called predictor would be trained on most of the data, to maximize the chances that the tested samples would look like the training ones. Ideally, you should handpick the test samples so make sure that this is indeed the case. Limit yourself to the European POPRES dataset. Stay away from the global datasets, they may crash your tool as they did with SPA that spat out geographic coordinates that do not exist. If you insist on using a global dataset, like the HGDP, you must be a true risk-taker! Just make sure that you maximized the portion of the training dataset and that all the samples match the tested one. If anyone tells you that your model over fits the data, ignore them, ideally, or just point them to the many papers describing PCA or SPA-like method that did the same thing. Of course, you should compare your paper to these methods. Finished? Congratulations, you produced another utterly useless paper with results that would probably fail to replicate.