Selection bias occurs when the selection of data is either not properly randomized or gathered uniformly. Selection bias leads to a dataset that is not a good representative of the situation being modeled.
Sampling bias is when the selection of data is not properly randomized during the data collection process.
Example: A model that is trained on white skin and tested with dark skin color.
Example: A social anxiety survey among students who applied to be a part of the study. Since they might be more willing to be a part of the study, they might not be a proper random sample of students with high and less social anxiety.
Coverage bias is when the selected data is not representative of the real world.
Example: When a diversity survey is conducted in the only international school in town instead of conducting the survey among different schools.
How to prevent selection bias
A common mistake has been to select the samples from one specific group and not include other groups and generalize the results based on one group’s input. To prevent selection bias, you should ensure that:
You have a proper random sampling and you split your training, test and validation sets randomly
You are aware of the impact of the model on different groups and which groups are represented or underrepresented by the data
Your data reflects the real world.
Your model performs equally well across different groups.