Statistical techniques

What statistical techniques do QSAR models use?

As we have seen in the previous entries, QSAR models allow predicting properties or activities of molecules, and are obtained from the analysis of molecular descriptors with statistical and/ or machine learning techniques.

QSAR models can be used qualitatively to classify a substance, for example as a skin irritant or non-irritant. They can also help us to calculate exact numerical values i.e., we may build quantitative models, for example, to calculate how many days it takes for a substance to degrade. As we saw in the entry on computational methods, the first type of models is known as classification models, while the second type of models is denominated as regression models.

Depending on the type of model we build, we will need one or the other statistical tools to evaluate its performance. The series of parameters that indicate how well a model predicts data are known as the “goodness-of-fit” parameters.

In a classification model, the input comprises of a group of molecules that are classified into one group (e.g., they are irritating to the skin) and another for those that are not. Once the model is generated, if one wishes to know its predictive capacity, one needs to apply it to different molecules whose properties are known.

Although the ideal case scenario would be for the model to be able to correctly predict the irritation capacity of all the molecules, normally there is a percentage of errors, in that we will have molecules that are irritating and that we would have predicted as non-irritating and vice versa (non-irritants predicted as irritants). These two values ​​are known as false negatives (FN, those that a model predicts as negative while positive) and false positives (FP, those that a model predicts as positive while negative), respectively. In a good model, most of the positives and negatives are well predicted. These are known as true positives (TP) and true negatives (TN), respectively.

For a more visual illustration, a matrix denominated as the confusion matrix (or result matrix) is usually used to represent this result. An example is illustrated in the following table:

From these values we can calculate parameters that measure the percentage of correct predictions, such as accuracy; the percentage of correctly predicted positive cases (sensitivity, recall or true positive rate) or the percentage of correctly predicted negative cases (specificity, selectivity or true negative rate), among other parameters.

For regression models it is a bit different. Let’s imagine that we have a very small data set, like the one shown in the table below, where we have the value of the actual (or experimental) degradation days and the prediction of our model:

From this results table we can calculate several parameters, such as, the average prediction error, that is, by how much the model provides an erroneous prediction represented as the average difference between real and predicted values. Another parameter widely used to determine the quality of regression models is R2, which tells us how close the data is to the fitted regression line.

With all these parameters, we can evaluate the quality of our predictions for both types of models and thus we are able to measure their performance. The better the predictive models are, the more we proceed to their use.


[1] Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Lennart Eriksson et al. Environ Health Perspect. 2003 Aug;111(10):1361-75.

[2] QSAR modeling: where have you been? Where are you going to? Artem Cherkasov et al. J Med Chem. 2014 Jun 26;57(12):4977-5010.