QSAR models

QSAR models, what are they and how are they created?

The QSAR models, acronym for Quantitative Structure Activity Relationships, are based on one of the most used techniques in the field of chemoinformatics, as we saw in the previous entry.

In this post we will help you to understand better how to obtain these models and the different terms employed in reference to these, as well as clear up the most common doubts that they usually cause.

What is a QSAR model?

QSAR models are complex mathematical algorithms that facilitate and accelerate an important task in fields such as chemistry or drug development: the predictive evaluation of the properties or biological activities of chemical compounds based exclusively on their molecular structure. They are based on the fact that there is a relationship between the structure of a compound and its activity, an idea proposed by the Scottish chemist Crum Brown more than a hundred years ago, and widely demonstrated since then.

These models allow for the computational estimation of the physicochemical, biological or toxicological properties of compounds whose activity is unknown, without the need for laboratory experimentation, based on data from other compounds whose values for the properties of interest are known.

How is a QSAR model obtained?

The development of QSAR models requires the prior characterization of the molecules whose properties are known through numerical descriptors. That is, the structure of the compounds is transformed into ‘descriptors’, parameters that associate numerical values ​​to each compound based on the characteristics of its structure. These descriptors (which we will see in more detail in the next entry) can be as simple as the number of certain heteroatoms (atoms other than carbon or hydrogen) or functional groups of the molecule.

Later, statistical and machine learning tools are applied to generate the algorithms that relate these descriptors to the studied parameter. Machine learning is a branch of artificial intelligence that trains computers to “learn” by themselves, without being explicitly programmed. In this case, the algorithms learn the relationships between structure and biological property or activity.

A general workflow for building QSAR models can be as follows:

  1. The data set of molecules to be used for the model building is preprocessed. Duplicate structures, doubtful biological values, etc. are eliminated.
  2. A series of molecular descriptors is calculated for this data set, yielding a matrix (table) of data with as many columns as molecules and as many rows as descriptors.
  3. The data matrix is ​​randomly divided into a training set and a test or validation set.
  4. The model is built over the training set, using different techniques to yield a predictive model.
  5. The model is validated by measuring its effectiveness on the test set.

After the development and validation of a QSAR model, this can be used as a prediction tool for the property / activity of new molecules with known chemical structures.

Are QSAR models accepted?

QSAR models are accepted by regulatory authorities such as the European Chemicals Agency (ECHA) if the established rules are followed, which are as follows:

i) The models must focus on well-defined parameters (“endpoints”), all the experimental values ​​used must have been obtained under identical conditions,
ii) must take the form of an unambiguous algorithm, QSAR models must be reproducible by the rest of the scientific community,
iii) their applicability domain must be clearly defined and justified, that is, a QSAR model can only be applied to chemical compounds that occupy the same chemical space as the compounds that were used to generate a given model,
iv) they must comply with scientifically recognized metrics to demonstrate the goodness of fit, robustness and predictability,
v) Whenever possible, it is recommended that they provide a possible interpretation of the toxicological mechanisms of action of the studied compounds.

These five guidelines were established by the Organization for Economic Cooperation and Development (OECD) at the “37th Joint Meeting of the Chemicals Committee and Working Party on Chemicals, Pesticides and Biotechnology”, in November 2004. The European Union has also included them without modifications in specific standards such as Annex XI of the REACH Regulation and in Annex IV of the Standards for Biocidal Products (BPR).

In fact, the ECHA REACH regulation, which regulates the use, importation, and commercialization of chemical substances in the European Union territory, not only accepts, but also promotes the use of this type of computational models, with the aim of reducing the number of animals used in animal experimentation. This is a clear commitment to New Alternative Methods (NAMs), among which computational techniques and specifically QSAR models play a fundamental role.


[1] Gozalbes, R., & Vicente de Julián-Ortiz, J. (2018). Applications of Chemoinformatics in Predictive Toxicology for Regulatory Purposes, Especially in the Context of the EU REACH Legislation. International Journal of Quantitative Structure-Property Relationships (IJQSPR), 3(1), 1-24. http://doi.org/10.4018/IJQSPR.2018010101

[2] Carpio, L.E., Sanz, Y., Gozalbes, R. et al. Computational strategies for the discovery of biological functions of health foods, nutraceuticals and cosmeceuticals: a review. Mol Divers 25, 1425–1438 (2021). https://doi.org/10.1007/s11030-021-10277-5