AlphaFold2 — Possibly the biggest scientific discovery of the decade

Just in case you hadn’t realized by now…And they may have spurred the biggest scientific discovery of the decade

Since their discovery in 2017 [1], Transformers have taken the field of AI by storm. Language models like BERT [2] and others [2] [3] were suddenly the strongest models in NLP by a great margin.

Graph convolutional neural networks for molecular featurisation
Graph convolutional neural networks for molecular featurisation
Graph convolutions

Up until recently, practitioners would use molecular fingerprints (essentially one-hot encodings of different molecular substructures) as input into machine learning models. However, the field is starting to move towards automatically learning the fingerprints themselves (automatic feature engineering) using deep learning. Here is a demonstration for implementing a simple neural fingerprint.

1. Chemical Fingerprints

Chemical fingerprints [1] have long been the representation used to represent chemical structures as numbers, which are suitable inputs to machine learning models. In short, chemical fingerprints indicate the presence or absence of chemical features or substructures, as shown below:

Pictured top left: Schrodinger, Einstein and Pople. Pictured bottom middle: Hinton and Bengio

Computing quantum mechanical properties of compounds like its atomization energy accurately can take hours to weeks using conventional state-of-the-art methods. This article explores the use of deep neural networks to compute said properties in a matter of seconds, with 99.8% accuracy.

Morgan fingerprinting

Molecular fingerprints are used in drug discovery for many reasons. Today we will focus on their use in the prediction of drug binding affinities.

Molecular fingerprints are a way to represent molecules as mathematical objects. By doing this, we can perform statistical analyses and/or machine learning techniques on the set of molecules to gain new insights that we could not gain as humans. One of the most common molecular fingerprinting methods is Extended Connectivity FingerPrinting (ECFP) which we will look at today.

Extended Connectivity FingerPrinting (ECFP)

The basic idea goes as follows. Each point will be expanded on.

  1. Assign each atom with an identifier
  2. Update each atom’s identifiers based on its neighbours
  3. Remove duplicates
  4. Fold list of identifiers into a 2048-bit vector (a Morgan fingerprint)

1. Assign each atom with an identifier

We choose an…

The prediction of binding affinities of potential drug candidates is just one component of the drug discovery pipeline that is being disrupted by AI right now. Random Matrix Theory provides a classifying algorithm with a very high AUC, better than other existing methods.

ROC Curve for the RMT Algorithm; note the high AUC!

Two algorithms that have repeatedly shown consistent results (AUCs of around 0.7–0.8) are the following:

However, a relatively recent algorithm inspired by Random Matrix Theory was reported a couple of years back in a PNAS paper by the talented Alpha Lee. It obtains much better AUCs of ~0.9!

This Medium blog post aims to explain the algorithm, as well as provides a high-level Python package for its implementation.

Skip to 3. if you already know how to generate a matrix of bit vectors of molecules.

1. Obtaining the data and cleaning it (skip this section if you know how to do this)

We need 2 datasets, namely:

  • a set of drugs that bind to a particular target (in our case ADRB1) (this makes up the training and validation set)

We will write a Hartree-Fock algorithm completely from scratch in Python and use it to find the (almost) exact energy of simple diatomic molecules like H₂


I will assume you have read the first three chapters of “Modern Quantum Chemistry” by Szabo and Ostlund, or any other similar book, or have taken an introductory course into computational quantum chemistry. I will be referring to said book throughout the post. The book is very cheap (£10 on Amazon) and is a good investment.

I’ll also assume you have had a bit of practice coding in python, and know the basics, like how for loops work, etc.

I will go through the important maths again here and there. This is to function as a reminder, and it will…

Use Python to find the most space-consuming folders on your PC

  1. First, we will import the necessary packages. If you don’t have these packages, simply type into the command line: pip install numpy pandas.

2. Next, define a variable as the directory/folder you want to investigate. In my case, the appdata\local folder was taking up over 10GB on my hard drive!


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store