Since their discovery in 2017 [1], Transformers have taken the field of AI by storm. Language models like BERT [2] and others [2] [3] were suddenly the strongest models in NLP by a great margin.
Chemical fingerprints [1] have long been the representation used to represent chemical structures as numbers, which are suitable inputs to machine learning models. In short, chemical fingerprints indicate the presence or absence of chemical features or substructures, as shown below:
Molecular fingerprints are a way to represent molecules as mathematical objects. By doing this, we can perform statistical analyses and/or machine learning techniques on the set of molecules to gain new insights that we could not gain as humans. One of the most common molecular fingerprinting methods is Extended Connectivity FingerPrinting (ECFP) which we will look at today.
The basic idea goes as follows. Each point will be expanded on.
We choose an…
Two algorithms that have repeatedly shown consistent results (AUCs of around 0.7–0.8) are the following:
However, a relatively recent algorithm inspired by Random Matrix Theory was reported a couple of years back in a PNAS paper by the talented Alpha Lee. It obtains much better AUCs of ~0.9!
Skip to 3. if you already know how to generate a matrix of bit vectors of molecules.
We need 2 datasets, namely:
I will assume you have read the first three chapters of “Modern Quantum Chemistry” by Szabo and Ostlund, or any other similar book, or have taken an introductory course into computational quantum chemistry. I will be referring to said book throughout the post. The book is very cheap (£10 on Amazon) and is a good investment.
I’ll also assume you have had a bit of practice coding in python, and know the basics, like how for loops work, etc.
I will go through the important maths again here and there. This is to function as a reminder, and it will…
Use Python to find the most space-consuming folders on your PC
2. Next, define a variable as the directory/folder you want to investigate. In my case, the appdata\local folder was taking up over 10GB on my hard drive!