Home | Publications | KBO+21

Compound Segmentation via Clustering on Mol2Vec-Based Embeddings

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

→ Group Thomas Seidl
Database Systems, Data Mining and AI

Anna Beer

Dr.

* Former Member

→ Group Thomas Seidl
Database Systems, Data Mining and AI

Thomas Seidl

Prof. Dr.

Director

Database Systems, Data Mining and AI

Abstract

During different steps in the process of discovering drug candidates for diseases, it can be supportive to identify groups of molecules that share similar properties, i.e. common overall structural similarity. The existing methods for computing (dis)similarities between chemical structures rely on a priori domain knowledge. Here we investigate the clustering of compounds that are applied on embeddings generated from a recently published Mol2Vec technique which enables an entirely unsupervised vector representation of compounds. A research question we address in this work is: do existent well-known clustering algorithms such as k-means or hierarchical clustering methods yield meaningful clusters on the Mol2Vec embeddings? Further, we investigate how far subspace clustering can be utilized to compress the data by reducing the dimensionality of the compounds vector representation. Our first conducted experiments on a set of COVID-19 drug candidates reveal that well-established methods yield meaningful clusters. Preliminary results from subspace clusterings indicate that a compression of the vector representations seems viable.

inproceedings KBO+21