MoleHD: Accelerating Molecule Discovery with Hyperdimensional Computing (HDC)

Jul 9

6 min read

Introduction

Drug discovery is a multifaceted process that leverages knowledge from biology, chemistry, and pharmacology to identify effective and safe medications. Traditionally, this process involves a laboriously inefficient, expensive and time-consuming screening phase, where discovery candidates are manually selected from extensive chemical databases like ChEMBL and OpenChem to build smaller, more focused in-house databases for further synthesis.

Challenges in Drug Discovery

Among the many challenges that plague the drug discovery journey, the following issues are common in the traditional discovery process:

High Costs and Long Timelines: The drug discovery process is extremely expensive and time-consuming, often taking over a decade and billions of dollars to bring a new drug to market.
High Failure Rates: The majority of drug candidates fail during clinical trials due to safety concerns, lack of efficacy, or unforeseen side effects, leading to significant financial losses.
Regulatory Hurdles: Stringent regulatory requirements and lengthy approval processes can delay the introduction of new drugs, complicating the journey from laboratory to market.
Limited Innovation: There is often a lack of novel therapeutic targets, and many new drugs are incremental improvements over other compounds, rather than groundbreaking innovations, limiting significant advancement of treatments for complex diseases.
Complexity of Diseases: Many diseases, especially chronic and multifactorial ones like cancer and Alzheimer's, present significant efficacy challenges due to their complex biology, making it difficult to develop effective treatments.

Current Tech Brings Limited Improvement

In recent years, machine learning (ML) and (AI) algorithms such as random forest, support vector machines, k-nearest neighbors, and gradient boosting have been explored to enhance drug discovery efforts. While these models use molecular representations to predict properties, they often fall short due to their limited ability to capture the complex structural nuances of molecules. Consequently, deep learning models, particularly Graph Neural Networks (GNNs), have gained popularity due to their superior performance in learning detailed molecular features.

However, GNNs still require significant pre-processing and computational resources, limiting their efficiency and accessibility.

Introducing MoleHD: A Paradigm Shift in Molecule Discovery

Zscale Labs™ is pleased to introduce MoleHD, an innovative, ultra-low-cost model based on hyperdimensional computing (HDC) that significantly reduces computational demands, pre-processing efforts, and overall development time. Thus enabling faster rollout for faster return on investment (ROI) and swift time-to-market (TTM).

Figure 1: Overview of MoleHD. MoleHD has 5 major steps: Tokenization, Encoding, Training, Retraining and Inference.

What is High-Dimensional Computing (HDC)?

HDC is inspired by brain-like (neural) attributes such as high-dimensionality and distributed holographic representation, which allows it to generate, manipulate, and compare symbols represented by high-dimensional vectors. Compared to deep neural networks (DNNs), HDC offers several advantages including smaller model sizes, reduced computational costs, and the capability for one-shot or few-shot learning.

Hyperdimensional computing (HDC) differs from traditional AI in several key ways:

Representation of Data:

HDC: Uses high-dimensional vectors (hypervectors) to represent data. These hypervectors are typically composed of thousands of dimensions, making them impervious to noise and capable of “intuitively” capturing complex relationships.
Traditional AI: Often uses lower-dimensional representations, such as feature vectors, matrices, or tensors, which may not capture the same level of versatility and complexity.

Computation Paradigm:

HDC: Relies on mathematical operations in high-dimensional space, such as binding, bundling, and permutation, to manipulate hypervectors. These operations are usually simple and parallelizable.
Traditional AI: Utilises disparate computational paradigms, including neural networks, decision trees, and support vector machines, which often involve more complex and sequential computations that introduce loss of target relevance, and computational errors that yield wholly irrelevant candidates.

Learning and Inference:

HDC: Typically leverages rapid learning and inference with fewer examples and often does not require extensive training data. The learning process can be more straightforward, thus yielding faster results.
Traditional AI: Often requires large amounts of training data and computational resources to achieve high accuracy, especially in deep learning models. The learning process can be computationally time-consuming and intensive, while also adding to computational throughput costs on cloud-based development platforms.

Robustness and Noise Tolerance:

HDC: Naturally impervious to noise and variations in data due to the self-insulating properties of high-dimensional space. Small changes in data have a minimal impact on the overall representation of data results.
Traditional AI: Frequently sensitive to noise and may require additional techniques, such as regularisation, data augmentation and error correction required to handle variations and hallucinations that still plague modern AI. These add-ons also increase the complexity of discovery systems, and thus increase system maintenance while proposing false or erroneous candidates.

Scalability and Parallelism:

HDC: Highly scalable and inherently parallel, making it suitable for real-time and low-power applications. Its operations can be easily implemented on hardware like locally hosted GPUs and FPGAs, thus reducing latency and data throughput billing costs (if using cloud compute).
Traditional AI: Scalability and parallelism depend on the specific algorithm and implementation. While some models like neural networks can be parallelized, others may not scale as efficiently to large language model (LLM) parameters.

Overall, hyperdimensional computing (HDC) offers a fundamentally different approach to representing and processing information compared to traditional AI, with advantages in speed and simplicity, particularly for applications requiring real-time processing and noise tolerance.

How MoleHD Works

MoleHD begins by applying numerical tokens to SMILES strings. These tokens are then encoded into high-dimensional vectors (“hypervectors”) that visually represent the realistic features of molecules to provide easy human-in-the-loop interaction. These hypervectors are then used to train an HDC model for molecule classification tasks. By bypassing the need for backpropagation and complex arithmetic operations, this empowers MoleHD as your highly efficient and extremely fast producer of viable drug discovery candidates.

Game-Changing Advantages

Key advantages that MoleHD offers to you:

Backpropagation-Free Training: MoleHD does not rely on backpropagation to train its parameters. Instead, it uses one-shot or few-shot learning to establish abstract patterns represented by specific symbols.
Efficient Computing: Unlike neural networks that require complex operations like convolutions, MoleHD performs simple arithmetic operations such as vector addition. This efficiency allows MoleHD to run easily on common CPUs to complete both training and testing within minutes, compared to GNNs requiring much longer GPU time.
Smaller Language Model (SLM) Size: MoleHD needs to store only a small set of relevant vectors for comparison during inference, unlike state-of-the-art neural networks that require large language models (LLMs)for storing numerous parameters, many of which are actually not needed. On the chance that more parameters are needed, MoleHD’s SLM can be swiftly scaled up for larger tasks without incurring latency losses and increased inference times commonly seen in LLMs.

Significant Contributions and Results

MoleHD provides an arsenal of compelling features and benefits for your drug discovery journey:

Novel Learning Model: MoleHD presents a cost-effective, faster alternative to existing learning methods in drug discovery, yielding promising and viable discovery candidates.
Complete Pipeline for HDC-Based Drug Discovery: MoleHD tokenizes SMILES strings into substructure-representing tokens, encodes them into favourable candidate hypervectors, and then uses these hypervectors for training and further evaluation.
Extensive Evaluation: MoleHD was tested on 29 classification tasks from three widely-used molecule datasets under various split methods. Compared to eight currently used baseline models (including state-of-the-art neural networks), MoleHD achieved the highest ROC-AUC scores on average across random and scaffold splits, with significantly reduced computing costs and process latency.
Design Space Exploration: We developed and evaluated two tokenization schemes (MoleHD-PE and MoleHD-char) and two gram sizes (unigram and bi-gram) to explore their impact on performance.

Figure 2: HDC processing: Encoding, Training, Retraining and Inference.

Table 1: MoleHD vs. Baselines on 3 datasets by average ROC-AUC score. Bold: the highest score. “-”: data unavailable. Superscript “(k)”: MoleHD ranks 𝑘-th place amongst all the available models under current dataset and split method.

Table 2: MoleHD-PE performance on ROC-AUC score comparison on 3 datasets by average. Superscript and subscript refer to the ceiling and floor of errors. For SIDER dataset, the ROC-AUC score is task-average.

Table 3: MoleHD-char performance on ROC-AUC score comparison on 3 datasets by average. Superscript and subscript refer to the ceiling and floor of errors. For SIDER dataset, the ROC-AUC score is task-average.

Conclusion

MoleHD represents a significant advancement in drug discovery by providing an efficient, relatively rapid, low-cost, and highly effective model for novel drug discovery candidates. This innovative approach not only outperforms traditional methods, but also reduces the computational burden, making it accessible for broader applications. It also provides faster rollout for faster return on investment (ROI) and swift time-to-market (TTM). As we continue to refine and expand the capabilities of MoleHD, it holds the potential to transform the landscape of viable, leading-edge drug discovery.

For those interested in exploring MoleHD further, I have developed a demonstration app hosted on Streamlit that showcases how MoleHD works, and how it can provide you with obvious benefits that help streamline costs while rapidly offering viable new molecule candidates for your drug discovery journey.