KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge

Zaifei Yang, Hong Chang, RuiBing Hou, Shiguang Shan, Xilin Chen

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Datasets and Benchmarks Track

The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks.