Jian Jiang's team at the Institute of Chemistry, Chinese Academy of Sciences, recently published an article that focuses on the core bottlenecks of machine learning force fields (MLFF) in organic systems during long-term molecular dynamics simulations, including molecular structure collapse and low accuracy in macroscopic property calculations, and proposed two physical embedding solutions. Their approach works on two levels, addressing both intramolecular and intermolecular interactions. They develop a physics-guided adaptive bond length sampling method and a top-down model correction method based on physical equation embedding, respectively. The results show that these methods can significantly improve simulation stability under small sample conditions and effectively improve the prediction accuracy of macroscopic properties, such as density and viscosity with extremely low data and computational costs. Their approach effectively overcomes the limitations of purely data-driven methods, significantly enhances the reliability and generalization ability of MLFF, and provides a scalable approach for physical embedding of MLFF. The article was published as an open access Research Article in CCS Chemistry , the flagship journal of the Chinese Chemical Society.
Background information:
Molecular dynamics simulations are crucial tools for studying the microscopic mechanisms and macroscopic properties of chemical, materials, and biological systems. While ab initio molecular dynamics simulations offer high precision, they are ill-suited for large-scale, long-duration simulations. Traditional empirical force fields, though computationally efficient, often lack sufficient accuracy in complex organic systems. Machine learning force fields, by approximating the true potential energy surface through data-driven approaches, balance the precision of quantum chemical methods with the computational efficiency of classical empirical force fields, thus becoming an important technological pathway bridging high-precision electronic structure calculations and large-scale classical molecular simulations.
However, the application of MLFF in organic systems still faces significant challenges. Organic molecules contain both intramolecular interactions dominated by covalent bonds and intermolecular interactions dominated by van der Waals interactions. If both types of interactions cannot be accurately characterized simultaneously, the model is prone to limitations in simulation stability and macroscopic property prediction, leading to simulation failure. Specifically, for intramolecular interactions, when the training data does not adequately cover the high-energy chemical bond region, the model may experience non-physical structural collapse phenomena such as chemical bond breaking and atomic collisions during long-term simulations. For intermolecular interactions, even if the model performs well in microscopic indicators such as energy, atomic forces, and structural features, it may not be able to accurately predict macroscopic properties such as density and viscosity.
Therefore, relying solely on data-driven MLFF models cannot simultaneously ensure both intramolecular structural stability and the accuracy of macroscopic intermolecular property predictions. Building upon previous research ( CCS Chem , 2025 , 7(3): 716-730), this paper expands on the concept of physical embedding. On one hand, the physical knowledge contained in empirical force field topology files is introduced into the data sampling process; on the other hand, equations with clear physical meaning are embedded into the MLFF model in a top-down manner to perform targeted corrections to intermolecular interactions, thereby systematically improving the model's prediction accuracy, stability, and interpretability.
Highlights of this article :
The first core contribution of this paper is the proposal of a physics-guided adaptive bond length sampling method. This method reads relevant information from empirical force field topology files to achieve precise differentiation between atom and chemical bond types under different chemical environments. Combined with the bond force constants given in the topology file, it further determines the sampling range and corresponding sampling probability of various chemical bonds in complex organic molecules. Compared to the traditional approach using a uniform stretching factor, this strategy can more accurately cover high-energy bond length regions that are prone to simulation collapse. While achieving adaptive sampling, it effectively avoids problems such as SCF non-convergence and abnormal forces caused by excessive chemical bond stretching.
In three representative organic molecules—fluorinated engineering fluids, alanine tripeptides, and acetaminophen—the authors significantly improved the model's simulation stability using only 50 single-molecule samples for training and validation. The original MACE model exhibited structural collapse probabilities of 59%, 22%, and 77% in the three systems, respectively. After introducing adaptive bond length sampling enhancement, the model successfully passed 100 independent 100ps high-temperature molecular dynamics stability tests, demonstrating that this method can significantly improve the long-term simulation stability of MLFFs under small sample conditions.
The second core contribution of this paper is the proposal of a top-down model correction strategy based on physical equation embedding. The authors point out that the training process of MLFF often prioritizes fitting intramolecular interactions that contribute more to the potential energy, while the characterization of intermolecular interactions is relatively insufficient. Therefore, even though MLFF performs well in various fitting errors and microscopic indices such as the radial distribution function, it may still have significant deviations in the calculation accuracy of macroscopic properties such as density and viscosity.
To address this issue, the authors introduce a DFT-CSO dispersion equation with clear physical meaning to correct the model. This method enhances or weakens intermolecular interactions using only an adjustable damping parameter, and corrects the intermolecular interactions of the trained model with experimental density as the optimization objective. Since this strategy is directly guided by experimental macroscopic properties, it not only alleviates the error caused by the model's underfit to the reference quantum chemistry method but also corrects, to some extent, the systematic bias of the reference quantum chemistry method relative to experimental results. Furthermore, thanks to the clear physical meaning of the embedded equation, this method also has advantages such as simple process, extremely low data requirements, controllable computational cost, and easy integration into existing MLFF frameworks.
In typical battery electrolyte solvent systems, namely mixed solvents of ethylene carbonate (EC) and methyl ethyl carbonate (EMC) and pure EMC systems, this method demonstrates high data efficiency and application potential. The authors point out that the scanning process for adjustable parameters in the physical equations has low computational overhead, requiring only a few hours to determine the optimal parameters. After correction, in generalization tests of the MACE-EC/EMC and MACE-OFF23(S) models at different temperatures, with different mixing ratios, and on new molecular systems, the density prediction errors decreased by 78% and 88%, respectively, reaching 0.006 g/cm³ and 0.012 g/cm³; the viscosity prediction errors also decreased by 38% and 77%, respectively, with final relative experimental deviations of 18.4% and 12.9%, respectively. These results indicate that this method can significantly improve the accuracy of macroscopic property calculations with lower data and computational costs, achieving results comparable to more complex correction schemes.
Regarding interpretability, the rigid volume scan results show that the position of the potential minimum point changes before and after the correction, indicating that the physical equation embedding can directionally enhance or weaken intermolecular interactions, which has clear physical significance. Meanwhile, the RMSE change in atomic forces is less than 0.8 meV/Å, significantly lower than the upper limit of the model's own fitting error. This suggests that, on the one hand, subtle changes at the atomic force level can lead to significant changes in macroscopic properties; on the other hand, model training relying solely on data-driven methods often fails to accurately characterize intermolecular interactions, thus further highlighting the necessity of the physical equation embedding method. Furthermore, the radial distribution function reflecting microstructural characteristics remains essentially unchanged before and after the correction, further confirming that this correction method has almost no impact on the original intramolecular interactions of the model.
Summary and Outlook:
Overall, this paper proposes two complementary physical embedding strategies, focusing on the two key stages of "data sampling" and "model post-processing," effectively overcoming the bottlenecks of purely data-driven methods. Adaptive bond length sampling primarily addresses the molecular structure collapse caused by insufficient coverage of high-energy chemical bond regions within molecules; while the top-down model correction method based on physical equations mainly tackles the failure of macroscopic property predictions caused by insufficient characterization of intermolecular interactions and systematic errors inherent in reference quantum chemistry methods. These two methods lay a solid foundation for constructing high-precision, robust, and transferable molecular simulation tools.
Unlike common approaches that improve performance by increasing data scale or model complexity, this paper emphasizes embedding physical knowledge and equations into the MLFF development process in a low-cost, highly interpretable, and highly transferable manner. This approach not only enhances the model's applicability in systems such as engineering fluids, peptides, drug molecules, and organic solvents, but also provides a new direction for the rapid calibration of general-purpose basic models in the downstream fine-tuning stage. In the future, this framework can be further expanded, for example, by introducing more tunable physical parameters and developing efficient and low-cost correction methods for kinetic properties such as viscosity.
The above work was published as a Research Article in CCS Chemistry , with Professor Jian Jiang from the Institute of Chemistry, Chinese Academy of Sciences, as the corresponding author and doctoral student Junbao Hu as the first author. This work was supported by the National Natural Science Foundation of China and the Strategic Priority Research Program of the Chinese Academy of Sciences.
---
About the journal: CCS Chemistry is the Chinese Chemical Society’s flagship publication, established to serve as the preeminent international chemistry journal published in China. It is an English language journal that covers all areas of chemistry and the chemical sciences, including groundbreaking concepts, mechanisms, methods, materials, reactions, and applications. All articles are diamond open access, with no fees for authors or readers. More information can be found at https://www.chinesechemsoc.org/journal/ccschem .
About the Chinese Chemical Society: The Chinese Chemical Society (CCS) is an academic organization formed by Chinese chemists of their own accord with the purpose of uniting Chinese chemists at home and abroad to promote the development of chemistry in China. The CCS was founded during a meeting of preeminent chemists in Nanjing on August 4, 1932. It currently has more than 120,000 individual members and 184 organizational members. There are 7 Divisions covering the major areas of chemistry: physical, inorganic, organic, polymer, analytical, applied and chemical education, as well as 31 Commissions, including catalysis, computational chemistry, photochemistry, electrochemistry, organic solid chemistry, environmental chemistry, and many other sub-fields of the chemical sciences. The CCS also has 10 committees, including the Woman’s Chemists Committee and Young Chemists Committee. More information can be found at https://www.chinesechemsoc.org/ .
CCS Chemistry
10.31635/ccschem.026.202506780
Computational simulation/modeling
Not applicable
Physical Embedding Machine Learning Force Fields for Organic Systems
6-Mar-2026
There is no conflict of interest to report.