Shandong University researchers develop multi‑scale feature fusion and weighted ensemble learning method for accurate promoter identification across cell lines

Promoters are the gatekeepers of gene expression — DNA sequences that recruit the machinery needed to start transcription. However, a promoter that is active in one cell type may be silent in another. This cell‑type‑specific behaviour, combined with the large sequence diversity of promoters, poses a challenge for computational identification. While existing models often work well on the cell lines they were trained on, they tend to fail when applied to new cellular contexts.

To tackle this problem, a research team at Shenzhen Research Institute and Schools of Mathematics and Software, Shandong University, created MuSE‑Promoter — a deep learning framework that integrates multiple complementary ways of looking at DNA sequences.

Their findings are published in Molecular and Digital Medicine , with all code and data publicly available at https://github.com/HaoWuLab-Bioinformatics/MuSE-Promoter .

"Our model does not rely on a single type of feature," says the study's co‑corresponding author, Professor Hao Wu. "It simultaneously uses learned semantic embeddings from DNABERT and Word2Vec, and handcrafted biophysical descriptors such as tri‑nucleotide physicochemical properties and reverse‑complement k‑mer frequencies."

Notably, this multi‑modal fusion allows the model to capture both the hidden grammar of regulatory DNA and the structural cues that matter to transcription factors. The architecture further employs a multi‑scale convolutional neural network with squeeze‑excitation attention to detect motifs of variable lengths, followed by a transformer encoder that models long‑range dependencies across the whole promoter region. Finally, a learnable weighted ensemble combines the deep neural network's prediction with that of a random forest classifier, improving robustness and reducing overfitting when moving from one cell line to another.

"We evaluated MuSE‑Promoter on four human cell lines (GM12878, HeLa‑S3, HUVEC, K562) and on TATA‑box and non‑TATA promoters from Arabidopsis thaliana," says co‑corresponding author Professor Zhangyu Mei. "The results show that our method consistently outperforms state‑of‑the‑art tools like iPro‑WAEL and Z‑curve, especially in challenging cross‑cell‑line transfer and promoter–enhancer discrimination tasks."

In cross‑cell‑line tests — where the model was trained on one cell type and tested on a completely different one — MuSE‑Promoter maintained an average AUC of 0.991 and an MCC above 0.92, substantially higher than competing methods. The team also showed that the learned representations form clear, separable clusters for promoters vs. non‑promoters, and that the model assigns high importance to biologically known motifs such as CGA, RCKmer and CC.

"We believe MuSE‑Promoter will become a powerful tool for large‑scale promoter annotation, helping researchers to decode cell‑type‑specific regulatory programs and to distinguish true promoters from other regulatory elements like enhancers," adds Wu. "Future work will extend the framework to integrate multi‑omics data and to predict enhancer–promoter interactions."

###

Contact the author: Hao Wu, School of Software, Shandong University, haowu@sdu.edu.cn

The publisher KeAi was established by Elsevier and China Science Publishing & Media Ltd to unfold quality research globally. In 2013, our focus shifted to open access publishing. We now proudly publish more than 200 world-class, open access, English language journals, spanning all scientific disciplines. Many of these are titles we publish in partnership with prestigious societies and academic institutions, such as the National Natural Science Foundation of China (NSFC).

10.1016/j.mdmed.2026.100002

Computational simulation/modeling

Cells

MuSE‑Promoter: a multi‑scale feature fusion and weighted ensemble learning method for identifying promoters across multiple cell lines

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Shandong University researchers develop multi‑scale feature fusion and weighted ensemble learning method for accurate promoter identification across cell lines

Additional Media

Keywords

Article Information

Contact Information

How to Cite This Article