Leveraging State Space Models in Long Range Genomics

* Equal contribution     Corresponding author: a.ramesh@instadeep.com

Abstract

Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State-Space Models (SSMs) as a promising alternative by benchmarking two SSM inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M-parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10–100× longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

Introduction

Genomes are the fundamental blueprint of life. Advances in DNA sequencing have rapidly lowered costs, enabling the curation of high-quality genomic datasets and opening new avenues to understand complex biological processes. Yet, a critical challenge remains in modeling the long-range interactions inherent to genomic data, which can span billions of base pairs (e.g. ≈3 billion in the human genome).

A single human chromosome can span hundreds of millions of nucleotides, with regulatory elements often residing hundreds of kilobases or more from their target genes. Subtle variations, such as single-nucleotide polymorphisms (SNPs), can disrupt these regulatory landscapes by changing enhancer or promoter activity, sometimes resulting in substantial phenotypic effects. As these elements and variations are interspersed throughout massive stretches of DNA, any method that cannot maintain full sequence context while also distinguishing base-level changes risks missing critical genomic signals.

Our contributions

  • SSM-based models achieve performance on par with attention-based models on a wide range of DNA modeling tasks.
  • SSMs zero-shot extrapolate to much longer contexts (10-100x) without additional finetuning, suffering minimal performance loss, with trends suggesting possible extrapolation to even longer contexts.
  • We demonstrate scalability to 1Mbp+ sequences at single nucleotide-level on just one GPU, laying the groundwork for future large-scale genomic modeling.

Experiments

Model Architectures

We consider three classes of models, each with 50M parameters for a fair comparison:

  • NTv2: Our baseline is a smaller-variant of the Nucleotide Transformer model, which is the current SOTA model on GLRB.
  • Caduceus: Caduceus is the first successful application of an SSM to genomic tasks. It extends Mamba layers with bi-directionality and reverse-complement (RC) equivariance, showing promising genomics modeling results.
  • Hawk: Hawk is a recurrent architecture built on Linear Recurrence Units (LRUs). It achieves competitive standard performance while excelling in zero-shot extrapolation to sequence lengths far beyond those seen during training. Inspired by Caduceus we enhanced Hawk with bi-directional processing.

Pretraining

For our experiments, we train our own version of Caduceus and Hawk, but use a pretrained version of NTv2. We follow the data sourcing and preprocessing procedures outlined in established genomic language model protocols. Specifically, we made use of the multispecies genome used to pretrain NTv2 in the exact same setting, and pretrained our models on 300B nucleotides.

Fine-Tuning on Long-Range Genomics Tasks

Following pretraining, the models are fine-tuned on a suite of long-range genomics tasks from the Genomics Long-Range Benchmark (GLRB), which assess the ability to predict genomic features (e.g., regulatory elements, gene expression, chromatin marks) from long-context sequences.

Performance Results

Genomics Long-Range Benchmark results, using 12kbp per sequence. Caduceus achieves the best result in three out of six GLRB tasks.
Task NTv2 Caduceus Hawk
Bulk RNA (R²) 0.52 0.53 -
VEP eQTL (AUROC) 0.72 0.68 0.60
VEP ClinVar (AUROC) 0.75 0.75 0.55
Histone Marks (AUPRC) 0.34 0.52 -
Promoters (AUPRC) 0.75 0.77 -
Enhancers (AUROC) 0.78 0.75 -

Zero-shot Extrapolation

We evaluate models' ability to generalize to significantly longer sequences in a scalable way, without requiring further fine-tuning. Specifically, we assess zero-shot extrapolation by testing on downstream tasks with input lengths up to 10× greater than those seen during pretraining.

Comparison of extrapolation methods
Comparison of the extrapolation methods of state-space models and attention-based models on VEP eQTLs (AUROC). A dotted vertical line indicates the fine-tuning sequence length (12 kbp) of all models. Attention-based models collapse when processing sequences that are longer than what they have encountered at training time, whereas state-space models show an ability to generalize to sequences up to 10x longer.
Bulk RNA extrapolation
Bulk RNA
VEP eQTL extrapolation
VEP eQTL
VEP ClinVar extrapolation
VEP ClinVar

Processing Ultralong Sequences

In this section, we demonstrate how hidden state transfer mechanism in SSMs can be used to process ultralong sequences of 1M+ tokens on a single GPU. As input sequences get longer, loading and processing it all at once requires a large amount of memory. If an input sequence exceeds the maximum length that a single GPU can handle, the sequence is divided into smaller chunks (for example 100 kbp segments). The final hidden state from each chunk is passed as the initial state for the next chunk, ensuring continuity and preservation of dependencies across the entire sequence.

Hidden State Propagation Mechanism
A Mechanism for Hidden State Propagation in SSMs for Ultralong Sequences Visualized. An ultralong sequence is split into multiple chunks, thereby doing a linear scan over chunks. An individual chunk size could be set to any size that fits on a single GPU. The hidden state's size always stays fixed.
VEP ClinVar AUROC
Zero-shot extrapolation on VEP ClinVar with Hawk (50M parameters) up to 1 Mbp input length. Performance remains stable despite the substantial increase in context size.
eQTL 1 million
Zero-shot extrapolation on VEP eQTL with Hawk (50M parameters) up to 1 Mbp input length. Performance remains stable despite the substantial increase in context size.

Conclusion

Our evaluation of State-Space Models (SSMs) for long-range genomic modeling demonstrates that these architectures learn high-quality representations that are both biologically meaningful and computationally scalable. Across multiple downstream tasks, SSMs not only match transformer performance but also excel in zero-shot extrapolation—extending from a 12 kbp training context to sequences up to 120 kbp and even 1 Mbp without additional fine-tuning. This behavior aligns with our goal of capturing the genome's hierarchical regulation, preserving both fine-grained nucleotide details and long-range regulatory interactions.

The results presented in this work highlight the potential of SSM-based architectures as scalable alternatives for comprehensive genomic analysis. The demonstrated ability to process ultralong sequences on a single GPU not only makes these models practical for large-scale studies but also opens the door for integrated analyses of entire genomic regions in one pass.

BibTeX

@article{popov2024leveraging,
  title={Leveraging State Space Models in Long Range Genomics},
  author={Popov, Matvei and Kallala, Aymen and Ramesh, Anirudha and Hennouni, Narimane and Khaitan, Shivesh and Gentry, Rick and Cohen, Alain-Sam},
  journal={ICLR 2025 Workshop},
  year={2025}
}