Overview

Germline mutation rates are important in genetics, genomics and evolutionary biology. It is long known that mutation rates vary substantially across the genome, but existing methods can only obtain very rough estimates of local mutation rates and are difficult to be applied to non-model species.

MuRaL (short for Mutation Rate Learner), is a generalizable deep learning framework for mutation rate estimation.

Framework Components

MuRaL Modules
Module	Functionality
MuRaL-snv	Estimates single-nucleotide variant (SNV) rates
MuRaL-indel	Estimates small insertion/deletion (INDEL) rates (1-50bp)

Key advantages over conventional methods:

Superior prediction accuracy across genomic scales
Genome-wide rate mapping from moderate samples (~100 individuals)
Transfer learning capability reduces data requirements
Broad species applicability with population polymorphism data

Output Demonstration

SNV Rate Prediction (A. thaliana)

Below is an example showing that MuRaL-snv predicted each A/T sites mutation rate

Below is an example showing that MuRaL-snv predicted rates (colored lines) are highly correlated with observed mutation rates (grey shades) at different scales on Chr3 of A. thaliana.

_images/regional_correlation_example.jpg

INDEL Rate Prediction (H. sapiens)

Below is an example showing that MuRaL-indel predicted each one-bp insertion mutation rate

Below is an example showing that MuRaL-indel predicted rates (colored lines) are highly correlated with observed mutation rates (grey shades) at different scales on Chr20 of H. sapiens.

_images/INDEL_regional_correlation_example.jpg

Model Architecture

MuRaL-snv Architecture

The MuRaL-snv network architecture has two main modules (shown below), one is for learning signals from local genomic regions (e.g. 10bp on each side of the focal nucleotide), the other for learning signals from expanded regions (e.g. 1Kb on each side of the focal nucleotide).

MuRaL-indel Architecture

The MuRaL-indel network architecture employs a U-Net based CNN structure (shown below) that processes long sequence contexts around potential INDEL breakpoints. The model takes one-hot encoded DNA sequences as input and outputs probabilities for different INDEL lengths (including no mutation). The U-Net consists of symmetrical downsampling and upsampling paths connected through skip connections, with residual blocks and convolutional layers at each scale to extract hierarchical features.

Install

Using Conda

MuRaL depends on several other packages. We recommend using Miniconda (version 3 or newer) to create a conda environment for installation.

Prerequisites

Install Miniconda (version 3+)
Clone/download MuRaL source code and navigate to ‘MuRaL-xxx/’ directory

System Requirements

GPU recommended for training (significantly faster)
CPU sufficient for most prediction tasks
Environment files provided:
- environment.yml (GPU version)
- environment_cpu.yml (CPU-only version)

Installation Process

Create conda environment (may take >30 minutes depending on your internet):

# For GPU systems
conda env create -n mural -f environment.yml
# If interrupted, update instead:
conda env update -n mural -f environment.yml --prune

# For CPU-only systems
conda env create -n mural -f environment_cpu.yml

Activate environment and install MuRaL:

conda activate mural
pip install .

Verify installation:

mural_snv train -v  # Should display version number

Using Singularity

Singularity is a popular container platform for scientific research. We also built Singularity images for specific versions, which can be found at this OSF repo. You can just download the Singularity image mural_vx.x.x.sif from the OSF repo and don’t need to install the dependencies of MuRaL. Once Singularity is installed in your system, you can try running the MuRaL commands with the mural_vx.x.x.sif file.

If your machine has GPUs and you want to use GPU resources for MuRaL tools, please remember to set the --nv flag for Singularity commands. See the following examples:

singularity exec --nv /path/to/mural_vx.x.x.sif mural_snv train ...
singularity exec --nv /path/to/mural_vx.x.x.sif mural_snv transfer ...

For prediction tasks, it is recommended to use only CPUs so that you can run many prediction tasks in parallel. See the example below:

singularity exec /path/to/mural_vx.x.x.sif  ...

For more about Singularity, please refer to the Singularity documentation.

Tools and examples

Command Line Tools

All MuRaL tools are available via command line. For detailed usage, run any command with the -h help option. See subsequent sections for specific examples.

Main Commands

MuRaL-snv

`mural_snv train`	Train new mutation rate models from scratch
`mural_snv get_best_model`	Find the best SNV rate model from trial results (requires ‘progress.csv’)
`mural_snv transfer`	Fine-tune models using transfer learning
`mural_snv predict`	Predict mutation rate predictions
`mural_snv evaluate`	Evaluate model performance metrics
`mural_snv calc_scaling_factor`	Calculate scaling factors
`mural_snv scale`	Convert predictions to per-generation rates

MuRaL-indel

`mural_indel train`	Train new INDEL rate models
`mural_indel get_best_model`	Find the best INDEL rate model from trial results (requires ‘progress.csv’)
`mural_indel transfer`	Apply transfer learning for INDEL models
`mural_indel predict`	Predict INDEL mutation rates
`mural_indel evaluate`	Evaluate INDEL model performance
`mural_indel calc_scaling_factor`	Calculate scaling factors
`mural_indel scale`	Scale INDEL rate predictions

Model training

Both mural_snv train and mural_indel train commands train MuRaL models using training and validation mutation data. Training results are saved in different locations based on whether Ray is used for hyperparameter search(–use_ray):

Without Ray: ./results/ folder
With Ray: ./ray_results/ folder

Input Requirements

Required Files:

Reference sequence (FASTA format)
Training data (BED format)
Validation data (BED format, optional)

Note: If the validation data file isn’t provided, a fraction of the sites sampled from the training data file are used as validation data.

Input training and validation data files are in BED format (more info about BED format here). Some example lines of an input BED file are shown below.

chr1    2333436 2333437 .   0   +
chr1    2333446 2333447 .   2   -
chr1    2333468 2333469 .   1   -
chr1    2333510 2333511 .   3   -
chr1    2333812 2333813 .   0   -

In the BED-formatted lines above, the 5th column is used to represent mutation status: usually, ‘0’ means the non-mutated status and other numbers for specific mutation types (e.g. ‘1’ for ‘A>C’, ‘2’ for ‘A>G’, ‘3’ for ‘A>T’). You can specify an arbitrary order for a group of mutation types with incremental numbers starting from 0, but make sure that the same order is consistently used in training, validation and testing datasets. Importantly, the training and validation BED file MUST BE SORTED by chromosome coordinates. You can sort BED files by bedtools sort or sort -k1,1 -k2,2n.

Recurrent mutation BED format

When analyzing recurrent mutations (sites mutated multiple times across individuals), use the --recurrent flag during training, prediction and evaluation. In this mode, the 4th column (name field) of the BED file encodes per-site information in a semicolon-separated format: chrom:start;ref>alt;AC;count, where AC is the allele count (integer; -1 if unavailable) and the last field is the observed mutation count:

chr1    238329  238330  chr1:238329;G>A;-1;3   1   +
chr1    238331  238332  chr1:238331;G>A;-1;1   1   -
chr1    238335  238336  .                       0   +

In the example above, the site at chr1:238329 has 3 independent mutation events (count=3), while chr1:238331 has 1 event. Non-mutated sites use . as the name field and are assigned a weight of 1.

During training with --recurrent, per-site counts are used as sample weights in the cross-entropy loss, so sites with more independent mutations contribute proportionally more to the gradient.

Output data

mural_snv train or mural_indel train saves the model information at each checkpoint, normally at the end of each training epoch of a trial. The checkpointed model files during training are saved under folders named like:

Non-Ray Execution (default):

./results/{experiment_name}/Train_{trial_id}/checkpoint_{epoch}/
                                                  ├── model               # Learned parameters
                                                  ├── model.config.pkl    # Model hyperparameters
                                                  └── model.fdiri_cal.pkl # Calibration model (if validation data provided)

Ray Execution:

./ray_results/{experiment_name}/Train_{trial_id}/checkpoint_{epoch}/
                                                        ├── model
                                                        ├── model.config.pkl
                                                        └── model.fdiri_cal.pkl

In the above folder, the ‘model’ file contains the learned model parameters. The ‘model.config.pkl’ file contains configured hyperparameters of the model. The ‘model.fdiri_cal.pkl’ file (if exists) contains the calibration model learned with validation data, which can be used for calibrating predicted mutation rates. These files can be used in downstream analyses such as model prediction and transfer learning. The ‘progress.csv’ files in ‘Train_xxx’ folders contain important information for each training epoch of trials (e.g., validation loss, used time, etc.). You can use the command mural_snv get_best_model or mural_indel get_best_model to find the best model per trial after training.

# For SNV: running trials without Ray (default)
mural_snv get_best_model ./results/your_experiment_name

# For SNV: running trials using Ray
mural_snv get_best_model ./ray_results/your_experiment_name

# For INDEL: running trials without Ray (default)
mural_indel get_best_model ./results/your_experiment_name

# For INDEL: running trials using Ray
mural_indel get_best_model ./ray_results/your_experiment_name

The output lists all trials sorted by validation loss (ascending). Use the top result for predictions. Example:

./results/your_experiment_name/Train_zoq97_00000/checkpoint_8   0.401637  # ← Best model (lowest loss)
./results/your_experiment_name/Train_zoq97_00002/checkpoint_8   0.407693
./results/your_experiment_name/Train_zoq97_00001/checkpoint_6   0.445094

Example 1: Basic Training

MuRaL-snv

The following commands will train a SNV model by running two trials (default:--n_trials 2), using data in ‘snv/data/training.sorted.bed’ for training. The training results will be saved under the folder ‘./results/example1/’ or ‘./ray_results/example1/’. Default values will be used for other unspecified arguments. Note that, by default, 10% of the sites sampled from ‘training.sorted.bed’ is used as validation data (i.e. --valid_ratio 0.1). You can run this example under the ‘examples/’ folder in the package.

# serially running two trials (default)
mural_snv train --ref_genome snv/data/seq.fa --train_data snv/data/training.sorted.bed \
            --experiment_name example1 > test1.out 2> test1.err

# parallelly running two trials using Ray, with 3 CPUs each trial (6 requested CPUs in total)
mural_snv train --ref_genome snv/data/seq.fa --train_data snv/data/training.sorted.bed \
            --use_ray --ray_ncpus 6 --cpu_per_trial 3 --experiment_name example1 \
            > test1.out 2> test1.err

Note

If your machine has sufficient resources to execute multiple trials in parallel, it is recommended to add the --use_ray option. Using Ray allows for better resource scheduling. If running multiple trials serially or running only a single trial (set --n_trials 1), it is recommended not to use --use_ray, which can improve the runtime speed by approximately 2-3 times for each trial.

MuaRaL-indel

The following commands train an INDEL model with two trials (default: --n_trials 2), using indel/data/training.sorted.bed as input. Parameters are identical to SNV training, but with two key differences:

Uses larger default context window (--distal_radius 4000) to improve INDEL prediction accuracy while maintaining reasonable computational requirements

# serially running two trials (default)
mural_indel train --ref_genome indel/data/seq.fa \
            --train_data indel/data/training.sorted.bed \
            --experiment_name example1 > test1.out 2> test1.err

# parallelly running two trials using Ray, with 3 CPUs each trial (6 requested CPUs in total)
mural_indel train --ref_genome indel/data/seq.fa \
            --train_data indel/data/training.sorted.bed \
            --use_ray --ray_ncpus 6 --cpu_per_trial 3 --experiment_name example1 \
            --experiment_name example1 > test1.out 2> test1.err

Note

–use_reverse should be added, if insertion model is trained.

Example 2: Training with large sequence

MuRaL-snv

The following command will use data in ‘snv/data/training.sorted.bed’ as training data and a separate ‘snv/data/validation.sorted.bed’ as validation data. The option --local_radius 7 means that length of the local sequence used for training is 7*2+1 = 15 bp. --distal_radius 200 means that length of the expanded sequence used for training is 200*2+1 = 401 bp. You can run this example under the ‘examples/’ folder in the package.

mural_snv train --ref_genome snv/data/seq.fa \
            --train_data snv/data/training.sorted.bed \
            --validation_data snv/data/validation.sorted.bed \
            --n_trials 2 --local_radius 7 \
            --distal_radius 200 --experiment_name example2 \
            > test2.out 2> test2.err

MuRaL-indel

The following command will use data in ‘indel/data/training.sorted.bed’ as training data and a separate ‘indel/data/validation.sorted.bed’ as validation data. --distal_radius 4000 means that length of the expanded sequence used for training is 4000*2 = 8000 bp. You can run this example under the ‘examples/’ folder in the package.

# For indel, 4k radius is recommend
mural_indel train --ref_genome indel/data/seq.fa \
            --train_data indel/data/training.sorted.bed \
            --validation_data indel/data/validation.sorted.bed \
            --n_trials 2 \
            --distal_radius 4000 --experiment_name example2 \
            > test2.out 2> test2.err

Example 3: Training acceleration

For large sequence contexts (distal_radius > 1000), data loading may become a bottleneck during training. To mitigate this, specify additional CPUs per trial using --cpu_per_trial. The following example demonstrates using 3 extra CPUs to accelerate data loading(applies equally to both mural_snv and mural_indel):

mural_snv train --ref_genome snv/data/seq.fa \
            --train_data snv/data/training.sorted.bed \
            --cpu_per_trial 4 \
            --experiment_name example3 > test3.out 2> test3.err

mural_indel train --ref_genome indel/data/seq.fa \
            --train_data indel/data/training.sorted.bed \
            --cpu_per_trial 4 \
            --experiment_name example3 > test3.out 2> test3.err

Example 4: Training with limited resources

If RAM memory or GPU memory limits the usage of mural_snv train or mural_indel train (which may happen with large expanded sequences used for training), please refer to following suggestions.

Since v1.1.2, we split the genomes into segments of a spefic size to facilitate data preprocessing (see illustration in the panel A of the figure below).

To reduce RAM memory, you can set smaller values for --segment_center and --sampled_segments. The default value of --segment_center is 300,000 bp, meaning maximum encoding unit of the sequence is 300000+2*distal_radius bp (see illustration in the panel A of the figure below). It is the key parameter for trade-off between RAM memory usage and data preprocessing speed. You can reduce this (e.g., 50,000 bp) at the cost of an acceptable loss in data preprocessing speed. The second option is to set smaller value for --sampled_segments (default 10). If changing this, you should also check the performance of trained model, because --sampled_segments may also influence model performance sometimes. The impact of the two parameters on RAM usage can be visualized in the following figure (panels B & C):

_images/preprocessAndRAM_memory_usage.jpg

To reduce GPU memory, it is recommended to reduce --batch_size (default 128). You can set smaller values (e.g., 64, 32, 16, etc.). In addition, you may also consider using smaller models (e.g., reducing --distal_radius, --CNN_out_channels, etc.), but that may affect model performance.

You can run the following commands under the ‘examples/’ folder in the package.

MuRaL-snv

# use less RAM memory
mural_snv train --ref_genome snv/data/seq.fa \
           --train_data snv/data/training.sorted.bed \
           --validation_data snv/data/validation.sorted.bed \
           --n_trials 2 --local_radius 7 \
           --distal_radius 6400 --segment_center 100000 \
           --sampled_segments 4 --experiment_name example4 \
           > test4.out 2> test4.err

# use less GPU memory
mural_snv train --ref_genome snv/data/seq.fa \
           --train_data snv/data/training.sorted.bed \
           --validation_data snv/data/validation.sorted.bed \
           --n_trials 2 --local_radius 7 \
           --batch_size 64 \
           --distal_radius 6400 --experiment_name example4 \
           > test4.out 2> test4.err

MuRaL-indel

# use less RAM memory
mural_indel train --ref_genome indel/data/seq.fa \
           --train_data indel/data/training.sorted.bed \
           --validation_data indel/data/validation.sorted.bed \
           --n_trials 2 --local_radius 7 \
           --distal_radius 6400 --segment_center 100000 \
           --sampled_segments 4 --experiment_name example4 \
           > test4.out 2> test4.err

# use less GPU memory
mural_indel train --ref_genome indel/data/seq.fa \
           --train_data indel/data/training.sorted.bed \
           --validation_data indel/data/validation.sorted.bed \
           --n_trials 2 --local_radius 7 \
           --batch_size 64
           --distal_radius 6400 --experiment_name example4 \
           > test4.out 2> test4.err

Note

The RAM memory usage can be estimated as: sampled_segments * segment_center * 4 * (2 * distal_radius + 1) * 4 / 2^30 + C (GB)

where:

C is a constant term ranging between 5 and 12 GB;
seq_length is (2 * distal_radius + 1) for SNV,
seq_length is (2 * distal_radius) for INDEL,

When experiencing insufficient RAM, this formula helps determine appropriate parameter values.

Example 4b: Training with recurrent mutations

When training on recurrent mutation data (where the same genomic site may be mutated independently in multiple individuals), add the --recurrent flag. This uses per-site mutation counts as sample weights in the loss function, ensuring that sites with more independent mutation events contribute proportionally to training.

mural_snv train --ref_genome snv/data/seq.fa \
           --train_data snv/data/recurrent_training.sorted.bed \
           --validation_data snv/data/recurrent_validation.sorted.bed \
           --recurrent --n_trials 2 --experiment_name example4b \
           > test4b.out 2> test4b.err

Note

The recurrent mutation BED files in the example directory are synthetic data generated from the standard training/validation files via generate_recurrent_data.py. They are for demonstration purposes only.

Model prediction

mural_snv predict and mural_indel predict predicts mutation rates for all sites in a BED file based on a trained model.

Input Requirements

Reference genome (FASTA format)
Input BED file (formatted identically to training BED files) - Column 5 can be set to ‘0’ if mutation status is unknown
Trained model files(generated by mural_snv train or mural_indel train): - model (learned parameters) - model.config.pkl (model configuration)
Optional calibration file: - model.fdiri\_cal.pkl (for calibrating predicted mutation rates)

Output data

The output of mural_snv predict is a tab-separated file containing the sequence coordinates (BED-formatted) and the predicted probabilities for all possible mutation types. Usually, the ‘prob0’ column contains probabilities for the non-mutated class and other ‘probX’ columns for mutated classes. Some example lines of a prediction output file are shown below.

SNV output:

chrom   start   end    strand mut_type  prob0   prob1   prob2   prob3
chr1    10006   10007   -       0       0.9797  0.003134 0.01444 0.002724
chr1    10007   10008   +       0       0.9849  0.005517 0.00707 0.002520
chr1    10008   10009   +       0       0.9817  0.004801 0.01006 0.003399
chr1    10012   10013   -       0       0.9711  0.004898 0.02029 0.003746

INDEL output:

chrom   start   end    strand mut_type  prob0   prob1   prob2    ...   prob7
chr1    10006   10007   -       0       0.9649  0.008623 0.00680 ... 0.012542
chr1    10007   10008   +       0       0.9849  0.005517 0.00707 ... 0.009099
chr1    10008   10009   +       0       0.9717  0.007114 0.00385 ... 0.008034
chr1    10012   10013   -       0       0.9706  0.010121 0.00476 ... 0.006360

Example 5: Mutation Rate Prediction

MuRaL-snv

Predicts mutation rates for sites in ‘snv/data/testing.bed.gz’ using model from ‘snv/models/checkpoint_6/’:

mural_snv predict --ref_genome snv/data/seq.fa --test_data snv/data/testing.bed.gz \
                   --model_path snv/models/checkpoint_6/model \
                   --model_config_path snv/models/checkpoint_6/model.config.pkl \
                   --calibrator_path snv/models/checkpoint_6/model.fdiri_cal.pkl \
                   --pred_file testing.ckpt6.fdiri.tsv.gz \
                   --cpu_only > test5.out 2> test5.err

For recurrent mutation data, add --recurrent so that correlation calculations use observed mutation counts instead of binary presence/absence:

mural_snv predict --ref_genome snv/data/seq.fa --test_data snv/data/recurrent_validation.sorted.bed \
                   --model_path snv/models/checkpoint_6/model \
                   --model_config_path snv/models/checkpoint_6/model.config.pkl \
                   --calibrator_path snv/models/checkpoint_6/model.fdiri_cal.pkl \
                   --pred_file testing.ckpt6.fdiri.tsv.gz \
                   --recurrent --cpu_only > test5.out 2> test5.err

MuRaL-indel

mural_indel predict --ref_genome indel/data/seq.fa --test_data indel/data/testing.bed.gz \
                   --model_path indel/models/checkpoint_9/model \
                   --model_config_path indel/models/checkpoint_9/model.config.pkl \
                   --calibrator_path indel/models/checkpoint_9/model.fdiri_cal.pkl \
                   --pred_file testing.ckpt9.fdiri.tsv.gz \
                   --cpu_only > test5.out 2> test5.err

Transfer learning

mural_snv transfer and mural_indel transfer trains MuRaL models like mural_snv train and mural_indel train but initializes model parameters with learned weights from a pre-trained model. Its training results are also saved under the ‘./results/’ or ‘./ray_results/’ folder.

Input Requirements

Required files:

Reference genome (FASTA format)
Training data (BED format)
Pre-trained model files:
- model (learned parameters)
- model.config.pkl (model configuration)

Optional:

Validation data (BED format)
Calibration model (model.fdiri_cal.pkl)

Output data

Output data has the same structure as that of mural train.

Example 6: transfer learning commands

MuRaL-snv

The following command will train a transfer learning SNV model using training data in ‘snv/data/training_TL.sorted.bed’, the validation data in ‘snv/data/validation.sorted.bed’, and the model files under ‘models/checkpoint_6/’. You can run this example under the ‘examples/’ folder in the package.

mural_snv transfer --ref_genome snv/data/seq.fa \
               --train_data snv/data/training_TL.sorted.bed \
               --validation_data snv/data/validation.sorted.bed \
               --model_path snv/models/checkpoint_6/model \
               --model_config_path snv/models/checkpoint_6/model.config.pkl \
               --train_all --init_fc_with_pretrained \
               --experiment_name example6 > test6.out 2> test6.err

MuRaL-indel

The following command will train a transfer learning INDEL model using training data in ‘indel/data/training_TL.sorted.bed’, the validation data in ‘indel/data/validation.sorted.bed’, and the model files under ‘indel/models/checkpoint_6/’. You can run this example under the ‘examples/’ folder in the package.

mural_indel transfer --ref_genome indel/data/seq.fa \
               --train_data indel/data/training_TL.sorted.bed \
               --validation_data indel/data/validation.sorted.bed \
               --model_path indel/models/checkpoint_9/model \
               --model_config_path indel/models/checkpoint_9/model.config.pkl \
               --train_all --init_fc_with_pretrained \
               --experiment_name example6 > test6.out 2> test6.err

Calculating k-mer and regional correlations for evaluation

For model evaluation, since it is impossible to evaluate the accuracy of predicted mutation rates at the single-nucleotide level, we employ two metrics, k-mer correlation and regional correlation, to evaluate model performance at the higher (summarized) levels. More details about the two metrics can be found in the MuRaL paper. The k-mer and regional correlations can be calculated with the predicted tsv files generated by mural_snv predict and mural_indel predict.

The command mural_snv evaluate and mural_indel evaluate is used for calculating k-mer and regional correlations.By default, they output results for both k-mer correlation and regional correlation.

K-mer correlation analysis

Input data

The inputs for k-mer correlation analysis include the reference FASTA file, a prediction tsv file and the length of k-mer. Note that for evaluation, we need to provide a specific set of observed mutations (e.g. all available rare variants), which are stored in the 5th column of the prediction tsv files. These observed mutations are used for calculating observed mutation rates. We can change the content in the 5th column to evaluate model performance in different observed mutation data.

Output data

The outputs include a file (’*-mer.mut_rates.tsv’) storing predicted and observed k-mer rates of all possible mutation subtypes, and a file (’*-mer.corr.txt’) storing the k-mer correlations (Pearson’s r and p-value) of three mutation types in a specific order (e.g., for A/T sites, prob1, prob2 and prob3 are for A>C, A>G and A>T, respectively).

# example of '*-mer.mut_rates.tsv'
type   avg_obs_rate1   avg_obs_rate2   avg_obs_rate3   avg_pred_prob1  avg_pred_prob2  avg_pred_prob3  number_of_mut1  number_of_mut2  number_of_mut3  number_of_all
TAG    0.006806776385512125    0.010141979926438501    0.012039461380213204    0.012744358544122413    0.01817057941563919     0.021860978496512425    3494    5206    6180    513312
TAA    0.007517292690907348    0.011278023120833133    0.01318808653952362     0.013600087566977897    0.019697007577734515    0.024266536859123104    7214    10823   12656   959654
AAA    0.0068964404639771226   0.010705555691654661    0.009617493130148654    0.012599749576515839    0.020442895433664586    0.01646869397956817     11542   17917   16096   1673617

# example of '*-mer.corr.txt'
3-mer  prob1   0.9569216831654604      6.585788162834682e-09 # r and p for prob1
3-mer  prob2   0.9326211281771537      1.4129640985193586e-07 # r and p for prob2
3-mer  prob3   0.947146892265788       2.6848989196451608e-08 # r and p for prob3

Example 7 : k-mer correlation commands

MuRaL-snv

The following commands use the prediction file ‘testing.ckpt4.fdiri.tsv.gz’ to calculate 3-mer, 5-mer and 7-mer correlations:

mural_snv evaluate --pred_file testing.ckpt4.fdiri.tsv.gz --ref_genome snv/data/seq.fa --kmer_length 3 --kmer_only --out_prefix test_kmer_corr
mural_snv evaluate --pred_file testing.ckpt4.fdiri.tsv.gz --ref_genome snv/data/seq.fa --kmer_length 5 --kmer_only --out_prefix test_kmer_corr
mural_snv evaluate --pred_file testing.ckpt4.fdiri.tsv.gz --ref_genome snv/data/seq.fa --kmer_length 7 --kmer_only --out_prefix test_kmer_corr

For recurrent mutation data, add --recurrent so that observed rates reflect the actual number of independent mutation events rather than binary presence/absence:

mural_snv evaluate --pred_file testing.ckpt4.fdiri.tsv.gz --ref_genome snv/data/seq.fa --kmer_length 3 --kmer_only --recurrent --out_prefix test_kmer_corr

MuRaL-indel

The following commands use the prediction file ‘testing.ckpt4.fdiri.tsv.gz’ to calculate 2-mer, 4-mer and 6-mer correlations:

mural_indel evaluate --pred_file testing.ckpt9.fdiri.tsv.gz --ref_genome indel/data/seq.fa --kmer_length 2 --kmer_only --out_prefix test_kmer_corr
mural_indel evaluate --pred_file testing.ckpt9.fdiri.tsv.gz --ref_genome indel/data/seq.fa --kmer_length 4 --kmer_only --out_prefix test_kmer_corr
mural_indel evaluate --pred_file testing.ckpt9.fdiri.tsv.gz --ref_genome indel/data/seq.fa --kmer_length 6 --kmer_only --out_prefix test_kmer_corr

Regional correlation analysis

Input data

The inputs for regional correlation analysis include a prediction tsv file and the window size. Like the k-mer correlation analysis, we need to provide a specific set of observed mutations in the 5th column of the prediction tsv files. These observed mutations are used for calculating observed regional mutation rates.

Output data

There are multiple output files. The files storing regional rates (’*.regional_rates.tsv’) have seven columns: chromosome name, the end position of the window, number of valid sites in the window, number of observed mutations in the window, average observed mutation rate, average predicted mutation rate in the window and the ‘used_or_deprecated’ label. The windows labeled ‘deprecated’ are not used in correlation analysis due to too few valid sites. The regional correlation (Pearson’s r and p-value) of the considered mutation type is given in the ‘*.corr.txt’.

# example of '*.regional_rates.tsv'
chrom  end     sites_count     mut_type_total  mut_type_avg    avg_pred        used_or_deprecated
chr3   100000  61492   576     0.009367072139465296    0.020374342255903233    used
chr3   200000  60680   531     0.008750823994726434    0.02025859070533955     used
chr3   300000  59005   499     0.00845691043131938     0.01882644280993153     used
...

# example of '*.corr.txt'
100Kb  prob3   0.4999  6.040983e-16

Example 8: regional correlation commands

MuRaL-snv

The following command will calculate the regional correlation for 100Kb windows.

mural_snv evaluate --pred_file testing.ckpt4.fdiri.tsv.gz --window_size 100000 --regional_only --out_prefix test_region_corr

With --recurrent, observed counts reflect the number of independent mutation events per window:

mural_snv evaluate --pred_file testing.ckpt4.fdiri.tsv.gz --window_size 100000 --regional_only --recurrent --out_prefix test_region_corr

MuRaL-indel

The following command will calculate the regional correlation for 100Kb windows.

mural_indel evaluate --pred_file testing.ckpt9.fdiri.tsv.gz --window_size 100000 --regional_only --out_prefix test_region_corr

Visualization of correlation results

You can run the commands like below to extract k-mer correlations and corresponding p-values for further visualization:

# For SNV
cat test.{3,5,7}-mer.corr.txt | awk 'BEGIN{print "k-mer\tmut_type\tcorrelation\tp-value"}{print;}' > kmer_correlations.tsv
# For INDEl
cat test.{2,4,6}-mer.corr.txt | awk 'BEGIN{print "k-mer\tmut_type\tcorrelation\tp-value"}{print;}' > kmer_correlations.tsv

The resulting ‘kmer_correlations.tsv’ file is tab-delimited, looking like:

# For SNV
k-mer  mut_type        correlation     p-value
3-mer  A>C                     0.8527          2.7049e-05
3-mer  A>G                     0.8453          3.7235e-05
...

or

# For INDEL
kmer  mut_type correlation     p-value
2-mer  A>C                     0.8527          2.7049e-05
2-mer  A>G                     0.8453          3.7235e-05
...

The following python code can be used for generating bar plots for k-mer correlations:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_table('kmer_correlations.tsv')
plt.figure(figsize=(6,4))
sns.catplot(x="mut_type", y="correlation", kind="bar", hue="k-mer", data=df, palette="Blues_r")
plt.title('Bar plots of k-mer correlations')
plt.savefig('kmer_correlations.jpg', bbox_inches='tight')

The plot looks like below:

Similarly, one can generate bar plots for regional correlations for evaluation.

In addition, based on the output of calc_region_corr above, we can visualize how predicted rates fit observed rates for windows across a chromosome or a specific region. First, we should standardize the observed rates and the predicted rates for all windows by using z-score transformation. Then we select some regions to generate the plots. Below we use the results for 100Kb windows and A>G mutation type, and the region selected is from 15Mb to 23.6Mb. The solid line indicates average predicted mutation rates and the shade for average observed mutation rates:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from scipy.stats import pearsonr

df = pd.read_table('test2.100Kb.prob2.regional_rates.tsv')
df = df[df['used_or_deprecated'] == 'used']

#z-score preprocessing
avg_obs = preprocessing.scale(df['avg_obs'])
avg_pred = preprocessing.scale(df['avg_pred'])
data = {'window_end':df['window_end'],'avg_obs':list(avg_obs),'avg_pred':list(avg_pred)}
df1 = pd.DataFrame(data)

#select the region
df2 = df1[143:229]
corr = pearsonr(df2['avg_obs'],df2['avg_pred'])
print("Correlation of the selected regions is %f, p-value is %f" %(corr[0],corr[1]))

#plot
fig, ax = plt.subplots(1, figsize=(10, 2))
ax.set_xlabel("Chr3(Mb)")
ax.fill_between(df2['window_end']/1000000,df2['avg_obs'], alpha=0.3, color = 'Grey')
ax.plot(df2['window_end']/1000000,df2['avg_pred'], label="avg_pred", linewidth = 1.5)
plt.ylabel('average mutation rate (Z-score)')

plt.savefig('regional_rates.jpg', bbox_inches = 'tight')

The plot looks like below:

Scaling MuRaL-predicted mutation rates to per base per generation rates

The raw MuRaL-predicted mutation rates are not mutation rates per bp per generation. To obtain a mutation rate per bp per generation for each nucleotide, one can scale the MuRaL-predicted rates using reported genome-wide de novo mutation rate and spectrum per generation.

Calculate scaling factors: Use the appropriate command for your mutation type:
- mural_snv calc_scaling_factor - for SNV site groups (A/T, C/G, CpG sites)
- mural_indel calc_scaling_factor - for INDEL site types (insertions, deletion starts, deletion ends)
Apply scaling: use the command mural_snv scale or mural_indel scale to scale mutation rates in prediction files via scaling factors

Note

Cannot directly compare raw rates from different MuRaL models
Genome-wide rate accuracy affects between-species comparison
Scaling does not affect k-mer/regional correlation calculations

Example 9 : Scaling workflow

MuRaL-snv

Step 1: Calculate scaling factor

Here is an example for scaling mutation rates for A/T sites. Suppose that we have the following proportions of different mutation types and proportions of different site groups in a genome. In addition, suppose we already know from previous research that the genome-wide mutation rate per generation of the species is 5e−9 per base per generation. If the per generation mutation rate is not available for the studied species, one may use the estimate from a closely related species.

#mutation_type  proportion
AT_mutations       0.355
nonCpG_mutations   0.423
CpG_mutations      0.222

#site_group    proportion
AT_sites           0.475
nonCpG_sites       0.391
CpG_sites          0.134

To calculate the scaling factor, we need to have the predicted mutation rates for a set of representative sites based on a trained model. It is recommended to use the validation sites at the training step whose size is relatively small and representative enough. For instance, the following command is for obtaining predicted mutation rates for validation sites of the A/T model.

mural_snv predict --ref_genome snv/data/seq.fa --test_data snv/data/AT_validation.sorted.bed \
--model_path models/checkpoint_6/model --model_config_path models/checkpoint_6/model.config.pkl  \
--calibrator_path  models/checkpoint_6/model.fdiri_cal.pkl \
--pred_file AT_validation.ckpt6.fdiri.tsv.gz \
--cpu_only > test.out 2> test.err

Next, the command mural_snv calc_snv_scaling_factor will be used to compute the scaling factor based on the predicted rates, the proportions of A/T mutation types and proportions of A/T sites in the genome, and the genome-wide per generation mutation rate.

mural_snv calc_scaling_factor --pred_files AT_validation.ckpt6.fdiri.tsv.gz --genomewide_mu 5e-9
--m_proportions 0.355 --g_proportions 0.475 > scaling_factor.out

Output file ‘scaling_factor.out’ may look like the following:

pred_file: AT_validation.ckpt6.fdiri.tsv.gz
genomewide_mu: 5e-09
n_sites: 84000
g_proportion: 0.475
m_proportion: 0.355
prob_sum: 4.000e+03
scaling factor: 7.848e-08

Step 2: Apply scaling

Finally, the obtained scaling factor 7.848e-08 is used to scale all the predicted rates of all A/T sites using mural_snv scale --scale_factors. You can run separately for each chromosome.

mural_snv scale --pred_file AT_chr1.tsv.gz --scale_factor 7.848e-08 --out_file AT_chr1.scaled.tsv.gz

Similarly, you can generate the scaled mutation rates for non-CpG and CpG sites like the above example. More details can be found in the MuRaL paper.

MuRaL-indel

Step 1: Calculate scaling factor

Here is an example for scaling mutation rates for insertion sites. Suppose that we have the following proportions of different INDEL mutation types and proportions of different site groups in a genome. In addition, suppose we already know from previous research that the genome-wide mutation rate per generation of the species is 1.5e−9 per base per generation. If the per generation mutation rate is not available for the studied species, one may use the estimate from a closely related species.

#INDEL proportion
insertion       0.352
deletion start  0.648
deletion end    0.648

Note: Each deletion has two breakpoints—designated as the deletion start and deletion end. These two labels refer to the same deleted interval, so, same proportions are used for both.

To calculate the scaling factor, we need to have the predicted mutation rates for a set of representative sites based on a trained model. It is recommended to use the validation sites at the training step whose size is relatively small and representative enough. For instance, the following command is for obtaining predicted mutation rates for validation sites of the INDEL model.

mural_indel predict --ref_genome indel/data/seq.fa --test_data indel/data/insertion_validation.sorted.bed \
--model_path models/checkpoint_6/model --model_config_path models/checkpoint_6/model.config.pkl  \
--calibrator_path  models/checkpoint_6/model.fdiri_cal.pkl \
--pred_file insertion_validation.ckpt6.fdiri.tsv.gz \
--cpu_only > test.out 2> test.err

Next, the command mural_indel scale will be used to compute the scaling factor based on the predicted rates, the proportions of insertion mutation types, and the genome-wide per generation mutation rate.

mural_indel calc_scaling_factor --pred_files insertion_validation.ckpt6.fdiri.tsv.gz --genomewide_mu 1.5e-9
--m_proportions 0.352 > scaling_factor.out

Output file ‘scaling_factor.out’ may look like the following:

pred_file: insertion_validation.ckpt6.fdiri.tsv.gz
genomewide_mu: 1.5e-09
n_sites: 221240912
m_proportion: 0.352
prob_sum: 4.000e+03
scaling factor: 6.271e-09

Step 2: Apply scaling

Finally, the obtained scaling factor 6.271e-09 is used to scale all the predicted rates of all insertion sites using mural_indel scale --scale_factors. You can run separately for each chromosome.

mural_indel scale --pred_file insertion_chr1.tsv.gz --scale_factor 6.271e-09 --out_file insertion_chr1.scaled.tsv.gz

Similarly, you can generate the scaled mutation rates for deletion start and deletion end sites like the above example. More details can be found in the MuRaL-indel paper.

Trained models and predicted mutation rate maps of multiple species

MuRaL-snv

Trained models (by MuRaL v1.0.0) for four species - Homo sapiens, Macaca mulatta, Arabidopsis thaliana and Drosophila melanogaster are provided in the ‘models/*/SNV’ folder of the package. One can use these model files for prediction or transfer learning.

Predicted single-nucleotide mutation rate maps (by MuRaL v1.0.0) for these genomes are available at ScienceDB.

MuRaL-indel

Trained models (by MuRaL v1.2.0) for four species - Homo sapiens, Macaca mulatta, Arabidopsis thaliana and Drosophila melanogaster are provided in the ‘models/*/INDEL’ folder of the package. One can use these model files for prediction or transfer learning.

Predicted base-resolution INDEL mutation rate maps (by MuRaL v1.2.0) for these genomes are available at ScienceDB.

Citation

Fang Y, Deng S, Li C. A generalizable deep learning framework for inferring fine-scale germline mutation rate maps. Nature Machine Intelligence (2022) doi:10.1038/s42256-022-00574-5

Deng S, Song H, Li C. A deep learning framework for building INDEL mutation rate maps. Briefings in Bioinformatics (2025) doi:10.1093/bib/bbag212 <https://doi.org/10.1093/bib/bbag212>

Contact

For reporting issues or requests related to the package, please use GitHub Issues or write to mural-project@outlook.com.