simON-reads ("Simulate Oxford Nanopore Reads") is a simple yet powerful tool to generate fastq files containing MiniON-like long reads
simON-reads (“Simulate Oxford Nanopore Reads”) is a simple yet powerful tool to generate fastq files containing MiniON-like long reads: this python script generates artificial DNA sequencing reads with specified variations and errors for given reference sequences. It utilizes the BioPython library for handling DNA sequences and provides options to introduce single nucleotide polymorphisms (SNPs) and sequencing errors.
simON-reads, thus, represent a flexible and customizable tool for generating artificial DNA sequencing data, making it valuable for testing and validating bioinformatics softwares and pipelines. The introduced variations and errors simulate real-world scenarios, allowing for thorough testing of downstream analysis pipelines.
The basic requirment for Windows installation is to have python3.10 or higher up and running on your Windows machine. If you don’t match this pre-requisite, consider downloading it from the dedicated page.
After having fulfilled this requirement, go on and:
python3 -m pip install biopython
python3 -m pip install matplotlib
cd Downloads\simON-reads-1.0.0
python3 .\simON-reads-1.0.0\scripts\simON_reads.py -h
The script was conceived for Linux-like operating systems, but should be fine also on Windows: nevertheless, feel free to report issues if you encounter them!
The basic requirment for this installation is to have Mamba and Conda up and running on your Linux machine. If you don’t match this pre-requisite, consider downloading them from the dedicated pages.
git init
git clone https://github.com/AstraBert/simON-reads
cd ./simON-reads
bash ./scripts/install.sh
conda activate ./scripts/environments/simON-reads
simON_reads.py -h
conda deactivate
The script comes with three viable option (one is required, the other two are optional) ``` simON_reads.py [-v,–version] -i, –infile INFILE [-snp, –single_nucleotide_polymorphism “SAMPLE:POS:REF>ALT:1/0,…”] [-n, –nreads READS_NUMBER] [-ese, –enable_sequencing_error] [-ehp, –enable_homopolymer_error]
-v or –version: Print the version of the code
-i or –infile: Path to the input FASTA file containing the reference sequence(s).
-snp or –single_nucleotide_polymorphism: Insert single nucleotide variants. Insert a single nucleotide variant; the syntax of this option should be SAMPLE:POS:REF>ALT,SAMPLE:POS:REF>ALT:1/0,…,SAMPLE:POS:REF>ALT:1/0 (it should be separated by commas without blank spaces) where SAMPLE is the header of the sequence (withouth “>”) in the original fasta file, POS is an integer that indicates the position (0-based) of the polymorphic site, REF is the reference allele, ALT is the alternative allele you want to be put and 1/0 (where you should report either 1 or 0, not both of them) is the haplotype phasing information: all the SNPs referred to 1 will endup on the same sequences, separate from the ones attributed to 0: this will generate a diploid-like distribution of variants. (Default is “NO_SNP”)
-n or –nread: Number of reads to generate for each reference sequence (default is 2000).
-ehp or –enable_homopolymer_error : This will set a 30% chance of getting an extra nucleotide around homopolymeric regions
-ese or –enable_sequencing_error : This will set a 5% chance of getting a random single nucleotide variant or insertion, while it retains also a 5% chance of skipping a base (single deletion)
Input simON_reads.py -h,–help to show the help message
You will find a test sample of reference sequences in the test folder; to try the script that only generate SNPs, you can run:
```bash
cd ./test
simON_reads.py -i reference.fasta -snp "28S-rRNA:3:G>T,28S-rRNA:41:C>G,S7:0:A>G,S7:63:C>T" -n 1000 > test_SNP.fastq
If you also want to mock the sequencing error process and you also want homopolymer error, you can run:
cd ./test
simON_reads.py -i reference.fasta -snp "28S-rRNA:3:G>T,28S-rRNA:41:C>G,S7:0:A>G,S7:63:C>T" -n 1000 --enable_homopolymer_error --enable_sequencing_error > test_SEQERR.fastq
Always remember to redirect the stream to your desidered file, unless you want it to be printed on the standard output of your terminal.
The script generates artificial DNA sequencing reads based on the provided parameters. It outputs the average quality distribution of the generated reads as a histogram and saves it as avg_quality_distribution.png
in the current directory.
seqs_to_file(genomes_dict, snp_string, nreads)
This function takes three parameters:
genomes_dict
: A dictionary containing reference sequences, where keys are sequence headers, and values are corresponding sequences.snp_string
: A string specifying single nucleotide polymorphisms (SNPs) in the format SAMPLE:POS:REF>ALT,SAMPLE:POS:REF>ALT,...
.nreads
: The number of artificial reads to generate for each reference sequence.The function iterates over the reference sequences and generates artificial reads with variations:
revcomp
) with 5% of the reads being reverse complemented.chimers
) with 0.5% of the reads being chimeric (merged from two sequences).quality_string
function, and the average quality is recorded.quality_string(seq)
This function generates a random quality string for a given DNA sequence. It assigns ASCII phred score codes (from 1 to 30) to each nucleotide in the sequence.
The main execution part of the script utilizes the functions mentioned above:
ArgumentParser
to parse command-line arguments, including the input FASTA file (-i
), SNP string (-snp
), and the number of reads to generate (-n
).load_data
function.seqs_to_file
function to generate artificial reads based on the reference sequences, SNPs, and the specified number of reads.matplotlib
and saves it as avg_quality_distribution.png
in the current directory.avg_quality_distribution.png
).Please note that simON-reads is still experimental and may contain errors or may output not-100%-reliable results, so always check them and pull issues whenever you feel it to be the case, we’ll be on your back as soon as possible to fix/implement/enhance whatever you suggest!
The code is protected by the GNU v.3 license. As the license provider reports: “Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights”.
If you are using simON-reads for you project, please consider to cite the author of this code (Astra Bertelli) and this GitHub repository.