VirBiCla
A novel ML-based binary classifier to tell viral and non-viral DNA/RNA sequences apart.
Table of Contents
Model Description
VirBiCla is an ensemble voting classifier model created using scikit-learn in python.
Model Components
- Logistic Regression (LR): A linear model that predicts the probability of a categorical dependent variable.
- Random Forest (RF): An ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of the classes.
- Gaussian Naive Bayes (GNB): A simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions.
- Decision Tree (DT): A non-parametric supervised learning method used for classification and regression.
- K-Nearest Neighbors (KNN): A non-parametric method used for classification and regression, where the input consists of the k closest training examples in the feature space.
- Bagging Classifier (BC): An ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset.
- Histogram-Based Gradient Boosting Classifier (HGB): A gradient boosting model that uses histograms to speed up training.
- Extra Trees Classifier (ETC): An ensemble learning method that builds multiple decision trees and merges their predictions.
- Stochastic Gradient Descent (SGD): An optimization algorithm that updates the parameters iteratively to minimize the loss function.
Ensemble Voting
- Voting Method: Soft voting, where the predicted probabilities for each class are averaged and the class with the highest probability is chosen.
- Classifiers: A combination of LR, RF, GNB, DT, KNN, BC, HGB, ETC, and SGD classifiers are combined into a voting ensemble.
Model Training
- Training Data: The model is trained using the datasets you can see here
- Fit Method: The classifier is fit to the training data using the
fit
method.
Usage
Once trained, this ensemble classifier can be used to make predictions on new data by calling the predict
method.
You can use the model by running the prediction script as follows:
# make sure to include -p if you wish to get out a nice visualization of the viral/non-viral classification stats
python3 predict.py -i input.fasta -o output.csv -p
Training
Data collection
Training data were collected using this script, which is based on the instructions provided by NCBI on how to retrieve BLAST databases.
Training datasets are:
- 16S_ribosomal_RNA: 16S ribosomal RNA (Bacteria and Archaea type strains) - 22239 seqs
- 28S_fungal_sequences: 28S ribosomal RNA sequences (LSU) from Fungi type and reference material - 10292 seqs
- SSU_eukaryote_rRNA: Small subunit ribosomal nucleic acid for Eukaryotes - 8784 seqs
- ref_viruses_rep_genomes: Refseq viruses representative genomes - 18688 seqs
Descriptions are taken from a dedicated BLAST+ tools repository, while the number of sequences has been retrieved from the metadata files avauilable on BLAST ftp website.
Data processing
Data were processed using this script and plots of the resulting training dataset were made with the aid of this script.
Here’s a step-by-step breakdown of the processing procedure:
- Nucleotide Proportions and GC Content Calculation:
- The input DNA sequence is analyzed to compute the frequency of each nucleotide (A, T, C, G) in the sequence.
- Additionally, the GC content of the sequence is calculated, representing the proportion of Guanine (G) and Cytosine (C) bases compared to the total number of bases in the sequence.
- Effective Number of Codons (ENC) Computation:
- The program calculates the Effective Number of Codons (ENC), which provides insights into codon usage bias within the DNA sequence.
- ENC is a measure of how efficiently codons are used to encode amino acids, with lower values indicating higher codon bias.
- Homopolymeric Region Detection:
- Homopolymeric regions, characterized by consecutive repeats of a single nucleotide (e.g., AAAA), are identified within the DNA sequence.
- The program computes the percentage of homopolymeric loci for each nucleotide (A, T, C, G) present in the sequence.
- Information Entropy Estimation:
- Information entropy of the DNA sequence is estimated, providing a measure of uncertainty or randomness in the sequence.
- Higher entropy values suggest greater sequence diversity, while lower values indicate more uniform or repetitive sequences.
- Gene Density Calculation:
- The program estimates gene density based on coding statistics of reading frames derived from the DNA sequence.
- It computes the ratio of coding regions (sequences encoding proteins) to the total length of the translated sequences, providing insights into the density of potential genes within the sequence.
- Integration of Results:
- Finally, the extracted features and metrics, including nucleotide proportions, GC content, ENC, homopolymeric regions, information entropy, and gene density, are integrated into a structured dictionary.
- The dictionary encapsulates all relevant information about the processed DNA sequence, facilitating further analysis and interpretation, and it is finally integrated with the viral/non-viral information (“V” or “NV”) associated with each file.
- Results storage:
- The resulting dictionaries for each DNA sequences are stored into a list, converted into a DataFrame and, eventually, merged into a CSV file
All the functions employed to process data are defined here.
All the plots are available here.
Training specs
The binary classificator is trained to predict the “Domain” field (in the CSV) starting from all the other parameters, classifying a sequence as either viral (“V”) or non-viral (“NV”).
Training takes about 40s on Google Colab free python engine (12GB RAM).
Testing
Data collection
Three test sets were prepared (test, test1 and test2), and they are composed by:
- 401 sequences (ranging from 1000 to 8000 bp) from recently submitted viral specimens, collected from NCBI virus, 500 out of 4092 Araneae SSU rRNA sequences collected from SILVA and 500 out of 1982 Tumebacillus SSU rRNA sequences also collected from SILVA.
- 2000 DNA sequences (ranging from 1000 to 10000 bp) from bacteriophages, collected from NCBI virus, 1000 out of 4092 Araneae SSU rRNA sequences collected from SILVA and 1000 out of 1982 Tumebacillus SSU rRNA sequences also collected from SILVA.
- 1072 DNA sequences (ranging from 1000 to 15000 bp) from human viruses, collected from NCBI virus, 600 sequences from LSU_prpkaryote_rRNA and 600 sequences from 28S_fungal_sequences.
Data processing
Processing followed the same steps descripted for training, with this script. Plots were also generated using a similar script
All the plots are available here
Tests
You can find the tests in the dedicated notebook.
-
test
dataset had the following output:
|
|
True postive: 986 |
False positive: 14 |
False negative: 14 |
True negative: 387 |
- Accuracy: 0.9800142755174875
-
test1
dataset had the following output:
|
|
True postive: 1980 |
False positive: 20 |
False negative: 32 |
True negative: 1968 |
-
test2
dataset had the following output:
|
|
True postive: 1181 |
False positive: 19 |
False negative: 1 |
True negative: 1071 |
- Accuracy: 0.9911971830985915
Overall, the models seems to perform well in telling viral and non-viral DNA sequences apart.
Use: advices and limitations
User case for VirBiCla is here.
It is based on data downloaded from NCBI SRA, whose accession numbers are retained in their names, thus all the information about sequencing and related research are easily accesible.
The three samples refer to Aedes aegypti’s virome (SRR25317595), Tuta absoluta’s microbiome (SRR20766474) and rRNA genes region from several fungi samples (ERR2576718).
Size-selected virome (>1000 bp) was first combined with microbiome without any size restraint, and the resulting ca. 17000 reads were predicted by VirBiCla, resulting in 50% accuracy. From the confusion matrix you can easily see that the model predicted correctly almost all the actual viral sequences, whereas it misclassified the bacterial ones.
When size-selected virome was combined with size-selected microbiome (>1000 bp and >1500 bp), progressively greater accuracy (53 and 60%) and better confusion matrices were observed (you can find them in test.stats files in each folder).
When size-selected virome was combined with unprocessed mycetome (which had most of the sequences way longer than 2000 bp), results were definitely better. VirBiCla yielded an almost perfect performance, with 98% accuracy and few misclassfied sequences.
As a general advice, we suggest that the user employes the model “as is” only on long-read sequencing metagenomics samples: as you can see, VirBiCla is really good at differentiating between non-viral and viral sequences when put into a context of long reads (>2000-3000 bp).
Customization
If you have different data and you wish to custom VirBiCla for your needs, you need to follow these steps:
- Modify preprocess_train.py, adding your own fasta files to the preprocessing pipeline. For example, let’s say that we have two files called “virome.fa” and “microbiome.fa”: we should then make these modifications:
```python
MAPPING_DOMAINS = {
“virome”: “V”,
“microbiome”: “NV”,
}
if name == “main”:
csvpath = “viral-vs-nonviral_train.csv”
hugelist = []
for fsa in list(MAPPING_DOMAINS.keys()):
print(f”Loading data from {fsa}…”)
fastafile = f”{fsa}.fa”
2. Modify [model.py](/VirBiCla/scripts/model.py) with your data. For example, let's say that we have saved our training data in a CSV file called "viral-vs-nonviral_train.csv": we should then make these modifications:
```python
df = pd.read_csv("viral-vs-nonviral_train.csv")
X_train = df_train.iloc[:, 1:]
y_train_rev = df_train["Domain"]
License and right of usage
The program is provided under MIT license (more at LICENSE).
When using VirBiCla, please consider citing the author (Astra Bertelli) and this repository.
References