VirBiCla

A novel ML-based binary classifier to tell viral and non-viral long reads apart in metagenomic samples.

View the Project on GitHub AstraBert/VirBiCla

VirBiCla

A novel ML-based binary classifier to tell viral and non-viral DNA/RNA sequences apart.

Table of Contents

Model Description

VirBiCla is an ensemble voting classifier model created using scikit-learn in python.

Model Components

Ensemble Voting

Model Training

Usage

Once trained, this ensemble classifier can be used to make predictions on new data by calling the predict method.

You can use the model by running the prediction script as follows:

# make sure to include -p if you wish to get out a nice visualization of the viral/non-viral classification stats
python3 predict.py -i input.fasta -o output.csv -p

Training

Data collection

Training data were collected using this script, which is based on the instructions provided by NCBI on how to retrieve BLAST databases.

Training datasets are:

Descriptions are taken from a dedicated BLAST+ tools repository, while the number of sequences has been retrieved from the metadata files avauilable on BLAST ftp website.

Data processing

Data were processed using this script and plots of the resulting training dataset were made with the aid of this script.

Here’s a step-by-step breakdown of the processing procedure:

  1. Nucleotide Proportions and GC Content Calculation:
    • The input DNA sequence is analyzed to compute the frequency of each nucleotide (A, T, C, G) in the sequence.
    • Additionally, the GC content of the sequence is calculated, representing the proportion of Guanine (G) and Cytosine (C) bases compared to the total number of bases in the sequence.
  2. Effective Number of Codons (ENC) Computation:
    • The program calculates the Effective Number of Codons (ENC), which provides insights into codon usage bias within the DNA sequence.
    • ENC is a measure of how efficiently codons are used to encode amino acids, with lower values indicating higher codon bias.
  3. Homopolymeric Region Detection:
    • Homopolymeric regions, characterized by consecutive repeats of a single nucleotide (e.g., AAAA), are identified within the DNA sequence.
    • The program computes the percentage of homopolymeric loci for each nucleotide (A, T, C, G) present in the sequence.
  4. Information Entropy Estimation:
    • Information entropy of the DNA sequence is estimated, providing a measure of uncertainty or randomness in the sequence.
    • Higher entropy values suggest greater sequence diversity, while lower values indicate more uniform or repetitive sequences.
  5. Gene Density Calculation:
    • The program estimates gene density based on coding statistics of reading frames derived from the DNA sequence.
    • It computes the ratio of coding regions (sequences encoding proteins) to the total length of the translated sequences, providing insights into the density of potential genes within the sequence.
  6. Integration of Results:
    • Finally, the extracted features and metrics, including nucleotide proportions, GC content, ENC, homopolymeric regions, information entropy, and gene density, are integrated into a structured dictionary.
    • The dictionary encapsulates all relevant information about the processed DNA sequence, facilitating further analysis and interpretation, and it is finally integrated with the viral/non-viral information (“V” or “NV”) associated with each file.
  7. Results storage:
    • The resulting dictionaries for each DNA sequences are stored into a list, converted into a DataFrame and, eventually, merged into a CSV file

All the functions employed to process data are defined here.

All the plots are available here.

Training specs

The binary classificator is trained to predict the “Domain” field (in the CSV) starting from all the other parameters, classifying a sequence as either viral (“V”) or non-viral (“NV”).

Training takes about 40s on Google Colab free python engine (12GB RAM).

Testing

Data collection

Three test sets were prepared (test, test1 and test2), and they are composed by:

Data processing

Processing followed the same steps descripted for training, with this script. Plots were also generated using a similar script

All the plots are available here

Tests

You can find the tests in the dedicated notebook.

Overall, the models seems to perform well in telling viral and non-viral DNA sequences apart.

Use: advices and limitations

User case for VirBiCla is here.

It is based on data downloaded from NCBI SRA, whose accession numbers are retained in their names, thus all the information about sequencing and related research are easily accesible.

The three samples refer to Aedes aegypti’s virome (SRR25317595), Tuta absoluta’s microbiome (SRR20766474) and rRNA genes region from several fungi samples (ERR2576718).

Size-selected virome (>1000 bp) was first combined with microbiome without any size restraint, and the resulting ca. 17000 reads were predicted by VirBiCla, resulting in 50% accuracy. From the confusion matrix you can easily see that the model predicted correctly almost all the actual viral sequences, whereas it misclassified the bacterial ones.

When size-selected virome was combined with size-selected microbiome (>1000 bp and >1500 bp), progressively greater accuracy (53 and 60%) and better confusion matrices were observed (you can find them in test.stats files in each folder).

When size-selected virome was combined with unprocessed mycetome (which had most of the sequences way longer than 2000 bp), results were definitely better. VirBiCla yielded an almost perfect performance, with 98% accuracy and few misclassfied sequences.

As a general advice, we suggest that the user employes the model “as is” only on long-read sequencing metagenomics samples: as you can see, VirBiCla is really good at differentiating between non-viral and viral sequences when put into a context of long reads (>2000-3000 bp).

Customization

If you have different data and you wish to custom VirBiCla for your needs, you need to follow these steps:

  1. Modify preprocess_train.py, adding your own fasta files to the preprocessing pipeline. For example, let’s say that we have two files called “virome.fa” and “microbiome.fa”: we should then make these modifications: ```python MAPPING_DOMAINS = { “virome”: “V”, “microbiome”: “NV”, }

if name == “main”: csvpath = “viral-vs-nonviral_train.csv” hugelist = [] for fsa in list(MAPPING_DOMAINS.keys()): print(f”Loading data from {fsa}…”) fastafile = f”{fsa}.fa”

2. Modify [model.py](/VirBiCla/scripts/model.py) with your data. For example, let's say that we have saved our training data in a CSV file called "viral-vs-nonviral_train.csv": we should then make these modifications:
```python
df = pd.read_csv("viral-vs-nonviral_train.csv")
X_train = df_train.iloc[:, 1:]
y_train_rev = df_train["Domain"]

License and right of usage

The program is provided under MIT license (more at LICENSE).

When using VirBiCla, please consider citing the author (Astra Bertelli) and this repository.

References