Bioinformatics pipelines for whole-genome sequencing (Julien)
Sequencing provides a huge amount of binary data. The output of the sequencer doesn’t give directly a genome sequence, but an encoded file, which have to be curated and treated to obtain a sequence. Different softwares are used for each step of this processes. The chaining of these softwares is called a bioinformatics pipeline.
During my internship, we use WGS to sequence bacteria and identify single-nucleotide polymorphism (SNP).
A team of bioinformatician in my lab (laboratory for food safety) designed this pipeline called ARTwork, receiving reads from Illumina sequencer in Fastq format to determine the genetic sequence of the bacteria.
During sequencing preparation, DNA are sheared. The sequence of these pieces are called reads. A Fastq file contains all the reads of the genome with one read per line, supported by general informations above and quality score below.
The quality score (or phred score) indicates the probability that a given base is called incorrectly by the sequencer. A score of 30 (Q30) means that the probability of error is 1 of 1000.
The first step of the pipeline is to normalize and ensure that bad quality reads are removed. BBnorm identifies areas where the depth coverage are under 30x (means that there is less than 30 reads to determine a nucletotide) or upper than 100x. MultiQC searches a given directory for analysis logs and compiles a HTML report.
In the second step, trimmomatic removed adapters used for sequencing, the base with quality < 30 and reads shorter than 36 nucleotides.
Now that it just stays reads with a high quality, assembly can be performed.
Spades merges overlapping reads in fragments called contigs. If there is a SNP in these reads, Spades choose the consensus sequence. Next, a software called Mash determines what are the closest reference genome (a closed genome already assembled) of our contigs.
Using this reference genome contigs are merged together in scaffolds. However some gaps stay. Gap closer fill them using the abundant pair relationships of short reads.
This method is called De novo assembly and give a high resolution sequencing, allows to compare this genome to another at nucleotide level to see if two samples are from a near strain.
However, another methods exist. In my laboratory, after quality rapport transcrit in html in step 1, an in-house developped pipeline called iVarcall2 (update of iVarcall below)1 can be chose instead of De novo assembly.
The main difference of this method is that the pipeline directly take a reference genome to map the reads against it. If the reference genome is really close to the sequenced genome, we will have a better precision of assembling than De novo sequencing. However for some species as Bacillus, which have a low core-genome (containing genes present in all individuals of the species) it’s difficult to have a reference genome. For these cases, it’s better to use De novo sequencing.
This is an overview of the pipeline used in the laboratory for food safety. I don’t mention other techniques, details of the pipeline, quality checks…
If you want to know more, feel free to ask questions.
1 Felten A, Vila Nova M, Durimel K, Guillier L, Mistou M-Y, Radomski N. First gene-ontology enrichment analysis based on bacterial coregenome variants: insights into adaptations of Salmonella serovars to mammalian- and avian-hosts. BMC Microbiology. 2017;17:222. doi:10.1186/s12866-017-1132-1.