Gene sequence processing involves collecting, reformatting, modelling and analysing complex and large sets of data (typically in 100's of millions) and presents technical challenges in choice of approach and technology. This page highlights example projects and technologies used primarily in vaccine design and analysis for HIV, Covid (SARS-CoV-2), Respiratory Syncytia Virus (RSV) and the Chikungunya Virus.
Processing gene sequence files (e.g. FASTA) often requires bespoke programs (often Python / Biopython) to reformat, align sequences, clean up and output required segments (e.g proteins or matching antigen regions).
Modelling using tools such as NetMHCpan can require parallel processing to maximise available resources and bring simulation times down to days rather than months.
Unix tools have proved fastest for post processing results as the low-level tools (e.g. grep, sed, awk) are suited for handling huge volumes of data. Formatting results into a SQL database allows for repeated analysis (excel is limited to a million rows). Python / Pandas is good for loading data into a database and any complex (than SQL reports) analysis scripts. Python / Biopython can also be useful in sequence analysis such as matching mutations and more complex ranking analysis methods.