Data sources and preprocessing

GenSpectrum uses mostly open data from the International Nucleotide Sequence Database Collaboration (INSDC) which consists of GenBank , ENA and DDBJ .

  • For influenza and RSV, we download the data directly from GenBank using the NCBI Datasets CLI. You can explore the data in our Loculus instance .
  • For SARS-CoV-2, we download the data from Nextstrain who gets the data from GenBank and processes them. (For CoV-Spectrum , we have an instance which uses data from GISAID .)
  • For the West Nile virus and Mpox, we use data from Pathoplexus , which also includes data from INSDC.

We use Nextclade as the main tool for preprocessing the sequences. The following sequences are used as a reference for alignment: