No epidemic has ever had so many people sequencing so many samples of a virus. There was bound to be a pile-up. “It might seem like it burst onto the public stage. But in the scientific community, this discussion about, ‘How do we talk about it? What nomenclature do we use?’ has been brewing for a while,” says Emma Hodcroft, a molecular epidemiologist at the University of Bern and co-developer of Nextstrain, one of the main efforts to organize viral genetic sequences. “A lot of it does depend on what you’re doing. Are you doing public health intervention or large scale evolution?”

GISAID started in 2008, after researchers around the world expressed some reticence at putting sequence data from their surveillance of bird flu into public domain databases. Under-resourced scientists didn’t want to drop a new sequence but then get scooped on the analysis by some other researcher with a zillion-dollar lab. And as GISAID got more and more data, the people who ran it had to come up with a way to identify each sequence and put them all into context with one another. Now it’s the main data repository for SARS-CoV-2 genomes.

But the world of Covid nomenclature has two more great and noble houses. Nextstrain, based at the Fred Hutchinson Cancer Research Institute and University of Basel, is one. Its organization revolves around clades, big branches on the phylogenetic tree of life. (Nextstrain started out doing the same job for influenza.) Its names have a cheat code—clades are organized by the year they’re discovered and a letter of the alphabet, and then according to specific mutations of interest. The de Oliveira team’s variant had a bunch of mutations, but the N501Y was important. (The mutation changes an asparagine, abbreviated with the letter N, to tyrosine, abbreviated with a Y, at the 501st amino acid on the virus’ spike protein, in the RBD (that’s Receptor Binding Domain) that attaches to the human ACE2 receptor (that’s Angiotensin-Converting Enzyme).

Easy, right? (Ahem.) But then things got even more complicated. The one the UK researchers were seeing had the same mutation, among many others. To distinguish it from de Oliveira’s, each got a new designation—appending “V1” on the one from the UK and “V2” on the other. Another similar variant that led back to Manaus, in Brazil, came to be “v3.”

“We’re not trying to name everything. In fact, we’re really explicitly trying not to have more than 10 or 20 names a year, and we’re interested in picking out the most important things,” Hodcroft says. “That’s, like, big changes in the tree. When we see groups that are different in their genetics and they spread, even if it takes a while, in a region or around the world, we give those a Nextstrain clade.”

That’s not what the other bigwig in the space does, though. It’s analytical software called Pangolin—“Phylogenetic Assignment of Named Global Outbreak LINeages.” So-called Pango lineages start with a letter, initially A or B, designating the first two diverging SARS-CoV-2 sequences that emerged from China in late 2019 and early 2020. Each generation gets a number, and its descendants get an additional number, preceded by a period—but only for three generations. Four or more, and the whole lineage gets assigned to a new letter. Imagine an Obed-begat-Jesse-and-Jesse-begat-David vibe, but with diagrams and genomic receipts. “Lineages are operating on a different resolution. You can have very big ones and small ones, but the idea is to capture the emerging edge of the pandemic,” says Áine O’Toole, an evolutionary biologist at the University of Edinburgh who created Pangolin and is now one of its main developers. “The idea is to have a cluster of sequences that is linked to some sort of epidemiological piece of information.”