Furthermore, the other key feature thought to be instrumental in the ability of SARS-CoV-2 to infect humansa polybasic cleavage site insertion in the Sproteinhas not yet been seen in another close bat relative of the SARS-CoV-2 virus. Press, 2009). 82, 48074811 (2008). The Pango dynamic nomenclature is a popular system for classifying and naming genetically-distinct lineages of SARS-CoV-2, including variants of concern, and is based on the analysis of complete or near-complete virus genomes. BFRs were concatenated if no phylogenetic incongruence signal could be identified between them. These means are based on the mean rates estimated for MERS-CoV and HCoV-OC43, respectively, while the standard deviations are set ten times higher than empirical values to allow greater prior uncertainty and avoid strong bias (Extended Data Fig. Ge, X. et al. Decimal years are shown on the x axis for the 1.2 years of SARS sampling in c. d, Mean evolutionary rate estimates plotted against sampling time range for the same three datasets (represented by the same colour as the data points in their respective RtT divergence plots), as well as for the comparable NRA3 using the two different priors for the rate in the Bayesian inference (red points). collected SARS-CoV data and assisted in analyses of SARS-CoV and SARS-CoV-2 data. SARS-CoV-2 itself is not a recombinant of any sarbecoviruses detected to date, and its receptor-binding motif, important for specificity to human ACE2 receptors, appears to be an ancestral trait shared with bat viruses and not one acquired recently via recombination. Lemey, P., Minin, V. N., Bielejec, F., Pond, S. L. K. & Suchard, M. A. While pangolins could be acting as intermediate hosts for bat viruses to get into humansthey develop severe respiratory disease38 and commonly come into contact with people through traffickingthere is no evidence that pangolin infection is a requirement for bat viruses to cross into humans. 4). Holmes, E. C. The Evolution and Emergence of RNA Viruses (Oxford Univ. Of importance for future spillover events is the appreciation that SARS-CoV-2 has emerged from the same horseshoe bat subgenus that harbours SARS-like coronaviruses. 4, vey016 (2018). Wu, Y. et al. Unlike other viruses that have emerged in the past two decades, coronaviruses are highly recombinogenic14,15,16. Sci. We compiled a set of 69SARS-CoV genomes including 58 sampled from humans and 11 sampled from civets and raccoon dogs. acknowledges support by the Research FoundationFlanders (Fonds voor Wetenschappelijk OnderzoekVlaanderen (nos. Nat. 5). 82, 18191826 (2008). In early January, the aetiological agent of the pneumonia cases was found to be a coronavirus3, subsequently named SARS-CoV-2 by an International Committee on Taxonomy of Viruses (ICTV) Study Group4 and also named hCoV-19 by Wu et al.5. Mol. While there is involvement of other mammalian speciesspecifically pangolins for SARS-CoV-2as a plausible conduit for transmission to humans, there is no evidence that pangolins are facilitating adaptation to humans. Results and discussion Genomic surveillance has been a hallmark of the COVID-19 pandemic that, in contrast to other pandemics, achieves tracking of the virus evolution and spread worldwide almost in real-time ( 4 ). The coverage threshold and consensus sequence generation threshold were set to 20 and 90 respectively. From this perspective, it may be useful to perform surveillance for more closely related viruses to SARS-CoV-2 along the gradient from Yunnan to Hubei. Developed by the Centre for Genomic Pathogen Surveillance. Extended Data Fig. performed codon usage analysis. Uncertainty measures are shown in Extended Data Fig. Trends Microbiol. Pangolin-CoV is 91.02% and 90.55% identical to SARS-CoV-2 and BatCoV RaTG13, respectively, at the whole-genome level. To avoid artefacts due to recombination, we focused on NRR1 and NRR2 and the recombination-masked alignment NRA3 to infer time-measured evolutionary histories. We thank T. Bedford for providing M.F.B. The web application was developed by the Centre for Genomic Pathogen Surveillance. For the HCoV-OC43, MERS-CoV and SARS datasets we specified flexible skygrid coalescent tree priors. B.W.P. PI signals were identified (with bootstrap support >80%) for seven of these eight breakpoints: positions 1,684, 3,046, 9,237, 11,885, 21,753, 22,773 and 24,628. Zhou et al.2 concluded from the genetic proximity of SARS-CoV-2 to RaTG13 that a bat origin for the current COVID-19 outbreak is probable. This dataset comprises an updated version of that used in Hon et al.15 and includes a cluster of genomes sampled in late 2003 and early 2004, but the evolutionary rate estimate without this cluster (0.00175 substitutions per siteyr1 (0.00117,0.00229)) is consistent with the complete dataset (0.00169 substitutions per siteyr1, (0.00131,0.00205)). Genetics 172, 26652681 (2006). In addition, sequences NC_014470 (Bulgaria 2008), CoVZXC21, CoVZC45 and DQ412042 (Hubei-Yichang) needed to be removed to maintain a clean non-recombinant signal in A. Article 6, eabb9153 (2020). performed Srecombination analysis. Extensive diversity of coronaviruses in bats from China. Concurrent evidence also proposed pangolins as a potential intermediate species for SARS-CoV-2 emergence and suggested them as a potential reservoir species11,12,13. To examine temporal signal in the sequenced data, we plotted root-to-tip divergence against sampling time using TempEst39 v.1.5.3 based on a maximum likelihood tree. Lie, P., Chen, W. & Chen, J.-P. Given that these pangolin viruses are ancestral to the progenitor of the RaTG13/SARS-CoV-2 lineage, it is more likely that they are also acquiring viruses from bats. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic, https://doi.org/10.1038/s41564-020-0771-4. Yres, D. L. et al. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Zhou, H. et al. Sequence similarity. To estimate non-synonymous over synonymous rate ratios for the concatenated coding genes, we used the empirical Bayes Renaissance countingprocedure67. RegionB showed no PI signals within the region, except one including sequence SC2018 (Sichuan), and thus this sequence was also removed from the set. Methods Ecol. SARS-CoV-2 and RaTG13 are also exceptions because they were sampled from Hubei and Yunnan, respectively. There is a 90% DNA match between SARS CoV 2 and a coronavirus in pangolins. Dudas, G., Carvalho, L. M., Rambaut, A. "This is an extremely interesting . Wang, L. et al. G066215N, G0D5117N and G0B9317N)) and by the European Unions Horizon 2020 project MOOD (no. Biol. The variable-loop region in SARS-CoV-2 shows closer identity to the 2019 pangolin coronavirus sequence than to the RaTG13 bat virus, supported by phylogenetic inference (Fig. To begin characterizing any ancestral relationships for SARS-CoV-2, NRRs of the genome must be identified so that reliable phylogenetic reconstruction and dating can be performed. Using these breakpoints, the longest putative non-recombining segment (nt1,88521,753) is 9.9kb long, and we call this region NRR2. Evol. 3). Anderson, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The proximal origin of SARS-CoV-2. Nat Microbiol 5, 14081417 (2020). We thank all authors who have kindly deposited and shared genome data on GISAID. and X.J. Scientists defined the pangolin lineage of this variant to be B.1.1.523 and it was originally recognized as a variant under monitoring on July 14, 2021. Using both prior distributions, this results in six highly similar posterior rate estimates for NRR1, NRR2 and NRA3, centred around 0.00055 substitutions per siteyr1. Its origin and direct ancestral viruses have not been . These are in general agreement with estimates using NRR2 and NRA3, which result in divergence times of 1982 (19482009) and 1948 (18791999), respectively, for SARS-CoV-2, and estimates of 1952 (19061989) and 1970 (19321996), respectively, for the divergence time of SARS-CoV from its closest known bat relative. performed recombination analysis for non-recombining regions1 and 2, breakpoint analysis and phylogenetic inference on recombinant segments. 36)gives a putative recombination-free alignment that we call non-recombinant alignment3 (NRA3) (see Methods). Trends Microbiol. Lancet 395, 949950 (2020). M.F.B. It compares the new genome against the large, diverse population of sequenced strains using a Even before the COVID-19 pandemic, pangolins have been making headlines. However, the coronavirus isolated from pangolin is similar at 99% in a specific region of the S protein, which corresponds to the 74 amino acids involved in the ACE (Angiotensin Converting Enzyme . Boni, M. F., de Jong, M. D., van Doorn, H. R. & Holmes, E. C. Guidelines for identifying homologous recombination events in influenza A virus. Biol. Trova, S. et al. Nevertheless, the viral population is largely spatially structured according to provinces in the south and southeast on one lineage, and provinces in the centre, east and northeast on another (Fig. Lancet 383, 541548 (2013). Posada, D., Crandall, K. A. This long divergence period suggests there are unsampled virus lineages circulating in horseshoe bats that have zoonotic potential due to the ancestral position of the human-adapted contact residues in the SARS-CoV-2 RBD. PubMed Central Two exceptions can be seen in the relatively close relationship of Hong Kong viruses to those from Zhejiang Province (with two of the latter, CoVZC45 and CoVZXC21, identified as recombinants) and a recombinant virus from Sichuan for which part of the genome (regionB of SC2018 in Fig. We thank A. Chan and A. Irving for helpful comments on the manuscript. Slider with three articles shown per slide. Identifying the origins of an emerging pathogen can be critical during the early stages of an outbreak, because it may allow for containment measures to be precisely targeted at a stage when the number of daily new infections is still low. Accurate estimation of ages for deeper nodes would require adequate accommodation of time-dependent rate variation. Combining regions A, B and C and removing the five named sequences gives us putative NRR1, as an alignment of 63sequences. All custom code used in the manuscript is available at https://github.com/plemey/SARSCoV2origins. Sign up for the Nature Briefing newsletter what matters in science, free to your inbox daily. Novel Coronavirus (2019-nCoV) Situation Report 1, 21 January 2020 (World Health Organization, 2020). the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in And this genotype pattern led to creating a new Pangolin lineage named B.1.640.2, a phylogenetic sister group to the old B.1.640 lineage renamed B.1.640.1. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Humans' selfish, speciesist treatment of these animals could be the very reason why the novel coronavirus exists. # File containing the ID of the samples, the Sequence of the haplotype, the Continent, the country, the Region, the Data, the Lineage of Pangolin and Nextstrain clade, and the haplotype number # In this order # Could be obtained from the database For coronaviruses, however, recombination means that small genomic subregions can have independent origins, identifiable if sufficient sampling has been done in the animal reservoirs that support the endemic circulation, co-infection and recombination that appear to be common. Due to the absence of temporal signal in the sarbecovirus datasets, we used informative prior distributions on the evolutionary rate to estimate divergence dates. A., Filip, I., AlQuraishi, M. & Rabadan, R. Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2. Despite the SARS-CoV-2 lineages acquisition of residues in its Spike (S) proteins receptor-binding domain (RBD) permitting the use of human ACE2 (ref. 95% credible interval bars are shown for all internal node ages. Lu, R. et al. wrote the first draft of the manuscript, and all authors contributed to manuscript editing. Divergence dates between SARS-CoV-2 and the bat sarbecovirus reservoir were estimated as 1948 (95% highest posterior density (HPD): 18791999), 1969 (95% HPD: 19302000) and 1982 (95% HPD: 19482009), indicating that the lineage giving rise to SARS-CoV-2 has been circulating unnoticed in bats for decades. Using the most conservative approach (NRR1), the divergence time estimate for SARS-CoV-2 and RaTG13 is 1969 (95% HPD: 19302000), while that between SARS-CoV and its most closely related bat sequence is 1962 (95% HPD: 19321988); see Fig. Pink, green and orange bars show BFRs, with regionA (nt 13,29119,628) showing two trimmed segments yielding regionA (nt13,29114,932, 15,40517,162, 18,00919,628). There are outstanding evolutionary questions on the recent emergence of human coronavirus SARS-CoV-2 including the role of reservoir species, the role of recombination and its time of divergence from animal viruses. Evol. We infer time-measured evolutionary histories using a Bayesian phylogenetic approach while incorporating rate priors based on mean MERS-CoV and HCoV-OC43 rates and with standard deviations that allow for more uncertainty than the empirical estimates for both viruses (see Methods).