Think of the first draft of the human genome as a book. Published just past the turn of the century, the human genome paved the way for transformative therapeutics. Gene editing and gene therapies now battle previously untreatable diseases. Comparing the A, T, C, and G genetic letters with those of our closest evolutionary cousins is unveiling the roots of our evolution and intelligence.
But what, or who, does ”our” refer to?
Due to technological constraints, the current reference genome was assembled from chunks of sequenced DNA from a handful of people, mostly of European and African descent. Although invaluable for hunting down genetic diseases, the “book of humanity” hardly encapsulates the genetic diversity of people around the globe.
A new study published in Nature is taking the first step to broaden its scope. Roughly a decade in the making, the study captured the genomes of 47 people from Asia, Africa, the Americas, and Europe. The herculean effort sequenced a total of 94 genomes, one for each set of chromosomes for each person.
The end result is the first draft of the human “pangenome”—a collection of genetic data from each individual carefully compiled into a single reference. Rather than a book, the new data structure is now a library, capturing the rich genetic history of humans around the world.
“This is like going from black-and-white television to 1080p,” said Dr. Keolu Fox at the University of California, San Diego, who was not involved in the study.
The study is part of the Human Pangenome Reference Consortium (HPRC), an ambitious international project launched in 2019 to capture the diversity of our species into a comprehensive reference dictionary. Far from an academic pursuit, a diverse reference helps scientists hone in on genetic links for diseases, regardless of ancestry.
“It’s an exceptional advance… It’s making the picture of human genetic variation more accurate and more complete,” said Dr. Mashaal Sohail at the National Autonomous University of Mexico, who was not involved in the study.
The Quest for Humanity’s Genetic Blueprint
The first draft of the human genome was a triumph. But with eight percent of details missing, it also contained bias.
In genetic studies, scientists often match up patients’ genomes to the reference genome to hunt down disease-causing DNA variants. But similar to checking typos using a dictionary, the process suffers if the dictionary is incomplete, or if it only contains one version of a word’s spelling (American “humor” versus British “humour,” for example).
Without a full diverse DNA atlas, it’s difficult to decipher genes linked to rare diseases—especially when multiple genes are involved, or if the answers are buried inside complex DNA structures unique to a certain population.
Then there’s the problem of diagnosis and therapeutics. Cancer predictors, for example, may not work as well for those of Asian and African heritage, because they were developed using a largely European genomic reference.
Well aware of these hiccups, scientists have been adding to the first draft for decades, with the most recent update GRCh38 released in 2017. Although containing DNA from 20 people, the database is dominated by one person with over 70 percent contribution. Last year, another group released a map that virtually captured the entirety of the human genome—but just one.
Although a “major achievement, no single genome can represent the genetic diversity of our species,” the authors said.
A Genetic Subway Map
The new study is the first step to broadening the scope. The team aggregated DNA sequences from 47 individuals and their parents from all continents expect Antarctica. Because each person has two sets of chromosomes, all together they sequenced 94 genome assemblies.
Due to technological constraints, scientists have long updated the GRCh3 reference with a sort of biological copy-editing: fixing small errors, filling in gaps, or adding new variants. Most new data are short DNA sequences from people that differ from the reference. But their short length makes it difficult to correctly place the data into the reference genome.
Due to these problems, “we may have missed more than 70 percent of structural variants in traditional whole genome-sequencing studies,” wrote the team.
Thanks to an explosion of innovative genetic tools in the past decade, however, it’s now possible to capture longer DNA reads from an individual. Like tackling a 1,000-piece puzzle versus one with just 100 pieces, the longer reads make it far easier to assemble the pieces into a full genomic sequence with accuracy. All together, the new study added 119 million base pairs—the basic unit of DNA—to the GRCh38’s existing database of 3.2 billion.
The next step was to wrangle the humongous dataset into a decipherable atlas.
Here, the team used a clever graph method, analogous to that of a subway map with multiple branches. Shared genetic sequences converge into a single line. At certain “stops” where the genetic sequences differ, they diverge into separate lines. Some may eventually re-converge into another joint line of shared sequences. Overall, the graph makes it relatively easy to tease apart areas of DNA shared across multiple people and capture those unique to each individual.
The end result is the first draft of the human pangenome.
Discovery From Diversity
In a proof of concept, the pangenome proved its worth with two studies that focused on genetic regions previously difficult to explore. Called repetitive DNA regions, these chunks of genetic material are like frustratingly similar puzzle pieces, making it hard to precisely put them into the larger genomic assembly.
Yet they may also hold the key for germline cell engineering and the evolution of the human species. These regions critically underlie a process that helps develop healthy sperm and eggs, but they were previously difficult to study. Using the pangenome, one study found large differences in how these gene segments duplicate and shuffle in order between individuals.
“It is exciting to see accurate characterization of segmental duplications, because duplicated sequences can fuel the evolution of new, specialized roles for a gene,” said Drs. Brain McStay at the University of Galway, Ireland, and Hákon Jónsson at deCODE genetics in Reykjavik, Iceland, who were not involved in the study.
The pangenome may also shed light on genomic “dark matter” not represented in the GRCh38 reference. By capturing a far more diverse genetic landscape, we may be able to find rare but consequential mutations that lead to diseases.
These studies are just a taster of what’s to come. The pangenome is released to scientists as a resource to use in their own studies.
The map is just the first draft. But the team is already looking to expand the dataset, with a goal of reaching 350 people by next year. The consortium is also actively expanding its collaborations to other parts of the world traditionally underrepresented, such as parts of the Middle East and people belonging to marginalized groups.
To study author Dr. Eimear Kenny at the Icahn School of Medicine at Mount Sinai, as the project moves forward, transparency, privacy, and ethics are key.
“We recognize that this work is at the forefront of genomic research and has specific features, including open access of data,” she said. “[These details] warrant a great deal of consideration, and that the applications can raise ethical, legal, and social issues.”
Image Credit: Darryl Leja/NHGRI