The satellite DNA explorer
An interview with Karen Miga
I’m excited to share my first interview for this newsletter! While the premise is that I’m typically writing highlights of papers and preprints each week, I hope that this changeup for the week is interesting and enjoyable.
Last week, I wrote about the incredible preprint1 from the T2T consortium describing “construction, validation, and initial analysis of the first truly complete human reference genome” which was very exciting news for the field of genomics. In addition to this flagship preprint from the consortium, there was “Segmental duplications and their variation in a complete human genome” led by Mitchell Vollger, and “Epigenetic Patterns in a Complete Human Genome” led by Ariel Gershman.
I decided that all of this groundbreaking work from T2T deserved more attention and discussion, so I had a conversation with Karen Miga, who is one of the co-leads of the consortium along with Adam Phillippy and Evan Eichler.
Karen is an incredible genome biologist. She earned her Master’s degree at Case Western where she was supervised by Evan Eichler, and did her PhD at Duke under the supervision of Hunt Willard.
Since her training, Karen has become an expert in satellite DNA biology as well as an early pioneer in the application of new long-read sequencing technologies. In addition to leading the T2T consortium, she is project director of the human pangenome reference consortium (HPRC) production center at UCSC.
Just this week, she announced that she will also be opening her own lab at UCSC:
I was first exposed to Karen’s science at a seminar at UW Genome Sciences, and we met while I was a member of the Beliveau Lab. Now, we sat down to talk about the T2T consortium, long-read sequencing, satellite DNAs, virtual research communities, and the future of genomics.
I was wondering if you could talk a little bit about why you decided to become a scientist, and what initial questions in science/grad school made you interested in the genome not being completed and thinking about the kind of project like the T2T?
I think the place that I decided I wanted to do genomics was when I was at Case Western as a Master’s student in Evan Eichler’s laboratory. This was when the Human Genome Project was first coming out, so I was very much in the mindset of knowing the genome wasn't complete. That was Evan’s soapbox, and I was a part of his team.
Also at that same time I was in an exciting graduate course taught by Hunt Willard, which focused on genome biology. I still remember being shocked that our genomes were so riddled with repeats. This entire world of genome biology -- outside of the strict focus on genes and gene regulation -- really opened to me during grad school in an exciting way. When I joined Hunt's lab for my doctorate training I started to dig into the literature of larger satellite repeats. There was so little known in terms of how these sequences were organized or how they contributed to cellular function. I guess it was during that time I realized that I wanted to study satellite DNAs for the rest of my career.
Even though I was encouraged to broaden my scope for my postdoc, I was stubborn and continued to build a research program around centromere and satellite genomics. I wanted to understand satellite DNA biology in a very general and holistic way and really open these regions up to the genomics community.
My postdoc at UCSC was perfectly placed for integrating long read data. I was able to drive over the hill and work with PacBio almost as soon as they were starting to issue long reads. I also had early access to data from Oxford Nanopore by being at UCSC, where there was some of the earliest nanopore technology development, crediting research from Dave Deamer and Mark Akeson. Overall, I guess I was pretty lucky. I was in the right place at the right time.
Could you give some background of how the T2T became established? Who were the actors involved and how did you decide the technology was ready for this type of project?
The idea to complete a human genome is not something that started with our consortium. There are many players who deserve credit for thinking deeply about that problem. Notably, Evan Eichler, who has been working on completing the genome for years and actually generated the CHM resources used by the T2T. Also, George Church has been a major player in trying to develop new technology to close the human genome.
Adam Phillippy and I were both coming onto the scene after seeing the promise of these new long-read technologies. Adam and his team are really stars in genome assembly methods. Work from his lab, using assembly methods like Canu, really demonstrated the use of these long reads in closing gaps and improving reference assemblies.
Adam and I started working together when we contributed to the Jain et al. 2018 Nature Biotech paper2 , which first introduced the ultralong protocol. It was remarkable, we were starting to routinely produce reads that were greater than 100kb that were useful in closing some of the remaining gaps that were in hg38. Also, at that same time, I was leading an effort at Santa Cruz aimed to close a satellite array, or centromeric gap, on the Y chromosome. I think that those two papers coming out -- demonstrating the feasibility of generating ultra long reads and the potential to close larger gaps -- served as the signal that the T2T consortium needed to happen.
Of course, we didn't call it T2T at the time. It was a small group trying to identify the best, karyotypically stable CHM cell line and develop protocols to ramp up ultra-long sequencing. Ultimately, for our T2TX paper, we released close to 100 MinION ultra-long runs. It was a real tour de force with the NIH sequencing core at the beginning.
At first, we were aiming to complete one chromosome. The big announcement of the T2T project, in my mind, was at the AGBT conference where Adam stood up and said: hey this is the T2T consortium and we've completed the first human chromosome end-to-end.
In thinking about this effort could you contrast this with the previous technology for the original reference genome and what made it infeasible for this type of project?
In the past we were building genomes using Sanger reads. We built the T2T assembly using HiFi reads, which have an equivalent base-level accuracy with Sanger reads. The two key differences are the length of the reads and that Sanger reads required a cloning step. You had to partially digest the sequence and propagate a circular vector in bacteria. Some sequences were fine with cloning, and others were not. As a result there were cloning biases, and certain satellite sequences that were not cloned or sequenced as efficiently as the rest of the genome.
Even if you were able to sequence the satellites completely, due to being so repetitive, they were very difficult to assemble correctly using short reads. Sanger reads are about 1kb and that is not long enough to confidently span unique makers, or sites useful to predict the true linear ordering of the repeats. Illumina reads, although high in quality again, were even shorter and really failed to span these markers. So ultimately, we found using long reads, capable of confidently reporting and spanning unique sites, were critical to resolve these super repeat rich regions.
Looking back at some of the efforts of genome assembly without long reads is impressive. It was a herculean effort!
In ten years, people will look back at what the T2T consortium did, and also think “that's crazy!” It will be so much easier in the future. Once we can get to high fidelity Megabase reads, this challenge can be greatly simplified. Even when we think about extending this challenge to diploids. I've built my career trying to put together maps of satellite DNA and if technology keeps moving this way, I bet that in 10 years time my daughter will be sequencing and assembling human satellites for a school science project.
So far we've talked about the importance of the sequencing technology. I'm a programmer, and was really happy about the fact that it was all available on Github. This was also a testament to the open nature of T2T. Could you talk about that aspect of the vision for the project?
A lot of the heavy lifting took place during a pandemic. It is a virtual community. Most of us have never met in person. To be honest, there is no grant supporting this either. I am proud of our commitment to open data sharing. As soon as data are sufficiently checked for quality, we are ready to share with the public. The second that we have an assembly that we think is okay, we upload and share openly without restrictions. Admittedly, when we were first releasing our T2T X chromosome, there was a region that we were unsure about on the centromere region. We went ahead and made it versioned and live.
It was a gift because we had researchers from Evan Eichler's team, Mitchell Vollger who had flagged a couple of the regions on X for improvement when he was mapping and studying coverage plots of HiFi data. We also had collaborators like Andrey Bzikadze from Pavel Pevzner's group benchmark against our released X to develop new methods for automated satellite assembly, using centroFlye, and Alla Mikheenko released new methods to evaluate the accuracy of assembled tandem repeats. We were thrilled!
All of sudden it was not about protecting our resources, we saw it more as building a community and improving our work. I give huge credit to Adam Phillippy and his team by leading by positive example and emphasizing the development of tools, workflows, and datasets that are open, accessible and reproducible. That is also echoed by our use of preprints. Even though we aren't making definitive claims right now, because we are still waiting for peer review, it was the best way for us to provide an early community release of a complete genome -- one that we thought was ready for public review -- out to the community. If we get feedback, we will welcome it and incorporate it.
That is amazing! And what about that T2T Slack channel? Is that the main place for asynchronous communication?
Yes! Somebody asked me recently if it was like herding cats or if it was a highly efficient ant colony. I didn't think of it that way before, but I do feel like the Slack channel formed into a truly efficient little ant colony. I'm the administrator for our T2T Slack channel, and if it went away I would just be heartbroken. All of our messages, data, and texts. It's almost a historical document to be honest.
On a technical level, I've been trying to process this paper. Could you talk a little bit about the graph assembly process and what the manual curation is like?
When HiCanu came out, we were all shocked at how satellite DNAs and segmental duplications were being assembled without special manual effort. Then, Sergey Nurk went back to the drawing board and developed the conservative graph strategy presented in our paper, which uses components of HiCanu and hifiasm. When Sergey first showed the results of this method, and the earliest graph to Adam Phillippy, it was clear that we had made a giant leap forward. The chromosomes were well defined into their own connected components. Although a subset of chromosomes came out as T2T predictions, most remained broken into a small number of fragments. Ultimately though it was not anywhere close to as crazy as we expected the genome to be. For example, satellites on 1, and 5, and 19 were expected to be almost identical. I expected the graph to be a complete mess in these regions, and was really surprised to find that they were resolved into separate parts of the T2T graph.
Of course, the devil is in the details. There were parts that did need a lot of work, like chromosome 2 and chromosome 9, that had a ton of segmental duplications and large, recent duplications in the human satellite array that took some real work from our assembly team. Perhaps the crowning achievement goes to resolving the rDNAs and acrocentric short arms. We had a whole assembly team working on methods development and reaching T2T predictions for every chromosome. It still took some manual intervention, although it was a lot less manual than our earlier work completing the X. Of course, much of what was done can be organized into a more automated workflow now. I think that is something that Adam's team is actively working on.
Towards your daughter's future centromere science project!
I know! I've been saying centromere, but she'll probably have the whole genome if we're thinking 10 years out.
So with this graph assembly, what are the validation efforts to see how well it matches the underlying sequence data?
I'm so happy you asked, because that is an important balance of our entire work. I could care less if someone showed me a continuous satellite array, because it could be wrong. Show me your proof that it is right. It is really easy to make mistakes in these highly repetitive regions. Demonstrating that it is accurate, both at the structure and base level is hard but super important.
Arang Rhie was really the leader of our T2T polishing team, and she in particular thought really deeply about how to evaluate the quality of our T2T genome. For the X, we were doing old-school approaches, like pulsed-field Southerns and ddPCR to estimate copy number, which is a start but is not high resolution enough. Standard assembly assessment protocols study read alignments relative to the assembly, but when you are in a sea of repeats, how do you begin to align with confidence?
There are two major tools, one was Winnowmap3 from Chirag Jain as well as TandemTools from Alla Mikheenko in Pavel Pevzner's group. In both cases what you're looking at is how you can capitalize on these repetitive mapping strategies. With repeat-aware alignments you can begin to systematically study the coverage of these regions and mark any inconsistency. When we started doing that, to be quite honest, it was amazing to me how good the satellite arrays were being assembled.
In terms of quality, our T2T team flags and reports the small number of regions that have unexpected read coverage. We also used Hi-C to demonstrate that the long range connections of the p and q arms of the acrocentrics were correctly assigned. We used Strand-seq to demonstrate that there weren't any unexpected inversions or rearrangements. We also did all the due diligence of comparison to all the genes (in hg38), being extremely careful about any reported frameshifts or things that could have a biological effect.
As you mentioned, you originally became interested in this problem because of satellite DNA. With this new reference, what are some of the questions that you are excited about asking about satellite DNA biology?
Our centromere satellite working group would like to formally bridge these new sequences from the T2T assembly with the past literature. Ultimately, we want to set the stage for future researchers in this area. That includes all of the oligos, and the CRISPR guides, and simply a description of the sequences within this incredible resource. We need to pull the field of centromere genomics forward by decades.
In terms of the biology, we are excited that we finally are able to see the place where the kinetochore assembles. Past studies have always credited this relationship to alpha satellite, but now we get to ask is there anything about the underlying sequence or the organization of the tandem repeats in that region of the array. We’d like to dig in a little bit there. Ultimately, one genome is not enough. And it will be important to learn how humans vary in these regions and how that influences centromere biology.
Where do I think the world will go after this? I think there are a lot of cool questions, which hopefully the Miga Lab will go forward with as well. Because the function of satellite DNAs are not just defined by the centromere: there are a ton of other proteins that are binding to these regions, they are replicated late, they are spatially distinct, and when you have disrupted heterochromatin there are all kinds of things that can go wrong in a cell. Association with disease based on the genomic organization and content in these regions has not been studied before.
I think there is a whole world of genome biology on every front, from the genes that are in these regions, to the transcripts, to the way that it is regulated in the cell cycle, which will give us real insight into how the human genome works. We are finally living in a world where we can start to pull that into our scientific narrative. Because if right now you open a textbook, satellites might have a paragraph.
It's time for that book chapter! Looking forward, I'm definitely an optimist, which is why I named my newsletter The Century of Biology. What do you think the implications are of the work done by the T2T project moving forward?
I definitely think there will be cool questions about satellite DNA biology, and with the release of these T2T resources we hope researchers around the world are able to answer their own questions too. It is also fun to think of where the technology is going. We obviously went through a phase now that benefits from only having to resolve a single haplotype or single haploid chromosome. To reach a more complete view of a single human genome will take routinely finishing diploid genomes. And ultimately, we are aiming to release at least 350 individuals in partnership with the human pangenomic work. This is an ambitious yet attainable goal for 5 years, not a century!
I think where we need to be is in a place near where we started this conversation, where in the future reaching T2T genomes is completely routine. Right now our work can be reflected on as a historical moment, no doubt in the future someone will refer to our work as, "really crazy that they needed a team of 100 people to do something that a student can do now in their intro biology class in high school."
Once it becomes routine, what's the next question? I would hope that we would be thinking more about T2T single-cell biology. If you were to sample humans at different time points, through different tissue types, for cancer studies, and throughout life, I think that you're going to learn that in the end we're all just collections of ever changing T2T genomes that can inform our health by being re-sampled and re-analyzed. In order to get there, it really is going to take a massive change in technology.
I think another place that this is going to move in this century is, even though it may sound overused, writing genomes in addition to reading genomes. This idea that we'll have more control over understanding how our genome works and as we start to learn more and more comprehensively what they look like, we'll have a better understanding of how to design and make our own genomes too when it comes to medical and clinical cures. There is a big opportunity in the future to think creatively in that space.
That concludes my conversation with the incredibly inspiring scientist Karen Miga. Clearly, there is a lot on the horizon to be excited about! The future of genomics is bright. The T2T consortium is certainly worth keeping track of, as it is likely that more preprints will be released in the coming weeks and months.
On another important note,
If you’ve enjoyed reading this interview and would be interested in getting a highlight of a new open-access paper in your inbox each Sunday, you should consider subscribing:
I really enjoyed putting together this interview, and may do this again in the future!
In other news, COB will be taking a brief pause as I am taking a small break from science and the Internet at large while I go on a road trip to re-energize. The newsletter will pick back up in July.
That’s all for this week, have a great Sunday! 🧬
It is important to note that this work has not yet been peer reviewed, and the members of the T2T consortium want to hold off on making definitive claims until that is the case.
Nanopore sequencing and assembly of a human genome with ultra-long reads which was published in the same journal issue as Linear assembly of a human centromere on the Y chromosome