Pan-genomics! Let’s get real-world here. Graphs are just the start.

I work with human DNA pathological variants and gene therapy. To me, the current pangenome effort and friends is mostly an academic exercise that won’t matter. Yes, we know that FASTA is limited, but we can combine FASTA files and look at these differences. I do like what Erik and friends have done, and pangenomics is interesting. But, it doesn’t even start to address what really matters. There are other issues that are much more important. For instance:

– Why is it that new features that are cutting edge are ignored for a decade? The EEL materials from the group in Finland have been published for a long time. This is designed to look at promoter/enhancer variants. Since promoters and enhancers is the frontier of genome analysis, I have to ask why? This makes no sense at all. https://www.cs.helsinki.fi/u/kpalin/EEL/

-If I sequence one person’s DNA, how long does it take for this to converge on a fairly stable single sequence? Because we know that the sequencing technology itself is faulty. Sebastian Cocioba has been working for a while on a Dienococcus. Imagine how long it would take him to apply all of those methods to a human genome? He would be at it for longer than he has years left to live.

-We know that depth of sequencing is a big deal, and flatly, the current standard of 30X coverage is nonsense. You may not even get all of the genes in that sequence.

-Where do the tools look for and account for differences between gene copies on different chromosomes? And where do they count the number of instances of different variants in the exome?

-Sequencing of one person’s DNA is also lousy because of the way we do it. The vast majority of sequencing is done from leukocytes. But leukocytes are pretty specialized cells and they have large sections of hypermutaion. They are also the fastest reproducing cell in the body and develop clonal populations pretty fast that differ from each other. The older we are, in every tissue, we develop clonal populations that form a mosaic of genomes, but this happens faster in cells that divide faster. Whereas, for tracking what human genomes really are we should be exclusively using embryonic cells. Embryonic cells should be our baseline, NOT adult cells, and certainly not PBMCs! So why is this? It’s easier to get the material. And also because, well, mostly, nobody thinks about it. We just “did something” and now it’s the way it’s done.

-All of these human genomes are assembled using now fairly elderly base genomes as templates for short read technology. Redoing the original human genome using very long read technology will eventually happen. It will show that quite a bit of it has been incorrect. In addition, I expect it to show that as cells go from embryo to adult the genome itself may change more than we imagine.

-We already know that these human genomes are missing pieces. The area near centromeres was never sequenced. Those may play a significant role, they may not. The former director of TIGR used to give talks on why nobody should believe the human genome sequence we have.

What a real system would look like.

Numeric locus is how we find things now. We could change that to fragment locus perhaps, but that’s really not what you want. How are you going to access it and have a clue where you are in the map? My choice is functional locus centered around proteins. And, we could preserve the numeric loci so that they redirect this way.

What matters most in the human genome is functionality. If you don’t tie the genome to functional exons mapped to proteins you have nothing really. That’s already something I’ve identified as a problem with current genomes. But you can’t really do that without sequencing the exome together with the chromosomes for the cells you are analyzing. Do we do that? No. We don’t. But it’s crucial.

However, this is more complicated than just sequencing the exome with the genome. Because to get a true picture, we would need to sequence exome and genome together for virtually every tissue in the body. We may need to do it by time of day. And multiple times, or at least deep sequencing.

Large amounts of the genome are not understood. How will we work with those? There are promoters which are not entirely mapped. And there are enhancers, which are all but unmapped which is what the EEL software is all about. And there are enhancers of enhancers. And then there is the 3-dimensional geometry of how chromosomes unpack that sometimes modifies how genes work, but mostly doesn’t.

Promoters and enhancers are a major part of the genome as well. Mapping those is how engineering of genomes and gene therapy will really progress. Right now, most of gene therapy is like an alien spaceship firing a rice-a-roni factory into Madison Square Garden to feed the homeless. It is, how do you say? Lacking in a certain finesse.

This is why I think that using proteins as anchors to functionality is the way to go. And variants in exons should be classified by whether or not there are differences in proteins. Those that make no difference, need to be represented differently. That will give us anchors for adding information on how the promoters and enhancers and later, networks of regulation operate.

Right now, all of it is a rickety rickshaw. We just don’t do sequencing of biopsies of tissues from all over in the same organism. And we don’t use embryos as the foundation for our genomes, which they should be. It astonishes me how that is not obvious. What is the exome and genome for muscle? What is the exome and genome for bone? What is it for thymus? What is it for bladder? What is if for intestine? For brain cortex? For thalamus? For spinal cord? For heat sensing nerves? For touch sensing nerves? On top of that, there is epigenetics. We need to develop a picture of epigenetics in the same way, and we are a long ways from that.

So the pictures are nice, and I like to see evidence that a genome is getting close to its asymptote (as with the pseudomonas example). And representing genomes with layers is fine. But until it is functionally oriented, it doesn’t do me or pretty much anybody except an evolutionary biologist any good.

A Technoprogressive Think-Tank

Pan-genomics! Let’s get real-world here. Graphs are just the start.