16 Comments
May 6Liked by Elliot Hershberg

Even before the Evo model, genome scale language models first showcased the importance of the use of dna- based representations in capturing evolutionary trajectories of SARS-CoV-2. The important thing to note is that the model was first developed to be a foundation model trained on some of the largest and openly available prokaryotic gene sequences. GenSLMs are available at: https://github.com/ramanathanlab/genslm. The paper also provides ideas on the scaling of such models.

Expand full comment
May 6Liked by Elliot Hershberg

I love the article btw! It’s really engaging and great to read a summary of all the developments in this area!

Expand full comment
author

Thanks for sharing this Arvind, it's really cool work!

The chunking of genomes into "sentences" and passing those into a diffusion model is interesting. Great visualizations too!

Expand full comment
May 6Liked by Elliot Hershberg

Firstly - your relatable prose + high intelligence is inspiring. I’ve been obsessed with these topics since turning down an opp to fund a series A in a de-novo startup. They went on to huge rounds and I’ve been catching up ever since. You deliver huge value and I can’t wait for the next one.

Expand full comment
May 5Liked by Elliot Hershberg

Wow, great article (and great explanation of LLM "Attention" concept). AI plus CRISPR technology is really going to revolutionize the medical field. Exciting times we live in!!!

Expand full comment
author

Thanks Rick. Exciting times indeed!

Expand full comment
May 6Liked by Elliot Hershberg

We published a long-context genomic language model called megaDNA (first version last Dec) and it can also be fine-tuned for protein / regulatory elements prediction tasks: https://www.biorxiv.org/content/10.1101/2023.12.18.572218v3

Expand full comment
author

This is great. Added to my queue to read in more detail. It would be cool to see a similar phylogenetic plot of where the generated bacteriophages (!!) map onto the full trees of similar phages. Thanks Bin.

Expand full comment

Thank you, Elliot! The majority of the generated phages are Caudoviricetes (Fig. 2d) -- would be interesting to map them onto the full tree!

Expand full comment

This is the best (the easiest to understand) explanation of LLMs I've read. AND it's not boring. Bravo! :)

Expand full comment
author

Thanks Katia! Looking under the hood, there’s some really beautiful math and ideas at the heart of these models.

I strongly reccomend the 3B1B Series for more!

Expand full comment
May 6Liked by Elliot Hershberg

I agree with Rick, word for word. and your response as well!

Expand full comment
author

Thanks Michael!

Expand full comment

so i take it : depending on how high a 'DeNovo' bar we set, we either already have DeNovo protein design or only a 'smart vector search' of Crispr amino acid clusters (which might be DeNovo enough for many markets). Sounds like creating GPT to evaluate clinical and regulatory barriers is the real here-and-now ticket while true DeNovo figures itself out!

Expand full comment

Beautiful piece, as usual.

I do agree that 3-5 sequences getting close to 100% similarity is too low to be considered de novo design, but I am skeptical that "real" de novo design will be much higher.

An analogy i'm thinking about is houses. At what point would I consider an architects work de novo? They'll have to use "walls" and "rooms", typically made of the same ~20 materials.

There is only so many ways to build the core building blocks and I would not be surprised if 4b years of evolution have already discovered most of those.

Expand full comment

This looks pretty much on-topic -

https://www.nature.com/articles/s41467-024-47120-y

Expand full comment