Firstly - your relatable prose + high intelligence is inspiring. I’ve been obsessed with these topics since turning down an opp to fund a series A in a de-novo startup. They went on to huge rounds and I’ve been catching up ever since. You deliver huge value and I can’t wait for the next one.
Even before the Evo model, genome scale language models first showcased the importance of the use of dna- based representations in capturing evolutionary trajectories of SARS-CoV-2. The important thing to note is that the model was first developed to be a foundation model trained on some of the largest and openly available prokaryotic gene sequences. GenSLMs are available at: https://github.com/ramanathanlab/genslm. The paper also provides ideas on the scaling of such models.
Wow, great article (and great explanation of LLM "Attention" concept). AI plus CRISPR technology is really going to revolutionize the medical field. Exciting times we live in!!!
This is great. Added to my queue to read in more detail. It would be cool to see a similar phylogenetic plot of where the generated bacteriophages (!!) map onto the full trees of similar phages. Thanks Bin.
so i take it : depending on how high a 'DeNovo' bar we set, we either already have DeNovo protein design or only a 'smart vector search' of Crispr amino acid clusters (which might be DeNovo enough for many markets). Sounds like creating GPT to evaluate clinical and regulatory barriers is the real here-and-now ticket while true DeNovo figures itself out!
I do agree that 3-5 sequences getting close to 100% similarity is too low to be considered de novo design, but I am skeptical that "real" de novo design will be much higher.
An analogy i'm thinking about is houses. At what point would I consider an architects work de novo? They'll have to use "walls" and "rooms", typically made of the same ~20 materials.
There is only so many ways to build the core building blocks and I would not be surprised if 4b years of evolution have already discovered most of those.
Firstly - your relatable prose + high intelligence is inspiring. I’ve been obsessed with these topics since turning down an opp to fund a series A in a de-novo startup. They went on to huge rounds and I’ve been catching up ever since. You deliver huge value and I can’t wait for the next one.
Even before the Evo model, genome scale language models first showcased the importance of the use of dna- based representations in capturing evolutionary trajectories of SARS-CoV-2. The important thing to note is that the model was first developed to be a foundation model trained on some of the largest and openly available prokaryotic gene sequences. GenSLMs are available at: https://github.com/ramanathanlab/genslm. The paper also provides ideas on the scaling of such models.
I love the article btw! It’s really engaging and great to read a summary of all the developments in this area!
Thanks for sharing this Arvind, it's really cool work!
The chunking of genomes into "sentences" and passing those into a diffusion model is interesting. Great visualizations too!
Wow, great article (and great explanation of LLM "Attention" concept). AI plus CRISPR technology is really going to revolutionize the medical field. Exciting times we live in!!!
Thanks Rick. Exciting times indeed!
We published a long-context genomic language model called megaDNA (first version last Dec) and it can also be fine-tuned for protein / regulatory elements prediction tasks: https://www.biorxiv.org/content/10.1101/2023.12.18.572218v3
This is great. Added to my queue to read in more detail. It would be cool to see a similar phylogenetic plot of where the generated bacteriophages (!!) map onto the full trees of similar phages. Thanks Bin.
Thank you, Elliot! The majority of the generated phages are Caudoviricetes (Fig. 2d) -- would be interesting to map them onto the full tree!
This is the best (the easiest to understand) explanation of LLMs I've read. AND it's not boring. Bravo! :)
Thanks Katia! Looking under the hood, there’s some really beautiful math and ideas at the heart of these models.
I strongly reccomend the 3B1B Series for more!
I agree with Rick, word for word. and your response as well!
Thanks Michael!
so i take it : depending on how high a 'DeNovo' bar we set, we either already have DeNovo protein design or only a 'smart vector search' of Crispr amino acid clusters (which might be DeNovo enough for many markets). Sounds like creating GPT to evaluate clinical and regulatory barriers is the real here-and-now ticket while true DeNovo figures itself out!
Beautiful piece, as usual.
I do agree that 3-5 sequences getting close to 100% similarity is too low to be considered de novo design, but I am skeptical that "real" de novo design will be much higher.
An analogy i'm thinking about is houses. At what point would I consider an architects work de novo? They'll have to use "walls" and "rooms", typically made of the same ~20 materials.
There is only so many ways to build the core building blocks and I would not be surprised if 4b years of evolution have already discovered most of those.
This looks pretty much on-topic -
https://www.nature.com/articles/s41467-024-47120-y