It all sounds so exciting. But, we have built a virtual WHALE that runs on a machine so old people reading this would fall over laughing. Yet, it works, with pure mathematics. And, along the way we learned about the huge gaps between what has been observed, and what we need to know to be able to create virtual organisms that survive.
It's entirely possible that the constraint of being severely underfunded forced us to be far more efficient in problem-solving and calls into question the funding model described here, as well as the approaches described. Their dependence on data is also their weakness because the knowledge is purely empirical (as "Metaclesus" points out in their remarks).
Single cell perturbation datasets have come a long way, but one limitation they still have is that they are typically done in easy-to-grow cells like cancer cell lines or HEK293. The epigenetic context of these cells is often very different from more biologically relevant cell types. If a promoter is open in HEK293 but silenced by methylation in something like a neuron, the model trained on HEK293s won't make a correct prediction for the gene expression in a neuron.
Abhi recently had a good tweet about this: "virtual cell datasets being largely in-vitro cancer cell lines has a similar mouthfeel to what led to modern medicine being able to perfectly cure tumors in mice but not humans"
In my particular case I want to understand how perturbations might affect meiosis. There are only a few human scRNAseq datasets that contain meiotic cells, so I needed to generate my own one. ML x Bio is definitely powerful, but having relevant data is key!
Agreed. I think that this may be way the models trained across many contexts—such as unique species and tissues—have produced more interesting results.
Hani had an interesting post about this as well: "I personally think the many contexts that Tahoe offers is crucial here. At the moment, given the same number of cells, I take more contexts over more perturbations."
Hey Elliot, this is so inspiring an article, thank you!
Unfortunately, I recently came across a WeChat public account blog that translated your article into Chinese. As the time of 12/14, they published it as their own “original” content without acknowledging you as the author.
I tried to report this plagiarism to the platform administrators, but it failed. I thought you might want to be aware of it as well.
Deeply appreciate this article.
Glad you enjoyed.
It all sounds so exciting. But, we have built a virtual WHALE that runs on a machine so old people reading this would fall over laughing. Yet, it works, with pure mathematics. And, along the way we learned about the huge gaps between what has been observed, and what we need to know to be able to create virtual organisms that survive.
It's entirely possible that the constraint of being severely underfunded forced us to be far more efficient in problem-solving and calls into question the funding model described here, as well as the approaches described. Their dependence on data is also their weakness because the knowledge is purely empirical (as "Metaclesus" points out in their remarks).
You may be on to something with the virtual WHALE...
Single cell perturbation datasets have come a long way, but one limitation they still have is that they are typically done in easy-to-grow cells like cancer cell lines or HEK293. The epigenetic context of these cells is often very different from more biologically relevant cell types. If a promoter is open in HEK293 but silenced by methylation in something like a neuron, the model trained on HEK293s won't make a correct prediction for the gene expression in a neuron.
Abhi recently had a good tweet about this: "virtual cell datasets being largely in-vitro cancer cell lines has a similar mouthfeel to what led to modern medicine being able to perfectly cure tumors in mice but not humans"
In my particular case I want to understand how perturbations might affect meiosis. There are only a few human scRNAseq datasets that contain meiotic cells, so I needed to generate my own one. ML x Bio is definitely powerful, but having relevant data is key!
Agreed. I think that this may be way the models trained across many contexts—such as unique species and tissues—have produced more interesting results.
Hani had an interesting post about this as well: "I personally think the many contexts that Tahoe offers is crucial here. At the moment, given the same number of cells, I take more contexts over more perturbations."
Brilliant!, thank you
Thank you!
Wow! Great article, thanks.
Hey Elliot, this is so inspiring an article, thank you!
Unfortunately, I recently came across a WeChat public account blog that translated your article into Chinese. As the time of 12/14, they published it as their own “original” content without acknowledging you as the author.
I tried to report this plagiarism to the platform administrators, but it failed. I thought you might want to be aware of it as well.
Here is the link: https://mp.weixin.qq.com/s/aw88KP8YqBzzLShYkCEtEg
Wonderfully, written article, and it is easy to understand even for people coming from pure CS/AI background.