I realized after asking that my default prompt makes ChatGPT really verbose so I changed the prompt to:
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name and no other commentary.
And it gave me:
For 9 it's actually interesting that if I let it give commentary it says:
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1 - This set of genes does not point to a well-defined cell type but could suggest Neuronal Cells or specific types of Neural Precursors based on the presence of neural development and function genes.
Biologists (including myself) often need to identify types of cells based on their gene expression. For example, if I’m differentiating stem cells to make an ovarian organoid, and I perform single cell RNA sequencing, I might want to check the data to see which ovarian cell types are present.
Today, a Nature Methods paper reported good results with giving GPT-4 a list of cell-specific genes and asking it to identify the cell type. This seems interesting, and also quite easy to check for myself to see if it actually works.
I don’t pay for access to GPT-4, but I gave ChatGPT a test using the prompt from the Nature Methods paper, with the following cell markers:
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name.
SOX17, POU5F1, NANOS3, PRDM1, NANOG, CD38
POU5F1, SOX2, KLF4, ITGA6, NANOG
SOX17, FOXA2, CXCR4, GATA4
FOXL2, AMHR2, CD82, NR5A1, FSHR, GATA4
ZP3, DPPA3, DDX4, NPM2, ZP2
FOXL2, FSHB, NR5A1, PITX1, GNRHR
STK31, ZBTB16, DDX4, SSEA4, NANOS2
NR2F2, CYP17A1, STAR, LHCGR, GLI1, HSD3B
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1
SYCP1, TEX12, REC8, SPO11, SYCP3
NR5A1, SOX9, FSHR, GATA4
OTX2, SOX1, TUBB3, PAX6
Overall score: 7.5 / 12
In the first test, ChatGPT got the random genes completely wrong. Let’s prompt it to announce that it’s uncertain if it doesn’t actually know the cell type.
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name. If you are uncertain, respond "unknown" instead of providing a cell type name.
SOX17, POU5F1, NANOS3, PRDM1, NANOG, CD38
POU5F1, SOX2, KLF4, ITGA6, NANOG
SOX17, FOXA2, CXCR4, GATA4
FOXL2, AMHR2, CD82, NR5A1, FSHR, GATA4
ZP3, DPPA3, DDX4, NPM2, ZP2
FOXL2, FSHB, NR5A1, PITX1, GNRHR
STK31, ZBTB16, DDX4, SSEA4, NANOS2
NR2F2, CYP17A1, STAR, LHCGR, GLI1, HSD3B
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1
SYCP1, TEX12, REC8, SPO11, SYCP3
NR5A1, SOX9, FSHR, GATA4
OTX2, SOX1, TUBB3, PAX6
This time it performs better (I’d give it 9.5/12), but it’s still tricked by random genes, and it still can’t recognize primordial germ cells.
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name. If you are uncertain, respond "unknown" instead of providing a cell type name.
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1
IL9, SLC30A4, SX18P8, CHRDL2, SMUG1P1
HCST, EXOSC8, ORC3, CIDECP2, DNM2
DTL, U3, DDX28, WDFY3, PPP1R2P4
LTK, STK32C, SMIM9, DPPA3P10, MTCO1P12
This time, ChatGPT just responded “unknown” for everything. Very good! Without the prompt to respond “unknown”, ChatGPT instead made wild guesses:
ChatGPT is remarkably good at identifying most cell types, but can be overconfident and assign a cell type to a list of random genes. There also seems to be some bias in this: ChatGPT said the random gene list was Sertoli cells in context of the larger list of reproductive cell types, but when given five lists of completely random genes, it said “unknown” for all of them. Giving the option to respond “unknown” was very important, since otherwise the main outcome was “bovine fecal cells”.
I still don’t trust ChatGPT enough to use for my research, but it will be interesting to see if this improves over time. Also, if any readers can try my prompts with GPT-4, please post the results in the comments!