In a recent study published in Science, researchers conducted a comprehensive evaluation of OpenAI’s large language model (LLM) o1 by comparing it with hundreds of doctors to test its performance in clinical reasoning on complex tasks. The study included data collection from five experimental benchmark tests and a study in a real emergency department, covering medical cases considered “benchmark” and real-world emergency situations.
The study’s results revealed that the artificial intelligence (AI) model generally outperformed the benchmark results of human doctors on multiple tasks, suggesting that advanced models may have already surpassed many established clinical reasoning benchmarks. This study suggests that, in the near future, AI could go beyond information retrieval to provide sophisticated and reliable clinical second opinions.
Translation performed using the free version of the DeepL.com translator.
