Topics

Latest

AI

Amazon

Article image

Image Credits:metamorworks / Getty Images

Apps

Biotech & Health

Climate

Woman interacting with technology.

Image Credits:metamorworks / Getty Images

Cloud Computing

Department of Commerce

Crypto

enterprisingness

EVs

Fintech

fund raise

Gadgets

gage

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

Privacy

Robotics

surety

Social

Space

inauguration

TikTok

transfer

speculation

More from TechCrunch

event

Startup Battlefield

StrictlyVC

Podcasts

video

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us

AI might excel at certain taskslike codingorgenerating a podcast . But it struggle to pass a high-pitched - level history test , a young newspaper publisher has find oneself .

A squad of research worker has create a new benchmark to try three top large language model ( LLMs ) — OpenAI ’s GPT-4 , Meta ’s Llama , and Google ’s Gemini — on diachronic questions . The benchmark , Hist - LLM , tests the correctness of answers consort to the Seshat Global History Databank , a Brobdingnagian database of historical cognition named after the ancient Egyptian goddess of wisdom .

The results , whichwere presentedlast month at the gamey - profile AI conference NeurIPS , were disappointing , according to researchers associate with theComplexity Science Hub(CSH ) , a inquiry institute based in Austria . The well - performing LLM was GPT-4 Turbo , but it only achieved about 46 % truth — not much higher than random guessing .

“ The chief takeout from this subject is that LLMs , while impressive , still lack the depth of understanding require for advanced history . They ’re great for canonic facts , but when it comes to more nuanced , PhD - stratum historical inquiry , they ’re not yet up to the task , ” said Maria del Rio - Chanona , one of the newspaper publisher ’s atomic number 27 - authors and an associate professor of figurer scientific discipline at University College London .

The investigator shared sample distribution historical questions with TechCrunch that LLMs got amiss . For exemplar , GPT-4 Turbo was ask whether scale armour was present during a specific time flow in ancient Egypt . The LLM said yes , but the applied science only appeared in Egypt 1,500 long time afterward .

Why are LLMs bad at answering proficient historical doubtfulness , when they can be so good at answering very complicated questions about thing like coding ? Del Rio - Chanona say TechCrunch that it ’s likely because LLMs tend to generalize from diachronic data that is very prominent , finding it difficult to recollect more dark historical knowledge .

For lesson , the researchers need GPT-4 if ancient Egypt had a professional standing ground forces during a specific historic menstruation . While the correct reply is no , the LLM serve incorrectly that it did . This is likely because there is lots of public selective information about other ancient empires , like Persia , having standing Army .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

“ If you get told A and B 100 times , and degree Celsius 1 time , and then get asked a inquiry about C , you might just remember A and B and endeavor to extrapolate from that , ” del Rio - Chanona say .

The researchers also place other vogue , including that OpenAI and Llama models performed bad for certain regions like sub - Saharan Africa , indicate potential prejudice in their preparation data .

The outcome show that Master of Laws still are n’t a substitute for humans when it comes to sure knowledge base , said Peter Turchin , who head the work and is a staff member at CSH .

But the investigator are still hopeful LLM can help historians in the hereafter . They ’re work on refining their benchmark by include more datum from underrepresented region and adding more complex questions .

“ Overall , while our results highlight areas where LLMs need improvement , they also emphasize the potential for these models to aid in historical inquiry , ” the paper reads .