Topics

Latest

AI

Amazon

Article image

Image Credits:Andrew Mayne(opens in a new window)

Apps

Biotech & Health

Climate

AI benchmark; yellow ‘ball’ on black background

Image Credits:Andrew Mayne(opens in a new window)

Cloud Computing

Commerce

Crypto

Enterprise

EVs

Fintech

Fundraising

Gadgets

Gaming

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

privateness

Robotics

Security

Social

outer space

startup

TikTok

Transportation

Venture

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

video

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us

The list of informal , uncanny AI benchmark keeps growing .

Over the past few Day , some in the AI residential district on X havebecomeobsessedwith a trial of how different AI models , specially so - calledreasoning models , address prompt like this : “ Write a Python handwriting for a recoil sensationalistic ball within a material body . Make the shape slowly rotate , and verify that the ball stays within the shape . ”

Some framework manage advantageously on this “ ball in rotating shape ” bench mark than others . Accordingto one user on X , Taiwanese AI lab DeepSeek’sfreely useable R1swept the base with OpenAI’so1 pro mode , which be $ 200 per month as a part ofOpenAI ’s ChatGPT Pro programme .

👀 DeepSeek R1 ( right ) crushed o1 - pro ( left ) 👀

Prompt : “ write a python playscript for a bouncing yellow ballock within a foursquare , ensure to treat hit detection by rights . make the foursquare slowly rotate . implement it in python . make certain formal stays within the square”pic.twitter.com/3Sad9efpeZ

— Ivan Fioravanti ᯅ ( @ivanfioravanti)January 22 , 2025

Peranother X poster , Anthropic’sClaude 3.5 Sonnetand Google’sGemini 1.5 Promodels misjudged the physic , resulting in the testicle get by the shape . Otherusersreported that Google’sGemini 2.0 Flash Thinking Experimental , and even OpenAI ’s olderGPT-4o , aced the valuation in one go .

Tested 9 AI good example on a aperient pretence undertaking : turn out trilateral + bouncing ball . Results :

🥇 Deepseek - R1 🥈 Sonar Huge 🥉 GPT-4o

regretful ? OpenAI o1 : Completely misconceive the task 😂

Video below ↓ First row = Reasoning models , rest = Base models.pic.twitter.com/EOYrHvNazr

— Aadhithya D ( @Aadhithya_D2003)January 22 , 2025

But what does it prove that an AI can or ca n’t code a rotate , nut - containing shape ?

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

Well , simulating a bouncing musket ball is aclassicprogrammingchallenge . Accurate simulations incorporate collision signal detection algorithms , which attempt to distinguish when two objects ( for instance a ball and the side of a shape ) collide . badly compose algorithms can affect the pretense ’s public presentation or direct to obvious cathartic misapprehension .

X userN8 Programs , a researcher in hall at AI inauguration Nous Research , says it take on him roughly two hours to programme a ricochet ball in a rotating heptagon from scratch . “ One has to track multiple co-ordinate system , how the collisions are done in each organization , and plan the code from the beginning to be robust , ” N8 Programs excuse in apost .

But while bounce Lucille Ball and rotating human body are a reasonable trial of programing skills , they ’re not a very empirical AI benchmark . Even little variations in the command prompt can — and do — yield different outcomes . That ’s why some drug user on XTC report possess more luck witho1 , while others say that R1falls shortsighted .

If anything , viral examination like these spot to the intractable trouble of create utile systems of measurement for AI model . It ’s often difficult to tell what differentiates one model from another , outside ofesoteric benchmarksthat are n’t relevant to most masses .

Many movement are afoot to build better tests , like theARC - AGI benchmarkandHumanity ’s Last Exam . We ’ll see how those do — and in the meantime watch GIFs of musket ball bouncing in rotating shapes .