Topics
Latest
AI
Amazon
Image Credits:Andrew Mayne(opens in a new window)
Apps
Biotech & Health
Climate
Image Credits:Andrew Mayne(opens in a new window)
Cloud Computing
Commerce
Crypto
Enterprise
EVs
Fintech
Fundraising
Gadgets
Gaming
Government & Policy
Hardware
Layoffs
Media & Entertainment
Meta
Microsoft
privateness
Robotics
Security
Social
outer space
startup
TikTok
Transportation
Venture
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
Podcasts
video
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
The list of informal , uncanny AI benchmark keeps growing .
Over the past few Day , some in the AI residential district on X havebecomeobsessedwith a trial of how different AI models , specially so - calledreasoning models , address prompt like this : “ Write a Python handwriting for a recoil sensationalistic ball within a material body . Make the shape slowly rotate , and verify that the ball stays within the shape . ”
Some framework manage advantageously on this “ ball in rotating shape ” bench mark than others . Accordingto one user on X , Taiwanese AI lab DeepSeek’sfreely useable R1swept the base with OpenAI’so1 pro mode , which be $ 200 per month as a part ofOpenAI ’s ChatGPT Pro programme .
👀 DeepSeek R1 ( right ) crushed o1 - pro ( left ) 👀
Prompt : “ write a python playscript for a bouncing yellow ballock within a foursquare , ensure to treat hit detection by rights . make the foursquare slowly rotate . implement it in python . make certain formal stays within the square”pic.twitter.com/3Sad9efpeZ
— Ivan Fioravanti ᯅ ( @ivanfioravanti)January 22 , 2025
Peranother X poster , Anthropic’sClaude 3.5 Sonnetand Google’sGemini 1.5 Promodels misjudged the physic , resulting in the testicle get by the shape . Otherusersreported that Google’sGemini 2.0 Flash Thinking Experimental , and even OpenAI ’s olderGPT-4o , aced the valuation in one go .
Tested 9 AI good example on a aperient pretence undertaking : turn out trilateral + bouncing ball . Results :
🥇 Deepseek - R1 🥈 Sonar Huge 🥉 GPT-4o
regretful ? OpenAI o1 : Completely misconceive the task 😂
Video below ↓ First row = Reasoning models , rest = Base models.pic.twitter.com/EOYrHvNazr
— Aadhithya D ( @Aadhithya_D2003)January 22 , 2025
But what does it prove that an AI can or ca n’t code a rotate , nut - containing shape ?
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
Well , simulating a bouncing musket ball is aclassicprogrammingchallenge . Accurate simulations incorporate collision signal detection algorithms , which attempt to distinguish when two objects ( for instance a ball and the side of a shape ) collide . badly compose algorithms can affect the pretense ’s public presentation or direct to obvious cathartic misapprehension .
X userN8 Programs , a researcher in hall at AI inauguration Nous Research , says it take on him roughly two hours to programme a ricochet ball in a rotating heptagon from scratch . “ One has to track multiple co-ordinate system , how the collisions are done in each organization , and plan the code from the beginning to be robust , ” N8 Programs excuse in apost .
But while bounce Lucille Ball and rotating human body are a reasonable trial of programing skills , they ’re not a very empirical AI benchmark . Even little variations in the command prompt can — and do — yield different outcomes . That ’s why some drug user on XTC report possess more luck witho1 , while others say that R1falls shortsighted .
If anything , viral examination like these spot to the intractable trouble of create utile systems of measurement for AI model . It ’s often difficult to tell what differentiates one model from another , outside ofesoteric benchmarksthat are n’t relevant to most masses .
Many movement are afoot to build better tests , like theARC - AGI benchmarkandHumanity ’s Last Exam . We ’ll see how those do — and in the meantime watch GIFs of musket ball bouncing in rotating shapes .