Tech moves at an extremely rapid pace and it seems like we are getting new updates and improved large language models (LLMs) all the time. Claude 3.5 Sonnet has been leading the way for a very long time when it comes to doing programming related work. But is it still the top choice in this category or has it been surpassed by one of the many new LLM models which are available today.
I wanted to see how much these LLM models had progressed, so I created a test to see which one would come out on top. This article is a follow up to a post I made on my socials a while back. I tasked various AI models with building a simple Pokemon game which had surprising results.
This was the prompt I used:
Create a simple 1 v 1 Pokemon battle game using JavaScript and use sprites from this website for the Pokemon https://pokemondb.net/sprites
And here is the thread I created on my socials:
In the first phase of testing, I used Claude 3.5 Sonnet, DeepSeek R1 and ChatGPT-4o. In the second phase of testing, I used significantly more LLMs to get a better overview of the current capabilities available to us. The LLM's tested include:
- DeepSeek R1
- Gemini 2.0 Flash Thinking Experimental
- Grok 2
- Mistral
- o3-mini (medium reasoning - Windsurf)
- Qwen2.5-Max
- Claude 3.5 Sonnet
Building a Pokemon game
With the second phase test, I created a more advanced prompt to see how intelligent these LLM models are when it comes to building more complex applications that require higher logic and thinking, and I believe that a game is always a great way to test these types of use cases.
The aim of these tests was to see what the AI could accomplish after just one prompt. Of course, I would expect them all to accomplish a lot more after further iterations of chain prompting from a user.
This was the prompt which I used:
Create a 1 v 1 Pokemon battle game using JavaScript and use sprites from this website for the Pokemon https://pokemondb.net/sprites Make sure that the player can switch between 2 different Pokemon during battle and that there is type and elemental damage based on the Pokemon being used. Each Pokemon should have at least four available attacks to use. The player's Pokemon should be at level 5, and the enemy Pokemon should be at level 7. Factor in how the difference in level should play a part in the battle, including a difference in health, etc...
You can find all of the Pokemon games here on my GitHub https://github.com/andrewbaisden/pokemon-battle-game.
The _battle.js files were the original files generated by the LLM, which were broken. Claude fixed the battle.js files in that folder.
These were the results of my testing. I will rate all of them out of 5 stars so you can see which ones excelled and which ones could have done better with more work.
DeepSeek R1
LLM Performance
It took a while for DeepSeek R1 to come up with a system and start writing code. The response speed was slow because a lot of thinking was required for this task. DeepSeek R1 was thinking for 300 seconds, which is approximately 5 minutes, which is the longest I have seen yet, when using DeepSeek R1 for a task. The chain of thought process is interesting to see though and I did not set a time limit for this task so I did not mind if it took longer so long as it could fulfil the prompt.
Game UX & Logic
Unfortunately, the game has basic functionality and is not fully working. It's possible to switch between Pokemon. The Pokemon have health bars, and there are four moves available, but they are all generic and don't have names like "Thunderbolt", "Ember", etc., like in the game. Plus, it's only possible to use one move, and then all the buttons are greyed out, meaning it's not possible to play. Also, there is no image or a GIF for the enemy Pokemon, just an empty box. The design is simple, but more prompts are required to get this game into a working state.
Gemini 2.0 Flash Thinking Experimental
LLM Performance
So Gemini 2.0 Flash took about 15 seconds, to respond to the prompt which was fairly quick.
Game UX & Logic
The fast response to my prompt did not diminish Geminis output because it created a fully functional game, with a pretty decent design. Animated Pokemon, health bars, 4 moves and the ability to switch Pokemon plus an output box with all of the moves which happen during the battle. It's definitely one of the best games created in this test.
Grok 2
LLM Performance
Grok 2 does not have reasoning or a chain of thought. It took about 1 minute to complete the prompt request.
Game UX & Logic
Unfortunately, the codebase it provided was broken and did not work. I decided to use Claude 3.5 Sonnet via the Windsurf IDE to debug the codebase, and it got it working after one prompt. The reason I did not do this for DeepSeek R1 was because the game was somewhat playable already whereas the version which Grok 2 created had bugs which meant it was not playable at all.
After fixing the codebase, I could see that Grok 2 had actually designed and built a pretty beautiful game. The game more or less achieved the basics that I outlined in the initial prompt, which was good. However, it loses points because the codebase was broken, and Claude had to fix it.
Mistral
LLM Performance
It took about 2 seconds to generate the codebase which was by far the fastest of all the LLMs I tested.
Game UX & Logic
Mistral was able to create a fully functional game, after just 2 seconds! The design was pretty simple, but the basic logic worked as expected.
o3-mini (medium reasoning - Windsurf)
LLM Performance
It took about 5 seconds to create an action plan for building the app. Then, about another 10 seconds to create the codebase after I created empty files for index.html styles.css and battle.js so that it could add the code to them.
Game UX & Logic
After the setup, it successfully created a working application on the first attempt! The game works as expected and fulfils the requirements I set in the prompt. If I had one comment, it would be that all the move buttons have generic names like "Attack 1", "Attack 2", etc, even though in the output screen, it shows what move is being used. If the buttons matched the names of the attacks in the output, that would be better.
Qwen2.5-Max
LLM Performance
It took about 1 minute to generate a codebase which was not too bad.
Game UX & Logic
The JavaScript file had an error, although the HTML was able to work in the browser. The functionality did not work, though, so I used Claude 3.5 Sonnet via the Windsurf IDE to debug the codebase, and it got it working after one prompt.
The game works and does what I outlined in the initial prompt. However, the game logic needs much improvement. Firstly, when switching Pokemon, the attack moves remain the same and don't change, so they are not relevant for the new Pokemon. Secondly, the damage seems to be stuck at one, and when the Pokemon have a health of 100, that means the battle will be going on for a very long time...
Claude 3.5 Sonnet
LLM Performance
It took about 1 minute to generate a codebase, which is more than acceptable.
Game UX & Logic
The game was functional. However, it created placeholder images for the Pokemon and required the user to download sprites to replace the placeholders manually. But at least it provided instructions on how to do it. This is probably because Claude cannot search the web like other LLMs can, so it was unable to read the documentation. It's worth noting that I used the Claude website for this test. If I had used an IDE like Windsurf, which can search the web, then it might have worked.
This was the only game which had animated health bars, which was cool. I'm not so sure about the game logic, though. Either the enemy Pokemon is just that strong, or the players' Pokemon are doing damage to themselves every time they attack because their health bars go down way too quickly. 😂 Also, there are no electric Pokemon in this game, but there are electric attacks, which makes no sense. 😂
Conclusion
I think it's incredible to see how far AI has come and the direction it's heading in. Today, we learned about the current capabilities of some of the leading LLM models available right now. The fact that it's possible to create a fairly sophisticated working codebase from one prompt is truly a great sight to see. Also, taking into account that the prompt I used was detailed but left out some information, the AI models were still able to figure out most of the stuff that I was referring to, which shows how useful they have become for this type of work.
This test was not super scientific but a quick, fun one to gain insight into how well these models can build something from scratch with little human intervention. Based on this short study, I would give each LLM the following ratings and rankings for this particular test.
| AI LLM | Rating | 
|---|---|
| DeepSeek R1 | ⭐️️ | 
| Gemini 2.0 Flash Thinking Experimental | ⭐️⭐️⭐️⭐️⭐️ | 
| Grok 2 | ⭐️⭐️⭐️ | 
| Mistral | ⭐️⭐️⭐️⭐️ | 
| o3-mini (medium reasoning - Windsurf) | ⭐️⭐️⭐️⭐️ | 
| Qwen2.5-Max | ⭐️⭐️ | 
| Claude 3.5 Sonnet | ⭐️⭐⭐ | 
So unfortunately DeepSeek R1 only scored 1 star in this particular test because the game was not fully functional. Surprisingly it was Gemini 2.0 Flash which came out on top with 5 stars. Grok 2 only managed 3 stars because the codebase needed to be fixed by Claude before it worked.
Mistral and o3-mini (medium reasoning) produced fairly good all around games. Qwen2.5-Max created a game which only worked after Claude debugged the codebase. The logic needed improvement because the attacks only did damage of 1 so winning a game would be tiresome and boring... 😂
And lastly Claude only scored 3 stars because the game logic was a bit weird and it could not display any images of Pokemon like the other games due to not having the ability to search the web. However it gets an honourable mention because it fixed 2 broken codebases and got those games working after one prompt! And if I had used Claude 3.5 Sonnet inside of an IDE like Windsurf or Cursor which can access the web then it likely would have produced even better results when building this game.
Stay up to date with tech, programming, productivity, and AI
If you enjoyed these articles, connect and follow me across social media, where I share content related to all of these topics 🔥
 
Top comments (15)
Hi
Thanks for sharing this.
It is insightful and during my research regarding the Gen AI tools, I found a couple of resources that are useful for AI enthusiasts or working professionals -
These resources look helpful to me.
Thank you for sharing this
Insightful information thank you for this
"Stumble Guys Mod APK is a fun way to enjoy this multiplayer knockout game with extra features. If you're into competitive yet wacky battle royale experiences, it's worth checking out. The Codenewbie community is a great place to discuss game modding and development. What are your thoughts?"
I compared top AI models to build a similar app experience, and the results were surprising! Each model had strengths, but the most seamless, engaging creation mirrored the vibrant and interactive style of avater world apk for ios offers a fun and interactive experience where you can create, explore, and customize your own digital universe!
Finding the right government job can feel overwhelming, but Council Direct simplifies the process. Whether you’re looking for entry-level positions, management roles, or specialized government contracts, our platform connects you with top council jobs openings across Australia.
Stumble Guys MOD APK offers unlimited fun with all skins, emotes, and levels unlocked for a truly epic experience. Enjoy smoother gameplay, faster progress, and exciting new customizations. Dive into hilarious battles and become the ultimate champion without any limits!
Great breakdown, Andrew! Really cool seeing how each AI model handled the same task. Makes me wonder how these models could be used in game logic or modding — like in thestickwarlegacyapkk.com/stick-war-3-mod-apk/
.
Really enjoyed this breakdown — it’s fascinating to see how each AI model approaches the same app differently. I’ve been exploring similar experiments with gaming mechanics and strategy simulations, and thestickwarlegacyapkk.com/stick-wa... gives some cool insights into how structured logic can enhance interactivity.
stumble-guyzapk.com is a great site for fans of Stumble Guys who want access to older versions and modded APKs with unique features like unlocked skins and unlimited gems. It’s easy to navigate, regularly updated, and provides direct downloads for different versions of the game. For players looking to try out fun mods or revisit earlier gameplay styles, this site offers a convenient and user-friendly option