Anthropic’s free Claude 4 Sonnet aced my coding tests – but its paid Opus model somehow didn’t

The pace of AI improvement fascinates me. What sucked a year ago is at the top of the heap this year. I saw that over the past few months, as both Google's Gemini and Microsoft's Copilot went from the bottom rung on the AI coding ladder to the winner's circle, passing all of my coding tests.
Also: How I test an AI chatbot's coding ability - and you can, too
Today, another language model is making the trek up the ladder. What makes this interesting is that the underdog player is moving into the winner's circle, where the odds-on favorite only climbed up a rung or two before getting stuck.
Like most LLM offerings, Anthropic offers its Claude chatbot in both free and paid versions. The free version now available is Claude 4 Sonnet. The paid version, which comes with its $20/month Pro plan, is Claude 4 Opus. Opus is also available to those using Anthropic's even more expensive Max plan. The model is the same, but there are fewer limits.
Also: The best AI for coding in 2025 (and what not to use)
Here's the TL;DR along with a handy chart. The free Claude 4 Sonnet passed all four of my coding tests. On the other hand, the paid-for Claude 4 Opus failed two of my tests. Now, both of these results are better than they were exactly a year ago, but it's still fairly baffling that the higher-end version of the model performed worse.
And now, the results…
1. Writing a WordPress plugin
Let's not bury the lede. When I last tried Claude using its 3.5 Sonnet model, it built the user interface, but nothing ran. This time, both Claude 4 Sonnet and Claude 4 Opus built working plugins.
That's the good news. Things do get weird, but let's start with the easy stuff. Both Sonnet and Opus built workable user interfaces.
Interestingly, the more expensive Pro Opus version built a UI that looked almost identical to what Claude 3.5 Sonnet gave me a year ago. Meanwhile, Sonnet built a slightly differently formatted UI. Given the original prompt, they're both quite acceptable.
Also: Gemini Pro 2.5 is a stunningly capable coding assistant - and a big threat to ChatGPT
Just to recap, this test asks the AI to build an actual usable plugin. I originally assigned it to ChatGPT back in the very early days of generative AI, and was quite surprised when that AI turned out something my wife could put on her website's backend to help populate a customer involvement device she uses in her e-commerce business. To this day, it still saves her time every month.
But, as you can see from this screenshot (and especially the multicolored column on the far left), Sonnet 4 and Opus 4 generated considerably different code.
In one way, the Pro Opus version produces more robust code than the free Sonnet version. Opus adds international-ready strings, a best practice that makes the plugin translation-friendly. That's nice to see, but it's not a ding against Sonnet, because I've never required internationalization-ready features in my tests.
Last time, though, Sonnet created its own JavaScript file, which it wrote to the server. This practice is very, very baaad. More on that in a minute.
Also: I retested Microsoft Copilot's AI coding skills in 2025 and now it's got serious game
This time, Sonnet is not doing that. Instead, it's using inline JavaScript and reusing jQuery. This approach is not necessarily a best practice in WordPress because it might cause conflicts with other plugins, but it's not directly harmful.
By contrast, Opus is generating its own JavaScript file. When it first provided results from the generative AI test prompt I fed it, Opus provided one PHP file for download. I installed that on the server as normal.
Then, when the Opus-generated plugin ran, it auto-generated a JavaScript file into the plugin's home directory. This is both fairly impressive and wildly wrong-headed. It's cool that it tried to make the plugin creation process easier, but whether or not a plugin can write to its own folder is dependent on the settings of the OS configuration. There's a very high chance it could fail.
I allowed it in my testing environment, but I'd never allow a plugin to rewrite its own code in a production environment. That's a very serious security flaw. Think about it. The code generated by the AI, later, after being installed and running, wrote new code and enabled it on the server. That's malware behavior right there.
Also: How AI coding agents could destroy open source software
With that, I'm giving Claude 4 Sonnet a passing grade on this test, but failing the paid Opus edition because it did creepy and dangerous AI stuff.
Test 2: Rewriting a string function
This test evaluates how the AI can clean up some fairly common code, a regular expression for validating dollars and cents input. Regular expressions are basically formulas that define how strings of characters behave. In this case, the characters are the numbers 0-9, and a decimal point. We want the code to allow good dollars and cents amounts and reject garbage inputs.
Here, too, the free version outperforms the paid Pro version. Sonnet 4 performs the regular expression evaluation to confirm that the submitted set of input characters is formatted like money before converting the string data (suitable for display) to numeric data (suitable for math). This saves a few cycles for inputs that are invalid.
Also: Java at 30: How a language designed for a failed gadget became a global powerhouse
If a decimal point is present in the input string, Sonnet 4 requires at least one digit after the decimal point. Digit, no-digit, Opus 4 doesn't care. Sonnet's slightly stricter interpretation can prevent malformed inputs and errors.
Sonnet 4's code is also more readable than Opus 4's code. Sonnet creates a clear block for the format test and clear error management on failure. Opus 4 slams it all into one long conditional expression inside a variable assignment. That's a lot harder to read and maintain.
The free Sonnet 4 version passes this test as well. The pay-for Opus 4 version fails, because it does not prevent a malformed input case.
I have to say this is pretty baffling. Opus is supposed to be the flagship model in the Anthropic stable, but even though it has more training and presumably more training data, it's failing more often as well.
Test 3. Finding an annoying bug
This test essentially evaluates the AI's knowledge of a coding framework, in this case WordPress. Language knowledge is one thing. Languages are usually governed by very public and very carefully defined specifications. But frameworks are often built up incrementally, and involve a lot of "folklore" knowledge among users.
Also: How much energy does AI really use? The answer is surprising - and a little complicated
Once upon a time, I had a very annoying bug that I was having difficulty fixing. The error message and the fix that seemed obvious didn't solve the problem. That's because the actual bug was hidden in how the framework shared information, which was far from intuitively obvious.
Early on in my AI testing, not all AIs figured this out. ChatGPT solved it, as did Claude 3.5 Sonnet. Now, we can add Claude 4 Sonnet and Claude 4 Opus, both of which passed this test perfectly. They also both correctly identified a more obvious syntax error in the test code.
Test 4. Writing a script
This test plumbs the depths of the AI model's knowledge. It tests for understanding of Chrome's DOM (how Chrome manages pages), AppleScript (a Mac scripting language), and Keyboard Maestro (another Mac scripting tool made by one lone developer). Diehard Mac scripting aficionados (like me) know about Keyboard Maestro, but it's not exactly mainstream.
Claude 3.5 Sonnet failed this test. Claude 4 Sonnet passed. This time, Sonnet knew how to talk to Keyboard Maestro. AppleScript lacks a built-in toLower function (to make a string lowercase), so Sonnet wrote one to meet the needs of this test. All good.
Also: How ChatGPT could replace the internet as we know it
Claude 4 Opus did a slightly better job than Claude 4 Sonnet. Opus also generated working code, but instead of creating an entire new function to force a string to lowercase and then compare, it simply used AppleScript's built-in "ignoring case" functionality. It's not a big thing, but it is better code.
Both Claude 4 Sonnet and Claude 4 Opus passed this test, leaving Sonnet with a 4-out-of-4 score and Opus with a disappointing 2-out-of-4 score.
What are you using?
What about you? Have you tried Claude 4 Sonnet or Opus for coding tasks? Did you use the older model variants? Were you surprised that the free version outperformed the paid one in some areas? How do you evaluate trust when an AI model rewrites or deploys code on its own? Have you encountered similar behavior in other AI tools? Let us know in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.
Post Comment