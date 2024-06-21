Key Takeaways Claude 3.5 Sonnet surpasses ChatGPT, Gemini, and Llama models in some benchmarks.

Available to all users online and as an app, Claude offers free usage with increased limits for paid subscriptions.

Claude wins across several benchmarks but still has weaknesses common to other AI models.

Move over GPT-4o and Gemini 1.5, there's a new player in town. Anthropic has released its latest model, pretentiously called Claude 3.5 Sonnet, and the company says that it can outperform the latest ChatGPT, Gemini, and Llama models in several benchmarks.

Claude 3.5 Sonnet is now available to all users online and in the Claude app, and you don't need a subscription to use it. There is a limit on the number of messages you can send as a free user, however, which varies based on demand, and refreshes again each day. You can sign up to a paid subscription for five times the usage permitted in the free version.

How does Claude 3.5 Sonnet compare to its rivals?

The new model comes out ahead in many benchmarks

Anthropic

AI benchmarks should always be taken with a pinch of salt, as comparing AI chatbots is a notoriously difficult thing to do, not least because your chatbot might give a different response to the same question the next time you ask it. These benchmarks usually focus on specific types of tasks, too, which doesn't always give a good picture of how well a chatbot performs in real life. Regardless, the benchmarks published by Anthropic make for some interesting reading.

Anthropic tested Claude 3.5 Sonnet across eight different benchmarks and compared it to its own Claude 3 Opus model, as well as OpenAI's latest model, GPT-4o, Google's Gemini 1.5 Pro, and Meta's Llama-400b. Claude 3.5 Sonnet came out on top in seven out of the eight categories, with ChatGPT 4-o triumphing in the other.

The new version of Claude beat out the competition in graduate-level reasoning, code, multilingual math, reasoning over text, mixed evaluations, and grade school math. It took second place to GPT-4o in math problem-solving. When tested for undergraduate-level knowledge, Claude 3.5 Sonnet was the winner when using a 5-shot method, in which five examples are given before the prompt is asked. However, in 0-shot testing, where there are no prior examples given, Claude 3.5 Sonnet was narrowly beaten by GPT-4o.

Anthropic

Claude 3.5 Sonnet also has improved vision capabilities, which make it better at interpreting visual data such as charts. It was tested against other models for visual reasoning tasks and came out on top in all but one instance, where it was again beaten by ChatGPT 4-o.

Is Claude 3.5 Sonnet now the best AI?

It's hard to say with any degree of accuracy

Pocket-lint

Does this mean that Claude 3.5 Sonnet is now the best AI out there? As already mentioned, benchmarks should be taken with a pinch of salt, and abilities in narrow fields don't mean that the AI chatbot will perform better for general use.

While Claude 3.5 Sonnet certainly boasts impressive performance in benchmark testing, it still has many of the same weaknesses as its rivals.

For example, I tried the question that has been stumping many AI chatbots, and asked Claude 3.5 Sonnet how many times the letter R appears in the word strawberry, something current models still struggle with. Claude 3.5 Sonnet's response was that there are two (there are three if you can't be bothered to count) and when asked which position these came in, Claude 3.5 Sonnet responded that these were the third and eighth letters. It's true that there are Rs in these positions, but there's also one in the ninth position, too.

Anthropic also introduces Artifacts

A separate window makes your workflow less cluttered

Anthropic

Anthropic also introduced a new feature called Artifacts that is coming to its models. This is essentially just a separate window where the more complex output from your prompts is visible so that your main chat doesn't get cluttered up. Generated images or code appear in this window instead of within your main chat window, and it's even possible to run code in this window to see it in action. It's a useful feature, but it doesn't really seem worthy of requiring its own name.