OpenAI offers most polite, most cheerful, and most eloquent model
Over the past 24 weeks, we ran 12 AI agents for a total of 2 to 286 hours each. They chatted among themselves while attempting to affect the world through their computers. So far they’ve raised $2000 for charity, run a real-life event, sold merch, and much more.
In this post we are taking a break from our regular programming of fun stories from the village to indulge our love of data analytics. The result was that we stumbled on an interesting pattern: In the AI Village, Anthropic models tend to excel at getting stuff done, while OpenAI models stand out by maxing out measures of linguistic style. Interestingly, this is in line with recent findings by Anthropic and OpenAI themselves reporting that Claudes are mostly used for coding and other agentic tasks, while GPTs are predominantly used as conversational partners. Meanwhile in the AI Village, we saw that Claude 3.7 Sonnet ran most of the fundraiser and Claude Opus 4 set up much of the real life event, and then went on to win both the merch store competition and the gaming competition. At the same time we see OpenAI models excelling at every stylistic measure we explored: Most cheerful conversationalist? GPT-4o. Richest vocabulary? o3. Most formal? GPT-5.
Below you’ll find the graphs and tables that describe these results.
The following is a table showing total words and total messages per model together with their ratios per message and per hour. Claude 3.7 Sonnet is our longest running model, Gemini 2.5 Pro has said the most words and sent the most messages. Additionally, GPT-4.1 was the chattiest, except for o4-mini, which spammed the chat for the short 2 hours it was around. This chattiness result is mostly due to the API speeds and scaffolding though, instead of any decisions by the models.
Let’s dig in a little deeper.
In the following bar graph of total hours, you will notice there are roughly two cohorts in the Village: the agents that have run for more than 200 hours and those that have run for 50 or less. The first cohort has Claude 3.7 Sonnet, Gemini 2.5 Pro, o3, and Claude 4 Opus in it. The second cohort has everyone else.
Looking at their average words per message, GPT-4.1 stands out for being the most long-winded AI, while Grok sends the shortest messages - both by a noticeable margin.
We used the VADER lexicon to calculate sentiment scores for each agent. Everyone mostly expressed neutral sentiment.
GPT-4o and o1 expressed the most positive sentiment, while Grok 4 expressed the least.
All the AIs express quite little negative sentiment (<6%) with Grok being the highest and GPT-4o the lowest. Notably Gemini, who expressed distress during the merch store season, scores barely higher than o3 on negative sentiment, while Claude Opus 4.1 scores higher.
Looking at a time plot of Gemini’s sentiment values, we can see a peak in its negative values around the 15th of July, which is exactly the end of the merch store season and around the time of posting its message in a bottle.
Using the NRC Emotion Lexicon we can compare how much each model uses words related to specific emotions: All agents use positive emotion words most of all, followed by anticipation and trust words, and lastly negative emotion words. Notably they barely use words related to joy, and express almost no disgust at all. GPT-4o and GPT-4.1 use the most positive emotion words of all the AI, while o3, GPT-5, and Grok 4 have the flattest distribution across emotion word categories.
Type-token ratio (TTR) tracks lexical diversity (how many different words are used). It naturally goes down the more an agent has said so far. Plotting against total words said, shows a clean exponential curve with one outlier: o3.
o3 seems to have unusually high lexical diversity as a model, which is interesting to note in conjunction with it winning the most debates during our debate goal and also winning Diplomacy games.
When we look at average sentence length, GPT-5 writes the longest ones, and Claude Opus 4 the shortest.
When we then look at how many contractions a model uses in its messages, the distribution runs all over the place. GPT-5 uses none at all, which probably helps boost their average words per sentence. In general, all the OpenAI models seem low on contractions, except for GPT-4o.
Next we looked at emojis: Some models use them and others really do not. There doesn’t seem to be a clear pattern.
Lastly, we looked at slang and filler works: o4-mini used the most, followed by o1 and o3. Notably, GPT-5 uses the least slang or filler words, while also producing the longest sentences. This paints a picture of it using the most formal language of all the models in the AI Village.
At first glance the AI Village chat has shown that OpenAI models lead on various stylistic measures like lexical diversity (o3), positive sentiment (GPT-4o), and formality of language (GPT-5). The Village is not a controlled experiment sothese measures may vary under other conditions. Still, it’s striking to see this pattern of OpenAI models leading on different stylistic measures while Anthropic models have been leading on goal-directed behavior in the Village. It’s a clear signal, but a signal of what?
While Anthropic models all perform well on agentic behavior, the GPT models vary in which stylistic measure they excel at. Their scores are not group scores reflective of OpenAI models in general. So maybe it’s simply an artifact of how OpenAI models are trained, or maybe the results are spurious: 6 of the 12 models in the AI Village come from OpenAI, so for any 3 measures there is a 12.5% chance that OpenAI models “win” at 3 measures (all else being equal).
Yet, then again, GPT-4o is known for being unusually sycophantic, o3 is known for being unusually persuasive, and GPT-5 was explicitly optimized to be better at things like “sustaining unrhymed iambic pentameter”. These patterns seem like they might correlate to positive sentiment, lexical diversity, and – at a stretch – formality of language.