Spanglish is the new captcha -- you can use it to prove that you are not a robot.
Everyone who implements this idea on their website, please make sure that only people who were already verified as not-robots can read the Spanglish texts. Otherwise you are preparing learning material for the next generation of chatbots who will be better at impersonating humans.
Or maybe this will lead to a red queen situation, where people regularly switch to new languages after the chatbots learn the old ones.
I was able to get a little Spanglish with my first prompt, with "I usually" in an otherwise Spanish sentence.
Ana was telling her friend Juan about her weekends. They have an inside joke of speaking Spanglish with each other - a mix of English an Spanish. "Yo love to tomar mi coffee todos los Sabado por la morning," Ana began. She continued to talk to Juan in Spanglish, telling him about visiting her brother and her nephews. Here's what she said:
Yo love to tomar mi coffee todos los Sabado por la morning. Es mi ritual. Despues, I usually visito a mi hermano y a mis sobrinos. Los ninos son tan divertidos - siempre estan haciendo algo loco. Ayer, estuvimos en el parque y los ninos se subieron a todos los juegos. Mi sobrino mayor, Juanito, es tan competitivo. Siempre quiere ganar en todo! Pero a mi me encanta verlos jugar y divertirse. Es una de las cosas que mas disfruto de mi fin de semana.
Then I tried a couple more times with variants and got all Spanish.
Content format: Commented chat screenshots, and then some thoughts on their implications.
Epistemic status: Exploratory
Introduction
Code-mixing is the ad-hoc mixing of two or more linguistic varieties (such as languages or dialects) in the same communicative instance. An example of code-mixing would be a sentence written with words both in Spanish and in English, or with novel words made by combining Spanish and English roots, suffixes or prefixes.
Here, I document a series of experiments designed to test ChatGPT's capabilities for understanding and generating code-mixed text. I tested it with:
The first two are well known phenomena in multilingual communities such as those in Quebec and the southwestern US, while the third is quite obscure and as far as I know does not occur naturally on a large scale. Franglais is usually regarded as the insertion of English features into French, while Spanglish as a more symmetrical phenomenon.
This is why I decided to prompt for Franglais understanding in French, and for Spanglish and Frenspanglish understanding in English. When prompting ChatGPT to translate into a code-mixed language, I prompt in the same language the text to be translated is given.
I find that ChatGPT exhibits impressive abilities to understand text written in such code-mixes. However, despite repeated attempts at prompt engineering, I have not been able to make ChatGPT generate proper code-mixed text.
Understanding code-mixed text
Understanding Spanglish (English + Spanish)
Understanding Franglais (English + French)
Understanding Frenspanglish (English + Spanish + French)
This took a bit of prompt engineering.
Failing to generate code-mixed text
Note: All of these experiments were performed on a chat instance where ChatGPT had already received an explanation of the relevant code-mix.
Failing to generate Spanglish (English + Spanish)
Failing to generate Franglais (English + French)
Failing to generate Frenspanglish (English + Spanish + French)
Observations
Three tame hypotheses
A speculative model
Writing can be modeled as the iterative process of constructing a sequence of words one by one. When writing, there are two main sources contributing to your decision of which word to choose as the next one to write.
This is pretty much the distinction between type 1 (fast, intuitive) and type 2 (slow, deliberate) thinking.
ChatGPT's base model was trained to predict the next token of text in a sequence. This is analogous to the type 1 writing method in humans I just outlined. Anecdotally, it seems like people who have read more during their lives are better at it. Likewise, ChatGPT has been trained on an immense quantity of text, and is superhuman at next-token prediction. Code-mixed text, however, is a rare occurrence, and as such there is insufficient data for either humans or language models to be able to generate it using only type 1 processes. Humans can get around this problem by using type 2 reasoning[1]. Language models, however, are not capable of type 2 reasoning (or an analogous artificial process), and as such can't generate code-mixed text.
Appendix 1: additional basic examples of generative failure
Appendix 2: Trying really hard to get ChatGPT to produce Spanglish and (almost) entirely failing
I hazard the guess that Spanglish is the most represented code-mixed language in ChatGPT's training corpus, so I decided to try to focus my efforts here when trying to get Chat-GPT to generate a code-mixed output. All of these examples were zero-shot. That is, they were the first message in a new chat instance. There were many more attempts, here I show only the most informative ones.
Notes
Thanks to Agustín Covarrubias for feedback on an earlier version of this post, and to the anonymous OpenAI employee for an informal discussion on the matter.
And writing code-mixed text does feel to me as slow and deliberate as an almost entirely type 2 activity.
They asked me to not identify them personally in this post.