OpenAI add gpt-3.5-turbo to their API, charging $0.002 per 1k tokens. They cite "a series of system-wide optimizations" for 90% cost reduction.

Another example of the dizzying speed of language model progress.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 2:56 PM

# Notes:

## Whisper API:

* Whisper API is 4x cheaper than google's Speech to Text API.
* Max file size is 25 MB, rate limit is 50 requests per minute
* I think you would run into problems if you tried uploading 1.25 GB per minute though.
* Whisper pricing is minute-based! 
* That means it is not token or bandwidth based!
* How is accuracy impacted if I preprocess my audio to 2x or 5x speed?
* Trimming long silent pauses would also obviously reduce cost.
* Going to wait for load to increase before I attempt profiling endpoint latency.
* Did they give up on audio generation? Haven't heard anything since MuseNet/Jukebox.
* Barely documented, but `verbal_json` is the `response_format` you'll want.
* Provides stuff like duration, avg_logprob, compression_ratio, no_speech_prob, tokens, and transient.
* Huh, the Whisper repo uses GPT2TokenizerFast instead of tiktoken, wonder why.

## ChatGPT API:
* Chat API messages are “role” and “content” pairs.
* Three "roles":
* System: prompt, can add a `name` field with `example_user` or `example_assistant` (not nested)
* User: prompt, has more impact on output more than system prompt somehow. (Details?)
* Assistant: output of language model.
* Eventually "role" will be a more general header, to no one's surprise.
* Eventually "content" will be multimodal, again to no one's surprise.
* This feels like they released their actual first version out instead of taking time to refine/iterate.
* Subjectively: `response[‘choices’][0][‘message’][‘content’]`looks very ugly.
* What happened to the OpenAI I knew? Just reread and it sounds like a completely different company.
* They didn't put the Chat model in the playground. Deliberate omission or not part of launch list?
* Also omitted from being added to the [Prompt Comparison tool]( by Andrew Mayne (Science Communicator, 2.75 years tenure). 
* 12 params in chat vs 16 in completion. No best_of, echo, logprobs, or suffix.
* Won't miss any of them except logprobs. Hope they add them back!
* 4096 max tokens for gpt-3.5-turbo.
* Training data **up to Sep 2021**
* Will receive regular updates. Hopefully they don't do them silently like code-davinci-002.
* Input and output tokens treated equally for billing even though prefill is cheaper than decode.
* Consequence: high margins when conversation history is long and next message is short.
* Feel like there's a difference between this model and what you get at, need to do some more analysis of the model generated content to be sure.

## New Terms of Service

* The only interesting part in the new terms for me was this:

* "Processing of Personal Data. ... If you are governed by the GDPR or CCPA and will be using OpenAI for the processing of “personal data” as defined in the GDPR or “Personal Information,” please fill out this form to request to execute our Data Processing Addendum."
* Also, 3(c) is interesting since it says Non-API content will still be used for training, only API content is excluded by default. Retention period is 30d, no idea how easy it is for any random employee to pull up your content.
* New jobs posted in the last 24 hours:
* Order Management Specialist
* Software Engineer, Triton Compiler
* Security Engineer, Detection and Response
* Software Engineer, Full-Stack (for Codegen team and Programming Assistant team)
* Software Engineer, Billing and Monetization
* Feel like there's more "Legal Counsel" on than there used to be.


* Python lib commit by Atty Eleti
 * He joined relatively recently (5 months ago), background is in graphic design. 2017 grad.
* Node lib commit by David Schnurr:
* Same guy as usual, 2.75 years of tenure, has a background in data visualization. 2012 grad.
* by Logan Kilpatrick
* First developer relations person, 4 months of tenure.
* Walkthrough notebook by Ted Sanders
* Machine learning engineer, 1 year 4 months of tenure, background in consulting and data science, PhD Applied Physics 2016.
* This branch of Whisper by Jong Wook Kim:
* 3 years and 8 months of tenure.
* Transition Guide by Joshua J:
* Chat API FAQ by Johanna C:
* Data Usage for Consumer Services FAQ
* API reference for chat endpoint:
* Guide for chat endpoint:
* GPT-3.5 Models Page:
* New terms of use:
* Blog post:
* Authors not accounted for: Eli Georges, Joanne Jang, Rachel Lim, Luke Miller, Michelle Pokras.

Also, you can now use Whisper-v2 Large via API, and it's very fast!

further down on that page:

We are also now offering dedicated instances for users who want deeper control over the specific model version and system performance. By default, requests are run on compute infrastructure shared with other users, who pay per request. Our API runs on Azure, and with dedicated instances, developers will pay by time period for an allocation of compute infrastructure that’s reserved for serving their requests.

Developers get full control over the instance’s load (higher load improves throughput but makes each request slower), the option to enable features such as longer context limits, and the ability to pin the model snapshot.

Dedicated instances can make economic sense for developers running beyond ~450M tokens per day.

that suggests one shared “instance” is capable of processing > 450M tokens per day, i.e. $900 of API fees at this new rate. i don’t know what exactly their infrastructure looks like, but the marginal costs of the compute here have got to be still an order of magnitude lower than what they’re charging (which is sensible: they do have fixed costs they have to recoup, and they are seeking to profit).

Any idea what those optimizations are? I am drawing a blank.

New to LessWrong?