Structured Outputs
Structured output settings are exposed through OpenRouter for schema-driven or format-controlled responses.
This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The routed model identifier exposed by upstream providers.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for GPT-3.5 Instruct Deprecated.
This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.
Structured output settings are exposed through OpenRouter for schema-driven or format-controlled responses.
This model accepts text input and returns text output.
OpenRouter currently lists a context window of 4.1K with up to 2,000 tokens maximum output tokens.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Endpoint-level provider data currently available for this model.
Official model cards, release notes, docs, and other references synced from the source page.
GPT-3.5 Instruct Deprecated discussions are most active in r/ChatGPT, r/OpenAI, r/OxENV. Top Reddit threads cluster around benchmark and model-comparison threads.
The strongest match in this snapshot has 388 upvotes and 75 comments.
[This Twitter thread](https://twitter.com/GrantSlatton/status/1703913578036904431) ([Nitter alternative](https://nitter.net/GrantSlatton/status/1703913578036904431) for those who aren't logged into Twitter and want to see the full thread) claims that [OpenAI's new language model gpt-3.5-turbo-instruct](https://analyticsindiamag.com/openai-releases-gpt-3-5-turbo-instruct/) can "readily" beat Lichess Stockfish level 4 ([Lichess Stockfish level and its rating](https://lichess.org/@/MagoGG/blog/stockfish-level-and-its-rating/CvL5k0jL)) and has a chess rating of "around 1800 Elo." [This tweet](https://twitter.com/nabeelqu/status/1703961405999759638) shows the style of prompts that are being used to get these results with the new language model.
I used website parrotchess\[dot\]com (discovered [here](https://twitter.com/OwariDa/status/1704179448013070560)) (EDIT: parrotchess doesn't exist anymore, as of March 7, 2024) to play multiple games of chess purportedly pitting this new language model vs. various levels at website Lichess, which supposedly uses Fairy-Stockfish 14 according to the Lichess user interface. My current results for all completed games: The language model is 5-0 vs. Fairy-Stockfish 14 level 5 ([game 1](https://lichess.org/eGSWJtNq), [game 2](https://lichess.org/pN7K9bdS), [game 3](https://lichess.org/aK4jQvdo), [game 4](https://lichess.org/S9SGg8YI), [game 5](https://lichess.org/OqzdkDhE)), and 2-5 vs. Fairy-Stockfish 14 level 6 ([game 1](https://lichess.org/zP68C6H4), [game 2](https://lichess.org/4XKUIDh1), [game 3](https://lichess.org/1zTasRRp), [game 4](https://lichess.org/lH1EMqJQ), [game 5](https://lichess.org/mdFlTbMn), [game 6](https://lichess.org/HqmELNhw), [game 7](https://lichess.org/inWVs05Q)). Not included in the tally are games that I had to abort because the parrotchess user interface stalled (5 instances), because I accidentally copied a move incorrectly in the parrotchess user interface (numerous instances), or because the parrotchess user interface doesn't allow the promotion of a pawn to anything other than queen (1 instance). **Update: There could have been up to 5 additional losses - the number of times the parrotchess user interface stalled - that would have been recorded in this tally if** [this language model resignation bug](https://twitter.com/OwariDa/status/1705894692603269503) **hadn't been present. Also, the quality of play of some online chess bots can perhaps vary depending on the speed of the user's hardware.**
The following is a screenshot from parrotchess showing the end state of the first game vs. Fairy-Stockfish 14 level 5:
https://preview.redd.it/4ahi32xgjmpb1.jpg?width=432&format=pjpg&auto=webp&s=7fbb68371ca4257bed15ab2828fab58047f194a4
The game results in this paragraph are from using parrotchess after the forementioned resignation bug was fixed. The language model is 0-1 vs. Fairy-Stockfish level 7 ([game 1](https://lichess.org/Se3t7syX)), and 0-1 vs. Fairy-Stockfish 14 level 8 ([game 1](https://lichess.org/j3W2OwrP)).
There is [one known scenario](https://twitter.com/OwariDa/status/1706823943305167077) ([Nitter alternative](https://nitter.net/OwariDa/status/1706823943305167077)) in which the new language model purportedly generated an illegal move using language model sampling temperature of 0. Previous purported illegal moves that the parrotchess developer examined [turned out](https://twitter.com/OwariDa/status/1706765203130515642) ([Nitter alternative](https://nitter.net/OwariDa/status/1706765203130515642)) to be due to parrotchess bugs.
There are several other ways to play chess against the new language model if you have access to the OpenAI API. The first way is to use the OpenAI Playground as shown in [this video](https://www.youtube.com/watch?v=CReHXhmMprg). The second way is chess web app gptchess\[dot\]vercel\[dot\]app (discovered in [this Twitter thread](https://twitter.com/willdepue/status/1703974001717154191) / [Nitter thread](https://nitter.net/willdepue/status/1703974001717154191)). Third, another person modified that chess web app to additionally allow various levels of the Stockfish chess engine to autoplay, resulting in chess web app chessgpt-stockfish\[dot\]vercel\[dot\]app (discovered in [this tweet](https://twitter.com/paul_cal/status/1704466755110793455)).
Results from other people:
a) Results from hundreds of games in blog post [Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities](https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/).
b) Results from 150 games: [GPT-3.5-instruct beats GPT-4 at chess and is a \~1800 ELO chess player. Results of 150 games of GPT-3.5 vs stockfish and 30 of GPT-3.5 vs GPT-4](https://www.reddit.com/r/MachineLearning/comments/16q81fh/d_gpt35instruct_beats_gpt4_at_chess_and_is_a_1800/). [Post #2](https://www.reddit.com/r/chess/comments/16q8a3b/new_openai_model_gpt35instruct_is_a_1800_elo/). The developer later noted that due to bugs the legal move rate [was](https://twitter.com/a_karvonen/status/1706057268305809632) actually above 99.9%. It should also be noted that these results [didn't use](https://www.reddit.com/r/chess/comments/16q8a3b/comment/k1wgg0j/) a language model sampling temperature of 0, which I believe could have induced illegal moves.
c) Chess bot [gpt35-turbo-instruct](https://lichess.org/@/gpt35-turbo-instruct/all) at website Lichess.
d) Chess bot [konaz](https://lichess.org/@/konaz/all) at website Lichess.
From blog post [Playing chess with large language models](https://nicholas.carlini.com/writing/2023/chess-llm.html):
>Computers have been better than humans at chess for at least the last 25 years. And for the past five years, deep learning models have been better than the best humans. But until this week, in order to be good at chess, a machine learning model had to be explicitly designed to play games: it had to be told explicitly that there was an 8x8 board, that there were different pieces, how each of them moved, and what the goal of the game was. Then it had to be trained with reinforcement learning agaist itself. And then it would win.
>
>This all changed on Monday, when OpenAI released GPT-3.5-turbo-instruct, an instruction-tuned language model that was designed to just write English text, but that people on the internet quickly discovered can play chess at, roughly, the level of skilled human players.
Post [Chess as a case study in hidden capabilities in ChatGPT](https://www.lesswrong.com/posts/F6vH6fr8ngo7csDdf/chess-as-a-case-study-in-hidden-capabilities-in-chatgpt) from last month covers a different prompting style used for the older chat-based GPT 3.5 Turbo language model. If I recall correctly from my tests with ChatGPT-3.5, using that prompt style with the older language model can defeat Stockfish level 2 at Lichess, but I haven't been successful in using it to beat Stockfish level 3. In my tests, both the quality of play and frequency of illegal attempted moves seems to be better with the new prompt style with the new language model compared to the older prompt style with the older language model.
Related article: [Large Language Model: world models or surface statistics?](https://thegradient.pub/othello/)
P.S. Since some people claim that language model gpt-3.5-turbo-instruct is always playing moves memorized from the training dataset, I searched for data on the uniqueness of chess positions. From [this video](https://youtu.be/DpXy041BIlA?t=2225), we see that for a certain game dataset there were 763,331,945 chess positions encountered in an unknown number of games without removing duplicate chess positions, 597,725,848 different chess positions reached, and 582,337,984 different chess positions that were reached only once. Therefore, for that game dataset the probability that a chess position in a game was reached only once is 582337984 / 763331945 = 76.3%. For the larger dataset [cited](https://youtu.be/DpXy041BIlA?t=2187) in that video, there are approximately (506,000,000 - 200,000) games in the dataset (per [this paper](http://tom7.org/chess/survival.pdf)), and 21,553,382,902 different game positions encountered. Each game in the larger dataset added a mean of approximately 21,553,382,902 / (506,000,000 - 200,000) = 42.6 different chess positions to the dataset. For [this different dataset](https://lichess.org/blog/Vs0xMTAAAD4We4Ey/opening-explorer) of \~12 million games, \~390 million different chess positions were encountered. Each game in this different dataset added a mean of approximately (390 million / 12 million) = 32.5 different chess positions to the dataset. From the aforementioned numbers, we can conclude that a strategy of playing only moves memorized from a game dataset would fare poorly because there are not rarely new chess games that have chess positions that are not present in the game dataset.
TL;DR: Title.
Honestly, I'm surprised I haven't seen any kind of explanation for this unintended feature. (some dataare available at [https://github.com/adamkarvonen/chess\_gpt\_eval](https://github.com/adamkarvonen/chess_gpt_eval), or try for yourselves at [https://parrotchess.com/](https://parrotchess.com/))
Because to me, the only reasonable explanation is it can somehow \*understand\* the rules of chess. This would also mean it's at least, in some form, intelligent.
​
Hi,
Here’s a quick example of how to reliably get JSON output using the newly released gpt-3.5-turbo-instruct model. This is not a full tutorial, just sample code with some context.
# Context
Since completion models allow for partial completions, it’s been possible to prompt ada/curie/davinci with something like:
“””Here’s a JSON representing a person:
{“name”: [insert_name_here_pls],
“age“: [insert_age_here_pls]}
”””
And make them fill in the blanks thus returning an easily parsable json-like string.
Chat models do not support such functionality, making it somewhat troublesome (or at least requiring additional tokens) to make them output a JSON reliably (but given the comparative price-per-token — still totally worth it).
**gpt-3.5-turbo-instruct** is a high-quality **completion** model, arguably making it davinci on the cheap.
**Note (Update 2):** depending on your use-case, you may be just fine with the output provided by the function calling feature ([https://openai.com/blog/function-calling-and-other-api-updates](https://openai.com/blog/function-calling-and-other-api-updates)), as it's always a perfect JSON (but may be lacking in content quality for more complex cases, IMO). So try it first, before proceeding with the route outlined here.
# Tools
Although, when it comes to LLMs, it may still be a little too early to fully commit to a particular set of tools, **Guidance** ([https://github.com/guidance-ai/guidance](https://github.com/guidance-ai/guidance)) appears to be a very mature library that simplifies interactions with LLMs. So I'll use it in this example.
# Sample Task
Let's say, we have a bunch of customer product surveys, and we need to summarize and categorize them.
# Code
Let's go straight to the copy-pastable code that gets the job done.
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')
#loading api key. Feel free to just go: api_key = "abcd..."
import guidance
import json
guidance.llm = guidance.llms.OpenAI("gpt-3.5-turbo-instruct", api_key=api_key)
# pre-defining survey categories
my_categories = ["performance", "price", "compatibility", "support", "activation"]
# defining our prompt
survey_anlz_prompt = guidance("""
Customer's survey analysis has to contain the following parameters:
- summary: a short 1-12 word summary of the survey comment;
- score: an integer from 1 to 10 reflecting the survey score;
- category: an aspect of the survey that is stressed the most.
INPUT:
"{{survey_text}}"
OUTPUT:
```json
{
"summary": "{{gen 'name' max_tokens=20 stop='"'}}",
"score": {{gen 'score' max_tokens=2 stop=','}},
"category": "{{select 'category' logprobs='logprobs' options=categories}}"
}```""")
def process_survey_text(prompt,survey_text):
output = prompt(categories=my_categories, survey_text=survey_text, caching=False)
json_str = str(output).split("```json")[1][:-3]
json_obj = json.loads(json_str)
return json_obj
my_survey_text_1 = """The product is good, but the price is just too high. I've no idea who's paying $1500/month. You should totally reconsider it."""
my_survey_text_2 = """WTF? I've paid so much money for it, and the app is super slow! I can't work! Get in touch with me ASAP!"""
print(process_survey_text(survey_anlz_prompt,my_survey_text_1))
print(process_survey_text(survey_anlz_prompt,my_survey_text_2))
The result looks like this:
{'summary': 'Good product, high price', 'Score': 6, 'category': 'price'}
{'summary': 'Slow app, high price', 'Score': 1, 'category': 'performance'}
# Notes
Everything that's being done when defining the prompt is pretty much described at [https://github.com/guidance-ai/guidance](https://github.com/guidance-ai/guidance) right in the readme, but just to clarify a couple of things:
\- note that the **stop tokens** (e.g. `stop=','`) are different for "name" and "score" (`"` and `,` respectively) because one is supposed to be a string and the other — an integer;
\- in the readme, you'll also see Guidance patterns like "strength": `{{gen 'strength' pattern='[0-9]+'...}}` just be aware that they're not supported in OpenAI models, so you'll get an error.
\- just like with the chat model, you can significantly improve the quality by providing some examples of what you need inside the prompt.
**Update.** It's important to point out that this approach will cause a higher token usage, since under the hood, the model is being prompted separately for each key. As suggested by u/Baldric, it might make sense to use it as a backup route in case the result of a more direct approach doesn't pass validation (either when it's an invalid JSON or e.g. if a model hallucinates a value instead of selecting from a given list).
[This Twitter thread](https://twitter.com/GrantSlatton/status/1703913578036904431) ([link at Nitter)](https://nitter.net/GrantSlatton/status/1703913578036904431) claims that [OpenAI's new language model gpt-3.5-turbo-instruct](https://analyticsindiamag.com/openai-releases-gpt-3-5-turbo-instruct/) can readily defeat Lichess Stockfish level 4. I used website parrotchess\[dot\]com (discovered [here](https://twitter.com/OwariDa/status/1704179448013070560)) to play multiple games of chess pitting this new language model vs. various levels of Stockfish at website Lichess. The language model is 2-0 vs. Lichess Stockfish level 5 ([game 1](https://lichess.org/eGSWJtNq), [game 2](https://lichess.org/pN7K9bdS)), and 0-2 vs. Lichess Stockfish level 6 ([game 1](https://lichess.org/zP68C6H4), [game 2](https://lichess.org/4XKUIDh1)). One game was aborted because the language model apparently made an illegal move. **Update**: The latest game record tally is in [this post](https://www.reddit.com/r/MachineLearning/comments/16oi6fb/n_openais_new_language_model_gpt35turboinstruct/).
The following is a screenshot from the chess web app showing the end state of the first game vs. Lichess Stockfish level 5:
https://preview.redd.it/ni8hukh7aapb1.jpg?width=432&format=pjpg&auto=webp&s=5976dd2248c248e220edfca9aab83f7ebbac68ee
[Tweet](https://twitter.com/nabeelqu/status/1703961405999759638) from another person who purportedly got the new language model to beat Lichess Stockfish level 5.
Related article for a different board game: [Large Language Model: world models or surface statistics?](https://thegradient.pub/othello/)
Continue browsing adjacent models from the same provider.