Because the hard context window is only a portion of the generation pipeline, response truncation may occur even if you are below the advertised token limit. In reality, truncation is frequently brought on by
-
Budgets for reserved output
-
Tokens for hidden systems
-
Caps for frameworks
-
Interruptions to streaming
-
Condition of stop
-
Filters for safety
-
token allocations and reasoning
-
Limits of middleware
"128k context window" does not imply that you will always receive 128k useable tokens for prompt + output.