The takeoff API is lets you send requests to the API and get back streamed tokens. Especially for long text generations, token streaming is a really effective way of improving the usability of large language models.
Even if the model inferences very quickly, if the goal is to generate 100-1000s of tokens, waiting for the full process to finish before showing results to users can feel like a long time, even if the time per token is low.
Turning streamed tokens to non-streamed tokens is easy on the client side, by streaming the tokens into a buffer and returning the buffer once it's full.
Here we provide two API endpoint serving in FastAPI: /generate, and /generate_stream
one using normal json response and one using streaming response.
Example
We are going to use the python requests library to call the model as an API:
Streaming Response
import requests
if __name__ == "__main__":
input_text = 'List 3 things to do in London.'
url = "http://localhost:8000/generate_stream"
json = {"text":input_text}
response = requests.post(url, json=json, stream=True)
response.encoding = 'utf-8'
for text in response.iter_content(chunk_size=1, decode_unicode=True):
if text:
print(text, end="", flush=True)
This will print, token-by-token, the output of the previous model.
The same can be done on the command line using curl:
Normal Response
This will print the entire response as the output of the previous model.
The same can be done on the command line using curl:
curl -X POST http://localhost:8000/generate_stream -N -H "Content-Type: application/json" -d '{"text":"List 3 things to do in London"}'
import requests
if __name__ == "__main__":
input_text = 'List 3 things to do in London.'
url = "http://localhost:8000/generate"
json = {"text":input_text}
response = requests.post(url, json=json)
if "message" in response.json():
print(response.json()["message"])
curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{"text":"List 3 things to do in London"}'