Using the Takeoff API (Client-side)

These docs are outdated! Please check out https://docs.titanml.co for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our discord.

Streaming tokens

The takeoff API is lets you send requests to the API and get back streamed tokens. Especially for long text generations, token streaming is a really effective way of improving the usability of large language models.

Even if the model inferences very quickly, if the goal is to generate 100-1000s of tokens, waiting for the full process to finish before showing results to users can feel like a long time, even if the time per token is low.

Turning streamed tokens to non-streamed tokens is easy on the client side, by streaming the tokens into a buffer and returning the buffer once it's full.

Here we provide two API endpoint serving in FastAPI: /generate, and /generate_stream

one using normal json response and one using streaming response.

Example

We are going to use the python requests library to call the model as an API:

Streaming Response

import requests

if __name__ == "__main__":
    
    input_text = 'List 3 things to do in London.'
    
    url = "http://localhost:8000/generate_stream"
    json = {"text":input_text}
     
    response = requests.post(url, json=json, stream=True)
    response.encoding = 'utf-8'
    
    for text in response.iter_content(chunk_size=1, decode_unicode=True):
        if text:
            print(text, end="", flush=True)

This will print, token-by-token, the output of the previous model.

The same can be done on the command line using curl:

curl -X POST http://localhost:8000/generate_stream -N -H "Content-Type: application/json" -d '{"text":"List 3 things to do in London"}'

Normal Response

import requests

if __name__ == "__main__":
    
    input_text = 'List 3 things to do in London.'
    
    url = "http://localhost:8000/generate"
    json = {"text":input_text}
     
    response = requests.post(url, json=json)

    if "message" in response.json():
        print(response.json()["message"])

This will print the entire response as the output of the previous model.

The same can be done on the command line using curl:

curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{"text":

Last updated