TitanML Documentation
TitanHub Dashboard
  • 💡Overview
    • Guide to TitanML...
    • Need help?
  • 🦮Guides
  • Getting started
    • Installing iris
    • Sign up & sign in
    • Iris commands
      • Using iris upload
      • iris API
  • 🛫Titan Takeoff 🛫: Inference Server
    • When should I use the Takeoff Server?
    • Getting started
    • Supported models
    • Using the Takeoff API (Client-side)
    • Chat and Playground UI
    • Shutting down
    • Using a local model
    • Generation Parameters
  • 🎓Titan Train 🎓: Finetuning Service
    • Quickstart
    • Supported models & Tasks
    • Using iris finetune
      • Benchmark experiments for finetuning
      • A closer look at iris finetune arguments
      • Evaluating the model performance
    • Deploying and Inferencing the model
    • When should I use Titan Train?
  • ✨Titan Optimise ✨: Knowledge Distillation
    • When should I use Titan Optimise?
    • How to get the most out of Titan Optimise
    • Supported models & Tasks
    • Using iris distil
      • Benchmark experiments for knowledge distillation
      • A closer look at iris distil arguments
      • Monitoring progress
    • Evaluating and selecting a model
    • Deploying the optimal model
      • Which hardware should I deploy to?
      • Pulling the model
      • Inferencing the model
  • 🤓Other bits!! 🤓
    • Iris roadmap
Powered by GitBook
On this page
  • Streaming tokens
  • Example
  1. Titan Takeoff 🛫: Inference Server

Using the Takeoff API (Client-side)

PreviousSupported modelsNextChat and Playground UI

Last updated 1 year ago

These docs are outdated! Please check out for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our .

Streaming tokens

The takeoff API is lets you send requests to the API and get back streamed tokens. Especially for long text generations, token streaming is a really effective way of improving the usability of large language models.

Even if the model inferences very quickly, if the goal is to generate 100-1000s of tokens, waiting for the full process to finish before showing results to users can feel like a long time, even if the time per token is low.

Turning streamed tokens to non-streamed tokens is easy on the client side, by streaming the tokens into a buffer and returning the buffer once it's full.

Here we provide two API endpoint serving in FastAPI: /generate, and /generate_stream

one using normal json response and one using streaming response.

Example

We are going to use the python requests library to call the model as an API:

Streaming Response

import requests

if __name__ == "__main__":
    
    input_text = 'List 3 things to do in London.'
    
    url = "http://localhost:8000/generate_stream"
    json = {"text":input_text}
     
    response = requests.post(url, json=json, stream=True)
    response.encoding = 'utf-8'
    
    for text in response.iter_content(chunk_size=1, decode_unicode=True):
        if text:
            print(text, end="", flush=True)

This will print, token-by-token, the output of the previous model.

The same can be done on the command line using curl:

curl -X POST http://localhost:8000/generate_stream -N -H "Content-Type: application/json" -d '{"text":"List 3 things to do in London"}'

Normal Response

import requests

if __name__ == "__main__":
    
    input_text = 'List 3 things to do in London.'
    
    url = "http://localhost:8000/generate"
    json = {"text":input_text}
     
    response = requests.post(url, json=json)

    if "message" in response.json():
        print(response.json()["message"])

This will print the entire response as the output of the previous model.

The same can be done on the command line using curl:

curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{"text":
🛫
https://docs.titanml.co
discord