TitanML Documentation
TitanHub Dashboard
  • 💡Overview
    • Guide to TitanML...
    • Need help?
  • 🦮Guides
  • Getting started
    • Installing iris
    • Sign up & sign in
    • Iris commands
      • Using iris upload
      • iris API
  • 🛫Titan Takeoff 🛫: Inference Server
    • When should I use the Takeoff Server?
    • Getting started
    • Supported models
    • Using the Takeoff API (Client-side)
    • Chat and Playground UI
    • Shutting down
    • Using a local model
    • Generation Parameters
  • 🎓Titan Train 🎓: Finetuning Service
    • Quickstart
    • Supported models & Tasks
    • Using iris finetune
      • Benchmark experiments for finetuning
      • A closer look at iris finetune arguments
      • Evaluating the model performance
    • Deploying and Inferencing the model
    • When should I use Titan Train?
  • ✨Titan Optimise ✨: Knowledge Distillation
    • When should I use Titan Optimise?
    • How to get the most out of Titan Optimise
    • Supported models & Tasks
    • Using iris distil
      • Benchmark experiments for knowledge distillation
      • A closer look at iris distil arguments
      • Monitoring progress
    • Evaluating and selecting a model
    • Deploying the optimal model
      • Which hardware should I deploy to?
      • Pulling the model
      • Inferencing the model
  • 🤓Other bits!! 🤓
    • Iris roadmap
Powered by GitBook
On this page
  • Using iris infer
  • Inferencing with the tritonclient library
  1. Titan Optimise ✨: Knowledge Distillation
  2. Deploying the optimal model

Inferencing the model

PreviousPulling the modelNextIris roadmap

Last updated 1 year ago

These docs are outdated! Please check out for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our .

Now you're ready to run inference on your optimised model (and see how fast you can get results!)

Using iris infer

The iris infer command allows you to run inference on your chosen model through the Triton Inference Server without submitting your own HTTP request directly.

Here is the structure of the iris infer command:

iris infer \
    --target localhost \      # note localhost is the default target, so it does not need to be specified
    --port 8000 \             # note 8000 is the default port, so it does not need to be specified
    --task ...  \             # the task for which the model you are deploying was originally fine-tuned
    --use_cpu  \              # include this iff the optimised model is in onnx format
    --text "..."  \           # your inference text
    --context "..." \         # input context; obligatory when you run inference on a question-answering model
    
    
iris infer -t localhost -p8000 -t ... --use_cpu -t "..." -c "..."

The port should be wherever you launched the docker container in the previous step, the task should be the same task you entered when you ran iris post, and the text should be the question you want to answer/sequence or token you want to classify in the inference process. You may include up to two sequences in your inference text to be classified, but if you are running a question-answering experiment, you may only include one question (along with its context). By default, the CPU is not used and we build a TensorRT engine for deployment; if you have a model in CPU format and want to use CPU, add the --use_cpu flag and make sure your syntax matches the example above.

When you run the command, iris will submit a request to the Triton Inference Server and you will receive the result of your inference.

Inferencing with the tritonclient library

If you prefer to interact directly with the Triton Inference Server, you can do so with thetritonclient library. Installing iris will add tritonclient to your site packages. There are alternative libraries in other languages; see the links at the bottom of the page for more details.

Install using:

pip install tritonclient

or

conda install -c conda-forge tritonclient

You can use the following Python script to run inference on the deployed model. Similarly, you can use tritonclient to submit HTTP requests to access any other Triton Inference Server capabilities.

import tritonclient.http
import numpy as np

model_name = f"transformer_tensorrt_inference"
model_version = "1"
url = "0.0.0.0:8000" # if you are deployed to port 8000
batch_size=1

text = "..." # your inference text

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, model_version=model_version
)
model_config = triton_client.get_model_config(
    model_name=model_name, model_version=model_version
)

query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
model_score = tritonclient.http.InferRequestedOutput(name="outputs", binary_data=False)

query.set_data_from_numpy(np.asarray([text], dtype=object))
answer = triton_client.infer(
		  model_name=model_name,
	    model_version=model_version,
	    inputs=[query],
	    outputs=[model_score],
)

answer.as_numpy("outputs")

This request returns your results in an identical format to iris infer.

Further reading:

✨
https://docs.titanml.co
discord
Triton Inference Server Documentation
Triton Inference Server Client Libraries