Inferencing the model

These docs are outdated! Please check out https://docs.titanml.co for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our discord.

Now you're ready to run inference on your optimised model (and see how fast you can get results!)

Using iris infer

The iris infer command allows you to run inference on your chosen model through the Triton Inference Server without submitting your own HTTP request directly.

Here is the structure of the iris infer command:

iris infer \
    --target localhost \      # note localhost is the default target, so it does not need to be specified
    --port 8000 \             # note 8000 is the default port, so it does not need to be specified
    --task ...  \             # the task for which the model you are deploying was originally fine-tuned
    --use_cpu  \              # include this iff the optimised model is in onnx format
    --text "..."  \           # your inference text
    --context "..." \         # input context; obligatory when you run inference on a question-answering model

iris infer -t localhost -p8000 -t ... --use_cpu -t "..." -c "..."

The port should be wherever you launched the docker container in the previous step, the task should be the same task you entered when you ran iris post, and the text should be the question you want to answer/sequence or token you want to classify in the inference process. You may include up to two sequences in your inference text to be classified, but if you are running a question-answering experiment, you may only include one question (along with its context). By default, the CPU is not used and we build a TensorRT engine for deployment; if you have a model in CPU format and want to use CPU, add the --use_cpu flag and make sure your syntax matches the example above.

When you run the command, iris will submit a request to the Triton Inference Server and you will receive the result of your inference.

Inferencing with the tritonclient library

If you prefer to interact directly with the Triton Inference Server, you can do so with thetritonclient library. Installing iris will add tritonclient to your site packages. There are alternative libraries in other languages; see the links at the bottom of the page for more details.

Install using:

pip install tritonclient

conda install -c conda-forge tritonclient

You can use the following Python script to run inference on the deployed model. Similarly, you can use tritonclient to submit HTTP requests to access any other Triton Inference Server capabilities.

import tritonclient.http
import numpy as np

model_name = f"transformer_tensorrt_inference"
model_version = "1"
url = "0.0.0.0:8000" # if you are deployed to port 8000
batch_size=1

text = "..." # your inference text

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, model_version=model_version
)
model_config = triton_client.get_model_config(
    model_name=model_name, model_version=model_version
)

query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
model_score = tritonclient.http.InferRequestedOutput(name="outputs", binary_data=False)

query.set_data_from_numpy(np.asarray([text], dtype=object))
answer = triton_client.infer(
		  model_name=model_name,
	    model_version=model_version,
	    inputs=[query],
	    outputs=[model_score],
)

answer.as_numpy("outputs")

This request returns your results in an identical format to iris infer.