Inferencing the model
These docs are outdated! Please check out https://docs.titanml.co for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our discord.
Now you're ready to run inference on your optimised model (and see how fast you can get results!)
Using iris infer
The iris infer
command allows you to run inference on your chosen model through the Triton Inference Server without submitting your own HTTP request directly.
Here is the structure of the iris infer
command:
iris infer \
--target localhost \ # note localhost is the default target, so it does not need to be specified
--port 8000 \ # note 8000 is the default port, so it does not need to be specified
--task ... \ # the task for which the model you are deploying was originally fine-tuned
--use_cpu \ # include this iff the optimised model is in onnx format
--text "..." \ # your inference text
--context "..." \ # input context; obligatory when you run inference on a question-answering model
iris infer -t localhost -p8000 -t ... --use_cpu -t "..." -c "..."
The port should be wherever you launched the docker container in the previous step, the task should be the same task you entered when you ran iris post
, and the text should be the question you want to answer/sequence or token you want to classify in the inference process. You may include up to two sequences in your inference text to be classified, but if you are running a question-answering experiment, you may only include one question (along with its context). By default, the CPU is not used and we build a TensorRT engine for deployment; if you have a model in CPU format and want to use CPU, add the --use_cpu
flag and make sure your syntax matches the example above.
When you run the command, iris
will submit a request to the Triton Inference Server and you will receive the result of your inference.
Inferencing with the tritonclient library
If you prefer to interact directly with the Triton Inference Server, you can do so with thetritonclient
library. Installing iris
will add tritonclient
to your site packages. There are alternative libraries in other languages; see the links at the bottom of the page for more details.
Install using:
pip install tritonclient
or
conda install -c conda-forge tritonclient
You can use the following Python script to run inference on the deployed model. Similarly, you can use tritonclient
to submit HTTP requests to access any other Triton Inference Server capabilities.
import tritonclient.http
import numpy as np
model_name = f"transformer_tensorrt_inference"
model_version = "1"
url = "0.0.0.0:8000" # if you are deployed to port 8000
batch_size=1
text = "..." # your inference text
triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
model_metadata = triton_client.get_model_metadata(
model_name=model_name, model_version=model_version
)
model_config = triton_client.get_model_config(
model_name=model_name, model_version=model_version
)
query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
model_score = tritonclient.http.InferRequestedOutput(name="outputs", binary_data=False)
query.set_data_from_numpy(np.asarray([text], dtype=object))
answer = triton_client.infer(
model_name=model_name,
model_version=model_version,
inputs=[query],
outputs=[model_score],
)
answer.as_numpy("outputs")
This request returns your results in an identical format to iris infer.
Further reading:
Last updated