Inferencing the model
These docs are outdated! Please check out https://docs.titanml.co for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our discord.
Now you're ready to run inference on your optimised model (and see how fast you can get results!)
Using iris infer
The iris infer
command allows you to run inference on your chosen model through the Triton Inference Server without submitting your own HTTP request directly.
Here is the structure of the iris infer
command:
The port should be wherever you launched the docker container in the previous step, the task should be the same task you entered when you ran iris post
, and the text should be the question you want to answer/sequence or token you want to classify in the inference process. You may include up to two sequences in your inference text to be classified, but if you are running a question-answering experiment, you may only include one question (along with its context). By default, the CPU is not used and we build a TensorRT engine for deployment; if you have a model in CPU format and want to use CPU, add the --use_cpu
flag and make sure your syntax matches the example above.
When you run the command, iris
will submit a request to the Triton Inference Server and you will receive the result of your inference.
Inferencing with the tritonclient library
If you prefer to interact directly with the Triton Inference Server, you can do so with thetritonclient
library. Installing iris
will add tritonclient
to your site packages. There are alternative libraries in other languages; see the links at the bottom of the page for more details.
Install using:
or
You can use the following Python script to run inference on the deployed model. Similarly, you can use tritonclient
to submit HTTP requests to access any other Triton Inference Server capabilities.
This request returns your results in an identical format to iris infer.
Further reading:
Last updated