> For the complete documentation index, see [llms.txt](https://titanml.gitbook.io/iris-documentation/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://titanml.gitbook.io/iris-documentation/titan-optimise-knowledge-distillation/deploying-the-optimal-model/inferencing-the-model.md).

# Inferencing the model

{% hint style="danger" %}
These docs are outdated! Please check out <https://docs.titanml.co> for the latest information on the TitanML platform.\
\
If there's anything that's not covered there, please contact us on our [discord](https://discord.com/invite/83RmHTjZgf).
{% endhint %}

Now you're ready to run inference on your optimised model (and see how fast you can get results!)

### Using iris infer

The `iris infer` command allows you to run inference on your chosen model through the Triton Inference Server without submitting your own HTTP request directly.

Here is the structure of the `iris infer` command:

```sh
iris infer \
    --target localhost \      # note localhost is the default target, so it does not need to be specified
    --port 8000 \             # note 8000 is the default port, so it does not need to be specified
    --task ...  \             # the task for which the model you are deploying was originally fine-tuned
    --use_cpu  \              # include this iff the optimised model is in onnx format
    --text "..."  \           # your inference text
    --context "..." \         # input context; obligatory when you run inference on a question-answering model
    
    
```

```
iris infer -t localhost -p8000 -t ... --use_cpu -t "..." -c "..."
```

The port should be wherever you launched the docker container in the previous step, the task should be the same task you entered when you ran `iris post`, and the text should be the question you want to answer/sequence or token you want to classify in the inference process. You may include up to two sequences in your inference text to be classified, but if you are running a question-answering experiment, you may only include one question (along with its context). By default, the CPU is not used and we build a TensorRT engine for deployment; if you have a model in CPU format and want to use CPU, add the `--use_cpu` flag and make sure your syntax matches the example above.

When you run the command, `iris` will submit a request to the Triton Inference Server and you will receive the result of your inference.

### Inferencing with the tritonclient library

If you prefer to interact directly with the Triton Inference Server, you can do so with the`tritonclient` library.  Installing `iris` will add `tritonclient` to your site packages. There are alternative libraries in other languages; see the links at the bottom of the page for more details.

Install using:

```bash
pip install tritonclient
```

or

```bash
conda install -c conda-forge tritonclient
```

You can use the following Python script to run inference on the deployed model. Similarly, you can use `tritonclient` to submit HTTP requests to access any other Triton Inference Server capabilities.

```python
import tritonclient.http
import numpy as np

model_name = f"transformer_tensorrt_inference"
model_version = "1"
url = "0.0.0.0:8000" # if you are deployed to port 8000
batch_size=1

text = "..." # your inference text

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, model_version=model_version
)
model_config = triton_client.get_model_config(
    model_name=model_name, model_version=model_version
)

query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
model_score = tritonclient.http.InferRequestedOutput(name="outputs", binary_data=False)

query.set_data_from_numpy(np.asarray([text], dtype=object))
answer = triton_client.infer(
		  model_name=model_name,
	    model_version=model_version,
	    inputs=[query],
	    outputs=[model_score],
)

answer.as_numpy("outputs")
```

This request returns your results in an identical format to `iris infer.`

Further reading:

[Triton Inference Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html)

[Triton Inference Server Client Libraries](https://github.com/triton-inference-server/client)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://titanml.gitbook.io/iris-documentation/titan-optimise-knowledge-distillation/deploying-the-optimal-model/inferencing-the-model.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
