# Inferencing the model

{% hint style="danger" %}
These docs are outdated! Please check out <https://docs.titanml.co> for the latest information on the TitanML platform.\
\
If there's anything that's not covered there, please contact us on our [discord](https://discord.com/invite/83RmHTjZgf).
{% endhint %}

Now you're ready to run inference on your optimised model (and see how fast you can get results!)

### Using iris infer

The `iris infer` command allows you to run inference on your chosen model through the Triton Inference Server without submitting your own HTTP request directly.

Here is the structure of the `iris infer` command:

```sh
iris infer \
    --target localhost \      # note localhost is the default target, so it does not need to be specified
    --port 8000 \             # note 8000 is the default port, so it does not need to be specified
    --task ...  \             # the task for which the model you are deploying was originally fine-tuned
    --use_cpu  \              # include this iff the optimised model is in onnx format
    --text "..."  \           # your inference text
    --context "..." \         # input context; obligatory when you run inference on a question-answering model
    
    
```

```
iris infer -t localhost -p8000 -t ... --use_cpu -t "..." -c "..."
```

The port should be wherever you launched the docker container in the previous step, the task should be the same task you entered when you ran `iris post`, and the text should be the question you want to answer/sequence or token you want to classify in the inference process. You may include up to two sequences in your inference text to be classified, but if you are running a question-answering experiment, you may only include one question (along with its context). By default, the CPU is not used and we build a TensorRT engine for deployment; if you have a model in CPU format and want to use CPU, add the `--use_cpu` flag and make sure your syntax matches the example above.

When you run the command, `iris` will submit a request to the Triton Inference Server and you will receive the result of your inference.

### Inferencing with the tritonclient library

If you prefer to interact directly with the Triton Inference Server, you can do so with the`tritonclient` library.  Installing `iris` will add `tritonclient` to your site packages. There are alternative libraries in other languages; see the links at the bottom of the page for more details.

Install using:

```bash
pip install tritonclient
```

or

```bash
conda install -c conda-forge tritonclient
```

You can use the following Python script to run inference on the deployed model. Similarly, you can use `tritonclient` to submit HTTP requests to access any other Triton Inference Server capabilities.

```python
import tritonclient.http
import numpy as np

model_name = f"transformer_tensorrt_inference"
model_version = "1"
url = "0.0.0.0:8000" # if you are deployed to port 8000
batch_size=1

text = "..." # your inference text

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, model_version=model_version
)
model_config = triton_client.get_model_config(
    model_name=model_name, model_version=model_version
)

query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
model_score = tritonclient.http.InferRequestedOutput(name="outputs", binary_data=False)

query.set_data_from_numpy(np.asarray([text], dtype=object))
answer = triton_client.infer(
		  model_name=model_name,
	    model_version=model_version,
	    inputs=[query],
	    outputs=[model_score],
)

answer.as_numpy("outputs")
```

This request returns your results in an identical format to `iris infer.`

Further reading:

[Triton Inference Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html)

[Triton Inference Server Client Libraries](https://github.com/triton-inference-server/client)
