Pulling the model
Last updated
Last updated
These docs are outdated! Please check out https://docs.titanml.co for the latest information on the TitanML platform. If there's anything that's not covered there, please contact us on our discord.
Now that you are familiar with the TitanHub model evaluation interface, you can decide which of the 3 Titan-optimised models best suit your use-case and deploy your chosen model into production.
Click one of the data points to choose which model size to download; you can choose medium, small or extra-small, depending on your cost and performance requirements.
If you simply want to download a Titan-optimised model in ONNX format onto your machine, use the iris download
command with the syntax iris download <experiment ID>:<model size>
. For our example experiment, you would use iris download 183:M
to download the medium-sized model, iris download 183:S
for the small model etc.. You are then free to use the ONNX model as you wish.
With the iris pull
command, you can download a docker container to serve an optimised model. The pull command is always iris pull <experiment ID>:<model size>
. and is analogous to docker pull
.
You can also copy either the pull or download command directly from the TitanHub web interface. Select a model size by clicking on a corresponding data point, then copy one of the generated commands from the 'Model Download' tab:
Make sure that docker
has been installed on your device; if you are using Windows or OSX, make sure that the Docker Desktop app is open and running. Then, run the iris pull
command you copied from TitanHub. You should now see progress bars in your terminal which indicate the progress of the docker container download.
When the download is complete, run docker images
to find the name of the downloaded image. An item in the format iris-triton-<experiment ID>
should appear in your list of images. For example, this might be iris-triton-183.
Now launch the docker container using the following command:
Be sure to add any necessary arguments to docker run
depending on your system’s GPU support and specifics. For instance, you might want to specify the platform architecture for the container to run on, or specify which GPUs should be made available to Triton for inference by adding --gpus devices=...
rather than using all GPUs as we did in the example. The container should now be visible in Docker Desktop and in your terminal when you run docker container ls
.
When you launch the container, you should be prompted to add certain parameters (batch size, sequence length and CPU usage (true/false)) before the pre-built script inside the container serves the model. We use TensorRT models (unless you are running inference with a CPU, in which case we use ONNX) and deploy them to the Triton Inference Server. Once deployed, an endpoint will be created at an available port on the device; the endpoint will use the Triton server to manage batching and influence.