Getting Started

These docs are outdated! Please check out for the latest information on the Titan Takeoff server. If there's anything that's not covered there, please contact us on our discord.

To get started with Iris Takeoff, all you need is to have docker and python installed on your local system. If you wish to use the server with gpu suport, then you will need to install docker with cuda support.

The first step is to install the TitanML local python package, Iris, using pip.

pip install titan-iris

Getting a model

Once Iris is installed, the next step is to select a model to inference. Iris Takeoff supports many of the most powerful generative text models, such as Falcon, MPT, and Llama. See the support page for a list of models that work.

These can be found on their respective HuggingFace pages. We support directly passing in a HuggingFace model name, or passing in a local path to one of these models saved locally using .save_pretrained().

Going forward in this demo we will be using the falcon 7B Instruct model. This is a good open source model that is trained to follow instructions, and is small enough to easily inference even on CPUs.

Launching the Server

To start the server, run the takeoff command:

iris takeoff --model tiiuae/falcon-7b-instruct --device cpu --port 8000

If you want to start a server using llama2, see here.

This will trigger a prompt to login to a TitanML account to be able to download the docker container:

You will have to create a free TitanML account to download the docker.

Once this is done the model be downloaded, prepared for inference, and a server started on your device. We can see this by running:


docker ps  


    - By default, this will use the port 8000 on your machine. Please make sure it's free. 
    - If you want to specify a different port, you can do:
        `iris takeoff --model tiiuae/falcon-7b-instruct --device cpu --port <your_port>`
    - For more information, please check `Using the Takeoff API` page

Testing the model

With this port exposed, you can now send commands to the model and have tokens streamed back to you.

This is the foundation that can power a large number of LLM apps. We provide a minimal chat interface that runs locally in order to test the performance and speed of the model. This chat interface is limited, but enough to test inference speeds and model quality.

To start the chat interface run the following command:

iris takeoff --infer --port 8000
    - This will open up a chat interface with whichever model is hosted on port 8000.
    - If you chose a different port, pass in a different port here.
    - Again, infer uses port 8000 by default, so you only need this flag if a differetn port is used.

The output should look something like this:

Here is a video of the working chat server:

If you are happy with the performance, then we can look to build an application on top of the server.

End to End Demo

Last updated