Getting Started
Last updated
Last updated
These docs are outdated! Please check out https://docs.titanml.co/docs/category/titan-takeoff for the latest information on the Titan Takeoff server. If there's anything that's not covered there, please contact us on our discord.
To get started with Iris Takeoff, all you need is to have docker and python installed on your local system. If you wish to use the server with gpu suport, then you will need to install docker with cuda support.
The first step is to install the TitanML local python package, Iris, using pip.
Once Iris is installed, the next step is to select a model to inference. Iris Takeoff supports many of the most powerful generative text models, such as Falcon, MPT, and Llama. See the support page for a list of models that work.
These can be found on their respective HuggingFace pages. We support directly passing in a HuggingFace model name, or passing in a local path to one of these models saved locally using .save_pretrained().
Going forward in this demo we will be using the falcon 7B Instruct model. This is a good open source model that is trained to follow instructions, and is small enough to easily inference even on CPUs.
To start the server, run the takeoff command:
If you want to start a server using llama2, see here.
This will trigger a prompt to login to a TitanML account to be able to download the docker container:
You will have to create a free TitanML account to download the docker.
Once this is done the model be downloaded, prepared for inference, and a server started on your device. We can see this by running:
Input:
Output:
With this port exposed, you can now send commands to the model and have tokens streamed back to you.
This is the foundation that can power a large number of LLM apps. We provide a minimal chat interface that runs locally in order to test the performance and speed of the model. This chat interface is limited, but enough to test inference speeds and model quality.
To start the chat interface run the following command:
The output should look something like this:
Here is a video of the working chat server:
If you are happy with the performance, then we can look to build an application on top of the server.