Supported models

These docs are outdated! Please check out https://docs.titanml.co/docs/category/titan-takeoff for the latest information on the Titan Takeoff server. If there's anything that's not covered there, please contact us on our discord.

How it works

Iris Takeoff is meant to be simple and used for fast prototyping and inference of large language models on CPU and Nvidia GPU.

Currently supported models:

  • BART

  • BLOOM

  • CodeGen

  • Falcon

  • LLaMa

  • MPT

  • OpenAI GPT2

  • GPTBigCode

  • GPT-J

  • GPT-NeoX

  • OPT

  • Pegasus

  • T5

  • Llama 2 (see below)

For a full list see here.

Quantisation

To improve token generation latency we use the cTranslate2 library. cTranslate2 gives support for int8 computation, which is useful for speeding up models and reducing memory requirements. Takeoff uses this as default.

However int8 inference can sometimes causes a decrease in model quality. This can be mitigated by smoothing out activations. This currently only works with Bloom and OPT and we are working on extending this support to a larger range of models.

Memory Requirements:

Because the models are quantized to int8, you need roughly 1GB memory per 1B parameters when running the model. So to run a 7B parameter model, would take up at least 7GB of memory. This really is a lower bound, and there is some overhead from model activations etc that mean you need more than this to actually run.

The conversion process loads the full model into memory in fp32. So to perform the conversion to int8 often more memory is needed. On an 8GB RAM laptop, we are able to convert and run 1.5B parameter models. This will be improved as we come up with better ways of converting the models.

Using Llama 2

Llama 2 is a new state of the art language model released by Meta. It has shown superior performance across a wide range of benchmarks when compared to other open source models, adn is comparable to ChatGPT performance for certain benchmarks.

As of time of writing, Llama 2 weights are only available via a request form on the huggingface website. So to download and use the weights you need to have filled in the form and have been granted access. Once you have that you can use your huggingface token to authorise the download. So now the server launch command looks like this:

iris takeoff --model meta-llama/Llama-2-7b-hf -t <your/token> 

Then everything else should work as expected.

Last updated