Skip to main content

Local OpenAI Proxy Server

A fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs.

info

This is deprecated. Support for the CLI tool will be removed in our next MAJOR release - https://github.com/BerriAI/litellm/discussions/648.

Usage

pip install litellm
$ litellm --model ollama/codellama 

#INFO: Ollama running on http://0.0.0.0:8000

Test

In a new shell, run:

$ litellm --test

Replace openai base

import openai 

openai.api_base = "http://0.0.0.0:8000"

print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))

Other supported models:

Assuming you're running vllm locally
$ litellm --model vllm/facebook/opt-125m

Jump to Code

Docker

Here's how to use our Docker image to go to production with OpenAI Proxy Server

git clone https://github.com/BerriAI/litellm.git

Add your API keys / LLM configs to template_secrets.toml.

[keys]
OPENAI_API_KEY="sk-..."
COHERE_API_KEY="Wa-..."

All Configs

Run Docker image:

docker build -t litellm . && docker run -p 8000:8000 litellm

## INFO: OpenAI Proxy server running on http://0.0.0.0:8000

Tutorial: Use with Multiple LLMs + LibreChat/Chatbot-UI/Auto-Gen/ChatDev/Langroid,etc.

Replace openai base:

import openai 

openai.api_key = "any-string-here"
openai.api_base = "http://0.0.0.0:8080" # your proxy url

# call openai
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])

print(response)

# call cohere
response = openai.ChatCompletion.create(model="command-nightly", messages=[{"role": "user", "content": "Hey"}])

print(response)

Local Proxy

Here's how to use the local proxy to test codellama/mistral/etc. models for different github repos

pip install litellm
$ ollama pull codellama # OUR Local CodeLlama  

$ litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048

Tutorial: Use with Multiple LLMs + Aider/AutoGen/Langroid/etc.

$ litellm

#INFO: litellm proxy running on http://0.0.0.0:8000

Send a request to your proxy

import openai 

openai.api_key = "any-string-here"
openai.api_base = "http://0.0.0.0:8080" # your proxy url

# call gpt-3.5-turbo
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])

print(response)

# call ollama/llama2
response = openai.ChatCompletion.create(model="ollama/llama2", messages=[{"role": "user", "content": "Hey"}])

print(response)
note

Contribute Using this server with a project? Contribute your tutorial here!

Advanced

Logs

$ litellm --logs

This will return the most recent log (the call that went to the LLM API + the received response).

All logs are saved to a file called api_logs.json in the current directory.

Configure Proxy

If you need to:

  • save API keys
  • set litellm params (e.g. drop unmapped params, set fallback models, etc.)
  • set model-specific params (max tokens, temperature, api base, prompt template)

You can do set these just for that session (via cli), or persist these across restarts (via config file).

Save API Keys

$ litellm --api_key OPENAI_API_KEY=sk-...

LiteLLM will save this to a locally stored config file, and persist this across sessions.

LiteLLM Proxy supports all litellm supported api keys. To add keys for a specific provider, check this list:

$ litellm --add_key HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]

E.g.: Set api base, max tokens and temperature.

For that session:

litellm --model ollama/llama2 \
--api_base http://localhost:11434 \
--max_tokens 250 \
--temperature 0.5

# OpenAI-compatible server running on http://0.0.0.0:8000

Across restarts:
Create a file called litellm_config.toml and paste this in there:

[model."ollama/llama2"] # run via `litellm --model ollama/llama2`
max_tokens = 250 # set max tokens for the model
temperature = 0.5 # set temperature for the model
api_base = "http://localhost:11434" # set a custom api base for the model

Save it to the proxy with:

$ litellm --config -f ./litellm_config.toml 

LiteLLM will save a copy of this file in it's package, so it can persist these settings across restarts.

Complete Config File 🔥 [Tutorial] modify a model prompt on the proxy

Track Costs

By default litellm proxy writes cost logs to litellm/proxy/costs.json

How can the proxy be better? Let us know here

{
"Oct-12-2023": {
"claude-2": {
"cost": 0.02365918,
"num_requests": 1
}
}
}

You can view costs on the cli using

litellm --cost

Performance

We load-tested 500,000 HTTP connections on the FastAPI server for 1 minute, using wrk.

There are our results:

Thread Stats   Avg      Stdev     Max   +/- Stdev
Latency 156.38ms 25.52ms 361.91ms 84.73%
Req/Sec 13.61 5.13 40.00 57.50%
383625 requests in 1.00m, 391.10MB read
Socket errors: connect 0, read 1632, write 1, timeout 0

Support/ talk with founders