Rate Limits and Quotas

This page provides information about API usage constraints, including request rate limits and token quotas.

Rate Limits

Rate limits are implemented to ensure fair usage of the API and maintain service stability for all users. Exceeding these limits will result in a 429 Too Many Requests error response.

Request-based Rate Limits

| Endpoint | Limit | Time Window | Notes | | ---------------------------- | ----- | ----------- | -------------------------- | | /chat_generate | 100 | Per minute | Non-streaming requests | | /chat_generate (streaming) | 30 | Per minute | Streaming requests (beta) | | /audio | 250 | Per minute | Non-streaming requests | | /audio (streaming) | 100 | Per minute | Streaming requests | | All endpoints combined | 1,000 | Per hour | Total requests per API key |

Concurrent Request Limits

| Request Type | Concurrent Limit | Notes | | ---------------------- | ---------------- | ------------------------------------------ | | Non-streaming requests | 20 | Maximum simultaneous requests | | Streaming connections | 5 | Maximum simultaneous streaming connections |

Token Usage Quotas

In addition to request-based rate limits, the API also enforces token usage quotas. These quotas apply to the total number of tokens processed (both input and output). | API Key Type | Daily Token Quota | Monthly Token Quota | | --------------- | ----------------- | ------------------- | | Free Tier | 50,000 | 1,000,000 | | Standard Tier | 1,000,000 | 10,000,000 | | Enterprise Tier | Custom | Custom |

Token Calculation

Tokens are calculated as follows:
  • Input tokens: All text submitted in the prompt and conversation history
  • Output tokens: All text generated in the response
  • Total tokens: Input tokens + output tokens
The usage_metadata field in API responses provides information about token usage:
"usage_metadata": {
"prompt_token_count": 12,
"candidates_token_count": 8,
"total_token_count": 20
}
json

File Size Limits

| File Type | Maximum Size | Notes | | ------------------------ | ------------ | --------------------------------------- | | Images | 10 MB | Supported formats: JPEG, PNG, GIF, WebP | | Audio files | 25 MB | Supported formats: WAV, MP3, M4A, WebM | | Documents | 20 MB | Supported formats: PDF, TXT, DOCX | | All attachments combined | 50 MB | Total per request |

Text Length Limits

| Parameter | Maximum Length | Notes | | --------------- | ---------------- | ----------------------------- | | prompt | 32,000 tokens | Approximately 24,000 words | | text (TTS) | 2,000 characters | For text-to-speech requests | | Response length | 4,096 tokens | Maximum generated text length |

Handling Rate Limits

When you exceed rate limits, the API returns a 429 Too Many Requests response with a Retry-After header indicating the number of seconds to wait before retrying.

Response Example

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 5
{
"status": "error",
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Please reduce request frequency."
}
}
text

Best Practices for Handling Rate Limits

  1. Implement Backoff Logic: Use exponential backoff when retrying after rate limit errors.
async function callWithBackoff(fn, maxRetries = 5) {
let retries = 0;
while (retries < maxRetries) {
try {
return await fn();
} catch (error) {
if (error.status === 429) {
retries++;
const retryAfter = parseInt(
error.headers.get("Retry-After") || "1",
10,
);
const backoffTime = retryAfter * 1000 * Math.pow(1.5, retries - 1);
console.log(
`Rate limited. Retrying after ${backoffTime}ms (retry ${retries}/${maxRetries})`,
);
await new Promise((resolve) => setTimeout(resolve, backoffTime));
} else {
throw error; // Re-throw non-rate-limit errors
}
}
}
throw new Error(`Failed after ${maxRetries} retries due to rate limiting`);
}
javascript
  1. Implement Request Queuing: Queue requests and process them at an appropriate rate.
class RequestQueue {
constructor(requestsPerMinute) {
this.queue = [];
this.processing = false;
this.interval = 60000 / requestsPerMinute; // Time between requests
}
async add(requestFn) {
return new Promise((resolve, reject) => {
this.queue.push({ requestFn, resolve, reject });
if (!this.processing) {
this.process();
}
});
}
async process() {
if (this.queue.length === 0) {
this.processing = false;
return;
}
this.processing = true;
const { requestFn, resolve, reject } = this.queue.shift();
try {
const result = await requestFn();
resolve(result);
} catch (error) {
reject(error);
}
// Wait before processing next request
setTimeout(() => this.process(), this.interval);
}
}
// Usage
const apiQueue = new RequestQueue(50); // 50 requests per minute
async function callAPI(params) {
return apiQueue.add(() => actualAPICall(params));
}
javascript
  1. Monitor Token Usage: Track your token usage to avoid hitting quotas.
let tokenUsage = {
daily: 0,
monthly: 0,
};
function updateTokenUsage(response) {
if (response.usage_metadata) {
tokenUsage.daily += response.usage_metadata.total_token_count;
tokenUsage.monthly += response.usage_metadata.total_token_count;
// Check if approaching limits
const dailyLimit = 50000; // Example limit
if (tokenUsage.daily > dailyLimit * 0.8) {
console.warn(
`Warning: Using ${
tokenUsage.daily
}/${dailyLimit} daily tokens (${Math.round(
(tokenUsage.daily / dailyLimit) * 100,
)}%)`,
);
}
}
}
javascript
  1. Batch Requests When Possible: Combine multiple small requests into fewer larger ones.
  2. Implement Graceful Degradation: When near limits, reduce features or use cached responses.

Optimizing API Consumption

Reduce Token Usage

  1. Be Concise: Keep prompts and conversation history as concise as possible while maintaining clarity.
  2. Prune Conversation History: For long conversations, consider removing older or less relevant messages.
function pruneConversationHistory(history, maxTokens = 2000) {
// Start with the most recent messages
let prunedHistory = [...history].reverse();
let tokenCount = 0;
let keepMessages = [];
// Estimate token count (rough approximation)
for (const message of prunedHistory) {
const messageTokens = message.content.split(/\s+/).length * 1.3; // Rough estimation
if (tokenCount + messageTokens <= maxTokens) {
keepMessages.unshift(message); // Add to the beginning to maintain order
tokenCount += messageTokens;
} else {
break;
}
}
return keepMessages;
}
javascript
  1. Use Streaming for Long Responses: Streaming allows you to start processing responses immediately, even if you hit token limits mid-response.

Optimize Request Patterns

  1. Cache Common Responses: Store responses for common or repetitive queries.
const responseCache = new Map();
async function getCachedResponse(prompt, language) {
const cacheKey = `${prompt}:${language}`;
// Check cache first
if (responseCache.has(cacheKey)) {
return responseCache.get(cacheKey);
}
// Make API call if not cached
const response = await callAddisAI(prompt, language);
// Cache the response (with TTL)
responseCache.set(cacheKey, response);
setTimeout(() => responseCache.delete(cacheKey), 3600000); // 1 hour TTL
return response;
}
javascript
  1. Precompute When Possible: For predictable interaction patterns, precompute responses during low-usage periods.
  2. Implement Client-side Throttling: Respect rate limits by throttling requests before they hit the API.

Enterprise Quotas and Custom Limits

Enterprise users can request custom rate limits and quotas based on their specific needs. Contact our sales team for more information on enterprise plans and custom quotas. Custom options include:
  • Higher request rate limits
  • Increased token quotas
  • Reserved capacity for spikes in usage
  • Custom concurrent request limits
  • Priority processing during high-demand periods

Monitoring Your Usage

You can monitor your current usage and remaining quota through the following methods:
  1. Response Headers: All API responses include headers with current usage information:
    • X-RateLimit-Limit: Your total request limit
    • X-RateLimit-Remaining: Remaining requests in the current window
    • X-RateLimit-Reset: Time when the current window resets (Unix timestamp)
  2. Usage Metadata: The usage_metadata field in each response provides token usage for that specific request.
  3. Developer Dashboard: Enterprise customers have access to a dashboard with detailed usage statistics and analytics.
For any questions about rate limits or to request a quota increase, please contact our support team.

Python Examples for Rate Limit Handling

Basic Rate Limit Handling

import requests
import time
import json
def call_with_rate_limit_handling(prompt, target_language):
"""
Call the Addis AI API with basic rate limit handling.
"""
api_key = "YOUR_API_KEY"
url = "https://api.addisassistant.com/api/v1/chat_generate"
headers = {
"Content-Type": "application/json",
"X-API-Key": api_key
}
data = {
"prompt": prompt,
"target_language": target_language
}
try:
response = requests.post(url, headers=headers, json=data)
# Check for rate limit
if response.status_code == 429:
# Get retry after time
retry_after = int(response.headers.get("Retry-After", "60"))
print(f"Rate limit exceeded. Waiting for {retry_after} seconds before retrying.")
# Wait for the specified time
time.sleep(retry_after)
# Retry the request
return call_with_rate_limit_handling(prompt, target_language)
# Process successful response
if response.status_code == 200:
return response.json()
# Handle other errors
print(f"Error: {response.status_code}")
try:
print(response.json())
except:
print(response.text)
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {str(e)}")
return None
# Example usage
response = call_with_rate_limit_handling(
prompt="What is the capital of Ethiopia?",
target_language="am"
)
if response:
print(response["response_text"])
python

Rate Limiter with Token Bucket

import requests
import time
import threading
import json
from collections import deque
class RateLimiter:
"""
Implements a token bucket algorithm for rate limiting API calls.
"""
def __init__(self, rate_per_minute, max_burst=None):
self.rate = rate_per_minute / 60.0 # Tokens per second
self.max_tokens = max_burst or rate_per_minute
self.tokens = self.max_tokens
self.last_update = time.time()
self.lock = threading.Lock()
# Queue for tracking token usage
self.token_usage_queue = deque()
self.daily_token_limit = 50000 # Example limit
self.daily_token_usage = 0
def _update_tokens(self):
"""Update the token count based on elapsed time."""
now = time.time()
elapsed = now - self.last_update
new_tokens = elapsed * self.rate
with self.lock:
self.tokens = min(self.max_tokens, self.tokens + new_tokens)
self.last_update = now
def try_acquire(self):
"""
Try to acquire a token. Returns True if successful, False otherwise.
"""
self._update_tokens()
with self.lock:
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def acquire(self, block=True, timeout=None):
"""
Acquire a token, waiting if necessary.
Args:
block (bool): Whether to block until a token is available.
timeout (float): Maximum time to wait for a token.
Returns:
bool: True if a token was acquired, False otherwise.
"""
if not block:
return self.try_acquire()
start_time = time.time()
while timeout is None or time.time() - start_time < timeout:
if self.try_acquire():
return True
# Sleep for a small interval
time.sleep(0.01)
return False
def update_token_usage(self, token_count):
"""
Update token usage tracking.
Args:
token_count (int): The number of tokens used in the current request.
"""
with self.lock:
now = time.time()
# Add current usage
self.token_usage_queue.append((now, token_count))
self.daily_token_usage += token_count
# Clean up old entries (older than 24 hours)
day_ago = now - 86400 # 24 hours in seconds
while self.token_usage_queue and self.token_usage_queue[0][0] < day_ago:
_, old_count = self.token_usage_queue.popleft()
self.daily_token_usage -= old_count
def check_token_limits(self):
"""
Check if we're approaching token limits.
Returns:
tuple: (is_approaching_limit, percentage_used)
"""
with self.lock:
percentage = (self.daily_token_usage / self.daily_token_limit) * 100
is_approaching = percentage > 80
return is_approaching, percentage
class AddisAIClient:
"""
Client for the Addis AI API with rate limiting.
"""
def __init__(self, api_key, rate_per_minute=60):
self.api_key = api_key
self.base_url = "https://api.addisassistant.com/api/v1"
# Create rate limiter
self.rate_limiter = RateLimiter(rate_per_minute)
# Headers
self.headers = {
"Content-Type": "application/json",
"X-API-Key": api_key
}
def chat_generate(self, prompt, target_language,
conversation_history=None, generation_config=None):
"""
Call the chat_generate endpoint with rate limiting.
"""
# Check token limits
approaching_limit, percentage = self.rate_limiter.check_token_limits()
if approaching_limit:
print(f"Warning: Approaching daily token limit ({percentage:.1f}%)")
# Acquire token from rate limiter
if not self.rate_limiter.acquire(timeout=300): # Wait up to 5 minutes
raise Exception("Failed to acquire rate limit token after timeout")
# Prepare request
url = f"{self.base_url}/chat_generate"
data = {
"prompt": prompt,
"target_language": target_language
}
if conversation_history:
data["conversation_history"] = conversation_history
if generation_config:
data["generation_config"] = generation_config
# Make request
try:
response = requests.post(url, headers=self.headers, json=data)
# Handle rate limiting
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", "60"))
print(f"Rate limit exceeded. Waiting for {retry_after} seconds.")
time.sleep(retry_after + 1) # Add 1 second buffer
# Retry after waiting
return self.chat_generate(
prompt, target_language, conversation_history, generation_config
)
# Parse response
if response.status_code == 200:
result = response.json()
# Update token usage tracking
if "usage_metadata" in result:
token_count = result["usage_metadata"].get("total_token_count", 0)
self.rate_limiter.update_token_usage(token_count)
return result
else:
print(f"Error: {response.status_code}")
try:
print(response.json())
except:
print(response.text)
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {str(e)}")
return None
# Example usage
client = AddisAIClient(api_key="YOUR_API_KEY", rate_per_minute=50)
# Process multiple requests with rate limiting
requests_to_process = [
("What is the capital of Ethiopia?", "am"),
("How many regions are in Ethiopia?", "am"),
("Tell me about Ethiopian cuisine", "am"),
# Add more requests...
]
for prompt, language in requests_to_process:
response = client.chat_generate(prompt, language)
if response:
print(f"Prompt: {prompt}")
print(f"Response: {response['response_text']}")
print("---")
# No need to add artificial delay; the rate limiter handles this
python

Conversation History Token Management

import re
def estimate_token_count(text):
"""
Estimate the number of tokens in a string.
This is a simple approximation - actual tokenization varies by model.
"""
# Rough approximation: 1 token ≈ 4 characters for English, less for non-Latin scripts
return len(text) / 3 # Adjust for Amharic/Afan Oromo which may use more bytes per character
def prune_conversation_history(history, max_tokens=2000):
"""
Prune conversation history to stay under token limits.
Args:
history (list): The conversation history
max_tokens (int): Maximum token budget
Returns:
list: Pruned conversation history
"""
if not history:
return []
# Calculate token counts for each message
token_counts = []
for message in history:
if "content" in message:
token_counts.append(estimate_token_count(message["content"]))
elif "parts" in message:
# Sum tokens in text parts
parts_count = 0
for part in message["parts"]:
if "text" in part:
parts_count += estimate_token_count(part["text"])
token_counts.append(parts_count)
else:
# Default if we can't determine
token_counts.append(50) # Assume a default size
# Start with most recent messages and work backward
total_tokens = 0
keep_indices = []
for i in range(len(history) - 1, -1, -1):
if total_tokens + token_counts[i] <= max_tokens:
keep_indices.append(i)
total_tokens += token_counts[i]
else:
break
# Sort indices to maintain original order
keep_indices.sort()
# If we couldn't keep any messages, at least keep the latest one
# (possibly truncated)
if not keep_indices and history:
latest_msg = history[-1].copy()
# Truncate content if needed
if "content" in latest_msg:
content = latest_msg["content"]
# Estimate how much to keep
keep_chars = int(max_tokens * 3) # Convert tokens to approximate char count
if len(content) > keep_chars:
latest_msg["content"] = content[:keep_chars] + "..."
return [latest_msg]
return [history[i] for i in keep_indices]
# Example usage
conversation_history = [
{"role": "user", "content": "What is Ethiopia?"},
{"role": "assistant", "content": "Ethiopia is a country located in the Horn of Africa..."},
{"role": "user", "content": "Tell me about its history."},
{"role": "assistant", "content": "Ethiopia has a rich history dating back thousands of years..."},
{"role": "user", "content": "What about its culture?"}
]
# Prune to stay under token limit
pruned_history = prune_conversation_history(conversation_history, max_tokens=500)
print(f"Kept {len(pruned_history)} of {len(conversation_history)} messages")
# Use pruned history in API call
client = AddisAIClient(api_key="YOUR_API_KEY")
response = client.chat_generate(
prompt="Tell me more about Ethiopian cuisine",
target_language="am",
conversation_history=pruned_history
)
python