Rate Limits and Quotas

This page provides information about API usage constraints, including request rate limits and token quotas.

Rate Limits

Rate limits are implemented to ensure fair usage of the API and maintain service stability for all users. Exceeding these limits will result in a 429 Too Many Requests error response.

Request-based Rate Limits

| Endpoint | Limit | Time Window | Notes | | ---------------------------- | ----- | ----------- | -------------------------- | | /chat_generate | 100 | Per minute | Non-streaming requests | | /chat_generate (streaming) | 30 | Per minute | Streaming requests (beta) | | /audio | 250 | Per minute | Non-streaming requests | | /audio (streaming) | 100 | Per minute | Streaming requests | | All endpoints combined | 1,000 | Per hour | Total requests per API key |

Concurrent Request Limits

| Request Type | Concurrent Limit | Notes | | ---------------------- | ---------------- | ------------------------------------------ | | Non-streaming requests | 20 | Maximum simultaneous requests | | Streaming connections | 5 | Maximum simultaneous streaming connections |

Token Usage Quotas

In addition to request-based rate limits, the API also enforces token usage quotas. These quotas apply to the total number of tokens processed (both input and output). | API Key Type | Daily Token Quota | Monthly Token Quota | | --------------- | ----------------- | ------------------- | | Free Tier | 50,000 | 1,000,000 | | Standard Tier | 1,000,000 | 10,000,000 | | Enterprise Tier | Custom | Custom |

Token Calculation

Tokens are calculated as follows:

Input tokens: All text submitted in the prompt and conversation history
Output tokens: All text generated in the response
Total tokens: Input tokens + output tokens

The usage_metadata field in API responses provides information about token usage:

"usage_metadata": {
  "prompt_token_count": 12,
  "candidates_token_count": 8,
  "total_token_count": 20
}

json

File Size Limits

| File Type | Maximum Size | Notes | | ------------------------ | ------------ | --------------------------------------- | | Images | 10 MB | Supported formats: JPEG, PNG, GIF, WebP | | Audio files | 25 MB | Supported formats: WAV, MP3, M4A, WebM | | Documents | 20 MB | Supported formats: PDF, TXT, DOCX | | All attachments combined | 50 MB | Total per request |

Text Length Limits

| Parameter | Maximum Length | Notes | | --------------- | ---------------- | ----------------------------- | | prompt | 32,000 tokens | Approximately 24,000 words | | text (TTS) | 2,000 characters | For text-to-speech requests | | Response length | 4,096 tokens | Maximum generated text length |

Handling Rate Limits

When you exceed rate limits, the API returns a 429 Too Many Requests response with a Retry-After header indicating the number of seconds to wait before retrying.

Response Example

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 5

{
  "status": "error",
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Please reduce request frequency."
  }
}

text

Best Practices for Handling Rate Limits

Implement Backoff Logic: Use exponential backoff when retrying after rate limit errors.

async function callWithBackoff(fn, maxRetries = 5) {
  let retries = 0;

  while (retries < maxRetries) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429) {
        retries++;
        const retryAfter = parseInt(
          error.headers.get("Retry-After") || "1",
          10,
        );
        const backoffTime = retryAfter * 1000 * Math.pow(1.5, retries - 1);

        console.log(
          `Rate limited. Retrying after ${backoffTime}ms (retry ${retries}/${maxRetries})`,
        );
        await new Promise((resolve) => setTimeout(resolve, backoffTime));
      } else {
        throw error; // Re-throw non-rate-limit errors
      }
    }
  }

  throw new Error(`Failed after ${maxRetries} retries due to rate limiting`);
}

javascript

Implement Request Queuing: Queue requests and process them at an appropriate rate.

class RequestQueue {
  constructor(requestsPerMinute) {
    this.queue = [];
    this.processing = false;
    this.interval = 60000 / requestsPerMinute; // Time between requests
  }

  async add(requestFn) {
    return new Promise((resolve, reject) => {
      this.queue.push({ requestFn, resolve, reject });

      if (!this.processing) {
        this.process();
      }
    });
  }

  async process() {
    if (this.queue.length === 0) {
      this.processing = false;
      return;
    }

    this.processing = true;
    const { requestFn, resolve, reject } = this.queue.shift();

    try {
      const result = await requestFn();
      resolve(result);
    } catch (error) {
      reject(error);
    }

    // Wait before processing next request
    setTimeout(() => this.process(), this.interval);
  }
}

// Usage
const apiQueue = new RequestQueue(50); // 50 requests per minute

async function callAPI(params) {
  return apiQueue.add(() => actualAPICall(params));
}

javascript

Monitor Token Usage: Track your token usage to avoid hitting quotas.

let tokenUsage = {
  daily: 0,
  monthly: 0,
};

function updateTokenUsage(response) {
  if (response.usage_metadata) {
    tokenUsage.daily += response.usage_metadata.total_token_count;
    tokenUsage.monthly += response.usage_metadata.total_token_count;

    // Check if approaching limits
    const dailyLimit = 50000; // Example limit
    if (tokenUsage.daily > dailyLimit * 0.8) {
      console.warn(
        `Warning: Using ${
          tokenUsage.daily
        }/${dailyLimit} daily tokens (${Math.round(
          (tokenUsage.daily / dailyLimit) * 100,
        )}%)`,
      );
    }
  }
}

javascript

Batch Requests When Possible: Combine multiple small requests into fewer larger ones.
Implement Graceful Degradation: When near limits, reduce features or use cached responses.

Optimizing API Consumption

Reduce Token Usage

Be Concise: Keep prompts and conversation history as concise as possible while maintaining clarity.
Prune Conversation History: For long conversations, consider removing older or less relevant messages.

function pruneConversationHistory(history, maxTokens = 2000) {
  // Start with the most recent messages
  let prunedHistory = [...history].reverse();
  let tokenCount = 0;
  let keepMessages = [];

  // Estimate token count (rough approximation)
  for (const message of prunedHistory) {
    const messageTokens = message.content.split(/\s+/).length * 1.3; // Rough estimation

    if (tokenCount + messageTokens <= maxTokens) {
      keepMessages.unshift(message); // Add to the beginning to maintain order
      tokenCount += messageTokens;
    } else {
      break;
    }
  }

  return keepMessages;
}

javascript

Use Streaming for Long Responses: Streaming allows you to start processing responses immediately, even if you hit token limits mid-response.

Optimize Request Patterns

Cache Common Responses: Store responses for common or repetitive queries.

const responseCache = new Map();

async function getCachedResponse(prompt, language) {
  const cacheKey = `${prompt}:${language}`;

  // Check cache first
  if (responseCache.has(cacheKey)) {
    return responseCache.get(cacheKey);
  }

  // Make API call if not cached
  const response = await callAddisAI(prompt, language);

  // Cache the response (with TTL)
  responseCache.set(cacheKey, response);
  setTimeout(() => responseCache.delete(cacheKey), 3600000); // 1 hour TTL

  return response;
}

javascript

Precompute When Possible: For predictable interaction patterns, precompute responses during low-usage periods.
Implement Client-side Throttling: Respect rate limits by throttling requests before they hit the API.

Enterprise Quotas and Custom Limits

Enterprise users can request custom rate limits and quotas based on their specific needs. Contact our sales team for more information on enterprise plans and custom quotas. Custom options include:

Higher request rate limits
Increased token quotas
Reserved capacity for spikes in usage
Custom concurrent request limits
Priority processing during high-demand periods

Monitoring Your Usage

You can monitor your current usage and remaining quota through the following methods:

Response Headers: All API responses include headers with current usage information:
- X-RateLimit-Limit: Your total request limit
- X-RateLimit-Remaining: Remaining requests in the current window
- X-RateLimit-Reset: Time when the current window resets (Unix timestamp)
Usage Metadata: The usage_metadata field in each response provides token usage for that specific request.
Developer Dashboard: Enterprise customers have access to a dashboard with detailed usage statistics and analytics.

For any questions about rate limits or to request a quota increase, please contact our support team.

Python Examples for Rate Limit Handling

Basic Rate Limit Handling

import requests
import time
import json

def call_with_rate_limit_handling(prompt, target_language):
    """
    Call the Addis AI API with basic rate limit handling.
    """
    api_key = "YOUR_API_KEY"
    url = "https://api.addisassistant.com/api/v1/chat_generate"

    headers = {
        "Content-Type": "application/json",
        "X-API-Key": api_key
    }

    data = {
        "prompt": prompt,
        "target_language": target_language
    }

    try:
        response = requests.post(url, headers=headers, json=data)

        # Check for rate limit
        if response.status_code == 429:
            # Get retry after time
            retry_after = int(response.headers.get("Retry-After", "60"))
            print(f"Rate limit exceeded. Waiting for {retry_after} seconds before retrying.")

            # Wait for the specified time
            time.sleep(retry_after)

            # Retry the request
            return call_with_rate_limit_handling(prompt, target_language)

        # Process successful response
        if response.status_code == 200:
            return response.json()

        # Handle other errors
        print(f"Error: {response.status_code}")
        try:
            print(response.json())
        except:
            print(response.text)
        return None

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {str(e)}")
        return None

# Example usage
response = call_with_rate_limit_handling(
    prompt="What is the capital of Ethiopia?",
    target_language="am"
)
if response:
    print(response["response_text"])

python

Rate Limiter with Token Bucket

import requests
import time
import threading
import json
from collections import deque

class RateLimiter:
    """
    Implements a token bucket algorithm for rate limiting API calls.
    """
    def __init__(self, rate_per_minute, max_burst=None):
        self.rate = rate_per_minute / 60.0  # Tokens per second
        self.max_tokens = max_burst or rate_per_minute
        self.tokens = self.max_tokens
        self.last_update = time.time()
        self.lock = threading.Lock()

        # Queue for tracking token usage
        self.token_usage_queue = deque()
        self.daily_token_limit = 50000  # Example limit
        self.daily_token_usage = 0

    def _update_tokens(self):
        """Update the token count based on elapsed time."""
        now = time.time()
        elapsed = now - self.last_update
        new_tokens = elapsed * self.rate

        with self.lock:
            self.tokens = min(self.max_tokens, self.tokens + new_tokens)
            self.last_update = now

    def try_acquire(self):
        """
        Try to acquire a token. Returns True if successful, False otherwise.
        """
        self._update_tokens()

        with self.lock:
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

    def acquire(self, block=True, timeout=None):
        """
        Acquire a token, waiting if necessary.

        Args:
            block (bool): Whether to block until a token is available.
            timeout (float): Maximum time to wait for a token.

        Returns:
            bool: True if a token was acquired, False otherwise.
        """
        if not block:
            return self.try_acquire()

        start_time = time.time()

        while timeout is None or time.time() - start_time < timeout:
            if self.try_acquire():
                return True

            # Sleep for a small interval
            time.sleep(0.01)

        return False

    def update_token_usage(self, token_count):
        """
        Update token usage tracking.

        Args:
            token_count (int): The number of tokens used in the current request.
        """
        with self.lock:
            now = time.time()

            # Add current usage
            self.token_usage_queue.append((now, token_count))
            self.daily_token_usage += token_count

            # Clean up old entries (older than 24 hours)
            day_ago = now - 86400  # 24 hours in seconds

            while self.token_usage_queue and self.token_usage_queue[0][0] < day_ago:
                _, old_count = self.token_usage_queue.popleft()
                self.daily_token_usage -= old_count

    def check_token_limits(self):
        """
        Check if we're approaching token limits.

        Returns:
            tuple: (is_approaching_limit, percentage_used)
        """
        with self.lock:
            percentage = (self.daily_token_usage / self.daily_token_limit) * 100
            is_approaching = percentage > 80

            return is_approaching, percentage

class AddisAIClient:
    """
    Client for the Addis AI API with rate limiting.
    """
    def __init__(self, api_key, rate_per_minute=60):
        self.api_key = api_key
        self.base_url = "https://api.addisassistant.com/api/v1"

        # Create rate limiter
        self.rate_limiter = RateLimiter(rate_per_minute)

        # Headers
        self.headers = {
            "Content-Type": "application/json",
            "X-API-Key": api_key
        }

    def chat_generate(self, prompt, target_language,
                      conversation_history=None, generation_config=None):
        """
        Call the chat_generate endpoint with rate limiting.
        """
        # Check token limits
        approaching_limit, percentage = self.rate_limiter.check_token_limits()
        if approaching_limit:
            print(f"Warning: Approaching daily token limit ({percentage:.1f}%)")

        # Acquire token from rate limiter
        if not self.rate_limiter.acquire(timeout=300):  # Wait up to 5 minutes
            raise Exception("Failed to acquire rate limit token after timeout")

        # Prepare request
        url = f"{self.base_url}/chat_generate"

        data = {
            "prompt": prompt,
            "target_language": target_language
        }

        if conversation_history:
            data["conversation_history"] = conversation_history

        if generation_config:
            data["generation_config"] = generation_config

        # Make request
        try:
            response = requests.post(url, headers=self.headers, json=data)

            # Handle rate limiting
            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", "60"))
                print(f"Rate limit exceeded. Waiting for {retry_after} seconds.")
                time.sleep(retry_after + 1)  # Add 1 second buffer

                # Retry after waiting
                return self.chat_generate(
                    prompt, target_language, conversation_history, generation_config
                )

            # Parse response
            if response.status_code == 200:
                result = response.json()

                # Update token usage tracking
                if "usage_metadata" in result:
                    token_count = result["usage_metadata"].get("total_token_count", 0)
                    self.rate_limiter.update_token_usage(token_count)

                return result
            else:
                print(f"Error: {response.status_code}")
                try:
                    print(response.json())
                except:
                    print(response.text)
                return None

        except requests.exceptions.RequestException as e:
            print(f"Request failed: {str(e)}")
            return None

# Example usage
client = AddisAIClient(api_key="YOUR_API_KEY", rate_per_minute=50)

# Process multiple requests with rate limiting
requests_to_process = [
    ("What is the capital of Ethiopia?", "am"),
    ("How many regions are in Ethiopia?", "am"),
    ("Tell me about Ethiopian cuisine", "am"),
    # Add more requests...
]

for prompt, language in requests_to_process:
    response = client.chat_generate(prompt, language)
    if response:
        print(f"Prompt: {prompt}")
        print(f"Response: {response['response_text']}")
        print("---")

    # No need to add artificial delay; the rate limiter handles this

python

Conversation History Token Management

import re

def estimate_token_count(text):
    """
    Estimate the number of tokens in a string.
    This is a simple approximation - actual tokenization varies by model.
    """
    # Rough approximation: 1 token ≈ 4 characters for English, less for non-Latin scripts
    return len(text) / 3  # Adjust for Amharic/Afan Oromo which may use more bytes per character

def prune_conversation_history(history, max_tokens=2000):
    """
    Prune conversation history to stay under token limits.

    Args:
        history (list): The conversation history
        max_tokens (int): Maximum token budget

    Returns:
        list: Pruned conversation history
    """
    if not history:
        return []

    # Calculate token counts for each message
    token_counts = []
    for message in history:
        if "content" in message:
            token_counts.append(estimate_token_count(message["content"]))
        elif "parts" in message:
            # Sum tokens in text parts
            parts_count = 0
            for part in message["parts"]:
                if "text" in part:
                    parts_count += estimate_token_count(part["text"])
            token_counts.append(parts_count)
        else:
            # Default if we can't determine
            token_counts.append(50)  # Assume a default size

    # Start with most recent messages and work backward
    total_tokens = 0
    keep_indices = []

    for i in range(len(history) - 1, -1, -1):
        if total_tokens + token_counts[i] <= max_tokens:
            keep_indices.append(i)
            total_tokens += token_counts[i]
        else:
            break

    # Sort indices to maintain original order
    keep_indices.sort()

    # If we couldn't keep any messages, at least keep the latest one
    # (possibly truncated)
    if not keep_indices and history:
        latest_msg = history[-1].copy()

        # Truncate content if needed
        if "content" in latest_msg:
            content = latest_msg["content"]
            # Estimate how much to keep
            keep_chars = int(max_tokens * 3)  # Convert tokens to approximate char count
            if len(content) > keep_chars:
                latest_msg["content"] = content[:keep_chars] + "..."

        return [latest_msg]

    return [history[i] for i in keep_indices]

# Example usage
conversation_history = [
    {"role": "user", "content": "What is Ethiopia?"},
    {"role": "assistant", "content": "Ethiopia is a country located in the Horn of Africa..."},
    {"role": "user", "content": "Tell me about its history."},
    {"role": "assistant", "content": "Ethiopia has a rich history dating back thousands of years..."},
    {"role": "user", "content": "What about its culture?"}
]

# Prune to stay under token limit
pruned_history = prune_conversation_history(conversation_history, max_tokens=500)
print(f"Kept {len(pruned_history)} of {len(conversation_history)} messages")

# Use pruned history in API call
client = AddisAIClient(api_key="YOUR_API_KEY")
response = client.chat_generate(
    prompt="Tell me more about Ethiopian cuisine",
    target_language="am",
    conversation_history=pruned_history
)

python

Previous Next