Multi-modal Input

Addis AI supports multi-modal inputs, allowing you to combine text, audio, and file attachments (images, documents) in a single request.

Overview and Use Cases

Multi-modal capabilities enable:

Richer conversational experiences with visual context
Document analysis in Ethiopian languages
Image captioning and description in local languages
Voice commands with supporting documents
Educational applications with visual and audio components
Visual question answering in Amharic and Afan Oromo

Text + Audio Combination

You can combine text prompts with audio input in a single request: Endpoint: POST /chat_generate with multipart/form-data

curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "chat_audio_input=@/path/to/your-audio.wav" \
  -F 'request_data={
    "prompt": "Additional text context here",
    "target_language": "am"
  };type=application/json'

bash

This approach allows you to:

Provide supplementary text context to audio input
Send voice questions with text constraints or clarifications
Combine typed and spoken input in the same request

Attachment Support

You can include file attachments (images, documents) along with text or audio:

curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "image1=@/path/to/image.jpg" \
  -F "document1=@/path/to/document.pdf" \
  -F 'request_data={
    "prompt": "Describe these attachments",
    "target_language": "am",
    "attachment_field_names": ["image1", "document1"]
  };type=application/json'

bash

Important: You must list all attachment field names in the attachment_field_names array in the JSON parameters.

Supported Attachment Types

The system supports various file types, including: | File Type | Supported Formats | | --------- | --------------------------------------- | | Images | JPEG, PNG, WebP, GIF (first frame), BMP | | Documents | PDF, TXT, RTF, DOCX, PPTX, XLSX | | Audio | WAV, MP3, M4A, FLAC, OGG, WebM |

Image Analysis

When you include image attachments, Addis AI can:

Describe the contents of images in Amharic or Afan Oromo
Answer questions about visual content
Extract text from images (OCR) and respond to it
Use visual context to inform responses

Sample Image Queries

"ይህን ምስል አብራራልኝ"  (Describe this image)
"በዚህ ምስል ላይ ምን አለ?" (What is in this image?)
"በዚህ ምስል ላይ ስንት ሰዎች አሉ?" (How many people are in this image?)
"በዚህ ምስል ላይ ያለው ጽሑፍ ምንድን ነው?" (What text is in this image?)

text

Document Processing

When you include document attachments, Addis AI can:

Summarize document content in the target language
Answer questions about document content
Extract and analyze information from PDF, Word, or text files
Process bilingual documents and respond in the target language

Response Format with Attachments

Responses to multi-modal inputs follow the standard response format, with the addition of the uploaded_attachments field that contains references to the uploaded files:

{
  "response_text": "The image shows a traditional Ethiopian coffee ceremony...",
  "finish_reason": "stop",
  "usage_metadata": { ... },
  "modelVersion": "Addis-፩-አሌፍ",
  "uploaded_attachments": [
    { "fileUri": "gs://.../file1.png", "mimeType": "image/png" },
    { "fileUri": "gs://.../file2.pdf", "mimeType": "application/pdf" }
  ]
}

json

The uploaded_attachments array contains the file URIs and MIME types that can be referenced in future conversation turns.

Conversation History with Attachments

For multi-turn conversations involving files, you'll need to:

First request: Upload files and store the returned URIs
Subsequent requests: Include the file URIs in your conversation history

First Request (Uploading Files)

Upload files as part of a multipart request:

// Step 1: First request to upload files
const formData = new FormData();
formData.append("image1", imageFile);
formData.append(
  "request_data",
  JSON.stringify({
    prompt: "What's in this image?",
    target_language: "am",
    attachment_field_names: ["image1"],
  }),
);

const response = await fetch(
  "https://api.addisassistant.com/api/v1/chat_generate",
  {
    method: "POST",
    headers: { "X-API-Key": "YOUR_API_KEY" },
    body: formData,
  },
);

const result = await response.json();
const fileUri = result.uploaded_attachments[0].fileUri;
const mimeType = result.uploaded_attachments[0].mimeType;

javascript

Follow-up Requests (Referencing Files)

For subsequent requests, include the file information in the conversation history:

// Step 2: Follow-up request using conversation history with file references
const followUpRequest = {
  prompt: "What colors are dominant in the image?",
  target_language: "am",
  conversation_history: [
    {
      role: "user",
      parts: [
        {
          fileData: { fileUri: fileUri, mimeType: mimeType },
        },
        { text: "What's in this image?" },
      ],
    },
    {
      role: "assistant",
      parts: [{ text: "The image shows mountains with a lake." }],
    },
  ],
};

const followUpResponse = await fetch(
  "https://api.addisassistant.com/api/v1/chat_generate",
  {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(followUpRequest),
  },
);

javascript

Combining Multiple Input Types

You can combine text, audio, and file attachments in a single request:

curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "chat_audio_input=@/path/to/question.wav" \
  -F "image1=@/path/to/image.jpg" \
  -F 'request_data={
    "prompt": "Additional context about the image",
    "target_language": "am",
    "attachment_field_names": ["image1"]
  };type=application/json'

bash

async function submitMultimodalForm() {
  const imageFile = document.getElementById("imageInput").files[0];
  const audioFile = document.getElementById("audioInput").files[0];
  const textPrompt = document.getElementById("textPrompt").value;

  const formData = new FormData();

  // Add files if present
  if (imageFile) formData.append("image1", imageFile);
  if (audioFile) formData.append("chat_audio_input", audioFile);

  // Create request data object
  const requestData = {
    prompt: textPrompt,
    target_language: "am",
  };

  // Add attachment field names if needed
  if (imageFile) {
    requestData.attachment_field_names = ["image1"];
  }

  // Append the request data as JSON
  formData.append("request_data", JSON.stringify(requestData));

  // Submit the request
  const response = await fetch(
    "https://api.addisassistant.com/api/v1/chat_generate",
    {
      method: "POST",
      headers: {
        "X-API-Key": "YOUR_API_KEY",
      },
      body: formData,
    },
  );

  return await response.json();
}

javascript

Best Practices

File Handling
- Keep file sizes reasonable (under 10MB per file)
- Use appropriate file formats (JPEG/PNG for images, PDF for documents)
- Ensure images have good resolution but aren't excessively large
- When possible, compress files before uploading
Prompt Construction
- Be specific about what information you want from the attachments
- When submitting multiple files, clearly indicate which file you're asking about
- Consider breaking complex multi-file analysis into separate questions
Conversation Management
- Save file URIs and MIME types from the uploaded_attachments response field
- Include these file references in subsequent conversation turns using the parts array
- Store file URIs securely as they grant access to the files for future requests
Error Handling
- Handle upload failures gracefully (check for attachment_upload_failed error)
- Implement retries with exponential backoff for large file uploads
- Provide fallback options when files cannot be processed

Previous Next

Multi-modal Input

Overview and Use Cases

Text + Audio Combination

Attachment Support

Supported Attachment Types

Image Analysis

Sample Image Queries

Document Processing

Response Format with Attachments

Conversation History with Attachments

First Request (Uploading Files)

Follow-up Requests (Referencing Files)

Combining Multiple Input Types

Example: Multi-modal Form Submission

Best Practices