Multi-modal Input

Addis AI supports multi-modal inputs, allowing you to combine text, audio, and file attachments (images, documents) in a single request.

Overview and Use Cases

Multi-modal capabilities enable:
  • Richer conversational experiences with visual context
  • Document analysis in Ethiopian languages
  • Image captioning and description in local languages
  • Voice commands with supporting documents
  • Educational applications with visual and audio components
  • Visual question answering in Amharic and Afan Oromo

Text + Audio Combination

You can combine text prompts with audio input in a single request: Endpoint: POST /chat_generate with multipart/form-data
curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "chat_audio_input=@/path/to/your-audio.wav" \
-F 'request_data={
"prompt": "Additional text context here",
"target_language": "am"
};type=application/json'
bash
This approach allows you to:
  • Provide supplementary text context to audio input
  • Send voice questions with text constraints or clarifications
  • Combine typed and spoken input in the same request

Attachment Support

You can include file attachments (images, documents) along with text or audio:
curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "image1=@/path/to/image.jpg" \
-F "document1=@/path/to/document.pdf" \
-F 'request_data={
"prompt": "Describe these attachments",
"target_language": "am",
"attachment_field_names": ["image1", "document1"]
};type=application/json'
bash
Important: You must list all attachment field names in the attachment_field_names array in the JSON parameters.

Supported Attachment Types

The system supports various file types, including: | File Type | Supported Formats | | --------- | --------------------------------------- | | Images | JPEG, PNG, WebP, GIF (first frame), BMP | | Documents | PDF, TXT, RTF, DOCX, PPTX, XLSX | | Audio | WAV, MP3, M4A, FLAC, OGG, WebM |

Image Analysis

When you include image attachments, Addis AI can:
  • Describe the contents of images in Amharic or Afan Oromo
  • Answer questions about visual content
  • Extract text from images (OCR) and respond to it
  • Use visual context to inform responses

Sample Image Queries

"ይህን ምስል አብራራልኝ" (Describe this image)
"በዚህ ምስል ላይ ምን አለ?" (What is in this image?)
"በዚህ ምስል ላይ ስንት ሰዎች አሉ?" (How many people are in this image?)
"በዚህ ምስል ላይ ያለው ጽሑፍ ምንድን ነው?" (What text is in this image?)
text

Document Processing

When you include document attachments, Addis AI can:
  • Summarize document content in the target language
  • Answer questions about document content
  • Extract and analyze information from PDF, Word, or text files
  • Process bilingual documents and respond in the target language

Response Format with Attachments

Responses to multi-modal inputs follow the standard response format, with the addition of the uploaded_attachments field that contains references to the uploaded files:
{
"response_text": "The image shows a traditional Ethiopian coffee ceremony...",
"finish_reason": "stop",
"usage_metadata": { ... },
"modelVersion": "Addis-፩-አሌፍ",
"uploaded_attachments": [
{ "fileUri": "gs://.../file1.png", "mimeType": "image/png" },
{ "fileUri": "gs://.../file2.pdf", "mimeType": "application/pdf" }
]
}
json
The uploaded_attachments array contains the file URIs and MIME types that can be referenced in future conversation turns.

Conversation History with Attachments

For multi-turn conversations involving files, you'll need to:
  1. First request: Upload files and store the returned URIs
  2. Subsequent requests: Include the file URIs in your conversation history

First Request (Uploading Files)

Upload files as part of a multipart request:
// Step 1: First request to upload files
const formData = new FormData();
formData.append("image1", imageFile);
formData.append(
"request_data",
JSON.stringify({
prompt: "What's in this image?",
target_language: "am",
attachment_field_names: ["image1"],
}),
);
const response = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: { "X-API-Key": "YOUR_API_KEY" },
body: formData,
},
);
const result = await response.json();
const fileUri = result.uploaded_attachments[0].fileUri;
const mimeType = result.uploaded_attachments[0].mimeType;
javascript

Follow-up Requests (Referencing Files)

For subsequent requests, include the file information in the conversation history:
// Step 2: Follow-up request using conversation history with file references
const followUpRequest = {
prompt: "What colors are dominant in the image?",
target_language: "am",
conversation_history: [
{
role: "user",
parts: [
{
fileData: { fileUri: fileUri, mimeType: mimeType },
},
{ text: "What's in this image?" },
],
},
{
role: "assistant",
parts: [{ text: "The image shows mountains with a lake." }],
},
],
};
const followUpResponse = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify(followUpRequest),
},
);
javascript

Combining Multiple Input Types

You can combine text, audio, and file attachments in a single request:
curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "chat_audio_input=@/path/to/question.wav" \
-F "image1=@/path/to/image.jpg" \
-F 'request_data={
"prompt": "Additional context about the image",
"target_language": "am",
"attachment_field_names": ["image1"]
};type=application/json'
bash

Example: Multi-modal Form Submission

async function submitMultimodalForm() {
const imageFile = document.getElementById("imageInput").files[0];
const audioFile = document.getElementById("audioInput").files[0];
const textPrompt = document.getElementById("textPrompt").value;
const formData = new FormData();
// Add files if present
if (imageFile) formData.append("image1", imageFile);
if (audioFile) formData.append("chat_audio_input", audioFile);
// Create request data object
const requestData = {
prompt: textPrompt,
target_language: "am",
};
// Add attachment field names if needed
if (imageFile) {
requestData.attachment_field_names = ["image1"];
}
// Append the request data as JSON
formData.append("request_data", JSON.stringify(requestData));
// Submit the request
const response = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
},
body: formData,
},
);
return await response.json();
}
javascript

Best Practices

  1. File Handling
    • Keep file sizes reasonable (under 10MB per file)
    • Use appropriate file formats (JPEG/PNG for images, PDF for documents)
    • Ensure images have good resolution but aren't excessively large
    • When possible, compress files before uploading
  2. Prompt Construction
    • Be specific about what information you want from the attachments
    • When submitting multiple files, clearly indicate which file you're asking about
    • Consider breaking complex multi-file analysis into separate questions
  3. Conversation Management
    • Save file URIs and MIME types from the uploaded_attachments response field
    • Include these file references in subsequent conversation turns using the parts array
    • Store file URIs securely as they grant access to the files for future requests
  4. Error Handling
    • Handle upload failures gracefully (check for attachment_upload_failed error)
    • Implement retries with exponential backoff for large file uploads
    • Provide fallback options when files cannot be processed