Multi-modal Input
Addis AI supports multi-modal inputs, allowing you to combine text, audio, and file attachments (images, documents) in a single request.
Overview and Use Cases
Multi-modal capabilities enable:
- Richer conversational experiences with visual context
- Document analysis in Ethiopian languages
- Image captioning and description in local languages
- Voice commands with supporting documents
- Educational applications with visual and audio components
- Visual question answering in Amharic and Afan Oromo
Text + Audio Combination
You can combine text prompts with audio input in a single request:
Endpoint: POST /chat_generate
with
multipart/form-data
curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "chat_audio_input=@/path/to/your-audio.wav" \
-F 'request_data={
"prompt": "Additional text context here",
"target_language": "am"
};type=application/json'
This approach allows you to:
- Provide supplementary text context to audio input
- Send voice questions with text constraints or clarifications
- Combine typed and spoken input in the same request
Attachment Support
You can include file attachments (images, documents) along with text or audio:
curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "image1=@/path/to/image.jpg" \
-F "document1=@/path/to/document.pdf" \
-F 'request_data={
"prompt": "Describe these attachments",
"target_language": "am",
"attachment_field_names": ["image1", "document1"]
};type=application/json'
Important: You must list all attachment field names in the
attachment_field_names
array in the JSON parameters.
Supported Attachment Types
The system supports various file types, including:
| File Type | Supported Formats |
| --------- | --------------------------------------- |
| Images | JPEG, PNG, WebP, GIF (first frame), BMP |
| Documents | PDF, TXT, RTF, DOCX, PPTX, XLSX |
| Audio | WAV, MP3, M4A, FLAC, OGG, WebM |
Image Analysis
When you include image attachments, Addis AI can:
- Describe the contents of images in Amharic or Afan Oromo
- Answer questions about visual content
- Extract text from images (OCR) and respond to it
- Use visual context to inform responses
Sample Image Queries
"ይህን ምስል አብራራልኝ" (Describe this image)
"በዚህ ምስል ላይ ምን አለ?" (What is in this image?)
"በዚህ ምስል ላይ ስንት ሰዎች አሉ?" (How many people are in this image?)
"በዚህ ምስል ላይ ያለው ጽሑፍ ምንድን ነው?" (What text is in this image?)
Document Processing
When you include document attachments, Addis AI can:
- Summarize document content in the target language
- Answer questions about document content
- Extract and analyze information from PDF, Word, or text files
- Process bilingual documents and respond in the target language
Response Format with Attachments
Responses to multi-modal inputs follow the standard response format, with the addition of the
uploaded_attachments
field that contains references to the uploaded files:
{
"response_text": "The image shows a traditional Ethiopian coffee ceremony...",
"finish_reason": "stop",
"usage_metadata": { ... },
"modelVersion": "Addis-፩-አሌፍ",
"uploaded_attachments": [
{ "fileUri": "gs://.../file1.png", "mimeType": "image/png" },
{ "fileUri": "gs://.../file2.pdf", "mimeType": "application/pdf" }
]
}
The
uploaded_attachments
array contains the file URIs and MIME types that can be referenced in future conversation turns.
Conversation History with Attachments
For multi-turn conversations involving files, you'll need to:
- First request: Upload files and store the returned URIs
- Subsequent requests: Include the file URIs in your conversation history
First Request (Uploading Files)
Upload files as part of a multipart request:
const formData = new FormData();
formData.append("image1", imageFile);
formData.append(
"request_data",
JSON.stringify({
prompt: "What's in this image?",
target_language: "am",
attachment_field_names: ["image1"],
}),
);
const response = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: { "X-API-Key": "YOUR_API_KEY" },
body: formData,
},
);
const result = await response.json();
const fileUri = result.uploaded_attachments[0].fileUri;
const mimeType = result.uploaded_attachments[0].mimeType;
Follow-up Requests (Referencing Files)
For subsequent requests, include the file information in the conversation history:
const followUpRequest = {
prompt: "What colors are dominant in the image?",
target_language: "am",
conversation_history: [
{
role: "user",
parts: [
{
fileData: { fileUri: fileUri, mimeType: mimeType },
},
{ text: "What's in this image?" },
],
},
{
role: "assistant",
parts: [{ text: "The image shows mountains with a lake." }],
},
],
};
const followUpResponse = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify(followUpRequest),
},
);
Combining Multiple Input Types
You can combine text, audio, and file attachments in a single request:
curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "chat_audio_input=@/path/to/question.wav" \
-F "image1=@/path/to/image.jpg" \
-F 'request_data={
"prompt": "Additional context about the image",
"target_language": "am",
"attachment_field_names": ["image1"]
};type=application/json'
Example: Multi-modal Form Submission
async function submitMultimodalForm() {
const imageFile = document.getElementById("imageInput").files[0];
const audioFile = document.getElementById("audioInput").files[0];
const textPrompt = document.getElementById("textPrompt").value;
const formData = new FormData();
if (imageFile) formData.append("image1", imageFile);
if (audioFile) formData.append("chat_audio_input", audioFile);
const requestData = {
prompt: textPrompt,
target_language: "am",
};
if (imageFile) {
requestData.attachment_field_names = ["image1"];
}
formData.append("request_data", JSON.stringify(requestData));
const response = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
},
body: formData,
},
);
return await response.json();
}
Best Practices
-
File Handling
- Keep file sizes reasonable (under 10MB per file)
- Use appropriate file formats (JPEG/PNG for images, PDF for documents)
- Ensure images have good resolution but aren't excessively large
- When possible, compress files before uploading
-
Prompt Construction
- Be specific about what information you want from the attachments
- When submitting multiple files, clearly indicate which file you're asking about
- Consider breaking complex multi-file analysis into separate questions
-
Conversation Management
- Save file URIs and MIME types from the
uploaded_attachments
response field
- Include these file references in subsequent conversation turns using the
parts
array
- Store file URIs securely as they grant access to the files for future requests
-
Error Handling
- Handle upload failures gracefully (check for attachment_upload_failed error)
- Implement retries with exponential backoff for large file uploads
- Provide fallback options when files cannot be processed