Speech-to-Text (Transcription)

The Speech-to-Text capability transcribes spoken language from audio files to written text. This feature is integrated into the /chat_generate endpoint as a preprocessing step when audio input is provided.

Overview and Use Cases

Our Speech-to-Text technology is powered by our specialized audio models that have been trained to accurately recognize and transcribe Ethiopian languages. The system processes audio input and provides both the raw transcription and a cleaned version in the API response. This capability enables:
  • Voice Command Systems: Create applications that respond to spoken commands in Ethiopian languages
  • Content Transcription: Convert interviews, meetings, or lectures into searchable text
  • Voice Messaging: Transcribe voice messages to text for easier consumption
  • Accessibility Tools: Make audio content accessible to hearing-impaired users
  • Data Collection: Gather and analyze spoken data for research or business insights
  • Voice-First Applications: Build applications with voice as the primary input method

Request Format

Speech-to-text functionality is accessed through the /chat_generate endpoint using multipart/form-data format.

Basic Request

curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
-H "X-API-Key: YOUR_API_KEY" \
-F "chat_audio_input=@/path/to/your-audio.wav" \
-F 'request_data={
"target_language": "am"
};type=application/json'
bash

Required Components

| Component | Type | Description | | ------------------ | ---- | ------------------------------------------------------ | | chat_audio_input | File | The audio file to transcribe | | request_data | JSON | Configuration for processing, wrapped as a JSON string |

Request Data Parameters

| Parameter | Type | Required | Description | | ---------------------- | ------ | -------- | -------------------------------------------------------------------------------------------------- | | target_language | string | Yes | Language code for the response: am (Amharic) or om (Afan Oromo) | | prompt | string | No | Additional text context that can be combined with the transcribed audio | | conversation_history | array | No | Previous conversation turns (see Conversation Management) | | generation_config | object | No | Configuration for the response generation | :::important When using multipart/form-data with the chat_generate endpoint, all JSON parameters must be wrapped inside a field named request_data. :::

Response Format

The response includes both the transcription results and the AI's response to the transcribed content:
{
"response_text": "እሺ፣ ውሎ ደህና መሸ።",
"finish_reason": "stop",
"usage_metadata": {
"prompt_token_count": 15,
"candidates_token_count": 12,
"total_token_count": 27
},
"modelVersion": "Addis-፩-አሌፍ",
"transcription_raw": "<analysis>gender: male, emotion: neutral</analysis> ውሎ ደህና መሸ ወይ?",
"transcription_clean": "ውሎ ደህና መሸ ወይ?"
}
json

Transcription Fields

| Field | Type | Description | | --------------------- | ------ | ------------------------------------------------------------------------------------ | | transcription_raw | string | Complete transcription with analysis markup as returned by the underlying model | | transcription_clean | string | Cleaned transcription text with analysis tags removed, suitable for display to users | The transcription_raw field may contain markup tags with metadata, while transcription_clean offers a user-friendly version for display purposes.

Supported Audio Formats

The following audio formats are supported: | Format | MIME Types | | ------------- | ---------------------------------------------------------- | | WAV | audio/wav, audio/x-wav, audio/wave, audio/x-pn-wav | | MP3 | audio/mpeg, audio/mp3, audio/x-mp3 | | M4A/MP4 | audio/mp4, audio/x-m4a, audio/m4a | | WebM/Ogg/FLAC | audio/webm, audio/ogg, audio/x-flac, audio/flac |

Audio Quality Recommendations

For best transcription results:
  • Sample Rate: 16kHz or higher
  • Bit Depth: 16-bit or higher
  • Channels: Mono preferred, but stereo is supported
  • Recording Environment: Minimal background noise
  • Speaker Distance: 10-30cm from microphone
  • Audio Duration: 1-60 seconds per request
  • File Size: Maximum 10MB

Code Examples

Browser-based Recording and Transcription (JavaScript)

// Basic audio recording and transcription
async function recordAndTranscribe() {
// 1. Set up UI elements
const startButton = document.getElementById("startRecording");
const stopButton = document.getElementById("stopRecording");
const statusDiv = document.getElementById("status");
const resultDiv = document.getElementById("result");
// 2. Initialize variables
let mediaRecorder;
let audioChunks = [];
let stream;
// 3. Set up event handlers
startButton.addEventListener("click", async () => {
try {
// Request microphone access
stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// Create media recorder
mediaRecorder = new MediaRecorder(stream);
audioChunks = [];
// Collect audio chunks
mediaRecorder.addEventListener("dataavailable", (event) => {
audioChunks.push(event.data);
});
// Start recording
mediaRecorder.start();
statusDiv.textContent = "Recording...";
// Enable stop button, disable start button
startButton.disabled = true;
stopButton.disabled = false;
} catch (error) {
statusDiv.textContent = `Error: ${error.message}`;
}
});
stopButton.addEventListener("click", () => {
if (mediaRecorder && mediaRecorder.state !== "inactive") {
mediaRecorder.stop();
statusDiv.textContent = "Processing...";
// Handle recording stop event
mediaRecorder.addEventListener("stop", async () => {
try {
// Create audio blob and send it
const audioBlob = new Blob(audioChunks, { type: "audio/wav" });
// Send to API
const result = await sendAudioForTranscription(audioBlob);
// Display results
statusDiv.textContent = "Transcription complete";
resultDiv.innerHTML = `
<p><strong>Transcription:</strong> ${result.transcription_clean}</p>
<p><strong>Response:</strong> ${result.response_text}</p>
`;
// Stop the media tracks to release microphone
stream.getTracks().forEach((track) => track.stop());
// Reset buttons
startButton.disabled = false;
stopButton.disabled = true;
} catch (error) {
statusDiv.textContent = `Error: ${error.message}`;
}
});
}
});
// 4. Function to send audio to API
async function sendAudioForTranscription(audioBlob) {
const formData = new FormData();
formData.append("chat_audio_input", audioBlob);
formData.append(
"request_data",
JSON.stringify({
target_language: "am",
}),
);
const response = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
},
body: formData,
},
);
if (!response.ok) {
throw new Error(`API error: ${response.status}`);
}
return await response.json();
}
}
// Initialize the recording functionality
document.addEventListener("DOMContentLoaded", recordAndTranscribe);
javascript

File Upload and Transcription (JavaScript)

// Handle file upload and transcription
document
.getElementById("audioFileForm")
.addEventListener("submit", async (event) => {
event.preventDefault();
const fileInput = document.getElementById("audioFile");
const resultDiv = document.getElementById("transcriptionResult");
const statusDiv = document.getElementById("status");
if (!fileInput.files.length) {
statusDiv.textContent = "Please select an audio file";
return;
}
const audioFile = fileInput.files[0];
statusDiv.textContent = "Uploading and transcribing...";
try {
const formData = new FormData();
formData.append("chat_audio_input", audioFile);
// Include additional context if needed
const additionalText = document.getElementById("additionalText").value;
const requestData = {
target_language: "am",
};
if (additionalText) {
requestData.prompt = additionalText;
}
formData.append("request_data", JSON.stringify(requestData));
const response = await fetch(
"https://api.addisassistant.com/api/v1/chat_generate",
{
method: "POST",
headers: {
"X-API-Key": "YOUR_API_KEY",
},
body: formData,
},
);
if (!response.ok) {
throw new Error(
`Server returned ${response.status}: ${response.statusText}`,
);
}
const result = await response.json();
// Display results
statusDiv.textContent = "Transcription complete";
resultDiv.innerHTML = `
<div class="transcription">
<h3>Transcription</h3>
<p>${result.transcription_clean}</p>
</div>
<div class="analysis">
<h3>Raw Transcription</h3>
<pre>${result.transcription_raw}</pre>
</div>
<div class="response">
<h3>AI Response</h3>
<p>${result.response_text}</p>
</div>
`;
} catch (error) {
statusDiv.textContent = `Error: ${error.message}`;
}
});
javascript

Best Practices

  1. Audio Quality Optimization
    • Use a good quality microphone in a quiet environment
    • Position the microphone close to the speaker
    • Process audio to reduce background noise when necessary
    • Avoid clipping or distortion in recordings
  2. Language Considerations
    • For optimal results, use native speakers of the target language
    • Clearly articulate words, especially for technical terms
    • When mixing languages, separate them into different requests if possible
  3. Handling the Transcription
    • Use transcription_clean for user-facing displays
    • Implement confidence thresholds for accepting transcriptions
  4. Error Handling
    • Implement appropriate error handling for situations where transcription fails
    • Provide users with feedback about recording quality
    • Consider fallback options like text input when transcription isn't possible
  5. User Experience
    • Show visual feedback during recording and processing
    • Allow users to review and edit transcriptions if needed
    • Implement voice activity detection to automatically stop recording

Limitations and Considerations

  • Background Noise: Performance may decrease in noisy environments
  • Multiple Speakers: Works best with single-speaker audio
  • Dialects and Accents: Optimized for standard dialects of Amharic and Afan Oromo
  • Technical Terms: May have difficulty with specialized terminology or rare words
  • Audio Length: Best performance on audio clips between 1-60 seconds