Speech-to-Text (Transcription)

The Speech-to-Text capability transcribes spoken language from audio files to written text. This feature is integrated into the /chat_generate endpoint as a preprocessing step when audio input is provided.

Overview and Use Cases

Our Speech-to-Text technology is powered by our specialized audio models that have been trained to accurately recognize and transcribe Ethiopian languages. The system processes audio input and provides both the raw transcription and a cleaned version in the API response. This capability enables:

Voice Command Systems: Create applications that respond to spoken commands in Ethiopian languages
Content Transcription: Convert interviews, meetings, or lectures into searchable text
Voice Messaging: Transcribe voice messages to text for easier consumption
Accessibility Tools: Make audio content accessible to hearing-impaired users
Data Collection: Gather and analyze spoken data for research or business insights
Voice-First Applications: Build applications with voice as the primary input method

Request Format

Speech-to-text functionality is accessed through the /chat_generate endpoint using multipart/form-data format.

Basic Request

curl -X POST https://api.addisassistant.com/api/v1/chat_generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "chat_audio_input=@/path/to/your-audio.wav" \
  -F 'request_data={
    "target_language": "am"
  };type=application/json'

bash

Required Components

| Component | Type | Description | | ------------------ | ---- | ------------------------------------------------------ | | chat_audio_input | File | The audio file to transcribe | | request_data | JSON | Configuration for processing, wrapped as a JSON string |

Request Data Parameters

| Parameter | Type | Required | Description | | ---------------------- | ------ | -------- | -------------------------------------------------------------------------------------------------- | | target_language | string | Yes | Language code for the response: am (Amharic) or om (Afan Oromo) | | prompt | string | No | Additional text context that can be combined with the transcribed audio | | conversation_history | array | No | Previous conversation turns (see Conversation Management) | | generation_config | object | No | Configuration for the response generation | :::important When using multipart/form-data with the chat_generate endpoint, all JSON parameters must be wrapped inside a field named request_data. :::

Response Format

The response includes both the transcription results and the AI's response to the transcribed content:

{
  "response_text": "እሺ፣ ውሎ ደህና መሸ።",
  "finish_reason": "stop",
  "usage_metadata": {
    "prompt_token_count": 15,
    "candidates_token_count": 12,
    "total_token_count": 27
  },
  "modelVersion": "Addis-፩-አሌፍ",
  "transcription_raw": "<analysis>gender: male, emotion: neutral</analysis> ውሎ ደህና መሸ ወይ?",
  "transcription_clean": "ውሎ ደህና መሸ ወይ?"
}

json

Transcription Fields

| Field | Type | Description | | --------------------- | ------ | ------------------------------------------------------------------------------------ | | transcription_raw | string | Complete transcription with analysis markup as returned by the underlying model | | transcription_clean | string | Cleaned transcription text with analysis tags removed, suitable for display to users | The transcription_raw field may contain markup tags with metadata, while transcription_clean offers a user-friendly version for display purposes.

Supported Audio Formats

The following audio formats are supported: | Format | MIME Types | | ------------- | ---------------------------------------------------------- | | WAV | audio/wav, audio/x-wav, audio/wave, audio/x-pn-wav | | MP3 | audio/mpeg, audio/mp3, audio/x-mp3 | | M4A/MP4 | audio/mp4, audio/x-m4a, audio/m4a | | WebM/Ogg/FLAC | audio/webm, audio/ogg, audio/x-flac, audio/flac |

Audio Quality Recommendations

For best transcription results:

Sample Rate: 16kHz or higher
Bit Depth: 16-bit or higher
Channels: Mono preferred, but stereo is supported
Recording Environment: Minimal background noise
Speaker Distance: 10-30cm from microphone
Audio Duration: 1-60 seconds per request
File Size: Maximum 10MB

Code Examples

Browser-based Recording and Transcription (JavaScript)

// Basic audio recording and transcription
async function recordAndTranscribe() {
  // 1. Set up UI elements
  const startButton = document.getElementById("startRecording");
  const stopButton = document.getElementById("stopRecording");
  const statusDiv = document.getElementById("status");
  const resultDiv = document.getElementById("result");

  // 2. Initialize variables
  let mediaRecorder;
  let audioChunks = [];
  let stream;

  // 3. Set up event handlers
  startButton.addEventListener("click", async () => {
    try {
      // Request microphone access
      stream = await navigator.mediaDevices.getUserMedia({ audio: true });

      // Create media recorder
      mediaRecorder = new MediaRecorder(stream);
      audioChunks = [];

      // Collect audio chunks
      mediaRecorder.addEventListener("dataavailable", (event) => {
        audioChunks.push(event.data);
      });

      // Start recording
      mediaRecorder.start();
      statusDiv.textContent = "Recording...";

      // Enable stop button, disable start button
      startButton.disabled = true;
      stopButton.disabled = false;
    } catch (error) {
      statusDiv.textContent = `Error: ${error.message}`;
    }
  });

  stopButton.addEventListener("click", () => {
    if (mediaRecorder && mediaRecorder.state !== "inactive") {
      mediaRecorder.stop();
      statusDiv.textContent = "Processing...";

      // Handle recording stop event
      mediaRecorder.addEventListener("stop", async () => {
        try {
          // Create audio blob and send it
          const audioBlob = new Blob(audioChunks, { type: "audio/wav" });

          // Send to API
          const result = await sendAudioForTranscription(audioBlob);

          // Display results
          statusDiv.textContent = "Transcription complete";
          resultDiv.innerHTML = `
            <p><strong>Transcription:</strong> ${result.transcription_clean}</p>
            <p><strong>Response:</strong> ${result.response_text}</p>
          `;

          // Stop the media tracks to release microphone
          stream.getTracks().forEach((track) => track.stop());

          // Reset buttons
          startButton.disabled = false;
          stopButton.disabled = true;
        } catch (error) {
          statusDiv.textContent = `Error: ${error.message}`;
        }
      });
    }
  });

  // 4. Function to send audio to API
  async function sendAudioForTranscription(audioBlob) {
    const formData = new FormData();
    formData.append("chat_audio_input", audioBlob);
    formData.append(
      "request_data",
      JSON.stringify({
        target_language: "am",
      }),
    );

    const response = await fetch(
      "https://api.addisassistant.com/api/v1/chat_generate",
      {
        method: "POST",
        headers: {
          "X-API-Key": "YOUR_API_KEY",
        },
        body: formData,
      },
    );

    if (!response.ok) {
      throw new Error(`API error: ${response.status}`);
    }

    return await response.json();
  }
}

// Initialize the recording functionality
document.addEventListener("DOMContentLoaded", recordAndTranscribe);

javascript

File Upload and Transcription (JavaScript)

// Handle file upload and transcription
document
  .getElementById("audioFileForm")
  .addEventListener("submit", async (event) => {
    event.preventDefault();

    const fileInput = document.getElementById("audioFile");
    const resultDiv = document.getElementById("transcriptionResult");
    const statusDiv = document.getElementById("status");

    if (!fileInput.files.length) {
      statusDiv.textContent = "Please select an audio file";
      return;
    }

    const audioFile = fileInput.files[0];
    statusDiv.textContent = "Uploading and transcribing...";

    try {
      const formData = new FormData();
      formData.append("chat_audio_input", audioFile);

      // Include additional context if needed
      const additionalText = document.getElementById("additionalText").value;
      const requestData = {
        target_language: "am",
      };

      if (additionalText) {
        requestData.prompt = additionalText;
      }

      formData.append("request_data", JSON.stringify(requestData));

      const response = await fetch(
        "https://api.addisassistant.com/api/v1/chat_generate",
        {
          method: "POST",
          headers: {
            "X-API-Key": "YOUR_API_KEY",
          },
          body: formData,
        },
      );

      if (!response.ok) {
        throw new Error(
          `Server returned ${response.status}: ${response.statusText}`,
        );
      }

      const result = await response.json();

      // Display results
      statusDiv.textContent = "Transcription complete";
      resultDiv.innerHTML = `
      <div class="transcription">
        <h3>Transcription</h3>
        <p>${result.transcription_clean}</p>
      </div>
      <div class="analysis">
        <h3>Raw Transcription</h3>
        <pre>${result.transcription_raw}</pre>
      </div>
      <div class="response">
        <h3>AI Response</h3>
        <p>${result.response_text}</p>
      </div>
    `;
    } catch (error) {
      statusDiv.textContent = `Error: ${error.message}`;
    }
  });

javascript

Best Practices

Audio Quality Optimization
- Use a good quality microphone in a quiet environment
- Position the microphone close to the speaker
- Process audio to reduce background noise when necessary
- Avoid clipping or distortion in recordings
Language Considerations
- For optimal results, use native speakers of the target language
- Clearly articulate words, especially for technical terms
- When mixing languages, separate them into different requests if possible
Handling the Transcription
- Use transcription_clean for user-facing displays
- Implement confidence thresholds for accepting transcriptions
Error Handling
- Implement appropriate error handling for situations where transcription fails
- Provide users with feedback about recording quality
- Consider fallback options like text input when transcription isn't possible
User Experience
- Show visual feedback during recording and processing
- Allow users to review and edit transcriptions if needed
- Implement voice activity detection to automatically stop recording

Limitations and Considerations

Background Noise: Performance may decrease in noisy environments
Multiple Speakers: Works best with single-speaker audio
Dialects and Accents: Optimized for standard dialects of Amharic and Afan Oromo
Technical Terms: May have difficulty with specialized terminology or rare words
Audio Length: Best performance on audio clips between 1-60 seconds

Previous Next