In a world where people connect across borders daily, breaking language barriers is more important than ever. An AI-powered voice translation app allows users to speak in one language and instantly receive translations in another, making global communication seamless. This project combines natural language processing (NLP), speech recognition, and machine learning to create a mobile application that’s practical, innovative, and in high demand.

1. User taps a mic → speaks in Language A
2. App transcribes speech to text (ASR)
3. Text is translated to Language B
4. Translated text is synthesized as speech (TTS)
5. Audio plays back; optional on-screen captions
Option A — Hybrid (Recommended for MVP)
Option B — Mostly On-Device (Advanced)
For a production v1, start with Option A then migrate hot paths on-device later.
1. App records audio → sends chunks to /asr/stream (WebSocket) or /asr (REST)
2. Backend returns partial/final transcripts
3. App calls /translate with source text + target lang
4. App calls /tts with translated text → gets WAV/MP3
5. App plays audio; shows captions
1) FastAPI project layout
server/
app.py
asr.py
translate.py
tts.py
models/ # store ONNX/PyTorch models or cache
2) app.py (wire endpoints)
Copy Code
from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse, JSONResponse
from asr import transcribe_audio
from translate import translate_text
from tts import synthesize
app = FastAPI()
@app.post("/asr")
async def asr(file: UploadFile, source_lang: str = Form(default="auto")):
audio_bytes = await file.read()
text = transcribe_audio(audio_bytes, source_lang)
return JSONResponse({"text": text})
@app.post("/translate")
async def translate(q: str = Form(...), target_lang: str = Form(...), source_lang: str = Form("auto")):
out = translate_text(q, target_lang, source_lang)
return JSONResponse({"text": out})
@app.post("/tts")
async def tts(q: str = Form(...), voice: str = Form("en-US")):
audio_iter = synthesize(q, voice) # yields PCM/WAV bytes
return StreamingResponse(audio_iter, media_type="audio/wav")
3) asr.py (Whisper base; replace with your preferred runtime)
Copy Code
import io
import torch
import torchaudio
import whisper
_model = whisper.load_model("base") # for latency use "small" or “tiny”
def transcribe_audio(audio_bytes: bytes, source_lang: str = "auto") -> str:
wav, sr = torchaudio.load(io.BytesIO(audio_bytes))
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
audio = wav.squeeze(0).numpy()
result = _model.transcribe(audio, language=None if source_lang=="auto" else source_lang)
return result.get("text", "").strip()
4) translate.py (Transformers NLLB example)
Copy Code
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
MODEL_NAME = "facebook/nllb-200-distilled-600M"
_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16).eval().to("cuda" if torch.cuda.is_available() else "cpu")
# map BCP-47 -> NLLB language tokens (simplified)
NLLB = {"en":"eng_Latn","hi":"hin_Deva","es":"spa_Latn","fr":"fra_Latn","de":"deu_Latn"}
def translate_text(txt: str, target_lang: str, source_lang: str = "auto") -> str:
tgt = NLLB.get(target_lang, "eng_Latn")
inputs = _tokenizer(txt, return_tensors="pt").to(_model.device)
generated_tokens = _model.generate(**inputs, forced_bos_token_id=_tokenizer.lang_code_to_id[tgt], max_new_tokens=256)
return _tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
5) tts.py (Piper CLI wrapper—simple and fast)
Copy Code
import subprocess
from typing import Iterator
# assumes piper + voice model installed, e.g. en_US-amy-low.onnx
VOICE_PATHS = {"en-US":"voices/en_US-amy-low.onnx","hi-IN":"voices/hi_IN-...onnx"}
def synthesize(text: str, voice: str) -> Iterator[bytes]:
model = VOICE_PATHS.get(voice, VOICE_PATHS["en-US"])
# stream WAV from Piper
proc = subprocess.Popen(
["piper", "--model", model, "--sentence_silence", "0.25", "--output_raw"],
stdin=subprocess.PIPE, stdout=subprocess.PIPE
)
proc.stdin.write(text.encode("utf-8"))
proc.stdin.close()
while True:
chunk = proc.stdout.read(4096)
if not chunk: break
yield chunk
proc.wait()
Swap Piper/Coqui with a cloud TTS if you need premium voices quickly.
Dependencies
Copy Code
record (mic capture) http (upload audio, call APIs) just_audio (playback) permission_handler (mic permission) dependencies: record: ^5 http: ^1 just_audio: ^0.9 permission_handler: ^11
Record → ASR → Translate → TTS (happy-path demo)
Copy Code
import 'package:record/record.dart';
import 'package:http/http.dart' as http;
import 'package:just_audio/just_audio.dart';
final record = AudioRecorder();
final player = AudioPlayer();
const server = "http://YOUR_SERVER:8000";
Future<void> captureAndTranslate(String targetLang) async {
if (!await record.hasPermission()) return;
await record.start(RecordConfig(encoder: AudioEncoder.wav, sampleRate: 16000), path: "/tmp_in.wav");
// UI shows "listening..." and a stop button
}
Future<void> stopAndProcess(String targetLang) async {
final path = await record.stop();
final asrReq = http.MultipartRequest('POST', Uri.parse("$server/asr"))
..fields['source_lang'] = 'auto'
..files.add(await http.MultipartFile.fromPath('file', path!));
final asrRes = await asrReq.send();
final asrText = await http.Response.fromStream(asrRes);
final sourceText = jsonDecode(asrText.body)['text'];
final trRes = await http.post(Uri.parse("$server/translate"),
body: {'q': sourceText, 'target_lang': targetLang, 'source_lang': 'auto'});
final translated = jsonDecode(trRes.body)['text'];
final ttsRes = await http.post(Uri.parse("$server/tts"), body: {'q': translated, 'voice': langToVoice(targetLang)});
await player.setAudioSource(AudioSource.uri(ttsRes.request!.url)); // or use bytes via setAudioSource(StreamAudioSource)
await player.play();
}For streaming (low latency captions), switch /asr to a WebSocket endpoint and send audio chunks every 200–400 ms with VAD (voice activity detection).
Latency & quality playbook
Privacy, safety & cost
Testing matrix (quick but strong)
Accessibility & UX details that delight
Versioning & rollout plan
1. v0.1: One-shot translation (record → result), English↔Hindi, server ASR/Translate/TTS
2. v0.2: Streaming ASR + partial captions, add Spanish/French
3. v0.3: Conversation mode (turn-taking), phrase history
4. v1.0: On-device ASR on newer phones, privacy mode, cloud TTS fallback
5. v1.x: Custom voice cloning (opt-in), domain glossaries, analytics dashboard
Production hardening checklist
To make this app a reality, you’ll touch Python, ML, NLP, mobile dev, and cloud. These Uncodemy courses match the skills you’ll need:
These programs emphasize hands-on projects—exactly what you need for a voice translator pipeline.
1. Stand up the FastAPI skeleton and test /asr with Whisper base
2. Add /translate via NLLB-200 distilled (or plug in cloud for speed)
3. Integrate Piper for fast, free TTS; ship a first Android APK
4. Add streaming ASR (WebSocket + VAD) for live captions
5. Measure end-to-end median latency; aim for < 1.5s short utterances
If you want, I can package the above into:
Just say the word and tell me your preferred client (Flutter or React Native) and target languages for v1.
Personalized learning paths with interactive materials and progress tracking for optimal learning experience.
Explore LMSCreate professional, ATS-optimized resumes tailored for tech roles with intelligent suggestions.
Build ResumeDetailed analysis of how your resume performs in Applicant Tracking Systems with actionable insights.
Check ResumeAI analyzes your code for efficiency, best practices, and bugs with instant feedback.
Try Code ReviewPractice coding in 20+ languages with our cloud-based compiler that works on any device.
Start Coding
TRENDING
BESTSELLER
BESTSELLER
TRENDING
HOT
BESTSELLER
HOT
BESTSELLER
BESTSELLER
HOT
POPULAR