Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

Hey everyone, I’m building a backend that analyzes long YouTube videos using an LLM.

Currently, my flow is a slow waterfall: Download full audio -> Whisper -> LLM -> Return results. For a 30-minute video, the user waits forever.

I want to pipeline this for real-time SSE streaming: [Chunk Audio on the fly] -> [Whisper] -> [LLM] -> [Stream to UI]

My questions for the data/backend engineers:

Chunking & VAD: What's the best way to chunk YouTube audio streams (e.g., via ffmpeg) without cutting sentences in half and ruining the LLM's context?
Queueing: Is standard asyncio in FastAPI enough to handle these overlapping tasks, or do I strictly need Celery/Redis workers for this pipeline?

Any library recommendations or architectural patterns would be hugely appreciated