Markdown to Multi-Speaker Podcast: My Local TTS CLI
Turning Markdown scripts into podcast-style audio with a local-first TTS pipeline

Hey! I'm Aman Kumar, a passionate full-stack developer and open-source enthusiast from Uttar Pradesh, turning ideas into clean, creative, and secure code. 🚀
Introduction
I like experimenting with text-to-speech tools, especially for podcast-style or dialogue-based content.
But most tools I tried felt heavier than necessary — cloud dashboards, API keys, usage limits, or complex setups just to generate a small audio clip.
I wanted something simpler.
A tool where I could write a script like code, run a single command in the terminal, and get clean, stitched audio — fully on my own machine.
That idea eventually became Podvoice.
What is Podvoice?
Podvoice is a local-first command-line tool that converts simple Markdown scripts into multi-speaker, podcast-style audio using the open-source XTTS v2 model.
It runs entirely on your system:
No cloud APIs
No API keys
No paid services
If you can write a Markdown file, you can generate audio from it.
Why Markdown?
Markdown is already familiar to developers.
It’s readable, version-controllable, and easy to edit.
Instead of designing a GUI or complex configuration files, Podvoice uses a very simple script format:
[Host | calm]
Hello and welcome to the show.
[Guest | excited]
Today we’re talking about developer tools.
The speaker name stays consistent throughout the script
The optional emotion tag is parsed for future extensibility
Everything remains human-readable and diff-friendly
This makes scripting conversations feel more like writing code than operating a tool.
How Podvoice Works (High-Level)
Internally, Podvoice follows a straightforward pipeline:
Parse the Markdown file into speaker segments
Map each speaker name to a consistent voice from XTTS
Generate audio for each segment
Stitch all segments together into a single output file
The tool uses Coqui XTTS v2 purely for inference — there’s no model training or fine-tuning involved.
The goal was not to build a research project, but a practical developer tool.
A Note on Performance
One of the biggest challenges was performance on CPU-only machines.
XTTS is a powerful model, but on older hardware, generating long-form audio can take time.
Libraries like PyTorch also attempt optimizations that aren’t always supported on all CPUs.
Instead of hiding this, Podvoice is intentionally transparent:
CPU-first by default
GPU optional if available
Performance trade-offs clearly documented
I preferred predictable behavior over pretending the problem doesn’t exist.
Demo
Below is a short demo generated locally using Podvoice from a Markdown script with two speakers.
Script:
examples/demo.mdOutput: single stitched audio file
Generated entirely on CPU
👉 GitHub repository: (add your repo link here)
👉 Demo audio: (add audio link here)
Who Is This For?
Podvoice is useful if you:
Prefer local tools over cloud services
Like scripting content instead of using dashboards
Want to prototype podcasts, narrations, or dialogue-style audio
Enjoy small, hackable CLI tools
It’s especially aimed at developers who value simplicity and control.
What Podvoice Is Not
Podvoice is not:
A real-time TTS engine
A SaaS replacement
A production-scale voice platform
It’s a focused developer tool — intentionally limited and easy to reason about.
Lessons Learned
Building Podvoice reinforced a few things for me:
Local-first tools still matter
Developer experience is as important as the model itself
Clear constraints build more trust than big promises
Shipping something usable beats over-engineering
Closing Thoughts
Podvoice started as a small experiment, but it taught me a lot about building practical tools around machine-learning models.
If you’re interested in local-first tooling, CLI design, or audio workflows, I’d love your feedback.
The project is open-source, and small improvements or ideas are always welcome.
