Skip to main content

Command Palette

Search for a command to run...

Markdown to Multi-Speaker Podcast: My Local TTS CLI

Turning Markdown scripts into podcast-style audio with a local-first TTS pipeline

Published
3 min read
Markdown to Multi-Speaker Podcast: My Local TTS CLI
A

Hey! I'm Aman Kumar, a passionate full-stack developer and open-source enthusiast from Uttar Pradesh, turning ideas into clean, creative, and secure code. 🚀

Introduction

I like experimenting with text-to-speech tools, especially for podcast-style or dialogue-based content.
But most tools I tried felt heavier than necessary — cloud dashboards, API keys, usage limits, or complex setups just to generate a small audio clip.

I wanted something simpler.

A tool where I could write a script like code, run a single command in the terminal, and get clean, stitched audio — fully on my own machine.

That idea eventually became Podvoice.


What is Podvoice?

Podvoice is a local-first command-line tool that converts simple Markdown scripts into multi-speaker, podcast-style audio using the open-source XTTS v2 model.

It runs entirely on your system:

  • No cloud APIs

  • No API keys

  • No paid services

If you can write a Markdown file, you can generate audio from it.


Why Markdown?

Markdown is already familiar to developers.
It’s readable, version-controllable, and easy to edit.

Instead of designing a GUI or complex configuration files, Podvoice uses a very simple script format:

[Host | calm]
Hello and welcome to the show.

[Guest | excited]
Today we’re talking about developer tools.
  • The speaker name stays consistent throughout the script

  • The optional emotion tag is parsed for future extensibility

  • Everything remains human-readable and diff-friendly

This makes scripting conversations feel more like writing code than operating a tool.


How Podvoice Works (High-Level)

Internally, Podvoice follows a straightforward pipeline:

  1. Parse the Markdown file into speaker segments

  2. Map each speaker name to a consistent voice from XTTS

  3. Generate audio for each segment

  4. Stitch all segments together into a single output file

The tool uses Coqui XTTS v2 purely for inference — there’s no model training or fine-tuning involved.

The goal was not to build a research project, but a practical developer tool.


A Note on Performance

One of the biggest challenges was performance on CPU-only machines.

XTTS is a powerful model, but on older hardware, generating long-form audio can take time.
Libraries like PyTorch also attempt optimizations that aren’t always supported on all CPUs.

Instead of hiding this, Podvoice is intentionally transparent:

  • CPU-first by default

  • GPU optional if available

  • Performance trade-offs clearly documented

I preferred predictable behavior over pretending the problem doesn’t exist.


Demo

Below is a short demo generated locally using Podvoice from a Markdown script with two speakers.

  • Script: examples/demo.md

  • Output: single stitched audio file

  • Generated entirely on CPU

👉 GitHub repository: (add your repo link here)
👉 Demo audio: (add audio link here)


Who Is This For?

Podvoice is useful if you:

  • Prefer local tools over cloud services

  • Like scripting content instead of using dashboards

  • Want to prototype podcasts, narrations, or dialogue-style audio

  • Enjoy small, hackable CLI tools

It’s especially aimed at developers who value simplicity and control.


What Podvoice Is Not

Podvoice is not:

  • A real-time TTS engine

  • A SaaS replacement

  • A production-scale voice platform

It’s a focused developer tool — intentionally limited and easy to reason about.


Lessons Learned

Building Podvoice reinforced a few things for me:

  • Local-first tools still matter

  • Developer experience is as important as the model itself

  • Clear constraints build more trust than big promises

  • Shipping something usable beats over-engineering


Closing Thoughts

Podvoice started as a small experiment, but it taught me a lot about building practical tools around machine-learning models.

If you’re interested in local-first tooling, CLI design, or audio workflows, I’d love your feedback.

The project is open-source, and small improvements or ideas are always welcome.