From Markdown to Podcast: Building a Local-First TTS Tool with XTTS

Introduction

I like experimenting with text-to-speech tools, especially for podcast-style or dialogue-based content.
But most tools I tried felt heavier than necessary — cloud dashboards, API keys, usage limits, or complex setups just to generate a small audio clip.

I wanted something simpler.

A tool where I could write a script like code, run a single command in the terminal, and get clean, stitched audio — fully on my own machine.

That idea eventually became Podvoice.

What is Podvoice?

Podvoice is a local-first command-line tool that converts simple Markdown scripts into multi-speaker, podcast-style audio using the open-source XTTS v2 model.

It runs entirely on your system:

No cloud APIs
No API keys
No paid services

If you can write a Markdown file, you can generate audio from it.

Why Markdown?

Markdown is already familiar to developers.
It’s readable, version-controllable, and easy to edit.

Instead of designing a GUI or complex configuration files, Podvoice uses a very simple script format:

[Host | calm]
Hello and welcome to the show.

[Guest | excited]
Today we’re talking about developer tools.

The speaker name stays consistent throughout the script
The optional emotion tag is parsed for future extensibility
Everything remains human-readable and diff-friendly

This makes scripting conversations feel more like writing code than operating a tool.

How Podvoice Works (High-Level)

Internally, Podvoice follows a straightforward pipeline:

Parse the Markdown file into speaker segments
Map each speaker name to a consistent voice from XTTS
Generate audio for each segment
Stitch all segments together into a single output file

The tool uses Coqui XTTS v2 purely for inference — there’s no model training or fine-tuning involved.

The goal was not to build a research project, but a practical developer tool.

A Note on Performance

One of the biggest challenges was performance on CPU-only machines.

XTTS is a powerful model, but on older hardware, generating long-form audio can take time.
Libraries like PyTorch also attempt optimizations that aren’t always supported on all CPUs.

Instead of hiding this, Podvoice is intentionally transparent:

CPU-first by default
GPU optional if available
Performance trade-offs clearly documented

I preferred predictable behavior over pretending the problem doesn’t exist.

Demo

Below is a short demo generated locally using Podvoice from a Markdown script with two speakers.

Script: examples/demo.md
Output: single stitched audio file
Generated entirely on CPU

👉 GitHub repository: (add your repo link here)
👉 Demo audio: (add audio link here)

Who Is This For?

Podvoice is useful if you:

Prefer local tools over cloud services
Like scripting content instead of using dashboards
Want to prototype podcasts, narrations, or dialogue-style audio
Enjoy small, hackable CLI tools

It’s especially aimed at developers who value simplicity and control.

What Podvoice Is Not

Podvoice is not:

A real-time TTS engine
A SaaS replacement
A production-scale voice platform

It’s a focused developer tool — intentionally limited and easy to reason about.

Lessons Learned

Building Podvoice reinforced a few things for me:

Local-first tools still matter
Developer experience is as important as the model itself
Clear constraints build more trust than big promises
Shipping something usable beats over-engineering

Closing Thoughts

Podvoice started as a small experiment, but it taught me a lot about building practical tools around machine-learning models.

If you’re interested in local-first tooling, CLI design, or audio workflows, I’d love your feedback.

The project is open-source, and small improvements or ideas are always welcome.

Markdown to Multi-Speaker Podcast: My Local TTS CLI

Introduction

What is Podvoice?

Why Markdown?

How Podvoice Works (High-Level)

A Note on Performance

Demo

Who Is This For?

What Podvoice Is Not

Lessons Learned

Closing Thoughts

Links

More from this blog

MoodSense AI – From Model to Real Product

I Turned My CLI Tool Into a Full GUI AI Podcast Studio

Tesla Project

Omnifood Project

Command Palette

Introduction

What is Podvoice?

Why Markdown?

How Podvoice Works (High-Level)

A Note on Performance

Demo

Who Is This For?

What Podvoice Is Not

Lessons Learned

Closing Thoughts

Links

More from this blog