Multimodal

Beyond text — image, voice, video
all start from one selection

Select a sentence in the novel or a reply in chat: illustration, narration and video generate in place and embed in place. Imagen and gpt-image paint the scene, TTS voices read it aloud, Veo sets it in motion — media grows inside the story instead of interrupting it.

  • Inline illustrations
  • Chapter narration · selection dub
  • Per-modality default models

Tap “Image” on this iPad

A real interactive prototype: tap Image in the selection toolbar to watch Imagen render, Continue for streaming prose, Audio for the narration waveform. Every generation lands in the API request log with tokens and cost.

iPad · Live generation
9:41
Star ArchiveCanon· Ch. 28 · Lighthouse Signal

Luowen looked up as the old starport lamps went out one by one, as if someone had folded the night sky into her palm.

She finally understood the warning: every route is not only a path, but an echo.

"Do not light the seventh lamp," she whispered. "It is not a signal. Someone is asking the past for help."

The ship that vanished tonight would dock again in some other branch.

Selection workbench

Select any sentence — continue, illustrate, narrate, or branch from right here.

Canon
Ch. 28 / 36 · 78%

Three new forms for one passage

ImageIllustrations live in the page
Illustrations live in the page

Rendered from selection context and anchored inline; long-press to save, regenerate, or remove.

Imagen 3gpt-image-1SeedDream
Long fiction as audiobook

Narrate selections instantly or cache whole chapters offline — speed, sleep timer, paragraph skip.

Qwen TTSGrok VoiceMiMo Speech
VideoScenes in motion
Scenes in motion

Hand a signature moment to a video model and play the clip right inside the reading page.

Google VeoKlingHailuo

Each modality, its own model

  • Defaults per modalityText, image, speech and video each carry a default model; reader and chat can override separately.
  • Models per sceneOne model for romance, another for action — per-world generation settings switch anytime.
  • Capability filteringThe model picker only lists modalities the client actually implements — no phantom capabilities.
  • Fully accountedEvery generation writes to the API request log: model, tokens, cache hits, cost.

FAQ

Common questions

Which models draw the illustrations?

Imagen 3, gpt-image-1, SeedDream and more. Prompts are built from selection context, templates are editable, and size/style can be adjusted before generating.

Which voices can narrate?

Your own TTS providers: Qwen TTS, Grok Voice, MiMo Speech and others, with selectable voices; cached chapters play offline.

How long does video take?

Model-dependent — typically tens of seconds to minutes. Jobs run in the background and the clip anchors back into the text when ready.

How is media billed?

BYOK means your provider's price with no markup; tokens, cost and cache hits are recorded per request in the API log.

Reserve a spot

Let the next chapter grow its own art, voice, and motion.

We send one short note when the next tester wave opens. You can also email [email protected].

Early access

Reserve your email

One launch email. No newsletter, no third-party trackers, unsubscribe anytime.

No spam. Unsubscribe anytime.

Email directly
Multimodal Story Generation — Image, Voice, Video · Foreverse · Xinmeng