Multimodal

Beyond text — image, voice, video
all start from one selection

Select a sentence in the novel or a reply in chat: illustration, narration and video generate in place and embed in place. Imagen and gpt-image paint the scene, TTS voices read it aloud, Veo sets it in motion — media grows inside the story instead of interrupting it.

Get it on Google Play See BYOK API

Inline illustrations
Chapter narration · selection dub
Per-modality default models

Tap “Image” on this iPad

A real interactive prototype: tap Image in the selection toolbar to watch Imagen render, Continue for streaming prose, Audio for the narration waveform. Every generation lands in the API request log with tokens and cost.

iPad · Live generation

9:41

Star ArchiveCanon· Ch. 28 · Lighthouse Signal

Luowen looked up as the old starport lamps went out one by one, as if someone had folded the night sky into her palm.

She finally understood the warning: every route is not only a path, but an echo.

"Do not light the seventh lamp," she whispered. "It is not a signal. Someone is asking the past for help."

The ship that vanished tonight would dock again in some other branch.

Selection workbench

Select any sentence — continue, illustrate, narrate, or branch from right here.

Canon

Ch. 28 / 36 · 78%

Three new forms for one passage

Image

Illustrations live in the page

Rendered from selection context and anchored inline; long-press to save, regenerate, or remove.

Imagen 3gpt-image-1SeedDream

Long fiction as audiobook

Narrate selections instantly or cache whole chapters offline — speed, sleep timer, paragraph skip.

Qwen TTSGrok VoiceMiMo Speech

Video

Scenes in motion

Hand a signature moment to a video model and play the clip right inside the reading page.

Google VeoKlingHailuo

Each modality, its own model

Defaults per modalityText, image, speech and video each carry a default model; reader and chat can override separately.
Models per sceneOne model for romance, another for action — per-world generation settings switch anytime.
Capability filteringThe model picker only lists modalities the client actually implements — no phantom capabilities.
Fully accountedEvery generation writes to the API request log: model, tokens, cache hits, cost.

The official prompt templates behind image and video summons are published in full: browse the image & video style packs and the video wrappers (written to the Veo / Kling formulas) on the prompt library — free to copy, or import whole packs from the app's community hub.

FAQ

Common questions

Which models draw the illustrations?

Imagen 3, gpt-image-1, SeedDream and more. Prompts are built from selection context, templates are editable, and size/style can be adjusted before generating.

Which voices can narrate?

Your own TTS providers: Qwen TTS, Grok Voice, MiMo Speech and others, with selectable voices; cached chapters play offline.

How long does video take?

Model-dependent — typically tens of seconds to minutes. Jobs run in the background and the clip anchors back into the text when ready.

How is media billed?

BYOK means your provider's price with no markup; tokens, cost and cache hits are recorded per request in the API log.

Now on Google Play

Let the next chapter grow its own art, voice, and motion.

The Android app is live on Google Play; for iOS Early Access we send one short note when the next wave opens. You can also email [email protected].

Get it on Google Play →

Beyond text — image, voice, videoall start from one selection