Log in to leave a comment
No posts yet
Looking at monthly invoices from ElevenLabs or OpenAI for TTS can be a painful experience. For indie game developers, API call costs are a fixed expense that can stifle a project's growth. Kokoro 82M is an open-source model under the Apache 2.0 license that produces commercial-grade speech in a local environment. By hosting this lightweight model—composed of 82 million parameters—on your own PC, you no longer need to be at the mercy of external server policy changes.
The biggest concern when running a local model is the potential for game frame rates to drop. To prevent the CPU load generated during speech synthesis, you must manually control the execution threads. Since Kokoro 82M follows the StyleTTS 2 architecture, it operates most efficiently when running on the ONNX Runtime.
intra_op_num_threads in ONNX's SessionOptions to less than half of your total cores. For an 8-core CPU, allocating 2 to 4 cores is sufficient.enable_cpu_mem_arena to prevent memory fragmentation. This eliminates the micro-stuttering that occurs when generating audio in the background.asyncio queue to send data to the audio device as soon as the first chunk is generated.With these settings, you can drop the Time to First Audio (TTFA) to under 0.5 seconds.
No matter how great Kokoro 82M is, immersion is broken if it reads 'API' as 'Ah-pi' or fails to process '10%' correctly. Because this model was trained based on the International Phonetic Alphabet (IPA), a process for normalizing input text is essential.
Instead of simply feeding in raw text, create a regular expression mapping dictionary. r'\bAPI\b' should be converted to '에이피아이' (A-P-I), and numbers should be expanded into Korean text like '한 개' or '일 퍼센트' depending on the context. Specifically, the unique liaison rules of the Korean language can be handled by using auxiliary libraries like korean-text-normalizer. This can save you about 5 hours a week that would otherwise be spent manually editing audio files late into the night.
You don't need to overhaul all your existing OpenAI SDK code. By launching a lightweight FastAPI server on your localhost, you can replace a paid API just by modifying a single line for the endpoint address.
/v1/audio/speech path designed to receive JSON data in the OpenAI standard format.INT8 quantized model, which is 92.4MB in size. The inference speed is more than three times faster than the standard model, while the audible difference in quality is negligible.pydub to return it immediately as .mp3 or .wav.By doing this, you can maintain your codebase that previously relied on paid services while cleanly eliminating monthly subscription fees.
Lightweight models have a limitation where the pronunciation at the end of a sentence can become garbled or mixed with mechanical noise when processing long texts of over 500 characters. To solve this, you must split sentences intelligently.
Divide sentences based on periods and commas, then use AudioSegment.silent to forcibly insert 200–500ms of silence between sentences. Simply simulating human breathing cycles makes the model's awkwardness disappear. The key is not just dividing the text, but the seamless playback logic that naturally stitches the audio fragments together. Automating this process allows for natural acting even with long lines of dialogue without any interruptions.