What Are Tokens?
Language models don't process raw text character by character. They operate on tokens - chunks of text that can be as short as a single character or as long as a common word. OpenAI's tokenizer, for example, splits text roughly as follows:
- Common English words: 1 token each (
the,is,chat) - Longer or rare words: 2–4 tokens (
compliance→ 2 tokens) - Non-English text: typically more tokens per word than English
- Whitespace and punctuation: counted separately
A useful rough rule: 1,000 tokens ≈ 750 words of English text. A typical support conversation of 300 words uses approximately 400 tokens - but add a system prompt, conversation history, and retrieved knowledge base chunks, and you're often looking at 2,000–5,000 tokens per turn.
The Cost Equation
Let's work through a realistic example. A support chatbot with:
- System prompt: 500 tokens
- Retrieved knowledge base context: 1,500 tokens
- Conversation history (last 5 turns): 800 tokens
- User's question: 50 tokens
- AI response: 300 tokens
That's 2,850 input tokens and 300 output tokens per single turn. Using GPT-4o pricing:
Input: 2,850 tokens × $2.50/M = $0.007125 Output: 300 tokens × $10.00/M = $0.003000 Per turn total: = $0.010125
At 1,000 conversations/month with an average of 5 turns each:
1,000 conversations × 5 turns × $0.010125 = $50.63/month
That might seem manageable. But scale to 10,000 conversations - a modest number for a deployed commercial chatbot - and you're at $506/month, just in LLM API costs, before infrastructure, support, or margins. Now optimisation becomes meaningful.
Where Tokens Are "Wasted"
Most token bloat comes from four sources, all fixable:
1. Bloated System Prompts
System prompts run on every single request. A 2,000-token system prompt costs the same every turn. Many operators write system prompts like cover letters - verbose, repetitive, padded with politeness. Every redundant sentence multiplies across your entire query volume.
2. Whole-Document KB Retrieval
If your retrieval logic fetches entire documents rather than targeted chunks, you're injecting thousands of irrelevant tokens into every prompt. Proper chunking retrieves only the 2–3 paragraphs actually relevant to the query.
3. Unbounded Conversation History
Including the entire conversation history in context is the fastest way to blow your token budget. A 20-turn conversation at 200 tokens/turn adds 4,000 tokens to every subsequent request - most of it irrelevant to the current question.
4. Chatty Model Responses
Without explicit instruction, models tend to pad responses with preamble ("Great question!"), summaries, caveats, and sign-offs. These are pleasant but expensive. Direct responses with no fluff can cut output token usage by 30–50%.
5 Optimisation Strategies
Strategy 1: Write Concise System Instructions
Audit your system prompt for redundancy. Here's a before/after example:
BEFORE (87 tokens): You are a helpful, friendly, and professional customer support assistant for AcmeCorp. Your job is to help users with their questions about our products and services. Always be polite and professional. If you don't know the answer, tell the user you don't know and offer to escalate to a human agent. Never make up information that isn't in your knowledge base. AFTER (41 tokens): You are AcmeCorp support. Answer only from the provided context. If unsure, say so and offer human escalation. Be concise.
Both prompts produce equivalent behaviour. The second saves 46 tokens per request - at 10,000 req/month, that's 460,000 tokens/month saved.
Strategy 2: Use Focused KB Chunks
Configure your knowledge base chunking to produce small, focused segments (200–400 tokens each) rather than large page-level blocks. Set retrieval to return 3–5 chunks maximum. Retrieving the right 400 tokens beats retrieving 3,000 tokens hoping the right information is in there.
Strategy 3: Set Appropriate Max Response Lengths
Use the max_tokens parameter to cap response length. For a support chatbot,
responses over 300 tokens are rarely useful - they just read like a wall of text.
Set a reasonable ceiling and instruct the model to be concise in your system prompt.
Strategy 4: Monitor Token Usage in Analytics
You can't optimise what you can't measure. Track average tokens per conversation, per session, and per user segment. Outlier conversations (unusually high token counts) often reveal edge cases where users are feeding the chatbot large inputs or triggering retrieval of many documents.
Strategy 5: Choose the Right Model for the Task
Not every query needs the most capable model. Routing simple FAQ-style questions to a lighter model (GPT-4o mini, Haiku, Flash) at a fraction of the cost - while reserving the premium model for complex, multi-step queries - can cut your average cost per conversation by 60–80%.
Real-World Cost Comparison
Let's apply these strategies to our example support bot with 1,000 conversations/month × 5 turns:
| Scenario | Avg tokens/turn | Monthly cost (GPT-4o) |
|---|---|---|
| Unoptimised | ~3,150 | ~$50.63 |
| Concise prompt + history cap | ~2,100 | ~$33.75 |
| All 5 strategies applied | ~1,400 | ~$22.50 |
| Strategies + model routing | ~1,400 (mixed models) | ~$9.00 |
The same workload at one-fifth the cost. At 10,000 conversations/month, that's the difference between a $506 bill and a $90 bill.
Token awareness is one of those skills that separates chatbot operators who build profitable products from those who discover at scale that the economics were broken all along.
Monitor Your Chatbot's Token Usage
ChatNexus includes per-conversation token analytics and model routing to help you optimise costs as you scale.
Get Started Free →