Finding Needles in the LLM Haystack: Testing LLMs for a Turkish Bank Voice Agent

Hugging Face hosts over 2.5 million large language models (LLM) as of February 2026. However, identifying a subset relevant to specific use cases remains difficult, particularly in multilingual, industry-specific settings. At datAcxion, we encountered this challenge when developing an outbound voice-AI sales agent for a Turkish bank. While we faced various hurdles regarding voice activity detection (VAD), automatic speech recognition (ASR) in Turkish, user intent classification and next best action decisioning, text-to-voice (TTV) streaming, and doing all this under a sub-second latency requirement, this write-up focuses specifically on model selection.

Finding a proven LLM specifically trained for Turkish is challenging. While multilingual models exist, those augmented for Turkish banking often underperform. For instance, even models utilizing RAG or fine-tuning with Turkish banking documents, such as the Commensis LLM, did not meet our requirements.

We tested 36 Turkish-capable LLMs using product-specific prompt engineering in LM Studio on a Mac Studio M3 Ultra (512 GB uRAM). Our evaluation was based on three primary constraints:

Latency: A bank-mandated SLA of at most 3 seconds per response (demo achieved 1.1s).
Deployability: Necessary for on-premise deployment to meet the Turkish GDPR requirements.
Accuracy: A goal of handling 90% of cases without referral to a live agent.

To rank these models, we used the following scorecard:

LLM Rank = Deployability + Efficiency/Scale + 0.5 * Latency +

3 * Accuracy/Quality.

Definitions:

Deployability: Decile ranking (1-10) based on model size in billions of parameters (inverse; higher is better).
Efficiency/Scale: Decile ranking (1-10) of tokens per second per billion parameters.
Latency: Decile ranking (1-10) of total elapsed time, including thinking time, time to first token, and output time (inverse; higher is better).
Semantic Quality/Language Accuracy: Decile ranking (1-10) based on output volume in terms of the number of distinct rows, linguistic quality (e.g., lack of typos or non-Turkish words), and alignment with internal taxonomy.

In our weighting, deployability and efficiency each carry a weight of 1. Deployability may be adjusted based on specific platform constraints — potentially lower for cloud operations or higher for more constrained environments. Rapid processing is essential to assess user intent and output tokens within our required SLA. While speed is important, we have demoted the latency metric’s individual weight. Smaller, non-reasoning models are naturally faster, but they often produce lower-quality results. We wanted to ensure we are not inadvertently rewarding poor output simply because it is delivered quickly. Linguistic and semantic quality/accuracy was our most critical internal priority. Inaccurate or incoherent responses negatively impact the user experience, increase referral rates to live sales agents, and can ultimately damage the brand perception. The results of our evaluation are detailed in the table below:

	Model Name	Architecture	Size in B-params	Deployability Decile	Efficiency Decile	Latency Decile	Quality Decile	Scorecard *(D+E+0.5L+3Q)*
1	turkish-gemma-9b-t1	Gemma-2	9	8	7	2	10	44
2	qwen3-30b-a3b-thinking-2507-claude-4.5-sonnet-high-reasoning-distill	Qwen-3-MoE	30	6	6	8	8	40
3	gemma3-turkish-augment-ft	Gemma-3	4	10	10	9	5	39
4	gpt-oss-120b-mlx	GPT-OSS	120	1	5	6	10	39
5	gemma-3-27b-it	Gemma-3	27	6	5	3	8	38
6	qwen3-14b-claude-sonnet-4.5-reasoning-distill	Qwen-3	14	7	7	5	6	33
7	minimax/minimax-m2	Minimax-2-MoE-A10B	230	1	3	4	9	33
8	olmo-3-32b-think	Olmo-3	32	5	5	0	7	32
9	turkish-llama-8b-instruct-v0.1	Llama	8	9	8	9	3	32
10	glm-4.7	GLM-4-MoE	106	2	2	1	9	31
11	qwq-32b	Qwen-2	32	5	3	0	7	31
12	tongyi-deepresearch-30b-a3b-mlx	Qwen-3-MoE-A3B	30	6	6	8	5	30
13	intellect-3	GLM-4-MoE	106	2	4	5	7	29
14	seed-oss-36b	Seed-OSS	36	4	4	1	7	28
15	teknofest-2025-turkish-edu-v2-i1	Qwen-3	8	9	10	7	1	26
16	baidu-ernie-4.5-21b-a3b	Ernie4-5-MoE	21	7	8	10	2	25
17	commencis-llm	Llama	7	10	9	10	0	24
18	cogito-v2-preview-llama-70b-mlx	Llama	70	4	2	5	5	23
19	turkish-article-abstracts-dataset-bb-mistral-model-v1-multi	Llama	12	7	7	7	2	23
20	mistral-7b-instruct-v0.2-turkish	Llama	8	9	9	8	0	23
21	deepseek-r1-0528-qwen3-8b	Qwen-3	8	9	8	3	1	21
22	deepseek-v3-0324	Deepseek-3	671	0	0	4	6	21
23	hermes-4-70b	Llama	70	4	2	6	3	19
24	apertus-70b-instruct-2509-qx64-mlx	Apertus	70	4	3	7	2	16
25	c4ai-command-r-plus	Command-R	104	2	1	2	4	16
26	qwen3-235b-a22b	Qwen	235	0	1	2	4	15
27	grok-2	Grok	89	3	0	3	3	13

Before this evaluation we were using Gemma-3-27B-IT. Following the evaluation, we switched to Qwen3-30B-A3B-Thinking-2507-Claude-4.5-Sonnet-High-Reasoning-Distil. For offline demos on a laptop or in hardware constrained environments, we recommend using Turkish-Gemma-9B-T1. In terms of the model family for this particular application, Gemma ranked at the top followed by GPT-OSS. Though widely used in Turkish language applications, Qwen and Llama family of models depicted only average performance.

Note:

9 out of 36 models have been eliminated from the list as they did not produce output or produced inconsistent output across consecutive runs.

Finding Needles in the LLM Haystack: Testing LLMs for a Turkish Bank Voice Agent

Comments

Leave a Reply Cancel reply

More posts

Finding Needles in the LLM Haystack: Testing LLMs for a Turkish Bank Voice Agent

Working-Backwards Analytics

From 30K Videos to 70 High-Impact Influencers: A Data Science Blueprint for Influencer Marketing

On Ranking Sellers/ Merchants in Product Detail Pages, aka Buy-Box Modeling…