Finding Needles in the LLM Haystack: Testing LLMs for a Turkish Bank Voice Agent

Hugging Face hosts over 2.5 million large language models (LLM) as of February 2026. However, identifying a subset relevant to specific use cases remains difficult, particularly in multilingual, industry-specific settings. At datAcxion, we encountered this challenge when developing an outbound voice-AI sales agent for a Turkish bank. While we faced various hurdles regarding voice activity detection (VAD), automatic speech recognition (ASR) in Turkish, user intent classification and next best action decisioning, text-to-voice (TTV) streaming, and doing all this under a sub-second latency requirement, this write-up focuses specifically on model selection.

Finding a proven LLM specifically trained for Turkish is challenging. While multilingual models exist, those augmented for Turkish banking often underperform. For instance, even models utilizing RAG or fine-tuning with Turkish banking documents, such as the Commensis LLM, did not meet our requirements.

We tested 36 Turkish-capable LLMs using product-specific prompt engineering in LM Studio on a Mac Studio M3 Ultra (512 GB uRAM). Our evaluation was based on three primary constraints:

  1. Latency: A bank-mandated SLA of at most 3 seconds per response (demo achieved 1.1s).
  2. Deployability: Necessary for on-premise deployment to meet the Turkish GDPR requirements.
  3. Accuracy: A goal of handling 90% of cases without referral to a live agent.

To rank these models, we used the following scorecard:

     

LLM Rank = Deployability + Efficiency/Scale + 0.5 * Latency +

3 * Accuracy/Quality. 

Definitions:

  • Deployability: Decile ranking (1-10) based on model size in billions of parameters (inverse; higher is better).
  • Efficiency/Scale: Decile ranking (1-10) of tokens per second per billion parameters.
  • Latency: Decile ranking (1-10) of total elapsed time, including thinking time, time to first token, and output time (inverse; higher is better).
  • Semantic Quality/Language Accuracy: Decile ranking (1-10) based on output volume in terms of the number of distinct rows, linguistic quality (e.g., lack of typos or non-Turkish words), and alignment with internal taxonomy.

In our weighting, deployability and efficiency each carry a weight of 1. Deployability may be adjusted based on specific platform constraints — potentially lower for cloud operations or higher for more constrained environments. Rapid processing is essential to assess user intent and output tokens within our required SLA. While speed is important, we have demoted the latency metric’s individual weight. Smaller, non-reasoning models are naturally faster, but they often produce lower-quality results. We wanted to ensure we are not inadvertently rewarding poor output simply because it is delivered quickly. Linguistic and semantic quality/accuracy was our most critical internal priority. Inaccurate or incoherent responses negatively impact the user experience, increase referral rates to live sales agents, and can ultimately damage the brand perception. The results of our evaluation are detailed in the table below: 

Model NameArchitectureSize in B-paramsDeployability
Decile
Efficiency
Decile
Latency
Decile
Quality
Decile
Scorecard
(D+E+0.5*L+3*Q)
1turkish-gemma-9b-t1Gemma-298721044
2qwen3-30b-a3b-thinking-2507-claude-4.5-sonnet-high-reasoning-distillQwen-3-MoE30668840
3gemma3-turkish-augment-ftGemma-3410109539
4gpt-oss-120b-mlxGPT-OSS1201561039
5gemma-3-27b-itGemma-327653838
6qwen3-14b-claude-sonnet-4.5-reasoning-distillQwen-314775633
7minimax/minimax-m2Minimax-2-MoE-A10B230134933
8olmo-3-32b-thinkOlmo-332550732
9turkish-llama-8b-instruct-v0.1Llama8989332
10glm-4.7GLM-4-MoE106221931
11qwq-32bQwen-232530731
12tongyi-deepresearch-30b-a3b-mlxQwen-3-MoE-A3B30668530
13intellect-3GLM-4-MoE106245729
14seed-oss-36bSeed-OSS36441728
15teknofest-2025-turkish-edu-v2-i1Qwen-389107126
16baidu-ernie-4.5-21b-a3bErnie4-5-MoE217810225
17commencis-llmLlama710910024
18cogito-v2-preview-llama-70b-mlxLlama70425523
19turkish-article-abstracts-dataset-bb-mistral-model-v1-multiLlama12777223
20mistral-7b-instruct-v0.2-turkishLlama8998023
21deepseek-r1-0528-qwen3-8bQwen-38983121
22deepseek-v3-0324Deepseek-3671004621
23hermes-4-70bLlama70426319
24apertus-70b-instruct-2509-qx64-mlxApertus70437216
25c4ai-command-r-plusCommand-R104212416
26qwen3-235b-a22bQwen235012415
27grok-2Grok89303313

Before this evaluation we were using Gemma-3-27B-IT. Following the evaluation, we switched to Qwen3-30B-A3B-Thinking-2507-Claude-4.5-Sonnet-High-Reasoning-Distil. For offline demos on a laptop or in hardware constrained environments, we recommend using Turkish-Gemma-9B-T1. In terms of the model family for this particular application, Gemma ranked at the top followed by GPT-OSS. Though widely used in Turkish language applications, Qwen and Llama family of models depicted only average performance. 

Note

  • 9 out of 36 models have been eliminated from the list as they did not produce output or produced inconsistent output across consecutive runs. 

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *