Analyzing Quantized Small Language Models for Efficient Edge Deployment 


Vol. 50,  No. 9, pp. 1364-1380, Sep.  2025
10.7840/kics.2025.50.9.1364


PDF Full-Text
  Abstract

Quantized small language models (SLMs) offer a promising approach for deploying advanced natural language process- ing (NLP) services on resource-constrained edge devices. However, an in-depth examination of how different quantization configurations influence accuracy and efficiency remains underexplored. This paper systematically evaluates 72 quantized variants of Llama 3.2 (1B and 3B parameters) and Qwen 2.5 (1.5B and 3B parameters) across 13 quantization configura- tions, ranging from q2_K to q6_K. We use the MMLU-Pro benchmark to measure the accuracy (including and excluding random guesses), inference time, resource utilization, and power consumption on an NVIDIA Jetson Orin Nano. Our findings reveal that low-bit quantized models often rely heavily on random guessing, with modest accuracy improvements observed when these are excluded. Furthermore, Qwen 2.5 models generally yield superior accuracy and lower latency than Llama 3.2, albeit with higher sensitivity to quantization, whereas Llama 3.2 exhibits more consistent performance across quantization configurations. CPU utilization remains low (approximately 1-4%), with GPU utilization peaking up to 90% and power consumption ranging from 9.2 W to 11.5 W. Variability across different domains (computer science, engineering, and math) underscores the importance of selecting the appropriate model family, parameter size, and quantization configuration for specific applications. We conclude by outlining future directions for improving on-device NLP, including mixed-precision quantization, hardware-specific optimizations, and broader assessments covering multilingual or multimodal tasks.

  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Related Articles
  Cite this article

[IEEE Style]

S. Jang, S. Yang, C. Choi, "Analyzing Quantized Small Language Models for Efficient Edge Deployment," The Journal of Korean Institute of Communications and Information Sciences, vol. 50, no. 9, pp. 1364-1380, 2025. DOI: 10.7840/kics.2025.50.9.1364.

[ACM Style]

Sooyoung Jang, Seungho Yang, and Changbeom Choi. 2025. Analyzing Quantized Small Language Models for Efficient Edge Deployment. The Journal of Korean Institute of Communications and Information Sciences, 50, 9, (2025), 1364-1380. DOI: 10.7840/kics.2025.50.9.1364.

[KICS Style]

Sooyoung Jang, Seungho Yang, Changbeom Choi, "Analyzing Quantized Small Language Models for Efficient Edge Deployment," The Journal of Korean Institute of Communications and Information Sciences, vol. 50, no. 9, pp. 1364-1380, 9. 2025. (https://doi.org/10.7840/kics.2025.50.9.1364)
Vol. 50, No. 9 Index