An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models 


Vol. 49,  No. 10, pp. 1377-1385, Oct.  2024
10.7840/kics.2024.49.10.1377


PDF
  Abstract

Recently, large language models, such as GPT, LLaMA, and PaLM, have been actively applied in various fields such as medicine, education, finance, law, and marketing. These models have a vast number of parameters that require multiple GPUs to perform inference. For system administrators of inference services in clusters or clouds, it is critical to utilize the given GPU and network resources as efficiently as possible to quickly respond to numerous user requests. To achieve this, existing inference systems employ various parallelization and optimization strategies. This paper profiles and analyzes inference time, prediction accuracy, GPU communication amount, and GPU memory usage for different parallelization strategies, optimization techniques, and batch size changes. Notably, we develop a new resource profiler for precise resource measurement of GPU resources. Our profiling results reveal that increasing batch size can lead to inefficiencies due to increased GPU communication. In terms of GPU memory, larger batch sizes result in more aggressive memory utilization, but a specific threshold exists where out-of-memory issues arise for the limited GPU memory. Such observations are expected to serve as a baseline for designing efficient inference systems for large language models.

  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Related Articles
  Cite this article

[IEEE Style]

C. Shin, Y. Go, Y. Yoo, G. Yang, C. Yoo, "An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models," The Journal of Korean Institute of Communications and Information Sciences, vol. 49, no. 10, pp. 1377-1385, 2024. DOI: 10.7840/kics.2024.49.10.1377.

[ACM Style]

Changyong Shin, Younghun Go, Yeonho Yoo, Gyeongsik Yang, and Chuck Yoo. 2024. An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models. The Journal of Korean Institute of Communications and Information Sciences, 49, 10, (2024), 1377-1385. DOI: 10.7840/kics.2024.49.10.1377.

[KICS Style]

Changyong Shin, Younghun Go, Yeonho Yoo, Gyeongsik Yang, Chuck Yoo, "An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models," The Journal of Korean Institute of Communications and Information Sciences, vol. 49, no. 10, pp. 1377-1385, 10. 2024. (https://doi.org/10.7840/kics.2024.49.10.1377)
Vol. 49, No. 10 Index