An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models
Vol. 49, No. 10, pp. 1377-1385, Oct. 2024
10.7840/kics.2024.49.10.1377
-
Large Language Model GPU utilization communication overhead Model Parallelism Tensor parallelism Kernel fusion Batch size
Abstract
Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.
|
Cite this article
[IEEE Style]
C. Shin, Y. Go, Y. Yoo, G. Yang, C. Yoo, "An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models," The Journal of Korean Institute of Communications and Information Sciences, vol. 49, no. 10, pp. 1377-1385, 2024. DOI: 10.7840/kics.2024.49.10.1377.
[ACM Style]
Changyong Shin, Younghun Go, Yeonho Yoo, Gyeongsik Yang, and Chuck Yoo. 2024. An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models. The Journal of Korean Institute of Communications and Information Sciences, 49, 10, (2024), 1377-1385. DOI: 10.7840/kics.2024.49.10.1377.
[KICS Style]
Changyong Shin, Younghun Go, Yeonho Yoo, Gyeongsik Yang, Chuck Yoo, "An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models," The Journal of Korean Institute of Communications and Information Sciences, vol. 49, no. 10, pp. 1377-1385, 10. 2024. (https://doi.org/10.7840/kics.2024.49.10.1377)
Vol. 49, No. 10 Index