‫ بهبود متوسط زمان پاسخگویی حافظه برای برنامه‌های حافظه-محور به منظور کاهش زمان بیکاری هسته‌های پردازشی در پردازنده گرافیکی

بهبود متوسط زمان پاسخگویی حافظه برای برنامه‌های حافظه-محور به منظور کاهش زمان بیکاری هسته‌های پردازشی در پردازنده گرافیکی

حسین بی‌طالبی, فرشاد صفایی

چکیده

ظهور مفهومGPGPU  همراه با CUDA  ومدل‌های برنامه نویسی نظیرOpenCl ، فرصت‌های جدیدی را برای کاهش تأخیر و توان  مصرفیِ برنامه‌های کارایی محور فراهم می‌کند. GPU می‌تواند هزاران نخ پردازشی موازی را برای پنهان کردن تأخیر پرهزینه دسترسی به حافظه اجرا کند. با این حال، برای برخی از برنامه‌های حافظه محور، به احتمال زیاد در برخی فواصل زمانی تمام نخ‌های پردازشی یک هسته متوقف شده و منتظر تأمین داده توسط واحد حافظه هستند. در این پژوهش هدف ما بهبود تأخیر دسترسی به حافظه برای بسته‌های تولیدی توسط هسته‌های بحرانی در پردازنده‌های گرافیکی است. به منظور بهبود زمان غیربهینه هسته‌ها، ما بر روی شبکه میان ارتباطی بین هسته‌ها و حافظه پنهان سطح آخر تمرکز و بسته مربوط به هسته‌هایی که تعداد بیشتری نخ متوقف شده دارند را در ورود به شبکه و داوری در شبکه اولویت قرار می‌دهیم. به این ترتیب ، بیشترین اولویت در داوری و تخصیص منابع به بسته‌های بحرانی‌تر اعطا می‌شود، بنابراین درخواست حافظه برای آنها سریعتر سرویس دهی شده و متوسط زمان توقف هسته کاهش و در نهایت کارایی پردازنده گرافیکی افزایش می‌یابد.

کلمات کلیدی

پردازنده گرافیکی, شبکه میان ارتباطی, تاخیر, سطوح اولویت, بحرانی, حافظه, حالت بیکاری, حافظه‌نهان

مراجع

  • [1] J. Macri, “AMD’s next generation GPU and high bandwidth memory architecture: FURY,” in Hot Chips 27 Symposium (HCS), 2015 IEEE, 2015, pp. 1–26.
  • [2] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, “A Compiler Framework for Optimization of Affine Loop Nests for Gpgpus,” in Proceedings of the 22Nd Annual International Conference on Supercomputing, 2008, pp. 225–234.
  • [3] M. M. Baskaran, J. Ramanujam, and P. Sadayappan, “Automatic C-to-CUDA code generation for affine programs,” in International Conference on Compiler Construction, 2010, pp. 244–263.
  • [4] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs and the future of parallel computing,” IEEE Micro, no. 5, pp. 7–17, 2011.
  • [5] D. B. KirkW and W. Hwu, “Programming massively parallel processors.” Morgan Kaufmann, Burlington, MA, 2010.
  • [6] N. Corporation, “NVIDIA’s next generation CUDA compute architecture: Fermi,” 2009.
  • [7] A. Munshi, “The opencl specification,” in Hot Chips 21 Symposium (HCS), 2009 IEEE, 2009, pp. 1–314.
  • [8] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
  • [9] M. Bauer, H. Cook, and B. Khailany, “CudaDMA: optimizing GPU memory bandwidth via warp specialization,” in Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, 2011, p. 12.
  • [10] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,” IEEE micro, vol. 28, no. 2, 2008.
  • [11] M. Lee, S. Song, J. Moon, and J. Kim, “Improving GPGPU resource utilization through alternative thread block scheduling,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 260–271.
  • [12] A. Bakhoda, J. Kim, and T. M. Aamodt, “Throughput-effective on-chip networks for manycore accelerators,” in Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, 2010, pp. 421–432.
  • [13] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, “Improving GPU performance via large warps and two-level warp scheduling,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 308–317.
  • [14] A. Sethia, D. A. Jamshidi, and S. Mahlke, “Mascar: Speeding up GPU warps by reducing memory pitstops,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 174–185.
  • [15] M. Abdel-Majeed and M. Annavaram, “Warped register file: A power efficient register file for GPGPUs,” in 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 412–423.
  • [16] Y. Zhang, Z. Xing, C. Liu, C. Tang, and Q. Wang, “Locality based warp scheduling in GPGPUs,” Futur. Gener. Comput. Syst., vol. 82, pp. 520–527, 2018.
  • [17] X. Cheng, H. Zhao, S. P. Mohanty, and J. Fang, “Improving GPU NoC Power Efficiency through Dynamic Bandwidth Allocation,” in 2019 IEEE International Conference on Consumer Electronics (ICCE), 2019, pp. 1–4.
  • [18] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, and A. K. Mishra, “OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance,” in ACM SIGPLAN Notices, 2013, vol. 48, no. 4, pp. 395–406..
  • [19] Y. Yu, X. He, H. Guo, Y. Wang, and X. Chen, “A credit-based load-balance-aware CTA scheduling optimization scheme in GPGPU,” Int. J. Parallel Program., vol. 44, no. 1, pp. 109–129, 2016.
  • [20] A. Bakhoda, J. Kim, and T. M. Aamodt, “Designing on-chip networks for throughput accelerators,” ACM Trans. Archit. Code Optim., vol. 10, no. 3, p. 21, 2013.
  • [21] J. Lee, S. Li, H. Kim, and S. Yalamanchili, “Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures,” ACM Trans. Des. Autom. Electron. Syst., vol. 18, no. 4, p. 48, 2013.
  • [22] R. Ausavarungnirun, S. Ghose, O. Kay, M. T. Kandemir, and O. Mutlu, “Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance,” arXiv Prepr. arXiv1804.11038, 2018.
  • [23] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian, “Managing DRAM latency divergence in irregular GPGPU applications,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 128–139.
  • [24] N. B. Lakshminarayana and H. Kim, “Effect of instruction fetch and memory scheduling on gpu performance,” in Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010, vol. 88.
  • [25] N. Brunie, S. Collange, and G. Diamos, “Simultaneous branch and warp interweaving for sustained GPU performance,” in 2012 39th Annual International Symposium on Computer Architecture (ISCA), 2012, pp. 49–60.
  • [26] Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen, “A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in GPGPUs,” in Proceedings of the 29th ACM on International Conference on Supercomputing, 2015, pp. 15–24.
  • [27] P. Lotfi-Kamran, M. Modarressi, and H. Sarbazi-Azad, “Near-Ideal Networks-on-Chip for Servers,” in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, 2017, pp. 277–288.
  • [28] Y. Oh, K. Kim, Y. Park, W. W. Ro, and M. Annavaram, “APRES: Improving cache efficiency by exploiting load characteristics on GPUs,” ACM SIGARCH Comput. Archit. news, vol. 44, no. 3, pp. 191–203, 2016.
  • [29] F. Candel, S. Petit, J. Sahuquillo, and J. Duato, “Accurately modeling the on-chip and off-chip GPU memory subsystem,” Futur. Gener. Comput. Syst., vol. 82, pp. 510–519, 2018.
  • [30] G. Koo, Y. Oh, W. W. Ro, and M. Annavaram, “Access pattern-aware cache management for improving data utilization in gpu,” in ACM SIGARCH Computer Architecture News, 2017, vol. 45, no. 2, pp. 307–319.
  • [31] L. Wang, J. Ye, S. L. Song, Z. Xu, and T. Kraska, “Superneurons: dynamic GPU memory management for training deep neural networks,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018, pp. 41–53.
  • [32] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu, “Adaptive cache management for energy-efficient gpu computing,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 343–355.
  • [33] G. B. Kim, J. M. Kim, and C. H. Kim, “MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs,” KIPS Trans. Comput. Commun. Syst., vol. 8, no. 5, pp. 111–118, 2019.
  • [34] Y. Gu and L. Chen, “Dynamically linked MSHRs for adaptive miss handling in GPUs,” in Proceedings of the ACM International Conference on Supercomputing, 2019, pp. 510–521.
  • [35] D. Kroft, “Cache memory organization utilizing miss information holding registers to prevent lockup from cache misses.” Google Patents, 1983.
  • [36] D. Kroft, “Lockup-free instruction fetch/prefetch cache organization,” in 25 years of the international symposia on Computer architecture (selected papers), 1998, pp. 195–201.
  • [37] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009, pp. 163–174.
  • [38] S. Che, M. Boyer, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE international symposium on workload characterization (IISWC), 2009, pp. 44–54.
  • [39] A. Jog, O. Kayiran, O. Mutlu, R. Iyer, and C. R. Das, “Exploiting core criticality for enhanced GPU performance,” in ACM SIGMETRICS Performance Evaluation Review, 2016, vol. 44, no. 1, pp. 351–363.
  • [40] S. Mu, Y. Deng, Y. Chen, W. Zhang, and Z. Wang, “Orchestrating cache management and memory scheduling for GPGPU applications,” IEEE Trans. Very Large Scale Integr. Syst., vol. 22, no. 8, pp. 1803–1814, 2013.
  • [41] G. L. Yuan, A. Bakhoda, and T. M. Aamodt, “Complexity effective memory access scheduling for many-core accelerator architectures,” in 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009, pp. 34–44.
  • [42] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High performance cache replacement using re-reference interval prediction (RRIP),” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 60–71, 2010.
  • [43] J. Jalminger and P. Stenstrom, “A novel approach to cache block reuse predictions,” in 2003 International Conference on Parallel Processing, 2003. Proceedings., 2003, pp. 294–302.
  • [44] S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutluy, and D. A. Jimenezz, “Improving cache performance using read-write partitioning,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 452–463.
  • [45] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” ACM SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 381–391, 2007.
  • [46] V. Seshadri, H. Xin, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks,” ACM Trans. Archit. Code Optim., vol. 11, no. 4, pp. 1–22, 2015.
  • [47] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-aware shared resource management for multi-core systems,” ACM SIGARCH Comput. Archit. News, vol. 39, no. 3, pp. 141–152, 2011.
  • [48] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Divergence-aware warp scheduling,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013, pp. 99–110.
  • [49] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious wavefront scheduling,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012, pp. 72–83.
  • [50] M. Chaudhuri, J. Gaur, N. Bashyam, S. Subramoney, and J. Nuzman, “Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, 2012, pp. 293–304.
  • [51] J. Gaur, M. Chaudhuri, and S. Subramoney, “Bypass and insertion algorithms for exclusive last-level caches,” in Proceedings of the 38th annual international symposium on Computer architecture, 2011, pp. 81–92.
  • [52] W. Jia, K. A. Shaw, and M. Martonosi, “MRPB: Memory request prioritization for massively parallel processors,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 272–283.
  • [53] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the NVIDIA volta GPU architecture via microbenchmarking,” arXiv Prepr. arXiv1804.06826, 2018.