ارزیابی اثر خطای انسانی بر اتکاپذیری سامانه‌های ذخیره‌سازی داده

نویسندگان

دانشکده مهندسی کامپیوتر، دانشگاه صنعتی شریف، تهران، ایران

چکیده

به‌رغم استفاده از روش‌هایی مانند بازیابی خودکار خرابی، نقش عامل انسانی و متعاقبا خطای انسانی در مراکز داده اجتناب‌ناپذیر است. به این دلیل که مراکز داده از تعداد دیسک بسیار زیادی بهره می‌گیرند، و با توجه به نرخ بالای خرابی دیسک، خطای انسانی در زیرسامانه‌ی دیسک یکی از عوامل اصلی عدم دسترس‌پذیری و فقدان داده است. در این مقاله، اثر جایگزینی دیسک اشتباه را بر دسترس‌پذیری و قابلیت اطمینان سامانه‌های ذخیره‌سازی داده بررسی خواهیم کرد. با این هدف، ابتدا پیامدهای جایگزینی دیسک اشتباه را در آرایه‌ی دیسک بررسی می‌کنیم و سپس با استفاده از شبیه‌سازی‌های مونت‌کارلو عدم دسترس‌پذیری و فقدان داده را ارزیابی می‌کنیم. در چهارچوب پیشنهاد شده الف) پیکربندی‌های مختلف آرایه‌ی دیسک در نظر گرفته می‌شود. ب) معیاری جدید برای عدم دسترس‌پذیری سامانه‌های ذخیره‌سازی داده پیشنهاد می‌شود که مستقل از اندازه‌ی سامانه‌ی مورد آزمایش است و بزرگی عدم دسترس‌‎پذیری را نیز در خود می‌گنجاند.

کلیدواژه‌ها

  • [1] A. Avizienis, J. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing.” IEEE transactions on dependable and secure computing, vol. 1, pp. 11-33, 2004.
  • [2] L. N. Bairavasundaram, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, G. R. Goodson, and B. Schroeder, “An analysis of data corruption in the storage stack.” ACM Transactions on Storage (TOS), vol. 4, pp. 8:1--8:28, 2008.
  • [3] L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler, “An analysis of latent sector errors in disk drives.” ACM SIGMETRICS international conference on Measurement and modeling of computer systems, vol. 35, pp. 289-300, 2007.
  • [4] M. Balakrishnan, A. Kadav, V. Prabhakaran, and D. Malkhi, “Differential raid: Rethinking raid for ssd reliability.” ACM Transactions on Storage (TOS), vol. 6, no. 4, pp. 1-10, 2010.
  • [5] M. Blaum, J. Brady, J. Bruck, and J. Menon, “EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures.” IEEE Transactions on computers, vol. 44, pp. 192-202, 1995.
  • [6] M. Blaum, J. L. Hafner, and S. Hetzler, “Partial-MDS Codes and Their Application to RAID Type of Architectures.” IEEE Transactions on Information Theory, vol. 59, pp. 4510-4519, 2013.
  • [7] F. Chandler, I. Heard, M. Presley, A. Burg, E. Midden, and P. Mongan, NASA Human Error Analysis. Retrieved from www.hq.nasa.gov/office/codeq/rm/docs/hra.pdf‎, Sep 2010
  • [8] P. L. Clemens, Human Factors and Operator Errors. Jacobs Sverdrup, 2002.
  • [9] U. S. Commission, Reactor Safety Study: An Assessment of Accident Risks in US Commercial Nuclear Power Plants. International Nuclear Information System (INIS), vol 2, 1975.
  • [10] B. S. Dhillon, “System reliability evaluation models with human error.” IEEE Transactions on Reliability, vol 32, pp. 47-59, 1983.
  • [11] A. Dholakia, E. Eleftheriou, X. Y. Hu, I. Iliadis, J. Menon, and K. K. Rao, “A New Intra-disk Redundancy Scheme for High-reliability RAID Storage Systems in the Presence of Unrecoverable Eerrors.” ACM Transactions on Storage (TOS), vol 4, pp. 1-42, 2008.
  • [12] B. Dufrasne, and R. Eriksson, IBM XIV Storage System Architecture, Implementation, and Usage. Tech. rep., IBM. Retrieved from http://www.redbooks.ibm.com/abstracts/sg247659.html, 2011
  • [13] J. Elerath, “Reliability Model and Assessment of Redundant Arrays of Inexpensive Disks (RAID) Incorporating Latent Defects and Non-Homogeneous Poisson Process Events.” Ph.D. dissertation, 2007.
  • [14] J. Elerath, “A simple equation for estimating reliability of an N+1 redundant array of independent disks (RAID).” Dependable Systems and Networks (DSN), International Conference on, pp. 484-493, 2009.
  • [15] J. Elerath, and M. Pecht, “Enhanced reliability modeling of raid storage systems.” Dependable Systems and Networks (DSN), International Conference on,, pp. 175-184, 2007.
  • [16] J. Elerath, and J. Schindler, “Beyond MTTDL: A closed-form RAID 6 reliability equation.” ACM Transactions on Storage (TOS), vol 10, no 7, pp. 256-279, 2014.
  • [17] J. Elerath, and M. Pecht, “A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (RAID).” IEEE Transactions on Computers, vol 58, pp. 289-299, 2009.
  • [18] S. Forrest, S. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A Sense of Self” Unix Processes. Security and Privacy, IEEE Symposium on, pp. 120-128. 1996.
  • [19] G. A. Gibson, “Redundant Disk Arrays: Reliable, Parallel Secondary Storage.” Ph.D. dissertation, Univeristy of California, Berkeley, 1990.
  • [20] W. H. Gibson, B. Hickling, and B. Kirwan, Feasibility Study Into the Collection of Human Error Probability Data. EUROCONTROL. Retrieved from https://www.eurocontrol.int/feasibility-study-collection-human-error-probability-data, 2006
  • [21] K. M. Greenan, D. D. Long, E. L. Miller, and A. Wildani, “Building flexible, fault-tolerant flash-based storage systems.” Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep). Yokohama, 2009.
  • [22] K. M. Greenan, J. S. Plank, and J. J. Wylie, “Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability.” USENIX conference on Hot topics in storage and file systems (HotStorage), pp. 1-5, 2010.
  • [23] S. K. Hari, S. V. Adve, and H. Naeimi, “Low-Cost Program-Level Detectors for Reducing Silent Data Corruptions.” Dependable Systems and Networks (DSN), Annual IEEE/IFIP International Conference on, pp. 1-12. 2012.
  • [24] E. Haubert, “Threats of Human Error in a High-Performance Storage System: Problem Statement and Case Study,” CoRR, abs/cs/0412074, 1-13. Retrieved from http://arxiv.org/abs/cs/0412074, 2004
  • [25] I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, “Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems.” ACM SIGMETRICS Performance Evaluation Review, 36, pp. 241-252. 2008.
  • [26] G. Jacques-Silva, Z. Kalbarczyk, B. Gedik, H. Andrade, K.-L. Wu, and R. K. Iyer, “Modeling Stream Processing Applications for Dependability Evaluation.” Dependable Systems and Networks (DSN), Annual IEEE/IFIP International Conference on, pp. 430-441, 2011.
  • [27] W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky, “Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics.” ACM Transactions on Storage (TOS), vol 4, pp. 1-25, 2008.
  • [28] K. Keeton, and A. Merchant, “A framework for evaluating storage system dependability.” Dependable Systems and Networks (DSN), International Conference on, pp. 877-886. 2004.
  • [29] K. Keeton, C. A. Santos, D. Beyer, J. S. Chase, and J. Wilkes, “Designing for Disasters.” USENIX Conference on File and Storage Technologies (FAST), vol 4, pp. 59-62, 2004.
  • [30] J. O. Kephart, and D. M. Chess, “The Vision of Autonomic Computing.” Computer, vol 36, pp. 41-50, 2003.
  • [31] J. Kim, J. Lee, J. Choi, D. Lee, and S. H Noh, “Improving SSD reliability with RAID via elastic striping and anywhere parity.” Dependable Systems and Networks (DSN), pp. 1-12. Budapest, 2013.
  • [32] S. Kim, “Area-efficient error protection for caches.” Design, Automation and Test in Europe. DATE"06. Proceedings, 1, pp. 1-6, 2006.
  • [33] S. Kim, and A. K. Somani, “Area efficient architectures for information integrity in cache memories.” ACM SIGARCH Computer Architecture News, vol 27, pp. 246-255, 1999.
  • [34] M. Kishani, and H. Asadi, “Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems.” IEEE Transactions on Reliability (TR), vol. 67, no. 3, pp. 1111-1127, 2018.
  • [35] M. Kishani, R. Eftekhari, and H. Asadi, “Evaluating Impact of Human Errors on the Availability of Data Storage Systems.” Design, Automation and Test in Europe Conference (DATE), pp. 314-317. 2017.
  • [36] F. Lees, “Lees" Loss prevention in the process industries: Hazard identification, assessment and control.” Butterworth-Heinemann, 2017.
  • [37] N. G. Leveson, “Model-based analysis of socio-technical risk”, 2004.
  • [38] R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert, “Statistical Fault Injection: Quantified Error and Confidence.” Conference on Design, Automation and Test in Europe (DATE), pp. 502-506, 2009.
  • [39] M. Li, and P. P. Lee, “Stair codes: A general family of erasure codes for tolerating device and sector failures.” ACM Transactions on Storage (TOS), vol. 10, no. 14, pp. 719-740, 2014.
  • [40] M. Li, and J. Shu, “Preventing silent data corruptions from propagating during data reconstruction.” IEEE Transactions on Computers, vol. 59, pp. 1611-1624, 2010.
  • [41] M. Li, J. Shu, and W. Zheng, “GRID codes: Strip-based erasure codes with high fault tolerance for storage systems.” ACM Transactions on Storage (TOS), vol 4, pp. 15:1-15:22, 2009.
  • [42] X. Li, M. Lillibridge, and M. Uysal, “Reliability analysis of deduplicated and erasure-coded storage.” ACM SIGMETRICS Performance Evaluation Review, 38, 4-9, 2011.
  • [43] Y. Li, P. P. Lee, and J. C. Lui, “Analysis of Reliability Dynamics of SSD RAID.” IEEE Transactions on Computers, vol. 65, pp. 1131-1144, 2016.
  • [44] D. Meister, “The nature of human error.” Global Telecommunications Conference and Exhibition"Communications Technology for the 1990s and Beyond"(GLOBECOM), pp. 783-786, 1989.
  • [45] N. Mi, A. Riska, E. Smirni, and E. Riedel, “Enhancing data availability in disk drives through background activities. Dependable Systems and Networks (DSN), International Conference on, pp. 492-501, 2008.
  • [46] S. Moon, and A. L. Reddy, “Does RAID Improve Lifetime of SSD Arrays?” ACM Transactions on Storage (TOS), vol. 12, no. 11, pp. 1217-1241, 2016.
  • [47] B. Mullins, H. Asadi, M. B. Tahoori, D. Kaeli, K. Granlund, R. Bauer, and S. Romano, “Case Study: Soft Error Rate Analysis in Storage Systems.” VLSI Test Symposium, pp. 256-264, 2007.
  • [48] W. B. Nelson, Applied life data analysis, vol. 577. John Wiley and Sons, 2005.
  • [49] D. Oppenheimer, “The Importance of Understanding Distributed System Configuration.” Human Factors in Computer Systems workshop, pp. 1-3, 2003.
  • [50] A. Oprea, and A. Juels, “A clean-slate look at disk scrubbing.” USENIX conference on File and storage technologies (FAST), pp. 57-70, 2010.
  • [51] D. A. Patterson, “A Simple Way to Estimate the Cost of Downtime.” USENIX System Administration Conference (LISA), 2, pp. 185-188. 2002.
  • [52] D. A. Patterson, G. Gibson, and R. H. Katz, “A case for redundant arrays of inexpensive disks (RAID).” SIGMOD international conference on Management of data. 17, pp. 109-116. Chicago: ACM, 1988.
  • [53] E. Pinheiro, and L. A. Barroso, “Failure Trends in a Large Disk Drive Population.” USENIX Conference on File and Storage Technologies (FAST), pp. 17-28, 2007.
  • [54] J. S. Plank, and M. Blaum, “Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems.” ACM Transactions on Storage (TOS), vol. 10, no. 4, pp. 660-689, 2014.
  • [55] V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “IRON File Systems”, vol. 39, 2005.
  • [56] E. W. Rozier, W. Belluomini, V. Deenadhayalan, J. Hafner, K. K. Rao, and P. Zhou, “Evaluating the impact of undetected disk errors in raid systems.” Dependable Systems and Networks (DSN), International Conference on, pp. 83-92, 2009.
  • [57] L. Rui, L. Chuan, Z. Yingzhi, and Z. Dong, “The preliminary study on the human-factor evaluation system for maintainability design.” Intelligent Human-Machine Systems and Cybernetics (IHMSC), International Conference on, vol 2, pp. 107-112, 2009.
  • [58] B. Schroeder, S. Damouras, and P. Gill, “Understanding Latent Sector Errors and How to Protect Against Them.” ACM Transaction on Storage (TOS), vol. 6, no. 3, pp. 9:1-9:23, 2010.
  • [59] T. J. Schwarz, Q. Xin, E. L. Miller, D. D. Long, A. Hospodor, and S. Ng, “Disk scrubbing in large archival storage systems.” Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS), pp. 409-418, 2004.
  • [60] S. Z. Shazli, M. Abdul-Aziz, M. B. Tahoori, and D. R. Kaeli, “A field analysis of system-level effects of soft errors occurring in microprocessors used in information systems.” International Test Conference (ITC), pp. 1-10. 2008.
  • [61] I. Sideris, and K. Pekmestzi, “Cost Effective Protection Techniques for TCAM Memory Arrays.” IEEE Transactions on Computers, vol 61, pp. 1778-1788, 2012.
  • [62] C. W. Slayman, “Cache and Memory Error Detection, Correction, and Reduction Techniques for Terrestrial Servers and Workstations.” IEEE Transactions on Device and Materials Reliability, vol. 5, pp. 397-404, 2005.
  • [63] D. M. Smith, and M. L. Williams, Data Loss and Hard Drive Failure: Understanding the Causes and Costs, 2017.
  • [64] A. D. Swain, “Human Reliability Analysis: Need, Status, Trends and Limitations.” Reliability Engineering and System Safety, vol. 29, 301-313, 1990.
  • [65] A. D. Swain, and H. E. Guttmann, “Handbook of Human-Reliability Analysis with Emphasis on Nuclear Power Plant Applications. Final Report.” Tech. rep., Sandia National Labs., Albuquerque, NM (USA), 1983.
  • [66] T. Tsai, N. Theera-Ampornpunt, and S. Bagchi, “A Study of Soft Error Consequences in Hard Disk Drives.” Dependable Systems and Networks (DSN), Annual IEEE/IFIP International Conference on, pp. 1-8. 2012.
  • [67] K. V. Vishwanath, and N. Nagappan, “Characterizing Cloud Computing Hardware Reliability.” ACM Symposium on Cloud Computing, pp. 193-204, 2010.
  • [68] S. Xu, R. Li, P. Lee, Y. Zhu, L. Xiang, Y. Xu, and J. Lui, “Single Disk Failure Recovery for X-code-based Parallel Storage Systems.” IEEE Transactions on Computers, vol. 63, no. 4, pp. 995-1007, 2013.
  • [69] J. Yang, and F. Sun, “A comprehensive review of hard-disk drive reliability.” Reliability and Maintainability Symposium, 1999. Proceedings. Annual, pp. 403-409, 1999.
  • [70] Y. Zhu, P. P. Lee, L. Xiang, Y. Xu, and L. Gao, “A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes.” Dependable Systems and Networks (DSN), Annual IEEE/IFIP International Conference on, pp. 1-12, 2012.
  • [71] SAB-SE Data Storage Systems. http://hpdss.com/En/SAB-SE.html, 2020
  • [72] HPDS Corporation. http://hpdss.com/En/index.html, 2020
دوره 18، شماره 1
بهار و تابستان
اردیبهشت 1399