A Comprehensive Model for Software Rejuvenation

Abstract

Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.

Journal
IEEE Transactions on Professional Communication
Published
2005-02-01
DOI
10.1109/tdsc.2005.15
CompPile
Open Access
Closed
Topics
Export

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (47)

  1. Data Diversity: An Approach to Software Fault Tolerance
  2. 10.1023/A:1009794200077
  3. On the Implementation of N-Version Programming for Software Fault Tolerance During Execution
  4. Fundamental Concepts of Dependability
  5. 10.1109/DSN.2003.1209934
Show all 47 →
  1. 10.1016/S0166-5316(01)00037-2
  2. Preemptive Module Replacement Using the Virtualizing Operating System
  3. 10.1109/DSN.2002.1028933
  4. 10.1147/rd.452.0311
  5. 10.1109/FTCS.1992.243618
  6. 10.1109/FTCS.1995.466957
  7. 10.1109/PRDC.2000.897287
  8. 10.1109/PRDC.2001.992692
  9. 10.1109/ISSRE.1995.497656
  10. 10.1145/233013.233050
  11. 10.1109/12.656092
  12. 10.1109/ISSRE.1998.730892
  13. 10.1109/FTCS.1999.781067
  14. 10.2307/2531935
  15. Why Do Computers Stop and What Can Be Done About It?
  16. 10.1109/24.58719
  17. Clustering Algorithms
  18. Closed Loop Design for Software Rejuvenation
  19. 10.1007/BFb0020031
  20. 10.1109/FTCS.1995.466961
  21. 10.1109/12.2195
  22. IBM Netfinity Director Software Rejuvenation
  23. A Framework for Understanding and Handling Transient Software Failures
  24. 10.1109/TSE.1986.6312924
  25. 10.1109/32.387474
  26. 10.1109/ISESE.2002.1166929
  27. 10.1109/ISSRE.2002.1173239
  28. 10.1126/science.255.5050.1347
  29. 10.1109/71.774908
  30. 10.1109/40.259898
  31. 10.1016/0166-5316(96)00038-7
  32. Performance and Reliability Analysis of Computer Systems—An Example-Based Approach Using …
  33. 10.1080/01621459.1968.10480934
  34. 10.1109/FTCS.1991.146625
  35. 10.1109/WORDS.1997.609924
  36. 10.1016/S0898-1221(98)00183-7
  37. 10.1109/ISSRE.1996.558682
  38. Probability and Statistics, with Reliability, Queuing, and Computer Science Applications
  39. 10.1109/ISSRE.1999.809313
  40. 10.1145/378420.378434
  41. 10.1109/FTCS.1995.466999
  42. 10.1109/PRDC.2004.1276563