Self-Healing Cloud Infrastructure: A Hybrid Reinforcement Learning Framework for Predicting Microservice Failures in Azure Kubernetes Service (AKS)
Main Article Content
Abstract
The challenges that remain to cloud reliability are the task of migrating potentially damaging front-line services to microservices architectures. Microservice failures are still hurting businesses, despite a disillusioning array of building blocks strung to build orchestration technique Azure Kubernetes Service (AKS). We lay out here an architecture that evolves by reflex and regenerates fault-stability on the cloud itself, so that self-healing could preempt future microservice failures on purely AKS environments. We discussed ways to bridge the yawning gap between reactive monitoring and proactive failure prevention: we developed a learning agent that could recognize proper interventions by interacting with the environment. This framework integrates Deep Q-Networks with solutions drawn from actor-critic methods to be capable of maintaining real-time decision-making abilities while considering the increasingly large, state-space model of containerized applications. We harmoniously fit this model to the most existing monitoring stack of AKS, enabling it to analyze pod health metrics, resource utilization patterns, and service dependencies, then make predictions before a failure affects a real user. Experimental validations are effectively performed, revealing an accuracy fundamentally close to 87% for predictive failures with 12.5% below false positives, although the autonomous mean remedy cut shot about 25 points against human control. This research adds an additional layer of theoretically swirling fishing in the application of reinforcement learning to the managing of distributed systems, on the one hand, and on the other, against ways of practice implementation in cloud production environments.