The Azure Gamble That's Not Paying Off: Impact on GitHub Availability
Microsoft acquired GitHub in 2018. By 2025, the CTO boldly declared "availability is job #1," a claim that quickly faced scrutiny. That same year, the platform's overall GitHub availability had, at one point, plummeted below 90%. This stark reality underscored the urgency of their strategic move: a full infrastructure migration to Azure. Touted for "rapid growth" and "global resiliency," this ambitious shift, however, often aligns with typical marketing rhetoric, frequently detached from the complex operational realities of such a massive undertaking.
The migration hasn't gone smoothly. Far from it. As of today, only 12.5% of GitHub traffic routes through Azure Central US. The target is 50% by July 2026. This is a massive, high-stakes migration, actively underway *while* the existing platform struggles, leading to compounded instability. In fact, as of March 2026, GitHub's overall uptime for all services stands at a mere 90%. These figures, it should be noted, do not even account for periods of slow performance where the service is 'technically usable' but practically degraded.
Trust within the community is eroding rapidly. Remember that policy change about charging for self-hosted runners? They reversed it, but the initial signal of shifting priorities and potential cost increases had already eroded trust. It was perceived as a betrayal of trust, another sign that priorities were shifting away from the developers who built the platform.
The Failure Modes Are Piling Up
February 2026 was a period of significant operational instability for GitHub, marked by six major incidents. Let's examine some of the more egregious incidents, as they reveal a pattern of fundamental architectural and operational weaknesses.
The Telemetry Blind Spot (February 2, 2026)
This incident hit GitHub Actions hosted runners and Codespaces hard. The root cause was a loss of telemetry that cascaded into mistakenly applying security policies to backend storage accounts. This blocked access to critical VM metadata, effectively bricking VM creation.
This isn't some exotic zero-day. This is a configuration management screw-up, amplified by a telemetry failure. Effective monitoring is crucial; without visibility, safe changes are impossible, and guardrails are ineffective, directly impacting GitHub availability.
The Caching Catastrophe (February 9, 2026)
Next, the core database cluster overloaded. Two popular client-side applications went live, generating a >10x spike in read traffic. Concurrently, a new model reduced a user settings cache TTL from 12 hours to 2 hours. This meant more frequent refreshes, more writes. The database experienced severe overload.
The architectural flaw is clear: user settings, once a few bytes, had ballooned to kilobytes per user. The longer TTL masked this bloat. It became critical only when the TTL shortened and new client apps hit the system. This illustrates a latent design flaw exposed by increased load and a seemingly minor change. Crucially, they lacked granular load-shedding switches – a fundamental requirement for any system at this scale.
Beyond availability, there's the security angle. In mid-March 2026, Aqua Security reported a breach. Repositories were infected via GitHub Actions vulnerabilities. This involved widespread use of mutable references and an acknowledged, unfixed issue allowing malicious Action references into workflows. The proposed mitigation, Immutable Releases, is opt-in and per-repository. That's not a solution; it's a suggestion.
Adding to the instability, an impactful incident occurred on March 5, 2026, affecting Actions job orchestration. This was caused by a latent configuration issue in a Redis cluster, which, despite an automated failover, was left without a writable primary, requiring manual correction.
The Copilot Conundrum
Copilot's uptime sits at 96.47%, making it the lowest-performing tracked service. This isn't a coincidence. Whether due to increased load from AI-driven development or inherent instability, it reinforces the narrative: GitHub prioritizes features over stability. You can't claim "availability is job #1" when your flagship AI product actively degrades overall uptime.
What Now?
GitHub claims it's redesigning the user cache, expediting capacity planning, isolating dependencies, and protecting downstream components. These are basic engineering practices they should have implemented years ago. Monolith decomposition is a long-term strategy, but it's critical.
In my view, GitHub should prioritize stability over new feature development, pausing the pursuit of AI trends to address core platform issues. While the Azure migration is a strategic necessity, it is not a panacea. It introduces its own set of failure modes and complexities.
For developers, the message is clear: diversify. Avoid over-reliance on a single platform like GitHub. Look at alternatives like GitLab, which offers robust self-hosting capabilities and a different architectural approach to CI/CD, or consider self-hosting critical components directly. Build redundancy into your CI/CD pipelines that don't rely solely on GitHub Actions. The era of GitHub as an unquestionable, always-on utility is over. Without significant changes, the current trajectory suggests continued outages, frustration, and erosion of trust, severely impacting future GitHub availability.
The implications of GitHub's ongoing availability struggles extend beyond immediate outages. For the broader developer ecosystem, this instability fosters a climate of uncertainty, pushing teams to re-evaluate their dependencies and explore more resilient, distributed workflows. Competitors like GitLab stand to gain significantly, offering compelling alternatives that prioritize self-hosting and robust CI/CD pipelines, directly addressing the pain points GitHub users are currently experiencing. This shift could fundamentally alter the landscape of developer tools, moving away from centralized, monolithic platforms towards more diversified and fault-tolerant solutions.