On behalf of the team here at Hive, we would like to apologize for interruptions to services yesterday, and we appreciate your patience as we worked to resume service continuity. As posted in the incident status updates, the Hive web platform experienced service disruptions which impacted project loading from 8:01am through to 8:26am Eastern. The incident was left open with partial outage as we monitored failover from 8:26am through to 8:55am Eastern, and left in a monitoring state through to incident close out. A detailed timeline including mitigation steps taken list listed out below (all times stated are Eastern timezone):
8:01am - Application monitor alarm bells raised, notifying our team of issues from completion of an application deployment.
8:15am - Initial investigation confirms issues are widespread, impacting users who had been swapped over the latest web application refresh.
8:18am - Application deployment reversion started. Failover to stable environment initiated.
8:26am - Confirmation of all users switched over to failover environment and project loading service disruption resolved.
8:30am - Upon review of logs after switching to the failover environment, the team confirmed from logs that a specific scenario of project creation from templates with pre-configured table layout options failed to fully complete. This specific issue remained until separate service redeployment which was initiated at 8:18am. The issue was due to application version mismatch and impacted just below 2% of the active user population.
In short, a web application deployment (which completed just before 8am) contained a cached version of a pre-production Hive build, leading to mismatched application versions and logic between services. Upon review of the deployment command logs, the team has confirmed that this cached version was previously deployed to a pre-production environment and not properly cleared out before the production deployment was built.
While our deployment scripts already ask for written confirmation for initiating a deployment and show information in the confirmation regarding which version (branch/build) and target environment, potential untracked or cached change warnings do not show. In order to ensure the root cause of mismatched application versions being deployed never happens again, the team has taken steps to update deployment commands and contextual information such that deployment will automatically fail in the event of untracked or cached changes.