Issues with project loading
Incident Report for Hive
Postmortem

Context and timeline

On behalf of the team here at Hive, we would like to apologize for interruptions to services yesterday, and we appreciate your patience as we worked to resume service continuity. As posted in the incident status updates, the Hive web platform experienced service disruptions which impacted project loading from 8:01am through to 8:26am Eastern. The incident was left open with partial outage as we monitored failover from 8:26am through to 8:55am Eastern, and left in a monitoring state through to incident close out. A detailed timeline including mitigation steps taken list listed out below (all times stated are Eastern timezone):

8:01am - Application monitor alarm bells raised, notifying our team of issues from completion of an application deployment.

8:15am - Initial investigation confirms issues are widespread, impacting users who had been swapped over the latest web application refresh.

8:18am - Application deployment reversion started. Failover to stable environment initiated.

8:26am - Confirmation of all users switched over to failover environment and project loading service disruption resolved.

8:30am - Upon review of logs after switching to the failover environment, the team confirmed from logs that a specific scenario of project creation from templates with pre-configured table layout options failed to fully complete. This specific issue remained until separate service redeployment which was initiated at 8:18am. The issue was due to application version mismatch and impacted just below 2% of the active user population.

Root cause

In short, a web application deployment (which completed just before 8am) contained a cached version of a pre-production Hive build, leading to mismatched application versions and logic between services. Upon review of the deployment command logs, the team has confirmed that this cached version was previously deployed to a pre-production environment and not properly cleared out before the production deployment was built.

Remediation plan

While our deployment scripts already ask for written confirmation for initiating a deployment and show information in the confirmation regarding which version (branch/build) and target environment, potential untracked or cached change warnings do not show. In order to ensure the root cause of mismatched application versions being deployed never happens again, the team has taken steps to update deployment commands and contextual information such that deployment will automatically fail in the event of untracked or cached changes.

Posted Oct 28, 2022 - 14:43 EDT

Resolved
All systems have remained stable since our earlier updates at 8:26am and 8:55am Eastern. We have gone ahead and unified application states across all environments to ensure no users experience version mismatches. We'll continue to actively monitor stability throughout the day, and do not anticipate any further issues.

A post-mortem has been underway since ~9:30am Eastern this morning and will be posted here once finalized.
Posted Oct 27, 2022 - 13:30 EDT
Update
We are continuing to investigate the issue, but systems have now remained stable.
Posted Oct 27, 2022 - 08:55 EDT
Update
We have failed over to a stable application version while we work to identify root cause. The application should be available and working now. We will leave this incident open while we investigate original cause, implement a fix, and monitor before resolving.
Posted Oct 27, 2022 - 08:26 EDT
Update
We are currently work to roll back changes which were deployed earlier and related to the issue.
Posted Oct 27, 2022 - 08:18 EDT
Investigating
We are currently investigating this issue.
Posted Oct 27, 2022 - 08:18 EDT
This incident affected: Client Applications (Web Application, Desktop Application).