On May 7th, 2019 at approximately 1:17 PM, the GitHub service went offline unexpectedly. Web interface and SSH access to the appliance was unresponsive, and pings timed out with no response. An investigation of the server from the console in VMWare revealed an error message: “unable to fork ghe-welcome: cannot allocate memory”.
Recovery and Status Checks
With this indication that the available memory on the appliance had been exhausted, we decided to initiate a hard reset of the VM in VMWare. The box came back online, but did not have an IP address. The previous day we had deleted some old entries in DNS left over from development and testing. Apparently one of those entries had the MAC address for the production instance that allowed DHCP leasing to succeed. We added the MAC address to the host entry for the production server and rebooted again. This time it came up with an IP address correctly.
Whenever the GitHub appliance is unexpected rebooted, it takes some time for it to relocate the storage device for user data (this usually takes about 3-4 minutes). After waiting at the “Preflight Checks” screen for a bit, everything came up normally. By 1:31 PM, the service appeared to be running as usual, and all functionality we tested worked as expected.
Cause Analysis and Remediation
After ensuring that the service was stable, we looked to the GitHub monitoring records to see if there was any abnormalities in resource usage over the past hour. We paid particular attention to memory usage, since memory exhaustion was the most likely cause of the issue. We noticed that in the time right before the outage, the amount of free memory on the box was less than 400MB; however, after further analysis of usage patterns going back a year, this seems to be a fairly common occurrence due to Elasticsearch consuming as much memory as it can when it compiles indexes.
With this in mind, our best guess for a root cause for this incident is an unfortunate timing of a new user creation while an Elasticsearch index decided to run. The last user to be created in the system was 30 minutes before the incident at 12:47.
No remediation action was taken at this time. With the assumption of a one-time timing fluke, there is no action to be taken. If the situation occurs again, we will create a ticket with GitHub support for further analysis.