This unavailability manifested itself in particular by short phases of availability, which stopped shortly afterwards. After an initial analysis of the development and operations team, our engineers were able to stabilize the cloud and return it to normal operation. In addition, the first measures were immediately implemented, which mitigated short-term impacts of the problem by infrastructural changes.
In the weeks since the incident in-depth analyses of the software itself, its configuration and its interfaces to other systems have been done. Our teams succeeded in reproducing the incidents and recreating and analyzing them under laboratory conditions. Specifically, the problem could be traced back to a syslog connector for our audit logs, which could create a blocking state under special circumstances and subsequently affect the stability of the cloud via a cascade of events.
This comprehensive understanding of the problem and its root cause has led to further adjustments to the cloud infrastructure, which are also supported by a software patch. This ensures that this problem cannot occur again in the future.