Performance Issues

Incident Report for DRACOON Cloud

Postmortem

This unavailability manifested itself in particular by short phases of availability, which stopped shortly afterwards. After an initial analysis of the development and operations team, our engineers were able to stabilize the cloud and return it to normal operation. In addition, the first measures were immediately implemented, which mitigated short-term impacts of the problem by infrastructural changes.

In the weeks since the incident in-depth analyses of the software itself, its configuration and its interfaces to other systems have been done. Our teams succeeded in reproducing the incidents and recreating and analyzing them under laboratory conditions. Specifically, the problem could be traced back to a syslog connector for our audit logs, which could create a blocking state under special circumstances and subsequently affect the stability of the cloud via a cascade of events.

This comprehensive understanding of the problem and its root cause has led to further adjustments to the cloud infrastructure, which are also supported by a software patch. This ensures that this problem cannot occur again in the future.

Posted Mar 03, 2020 - 16:33 CET

Resolved

To date, we have not been able to observe any more performance issues. We have been able to identify and resolve some possible problems, but we are still closely monitoring the environment to eliminate any further possible causes. Once the investigation is complete, we will post a postmortem to keep you informed about this incident.

Posted Nov 26, 2019 - 16:50 CET

Update

The performance issues are still under control. We are still looking for the exact cause to solve the issue permanently.

Posted Oct 23, 2019 - 12:42 CEST

Monitoring

Currently the performance issues are under control and we are looking for a permanent solution to the problem.

Posted Oct 22, 2019 - 14:55 CEST

Investigating

At the moment we are experiencing performance problems on our servers. We are analyzing the issue and will keep you up to date.

Posted Oct 22, 2019 - 08:04 CEST

This incident affected: API (API (group 01)), Email and SMS (Instant messages), Previews (Preview delivery), and Authentication, Web App, Branding, WebDAV.