We experienced an issue with DRACOON Cloud on 2025-03-31 from around 07:30 to 08:15. Our team has worked diligently to identify the root cause and implement a resolution. In this post-mortem, we want to share the details of what happened, why it happened, what we did to resolve it, and what we will do to prevent similar incidents in the future.
What happened?
DRACOON Cloud experienced performance degradation during early usage hours, affecting user access and normal operation.
Why did this happen?
Application containers hit memory limits during high traffic periods, causing automatic restarts and service interruptions as the container orchestration system cycled through unhealthy instances. The memory limits were set too conservatively and hadn't been updated to account for certain traffic spikes.
What did we do?
Our engineering team quickly identified the container restart pattern through application logs and monitoring dashboards. We immediately increased the memory limits for affected services and scaled up the number of container replicas to distribute the load.
What can we do to improve?
We will improve our monitoring, update memory limits based on actual usage patterns, and create automated scaling policies that proactively increase resources before hitting limits.
We apologize for any inconvenience this incident may have caused. We are committed to ensuring the stability and reliability of our services and will continue to take proactive measures to prevent similar incidents from happening in the future.
If you have any questions or concerns, please don't hesitate to reach out to our support team for assistance.