Yesterday we had an important technical issue. After being unstable for one hour, the Echoz.ro platform crashed and was unavailable for few hours after.
This happened during an unusual (for us) traffic spike caused by an marketing article published on a well-known romanian blog. When the crash happened, the publisher was kind enough to un-publish the article thus allowing us to handle the issue without the extra burden caused by a constant stream of people complaining.
I wrote this article in order to explain why the problems appeared and to sincerely apologize. I know we missed a chance to make a good first impression to a lot of people who never heard about us.
We are bootstrapped, funded by personal income stream, so we don’t have a lot of resources available. We know the platform has some bugs. We consider them unimportant as they don’t affect the main functionalities of the platform, and we prefer to redirect the few resources we have in order to try to achieve a healthy business growth. However, yesterday we were caught unprepared. We didn’t give enough attention to the most important thing of all, the thing that makes all others irrelevant: the infrastructure. As I explained in the technical details part (bellow), this lack of attention to the infrastructure came and bite us in the ass.
I am sorry and I apologize for the problems we might caused.
However, I was very happy to receive an email notification, exactly during the downtime, in which a customer company was letting us know they just hired someone by our solution. I immediately let the affiliate know they won a 500 euros commission on this hire. In the middle of the chaos this let a big smile on my face. Regarding the current situation, we did a good job and will continue to do so.
Yesterday the platform was hosted on AWS (Amazon Web Services), on a small server instance. Actually, it was hosted on two small server instances, in order to provide redundancy (if one of them fails, the user will be redirected to the other one without even realizing) and load-balancing (if a lot of users use the platform in the same time, some of them will be redirected to one of the server instances, some on the other).
This server is also auto-scalable. What this means is, when a spike in traffic appears, some new instances are created to serve these requests. We (wrongfully) assumed that this means we’ll be able to cope with traffic spikes successfully. However, it seams that the way AWS handles this is to create new server instances, but the same size (power) as the original one. So our load-balancer was balancing at the peak time between four server instances with no enough computing power to handle the request.
In our opinion, this is useless in case of heavy unexpected traffic spikes. What we actually needed it was more computing power. We didn’t actually test heavy loads on traffic and, as you see, we come to regret it.
As an immediate measure, we increased the servers instances to large ones. We are also starting to do stress-tests on “similar to live” server in order to prevent future issues.