We’ve been running this regularly for over a year now with little problems.
Our scale has grown and we now have quite a lot of accounts using the Ghost instance (about 30 active users on admin section, and +100,000 PVs/day)
We have our Ghost instance running on an EC2 instance with more than enough CPU (around 15% utiliziation). We were having issues with memory on the instance but solved this by setting up a cron to regularly delete logs
However, around 2-3 times per day, we need to manually restart the ghost instance. Pages don’t load and the admin section loading spinner spins until we get a timeout. There is nothing immediately obvious in the error logs, metrics all look fine, and everything works fine by running ghost restart.
RAM shouldn’t be affected by deleting file logs, that part doesn’t make sense.
That being said, if your one server is being overloaded, my first suggestion would be to setup another instance with a Load balancer in front of it. AWS has an easy way of doing so.
I’m assuming you have a MySQL database configuration. How is that looking? Is the DB server fine as far as load/metrics/etc?
Do you mean disk usage or memory usage? Logs would typically affect disk rather than memory usage. If you meant disk then it would be useful to know what your memory usage profile looks like - used/free/cached and if there is any pattern to that usage leading up to when your site becomes unresponsive.
Hi Kevin, so the Ghost instance just needed restarting so I had a look at the disk and memory usage.
We’re not currently monitoring this using Cloudwatch agent, but might be something to look into.
When I got onto the instance, memory usage was 10% and disk usage was up at around 83%. 83% is obviously not great but would I would have thought things would still be working at that level
So I think the issue is with the caching on the articles. We have cloudfront set up which should cache the articles for 600 seconds, however, this is not working correctly since ghost has Cache-control set to max-age=0
I’ve overwritten the front end max age by using: ghost config caching.frontend.maxAge 1200
Now we seem to be getting much improved cache hit rate.
So after saying that, the problem is still occuring
So far I’ve
Increased disk size from 8gb to 20gb. It’s only using about 30% atm.
Fixed issue with caching. Now the pages are caching correctly with cloudfront
Increased ec2 instance size, which has double memory and cpu, as well as imporved network capabilities. The ec2 instance is now only using about 5% of cpu
I’ve had a look at the error logs and they are pretty clean, apart from this error :
TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
Although, this error could be to do with us rebooting the server, since the connection between RDS and EC2 instance seems completely fine almost all other times and most ideas online hint at issues with network configuration.
I have a similar issue with my single EC2 instance used for a very small set of sites that receive less than 5 users a day combined. Haven’t been able to identify exactly what is causing the issue and eventually the EC2 becomes completely unavailable requiring a full forced restart.
Since I can’t always detect when this starts happening and see the impacts afterwards when I can’t even SSH into the instance I’ve built a very small lambda that detects “unreachability” and simply has the permission to force a reboot on the specific instance. This has greatly improved the reliability of the site just restarting after it can’t be reached anymore. However, next I want to dig into the instance itself to see why it becomes unavailable in the first place as it shouldn’t need to be restarted like you have experienced as well so often.