Ghost server needs regular restarts

Currently using Ghost version 3.8.1.

We’ve been running this regularly for over a year now with little problems.

Our scale has grown and we now have quite a lot of accounts using the Ghost instance (about 30 active users on admin section, and +100,000 PVs/day)

We have our Ghost instance running on an EC2 instance with more than enough CPU (around 15% utiliziation). We were having issues with memory on the instance but solved this by setting up a cron to regularly delete logs

However, around 2-3 times per day, we need to manually restart the ghost instance. Pages don’t load and the admin section loading spinner spins until we get a timeout. There is nothing immediately obvious in the error logs, metrics all look fine, and everything works fine by running ghost restart.

Not really sure where to go from here, any ideas?

RAM shouldn’t be affected by deleting file logs, that part doesn’t make sense.

That being said, if your one server is being overloaded, my first suggestion would be to setup another instance with a Load balancer in front of it. AWS has an easy way of doing so.

I’m assuming you have a MySQL database configuration. How is that looking? Is the DB server fine as far as load/metrics/etc?

This part is definitely incorrect - Ghost is not designed to be clustered

Sorry, I meant memory on the EC2 instance.

The EC2 instance is not being overloaded as far as I can see, all the metrics are fine.

And yeah, we’re using RDS with MySQL and the metrics for that are also fine.

That’s interesting. I’m surprised that’s not the case so i stand corrected and retract my statement.

There’s another thread that’s worth linking for reference here that goes in more details it seems.

I’d love to see Redis or an external caching layer that would allow to scale horizontally if need be.

I suppose adding a CDN would offload the load on the server if you aren’t using it already.

Do you mean disk usage or memory usage? Logs would typically affect disk rather than memory usage. If you meant disk then it would be useful to know what your memory usage profile looks like - used/free/cached and if there is any pattern to that usage leading up to when your site becomes unresponsive.

Hi Kevin, so the Ghost instance just needed restarting so I had a look at the disk and memory usage.

We’re not currently monitoring this using Cloudwatch agent, but might be something to look into.

When I got onto the instance, memory usage was 10% and disk usage was up at around 83%. 83% is obviously not great but would I would have thought things would still be working at that level

So I think the issue is with the caching on the articles. We have cloudfront set up which should cache the articles for 600 seconds, however, this is not working correctly since ghost has Cache-control set to max-age=0

I’ve overwritten the front end max age by using: ghost config caching.frontend.maxAge 1200

Now we seem to be getting much improved cache hit rate.

So after saying that, the problem is still occuring

So far I’ve

  • Increased disk size from 8gb to 20gb. It’s only using about 30% atm.
  • Fixed issue with caching. Now the pages are caching correctly with cloudfront
  • Increased ec2 instance size, which has double memory and cpu, as well as imporved network capabilities. The ec2 instance is now only using about 5% of cpu

I’ve had a look at the error logs and they are pretty clean, apart from this error :

TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

Although, this error could be to do with us rebooting the server, since the connection between RDS and EC2 instance seems completely fine almost all other times and most ideas online hint at issues with network configuration.

Has anybody else got any other ideas?

Any ideas? Still getting problems 2x or 3x per day

I have a similar issue with my single EC2 instance used for a very small set of sites that receive less than 5 users a day combined. Haven’t been able to identify exactly what is causing the issue and eventually the EC2 becomes completely unavailable requiring a full forced restart.

Since I can’t always detect when this starts happening and see the impacts afterwards when I can’t even SSH into the instance I’ve built a very small lambda that detects “unreachability” and simply has the permission to force a reboot on the specific instance. This has greatly improved the reliability of the site just restarting after it can’t be reached anymore. However, next I want to dig into the instance itself to see why it becomes unavailable in the first place as it shouldn’t need to be restarted like you have experienced as well so often.

Fixed issue with caching. Now the pages are caching

What did you change or did the settings you mentioned earlier work to get the CND caching efficiently enough?