Back with the Ghost sites failing to load following an AWS RDS maintenance/backup window


#1

Back with the issue mentioned here a while ago:

3 out of my 5 Ghost sites crashed this morning again (most probably after Amazon RDS backup/maintenance window). Checking the logs I got the issue below:

[2018-07-13 07:10:22] INFO “HEAD /” 200 59999ms
[2018-07-13 07:10:22] ERROR

MESSAGE: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
at /srv/www/plimbalandala.ro/versions/1.24.8/node_modules/knex/lib/client.js:340:13
at tryCatcher (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/util.js:16:23)
at /srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/catch_filter.js:17:41
at tryCatcher (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/promise.js:512:31)
at Promise._settlePromise (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/promise.js:569:18)
at Promise._settlePromise0 (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/promise.js:614:10)
at Promise._settlePromises (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/promise.js:689:18)
at Async._drainQueue (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/async.js:133:16)
at Async._drainQueues (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/async.js:143:10)
at Immediate.Async.drainQueues (/srv/www/plimbalandala.ro/versions/1.24.8/node_modules/bluebird/js/release/async.js:17:14)
at runCallback (timers.js:672:20)
at tryOnImmediate (timers.js:645:5)
at processImmediate [as _immediateCallback] (timers.js:617:5)

The weird thing is that the crash only happens to 3 out of 5 sites and those are the first 3 installed in the order of their deployment. The error above appeared in the latest deployed Ghost site as well (a blog migrated from Wordpress), but the site was still up.

Any ideas what it could be? I know it’s most probably a very weird and hard to reproduce instance, but maybe the error above provides some hints I’m missing.

If you’re looking for some help, it’s important to provide as much context as possible so that people are able to assist you. Try to always mention:

  • What’s your URL? This is the easiest way for others to debug your issue
  • What version of Ghost are you using? 1.24.8
  • What configuration? AWS ec2 Ubuntu 16.04 LTS + Nginx + RDS MariaDB
  • What browser? Chrome
  • What errors or information do you see in the console? Log above
  • What steps could someone else take to reproduce the issue you’re having? Hard to reproduce. Only happens in some days and it only happens to some of the Ghost sites.

#2

From our experience, it seems like this is the error that we get from Knex whenever it is unable to connect to a database for any reason. It’s not a particularly useful error.

I can also echo your sentiment that sometimes it recovers, and sometimes it doesn’t.

So I can tell you that what you’re experiencing is not unusual for the scenario you are in, what’s unusual is that the AWS RDS DB seems to become unresponsive / unavailable so much.

The place to start with resolving this would be with your pool config: https://knexjs.org/#Installation-pooling
And timeout: https://knexjs.org/#Installation-acquireConnectionTimeout

You’ll find the issues on the knex repo here: https://github.com/tgriesser/knex are full of interesting examples of the error you posted and more information for how to configure the DB client.You may also be able to get more targeted help by raising an issue there but feel free to share the link and outcome here, I’m not trying to palm you off, we just don’t have the depth of knowledge that the people there do :slight_smile:

Ghost is currently on 0.14.6, whereas the latest is 0.15.1, so you may also be able to find that there are relevant fixes and we need to upgrade.

One thing I can tell you we’ve seen help improve stability, is setting the pool min to 0 instead of 1.

Hopefully this might get you on the path of finding a fix - I’m particularly interested in ensuring that Ghost always recovers.


#3

Thanks @Hannah for the hints! :slight_smile: I will check them out and report back here.

One thing that I could think of would be that RDS being on a t2.micro instance type might use up its cpu credits somehow and that’s why some dbase calls get refused (maybe due to the backup/maintenance processes being cpu intensive).

I also get a spike in index.php time in NewRelic, but unsure if it’s Ghost’s index.php or WordPress index.php. :slight_smile:

Digging into NewRelic a bit I found these DB requests generating issues:

Anyways, will check the knex config and see if tweaking that to 0 as you recommended makes this going away.


#4

Hey Daniel,

Hope you’re well! Nice site I like the post on migrating to Ghost.

I was looking forward to seeing some of your Bulgarian Sea Shore and Ropotamo River adventures but I think the images aren’t quite working (not sure if you’re aware).

Just read your original post.

Can I ask, have you disabled the backups temporarily to identify if it is daily backups causing the issue?

Are you able to access the DB cloudwatch monitoring graphs?

Something to note with RDS is that IOPS are limited based on the disk size of the DB so if there’s a period of time where an IOPS related graph seems to hit a ceiling / flatline it’s possible that it might be causing enough IO latency to cause queries to get queued up in Ghost and eventually timeout. You can increase available IOPS by increasing the DB disk size.

It seems that the Ghost service itself is still running but it’s slow enough for connections from nginx to timeout.

The index.php script will most certainly be Wordpress (written in PHP) as Ghost is a node.js application. The reason that Wordpress is okay (PHP in general) is that it creates and closes DB connections on the fly. NodeJS / Knex maintains a connection “pool” and queues up requests which are serviced as a connection becomes available.

Hannah’s acquireConnectionTimeout suggestion should help any queued up queries to fail faster. I’d recommend a value of 2000 to 5000 (2 - 5 seconds). The default is 60000 (1 minute).

Something I noticed with your site is that although you have it behind cloudflare, it isn’t caching requests (is development mode on?). You might benefit from enabling caching of the site for 1 hour or so, this will reduce the number of requests that hit your servers and reduce DB load. Enabling “AlwaysOn” will serve stale content if your backend server is offline. Important: Remember to bypass caching for ghost/* urls as documented here.

Feel free to ask if any of the above is unfamiliar or like an alien language (I’m assuming you’re quite a technical person. It’s totally okay if you’re not).

I’m by no means a Ghost pro, but I think the above could help you. Let me know how you get on!

Tim.


#5

Check your Google Search Console, maybe you have got a group of cute spiders :wink:
Also, could you please provide a screenshot about RDS connection pool?


#6

@timothy thanks for the detailed message :slight_smile:!

I did not have time to fix the content of the blog after importing it, but I know I need to go through it and fix the images at least. This would have been much easier to do if Ghost had a media storage managment tool. :wink:

When I created the RDS instance I went with their recommendation and chose 100GB storage size, so there should be, in theory, enough space and IOPS (see the graph below, it doesn’t seem to spike out of the ordinary).

I did not disable the backups yet to check this out as the behavior seems to be random (sometimes it happens 2-3 days in a row, sometimes once a month).

Yeah, I should have figured that index.php was WP out on my own :slight_smile:. Not sure what was in my mind when I wrote that (probably the exclusive usage of WordPress in the past and the habit of all sites to have an index.php). And to add to this :slight_smile:, NewRelic is actually monitoring the PHP application LOL.

So fiddling to install the node.js NewRelic monitor now, how to add the module to the Ghost app (I assume in config.production.json)?

Make this the first line of your app’s startup script:
require(‘newrelic’);

Not sure why the website is not cached (no development mode on in cloudflare, but checked anayltics and only a few, i.e. about 10%, of the requests are cached (traffic on website is low and this might be one reason for it, also images not loading might be another).

I am bypassing caching for /ghost/ and /wp-admin/ :slight_smile:.

I am kind of a technical person in the sense that I do my own server stuff, but am not a pro at it (I do run my own basic AWS infrastructure) and I’m definitely not a developer, but I can do some html/css and other minor things around it.

Modifications to connection timeout and pooling seem to be needed in config.production.json database settings, but I am unsure about it :slight_smile:. Tried to add the syntax, but there are some other details that seem to be missing as vim is highlighting some errors in it.

Later edit: fixed those two posts with the Bulgarian trip so you can check out the images now as well. :slight_smile:

@andy1247008998 see below the screenshot from RDS monitoring:

Not sure how to check bots in search console :slight_smile: (looked around but didn’t find any spikes there either).


#7

Hey @dsecareanu

Thanks for the graphs! Also, Thanks for the photos :slight_smile: Ropotamo / Primorsko look like magnificent places to visit. I’ve never been to Bulgaria, it’s on my list of places to visit for sure now.

Looking at the “Freeable Memory” graph the spike on 07/11 suggests that the RDS instance was rebooted. This happens occasionally during the maintenance window for OS updates. Have you explicitly set a maintenance day / time on your RDS instance? If

It is probably easiest to add the acquireConnectionTimeout globally to the database configuration in config.production.json like:

...
    "database": {
        "client": "mysql",
	    "acquireConnectionTimeout": 2000,
        "connection": {
            "host"     : "ypur-rds-endpoint",
            "user"     : "your-db-user",
            "password" : "your-db-pass",
            "database" : "your-db"
        }
    },
...

The above will mean that queries or transactions that would otherwise take 60 seconds (INFO “HEAD /” 200 59999ms) to timeout will timeout in 2 seconds which might help with the recovery of the sites after RDS becomes available again.

Something else that would help reduce the amount of time that RDS is unavailable is configuring it in a Multi-AZ cluster. Backups are taken from the standby and DB traffic is redirected to the standby when the master is undergoing maintenance. This does come at a financial cost.


#8

Thanks for the updates @timothy! I’ve made the config changes and will check if the behavior shows up again.

Now the config looks like this (didn’t manage to incorporate pooling variables, gave me a config error):

  "database": {
    "client": "mysql",
    "acquireConnectionTimeout": 2000,
    "connection": {
      "host": "db-host",
      "user": "db-user",
      "password": "db-pass",
      "database": "db-name"
    },
  }

But yes, the RDS maintenance window is setup on Sunday’s nights as far as I remember (and the backup window is every night around 5am).

If you want to travel to the Central and Eastern Europe there are a lot of beautiful places to go to. I have a few pictures from Zakynthos, Greece, here: https://www.secareanu.com/back-from-holidays-zakynthos-greece/ (from almost 10 years ago, heh).

But beautiful countries to visit: Romania (where I’m from), Bulgaria, Albania, Montenegro, Croatia, Serbia, Moldova, etc.


#9

So far so good, no more disconnected Ghost sites (well, it’s only been a week since the above mentioned changes).

If this is an issue (albeit a very rare one), maybe it makes sense to include somewhere a bit more detailed documentation about the options available for config.production.json, maybe even as comments in the config file itself.


#10

Spoke too soon :slight_smile: it happened again:


#11

That’s unfortunate… Does the 500 error persist until ghost is restarted or does everything return to normal after the MySQL backup has finished? Are there any interesting logs from one of these incidents? (probably similar to the logs above but with a 2000ms timeout instead of 60000ms).

I’ve tried to replicate the issue locally but stopping the DB causes immediate errors and everything recovers after starting the DB again. It seems in this case that it is because of IO stalls caused by snapshots so you can connect to RDS, it just doesn’t respond. Still working on a way to replicate that kind of behaviour.


#12

The error persists until Ghost is restarted and these days I’m getting errors from my WordPress sites as well. The erros is the same with the diminished timeout or so it seems (nothing shows up in nginx error log:

[2018-07-31 08:57:58] ERROR "GET /" 500 6026ms

NAME: InternalServerError
MESSAGE: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

level: normal

InternalServerError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
    at new GhostError (/srv/www/secareanu.com/versions/1.25.2/core/server/lib/common/errors.js:9:26)
    at prepareError (/srv/www/secareanu.com/versions/1.25.2/core/server/web/middleware/error-handler.js:42:19)
    at Layer.handle_error (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/layer.js:71:5)
    at trim_prefix (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:315:13)
    at /srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:335:12)
    at next (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:275:10)
    at Layer.handle_error (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/layer.js:67:12)
    at trim_prefix (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:315:13)
    at /srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:335:12)
    at Immediate.next (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:275:10)
    at Immediate.<anonymous> (/srv/www/secareanu.com/versions/1.25.2/node_modules/express/lib/router/index.js:635:15)
    at runCallback (timers.js:674:20)
    at tryOnImmediate (timers.js:645:5)
    at processImmediate [as _immediateCallback] (timers.js:617:5)

I don’t think there’s an easy way to replicate this and also the setup is not really a recommended one (is considered ok for development purposes but not for production, but I use it to minimize costs of RDS). What you could try would be to have several low traffic websites moved on a t2.micro or t2.nano instance connected to a t2.micro or t2.nano RDS instance with daily maintenance. I run about 10+ WordPress sites and 5 Ghost sites, all of them with minimal traffic so it shouldn’t be something about server load. Server setup is nginx + newrelic for monitoring.


#13

Found something else in the logs:

Debug Information:
    OS: Ubuntu, v16.04
    Node Version: v6.14.3
    Ghost-CLI Version: 1.8.1
    Environment: production
    Command: 'ghost restart'
Message: Ghost did not start.
Suggestion: journalctl -u ghost_www-secareanu-com -n 50
Stack: Error: Ghost did not start.
    at connectToGhostSocket.then.catch (/usr/lib/node_modules/ghost-cli/lib/utils/port-polling.js:94:32)
    at process._tickCallback (internal/process/next_tick.js:109:7)

Original Error Message:
Message: connect ECONNREFUSED 127.0.0.1:2370
Stack: Error: connect ECONNREFUSED 127.0.0.1:2370
    at Object.exports._errnoException (util.js:1020:11)
    at exports._exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1099:14)

#14

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.