Ghost goes down with high read/write operations (Prev. Newsletter Issue)

UPDATE

Ghost Version: 4.41.3
Theme: Reditory
No. of Members: ~280k

Deployment on: AWS EC2 t3.med instance
Database: AWS RDS db.t3.small gp2 instance

Problem:

Upon more digging, we discovered that Ghost actually goes down on three occasions:

  1. A lot of API requests are made (See Ghost goes down with many api requests)
  2. Scrolling down the members page in ghost admin portal (the initial load for this is also very slow)
  3. Publishing posts w/ newsletter sending (this does not always cause ghost to go down. mostly it just does not work properly [See previously elaborated issue below])

What we realized was that ghost goes down (apache does not go down and instead just throws a 504 error when trying to access the ghost blog) when it’s performing a high volume of read/writes to the databse.

Additionally, we saw this error being thrown a lot in the error logs:

{
  "name":"Log",
  "hostname":"ip-xxx-xx-x-xxx",
  "pid":787,
  "level":50,
  "err":
     {
      "domain":"https://xxx.xx",
      "message":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?",
      "stack":"KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
        at Client_MySQL2.acquireConnection (/var/www/xxx/versions/4.41.3/node_modules/knex/lib/client.js:348:26)
        at runMicrotasks (<anonymous>)
        at runNextTicks (node:internal/process/task_queues:61:5)
        at listOnTimeout (node:internal/timers:528:9)
        at processTimers (node:internal/timers:502:7)"
     },
  "msg":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?",
  "time":"2022-12-30T13:55:57.491Z",
  "v":0
}

We’ve already tried increasing the pool connections from the default as chatGPT suggested here but it has not helped:

"pool": {
  "max": 50,
  "min": 10,
  "idleTimeoutMillis": 5000
},
"acquireConnectionTimeout": 10000

Our theory is that the maximum IOPS provided by the RDS instance is bottlenecking the huge reads and writes our ghost instance is making because of the huge number of members. Is this the issue?

If so, what’s the recommended RDS deployment for ghost with this many members?

If this is not the issue? What is and how can we fix this?


Previous Post

Ghost Version: 4.41.3
Theme: Reditory
Problem: Newsletters not sending properly

The main problem is that newsletters are not sending properly to members. There are currently 200k+ members to this ghost blog but recently, only around 10k - 60k are being sent when a post is published.

This also manifests by the sends / opens not showing up in the ghost admin portal (image attached). I was only able to confirm that emails were in fact being sent just to some members by checking logs in mailgun directly.

We also cannot upgrade to the newest version because the theme we’re using is not compatible anymore.

Any help would be greatly appreciated. Thank you very much!

I discuss this some over here:

In summary, try throttling the request volume at the reverse proxy, which will keep the rest of the system from overloading.

If your system is falling over when it still have plenty of I/O, CPU and memory capacity, then you’ve got a configuration problem.

If it’s simply falling over from too much traffic, there ought to be a way to handle that at the web server layer, besides throttling connections, there are other things that might be done, like making sure you are serving static assets directly from the web server, which is more efficient at that then having Ghost serve the static assets.

Or enable a CDN service. Again, this frees from the web server from responding to most static asset requests so it can redirect more resources towards serving the dynamic content.

We finally have gotten to the root of the problem.

Basically, there were a lot of invalid email formats saved on Ghost as members (e.g. #lance@gmail.com, ***jeff@amazon.com). We’re thinking Ghost must get stuck on trying to send the emails and querying email open analytics from mailgun.

We figured this out by using a different environment (diff deployment, rds, etc) with the same infrastructure and configurations. We already had one but had different content inside the database (members, posts, etc).

We duplicated the production data and changed the emails to mock mailenator emails. When we did this everything went successfully and we were able to publish newsletters with all emails sent within 5 minutes.

We then used the same emails from production and changed the domain (from lance@gmail.com to lance@mailenator.com). When we did this, we encountered the same problems again. When we looked into the rds tables, that’s when we saw all the invalid emails.

Turns out somehow ghost doesn’t catch these invalid emails when we create members through the API.

1 Like