Advice on importing 250+ MB of posts?


#1

I am trying to import over 250 MB of articles from an old CMS into ghost 2.2.4. I successfully imported 50, 100 and 500 articles. Now I wanted to have a go at the full import of almost 30 000 articles. But the import fails for no clear reason. Chrome’s console just says “Error: Server was unreachable”.

Looking at GitHub, it seems others have asked for a CLI import option already. So this does not seem an option at the moment.

Is importing chunks of data (e.g. 1000 articles at a time) the only option? How would you approach such an import?

To be clear: This is not a supported configuration. We use HardenedBSD (a fork of FreeBSD) as operating system and h2o as web server. Everything seems to run smoothly and - as mentioned above - even the import works (with smaller files).


#2

Very large imports currently suffer from the fact that we don’t use polling, tracked here. And if your process has not enough memory, it will probably die with such a large file. The max size i have tested was 25mb.

So i would try to import the 250mb locally with a script. And you need to ensure that you give the node process enough memory. As soon as you have imported the file successfully, you could dump your MySQL database and upload it to your server.

The alternative is: splitting the big file into multiple smaller files (as you tried already). But that takes very long i guess.


#3

Thanks a lot. I went with the chunked way. That went so-la-la.

FYI:
I tried to import 1000 posts at a time and sometimes only a part was imported and the unspecified (network?) error showed up.

Trying again and again, I managed to import almost everything. But 9000+ entries were duplicates, triplicates etc… I could easily identify them in the database because their slugs end on -2, -3 etc. Thus, I managed to delete them with this SQL:

delete p,a from posts p join posts_authors a on p.id=a.post_id where p.slug REGEXP '\-[23456789]$';

Now, I have 27863 posts in the database and have to identify the ones that were not exported/imported.


#4

The network error you see in the browser is a timeout on the browser’s part, it doesn’t mean that the import has failed. Although the browser drops the connection the import is still being processed by the server and will continue in the background - this is probably why you ended up seeing duplicates if you were re-trying the same import.

This is why the polling solution was mentioned - we want to be able to show the progress of the background import process rather than having to make the browser wait until the import has finished.


#5

Thank you @Kevin ! Now I got the relation to the polling solution.


#6

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.