I’m using self-hosted Ghost Version: [5.118.1] with native ‘sodo-search’ function. The native search only searches English characters and and does not work for any Indic language such as Hindi, or Punjabi. I have tried changing the site locale as well and still same results. See attached screenshot.
I think this is true for all non-English languages and maybe related to bug #22439
Steps to reproduce
Create an article, excerpt, author name and tag using non roman characters and search will return no results.
I updated to the latest version 5.119.1 and the issue persists. I think the ‘isCJK’ (Chinese, Japanese, Korean) is still not using the Indic language unicode blocks. I’m testing this locally and will submit Github pull request if it works.
The latest release went out a couple hours ago, but I’m still seeing the old version being served, so I think it’s cached somewhere. I’ve got a local build off the latest version, and it /does/ correctly search Indic scripts, but you may not be seeing it live yet.
The reason we are listing CJK codepoints in the latest version of the code is because we need to chop that text into single characters, because CJK doesn’t really have words in the sense that English does, so we need to be matching single characters, rather than words separated by whitespace. How does Punjabi work? Does it make sense to break a word down to single characters? If not, please don’t put in a PR that causes that!
And thanks as always for educating me. I’m still learning, and there are a lot of languages out there!
I learned something new too! Indic scripts such as Hindi or Punjabi don’t need to be broken up like CJK - they all have proper words. Basically each consonant is surrounded by vowels, which can go left, right, up or down around the consonant - depending on the sound of the word. So every consonant like the letter S (ਸ) can be surrounded by vowels in 4 positions. This removes pronunciation ambiguity and provides highly accurate phonetics compared to Roman or RTL or other scripts.
ਸੇ (sE) - vowel on the top
ਸਿ (sI) - vowel on the left
ਸੀ (sEE) - vowel on the right
ਸੁ (sU) - vowel at the bottom
ਸੂ (sOO) - vowel at the bottom
ਸਾ (sAA) and so on…
I will not do a PR for this and wait for the latest release to hit the streets.
Just while we’re in ‘learning stuff’ mode: What happened with search is that to accommodate CJK, we were scanning for all characters that were not CJK, and then breaking them into words by whitespace. This was the wrong approach, because it meant that every time we added a language with a new character set, we had to update the search package to make it work. We had to fix it for Hebrew, and then Bengali, and now we’re still messed up on Greek and Indic.
(Thing I learned: There are a lot of character sets!)
The new approach (which really should be rolling out today, or maybe tomorrow) instead identifies the CJK characters, not the non-CJK characters. So it should not need an update unless there’s another non-word language we need to support. It’s much easier to enumerate the characters that are in CJK than the characters that are NOT in CJK, as it turns out.
Thank you for the added clarification. Is it possible to override the hardcoded 10,000 limit via config.production.json file for self-hosted Ghost install?
It’s possible to build your own copy with a different limit and to load it via config.production.json. HOWEVER, the reason that limit is there is because sodo-search does all the search database creation and use in the browser. So you’re limited by what the user’s browser can do, remembering that some users will be on phones, older computers, etc.
If you need to search that many posts, or if you need full-text also, there are several options to integrate full-text. Algolia is what I use on my site. There’s also a Meillisearch option, and a Typesense option.
Ahha! I figured out that I could purge the jsDeliver cache myself, and now I’m getting the right result. Give it another try? (You may need to press F12, open your network tab and check ‘disable cache’ to get it to load fresh.)