Native Search broken for Indic languages (Punjabi and Hindi)

Gill · May 7, 2025, 1:54am

I’m using self-hosted Ghost Version: [5.118.1] with native ‘sodo-search’ function. The native search only searches English characters and and does not work for any Indic language such as Hindi, or Punjabi. I have tried changing the site locale as well and still same results. See attached screenshot.

I think this is true for all non-English languages and maybe related to bug #22439

Steps to reproduce

Create an article, excerpt, author name and tag using non roman characters and search will return no results.

search11567×502 41.9 KB

Gill · May 7, 2025, 1:56am

And below is the result if you search using non roman script.

vikaspotluri123 · May 7, 2025, 2:03am

Can you try upgrading to the latest version (5.119.1)? @Cathy_Sarisky did some work on this that was released in 5.119.0

Cathy_Sarisky · May 7, 2025, 2:54am

The newest release should fix that! Please update when you can. :) (both Ghost and this thread! )

Update: Apparently search hadn’t been properly shipped. It should /now/ be live and available.

Gill · May 7, 2025, 5:23pm

I updated to the latest version 5.119.1 and the issue persists. I think the ‘isCJK’ (Chinese, Japanese, Korean) is still not using the Indic language unicode blocks. I’m testing this locally and will submit Github pull request if it works.

// Add Indic scripts support

    (codePoint >= 0x0A00 && codePoint <= 0x0A7F) || // Punjabi (Gurmukhi) Unicode block
    (codePoint >= 0x0900 && codePoint <= 0x097F) || // Hindi (Devanagari) Unicode block
    (codePoint >= 0x0964 && codePoint <= 0x0965) // Devanagari Danda and Double Danda (common in Indic scripts)
);

Cathy_Sarisky · May 7, 2025, 5:27pm

The latest release went out a couple hours ago, but I’m still seeing the old version being served, so I think it’s cached somewhere. I’ve got a local build off the latest version, and it /does/ correctly search Indic scripts, but you may not be seeing it live yet.

The reason we are listing CJK codepoints in the latest version of the code is because we need to chop that text into single characters, because CJK doesn’t really have words in the sense that English does, so we need to be matching single characters, rather than words separated by whitespace. How does Punjabi work? Does it make sense to break a word down to single characters? If not, please don’t put in a PR that causes that!

And thanks as always for educating me. I’m still learning, and there are a lot of languages out there!

Gill · May 7, 2025, 5:59pm

I learned something new too! Indic scripts such as Hindi or Punjabi don’t need to be broken up like CJK - they all have proper words. Basically each consonant is surrounded by vowels, which can go left, right, up or down around the consonant - depending on the sound of the word. So every consonant like the letter S (ਸ) can be surrounded by vowels in 4 positions. This removes pronunciation ambiguity and provides highly accurate phonetics compared to Roman or RTL or other scripts.

ਸੇ (sE) - vowel on the top
ਸਿ (sI) - vowel on the left
ਸੀ (sEE) - vowel on the right
ਸੁ (sU) - vowel at the bottom
ਸੂ (sOO) - vowel at the bottom
ਸਾ (sAA) and so on…

I will not do a PR for this and wait for the latest release to hit the streets.

Cathy_Sarisky · May 7, 2025, 6:07pm

Fascinating! Thanks for sharing!

Just while we’re in ‘learning stuff’ mode: What happened with search is that to accommodate CJK, we were scanning for all characters that were not CJK, and then breaking them into words by whitespace. This was the wrong approach, because it meant that every time we added a language with a new character set, we had to update the search package to make it work. We had to fix it for Hebrew, and then Bengali, and now we’re still messed up on Greek and Indic.
(Thing I learned: There are a lot of character sets!)

The new approach (which really should be rolling out today, or maybe tomorrow) instead identifies the CJK characters, not the non-CJK characters. So it should not need an update unless there’s another non-word language we need to support. It’s much easier to enumerate the characters that are in CJK than the characters that are NOT in CJK, as it turns out.

Gill · May 7, 2025, 6:26pm

Thank you for the added clarification. Is it possible to override the hardcoded 10,000 limit via config.production.json file for self-hosted Ghost install?

Cathy_Sarisky · May 7, 2025, 6:35pm

It’s possible to build your own copy with a different limit and to load it via config.production.json. HOWEVER, the reason that limit is there is because sodo-search does all the search database creation and use in the browser. So you’re limited by what the user’s browser can do, remembering that some users will be on phones, older computers, etc.

If you need to search that many posts, or if you need full-text also, there are several options to integrate full-text. Algolia is what I use on my site. There’s also a Meillisearch option, and a Typesense option.

Cathy_Sarisky · May 7, 2025, 7:02pm

Ahha! I figured out that I could purge the jsDeliver cache myself, and now I’m getting the right result. Give it another try? (You may need to press F12, open your network tab and check ‘disable cache’ to get it to load fresh.)

Gill · May 8, 2025, 4:52pm

Tested for three scripts - Hindi, Punjabi and Urdu (RTL) - and it is working as expected!!! Thank you so much for the quick turnaround

Topic		Replies	Views
Translate ghost native comments / ghost native search Developer help	2	638	August 15, 2022
Current plan/roadmap for better internationalization? Developer help	2	348	March 23, 2024
Native search (sodo-search) is missing localization Developer help	6	1510	December 15, 2022
mobiledoc-kit - Where is 0.11.1-ghost.7? Developer help	1	507	April 11, 2019
🌐 Tracking translations - help still needed! Contributing to Ghost	2	81	October 4, 2024

Native Search broken for Indic languages (Punjabi and Hindi)

search11567×502 41.9 KB

Related topics