phpBB2Refugees.com Logo
Not affiliated with or endorsed by the phpBB Group

Register •  Login 

Continue the legacy...

Welcome to all phpBB2 Refugees!Wave Smilie

This site is intended to continue support for the legacy 2.x line of the phpBB2 bulletin board package. If you are a fan of phpBB2, please, by all means register, post, and help us out by offering your suggestions. We are primarily a community and support network. Our secondary goal is to provide a phpBB2 MOD Author and Styles area.

search_wordlist index full


 
Search this topic... | Search General Support... | Search Box
Register or Login to Post    Index » General Support  Previous TopicPrint TopicNext Topic
Author Message
b0blee
Board Member



Joined: 11 Jun 2012

Posts: 15
Location: Cloverdale, northern California



PostPosted: Tue Jun 18, 2019 3:42 am 
Post subject: search_wordlist index full

In the database, the search_wordlist table word_id field is defined as a MEDIUMINT (3 bytes). The filter that allows words into the list is woefully inadequate - it allows numbers, URLs, symbols and even emoji. The word list is a virtual garbage can.

Yesterday people started getting errors like this one - note the dubious word "andtalkedwith". On research (viewing the database with Webmin), I discovered that the table's highest word_id was 16,777,215 (0xFFFFFF). Duh! I put a return statement at the top of add_search_words() and that "fixed" it.

Image link

Has anyone else encountered this? Has anyone written SQL code to clean up that table? I'm not proficient in SQL and I'm hesitant to do anything on the live database.

_________________
-𝕓𝕆𝕓- at The Steel Guitar Forum
Back to top
drathbun
Board Member



Joined: 24 Jul 2008

Posts: 683
Location: Texas


flag
PostPosted: Tue Jun 18, 2019 10:36 pm 
Post subject: Re: search_wordlist index full

That is super surprising! The SEARCH_WORDLIST table contains a unique list of words that have ever been posted on your board. You're saying that your users have posted over 16MM unique words, which is fairly interesting. There are less than 200,000 words in the Oxford English Dictionary. icon_smile.gif Of course phpBB stores not just actual words, but fragments that look like words. I have a board with almost a million posts, and I have less than half a million unique words.

Check to see if your wordlist table has "words" that are too long or too short. Run this SQL (which should work for MySQL to check:
Code:
select min(length(word_text)) as short_word,
max(length(word_text)) as long_word
from sgf_search_wordlist;

If it doesn't come back with this, something is wrong:
Code:
+------------+-----------+
| short_word | long_word |
+------------+-----------+
|          3 |        20 |
+------------+-----------+
1 row in set (0.80 sec)

The actual size designation for my database is mediumint(8). The maximum value is as you have discovered, a bit over sixteen million. You could probably alter the table and change the type from mediumint to int but that doesn't fix the problem of how many words you are collecting in your dictionary. icon_eek.gif If you do that, you would also have to update SEARCH_WORDMATCH table (should be sgf_search_wordmatch based on the SQL you shared) as well, otherwise you're going to index the wrong words to a post.

Keep in mind the number of posts does not impact the number of words in the SEARCH_WORDLIST table. It is truly supposed to be a unique list of words contained across all of your posts. Having 16MM values is...unexpected.

Having a look at the list of words might prove to be educational.

_________________
phpBBDoctor Blog
Back to top
b0blee
Board Member



Joined: 11 Jun 2012

Posts: 15
Location: Cloverdale, northern California



PostPosted: Wed Jun 19, 2019 1:42 am 
Post subject: Re: search_wordlist index full

I don't have 16 million rows in the table. The word_id field has reached 16 million. I don't know where this counter comes from, but it skips a lot of numbers. I've deleted some off the top since the problem started. That didn't help (duh!). Here's a current screen shot.

Image link

As you can see, there are 455,750 "words", but the word_id fields are not continuous. Where are those numbers generated?

Af of this minute there are 2,770,777 posts in the system.

_________________
-𝕓𝕆𝕓- at The Steel Guitar Forum
Back to top
b0blee
Board Member



Joined: 11 Jun 2012

Posts: 15
Location: Cloverdale, northern California



PostPosted: Wed Jun 19, 2019 1:53 am 
Post subject: Re: search_wordlist index full

I got

short_word 1
long_word 50

That's bad, certainly. Something's screwed up. Maybe it's related to the fact that a few years ago I switched to UTF-8 encoding.

_________________
-𝕓𝕆𝕓- at The Steel Guitar Forum
Back to top
drathbun
Board Member



Joined: 24 Jul 2008

Posts: 683
Location: Texas


flag
PostPosted: Wed Jun 19, 2019 1:20 pm 
Post subject: Re: search_wordlist index full

If you look for a "rebuild search index" MOD that could potentially help fix your issue. More details at the end of the post.

I don't know how much you know about how the database works, so if this is too simplistic for you I apologize, but I think it helps the explanation of what has likely happened. The search system is made up of two tables - a word list and a list of posts that contain those words. Every time a new post is entered, phpBB scans it to see if there are new words, and if so, it inserts them into the word list table. Here's how that table is defined:
Code:
+-------------+-----------------------+------+-----+---------+----------------+
| Field       | Type                  | Null | Key | Default | Extra          |
+-------------+-----------------------+------+-----+---------+----------------+
| word_text   | varchar(50)           | NO   | PRI |         |                |
| word_id     | mediumint(8) unsigned | NO   | MUL | NULL    | auto_increment |
| word_common | tinyint(1) unsigned   | NO   |     | 0       |                |
+-------------+-----------------------+------+-----+---------+----------------+

The auto_increment is the important part. Every word has to be assigned a unique ID in order to be associated with the post (or posts) that it appears in. The system takes care of that. If you have 9 words in your table, the ID values should be 1-9. The next one inserted is 10.

Here's the tricky part. The process that assigns those IDs does not go back and reuse values. Suppose you delete word 7. You now have a gap (1-6, 8-10). The ID number 7 is lost forever. If you delete / add a bunch of words, the number of "lost" ID values increases. I am making an educated guess that somewhere along the lines words have been deleted from your word list table, and therefore the word ID values continue to increase, up until the point where you can't generate a new one anymore due to the data type limitation. As I mentioned previously, changing the type from mediumint to int would extend the life of your board but would not fix the problem.

Next I went looking in the code to find out where words are deleted from the search_wordlist table, and I found that it's in the function called remove_common() in ./includes/functions_search.php. Apparently I looked at this a long time ago and decided it didn't work, because here's what's at the beginning of my code:

Code:
//
// Check if specified words are too common now
//
function remove_common($mode, $fraction, $word_id_list = array())
{
        global $db;

        // short-circuit this logic because it
        // doesn't work anyway.
        return;

icon_lol.gif Now I have no idea today why I did that, as it was probably done years ago. I don't have the time right now to go back and redo my analysis. But in my case, since I have turned that off, words are never deleted. Once you fix your problem, you might consider doing the same thing. (You have already done the "return;" trick in your search words process anyway, so you know how to do that.) If you turn that process off, words would no longer be deleted and you would not "lose" WORD_ID values.

But that doesn't fix your problem, it would only (hopefully) stop it from happening again. How do you fix your problem?

In order to get MySQL to reuse those skipped / formerly used WORD_ID values, you have to drop and recreate the table. If you do that, your search process will be completely broken, as the index will be empty. Future posts would be indexed and searchable, but existing posts would not be. That's where I was going at the beginning of this post - there are MODs that are designed to rebuild your search index.

You said you had over 2MM posts. That's going to take a while, depending on your server hardware. Basically a rebuild means the system has to read every single post and index (again) every single word. You need to start by dropping and recreating (not just emptying) the search_wordlist table, but also you have to delete all rows from the search_wordmatch table. Both of those tables will be refilled by the re-indexing process.

I started to write more about the search process, and then remembered that I already have. icon_smile.gif If you want to read more than you ever wanted to know about how the phpBB search system works, I have an entire category of posts on my phpBBDoctor blog related to this topic:

http://www.phpbbdoctor.com/blog/category/phpbb/search/

I have closed comments on my blog, so if you read something there and have a question, feel free to post it in this topic.

_________________
phpBBDoctor Blog
Back to top
Salvatos
Board Member



Joined: 19 Feb 2009

Posts: 439
Location: Québec


flag
PostPosted: Wed Jun 19, 2019 3:10 pm 
Post subject: Re: search_wordlist index full

I do still wonder, though, if there are 2,770,777 posts and about 455,750 "words" have been collected to date... Even assuming that someone or something deleted whole batches of words on occasion... How could the autoincrement have gotten to 16 billion? Something must be creating and removing words at an alarming rate, no?
Back to top
b0blee
Board Member



Joined: 11 Jun 2012

Posts: 15
Location: Cloverdale, northern California



PostPosted: Wed Jun 19, 2019 4:19 pm 
Post subject: Re: search_wordlist index full

Thanks for the response.

I was wondering if maybe the INSERT IGNORE INTO statement auto-increments the ID when it doesn't actually insert the word. That would explain the high word_id count. Is yours also higher than the actual number of words in the table?

_________________
-𝕓𝕆𝕓- at The Steel Guitar Forum
Back to top
drathbun
Board Member



Joined: 24 Jul 2008

Posts: 683
Location: Texas


flag
PostPosted: Wed Jun 19, 2019 7:52 pm 
Post subject: Re: search_wordlist index full

On my largest board, mine are exactly in sync. The number of words matches the maximum word_id value. They are both 455K and change.

On this board, they are very slightly out of sync. But I have never rebuilt the search index on this board, where I have on my main board.

I do not think the "ignore" option would consume a word_id value since there is no insert taking place. The INSERT IGNORE syntax is cheating. icon_smile.gif Rather than checking to see if each word already exists, it's saying to attempt to insert the word, and if it fails (due to the word already existing perhaps) then simply ignore the error. If that was what was causing the problem, there would be far more boards running into this same situation.

_________________
phpBBDoctor Blog
Back to top
Jim_UK
Board Member



Joined: 19 Nov 2008

Posts: 585
Location: North West UK


flag
PostPosted: Wed Jun 19, 2019 8:56 pm 
Post subject: Re: search_wordlist index full

Quote:
If you look for a "rebuild search index" MOD that could potentially help fix your issue


Mod attached

Jim



rebuild_search_2.4.0a.zip
 Description:

Download
 Filename:  rebuild_search_2.4.0a.zip
 Filesize:  59.61 KB
 Downloaded:  81 Time(s)


_________________
The truth is out there.
Unfortunately they will not let you anywhere near it!
Back to top
drathbun
Board Member



Joined: 24 Jul 2008

Posts: 683
Location: Texas


flag
PostPosted: Sat Jun 22, 2019 12:50 pm 
Post subject: Re: search_wordlist index full

Bob, do you have any updates or further questions?

You mentioned you have long / short words that are outside of the boundary. That's covered in a blog post here:

http://www.phpbbdoctor.com/blog/2007/02/02/how-does-search-work-part-iv-dissecting-a-regular-expression/

I went back to the original regex on my boards and don't have any longer (or shorter) words.

_________________
phpBBDoctor Blog
Back to top
b0blee
Board Member



Joined: 11 Jun 2012

Posts: 15
Location: Cloverdale, northern California



PostPosted: Sat Jun 22, 2019 5:34 pm 
Post subject: Re: search_wordlist index full

Thing are running smoothly now. Thanks for the MOD (which I fully intend to run soon), and for the updated clean_words() regex. I've also added a function to filter out emoji, which I'm calling from clean_words():

Code:
function removeEmoji($text) {

   $cleanText = "";
   
   // Match Miscellaneous Symbols and Dingbats
   $regexDingbats  = '/[\x{2600}-\x{27FF}]/u';
   $cleanText = preg_replace($regexDingbats, '', $text);

   // Match Emoticons, Transport And Map Symbols, Miscellaneous Symbols and Pictographs
   $regexEmoji = '/[\x{1F300}-\x{1F6FF}]/u';
   $cleanText   = preg_replace($regexEmoji, '', $cleanText);

   return $cleanText;
}


I was getting "words" that were just emoji strings. No need to add them to the search words table.

_________________
-𝕓𝕆𝕓- at The Steel Guitar Forum
Back to top
Display posts from previous:   
Register or Login to Post    Index » General Support  Previous TopicPrint TopicNext Topic
Page 1 of 1 All times are GMT
 
Jump to:  

Index • About • FAQ • Rules • Privacy • Search •  Register •  Login 
Not affiliated with or endorsed by the phpBB Group
Powered by phpBB2 © phpBB Group
Generated in 0.0302 seconds using 17 queries. (SQL 0.0029 Parse 0.0024 Other 0.0248)
phpBB Customizations by the phpBBDoctor.com
Template Design by DeLFlo and MomentsOfLight.com Moments of Light Logo