[UPHPU] keyword substring scoring
Jon Jensen
jon at jenseng.com
Thu Jul 13 11:19:31 MDT 2006
> I'd like to search text strings for instances of a keyword,
> give the string a score based on them, and group by the
> scores. The obvious problem is the while loop within a while
> loop. This splits up the scoring when an item has more than
> one keyword (I think...), but I can't think of any other way.
> In addition, I'll want to group by scores (descending).
There are several different approaches...
For example, if you are going to query this report frequently, you may want
to consider caching some information in another database table. Have a
table called items_keywords with columns keyword_id, item_id and count. It
would simply store however many times a keyword appears in a given item.
Then whenever you add/change an item or keyword, you could rebuild the
portion of the cache as it pertains to that keyword/item. Assuming you've
set up the proper indexes, the report query would be extremely fast... just
do a join between your 3 tables...
SELECT link, title, SUM(count * score) AS item_score
FROM keywords, items, items_keywords
WHERE keywords.keyword IN ('keyword1', 'keyword2', ...)
AND keywords.id = items_keywords.keyword_id
AND items.id = items_keywords.item_id
GROUP BY items.id
ORDER by item_score DESC
If the keywords and items constantly change and this query is only run
periodically, you may want to consider a different approach. As you say, a
loop within a loop is bad... In reality, you have three loops within a loop
within another loop, since substr_count loops through an array. Consider
restrucuring your code the following way... it's untested, but it ought to
work ;)
//get all keywords and put them in an associative array with
//the keyword as the key... this will make for speedy
//lookups, since hashes are always quicker than looping
//through an array
$result = mysql_query($keywdsql);
$keywords = array();
while($row = mysql_fetch_assoc($result))
$keywords[strtolower($row['keyword'])] = $row['pointvalue'];
//now let's go through our items and keep the ones that have
//scores...
$result = mysql_query($itemqry);
$items = array();
while($row = mysql_fetch_assoc($result)){
$score = 0;
//lump the text for this item
$text = strtolower($row['link'].' '.$row['title'].
' '.$row['desc']);
//break out the text into an array of words
$text = preg_split('/[^a-z0-9_-]/', $text, -1,
PREG_SPLIT_NO_EMPTY);
//check each word to see if it's a keyword
for($i = 0; $i < sizeof($text); $i++){
//if so, add its score to our total
if(isset($keywords[$text[$i]]))
$score += $keywords[$text[$i]];
}
if($score)
$items[] = array($score, $item);
}
//sort the array by score (reverse order)
rsort($items);
//then do whatever with the array (print it, etc.)
Another thing you may want to do is limit your initial item query to just
those that have the keywords. This is especially important if you have a lot
of items and only a couple keywords. A full-text index can come in quite
handy ("MATCH (title,desc,link) AGAINST ('keyword1 keyword2 ...')"). If your
database/table/column doesn't support full-text indexes, a simple where
clause is still better than nothing ("WHERE title LIKE '%keyword1%' OR desc
LIKE ..."). No sense retrieving rows you are just going to eliminate.
Hope that helps,
Jon
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
More information about the UPHPU
mailing list