I added a tag cloud to this BLOG. It looks quite nice, and I was happy with the tags except when I saw that that they didn't accurately reflect my content.
When I first deployed this change, the biggest tag was for "drupal", but when you clicked on that tag, you saw only one story. And conversely, the "c++" tag is smaller than the "drupal" tag, but there are three "c++" stories. Why was the "drupal" tag so large? And why was the "c++" tag so small?
It took only a little bit of investigation to realize that the "drupal" tag was so large because my Drupal tag module Tagadelic counted my unpublished stories which I had tagged with "drupal". During the "development" of this site, I tagged all my "fake" content as "drupal", and when I finally "launched", I unpublished all that content. However, the content was still in the database, and Tagadelic was counting it.
So how was Tagadelic counting these unpublished tags? For this, I had to look inside the module's PHP code, tagadelic.module. I'm not a Drupal module expert, but I knew I was looking for SQL, and eventually I found it. In the function tagadelic_get_weighted_tags(), I saw this:
After looking at Drupal's term_node and term_data tables, and taking some guesses at the value of "vid" (mine was '2'), I ran the query against my Drupal's MySQL database (using phpMyAdmin). This produced the output:
9 4 drupal
7 2 perl
5 23 job
4 11 code
4 3 source
3 8 bio
The "drupal" tag was the count leader (9), hence the large tag. However, the count contained unpublished stories. How do tell the query to count "only published stories"? Examining Drupal's "node" table revealed the status flag, which (by inspection) is set to '1' when a story is published.
I now needed to add this condition to the SQL's WHERE clause. My first pass was this SQL:
FROM term_data d INNER JOIN term_node n
ON d.tid = n.tid
WHERE d.vid IN ('2')
AND n.nid in (select n.nid from node n where status=1)
GROUP BY d.tid, d.name, d.vid ORDER BY count DESC
This produced a better looking set of weighted tags:
After making a backup of tagadelic.module, and then carefully replacing the SQL in the db_query_range() method, I refreshed my BLOG's home page. The change didn't do anything. Stymied, I looked at the tagadelic_get_weighted_tags() method, and I saw these lines:
$cache_name = 'tagadelic_cache_'. $options;
$cache = cache_get($cache_name);
// make sure cache has data
if (isset($cache->data)) {
$tags = $cache->data;
}
Clearly the tagadelic results were being cached, and perusing the Drupal tables lead me to the "cache" table (apparently used by the cache_get() function).
To reset things, I just deleted the rows associated with tagadelic. On a "real" site, this would be done during a scheduled outage, or perhaps tested on a staging environment. This site? I deleted the rows almost as soon as I realized it was the right thing to do. (However, this didn't persist the change, and I ended up using the "Clear cached data" function in the "Admin > Site configuration" page.) Once I did that, a refresh of the BLOG's home page showed the corrected sizes!
Feeling all smug and confident, I visited Tagadelic's module page. I went to the "issues" page, planning to file a bug. Of course, after scanning the headlines, I saw that this issue was already reported and resolved (though not yet delivered in released code).
The query in the issues page (by chrislb) was more thought out than mine (it didn't use a sub-select statement, but rather a multiple join). I replaced my query with chrislb's, and the end result was the same: good looking and correct tags!
Did I learn any lessons from this? Yes: Peruse your Drupal's module's issues page before embarking on any technical investigations (though it was fun to be inside some PHP code).