Thinking about sorting search results in our new Lucene-based digitization environment, in light of
- Thom Hickey’s thoughts about popularity-based sorting for FRBR records on the one hand
- Beth Jefferson’s words at the stimulating OLITA Digital Odyssey day last week, about how library recommendation systems should direct readers to the long tail, on the other.
I’ve got the makings of a recommendation system based on captured user behaviour in the new site (users who searched for this actually clicked through to these items). We could incorporate this into the Lucene-based search system by periodically adding search terms to a field and reindexing. Like this:
- for each item, extract search terms that led a user to click through to that item
- add a field to the index containing these search terms
- add that field to queries, perhaps weighting it more heavily
Over time, we should get a body of useful terms accumulating in that field, right? Or will we get too much flaky behaviour – the user who searched for one thing but clicked on an item because it serendipitously connected with some other research question? (How much search history do we have to capture before we can shake the fringe behaviour out?)
And will we tend to promote some items that happen to get a few initial clicks, therefore they move up to the top of the results lists, therefore they get more clicks? Records could become popular just for being popular. Should we provide an option that penalizes the recommendation field, to promote items that match your search but that no-one has ever clicked through to? I.e. promote that long tail?
This calls for some experimentation…
I was thinking this very thing when I first read about tag clouds. Can we capture the keywords people type and use this to develop a controlled vocabulary that would help later users connect to content of interest to them? For a while a human could tweak it....
Unless I am mistaken, Google used to do this for random hits. Or at least collect the stats for this. Right now, I can't seem to see this (the URLs all look like direct links, not via Google) although they may be colecting this via Java Script?? -g