Saturday, 14 August, 2004

What I Want From Searching Software

Given the existence of a spider that could identify and download news and blog entries almost as quickly as they're posted, the challenge becomes one of filtering the information and presenting it to the user.  Each user's needs are different, but I think most would agree that the software locate and present articles (changed or added Web pages) of known interest to the user.  The user expresses his interest in three ways:

  1. By identifying individual sites;
  2. By entering key words and key phrases;
  3. By rating and categorizing retrieved articles on a scale ranging from not interesting to very interesting.

Using those criteria, the program evaluates articles using a statistical (Bayesian or similar), genetic, or other such algorithm to determine how "interesting" the article is to the user.  The user interface presents the article headlines in list form, sorted from most interesting to least interesting.  The user then reads and rates the articles, further refining the program's ability to assign the ratings automatically.  It's a process of successive refinement:  training the program to identify articles of interest.

One other thing I'd like the program to do is randomly increase the rating of a low-rated article so that it shows up near the top of the list.  Such article headlines would be identified as having been "pushed up," so that the user doesn't think the rating system is somehow flawed.  Such a feature would have two primary benefits:

  1. It would encourage the user to rate content that he normally wouldn't view, thereby giving the program a wider range of rated articles on which to base its analysis;
  2. It would encourage the user to at least view information that is unrelated to what he normally sees.  At minimum this would ensure that I see new and different things from time to time, and I just might discover a new and valuable information source this way.

All of this is possible with current technology.  Much of the software is already written.  There are plenty of web crawling spiders out there that could easily be modified to search a list of XML feeds.  Statistical analysis of articles for relevance sounds quite similar to analyzing email to determine if it's spam; something that POPFile does quite well.

An immediate drawback of this scheme is that I'll be flooded with updated articles.  I'm already scanning a couple hundred new headlines every day  in Sharpreader, and I don't have very many feeds identified.  I'd obviously have to start out slowly, identifying only a few news sources at first and training the software how to determine what's of interest to me.  If my experience with POPFile is any indication, it would take a week or so of training to get the program to a 90 percent accuracy level.  Then I could start adding a few feeds each day.  I think I'd have the program reasonably well trained in a month or so, provided I gave it a good range of articles from which to base its analysis.  Fortunately, article classification isn't as critical as email classification:  I can tolerate a certain level of false positives--good articles that are identified as not interesting.

One other drawback that comes immediately to mind is spam, which will undoubtedly evolve to clog this space as well.  That's the subject of tomorrow's entry.