Monday, 23 August, 2004

CityDesk Critique

I've been converting my old Random Notes entries to CityDesk, starting from the first entries that I made back in October 2000.  So far I've converted all of the entries through September 2002, which puts me at an average of about one month per day since I started the project on July 26.  At that rate I should have everything converted by the middle of next month.

I'm reasonably happy with CityDesk, but not completely happy.  The good far outweighs the bad, especially when compared with the way I was maintaining my site, but the program is missing some features and has some quirks that make it less than ideal.  In no particular order, those are:

  • The WYSIWYG editor lacks table support.  This isn't a huge drawback, now that I have a style sheet for the site layout, but every once in a while I want to create a table.  The only way to do that is to go into the HTML view and construct the table the hard way.  No big deal, except then I have to be very careful not to modify the table when I switch back to WYSIWYG mode.
  • The CityScript scripting language is wonky in the extreme.  There are powerful constructs for selecting articles and placing them on pages, including different parts of an article (title, date, teaser, sidebar, etc.), but the language uses some very weird notation that is reminiscent of Lisp.  The language has some limitations, too, like the inability to include variable references inside of variables (basically included code blocks), that make the language very frustrating to use.  It's unfortunate that the designers decided they needed to create a whole new language for this.  I think they should have implemented a subset of a more traditional language (like VBScript perhaps) and spent their time writing functions that make the system more powerful.  The lack of any kind of string manipulation severely limits what can be done, the inability to use the Keywords field to create an index being a case in point.
  • There are many minor user interface issues that make it impossible to use the program without touching the mouse.  In addition, some things that should be on context menus (right-click menus) or as Alt+ key shortcuts aren't there.  In particular, the command to insert a picture should have a shortcut key, and inserting a link should be on the context menu.  There's no keyboard command (at least, I can't find one) to change from the Normal (WYSIWYG) view to the HTML view--something that I find myself having to do all too often when I'm converting entries from FrontPage.
  • Publishing the site takes an inordinately long time.  When I click on the Publish button, CityDesk appears to generate each and every article, whether or not it's changed since the last time I published.  As you can imagine, this takes an increasingly long time as I continue to add articles to the database.  I can understand the program having to check each entry in the database to see if it's changed, but I fail to understand why it needs to regenerate every article every time.  A simple disk cache of generated articles would speed things considerably.  The program only uploads changed articles to the Web site, so there's obviously logic in there to determine what's changed.  As it stands, it takes several minutes whenever I want to upload a new article, and that's only going to get worse.

CityDesk is a good replacement for the way I was maintaining my Web site previously, and I'll keep using it for the near term.  I'm in the market for something better, though.  Maybe their version 3.0 will address enough of the shortcomings to make me want to upgrade.  If you have any suggestions of programs I should try, please let me know.

Wednesday, 18 August, 2004

Syndication Format Standards?

Like laws and sausages, you probably don't want to know how RSS works.  Unless you're a developer who's trying to write a program that reads RSS feeds.  Then you're in for a rude awakening.  Silly me, I thought that there was an RSS standard that all had agreed on, and implementing a reader would be relatively simple.  That's not the case at all.

There are three competing syndication formats:  RSS, RDF, and Atom.  The original RSS specification was developed at Netscape and released as version 0.90 in March of 1999.  That was quickly followed by another Netscape release, 0.91, in June 1999.  0.91 was much simplified and was intentionally incompatible with 0.90.  Then things got really ugly, formats diverged to create RDF, and we ended up with nine different RSS/RDF formats, all incompatible with each other to one degree or another.  See The myth of RSS compatibility for all the gory details.

The RSS world split into two camps:  the RSS group who want to freeze RSS development at version 2.0, and the RDF group who continue development of the format and add all kinds of bells and whistles.  The flame wars between these two groups are legendary.  I'll let you search them out if you're so inclined.

And then along came Atom, an attempt to create a syndication format that everybody can agree on.  It looks like the RSS/RDF wars are mostly over and intelligent people are putting their differences behind them to work together on the new format.  Atom is still in the early stages of development, although there are feeds available in that format.  The Atom effort got a big boost in June when the IETF announced formation of the Atom Format and Protocol Working Group.

Another problem that faces developers of syndicated news readers is bad XML.  Many site summaries contain poorly formed XML, which can't be parsed using a standards compliant XML parser.  Repeated messages to the site operators go unacknowledged.  There is an astonishing amount of bad XML out there that nobody will fix.  Developers are forced to either reject feeds that contain poorly formed XML, or attempt to parse it at any cost.  Most do the latter, which leads to "tag soup":  pretty much anything goes and programs try to figure things out.  This is how we ended up with incompatible Web browsers, weird constructs, and strange rendering of HTML.  It makes Web browsers big, clunky, and unreliable.  Standards exist for a reason.  Unfortunately, competitive forces require that software attempt to make sense of bad data.

One can only hope that Atom is approved relatively quickly and that sites using other syndication formats will convert to Atom once it's fully defined.

Tuesday, 17 August, 2004

Car Trouble

My truck died Thursday morning while I was on my way to work.  I turned a corner and the engine died.  I had just enough momentum to coast into a parking lot, where I called a tow truck and had it hauled to the shop.  I figured it was something simple like the fuel pump, or maybe the computer that controls everything got fried.

No such luck.

I won't go into the reasons why it took the shop four days to diagnose the problem, but it turns out that the timing chain broke.  That's easy enough to replace, it seems, but there's no guarantee that the engine will run well once everything's put back together.  The reason?  In many engines, mine being one such, when the timing chain breaks it's possible for a piston to hit a valve which can bend the valve or crack the piston.  There are two ways to determine if that happened:  tear the engine apart and look, or replace the timing belt and try to start the engine.  People I trust and know a lot about cars (none of whom live in the area, unfortunately) tell me that replacing the timing belt takes less time and effort.

So I find myself in the uncomfortable situation of spending a whole lot of money in order to determine if I need to spend more.  It's either that or write the truck off and get a new one.  As much as I'd like a new truck, I'm not really excited about paying for one.  So I'll take the gamble and go from there.

Car trouble ranks right up there with toothaches on my list of aggravating things.

Sunday, 15 August, 2004

Combating RSS Spam

People who know a lot more than I do about RSS have already given some thought to RSS spam:  sites that include advertisements as part of their "new items" list.  The simple minded spammers would make the entire feed an advertisement, but those feeds would quickly get black balled.  More subtle approaches are possible and very easy to implement, as pointed out by the linked article.

As annoying as spammers could be to a first generation RSS aggregator like Sharpreader, it's nothing like what they could do to an automated reader bot like the one I described yesterday.  There are many methods that a Web site can use to identify and fool a web crawler, at least temporarily.  It would be impossible for the crawler to identify such a site automatically on the first pass, so spammers are guaranteed that some percentage of people will see their pages.  Combating this problem will require a community-wide effort; one that I hope will be handled better than the email spam effort, what with its over-zealous "black hole" operators.

I doubt that any kind of legislation could be passed to combat RSS spam.  Unlike email spam, which requires spammers to "push" content onto users and thus open themselves to charges of theft of services for bandwidth and storage, RSS spam is entirely a "pull" model:  people (or programs) go to the site and download the spam.  All a Web site does is publish an XML feed and notify a service that the feed has changed.  There's no "push" involved.

I don't know how much of a problem RSS spam could be.  My immediate reaction is that spammers could clog the RSS space as thoroughly as they've clogged email, but upon further reflection I'm not so sure they can.  Granted, they can clog the bandwidth, but I'm not certain that they can get "eyeballs" if the next generation of RSS aggregators implement a filtering scheme similar to the one I've described.  I guess we'll just have to wait a few years to find out.

Saturday, 14 August, 2004

What I Want From Searching Software

Given the existence of a spider that could identify and download news and blog entries almost as quickly as they're posted, the challenge becomes one of filtering the information and presenting it to the user.  Each user's needs are different, but I think most would agree that the software locate and present articles (changed or added Web pages) of known interest to the user.  The user expresses his interest in three ways:

  1. By identifying individual sites;
  2. By entering key words and key phrases;
  3. By rating and categorizing retrieved articles on a scale ranging from not interesting to very interesting.

Using those criteria, the program evaluates articles using a statistical (Bayesian or similar), genetic, or other such algorithm to determine how "interesting" the article is to the user.  The user interface presents the article headlines in list form, sorted from most interesting to least interesting.  The user then reads and rates the articles, further refining the program's ability to assign the ratings automatically.  It's a process of successive refinement:  training the program to identify articles of interest.

One other thing I'd like the program to do is randomly increase the rating of a low-rated article so that it shows up near the top of the list.  Such article headlines would be identified as having been "pushed up," so that the user doesn't think the rating system is somehow flawed.  Such a feature would have two primary benefits:

  1. It would encourage the user to rate content that he normally wouldn't view, thereby giving the program a wider range of rated articles on which to base its analysis;
  2. It would encourage the user to at least view information that is unrelated to what he normally sees.  At minimum this would ensure that I see new and different things from time to time, and I just might discover a new and valuable information source this way.

All of this is possible with current technology.  Much of the software is already written.  There are plenty of web crawling spiders out there that could easily be modified to search a list of XML feeds.  Statistical analysis of articles for relevance sounds quite similar to analyzing email to determine if it's spam; something that POPFile does quite well.

An immediate drawback of this scheme is that I'll be flooded with updated articles.  I'm already scanning a couple hundred new headlines every day  in Sharpreader, and I don't have very many feeds identified.  I'd obviously have to start out slowly, identifying only a few news sources at first and training the software how to determine what's of interest to me.  If my experience with POPFile is any indication, it would take a week or so of training to get the program to a 90 percent accuracy level.  Then I could start adding a few feeds each day.  I think I'd have the program reasonably well trained in a month or so, provided I gave it a good range of articles from which to base its analysis.  Fortunately, article classification isn't as critical as email classification:  I can tolerate a certain level of false positives--good articles that are identified as not interesting.

One other drawback that comes immediately to mind is spam, which will undoubtedly evolve to clog this space as well.  That's the subject of tomorrow's entry.

Friday, 13 August, 2004

How Searching Can Be Improved

Blogging is a relatively new phenomenon.  According to The History of Weblogs, the first weblog was the first Web page, published by Tim Berners-Lee in the early 1990's.  The concept caught on slowly, with just a handful of logs being published until 1998 or 1999 when blogging started to become popular in the technical community.  Blogging by the general public (i.e. people outside of the computer industry) didn't really start until 2000 or 2002.  Since then it's grown tremendously.  Technorati says that the number of weblogs that it tracks has increased from 100,000 to almost 3.5 million in just two years.  Blogging has become very popular.

In the following discussion, I'm going to lump personal weblogs and frequently updated news sites together.  I realize that news sites and news aggregators serve different needs than do personal blogs, but the way that they're updated and searched is almost identical.

It's little surprise that current searching and indexing techniques are inadequate when it comes to searching blog content.  We've spent centuries learning how to index relatively static content from books.  The growth of magazine publishing in the last 60 years or so gave us some idea of how to index and search monthly periodicals.  But indexing information that changes from minute to minute is still experimental.  The problem isn't so much in the indexing itself, but rather in keeping the index up to date.

Current search engines use a brute force method of keeping information up to date.  They have a bunch of servers (Google has over 10,000) that continually scan known Web sites for changes.  They do that in two ways:  by reading known sites and comparing the contents with a cached copy of the contents, and by searching content for links to new pages.  At least, new to the search engine.  Most search engines have some logic built in that prioritizes searches based on the historical frequency of changes to the site.  A site that typically changes only once per week will be searched much less frequently than a site like Yahoo News that changes many times per day.  Even so, there's a lot of bandwidth wasted in crawling pages that haven't changed.

There are ways to notify search engines of changes.  One case in point is Weblogs.com, which has a Ping-site form where site owners can notify the search engine of changes.  When it gets a ping, the Weblogs.com server searches the site to verify that it's changed, and then publishes a change notification in a file called changes.xml.  Programs that want to search current information can download changes.xml and search the listed sites for new information.  Note that changes.xml only says that something on the listed site has changed; it's up to the client of changes.xml to figure out exactly what.  There's still some bandwidth wasted searching unchanged pages, but not as much as a blind search of known sites.

That's where RSS can help.  If there was a service similar to Weblogs.com that listed only updated RSS feeds, then a program could know exactly what had changed on a site.  Imagine a file called changedfeeds.xml, which listed RSS feeds that had changed in the last hour.  A program could download that file, and then search the listed site summaries for changed information.  The amount of bandwidth required to maintain an index of current information is reduced, and clients of the search engine know that they are getting the most up to date information available.

Getting the most current information is only part of the solution.  Presenting the most relevant information to the user is the other big part, and that's going to require some software that doesn't appear to exist yet.  Next time I'll describe what that software has to do.

I said on Tuesday that blogging would change the way we use the Web, and possibly the way we get our information.  I'd like to explain why I think that's so, but first I need to explain why the current way of searching the Web is less than ideal when looking for current or balanced information on a topic.

One of the primary criteria that Google and other search engines use to rank results is links:  how many pages link to the page being ranked.  The more links, the higher the rank.  It's not quite that simple, of course, because people tripped to the scheme pretty quickly and started doing all kinds of silly things to increase the number of links to their pages.  Blog spam is perhaps the most common such attempt.  Search engines have implemented techniques to minimize the effects of such spam on the search rankings, but like every other spam battle it's an ongoing and ever escalating arms race.

The primary problem with a rating system that places a high value on the number of links to a page is that older content accumulates links and maintains "relevance" even after it becomes stale or out of date.  This is fine for static content like my TriTryst pages and other information that doesn't change much, if at all, over time.  It is not a good way to rank pages that deal with current issues or information in emerging fields of study.

The other problem I see in using a links based rating system for current events is that such a scheme leads to an inadvertent reporting bias.  Searching for current events usually returns a result set that includes stories from the major news outlets (CNN, MSNBC, BBC, New York Times, Reuters, etc.) at the top, followed by links to commentary sites that link to one of those stories.  Stories from the top tier news outlets typically are very similar to each other, meaning that the first few pages of search results will contain essentially homogeneous news and opinion.  Perhaps you'll run across a FreeRepublic or Plastic story that provides commentary.  At best, though, you usually get just two opposing views on the subject, based on wildly different interpretations of the news articles and neither acknowledging that any other interpretation is possible.

Much of this is caused by people who seem unwilling or unable to form their own opinions from reading a news article, or are unwilling to hold an opinion as valid until they read something that backs it up.  And since there typically are only two opposing viewpoints presented in the first few pages of search results, the discussion quickly becomes a this-or-that issue, with no room in the middle.  Truthfully, I can't say which is the cause and which is the effect.  Is news reported this way because that's what people want?  Or does reporting news this way cause public opinion to be bipolar?  That's a question for psychologists and sociologists.  Whatever the case, I think all would benefit from more balanced reporting.

Not that I expect any one news agency to provide impartial or even balanced reporting.  The slant put on news by any organization will reflect the personal views of the writers, editors, publishers, and even the readership.  They are, after all, in the business of selling what they publish, so you can't fault them for publishing what sells.  The idea of unbiased reporting is fine in a world full of totally rational and unemotional beings, but when humans are involved it's better to acknowledge and disclose the bias.  Biased reporting is okay.  Really.  As long as there are many views that are equally accessible.

The "equally accessible" issue is the heart of the problem.  When Microsoft rolled out their MSN Newsbot last month, articles describing it were quick to point out that MSN Newsbot gives preferential treatment to articles that appear on MSNBC.  Google News is a news aggregator that appears to be unbiased, but an article published at Digital Deliverance recently shows that the top five sources of news make up 48% of the headlines on the Google News front page.  The top 100 sources make up 98% of the headlines.  Other news aggregators appear to have similar unintended biases.

One final thing about major search engines:  they tend to attach more relevance to content from known sources.  I'd say that this is a good thing in general, but it tends to push unknown sources to the bottom of the results rankings, even if the information it provides is new and relevant.  Mind you, this isn't criticism, but rather an observation.  It's an artifact of using a ranking scheme designed for static content to search fast changing information.  It also causes most blogs to be pushed down in the rankings because most blogs are not consistently relevant--their content, like mine, varies much more than a news site's.

Those are the problems I see with current search techniques and news aggregators:  freshness, diversity, and visibility.  Tomorrow I'll discuss how using RSS to search blogs and news feeds can improve on that.

Wednesday, 11 August, 2004

Looking at Search Terms

One of the reports that I get from Sectorlink is a list of terms people used to search my site.  Most of the searches are understandable, but there are a few that make me shake my head.  Here are some of my favorites.

where do allroads lead
all roads lead to rome give an explanation
a rabbits lucky foot did you know
lose socks in washer
lifesavers spark
rotational dynamics

These all are obvious matches for a bit of silliness over in my Rants section.  I sure hope whoever was looking for that information doesn't take my word as the definitive reference on the subject.  Two other search strings probably matched that page, too, although I didn't directly address those questions:

where did ceaser salad come from
julias ceaser when he was born and what did he do

If the person who searched for "nude sunbathing barton springs austin tx" is a woman, I'd be happy to join her there.  Please contact me.

"anisomorpha bupestroides" threw me for a loop until I looked it up on Google.  It's the formal name of the American Walking Stick.  I posted a picture of one that I found in the yard a few years back.

A "concrete beer warehouse" sounds like a great idea, by the way.  I don't have one, but if you build one on my property I'd be happy to fill it with homebrew.

I have no idea how to "transform rice to rice krispies."  Maybe Snap, Crackle, and Pop are relatives of the Keebler elves.  I can't imagine why somebody would think I knew that answer. 

These four make something of a surrealistic poem.

ponzi wines employees
smut magazine august 14
wesley snipes haircut pictures
landscaping cigarette butts

And, my favorite:

random jim

Seeing that, I just had to run the search through Google.  I made the top 10.

Tuesday, 10 August, 2004

How Big is the Blogosphere?

My post on Saturday was part of an experiment to make Web logs more visible.  I don't usually take part in that kind of activism, but this one sounded kind of interesting.  I'm not terribly interested in using that method to make Random Notes more visible, but I am very interested to see how thing spreads over the Web, whether it's from my site or somebody else's.

The post starts with "There are by some estimates more than a million weblogs."  That's understating the number quite a bit.  Technorati, which calls itself "the authority on what's going on in the world of weblogs," is currently tracking almost 3.5 million weblogs.  They don't say how many are active.  blo.gs, another weblog tracker, currently tracks 2.5 million blogs, of which about 1.5 million are considered active.  Technorati says that it was tracking only 100,000 blogs two years ago.  It's hard to say how much of that is growth in the blogosphere, and how much of it is better detection.  Find more information at blogcensus.net or weblogs.com (among others).

Whatever the actual number, blogging is big and it's getting bigger.  It's the hot new thing on the Web.  Corporate CEOs are starting blogs, as are teenagers and bored housewives.  I find it difficult to believe that people actually post some of what I've seen, but I'm not going to tell them not to post it.  If you want a sample, visit blo.gs and pick one of the recently updated blogs, or check out a random LiveJournal user.  Still, with a couple million people blogging out there, we're bound to find a few that we find interesting.  Right?

Probably, but it's not very easy.  Yet.  Sites like Feedster.com, blo.gs, and many others will help you search blog entries for key words, and will rank them by "most popular" using much the same type of ranking scheme that Google uses for page relevance.  From there, you can refine your search further.  We're still working with first generation tools in the blog space because we haven't figured out exactly what we're looking for or how to find and filter it once we figure out what it is.

But I think blogging is going to change the way we use the Web, and could change the way we get our information.  I'll tell you why in a post later this week.

Monday, 09 August, 2004

What's This RSS Thing?

I posted Thursday's entry about adding an RSS feed and installing an RSS reader without explaining what RSS is or why I'm interested in it.

RSS is Real Simple Syndication.  It's a file format that Web sites use to publish summaries of recently added or changed information.  It's not a human readable format, though, unless you're really into reading XML documents.  The purpose of RSS (at least in this context) is to support client programs that read the site summaries (called RSS feeds) and present the information to the user.  The idea is that, rather than using your Web browser to visit the dozen or more news and blog sites that you read on a regular basis, you'll employ an RSS reader to automatically scan those sites' RSS feeds, download the summaries, and let you choose those that you want to read.  As I said in Thursday's post, you let the news come to you instead of having to track it down.  Using an RSS reader will save you time and trouble.

Getting started with RSS only takes a couple of minutes.  First you need to download and install an RSS reader, and then tell it what feeds you want to view.  I'm using SharpReader, which is a very simple (and free) program that took less than two minutes to download and install.  The installed program includes a handful of feeds so that you can see how it works.  The only drawback is that it requires the Microsoft .NET Framework, a Windows add-on that is a 25 megabyte download if you don't already have it installed.

A reader that's been getting a lot of press is called Pluck.  It's an Internet Explorer add-in that will read RSS feeds, organize your favorites, and do lots of other cool Web stuff from within your browser.  I haven't used it myself, but a couple of people at the office have given it good reviews.  It's also won Editor's Choice awards from c|net, and ZDNet.  It, too, is a free program.

One other program that I haven't used but have heard good things about is RSS Bandit.

The first question most people ask once they've installed an RSS reader is "Where do I find feeds?"  You've probably seen those orange "XML" icons on Web sites, like the one in the left column of this page.  That icon is a link to the RSS feed for the site.  If you click on it you'll see what looks like garbage:  the raw feed in XML format.  What you want to do is right click on the link and select "Copy Shortcut" (in IE) or "Copy Link Location" (in Mozilla), and then paste the link into your RSS reader.  (Pluck lets you drag and drop the link.)  The RSS reader will read the file and show you the article headlines.  You can click on a headline to read an article summary, or double-click on the headline to visit the site and read the entire article.

Many major news sites like Yahoo, CNet News.com, and BBC News have RSS feeds.  Wired, Slashdot, Plastic, and most of the other common commentary sites have feeds.  All blogs on blogger.com and most of the other major blogging sites have the ability to publish feeds, although I think some bloggers choose not to publish a feed.  In general, if you're reading it on the Web there's probably an RSS feed for it.  Sometimes you have to search around to find the darned thing, though.  You'll find yourself searching the FAQ sections of Web sites trying to find the feed URLs.  RSS Bandit and some other reader applications have the ability to find RSS feeds if you point them at a Web site.

You really should give RSS a try.  It'll only take a couple of minutes, and I'll bet you'll be hooked.  You'll probably end up with the same problem I have now:  too much information.  There are ways around that, which I'll talk about in the next few days.

If you're interested in creating an RSS summary for your Web site and want to learn more about it, start with the WebReference.com article Introduction to RSS.

 

Sunday, 08 August, 2004

10,000 Miles on a Bicycle

The odometer on my bicycling computer turned over to 10,000 miles this morning, 12 miles into today's 45-mile ride with Debra.  Wow.  10,000 miles on a bicycle.  Granted, it's over a 4-1/2 year period, but it still works out to over 600 hours in the saddle.  And those are just the miles that are recorded on the bike computer.  It doesn't include mountain bike rides or the time I spent pedaling on the bike trainer in the garage.  It works out to about 2,200 miles per year, but I haven't been that consistent.  The first year I had the computer I put 1,800 miles on it.  I put another 1,200 miles on the bike in August and September of 2002 when I was training for the Waco Century, but didn't ride much more than 2,000 miles total that year.  In contrast, I've put 4,100 miles on it since October 1 of last year.  I'll be right at 5,000 miles before October 1 this year.

That seems like an awful lot, 5,000 miles in a year.  Except that serious amateur cyclists usually do about 10,000 miles in a single year, and the pros put in 20,000 or more.  In cycling circles, I'm still in the recreational rider category.

Saturday, 07 August, 2004

Help Make Blogs More Visible

There are by some estimates more than a million weblogs. But most of them get no visibility in search engines. Only a few "A-List" blogs get into the top search engine results for a given topic, while the majority of blogs just don't get noticed. The reason is that the smaller blogs don't have enough links pointing to them. But this posting could solve that. Let's help the smaller blogs get more visibility!

This posting is GoMeme 4.0. It is part of an experiment to see if we can create a blog posting that helps 1000's of blogs get higher rankings in Google. So far we have tried 3 earlier variations. Our first test, GoMeme 1.0, spread to nearly 740 blogs in 2.5 days. This new version 4.0 is shorter, simpler, and fits more easily into your blog.

Why are we doing this? We want to help thousands of blogs get more visibility in Google and other search engines. How does it work? Just follow the instructions below to re-post this meme in your blog and add your URL to the end of the Path List below. As the meme spreads onwards from your blog, so will your URL. Later, when your blog is indexed by search engines, they will see the links pointing to your blog from all the downstream blogs that got this via you, which will cause them to rank your blog higher in search results. Everyone in the Path List below benefits in a similar way as this meme spreads. Try it!

Instructions: Just copy this entire post and paste it into your blog. Then add your URL to the end of the path list below, and pass it on! (Make sure you add your URLs as live links or HTML code to the Path List below.)

Path List

  1. Minding the Planet
  2. Luke Hutteman's public virtual MemoryStream
  3. Jim's Random Notes
  4. (your URL goes here! But first, please copy this line and move it down to the next line for the next person).

(NOTE: Be sure you paste live links for the Path List or use HTML code.)

Part of my hosting package is the DeepMetrix LiveStats reports that give me information about how many people visit the site, the most commonly viewed pages, number of repeat visitors, browser and operating system stats, keywords used in searches, and all manner of other things.  It's a very nice package of reports when it's working.  It seems like the stats server is kind of unreliable, though.  Back in May it was down for over a week, and last week it was down for a couple of days.  Still, I do like the information that it gives me.

There is one drawback, though.  My Firefox browser doesn't like the login page at http://www.getmystatsnow.com.  If I visit that page with Firefox 0.92 from my Windows machine, it goes into an infinite loop.  It works fine from Internet Explorer 6 and from Firefox 0.8 on my Linux machine.  I wonder if the folks over there know about that.  Anybody know what on that page could be causing the problem?

Friday, 06 August, 2004

Fixing Bad URLs

I was exploring the Web traffic reports that come along with my Web hosting package from Sectorlink and ran across the "Bad Requests" report, which lists all of the bad URLs that have been used to access my site.  If you get a 404 error, this report will list it.  I found that a surprising number of requests for my Random Notes pages still use "/Diary" rather than "/diary".  As I pointed out on March 5, the move from my old hosting provider to Sectorlink put me on a Linux Web server on which case is significant in URLs.  When I realized that I converted all of my URLs to lower case.  I figured that after five months everybody who was linking to my diary would have fixed their links.  No such luck.  There were dozens of links to "/Diary" and a few other common casing errors as well (WinHelp and ToolUtil, for example).  I fixed the problem by creating Redirect commands in the site's .htaccess file.  Now any requests for "/Diary" will be redirected to "/diary", and the other common casing errors are mapped to the all-lowercase equivalents.

I also see a lot of requests for "robots.txt" and "favicon.ico".  I know that robots.txt is a file that well-mannered Web crawlers look for, although I'm not sure what it's supposed to contain.  It looks like the crawlers look for favicon.ico, too, because the number of requests for both files is very close to identical.  I guess I'll have to read up on what those files are for and decide if I need to include them on the site.

Thursday, 05 August, 2004

RSS Feed

Behind the times as usual, today I downloaded SharpReader and stepped into the world of Really Simple Syndication.  I'm tired of going to the news.  Now I'll let it come to me.  The only excuse I can give for not doing that sooner is laziness.

I also created an RSS feed for my Web site.  It was surprisingly easy to do.  There's an article on the CityDesk Knowledge Base that shows how to create the required XML to publish an RSS feed.  It's just a couple minutes' work.  I'll add an RSS icon link to the Web page template over the weekend.  The link is http://www.mischel.com/rss.xml.

The National Association to Advance Fat Acceptance, according to their web site, "is a non-profit human rights organization dedicated to improving the quality of life for fat people."  Their Information Index page goes into much detail about what the organization does and how.  Their primary goal appears to be "to eliminate discrimination based on body size and provide fat people with the tools for self-empowerment."

That's fine as far as a stated goal, but I question some of their methods.  In particular, I disagree with their dismissal of the large body of evidence indicating that being grossly overweight is unhealthy.  The correlation or weight with medical problems is very strong, and for the NAAFA to dismiss it with a couple of feel-good paragraphs is just short of criminal.  Research has shown that being "healthy" (i.e. exercising regularly and eating a sensible diet) is more important than being thin, but research also indicates that people of average weight have fewer health problems.

I'm always suspicious of "activist" organizations, but I do like their stand on dieting, stomach stapling and similar crash weight loss schemes, and the diet industry in general.  I'm also impressed that they came out against Medicare's recent decision to have fatness declared a disease.  I am disappointed, though, in their stated position on weight reduction dieting, which they strongly discourage.

The NAAFA has some good points and I think what they're trying to do is, in balance, good.  They have many decades of social prejudice to overcome, though, and turning a blind eye to studies that disagree with their stated positions is not a good way to do it.

Sunday, 01 August, 2004

Regular Expressions Make My Brain Hurt

Working with regular expressions makes my brain hurt.  Understand, I'm not new to this particular brand of torture:  I've been using regular expressions for at least 20 years.  Not every day, but often enough that I'm reasonably comfortable with them.  And I still find them confusing to write, difficult to decipher, and almost impossible to modify.  The "language" of regular expressions deserves the description "write only" much more than do APL or Perl.

Regular expressions are a powerful and concise way of expressing simple or complex text matching and replacement behavior.  The basics of the language are easy enough to learn and use--even for a novice programmer--but long and complex regular expressions confound even experienced programmers.  Much of working with regular expressions involves trial and error along with a whole lot of head scratching and searching the Internet for examples of regular expressions that do "something like" whatever it is you're trying to do.

The Internet is chock full of regular expression "tutorials," most of which appear to be derived from the same reference whose origins I am unable to determine.  All of these "tutorials" describe what regular expressions are, provide simple examples of their use, and then launch into highly detailed technical reference information that means almost nothing.  Gleaning practical information like how to use an advanced feature is difficult at best.

The Internet also is chock full of programs that claim to help build and test regular expressions.  I've yet to find one that does anything more than allow me to enter a regular expression and test its action against a block of text.  When I see the term "regular expression builder" used to describe a program, I expect to see a tool that actually assists in building the regular expression:  helping me construct the correct syntax and validating a regular expression for correctness, providing useful explanatory error messages when it finds incorrect syntax.  I'm very surprised that no such tool exists.

Granted, such a tool would be very difficult to create from a user interface perspective.  It's either impossibly difficult, or none of the eight people in the world who actually understand regular expression syntax in detail are willing to expend the energy required to create such a program.  I'm thinking that maybe I should become the ninth person to understand regular expressions and then write the tool myself.

One person who certainly understands regular expressions is Jeffrey Friedl, the author of the book Mastering Regular Expressions.  At over 400 pages, this is the book on regular expressions.  Not only does it describe in detail the language of regular expressions and their behavior, but it does so with a focus on solving real problems--something that none of the other references I've seen does.  It includes sections on the different types of regular expression engines, descriptions of how expressions are processed, tips on creating efficient expressions, and individual chapters for Java, Perl, and .NET programmers.  If you need to understand regular expressions in detail (and if you're writing text processing applications, you do), you absolutely should read this book.