[htdig] the mysterious "Deleted, no excerpt" problem

Subject: [htdig] the mysterious "Deleted, no excerpt" problem
From: Patrick Robinson (pgr@ramandu.ext.vt.edu)
Date: Tue May 23 2000 - 07:38:47 PDT

I was experiencing the "Deleted, no excerpt" problem, yesterday
(using 3.1.5 on Solaris 2.6, built w/gcc 2.95.2). That is, I had
what I thought were "normal" HTML files that were being removed
from the db, because there was "no excerpt".

For some reason, it occurred to me to check the file type of these
seemingly harmless HTML files with the "file" command. It turns out
that whereas normal plain text files are reported to be "ascii text",
these files that were being removed were "English text".

I couldn't see anything out of the ordinary, looking at them in vi,
so I thought I'd see what od might tell me. Lo and behold, there
were insidious null bytes mixed in with the regular text!

These HTML files had been prepared on a Mac, using BBEdit (not sure
what version).

I don't know why BBEdit might have strewn text files with nulls, but I'm
also not sure why htdig can't read those files. But I might suspect that
there's a null-terminated string that contains the document. In my case,
there was typically a null as the first byte of the file, which might make
the file look empty. Is that it?

By the way, if anyone else if having this problem, the nulls can be
stripped out using something like:

   tr -d '\000' < inputfile.html > outputfile.html

Patrick Robinson
AHNR Info Technology, Virginia Tech

