Re: [htdig3-dev] Bug#56721: htdig and locale de_DE peculiarities. (fwd)


Subject: Re: [htdig3-dev] Bug#56721: htdig and locale de_DE peculiarities. (fwd)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Feb 04 2000 - 09:57:39 PST


According to Gergely Madarasz:
> I just got this bugreport in the debian BTS
>
> --
> Madarasz Gergely gorgo@sztaki.hu gorgo@linux.rulez.org
> It's practically impossible to look at a penguin and feel angry.
> Egy pingvinre gyakorlatilag lehetetlen haragosan nezni.
> HuLUG: http://mlf.linux.rulez.org/
>
> ---------- Forwarded message ----------
> Date: Mon, 31 Jan 2000 17:13:54 +0100
> From: Florian Hars <florian@hars.de>
> To: submit@bugs.debian.org
> Subject: Bug#56721: htdig and locale de_DE peculiarities.
> Resent-Date: Mon, 31 Jan 2000 16:18:02 +0000 (GMT)
> Resent-From: Florian Hars <florian@hars.de>
> Resent-To: debian-bugs-dist@lists.debian.org
> Resent-cc: Gergely Madarasz <gorgo@sztaki.hu>
>
> Package: htdig
> Version: 3.1.4-1
>
> This is probably for upstream.

This is very strange. I can't see anything in the code that could
explain the behaviour described below. Does debian include any patches to
3.1.4, or just a straight, unmodified installation of the 3.1.4 tarball?
If there are any patches, please provide us with them.

> I use htdig with a locale: de_DE setting. It seems unable to find
> occurrences of words containing non-ascii characters that are part of
> titles, <Hn> or emphasis elements. Say, if i look for "bég" in my
> data, it finds an index.html document that contains the line
>
> <a href="beg-islamabad-1990.html">B&eacute;g 1991: From the Quark
> Model to the Stand...</a>
>
> but not the document beg-islamabad-1990.html itself, that starts with:
>
> <html><head><title>B&eacute;g 1991: From the Quark Model to the
> Stand...</title>
> <body>
> <h1>Mirza Abdul Baqi <strong>B&eacute;g</strong>: From the Quark Model
> to the Standard Model: Ten Fateful Years in Particle Physics (1964--74
> C.\,E.).</h1>
> <p>Mirza Abdul Baqi <strong>B&eacute;g</strong> (1991): <em>From the
> Quark Model to the Standard Model: Ten Fateful Years in Particle
> Physics (1964--74 C.\,E.).</em>
>
> It also doesn't find another document containing
>
> <p><a href="beg-islamabad-1990.html">Mirza Abdul Baqi
> <strong>B&eacute;g</strong>: <em>From the Quark Model to the
> Stand...</em> 221-284</a></p>
>
> although it finds both documents if I look for "Mirza".

My first impulse was to say, oh, this is a problem with title_factor and
heading_factor_1 through heading_factor_6 being set to 0, but that would
not explain why unaccented words in headings and titles are found (unless
those words appear elsewhere in the document). It also wouldn't explain
why the <strong> tag has any effect - htdig normally ignores that tag.

Given that the e-acute works sometimes, I think we can rule out a problem
with the locale - that would either work consistently or fail consistently.

Perhaps you should set start_url to the URLs of the two documents above
that were giving you problems, and run htdig -vvvvv -i -s to see what it
is doing when parsing these names. You may also want to change your
database_dir so as to avoid clobbering your current database.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Feb 04 2000 - 09:59:44 PST