htdig: Re: Indexing output of CGIs


William Rhee (willrhee@umich.edu)
Fri, 13 Nov 1998 19:36:41 -0500 (EST)


Hi again,

I hadn't received any replies, except from someone else who had
experienced the same symptoms. But after looking around in some of the
src files it appears this behavior is by design, presumably to prevent the
possibility of having htdig careen wildly out of control on
dynamically-generated pages with dynamically-generated links.

In HTML.cc:

               // If a '?' or '#' is present in a quoted URL,
               // treat that as the end of the URL, but we skip
               // past the quote to parse the rest of the anchor.
               //
               // Is there a better way of looking for these?
               //
               if ((t = strchr(position, '#')) != NULL)
                   *t = '\0';
               if ((t = strchr(position, '?')) != NULL)
                   *t = '\0';

To change it so the URL is no longer terminated by a '?' what I did was
comment out
               if ((t = strchr(position, '?')) != NULL)
                   *t = '\0';

in a number of places in HTML.cc.

If you also decide to do this, in the htdig.conf file it might be a good
idea to set the max_hop_count to something more reasonable than the
default (999999); also note that by default the directive 'exclude_urls'
is set to ignore /cgi-bin/, so if you were thinking of just commenting
this directive out you better instead replace it with some other, possibly
non-sensical, pattern to ignore if you want it to pick up on /cgi-bin/.

Happy htdigging,
--Will

On Thu, 12 Nov 1998, William Rhee wrote:

> Hi there,
>
> I'm trying to index a single page which has a bunch of links to CGIs with
> urlencoded parameters in their URL's query string, eg:
>
> http://someplace.org/cgi-bin/something?ID=1234&blah=foo
>
> I removed the "exclude_urls" directive in the default htdig.conf which
> tells it not to index URLs matching the patterns /cgi-bin/ and .cgi but
> none of the pages get indexed.
>
> Examining the "url_list" of all the URLs which htdig extracts while it is
> running, it appears that the parameters of the query string are being
> truncated. That is, there are many lines in the url list where:
>
> http://someplace.org/cgi-bin/something
>
> appears, but the ? and query string:
>
> ?ID=1234&blah=foo
>
> is missing. Is this by design or has someone out there also
> experienced the symptom (maybe already patched it?! :-) )?
>
> cheers,
> --Will
>
>

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:48 PST