[htdig] relative URL retrieval infinite recursive loop

Subject: [htdig] relative URL retrieval infinite recursive loop
From: Glenn Nielsen (glenn@voyager.apg.more.net)
Date: Tue Dec 28 1999 - 04:31:19 PST


The following is a valid URL for a document...

<a href="/parent/parent.html/index.html">Parent Page</a>

where "/parent/parent.html" is a file on the server that is
returned by the webserver from the above URL.

If the document "/parent/parent.html" has any relative references
in it such as...

<a href="./child/child.html">Child Page</a>

The URL constructed by HtDig to retrieve "./child.html" gets resolved
to "/parent/parent.html/child/child.html". But the web server will once
again return the contents of the document "/parent/parent.html".
Which as its first href contains...

<a href="./child/child.html">Child Page</a>

so HtDig now resolves the relative href for "./child/child.html" as

HtDig is now in a recurive infinite loop until the disk partition
HtDig is using for its db's gets 100% full.


Since the above construction of URL's is valid and HtDig has no way
of knowing that the parent document resolves to the file "/parent/parent.html",
HtDig will need to add more intelligence to detect the above problem.

A possible solution would be to compare the contents of the parent and
child documents when the child comes from a relative URL. If the
document contents for the parent and child are identical and have the
same last modification date stamp, ignore the child document and report
an error. Then continue, digging the next href in the parent.


Glenn Nielsen

Glenn Nielsen glenn@more.net | /* Spelin donut madder |
MOREnet System Programming | * if iz ina coment. |
Missouri Research and Education Network | */ |

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue Dec 28 1999 - 04:43:55 PST