Re: [htdig] Htdig search problem


Torsten Neuer (tneuer@inwise.de)
Fri, 2 Jul 1999 08:57:43 +0200


According to david bernick:
>>Not likely. If you want to submit something, feel free. However, I don't
>>see how you can pick up JavaScript links since there are essentially an
>>infinite ways of making "links" in JavaScript. If you have some general
>>solution, I'd be interested in seeing it.
>
>I first must imply that one writes "good" javascript and abides by certain
>rules when writing it. it also assumes that you are indexing pages within an
>organization that has standardized ways of writing HTML, PDFs and
>Javascript. if these things are standardized, it's quite easy to pattern
>match for javascript links.
>the href property is officially associated with only the location, area and
>link objects. this means that location.href=, area.href=, and link.href= are
>the only standardized ways to put links in javascript. we can pattern match
>for location.href= quite easily. this is the most commonly used way to make
>a link in javascript. the link itself is either root relative or absolute.

JavaScript is used to make dynamic documents. Therefore also the links
might (and probably will) be dynamic. No way to catch this through PM.
You'll have to implement JavaScript interpretation and compute the values
that are placed after those JavaScript href tags.
However, since stuff is highly dynamic, you'd also have to predict user
input - which certainly is nearly impossible (or does anyone know of a
program that does a >= 99% guess on what Joe User will do next?).
There are also other ways to get to linking documents in JavaScript,
like interfacing with the history list of a browser which would imply
that you'd have a history list for the digger, too.

Remember: If you can follow JavaScript links, you'll also have to index
JavaScript generated documents (document.write()). And this is also another
"popular" way of placing links in JavaScript!

You'd also have to find a standardized(!) way of excluding JavaScript-linked
documents from digging. I think the robots-exclusion specs are pretty silent
with respect to this.

For the sake of not having too many people complaining on this list that
"the JavaScript feature isn't working" for them, I'd happily opt for not
implementing such stuff.

>>It might be easier to do this in Flash, but I don't know much about the
>>format.
>
>this is alot trickier. flash movies are compacted and vector based, almost
>like compiled code, and are only readable by a flash plugin or special flash
>reader. if you open a flash movie in a hex editor (even a good text editor),
>you can find the URLs and just follow them. the main issue with htdig is
>parseing the code for hex when encountering an .swf file. i figure (and this
>is only theory so far) you can use the variety of freeware C++ objects that
>do hex parsing. as i said, this is very much theory and i'd like to hear
>some input on it if anyone has any.

Now what is next? OCR for images to get to text that is placed in logos,
buttons or anything like that? IMHO, Flash is good for presentation pro-
gramming for exhibitions. Nothing more. If someone can convince me that
Flash is supported on the majority of OS, I'd also apply Flash to Internet
or (big) Intranet documents. But since the number of OS is limited and
support for *nix-based OS is even more limited...

But that's only my personal opinion ;-)

regards,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Jul 01 1999 - 23:32:59 PDT