ht://Dig Copyright © 1995-2004 The ht://Dig Group
Please see the file COPYING for license information.
Here are some of the major features of ht://Dig. They are in no particular order.
- Intranet searching
- ht://Dig has the ability to search through many servers on a network by acting as a WWW browser.
- It is free
- The whole system is released under the GNU Library General Public License (LGPL)
- Robot exclusion is supported
- The Standard for Robot Exclusion is supported by ht://Dig.
- Boolean expression searching
- Searches can be arbitrarily complex using boolean expressions.
- Phrase searching
- A phrase can be searched for by enclosing it in quotes. Phrase searches can be combined with word searches, as in
Linux and "high quality".
- Configurable search results
- The output of a search can easily be tailored to your needs by means of providing HTML templates.
- Fuzzy searching
- Searches can be performed using various configurable algorithms. Currently the following algorithms are supported (in any combination):
- common word endings
- accent stripping
- substring and prefix
- regular expressions
- simple spelling corrections
- Searching of many file formats
- Both HTML documents and plain text files can be searched directly ht://Dig itself. There is also a mechanism to allow external programs ("external parsers") to be used while building the database so that arbitrary file formats can be searched.
- Document retrieval using many transport services
- Several transport services can be handled by ht://Dig, including http://, ftp:// and file:///. There is also a mechanism to allow external programs ("external protocols") to be used while building the database so that arbitrary transport services can be used.
- Keywords can be added to HTML documents
- Any number of keywords can be added to HTML documents which will not show up when the document is viewed. This is used to make a document more like to be found and also to make it appear higher in the list of matches.
- Email notification of expired documents
- Special meta information can be added to HTML documents which can be used to notify the maintainer of those documents at a certain time. It is handy to get reminded when to remove the "New" images from a certain page, for example.
- A Protected server can be indexed
- ht://Dig can be told to use a specific username and password when it retrieves documents. This can be used to index a server or parts of a server that are protected by a username and password.
- Searches on subsections of the database
- It is easy to set up a search which only returns documents whose URL matches a certain pattern. This becomes very useful for people who want to make their own data searchable without having to use a separate search engine or database.
- Full source code included
- The search engine comes with full source code. The whole system is released under the terms and conditions of the GNU Library General Public License (LGPL) version 2.0
- The depth of the search can be limited
- Instead of limiting the search to a set of machines, it can also be restricted to documents that are a certain number of "mouse-clicks" away from the start document.
- Full support for the ISO-Latin-1 character set
- Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.
ht://Dig was developed under Unix using C++.
For this reason, you will need a Unix machine, a C compiler and a C++ compiler. (The C compiler is needed to compile some of the GNU libraries)
Unfortunately, we only have access to a couple of different Unix machines. ht://Dig has been tested on these machines:
If you plan on using g++ to compile ht://Dig, you have to make sure that libstdc++ has been installed. Unfortunately, libstdc++ is a separate package from gcc/g++. You can get libstdc++ from the GNU software archive.
The search engine will require lots of disk space to store its databases. Unfortunately, there is no exact formula to compute the space requirements. It depends on the number of documents you are going to index but also on the various options you use.
As a temporary measure, 3.2 betas use a very inefficient database structure to enable phrase searching. This will be fixed before the release of 3.2.0. Currently, indexing a site of around 10,000 documents gives a database of around 400MB using the default setting for maximum document size and storing the first 50,000 bytes of each document to enable context to be displayed.
Keep in mind that we keep at most 50,000 bytes of each document. This may seen a lot, but most documents aren't very big and it gives us a big enough chunk to almost always show an excerpt of the matches.
You may find that if you store most of each document, the databases are almost the same size, or even larger than the documents themselves! Remember that if you're storing a significant portion of each document (say 50,000 bytes as above), you have that requirement, plus the size of the word database and all the additional information about each document (size, URL, date, etc.) required for searching.