Searching a site - including in PDFs and (ugh) DOCs?

Larry Cook lcook at sybase.com
Fri Jun 4 09:07:01 EDT 2004


Bill Sconce wrote:
> Does anyone know of a package which can provide a search capability
> for a Web site - including searching in PDF and .DOC files?

Back in April I did a quick internet search for search engines.  Lucene 
(http://jakarta.apache.org/lucene/docs/index.html) seems to to be the most 
advanced and the most active.  There is a good set of converters, including 
PDF and DOC, that have been contributed.  See the jGuru FAQ.  ht://Dig 
(http://www.htdig.org/) also seems pretty popular.  It also uses converters 
and it sounds like there are ones available for PDF and DOC.  See FAQ 
questions 4.8 and 4.9.

Here is the list of what I found:

http://jakarta.apache.org/lucene/docs/index.html
http://www.htdig.org/
http://openfts.sourceforge.net/
http://www.egothor.org/
http://www.twmacinta.com/bddbot/
http://mg4j.dsi.unimi.it/
http://exist-db.org/
http://search.jxta.org/
http://xqengine.sourceforge.net/
http://search.mnogo.ru/
http://www.javaforu.com/start.htm (SearchAssist - no direct link)
http://www.me.lv/jse/
http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html
http://www.aspseek.org/
http://findmaan.sourceforge.net/
http://harvest.sourceforge.net/
http://linksearch.sourceforge.net/
http://www.perlfect.com/freescripts/search/
http://swish-e.org/
http://www.wrensoft.com/zoom/index.html
http://www.nutch.org/docs/en/
http://www.etymon.com/tr.html

And here are some additional info sites:

http://www.searchtools.com/tools/tools-opensource.html
http://www.searchtools.com/
http://www.hesketh.com/publications/finding_the_right_search_engine.html
http://zez.org/article/articleview/83/
http://www.weberdev.com/ViewArticle.php3?ArticleID=245
http://www.zend.com/zend/tut/tutorial-ferrara1.php

Larry



More information about the gnhlug-discuss mailing list