Overview
This plugin for Google Desktop Search is a simple web spider
(Könguló is Icelandic for spider) that crawls websites you specify,
e.g. intranet websites, and dumps them into Google Desktop Search so
that you will see results from them when you perform a desktop search,
and can browse their contents offline via GDS's cache.
Features include:
- Follows links in HTML frame, image and anchor tags
- Supports HTTP and HTTPS protocols, and HTML and plain text filetypes
- Obeys robots.txt
- Knows basic and digest HTTP authentication and allows you to specify your
usernames and passwords for multiple resources
- Can run in a loop, recrawling over previously crawled pages every
X minutes
- When recrawling, uses If-Modified-Since HTTP header to minimize
transfers
- You can specify a regular expression to limit crawls to e.g. your
intranet domain
Bad things that would be nice to fix include:
- No GUI, just your friend the command line
- No persistence between sessions; it'd be nicer if the state of which pages
have already been fetched, and their last-modified timestamp, were stored
and reused next time
- No support for form-based authentication.
Könguló is distributed under the terms of the
BSD License.
For downloads, news, and other information, visit our
Project Page
Example
This is by no means a complete example; it simply gives you a
feel for what Könguló can do.
Index your intranet wiki page one level deep:
kongulo.exe --depth=1 http://mywiki/wiki-index.cgi
Re-crawl every 30 minutes:
kongulo.exe --loop --sleep=30 -d 1 http://mywiki/wiki-index.cgi
|
|
Download
For downloads, visit our
Project Page
Installation
See the README file for installation instructions.
Documentation
README
News
See the Project
Page for news archives.
Google Groups
Bug reports and patches
|