=head1 NAME

perl.apache.org Site Indexing and Search Setup

=head1 Description

This document explains how to setup swish-e, index and search the
perl.apache.org site. Also how to setup search options.

=head1 Setting up search options

To setup search options which allow to search only specific
subsections, modify I<src/search/make.pl> and run:

 % cd src/search
 % ./make.pl

then commit I<make.pl> and the autogenerated files I<search_options>
and I<checkboxes.storable>. The docs inside I<make.pl> provide the
rest of the details.

=head1 Setting up swish-e

=over 

=item 1

Install the dev version of swish-e.  Currently we use SWISH-E 2.1-dev-25.

=item 2

Make sure that swish-e is in the PATH, so the apps will be able to
find it

=back

=head1 Indexing

Normally build the site:

  % bin/build -f (-d to build pdfs)

which among other things creates the dir: I<dst_html/search>

Now run:

  % bin/makeindex

This script is already adapted for the production machine of
perl.apache.org. 

If you are doing it elsewhere you need to set an
environment variable to the path of the site:

    export MODPERL_SITE='http://perl.apache.org'

or

    export MODPERL_SITE='http://localhost:4000/dst_html'

tcsh:

   setenv MODPERL_SITE http://perl.apache.org

This is used as the base for spidering, plus is used to determine the
sections of the site (for limiting the site to those sections, see
below)
    
Now you can manually spider the site if you didn't use the script
already.  Index the site

  % cd dst_html/search
  % swish-e -S prog -c swish.conf

You should see something like:

  Indexing Data Source: "External-Program"
  Indexing "./spider.pl"
  ./spider.pl: Reading parameters from 'default'
  
  Summary for: http://localhost/modperl-site/
      Duplicates:     5,357  (281.9/sec)
  Off-site links:     1,851  (97.4/sec)
     Total Bytes: 8,107,112  (426690.1/sec)
      Total Docs:       351  (18.5/sec)
     Unique URLs:       419  (22.1/sec)
  Removing very common words...
  no words removed.
  Writing main index...
  Sorting words ...
  Sorting 10599 words alphabetically
  Writing header ...
  Writing index entries ...
    Writing word text: Complete
    Writing word hash: Complete
    Writing word data: Complete
  10599 unique words indexed.
  5 properties sorted.                                              
  351 files indexed.  8107112 total bytes.  307356 total words.
  Elapsed time: 00:00:20 CPU time: 00:00:02
  Indexing done!

Now you can search...

=head1 Searching

=over 

=item 1

Go to the search page: ..../search/search.html

=item 2

Search

If something doesn't work check the I<error_log> file on the server
the I<swish.cgi> is running on. The most common error is that the
swish-e binary cannot be found by the I<swish.cgi> script. Remember
that CGI may be running under a different username and therefore may
not have the same PATH env variable.

=back

=head1 Swish-e related adjustments to the templates

=item *

Since we want to index only the real content, we use:

  <!-- Swishcommand index -->,
       only content here will indexed
  <!-- Swishcommand noindex -->,

=item *

Since we want to be able to search any sub-section of the site, the
search form includes the hidden variable C<sbm> (mnemonics: 'search by
meta'). For example:

  <input type="checkbox" name="sbm" value="docs/1.0/guide">

will search all the documents under I<docs/1.0/guide> directory.

the correct value for the C<sbm> variable are set in the template when
the site is created. 

The main search page I</search/swish.cgi>, has multiply checkboxes for
the for the C<sbm> variable so you can limit searches to only selected
sections.

The C<$ENV{MODPERL_SITE}> mentioned earlier is matched against the
C<sbm> variable to extract only the wanted subsets of the hits:

  $uri =~ m!$ENV{MODPERL_SITE}{/([^/]+)/.+$!

where C<$1> is used as the section name.  So it's just using the
initial directory name for the section.



=back


=head1 How does indexing work

Swish is run with a config file, and is run in a mode that says to use
an external program to fetch documents.  That external program is
called I<spider.pl> (part of the swish-e distribution).

I<spider.pl> uses a config file (by default) of
I<SwishSpiderConfig.pl>.  This file builds an array of hashes (in this
case a sinlge hash in the array).  This hash is the config.

Part of the config are call-back functions that spider.pl will call
while spidering.  One says to skip image files.  Another one is a bit
more tricky.  It splits a document into sections, creates new
"sub-pages" that are complete HTML pages, and calls the function in
spider.pl that sends those off to swish for indexing.  (That function
then returns false to tell swish not to index that document since the
sections have already been indexed.)

That's about it.

One trick.  For debugging you can run the spider without indexing.

   ./spider.pl > bigfile.out

Another trick, you can send SIGHUP to I<spider.pl> while indexing and
it will stop spidering, but let swish index what's been read so far.

=cut