Recently I had to set up an intranet search engine to crawl trough thousands of PDF files. There are a ton of commercial solutions (read: ) out there on the market, ranging from Google Search Appliance to IBM’s OmniFind. There are also a few good Open Source engines, such as Apache’s Lucene. The problem is that these are primarily intended for enterprises with server farms full of data. That’s really not what I was looking for. I was looking something simple that was easy to set up and maintain. That’s when I came across Xapian. It’s Open Source and lightweight. Combine Xapian with Omega and you got exactly what I was looking for — A lightweight intranet search engine.

This howto will walk you trough how to set up Xapian with Omega on FreeBSD. The version I used was FreeBSD 8.1, but I’m sure any recent version of FreeBSD (7.x>) will do. Please note that I do expect you to know your way around FreeBSD, so I’m not going to spend time on simple tasks like how to edit files etc. I also assume you already got your system up and running.

I’ve called the path we’re going to index (recursively) ‘/path/to/something’. This can be either a local path or something mounted from a remote server. Also, as you’ll see below, a lot of dependencies are installed. This is to increase the number of file-format Xapian will index. It should be able to index PDF-files, Word-files, RTF-files, in addition to plain-text files.

Let’s get started.

Note: If you don’t have the ports-tree installed (/usr/ports), you can download it by simply running:

portsnap fetch extract

Install Apache

/usr/ports/www/apache22  
make install  
echo -e “\\napache22_enable=\\”YES\\”” >> /etc/rc.conf

Install Xapian with Xapian-Omega

cd /usr/ports/www/xapian-omega  
make install

Install Xpdf
Make sure to uncheck X11 and DRAW

cd /usr/ports/graphics/xpdf  
make install

Install Catdoc
Uncheck WORDVIEW

cd /usr/ports/textproc/catdoc  
make install

Install Unzip

cd /usr/ports/archivers/unzip  
make install

Install Gzip

cd /usr/ports/archivers/gzip  
make install

Install Antiword

cd /usr/ports/textproc/antiword  
make install

Install Unrtf

cd /usr/ports/textproc/unrtf  
make install

Install Catdvi

cd /usr/ports/print/catdvi  
make install

Next we need to edit Apache’s config-file (/usr/local/etc/apache22/httpd.conf)

Change:

ScriptAlias /cgi-bin/ “/usr/local/www/apache22/cgi-bin/”

Into:

ScriptAlias /cgi-bin/ “/usr/local/www/xapian-omega/cgi-bin/”

We also need to create a new config-file for Xapian. Create the file /usr/local/etc/apache22/Include/xapian.conf

    Alias /something /path/to/something

            Options Indexes
            AllowOverride None
            Order allow,deny
            Allow from all

        AllowOverride None
        Options None
        Order allow,deny
        Allow from all

With all Apache configuration being done, let’s fire up Apache:

/usr/local/etc/rc.d/apache22 start

Create the holding directory

mkdir -p /usr/local/lib/omega/data/

Copy over the templates. For some reason FreeBSD doesn’t do this by default.

cp -rfv /usr/ports/www/xapian-omega/work/xapian-omega-*/templates /usr/local/lib/omega/

We also need to tell Xapian-Omega where to look for the files. Create the file /usr/local/www/xapian-omega/cgi-bin/omega.conf

\# Directory containing Xapian databases:  
database_dir /usr/local/lib/omega/data

\# Directory containing OmegaScript templates:  
template_dir /usr/local/lib/omega/templates

\# Directory to write Omega logs to:  
log_dir /var/log/omega

\# Directory containing any cdb files for the $lookup OmegaScript command:  
cdb_dir /var/lib/omega/cdb

Create a search page. I’ll just use index.html in Apache’s default DocumentRoot (/usr/local/www/apache22/data/index.html).

Match any word
Match all words