BBS水木清华站∶精华区

发信人: reden (鱼～君子律己以利人), 信区: Linux
标  题: Searching a Web Site with Linux
发信站: BBS 水木清华站 (Mon Oct  5 00:18:52 1998) WWW-POST

"Linux Gazette...making Linux just a little more fun!"

                   Searching a Web Site with Linux

                                   By Branden Williams

As your website grows in size, so will the number of people that visit your
site. Now most of these people are just like you and me
in the sense that they want to go to your site, click a button, and get
exactly what information they were looking for. To serve
these kinds of users a bit better, the Internet community responded with the
``Site Search''. A way to search a single website for
the information you are looking for. As a system administrator, I have been
asked to provide search engines for people to use on
their websites so that their clients can get to their information as fast as
possible.

Now the trick to most search engines (Internet wide included) is that they
index and search entire sites. So for instance, you are
looking for used cars. You decide to look for an early 90s model Nissan
Truck. You get on the web, and go to AltaVista. If you do
a search for ``used Nissan truck'', you will most likely come up with a few
pages that have listings of cars. Now the pain comes
when you go to that link and see that 400K HTML file with text listings of
used trucks. You have to either go line by line until you
find your choice, or like most people, find it on your page using your
browser's find command.

Now wouldn't it be nice if you could just search for your used truck and get
the results you are looking for in one fail swoop?

A recent search CGI that I designed for a company called Resource Spectrum
(http://www.spectrumm.com/) is what precipitated
DocSearch. Resource Spectrum needed a solution similar to my truck analogy.
They are a placement agency for high skilled jobs
that needed another alternative to posting their job listing to newsgroups.
What was proposed was a searchable Internet listing of
the jobs on their new website.

Now as the job listing came to us, it was in a word document that had been
exported to HTML. As I searched (no pun intended)
long and hard for something that I could use, nothing turned up. All of the
search engines I found only searched sites, not single
documents.

This is where the idea for DocSearch came from.

I needed a simple, clean way to search that single HTML document so users
could get the information they needed quickly and
easily.

I got out the old Perl Reference and spent a few afternoons working out a
solution to this problem. After a few updates, you see
in front of you DocSearch 1.0.4. You can grab the latest version at
ftp://ftp.inetinc.net/pub/docsearch/docsearch.tar.gz.

Let's go through the code here so we can see how this works. First before we
really get into this though, you need to make sure
you have the CGI Library (cgi-lib.pl) installed. If you do not, you can
download it from http://www.bio.cam.ac.uk/cgi-lib/. This is
simply a Perl library that contains several useful functions for CGIs. Place
it in your cgi-bin directory and make it world readable
and executable. (chmod a+rx cgi-lib.pl)

Now you can start to configure DocSearch. First off, there are a few
constants that need to be set. They are in reference to the
characteristics of the document you are searching. For instance...

# The Document you want to search.
$doc = "/path/to/my/list.html";

Set this to the absolute path of the document you are searching.

# Document Title. The text to go inside the
<title></title> HTML tags.
$htmltitle = "Nifty Search Results";

Set this to what you want the results page title to be.

# Optional Back link. If you don't want one, make the string null.
# i.e. $backlink = "";
$backlink = "http://www.inetinc.net/some.html";

If you want to provide a ``Go Back'' link, enter the URL of the file that we
will be referencing.

# Record delimiter. The text which separates the records.
$recdelim = " ";

This part is one of the most important aspects of the search. The document
you are searching must have something in between
the "records" to delimit the html document. In English, you will need to
place some HTML comment or something in between each
possible result of the search. In my example, MS Word put the $nbsp; tag in
between all of the records by default, so I just used
that as a delimiter.

Next we ReadParse() our information from the HTML form that was used as a
front end to our CGI. Then to simplify things
later, we go ahead and set the variable $query to be the term we are
searching for.

$query = $input{`term'};

This step can be repeated for each query item you would like to use to narrow
your search. If you want any of these items to be
optional, just add a line like this in your code.

if ($query eq "") {
$query = " ";
}

This will match relatively any record you search.

Now comes a very important step. We need to make sure that any meta
characters are escaped. Perl's bind operator uses meta
characters to modify and change search output. We want to make sure that any
characters that are entered into the form are not
going to change the output of our search in any way.

$query =~ s/([-+i.<>&|^%=])/\\\1/g;

Boy does that look messy! That is basically just a Regular Expression to
escape all of the meta characters. Basically this will
change a + into a \+.

Now we need to move right along and open up our target document. When we do
this, we will need to read the entire file into one
variable. Then we will work from there.

open (SEARCH, "$doc");
undef $/;
$text = <SEARCH>;
close (SEARCH);

The only thing you may not be familiar with is the undef $/; statement you
see there. For our search to work correctly, we
must undefine the Perl variable that separates the lines of our input file.
The reason this is necessary is due to the fact that we
must read the entire file into one variable. Unless this is undefined, only
one line will be read.

Now we will start the output of the results page. It is good to customize it
and make it appealing somehow to the user. This is free
form HTML so all you HTML guys, go at it.

Now we will do the real searching job. Here is the meat of our search. You
will notice there are two commented regular
expressions in the search. If you want to not display any images or show any
links, you should uncomment those lines.

@records = split(/$recdelim/,$text);

We want to split up the file into an array of records. Each record is a valid
search result, but is separate from the rest. This is
where the record delimiter comes into play.

foreach $record (@records)
{
#       $record =~ s/<a.*<\/a>//ig; # Do not print links inside this
#       doc.
#       $record =~ s/<img.*>//ig; # Do not display images inside this
#       doc.
if ( $record =~ /$query/i ) {
print $record;
$matches++;
}
}

This basically prints out every $record that matches our search criteria.
Again you can change the number of search criterion
you use by changing that if statement to something like this.

if ( ($record =~ /$query/i) && ($record =~ /$anotheritem/) ) {

This will try to match both queries with $record and upon a successful match,
it will dump that $record to our results page.
Notice how we also increment a variable called $matches every time a match is
made. This is not as much as to tell the user
how many different records were found, but more of a count to tell us if no
matches were found so we can tell the user that no,
the system is not down, but in fact we did not match any records based upon
that query.

Now that we are done searching and displaying the results of our search, we
need to do a few administrative actions to ensure
that we have fully completed our job.

First off, as I was mentioning before, we need to check for zero matches in
our search and let the user know that we could not
find anything to match his query.

if ($matches eq "0") {
$query =~ s/\\//g;

print << "End_Again";

<center>
<h2>Sorry! "$query" was not found!</h2><p>
</center>
End_Again
}

Notice that lovely Regular Expression. Now that we had to take all of the
trouble to escape those meta characters, we need to
remove the escape chars. This way when they see that their $query was not
found, they will not look at it and say ``But that is
not what I entered!'' Then we want to dump the HTML to disappoint the user.

The only two things left to do is end the HTML document cleanly and allow for
the back link.

if ( $backlink ne "" ) {
print "<center>";
print "<h3><a href=\"$backlink\">Go
back</a></h3>";
print "</center>";
}

print << "End_Of_Footer";

</body>
</html>

End_Of_Footer

All done. Now you are happy because the user is happy. Not only have you
streamlined your website by allowing to search a
single page, but you have increased the user's utility by giving them the
results they want. The only result of this is more hits. By
helping your user find the information he needs, he will tell his friends
about your site. And his friends will tell their friends and so
on. Putting the customer first sometimes does work!

--
    白马带著她一步步的回到中原。白马已经老了，只能慢慢的走，
但终是能回到中原的。江南有杨柳、桃花，有燕子、金鱼……
汉人中有的是英俊勇武的少年，倜傥潇洒的少年……但这个美
丽的姑娘就像古高昌国人那样固执:

    「那都是很好很好的，可是我偏不喜欢。」

※ 来源:·BBS 水木清华站 bbs.net.tsinghua.edu.cn·[FROM: 202.99.18.67]