Home | Articles

Simple Web Page Keyword Matching Tool

Have you ever created a web page with a lot of detailed information and wanted an easy way for your readers to parse and filter page entries based on user-specified keywords? This article explains how to use an HTML form and JavaServer Pages (JSP) technology to do exactly that. And even if you have never wanted to do this, you might find the example demonstration and code walkthrough useful because they cover how to retrieve request values and compare them to values in lines read from a static file.

How it Works

The HTML form is placed on the page you want to search. It provides a list of selectable keywords — words you know are on the page, that do not match your topic headings, and that you believe your users might want to search on. It also provides an input field for your users to type in a keyword in case they cannot find what they want to search on in the list.

When the user presses the Return key or clicks the Go button, a JSP is called that parses an HTML page one line at a time, looks for the user-specified keywords, and returns a page that lists all lines on the HTML page that contain the keywords. The returned results are organized under the keyword(s) where there is a match.

HTML Form

The figure below shows the HTML form on the left, and the HTML code to create the form on the right. The form is live. Go ahead and select or enter keywords and click the Go button.

When you click the Go button, a JSP page is called that parses a copy of a page and returns a list of articles where any part of the entry contains the specified keyword(s). For example, if you remember there was an article by somebody named Steve that you really liked, type in "Steve" and click the Go button to see a list of Steve's articles.

Note: This might take a few seconds to complete because the search page, described below, has a lot of HTML code for the banner, footer, and left navigation that gets read.
How the Form Looks HTML to Make the Form
Select keywords from the list below:

and/or enter a search phrase:

<form action="findwords.jsp" method="get">
Select keywords from the list below:
<input type="hidden" name="col" value="searchreports">
<select size="4" name="qp" multiple>
<option value="Jakarta">Jakarta
<option value="CachedRowSet">CachedRowSet
<option value="properties">Properties
<option value="Apache">Apache
</select>

and/or enter a search phrase:</font>
<input type="text" name="qt" size="20" maxlength="50" value="">
<input type="image" src="go.gif" border="0">
</form>

Search Page

The code for the JSP page requires the search page to use wrapping paragraphs. For example, a page entry should look like the following. Note there are no returns within an entry to make the HTML more readable to someone viewing the source file:

<P>
<a href="/pathname/">Maintaining State for HTML Form Buttons</a> by Matthias Laux <br>Here's the scoop on using JavaServer Pages custom tags to maintain button state in your HTML forms. <i>(October 2002)</i>
</p>

In contrast, the example below uses a return after "Laux" and before "Here's" to make the source file easier to read:

<P>
<a href="/pathname/">Maintaining State for HTML Form Buttons</a> by Matthias Laux
<br>Here's the scoop on using JavaServer Pages custom tags to maintain button state in your HTML forms. <i>(October 2002)</i>
</p>

Why No Formatting Returns?

If you leave out formatting returns as in the first example above, the JSP code reads the entire line, parses it for keywords, and if there is a match, returns the entire line on the results page. In the second example, the JSP code reads the line up to Laux, parses it, and if there is a match, returns the line up to Laux on the results page. The code then reads the next line starting with "Here's."

In short, the entries are not properly returned because they are broken up. In the example above, if the search term is "Maintaining" only the title-link and author are returned without the blurb, and if the search term is "maintain" only the blurb is returned without the title-link and author.

Code Walkthrough

This section walks through each section of findwords.jsp , which is the JSP code called when the user clicks the Go button.

Note: The source code is in a file with a "txt" extension so you can view it as text. A "jsp" extension tells the web server to compile the code into a servlet and execute it.

A JSP looks like an HTML page with servlet code segments embedded between JSP tags. There are a number of different kinds of JSP tags, and this code walkthrough touches on a few of them.

Directives

JSP directives are enclosed by the <%@ and %> directive tags, and are instructions processed by the JSP engine when the JSP Page is translated to a servlet. The page directives in this example tell the JSP engine the scripting language is Java (language="java"), and to include the indicated Java packages.

<%@ page language="java" import="java.util.*, java.io.*"
%>

Declarations

JSP declarations are enclosed by the <%! and %> declaration tags, and let you set up variables for later use in the program. You can also declare variables at the time you use them. Just remember that the scope is the entire JSP page regardless of where a variable is declared. The declarations in this example declare a random access file and string variables for reading the search page.

<%! RandomAccessFile in = null; %>
<%! String s = null; %>

Setting up Files and Variables

The next lines of code initialize variables and open a connection to searchpage.html, which is the page to be searched. The initializations get the length of the file to be searched, get the first line of that file, and set some variables to zero or null.

<%-- Change directory paths to your application --%>
 File inputFile = new File("/pathname/searchpage.html");
 in = new RandomAccessFile(inputFile, "r");
<%-- Get the length of the file --%>
 long length = in.length();
<%-- Read a line from the file --%>
 s = in.readLine();
<%-- Flag to tell if match is first in category --%>
 int val = 0;
<%-- Flag to tell if results are found or not --%>
 int results = 0;
 String[] qpValues = null;
 String qtValue = null;

Results Page and Recording Entries

The following code print the results page heading, check for option values passed from the selectable list (getParameterValues("qp")) and values from the input field (getParameter("qt")) on the form. All values found are retrieved and used to build the bulleted index that appears at the top of the results page.

 out.println("<h4>Keyword Search Results</h4>");
<%-- Start bullet list --%>
 out.println("<ul>");

<%-- Check for option values -->
 if(request.getParameterValues("qp") != null) {
   qpValues = request.getParameterValues("qp");
<%-- Retrieve option values --%>
   for(int j = 0; j < qpValues.length; j++) {
<%-- Make bullet list for each option value --%>
   out.println("<li><a href=#" + qpValues[j] + ">" 
           + qpValues[j] + "</a>");
   }
 }

<%-- Check for input value --%>
 if(request.getParameter("qt") != null) {
   qtValue=request.getParameter("qt");
   if(qtValue.length() > 0) {
      out.println("<li><a href=#" + qtValue + ">" 
           + qtValue + "</a>");
   }
 }

<%-- End bullet list --%>
   out.println("</ul>");

Capitalization and Reading from the File

Any option values retrieved are matched to characters in the lines read from the file with their original capitalization as typed on the form, and also converted to all lowercase. This is to catch all possibilities in the entry. For example, "Properties" is checked against each line in searchpage.html as initial cap "P" and as all lowercase, "properties" to account for it appearing at the beginning of a sentence and within a sentence. Values from the input field are checked exactly as typed, as all lowercase, as all uppercase, and as initial caps.

Leading and trailing spaces are trimmed from each line and each line is checked to see if it begins with "<A HREF." The searchpage.html file is formatted with all lines starting flush left, but the trimming accounts for any typing mistakes where a line might have spaces or tabs in front of it by mistake. All lines of interest in the file begin with <A HREF, and of course this would have to be changed to work on another file where, for example, the lines of interest begin with <li or something else.

Option Values: Looking for Matches

The main body of the code compares the option values to lines read from searchpage.html looking for matches. When matches are found, the line is returned in the results page as a bullet item under its correct category.

 if(request.getParameterValues("qp") != null) {
   qpValues = request.getParameterValues("qp");
<%-- Iterate through option values --%>
   for (int i = 0; i < qpValues.length; i++) {
<%-- Check validity of line read from file --%>
   while(s!=null) {
     if(s != null && s.length() > 0 ) {
       if(s.trim().startsWith("<A HREF")  
             && qpValues[i].length() > 0) {
<%-- Convert to lowercase --%>
         String lower = qpValues[i].toLowerCase();
<%-- Look for match --%>
         if(s.indexOf(qpValues[i]) > 0 
            || s.indexOf(lower) > 0) {
           results=1;
<%-- Start new list of matches under topic --%>
           if(val == 0) {
              out.println("<a name=" 
                      + qpValues[i] 
                    + "></a>");
              out.println("<h4>");
              out.println(qpValues[i]);
              out.println("</h4>");
              out.println("<ul>");
              val=1;
            } 
<%-- Add match to existing list --%> 
            out.println(s);
            out.println("<p>");
          } 
       } 
    }
<%--Read another line --%>
         s = in.readLine();
  }
  val=0;
  out.println("</ul>");

Option Values: No Results Found

If no matches are found, a message to that effect is returned on the results page. The file is reset to the beginning and the first line read to prepare to look for matches against another option value.

    if(results==0) {
      out.println("<a name=" + qpValues[i] 
               + "></a>");
      out.println("<h4>");
      out.println("No results found for " 
               + qpValues[i]);
      out.println("</h4>");
    } else {
      results=0;
    }
    in.seek(0);
    s = in.readLine();
    }
  }

Input Values: Looking for Matches

The main body of the code compares the input values to lines read from searchpage.html looking for matches. When matches are found, the line is returned in the results page as a bullet item under its correct category.

  if(request.getParameter("qt") != null) {
<%-- Get input value --%>
    qtValue = request.getParameter("qt");
    val=0;
<%-- Start at beginning of file and read a line --%>
    in.seek(0);

    s = in.readLine();

<%-- Check validity of line read from file --%>
    while(s!=null) {
      if(s != null && s.length() > 0 ) {
        if(s.trim().startsWith("<A HREF") 
           && qtValue.length() > 0) {
<%-- Create uppercase, lowercase, & init. caps --%>
          String uppercase = 
                 qtValue.toUpperCase();
          String lowercase = 
                 qtValue.toLowerCase();
          String firstletter = 
                 qtValue.substring(0,1);
          String lastletters = 
                 qtValue.substring(1);
          String upfirst = 
                 firstletter.toUpperCase(); 
          String initcap = 
                 upfirst.concat(lastletters);

          if(s.indexOf(qtValue) > 0 || 
             s.indexOf(lowercase) > 0 || 
             s.indexOf(initcap) > 0 ||
             s.indexOf(uppercase) > 0)  {
             results=1;
<%-- Start new list of matches under topic --%>
            if(val == 0) {
              out.println("<a name=" 
                    + qtValue + ">");
              out.println("<h4>");
              out.println(qtValue);
              out.println("</h4>");
              out.println("<ul>");
              val=1;
            } 
<%-- Add match to existing list --%>
            out.println(s);
            out.println("<p>");
          } 
        }
      } 
<%-- Read another line --%>
      s = in.readLine();
    }
    out.println("</ul>");

Input Values: No Results Found

If no matches are found, a message to that effect is returned on the results page.

    if(results==0 && qtValue.length() > 0) {
      out.println("<a name=" 
          + qtValue + ">");
      out.println("<h4>");
      out.println("No results found for " 
          + qtValue);
      out.println("</h4>");
      out.println("<p>");
    } else {
      results=0;
    }
  } 

No Keywords Selected

In the event the user fails to select from the list or enter a keyword by typing into the input field, a message is returned on the results page notifying the user of that case.

  if(request.getParameter("qp") == null 
     && qtValue.length() == 0) {
    out.println("<h4>");
    out.println("No Keywords were selected or entered.");
    out.println("</h4>");
    out.println("<p>");
  }
%>

Capturing Keywords

You could easily modify this program to capture the keywords the end user either selects or types into the input field. The reason for capturing the keywords is to see which keywords are most often selected or entered by the user. If you see a lot of entries for a particular keyword it could tell you something about the interests of your users or indicate that a keyword that is frequently input should probably be added as an option value to the selectable list.

To capture the keywords, you would modify the code by opening an output stream on a file with read-write permissions, and writing to that file. The code that gets the option and input values goes to the end of the file, writes the value, and adds a new line (/n) character so each value is on a separate line for readability.

Note: The values from the selection box are written to keywords.txt as machine-independent UTF characters because UTF characters are used in graphical displays.

findwords.jsp shows the full source code with this functionality added.

<%! RandomAccessFile outkw = null; %>

  File outputFile = new File("/pathname/keywords.txt");
  outkw = new RandomAccessFile(outputFile, "rw");

.  .  .

  if(request.getParameterValues("qp") != null) {
  qpValues = request.getParameterValues("qp");
  for(int j = 0; j < qpValues.length; j++) {
  out.println("<li><a href=#" 
       + qpValues[j] + ">" + qpValues[j] + "</a>");
      outkw.seek(outputFile.length());
      outkw.writeUTF(qpValues[j]);
      outkw.writeByte('\n');
    }
  }

  if(request.getParameter("qt") != null) {
    qtValue=request.getParameter("qt");
    if(qtValue.length() > 0) {
      out.println("<li><a href=#" + qtValue + ">" 
                   + qtValue + "</a>");
      outkw.seek(outputFile.length());
      outkw.writeChars(qtValue);
      outkw.writeByte('\n');
    }
  }

Conclusion

JSP technology makes it easy to write a simple search engine to parse an HTML page for keyword matches. This simple program is specific to a page with a certain formatting, but can easily be adapted to work on pages with different formats.

Exercise 1

A good exercise would be to add the code to read the lines regardless of whether formatting returns are placed in the entries to make them easier to read.

Exercise 2

Adapt the program so it reads from a URL connection. You would use the URL class and can see an example in the Reading Directly from a URL chapter of The Java Tutorial.

© 1994-2005 Sun Microsystems, Inc.