HomeDigital EditionSys-Con RadioSearch Java Cd
Advanced Java AWT Book Reviews/Excerpts Client Server Corba Editorials Embedded Java Enterprise Java IDE's Industry Watch Integration Interviews Java Applet Java & Databases Java & Web Services Java Fundamentals Java Native Interface Java Servlets Java Beans J2ME Libraries .NET Object Orientation Observations/IMHO Product Reviews Scalability & Performance Security Server Side Source Code Straight Talking Swing Threads Using Java with others Wireless XML
 

Corporate intranets are heterogeneous environments comprised of Web servers and search engines from numerous vendors. In such a disparate environment, how do you create a corporate collection of indexed documents for use by a single search facility? One method is to use a catalog or index server, such as Netscape's Compass Server or Microsoft's Index Server. These products employ robots or agents that build collections of indexed documents by crawling through your company's intranet via URLs. While this method is very effective, it requires careful planning and administration. An alternate method is to write a Java servlet that connects to several search engines, compiling the results into a single document.

What Is a Java Servlet?
The Java servlet API was developed by Sun Microsystems to provide a mechanism for implementing Web server-side logic using the Java programming language. Java servlets are similar to CGI (Common Gateway Interface) programs in that they provide an HTTP-based mechanism for receiving user input and producing output in the form of HTML. Servlets offer a significant performance boost over CGI programs due to their architecture and implementation. CGI programs require the creation of a separate process to handle each client request. This approach consumes a significant amount of system resources and processing time for each client connection. Web servers implement the Java servlet specification load and instantiate any registered Java servlets upon start-up. Client HTTP requests are handled by creating a new thread within the Server's process space. Thus, servlets perform as if they were developed with native interface APIs such as NSAPI (Netscape Web Server) or ISAPI (Microsoft Web Server). Java servlets can be created easily by sub-classing the HttpServlet (javax.servlet.http.HttpServlet) class and overriding one of the HTTP processing methods.

The Java servlet architecture can be used to enhance the performance of database applications by implementing database connection pools. In traditional Web-based database applications, the overhead of connecting to a database is incurred during each client request. Servlets that implement database connection pools create a collection of instantiated database connection objects during servlet initialization. Therefore, the time required to create the objects and initialize the connections is incurred before any client HTTP requests are handled. Subsequent HTTP client requests are assigned one of the pooled database connection objects, reducing the required processing time. The size of the pool is set to the number of anticipated concurrent users. If that number exceeds the pre-determined threshold, a new connection object is instantiated and added to the pool.

Architecture Overview
The search servlet combines the output from registered search engines by coordinating multiple threads that work in tandem to create a consolidated list of search results. This multithreaded approach increases the overall performance of the application by allowing the search engines to run concurrently. Each engine in the integrated search is registered via a URL entry in a configuration file (see Figure 1). The servlet strips the results from each search engine, ensuring that the most relevant results appear at the top of the final document (see Figure 2). To achieve a consistent look and feel across diverse search engine results, the output from each search engine is parsed to extract the HTML anchor tags. An HTML template file determines the final HTML representation of the results. Using a simple meta tag substitution, the template file alleviates the need to edit the servlet's Java code.

Figure 1
Figure 1:

Figure 2
Figure 2:

Servlet Design
The search servlet is comprised of three classes, each encapsulating a specific category of functionality into a reusable module. The HTMLAnchorParser class parses an input stream which extracts the HTML anchor tags and stores them in an internal data structure. The SearchEngine class connects to a search engine via a URL, initiates a search request and stores the results. It extends HTMLAnchorParser to provide parsing functionality for the search engine results. This class also implements the java.lang.Runnable interface to support concurrent processing via threads. Lastly, the SearchManager class coordinates all search engine threads and handles the HTTP requests from the Web Server by extending the HttpServlet class in the Java servlet package (javax.servlet.http). The doGet method is overridden to implement the necessary HTTP request handling. Figure 3 depicts the class diagram for the servlet designed using Rational Rose for Java. Rational Rose is an object-oriented analysis and design tool that provides powerful design and code generation functionality for Java development. Designing the system in a modular fashion such as this promotes reuse and decreases the time required for development and testing.

Figure 3
Figure 3:

Implementing the Servlet
Listing 1 contains the source code for the SearchManager class. HTTP GET requests are handled by overriding the doGet method of the inherited HttpServlet class. When this method is invoked, it stores the output stream from the HttpServletRequest object using the getOutputStream method. Sending output back to the client's browser is simplified by redirecting System.out and System.err to this output stream. This is accomplished by invoking the setOut and setErr methods of the System class. Subsequent calls to System.out.println and System.err.println will cause the output to be sent to the client's browser. The HTTP Response header's content type field is set to "text/html" via a call to the setContentType method. Without this call, the browser would not know how to interpret the information being sent back from the servlet. The search servlet accepts a single CGI parameter that contains the criteria for the integrated search. The getParameter method of the HttpServletRequest class is used to access this variable by name. In Listing 1 you will notice that the doGet method invokes an initialize method. This method reads the search engine configuration file and creates a thread for each entry. A thread group is used to monitor the status of the search engine threads and determine when they have finished processing. This is implemented using a busy-wait loop and the ThreadGroup.activeCount and Thread.sleep methods. The rest of the doGet method processes the HTML template file and outputs the search engine results when the "<<results>>" meta tag is encountered. The results from the search engines are stripped by outputting a single line from each search engine and removing the search engine when the results have been exhausted.

Listing 2 contains the source code for the HTMLAnchorParser class, which provides methods for reading and parsing the HTML anchor tags from a supplied java.io.InputStream object. The ignoreAnchor method provides a mechanism for ignoring HTML anchor tags that contain certain string patterns. This enables the servlet to ignore irrelevant and/or ornamental anchor tags that are generated by various search engines. A call is made to Thread.currentThread().yield(), while this class reads data from the given input stream to yield the CPU to other competing threads. Without this method call, a single thread would dominate the CPU.

The SearchEngine class (see Listing 3) inherits the HTML anchor parsing capabilities of the HTMLAnchorParser class while implementing the java.lang.Runnable interface to provide multithreaded support. This class creates a URLConnection instance from a specified URL, passing the associated InputStream to the HTMLAnchorParser.getContent method to return and store a list of HTML anchor tags.

The SearchManager strips the output from the SearchEngine instances using calls to the getResultItem and removeResultItem methods. Once a result is read it is removed from the associated SearchEngine object, making the next result available for the subsequent pass. The getResultsCount method is used to determine whether all the results have been read from a specific SearchEngine object and whether it can be ignored in subsequent stripping passes (see Figure 2).

Conclusion
The java.net package provides a feature-rich set of classes for interfacing and controlling HTTP based resources. Coupled with the Java servlet API, developers can easily create traditional CGI-based applications that support socket programming, HTTP handling and multithreaded capabilities that glue together Web and legacy systems across the corporate intranet.

Resources

About the Author
Eric Greenfeder currently works for the BASF Corporation as a Senior Internet Architect. He specializes in Java development, Web security, CORBA and object-oriented analysis and design. Eric has been programming in Java since 1996 and currently teaches an internal Java course to BASF developers. You can reach him by e-mail at greenfe@basf.com

	

Listing 1: Search Manager class.
 
import java.io.*; 
import java.util.*; 
import java.lang.*; 
import javax.servlet.*; 
import javax.servlet.http.*; 

// THIS CLASS IS USED TO MANAGE THE SearchEngine THREADS  
// AND PROVIDE JAVA SERVLET FUNCTIONALITY  

public class SearchManager extends HttpServlet 
{  
    private ServletOutputStream out; // THE OUTPUTSTREAM BACK TO THE CLIENT'S BROWSER 
    private String query=null;     // THE QUERY CRITERIA TO USE IN THE SEARCH  
    private Vector searchEngines = new Vector(); // A LIST OF SearchEngine OBJECTS 
    // A ThreadGroup TO MANAGE THE SearchEngine THREADS 
    private ThreadGroup threadgroup = new ThreadGroup("searchEngines");  
  
    // doGet PROCESSES HTTP GET REQUESTS FROM CLIENT CONNECTIONS  
    public void doGet (HttpServletRequest req,  
                       HttpServletResponse res)  
                       throws ServletException, IOException {  
      int cnt = 1,index=0; 
      String link,inline; // BUFFER VARIABLES  
      String currenthost=null; // THE HOST NAME OF THE CURRENT SearchEngine 
      SearchEngine currentse=null; // CURRENT SearchEngine OBJECT 
      Enumeration se=null; 
  
      // GET THE RESPONE OUTPUT STREAM AND REDIRECT THE out  
      // AND err OUTPUTSTREAMS TO RETURN RESULTS TO THE CLIENT  
      // CONNECTIONS 
      out = res.getOutputStream(); 

      // REDIRECT ERRORS TO THE CLIENT'S BROWSER 
      System.setErr(new PrintStream(out));  

      // REDIRECT STANDARD OUTPUT TO THE CLIENT'S BROWSER  
      System.setOut(new PrintStream(out));  

      // SET THE RESPONSE CONTENT TO HTML TEXT 
      res.setContentType("text/html"); 
  
      // GET THE QUERY PARAMETER 
      try { 
        query = req.getParameter("query"); 
       if(query == null || query.length() == 0) { 
          out.println("<H1 ALIGN=CENTER>Please input a "+ 
                      "search string !</H1>\n"); 
          out.println("</BODY></HTML>\n"); 
          out.flush(); 
          return; 
        } 
      } 
      catch (Exception e) {  
        System.err.println("SearchManager (doGet): "+e);  
        System.err.flush(); } 

      // INITIALIZE THE SEARCH ENGINE THREADS 
      if(!initialize()) return;  

      try { 
      // READ THE TEMPLATE USED TO FORMAT THE SEARCH RESULTS 
      BufferedReader in =  
          new BufferedReader(new FileReader(System.getProperty("user.dir")+ 
                                            "/searchservlet.pat")); 
      // READ IN THE TEMPLATE FILE UNTIL THE <<results>>  
      // TAG IS FOUND, THEN WRITE OUT THE RESULTS. LOOP ACCROSS 
      // EACH SEARCH ENGINE GATHERING THE RESULTS FROM EACH SEARCH  
      // ENGINE ONE RESULT AT A TIME. THIS WAY THE MORE PERTINENT  
      // SEARCH RESULTS FROM EACH ENGINE WILL APPEAR AT THE TOP OF  
      // THE DOCUMENT. 
      while((inline=in.readLine()) != null) { 
        if((index=inline.toLowerCase().indexOf("<<results>>")) > -1) 
		{ 
          System.out.println(inline.substring(0,index));  
          int exhaustedEngines=0; 
          while(searchEngines.size() > exhaustedEngines) { 
            exhaustedEngines=0;  
            // if all the engines have been exhausted then this  
            // variable will = searchEngines.size() 
            for(int i=0;i<searchEngines.size();i++) { 
              currentse = (SearchEngine)searchEngines.elementAt(i);  
              currenthost = currentse.getHost(); 
              // THIS SEARCH ENGINES RESULTS HAVE BEEN EXHAUSTED  
              if(currentse.getResultsCount() == 0) {  
                ++exhaustedEngines; 
                continue; 
              }  
              try {  
                // get the first element 
                link = currentse.getResultItem(0);  
                ((SearchEngine) searchEngines.elementAt( 
                  searchEngines.indexOf(currentse))).removeResultItem(0);  
              } 
              catch (ArrayIndexOutOfBoundsException e) {  
                  ++exhaustedEngines; 
                  continue;  
              }  
  
              // WE DO NOT WANT IMAGES AND BLANK ANCHORS 
              if(link.toLowerCase().indexOf("http://") == -1) { 
                int idx; // SCRATCH VARIABLE 
                if((idx = link.toLowerCase().indexOf("href=\"")) > -1) 
				{ 
                  if(link.toLowerCase().indexOf("href=\"/") > -1) 
                    link = link.substring(0,idx+6)+currenthost+ 
                           link.substring(idx+6);  
                  else 
                    link = link.substring(0,idx+6)+currenthost+"/"+ 
                           link.substring(idx+6);  
                }  
                else { 
                  idx = link.toLowerCase().indexOf("href="); 
                  if(link.charAt(idx+6) == '/')  
                    link = link.substring(0,idx+5)+currenthost+ 
                           link.substring(idx+5);  
                  else 
                    link = link.substring(0,idx+5)+currenthost+"/"+ 
                           link.substring(idx+5);  
                }  
              } 
              out.println(String.valueOf(cnt++)+". "+link+"<BR>");  
            } 
          }  
          // PRINT OUT THE REST OF THE LINE AFTER THE  
          // <<results>> META TAG 
          System.out.println(inline.substring(index+11));  
        } 
        else { 
          System.out.println(inline); 
        }  
       }  
      } catch (Exception e) { System.err.println("SearchManager (doGet): "+e); 
	  } 
      out.flush();  
      out.close(); 
    } 
    /* THIS METHOD INITIALIZES AND RUNS THE SEARCH ENGINES  
     AND WAITS FOR ALL SEARCH ENGINE THREADS TO COMPLETE.  
     THE FILE searchurls.conf IS READ FROM THE CURRENT WORKING 
     DIRECTORY TO PROVIDE A LIST OF SEARCH ENGINE URL'S. THE  
     URLS LISTED IN THE CONFIGURATION FILE MUST END WITH THE  
     SEARCH ENGINE'S QUERY PARAMETER FOLLOWED BY AN EQUAL SIGN.  
     example:  www.search.com/cgi-bin/search?query=</B> */ 

    public boolean initialize() { 
      String inline;            // A SCRATCH VARIABLE  
      String servlet_dir=null;  // THE SERVLET HOME DIRECTORY 
      SearchEngine se=null;     // A SearchEngine OBJECT 
  
      try { 
        // GET THE SERVLET HOME DIRECTORY 
        servlet_dir = System.getProperty("user.dir"); 

        // OPEN AN INPUT STREAM TO THE CONFIGURATION FILE THAT 
        // LISTS THE URL'S OF THE SEARCH ENGINES TO BE INTEGRATED 
        BufferedReader in =  
          new BufferedReader(new FileReader(servlet_dir.replace('\\','/')+ 
          (servlet_dir.charAt(servlet_dir.length()-1) == 
          '/'?"searchurls.conf":"/searchurls.conf")));  
  
        // INITIALIZE THE LIST OF SEARCH ENGINES 
        searchEngines.removeAllElements(); 
  
        // READ IN THE URL'S FROM THE searchurls.conf FILE 
        while((inline=in.readLine()) != null) {  
          // CREATE A SEARCH ENGINE INSTANCE USING THE URL  
          // READ IN FROM THE searchurls.conf FILE AND  
          // CONCATENATE THE USER SUPPLIED SEARCH CRITERIA  
          // NOTE: THE URL LISTED IN THE CONFIGURATION FILE  
          // MUST END WITH THE SEARCH ENGINES QUERY PARAMETER  
          // FOLLOWED BY AN EQUAL SIGN. 
          // example:  www.search.com/cgi-bin/search?query=  

          se = new SearchEngine(inline.trim()+query); 
          se.addIgnoreTag("<img"); 
          se.addIgnoreTag("<IMG"); 
          se.addIgnoreTag(">_>"); 
          searchEngines.addElement(se); 
          // START THE SEARCH ENGINE THREAD 
          new Thread(threadgroup,se).start();  
        }                                       
  
        // WAIT FOR ALL OF THE SEARCH THREADS TO FINISH PROCESSING 
        while(threadgroup.activeCount() > 0) 
          Thread.currentThread().sleep(50);  
  
      } catch (Exception e) {  
        System.err.println("initialize: "+e); System.err.flush(); 
        return false; } 
      return true;  
    } 
} 
  

Listing 2: HTMLAnchor Parser class.
 
import java.io.*; 
import java.lang.*; 
import java.util.*; 

public class HTMLAnchorParser { 
  private DataInputStream in=null; // INPUT STREAM TO READ 
  private Vector ignoreTags=new Vector(); // TAG IGNORE CRITERIA 
  
  // CONSTRUCTOR TO INITIALIZE THE CLASS  
  public HTMLAnchorParser() { ; }  
  
  // CONSTRUCTOR: INITIALIZE CLASS USING INPUT STREAM  
  public HTMLAnchorParser(InputStream in) { 
    setInputStream(in);  
  } 
  
  // READS InputStream AND RETURNS A VECTOR OF ANCHOR TAGS 
  public Vector getAnchorTags(InputStream in) throws IOException 
  { 
    setInputStream(in); 
    return(getAnchorTags()); 
  } 
  
  // SETS THE InputStream TO BE PARSED  
  private void setInputStream(InputStream in) 
  { 
    this.in = new DataInputStream(in);  
  } 

  // ADDS A TAG TO THE IGNORE TAG LIST. WHILE READING A URL, 
  // IF A LINK CONTAINS ANY MATCHES FROM THE IGNORE TAGS LIST 
  // IT IS NOT INCLUDED IN THE RESULTS 
  
  public void addIgnoreTag(String tag) { ignoreTags.addElement(tag); 
  } 

  // DETERMINES IF A TAG SHOULD BE IGNORED 
  private boolean ignoreAnchor(String anchor) { 
    for(int i=0;i<ignoreTags.size();i++) 
      if(anchor.indexOf((String)ignoreTags.elementAt(i)) < -1) 
        return true; 
    return false; 
  } 
  
  // READS AN InputStream AND RETURNS A VECTOR OF ANCHOR TAGS 
  public Vector getAnchorTags() throws IOException {  
    int charbuf[] = new int[4]; 
    StringBuffer sb = new StringBuffer(); 
    StringBuffer link = new StringBuffer(); 
    boolean startrecording=false; 
    Vector results = new Vector(); 
  
    try{ 
      if(in == null) 
        return null; 
      while(true) { 
        // YIELD THE CPU TO OTHER THREADS 
        Thread.currentThread().yield();  
        charbuf[0] = in.readByte(); 
        if(charbuf[0] == '<') {  // START OF TAG  
          // WE ARE RECORDING SO LOOK FOR ENDING TAG </A  
          if(startrecording) {  
            charbuf[1] = in.readByte();  
            charbuf[2] = in.readByte();  
            if(charbuf[1] == '/' && (charbuf[2] == 'a' || 
               charbuf[2] == 'A')) { 
              link.append("</A>"); 
              startrecording = false;  
              if(!ignoreAnchor(link.toString())) 
                results.addElement(link.toString()); 
              link = new StringBuffer(); 
            } 
            else { 
              // APPEND THE THREE CHARACTERS WE JUST READ 
              link.append((char)charbuf[0]); 
              link.append((char)charbuf[1]); 
              link.append((char)charbuf[2]); 
            } 
          } 
          else {  
       // SKIP PAST SPACES 
           while((char)(charbuf[1]=in.readByte()) == ' ') ;  
           charbuf[2] = (char)in.readByte(); 
           // START RECORDING IF THIS TAG IS AN ANCHOR <A 
           if(charbuf[0] == '<' && (charbuf[1] == 'a' ||  
              charbuf[1] == 'A') && charbuf[2] == ' ') {  
             link.append("<a ");  
             startrecording = true;  
           }  
           else {  // RESET NECESSARY VARIABLES 
             link = new StringBuffer(); 
             startrecording = false; 
             continue; 
           }  
          } 
        } 
        else if(startrecording)  
          link.append((char)charbuf[0]);  
      } 
    } 
    // EOFException CAUGHT HERE 
    catch (EOFException e) { ; }  
    return results; 
  } 
} 

Listing 3: SearchEngine class.
 
import java.net.*; 
import java.io.*; 
import java.util.*; 
import java.lang.Runnable; 
import java.util.Vector; 

// THIS CLASS OPENS A CONNECTION TO A SPECIFIED URL AND  
// READS THE CONTENTS PARSING OUT THE ANCHOR TAGS,  
// STORING THEM IN A VECTOR  
public class SearchEngine extends HTMLAnchorParser implements Runnable 
				{ 
    private String query; 
    private Vector results = new Vector(); 
    private String serverURL; 
  
    // CONSTRUCTOR TO INSTANTIATE THE CLASS WITH A URL 
    public SearchEngine(String serverURL) { 
      this.serverURL = serverURL;  
    } 
  
    // METHOD THAT GETS CALLED WHEN THE THREAD IS STARTED  
    // USGING THE START METHOD  
    public void run() { 
      try { 
        URLConnection urlcon =  
                       (new URL(serverURL)).openConnection(); 
        if(urlcon == null) { 
           System.err.println("SearchEngine (run): "+ 
                              "Error opening URL connection."); 
           System.err.flush(); 
           return; 
        } 
  
        urlcon.connect();  
        getContent(urlcon);  
      } 
      catch (Exception e) {  
        System.err.println("SearchEngine(run): "+e);  
        System.err.flush(); } 
    }  

    // GETS THE CONTENT BY READING THE INPUT STREAM SPECIFIED  
    // IN THE URLConnection OBJECT  
    public void getContent(URLConnection urlc) {  
      try { 
        results = getAnchorTags(urlc.getInputStream());  
      }  
      catch (Exception e) {  
        System.err.println("SearchEngine (getContent): "+e);  
        System.err.flush(); } 
    }  
  
    // OUTPUTS THE RESULTS TO System.out. YOU CAN REDIRECT  
    // THIS OUTPUT USING THE System.setOut METHOD 
    public synchronized void outputResults() { 
       try { 
         for(Enumeration e=results.elements();e.hasMoreElements();) 
		 { 
            System.out.println((String)e.nextElement()); 
            System.out.flush(); 
         } 
       } catch (Exception e) {  
           System.err.println("SearchEngine (outputResults): "+e);  
           System.err.flush(); } 
    } 
  
    // SETS THE QUERY CRITERIA FOR A SEARCH 
    public void setQuery(String query) { 
      this.query = query; 
    } 

    // GETS THE QUERY CRITERIA FOR A SEARCH 
    public String getQuery() { 
      return query; 
    } 

    // RETURNS THE RESULTS OF THE SEARCH  
    // getContent MUST BE CALLED BEFORE THIS  
    // FUNCTION IS USED 
    public Vector getResults() { 
      return results; 
    } 

    // RETURNS THE HOST PORTION OF THE CURRENT SEARCH URL 
    public String getHost() { 
      int index = serverURL.indexOf("//"); 
      index = serverURL.indexOf("/",index+3); 
      return serverURL.substring(0,index); 
    } 
  
    // SET THE URL USED TO EXECUTE THE SEARCH 
    public void setServerURL(String serverURL) 
	{ 
      this.serverURL = serverURL;  
    } 

    // RETURNS THE URL OF THE SEARCH ENGINE 
    public String getServerURL() 
	{ 
      return serverURL; 
    } 

    // RETURNS THE NUMBER OF SEARCH RESULT ITEMS  
    public int getResultsCount() { return results.size(); 
	} 
  
    // RETURNS A SEARCH RESULT GIVEN AN INDEX 
    public String getResultItem(int index) throws ArrayIndexOutOfBoundsException 
	{ 
      return (String)results.elementAt(index);  
    } 
  
    // REMOVES A SEARCH RESULT ITEM GIVEN AN INDEX  
    public void removeResultItem(int index) throws ArrayIndexOutOfBoundsException 
	{ 
      results.removeElementAt(0);  
    } 
} 
  
      
 

All Rights Reserved
Copyright ©  2004 SYS-CON Media, Inc.
  E-mail: info@sys-con.com

Java and Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. SYS-CON Publications, Inc. is independent of Sun Microsystems, Inc.