Corporate intranets are heterogeneous environments comprised of Web servers and search engines from numerous vendors. In such a disparate environment, how do you create a corporate collection of indexed documents for use by a single search facility? One method is to use a catalog or index server, such as Netscape's Compass Server or Microsoft's Index Server. These products employ robots or agents that build collections of indexed documents by crawling through your company's intranet via URLs. While this method is very effective, it requires careful planning and administration. An alternate method is to write a Java servlet that connects to several search engines, compiling the results into a single document.
What Is a Java Servlet?
The Java servlet API was developed by Sun Microsystems to provide a mechanism for implementing Web server-side logic using the Java programming language. Java servlets are similar to CGI (Common Gateway Interface) programs in that they provide an HTTP-based mechanism for receiving user input and producing output in the form of HTML. Servlets offer a significant performance boost over CGI programs due to their architecture and implementation. CGI programs require the creation of a separate process to handle each client request. This approach consumes a significant amount of system resources and processing time for each client connection. Web servers implement the Java servlet specification load and instantiate any registered Java servlets upon start-up. Client HTTP requests are handled by creating a new thread within the Server's process space. Thus, servlets perform as if they were developed with native interface APIs such as NSAPI (Netscape Web Server) or ISAPI (Microsoft Web Server). Java servlets can be created easily by sub-classing the HttpServlet (javax.servlet.http.HttpServlet) class and overriding one of the HTTP processing methods.
The Java servlet architecture can be used to enhance the performance of database applications by implementing database connection pools. In traditional Web-based database applications, the overhead of connecting to a database is incurred during each client request. Servlets that implement database connection pools create a collection of instantiated database connection objects during servlet initialization. Therefore, the time required to create the objects and initialize the connections is incurred before any client HTTP requests are handled. Subsequent HTTP client requests are assigned one of the pooled database connection objects, reducing the required processing time. The size of the pool is set to the number of anticipated concurrent users. If that number exceeds the pre-determined threshold, a new connection object is instantiated and added to the pool.
Architecture Overview
The search servlet combines the output from registered search engines by coordinating multiple threads that work in tandem to create a consolidated list of search results. This multithreaded approach increases the overall performance of the application by allowing the search engines to run concurrently. Each engine in the integrated search is registered via a URL entry in a configuration file (see Figure 1). The servlet strips the results from each search engine, ensuring that the most relevant results appear at the top of the final document (see Figure 2). To achieve a consistent look and feel across diverse search engine results, the output from each search engine is parsed to extract the HTML anchor tags. An HTML template file determines the final HTML representation of the results. Using a simple meta tag substitution, the template file alleviates the need to edit the servlet's Java code.
Figure 1:
Figure 2:
Servlet Design
The search servlet is comprised of three classes, each encapsulating a specific category of functionality into a reusable module. The HTMLAnchorParser class parses an input stream which extracts the HTML anchor tags and stores them in an internal data structure. The SearchEngine class connects to a search engine via a URL, initiates a search request and stores the results. It extends HTMLAnchorParser to provide parsing functionality for the search engine results. This class also implements the java.lang.Runnable interface to support concurrent processing via threads. Lastly, the SearchManager class coordinates all search engine threads and handles the HTTP requests from the Web Server by extending the HttpServlet class in the Java servlet package (javax.servlet.http). The doGet method is overridden to implement the necessary HTTP request handling. Figure 3 depicts the class diagram for the servlet designed using Rational Rose for Java. Rational Rose is an object-oriented analysis and design tool that provides powerful design and code generation functionality for Java development. Designing the system in a modular fashion such as this promotes reuse and decreases the time required for development and testing.
Figure 3:
Implementing the Servlet
Listing 1 contains the source code for the SearchManager class. HTTP GET requests are handled by overriding the doGet method of the inherited HttpServlet class. When this method is invoked, it stores the output stream from the HttpServletRequest object using the getOutputStream method. Sending output back to the client's browser is simplified by redirecting System.out and System.err to this output stream. This is accomplished by invoking the setOut and setErr methods of the System class. Subsequent calls to System.out.println and System.err.println will cause the output to be sent to the client's browser. The HTTP Response header's content type field is set to "text/html" via a call to the setContentType method. Without this call, the browser would not know how to interpret the information being sent back from the servlet. The search servlet accepts a single CGI parameter that contains the criteria for the integrated search. The getParameter method of the HttpServletRequest class is used to access this variable by name. In Listing 1 you will notice that the doGet method invokes an initialize method. This method reads the search engine configuration file and creates a thread for each entry. A thread group is used to monitor the status of the search engine threads and determine when they have finished processing. This is implemented using a busy-wait loop and the ThreadGroup.activeCount and Thread.sleep methods. The rest of the doGet method processes the HTML template file and outputs the search engine results when the "<<results>>" meta tag is encountered. The results from the search engines are stripped by outputting a single line from each search engine and removing the search engine when the results have been exhausted.
Listing 2 contains the source code for the HTMLAnchorParser class, which provides methods for reading and parsing the HTML anchor tags from a supplied java.io.InputStream object. The ignoreAnchor method provides a mechanism for ignoring HTML anchor tags that contain certain string patterns. This enables the servlet to ignore irrelevant and/or ornamental anchor tags that are generated by various search engines. A call is made to Thread.currentThread().yield(), while this class reads data from the given input stream to yield the CPU to other competing threads. Without this method call, a single thread would dominate the CPU.
The SearchEngine class (see Listing 3) inherits the HTML anchor parsing capabilities of the HTMLAnchorParser class while implementing the java.lang.Runnable interface to provide multithreaded support. This class creates a URLConnection instance from a specified URL, passing the associated InputStream to the HTMLAnchorParser.getContent method to return and store a list of HTML anchor tags.
The SearchManager strips the output from the SearchEngine instances using calls to the getResultItem and removeResultItem methods. Once a result is read it is removed from the associated SearchEngine object, making the next result available for the subsequent pass. The getResultsCount method is used to determine whether all the results have been read from a specific SearchEngine object and whether it can be ignored in subsequent stripping passes (see Figure 2).
Conclusion
The java.net package provides a feature-rich set of classes for interfacing and controlling HTTP based resources. Coupled with the Java servlet API, developers can easily create traditional CGI-based applications that support socket programming, HTTP handling and multithreaded capabilities that glue together Web and legacy systems across the corporate intranet.
Resources
About the Author
Eric Greenfeder currently works for the BASF Corporation as a Senior Internet Architect. He specializes in Java development, Web security, CORBA and object-oriented analysis and design. Eric has been programming in Java since 1996 and currently teaches an internal Java course to BASF developers. You can reach him by e-mail at greenfe@basf.com
Listing 1: Search Manager class.
import java.io.*;
import java.util.*;
import java.lang.*;
import javax.servlet.*;
import javax.servlet.http.*;
// THIS CLASS IS USED TO MANAGE THE SearchEngine THREADS
// AND PROVIDE JAVA SERVLET FUNCTIONALITY
public class SearchManager extends HttpServlet
{
private ServletOutputStream out; // THE OUTPUTSTREAM BACK TO THE CLIENT'S BROWSER
private String query=null; // THE QUERY CRITERIA TO USE IN THE SEARCH
private Vector searchEngines = new Vector(); // A LIST OF SearchEngine OBJECTS
// A ThreadGroup TO MANAGE THE SearchEngine THREADS
private ThreadGroup threadgroup = new ThreadGroup("searchEngines");
// doGet PROCESSES HTTP GET REQUESTS FROM CLIENT CONNECTIONS
public void doGet (HttpServletRequest req,
HttpServletResponse res)
throws ServletException, IOException {
int cnt = 1,index=0;
String link,inline; // BUFFER VARIABLES
String currenthost=null; // THE HOST NAME OF THE CURRENT SearchEngine
SearchEngine currentse=null; // CURRENT SearchEngine OBJECT
Enumeration se=null;
// GET THE RESPONE OUTPUT STREAM AND REDIRECT THE out
// AND err OUTPUTSTREAMS TO RETURN RESULTS TO THE CLIENT
// CONNECTIONS
out = res.getOutputStream();
// REDIRECT ERRORS TO THE CLIENT'S BROWSER
System.setErr(new PrintStream(out));
// REDIRECT STANDARD OUTPUT TO THE CLIENT'S BROWSER
System.setOut(new PrintStream(out));
// SET THE RESPONSE CONTENT TO HTML TEXT
res.setContentType("text/html");
// GET THE QUERY PARAMETER
try {
query = req.getParameter("query");
if(query == null || query.length() == 0) {
out.println("<H1 ALIGN=CENTER>Please input a "+
"search string !</H1>\n");
out.println("</BODY></HTML>\n");
out.flush();
return;
}
}
catch (Exception e) {
System.err.println("SearchManager (doGet): "+e);
System.err.flush(); }
// INITIALIZE THE SEARCH ENGINE THREADS
if(!initialize()) return;
try {
// READ THE TEMPLATE USED TO FORMAT THE SEARCH RESULTS
BufferedReader in =
new BufferedReader(new FileReader(System.getProperty("user.dir")+
"/searchservlet.pat"));
// READ IN THE TEMPLATE FILE UNTIL THE <<results>>
// TAG IS FOUND, THEN WRITE OUT THE RESULTS. LOOP ACCROSS
// EACH SEARCH ENGINE GATHERING THE RESULTS FROM EACH SEARCH
// ENGINE ONE RESULT AT A TIME. THIS WAY THE MORE PERTINENT
// SEARCH RESULTS FROM EACH ENGINE WILL APPEAR AT THE TOP OF
// THE DOCUMENT.
while((inline=in.readLine()) != null) {
if((index=inline.toLowerCase().indexOf("<<results>>")) > -1)
{
System.out.println(inline.substring(0,index));
int exhaustedEngines=0;
while(searchEngines.size() > exhaustedEngines) {
exhaustedEngines=0;
// if all the engines have been exhausted then this
// variable will = searchEngines.size()
for(int i=0;i<searchEngines.size();i++) {
currentse = (SearchEngine)searchEngines.elementAt(i);
currenthost = currentse.getHost();
// THIS SEARCH ENGINES RESULTS HAVE BEEN EXHAUSTED
if(currentse.getResultsCount() == 0) {
++exhaustedEngines;
continue;
}
try {
// get the first element
link = currentse.getResultItem(0);
((SearchEngine) searchEngines.elementAt(
searchEngines.indexOf(currentse))).removeResultItem(0);
}
catch (ArrayIndexOutOfBoundsException e) {
++exhaustedEngines;
continue;
}
// WE DO NOT WANT IMAGES AND BLANK ANCHORS
if(link.toLowerCase().indexOf("http://") == -1) {
int idx; // SCRATCH VARIABLE
if((idx = link.toLowerCase().indexOf("href=\"")) > -1)
{
if(link.toLowerCase().indexOf("href=\"/") > -1)
link = link.substring(0,idx+6)+currenthost+
link.substring(idx+6);
else
link = link.substring(0,idx+6)+currenthost+"/"+
link.substring(idx+6);
}
else {
idx = link.toLowerCase().indexOf("href=");
if(link.charAt(idx+6) == '/')
link = link.substring(0,idx+5)+currenthost+
link.substring(idx+5);
else
link = link.substring(0,idx+5)+currenthost+"/"+
link.substring(idx+5);
}
}
out.println(String.valueOf(cnt++)+". "+link+"<BR>");
}
}
// PRINT OUT THE REST OF THE LINE AFTER THE
// <<results>> META TAG
System.out.println(inline.substring(index+11));
}
else {
System.out.println(inline);
}
}
} catch (Exception e) { System.err.println("SearchManager (doGet): "+e);
}
out.flush();
out.close();
}
/* THIS METHOD INITIALIZES AND RUNS THE SEARCH ENGINES
AND WAITS FOR ALL SEARCH ENGINE THREADS TO COMPLETE.
THE FILE searchurls.conf IS READ FROM THE CURRENT WORKING
DIRECTORY TO PROVIDE A LIST OF SEARCH ENGINE URL'S. THE
URLS LISTED IN THE CONFIGURATION FILE MUST END WITH THE
SEARCH ENGINE'S QUERY PARAMETER FOLLOWED BY AN EQUAL SIGN.
example: www.search.com/cgi-bin/search?query=</B> */
public boolean initialize() {
String inline; // A SCRATCH VARIABLE
String servlet_dir=null; // THE SERVLET HOME DIRECTORY
SearchEngine se=null; // A SearchEngine OBJECT
try {
// GET THE SERVLET HOME DIRECTORY
servlet_dir = System.getProperty("user.dir");
// OPEN AN INPUT STREAM TO THE CONFIGURATION FILE THAT
// LISTS THE URL'S OF THE SEARCH ENGINES TO BE INTEGRATED
BufferedReader in =
new BufferedReader(new FileReader(servlet_dir.replace('\\','/')+
(servlet_dir.charAt(servlet_dir.length()-1) ==
'/'?"searchurls.conf":"/searchurls.conf")));
// INITIALIZE THE LIST OF SEARCH ENGINES
searchEngines.removeAllElements();
// READ IN THE URL'S FROM THE searchurls.conf FILE
while((inline=in.readLine()) != null) {
// CREATE A SEARCH ENGINE INSTANCE USING THE URL
// READ IN FROM THE searchurls.conf FILE AND
// CONCATENATE THE USER SUPPLIED SEARCH CRITERIA
// NOTE: THE URL LISTED IN THE CONFIGURATION FILE
// MUST END WITH THE SEARCH ENGINES QUERY PARAMETER
// FOLLOWED BY AN EQUAL SIGN.
// example: www.search.com/cgi-bin/search?query=
se = new SearchEngine(inline.trim()+query);
se.addIgnoreTag("<img");
se.addIgnoreTag("<IMG");
se.addIgnoreTag(">_>");
searchEngines.addElement(se);
// START THE SEARCH ENGINE THREAD
new Thread(threadgroup,se).start();
}
// WAIT FOR ALL OF THE SEARCH THREADS TO FINISH PROCESSING
while(threadgroup.activeCount() > 0)
Thread.currentThread().sleep(50);
} catch (Exception e) {
System.err.println("initialize: "+e); System.err.flush();
return false; }
return true;
}
}
Listing 2: HTMLAnchor Parser class.
import java.io.*;
import java.lang.*;
import java.util.*;
public class HTMLAnchorParser {
private DataInputStream in=null; // INPUT STREAM TO READ
private Vector ignoreTags=new Vector(); // TAG IGNORE CRITERIA
// CONSTRUCTOR TO INITIALIZE THE CLASS
public HTMLAnchorParser() { ; }
// CONSTRUCTOR: INITIALIZE CLASS USING INPUT STREAM
public HTMLAnchorParser(InputStream in) {
setInputStream(in);
}
// READS InputStream AND RETURNS A VECTOR OF ANCHOR TAGS
public Vector getAnchorTags(InputStream in) throws IOException
{
setInputStream(in);
return(getAnchorTags());
}
// SETS THE InputStream TO BE PARSED
private void setInputStream(InputStream in)
{
this.in = new DataInputStream(in);
}
// ADDS A TAG TO THE IGNORE TAG LIST. WHILE READING A URL,
// IF A LINK CONTAINS ANY MATCHES FROM THE IGNORE TAGS LIST
// IT IS NOT INCLUDED IN THE RESULTS
public void addIgnoreTag(String tag) { ignoreTags.addElement(tag);
}
// DETERMINES IF A TAG SHOULD BE IGNORED
private boolean ignoreAnchor(String anchor) {
for(int i=0;i<ignoreTags.size();i++)
if(anchor.indexOf((String)ignoreTags.elementAt(i)) < -1)
return true;
return false;
}
// READS AN InputStream AND RETURNS A VECTOR OF ANCHOR TAGS
public Vector getAnchorTags() throws IOException {
int charbuf[] = new int[4];
StringBuffer sb = new StringBuffer();
StringBuffer link = new StringBuffer();
boolean startrecording=false;
Vector results = new Vector();
try{
if(in == null)
return null;
while(true) {
// YIELD THE CPU TO OTHER THREADS
Thread.currentThread().yield();
charbuf[0] = in.readByte();
if(charbuf[0] == '<') { // START OF TAG
// WE ARE RECORDING SO LOOK FOR ENDING TAG </A
if(startrecording) {
charbuf[1] = in.readByte();
charbuf[2] = in.readByte();
if(charbuf[1] == '/' && (charbuf[2] == 'a' ||
charbuf[2] == 'A')) {
link.append("</A>");
startrecording = false;
if(!ignoreAnchor(link.toString()))
results.addElement(link.toString());
link = new StringBuffer();
}
else {
// APPEND THE THREE CHARACTERS WE JUST READ
link.append((char)charbuf[0]);
link.append((char)charbuf[1]);
link.append((char)charbuf[2]);
}
}
else {
// SKIP PAST SPACES
while((char)(charbuf[1]=in.readByte()) == ' ') ;
charbuf[2] = (char)in.readByte();
// START RECORDING IF THIS TAG IS AN ANCHOR <A
if(charbuf[0] == '<' && (charbuf[1] == 'a' ||
charbuf[1] == 'A') && charbuf[2] == ' ') {
link.append("<a ");
startrecording = true;
}
else { // RESET NECESSARY VARIABLES
link = new StringBuffer();
startrecording = false;
continue;
}
}
}
else if(startrecording)
link.append((char)charbuf[0]);
}
}
// EOFException CAUGHT HERE
catch (EOFException e) { ; }
return results;
}
}
Listing 3: SearchEngine class.
import java.net.*;
import java.io.*;
import java.util.*;
import java.lang.Runnable;
import java.util.Vector;
// THIS CLASS OPENS A CONNECTION TO A SPECIFIED URL AND
// READS THE CONTENTS PARSING OUT THE ANCHOR TAGS,
// STORING THEM IN A VECTOR
public class SearchEngine extends HTMLAnchorParser implements Runnable
{
private String query;
private Vector results = new Vector();
private String serverURL;
// CONSTRUCTOR TO INSTANTIATE THE CLASS WITH A URL
public SearchEngine(String serverURL) {
this.serverURL = serverURL;
}
// METHOD THAT GETS CALLED WHEN THE THREAD IS STARTED
// USGING THE START METHOD
public void run() {
try {
URLConnection urlcon =
(new URL(serverURL)).openConnection();
if(urlcon == null) {
System.err.println("SearchEngine (run): "+
"Error opening URL connection.");
System.err.flush();
return;
}
urlcon.connect();
getContent(urlcon);
}
catch (Exception e) {
System.err.println("SearchEngine(run): "+e);
System.err.flush(); }
}
// GETS THE CONTENT BY READING THE INPUT STREAM SPECIFIED
// IN THE URLConnection OBJECT
public void getContent(URLConnection urlc) {
try {
results = getAnchorTags(urlc.getInputStream());
}
catch (Exception e) {
System.err.println("SearchEngine (getContent): "+e);
System.err.flush(); }
}
// OUTPUTS THE RESULTS TO System.out. YOU CAN REDIRECT
// THIS OUTPUT USING THE System.setOut METHOD
public synchronized void outputResults() {
try {
for(Enumeration e=results.elements();e.hasMoreElements();)
{
System.out.println((String)e.nextElement());
System.out.flush();
}
} catch (Exception e) {
System.err.println("SearchEngine (outputResults): "+e);
System.err.flush(); }
}
// SETS THE QUERY CRITERIA FOR A SEARCH
public void setQuery(String query) {
this.query = query;
}
// GETS THE QUERY CRITERIA FOR A SEARCH
public String getQuery() {
return query;
}
// RETURNS THE RESULTS OF THE SEARCH
// getContent MUST BE CALLED BEFORE THIS
// FUNCTION IS USED
public Vector getResults() {
return results;
}
// RETURNS THE HOST PORTION OF THE CURRENT SEARCH URL
public String getHost() {
int index = serverURL.indexOf("//");
index = serverURL.indexOf("/",index+3);
return serverURL.substring(0,index);
}
// SET THE URL USED TO EXECUTE THE SEARCH
public void setServerURL(String serverURL)
{
this.serverURL = serverURL;
}
// RETURNS THE URL OF THE SEARCH ENGINE
public String getServerURL()
{
return serverURL;
}
// RETURNS THE NUMBER OF SEARCH RESULT ITEMS
public int getResultsCount() { return results.size();
}
// RETURNS A SEARCH RESULT GIVEN AN INDEX
public String getResultItem(int index) throws ArrayIndexOutOfBoundsException
{
return (String)results.elementAt(index);
}
// REMOVES A SEARCH RESULT ITEM GIVEN AN INDEX
public void removeResultItem(int index) throws ArrayIndexOutOfBoundsException
{
results.removeElementAt(0);
}
}