A couple of weeks ago, while I was on my way home, my cell phone rang and I was greeted by one of my favorite customers, who sounded like he had had better days. He had just left a meeting with the CIO and received his annual development budget for the following year. The problem was that the CIO was unable to justify a new set of Web service initiatives around a set of just-completed internal Web sites.
He and the upper management felt that it was too early to redevelop these sites. After all, as he explained, "the users had just been trained and were just starting to take advantage of these sites." It certainly wasn't that they didn't see the clear business and technical advantages of Web services but the business value just wasn't there. "Until we can get some return on our investments for these sites, they will stay as they are," was how the CIO later phrased it to me.
During our conversation I started to realize that we all too often forget how important it is to leverage existing assets in infrastructure and technology - and that we can use a combination of Web services and the .NET Framework to realize that. As I did for this customer, I will demonstrate how, using the built-in HTML parsing solution within .NET, you can parse existing content from a remote HTML page and then programmatically expose the resulting data in a Web service.
The development of a Web service that parses content is actually a different paradigm than traditional ASP.NET Web service development. At the core of this development is a service implemented through a Web Service Description Language (WSDL) file. The real difference is that with traditional ASP.NET development we never worry about actual WSDL generation. The framework handles this during the compilation process. With a parser-based service we actually spend our time focused almost exclusively on the creation of the WSDL. Once the WSDL file is created, then the framework provides a utility to generate the proxy class for our code. The trick, as I will show next, is that additional XML elements are added to specify both the input parameters and data returned from a parsed page. Even though additional elements are added, the end XML document must still adhere to the WSDL specification (www.w3.org/TR/wsdl). Within the WSDL file you provide both a target and a regular expression syntax to retrieve the requested parsed data. Once you have created the WSDL file, the .NET Framework provides a custom utility (wsdl.exe) that is used to generate the proxy files for ASP.NET applications. The built-in support is important to allow companies like my customer's to easily transition their existing investments in Web sites into Web services.
To demonstrate this technique, I created a simple HTML page that I will render back into a Web service callable by an ASP.NET application. There are two main caveats that I wanted to pass on. First, always make sure that you get proper permission before trying this on a site. Second, always remember that any changes to the layout of the target Web pages will cause problems within the Web service. In this article, I will show how you can retrieve both the <TITLE> and the <H1> elements of this simple document.
<html>
<head>
<title>Sample title</title>
</head>
<body>
<H1>Some Heading</H1>
</body>
</html>
Creating Custom WSDL
I always like to think of WSDL as an XML format that describes the
network services offered by a server. WSDL by definition is an
XML-based file that identifies the services provided by the server
and the set of operations within each service that the server
supports. Each operation described within the WSDL file includes a
format that the client must follow to request an operation. The
nature of this document sets up a requirement that both the server
and the client must follow, and acts as a form of contract that both
sides agree upon. The server limits its liability to only providing
services if the client sends the properly formatted SOAP request.
With a parsing service, both the parsing and implementation
requirements are part of the WSDL document and these two combined
return the requested information.
Within Visual Studio .NET, creating a custom WSDL file is fairly easy
but not completely straightforward. The problem is that VS.NET
doesn't directly support the creation of a WSDL file as part of its
standard wizards. In order to add a WSDL file after creating an
ASP.NET application, add a text file and then rename it with a *.WSDL
extension. Once this is done you're ready to add the necessary XML
elements.
Within the WSDL file there are a couple of basic elements. First is
the <services> element. A service is a set of <port> elements that
associate the physical or URL location with a <binding> element. Even
though this is a one-to-one relationship, you can specify additional
<port> elements within a <binding>. These are used for alternate
locations. It really isn't uncommon to have multiple <service>
elements within a document. This provides a couple of features,
including the ability to group HTTP ports in one service and SMTP in
another. This gives client applications the ability to search for the
specific <service> elements they need. This also provides a built-in
redirection mechanism for clients. Client applications can redirect
requests to another <service> element and continue processing without
any changes. For our sample I created a <service> binding that points
to the local machine. Obviously, within the production application
you would need to reset the URL to a valid location.
<service name="GetTitle">
<port name="GetTitleHttpGet"
binding="s0:GetTitleHttpGet">
<http:address location="http: //localhost/WebInfo" />
</port>
</service>
Within a WSDL document the <service> "name" attribute is used to
uniquely distinguish one server from another. This becomes even more
important when you have multiple ports in a service. The name
attribute allows each one to become unique and distinguishable from
the others.
Within our WSDL file we also have the <message> elements. These are
used to define the input and output parameters. Within this element
is a <part> child element that represents the particular parameter.
This element contains a name and type attribute. The name attribute
contains the unique name of the parameter being passed, and the type
attribute lists the data type of the parameter being passed. WSDL
isn't limited to simple type only. If you want to define more complex
types using XSD, they can be defined within the <types> section of
the services description and then specified within the data type for
the parameter. For our example I am using the simple type string and
defining "Body" as the parameter name.
<message name="TestHeadersHttpGetOut">
<part name="Body" element="s0:string"/>
</message>
Using Regular Expressions
Of course, all elements are important for a properly formatted WSDL
document. The most important element for parsing is the <match>
element. This element contains the actually parsing instruction and
the data elements required by the .NET Framework to properly generate
the proxy classes. The <match> element is part of the fully qualified
<text> element and contains the <output> and <operation> elements of
a specific <binding>. Within the <match> element there are a variety
of attributes (see Table 1).
By far the most important is the pattern attribute. This contains a
regular expression syntax pattern that will be applied against the
parsed page and will determine the return value. By definition, a
regular expression is a series of characters that define a pattern.
The pattern is then compared against a target string to determine
whether there is a match to the pattern in the target string.
The real power of these expressions is in the use of metacharacters
to indicate character positioning, grouping and even repetition. The
easiest example of a metacharacter is the "*" from the old DOS days.
The .NET Framework contains a fairly extensive set of expressions
that can be used when parsing pages. For more information and
examples of syntax, take a look at the .NET SDK. For our example, I
attempted to locate both the <TITLE> and the <H1> tags within the
base HTML elements.
<output>
<text xmlns="http: //microsoft.com/wsdl/mime/textMatching/">
<match name='Title'
pattern='TITLE>(.*?)<'/>
<match name='H1' pattern='H1>(.*?)<'/>
</text>
</output>
One thing I learned while writing this sample is that case
sensitivity is important. So, for this example and your own code,
make sure that you either turn on case insensitivity or are aware of
how the HTML tags are written.
Generating Proxy Classes
The job of the service description file is to define how to
communicate with the Web service. XML Web services allow
communication over a network in a variety of protocols. This means
that the client and Web service communicate using SOAP messages that
encapsulate both the in and the out parameters as XML. It is up to
the proxy class of a Web service client to handle the work of mapping
parameters to the actual XML elements defined within the service
description file and then sending the SOAP message over the networks.
Within the .NET Framework a proxy class is generated using the
Wsdl.exe utility. This utility examines the WSDL file and creates
proxy classes that can be invoked to communicate with the target Web
service over the network. The service in turn processes both the
incoming and outgoing SOAP messages. By default, the Wsdl.exe utility
assumes SOAP over HTTP to communicate with Web services. The utility
also provides the ability to generate classes that can communicate
with Web services using either the HTTP-GET or HTTP-POST protocol.
Wsdl.exe is run from the command prompt. The utility supports a wide
variety of switches that allow you to define such things as language
type, passwords, and even namespaces. For a complete listing of the
available options, run "wsdl.exe /?" from the command prompt. For my
example, I was interested in creating a Visual Basic .NET-based class
and a specific class name. From the command prompt I ran the
following:
Wsdl.exe /l:vb /out:datareturn.vb
http://localhost/Webinfo/datareturn.wsdl
The output of Wsdl.exe resulted in the creation of a class called
datareturn.vb. This file contains a proxy class that exposes both
synchronous and asynchronous methods for each of the methods in the
Web service. In this example the generated methods were TestHeaders,
BeginTestHeaders, and End TestHeaders. The Testheaders method
provides synchronous connectivity to the Web service. Both the
BeginTestHeader and EndTestHeader can be used to provide asynchronous
Web service connectivity.
Consume the Web Service
Once the generated proxy class is added to the project and a Web
reference is set to the WSDL file, you are ready to start using the
service. Within an ASP.NET Web page you can call the proxy class and
return the requested parsed data from the Web service using the code:
Dim Getdata As New localhost.GetTitle()
Dim match As localhost.TestHeadersMatches
match = Getdata.TestHeaders
TextBox1.Text = match.Title
TextBox2.Text = match.H1
Summary
As I said at the beginning of this article, this is a simple example
of what you can do. As I spoke with my customer over the next weeks,
he started to understand the value the CIO and upper management were
looking for. He developed a Web services strategy that relied on
current investments and leveraged them where appropriate. His
strategy was centered on a gradual transition that leveraged the full
power of his existing infrastructure.
As you download the source code (located below) provided with the article,
I challenge you to do the same thing. Use existing Web sites when
appropriate and integrate and enhance them with the power of a Web
service.
Author Bio
Thomas Robbins is a senior technology specialist with Microsoft in New England. He focuses on .NET development and implementing XML-based solutions. Thom is a regular speaker, writer, and presenter at industry events.
trobbins@microsoft.com
Web Services Made Easy, by Thomas Robbins
WSJ Vol 02 Issue 12 - pg.40