Which EII Solution Is Right for You?
Managing integration tasks
Enterprise Information Integration (EII) represents a new category of software that enables disparate data silos to be integrated into a single virtual database for applications. This approach gives developers a powerful tool for simplifying data integration and building flexible applications. If you haven't heard of EII yet, you will soon as the industry rallies around this concept and more EII projects reach deployment.
This article explains why you should consider EII, describes its advantages for data integration, and discusses the different approaches to implementing EII. The article provides a framework for comparing EII products and choosing a suitable product to begin simplifying data integration in your environment.
Data Integration Issues
The presence of multiple, disparate data sources is a fact of life in enterprise IT environments. The number and types of data sources have only grown over time and made data integration a prominent aspect of IT investments.
To combat the inflexibility induced by information fragmentation, developers have many tools for data integration (see Figure 1). There are adapters for accessing data sources, transformation engines for reformatting data, and data warehouses for aggregating data from multiple sources. To integrate data sources, developers use these tools to program the integration requirements into applications.
Although this approach to data integration works, it requires a programmatic approach that has the following deficiencies:
A Smarter Approach to Data Integration
- Low-level programming: The developer needs to create and maintain the integration code. The tools provide only the building blocks for integration. The developer must write the low-level code to implement the integration requirements, which greatly increases development and maintenance costs.
- Multiple data source APIs: Each data source has its own API and data format. The developer must understand each data source in sufficient detail to manage the data integration. This often requires multiple specialists to implement and maintain the integration, which again drives up complexity.
- Inconsistent integration framework: The developer has to manage all the integration issues such as the relationships between data sources and data formats. This leads to one-off solutions and inconsistencies that are difficult to maintain.
- Tight-coupling: The programmatic approach creates hard-coded dependencies between applications and data sources. This severely hampers maintenance since updating a data source can potentially break many applications. Even minor updates must be carefully considered and orchestrated.
- Limited reusability: The integration code tends to be application centric and data source centric. This limits the reusability of investments in data models and integration rules.
A better solution for data integration is EII. EII supports data integration by enabling multiple data silos to be represented to applications as a single virtual database. Instead of integrating data in the application code, the data integration function is pushed out of the application layer into a new EII tier that sits between applications and data sources. This new tier is dedicated to managing integration tasks such as connecting to a data source, transforming data, and integrating data.
In this framework, the developer or analyst creates a logical data model in the EII server that represents the business view of information (aka the data integration requirements). The target data source's physical data models are mapped into the logical data model to create a virtual database schema. Applications interact with data sources through the EII server based on the logical data model. The EII server automatically translates application requests into queries against one or more data sources, integrates the data, and produces results according to the logical view.
EII's holistic approach to data integration addresses the drawback of the toolkit approach (see Figure 2). Instead of providing tools for each task, EII establishes a framework that automates the low-level details and exposes a high-level, declarative interface for specifying data integration requirements. Multiple data sources can be integrated without writing any application code. The developer defines the desired logical data model and maps the data source into the logical model using a GUI tool.
This end-to-end approach generates several advantages:
- No programming: The declarative approach eliminates the need to create and manage integration code. The EII server automatically generates the low-level code to orchestrate the data integration.
- Unified API: Applications access data sources through the EII server's API instead of through data source APIs. Developers do not need to know how every data source works.
- Automation: The EII server manages the disparities between data formats and APIs. The developer uses the GUI tool to create the new data model and the EII server automatically generates the correct API calls and data transformations.
- Loose coupling: Inserting an EII layer between applications and data sources effectively decouples the usual dependencies between them. The mapping between the logical data model and the data source simply needs to be updated to reflect any changes to the data source. This level of flexibility provides some interesting benefits:
- An application can migrate to a different database of the same or different architecture.
- A database schema can quickly evolve to reflect new business requirements.
- Multiple versions of the logical data models can be created to support incremental system migration.
Approaches to EII
- Common data model: A common data model can be established for the enterprise. It helps different groups understand and share the available information. The investment in a common model compounds over time
to improve the strategic utilization of information assets.
- Reusability: The logical data model can be aggressively reused in other projects. This reduces the need for additional data integration and centralizes ongoing administration. Reusability changes the data integration focus from code management to strategic information management.
How do you get started with EII? Like all emerging
markets, EII offers many different approaches to implementing the solution. In general, the available products can be differentiated based on the underlying logical data model, the data transformation framework, and the query interface.
Logic Data Model
The logical data model is at the heart of every EII solution. Physical data models from target data sources are mapped into the logical data model. This model serves as the schema of the virtual database. The types of logical models include relational, object, and XML. Depending on the approach, the EII servers would appear to applications as an object database, a relational database, or an XML database.
The transformation framework dictates how data from multiple sources are transformed into the logical data model. The approaches to transformation include SQL, XQuery, and proprietary. A visual data mapping tool is typically provided to shield the developers from low-level transformation code; however, if the mapping tool doesn't accommodate specific integration requirements, the developer may need to edit or write transformation code directly.
The query interface provides access to data. It dictates how applications read, query, insert, update, and delete the underlying data represented by the logical data model. It also dictates the data format presented to applications.
EII query interfaces include SQL, XQuery, and proprietary.
Although the data model, transformation framework, and query interface can be mixed and matched, in practice, specific transformation frameworks and specific query interfaces work best with specific data models. Therefore, the general approaches to EII can be grouped into three main categories: relational, object, and XML (see Table 1).
Products that take the relational approach use the relational data model as their logical data model. All data sources are represented as a series of tables, and SQL is used to transform the data into the logical model. With this approach, the EII server is a virtual relational database and applications use SQL to interact with the integrated data.
MetaMatrix is an example of this approach.
Products that take the object approach use objects as the logical data model. All data sources are represented as objects, and automatically generated code is used to transform the data into the logical model. With this approach, the EII server is a virtual object database and applications use a proprietary interface to obtain data objects.
Journée is an example of this approach.
Products that take the XML approach use XML as the logical data model. All data sources are represented as XML document collections and XQuery is used to transform the data into the logical model. With this approach, the EII server is a virtual XML database and applications use XQuery to interact with the integrated data.
BEA Liquid Data for WebLogic, Ipedo, and Nimble are examples of this approach.
Each approach to EII has its advantages and disadvantages. The primary distinctions are based on the data modeling flexibility, the query flexibility, and the result-processing requirements. These differences affect the suitability of each approach to specific applications and developer predilections.
Data Modeling Flexibility
The distinction in modeling flexibility deals with the expressiveness of the data model that can be created from multiple data sources (see Table 2). This is largely dependent on the underlying logical data model and the ability of the transformation framework to mold data structures into the new logical model.
In this regard, the object and XML approach have an advantage over the relational approach because of their ability to support hierarchical data relationships. Their logical data model can directly represent hierarchical data whereas the relational approach must decompose the data structure into tables. In addition, the object and XML approach can represent hierarchical data sources like XML, whereas the relational approach requires the developer to reconstruct the hierarchical data in application code.
This modeling advantage is important to applications that use data from nonrelational data-sources sources such as message queues, Web services, XML documents, EJBs, and applications.
Another key distinction between each approach is query flexibility. This deals with the level of filtering and data processing that can be specified in a query and executed by the EII tier.
The SQL and XQuery query languages have an advantage over simply retrieving objects by criteria. The language approach better supports projection, aggregation, and joins, which enable fine-grained results to be produced by a query. In contrast, query returns of object collections may generate unnecessary data and require additional processing in application code to aggregate and join data.
XQuery has a further advantage over SQL and Object with its enhanced data manipulation facilities. It provides a functional programming language that can express complex transformations against any data structure. It supports built-in and external functions, conditional processing, scripting, and the ability to transform results into any text or binary format.
The distinction in query flexibility is important because the more data processing the EII server can perform, the less code the developer needs to create and maintain. Query flexibility also has implications on performance. A query that returns just the desired data, minimizes network traffic and improves the response time. To achieve the smallest result set, the query must maximize the level of data selection, projection, aggregation, joins, and transformation that can be expressed and processed in a single query.
The last primary distinction between the three approaches is the result-processing model. The relational approach generates tabular data, the object approach generates native program objects, and the XML approach generates XML documents.
The object approach has an advantage over the relational and XML approach since a native object representation of data is the most convenient to work with. The developer directly accesses the data by simply calling the specific data object's methods. In the relational and XML approach, the developer must work with generic data structures such as JDBC ResultSets or DOM objects. These require additional work over the object approach to read, update, create, and delete data.
Selecting an Approach
The capabilities and limitations of each approach affect their suitability to particular environments, data processing requirements, and data source integration requirements. The best approach truly depends on the unique set of requirements. The following are some general guidelines for selecting an approach.
The relational approach works very well with established data programming practices. Developers can get started quickly and leverage traditional techniques and know-how. Although the relational data model is less flexible, the majority of enterprise data resides in relational databases and many nonrelational data sources can be coaxed into a relational format. The pain of accommodating a few nonrelational data sources may be outweighed by the approachability of relational development.
The object approach provides a flexible, logical data model for integrating diverse data sources. For complex environments, this capability greatly simplifies data integration and produces detailed views of enterprise data assets. In addition, the object interface makes data programming more convenient. Fans of object-to-relational mapping tools and object databases will appreciate the object approach. Unfortunately, binding data to program objects is also the weakness of the object approach. This makes query processing less flexible and requires the application developer to process more data. The resulting inefficiency makes the object approach unsuitable for ad hoc query processing where the data binding can't be tuned for performance.
The XML approach is the most cutting-edge architecture for data integration. Using XML as the logical data model and XQuery as the query interface provides a flexible platform for EII. XML effectively models many enterprise data sources and XQuery provides powerful data processing capabilities. Together these minimize the integration code in applications. These advantages are the most apparent in an environment with heterogeneous data sources and mixed application architectures. Although the XML approach provides a great deal of flexibility, the technology is less mature and requires a shift toward XML-centric application development. For developers ready to experiment and blaze new trails, the payoff will be worthwhile.
The comparisons between different approaches to EII provide a good starting point for exploring the market offering, but they don't tell the whole story. In addition to exhibiting many different approaches to the solution, emerging markets also exhibit products with different levels of feature coverage. This affects the capability and performance of EII products beyond the architectural differences. Understanding the following key features will help to further narrow the list of qualified candidates.
Also known as connectors, these software components make a data source available to the EII server. This is the first item of any checklist since there is no solution if your target data sources are not supported. Most vendors get the bulk of their adapters from third-party providers. The level of integration between the EII server and the adapter can vary, so look for adapters that are well integrated into the graphical data mapping tool and the query interface.
If your target data source is not supported, most vendors offer a development kit for building custom adapters. The effort will vary but it won't be trivial, so consider this option carefully.
In addition to accessing data, the ability to propagate updates to the underlying data sources is critical. A bidirectional EII solution completes the virtual database illusion to keep the application decoupled from data integration. Look for the ability to update logical data entities made from multiple data sources and look for strong transaction management capabilities.
The logical data model in an EII system will cross many different data sources, so a good security model is needed to manage access to the integrated information. Look for fine-grained access management and integration with enterprise directory servers like LDAP directories.
Integrating data from multiple sources can be computationally expensive. An EII server's query response time can be an issue depending on the number of data sources, the number of joins, and the level of transformation required to generate a result according to the logical data model. Caching query results is a good strategy for reducing the processing requirements and improving performance.
There are two kinds of caches. The more conventional is a read cache that saves prior query results. The EII server scans the cache for past results identical to the query before going to back-end data sources and building new results. To avoid stale data, a read cache periodically clears the cached results. Alternatively, the read cache can also be synchronized with the source, which enables cached results to be selectively cleared or refreshed. This keeps the cache hit rate high to reduce query response times.
The other type of cache is a query cache. This approach caches all the data in the logical data format and directly processes arbitrary queries against the cached data. A query cache is like a data mart except with active synchronization. This approach eliminates all the physical-to-logical data mapping and all queries against the underlying data sources during a request. For sophisticated logical data models, a query cache can greatly improve performance. However, a query cache's storage and synchronization requirements can be enormous. Careful data partitioning is critical to keep the cache size manageable and filled with sufficiently current data.
A read cache and a query cache can be used in combination to better match data usage requirements. A read cache is best for situations with predictable and repetitive queries, while a query cache is best for environments with ad hoc queries against relatively static data sources.
Enterprise information integration is still very new but the value proposition is realizable. It can dramatically reduce the complexity of working with multiple data sources and produce much more flexible applications.
A starting point for evaluating the suitability of EII products is to consider the three general approaches to EII: relational, object, and XML. Each approach has its advantages and disadvantages, so the best solution depends on the specific requirements. In general:
Once the choices have been narrowed, consider the availability of adapters, support for updates, the security model, and the caching model to make the next cuts. This should identify a suitable product to begin leveraging EII.
- The XML and object approaches provide greater data integration capabilities but do not fit as well into traditional database application development environments.
- The XML and relational approaches provide more flexible query facilities but require more work than the object approach to process query results.
- XML with XQuery provides the most powerful modeling and querying facility but the technology is still evolving.
References BEA Liquid Data for WebLogic: www.bea.com
Raining Data: www.rainingdata.com
About The Author
Peter Chang has been working with Java, middleware, XML, integration, and the Web in engineering and marketing capacities. Previously he was the director of product
management at Raining Data and at Sonic Software. Peter is a graduate of the University of California, Berkeley.