Starting about 3.5 billion years ago with bacteria, nature em- barked on the grandest of all algorithms: the evolution of highly complex and dynamic machines capable of interacting with and adapting to their environments in order to solve problems. We know these machines as plants and animals.
One look at the genetic code of even the simplest living organism reveals a structure that's enormously complex and efficiently tuned, ensuring the survival of the organism in its environment. We might even use the terms fault-tolerant, highly parallel, high performance, and ubiquitous. Don't forget that nature accomplished this extraordinary programming feat without a single developer coding an exhaustive list of if-then rules and switch statements to account for all possible scenarios. It was simply based on a random set of interactions with the fittest organisms surviving to replicate their genetic code into the next generation.
With the advent of the internet over the past decade, an entirely digital world has arisen in which web sites and applications are the organisms fighting for survival in a highly complex, internetworked environment replete with computer viruses, server crashes, and the like - an environment in which only the fittest will survive. As such, it's my belief that more sophisticated means of software development are needed to build web applications capable of interacting with and adapting to the complexities of the new digital world thriving within our computers.
One simple, yet extremely powerful, technique that will likely play a role in the evolution of the internet (and the web applications that live within it) borrows several concepts from the biological world and transforms them into bits and bytes with the goal of building adaptive software systems.
This article is the first of a two-part series that examines a technique from the AI community called genetic algorithms, which borrows concepts from biology to solve complex and often nonlinear problems encountered in the world of computer science. This article will introduce you to the concepts of genetic algorithms and discuss why Java is well suited to their implementation. The next installment will investigate the details of implementing these algorithms in Java. It's my hope that after reading these articles, you'll think a little differently about software development and its future. Genetic algorithms provide a problem-solving technique that's too powerful to ignore.
First a little history. Genetic algorithms were born out of the idea of evolutionary programming introduced by I. Rechenberg in the 1960s. John Holland, a professor at the University of Michigan at the time, is credited with the invention of genetic algorithms following the publication of his 1975 book Adaptation in Natural and Artificial Systems. In his book Holland formulated the basics of genetic algorithms as models of machine learning that derive their behavior from concepts of biology's theory of evolution. It was one of Holland's students, David Goldberg, who popularized the use of genetic algorithms when he was able to solve a difficult problem involving gas-pipeline transmission for his dissertation in 1989.
That said, what exactly is a genetic algorithm? What are they used for? What are the benefits over traditional programming techniques? How does Java fit into this? I'll attempt to answer these questions so you'll have the foundation needed to start implementing genetic algorithms (see Figure 1).
Darwin in Your Computer
A genetic algorithm can be thought of as a model for machine learning in which a population of randomly created individuals goes through a simulated process of evolution - a digital survival of the fittest where each individual represents a point in the problem's solution search space. Using correct terminology, an individual is represented by a chromosome, which consists of several genes. Genes are essentially the parameters of the problem to be solved. A collection of chromosomes is considered a population and is the fundamental unit on which a genetic algorithm operates. Once the algorithm is set into motion, individuals are selected from a population and combined in a process called crossover to create a set of children. The children are randomly mutated to create a new set of chromosomes to be reinserted into the population. Once enough children chromosomes have been created to replace a population, a generation is said to have passed.
With each generation, all the chromosomes are evaluated according to some fitness criterion that's a measure of the strength of the chromosome compared to the rest of the population. Only the fittest chromosomes survive into the next generation where the selection, crossover, and mutate process begins anew. After a number of generations have elapsed, the best chromosome is selected from the population and represents the optimal solution to the problem being solved. Essentially what's happening is that a random set of solutions to a problem within a given search space is created and evolved over an amount of time to find an optimal solution. A concrete example will help clarify the concepts described above.
The Traveling Salesman
The traveling salesman problem (TSP) is a classic computer science problem in which a salesman must traverse a number of cities, visiting each only once, while minimizing the distance traveled. For the case of 20 cities, an exhaustive search method that examines all possible routes dictates a search through over 2.4 billion billion (20!) permutations which, if evaluated at a rate of 500 million per second, would take over 150 years to complete.
Employing a gen- etic algorithm reduces the amount of time to seconds (or a fraction thereof, de- pending on the computing power available) and produces the optimum solution in some cases and a near optimal solution in most others. The representation of this problem in the genetic algorithm domain consists of cities with their x and y coordinates serving as individual genes. A chromosome is a list of cities, in order, that represent one possible solution to the traveling salesman problem. The fitness of the chromosome is then the Cartesian distance between the cities when traversed in order, with the fittest chromosomes being those with the shortest overall distance (see Figure 2).
Typically, genetic algorithms have been utilized in solving complex optimization problems when traditional programming techniques (such as exhaustive search, analytic optimization, and line minimization) fail to arrive at a solution in a reasonable amount of time. genetic algorithms confer the following advantages:
This leads to the next question: Why use Java?
- They evaluate several solutions simultaneously, covering a large search space.
- They work well in parallel implementation.
- They optimize parameters with very complex cost functions.
- They create a list of optimal solutions, not just a single solution.
- They work with various data types.
As you can see, genetic algorithms can become computationally expensive depending on a number of parameters (including the size of the population, the complexity of the fitness function, the size of the chromosome, and the time to converge on an optimal solution. Thus, in choosing a language for implementation, weighing the benefits of using Java versus using a compiled language such as C or C++ is essential. For Java to be a viable language for genetic algorithm implementation, it must present significant advantages to make up for its degraded performance as compared to other compiled languages. And it does! The advantages of Java are particularly evident in the area of distributed computing.
Simple and Object-Oriented
Given the dynamic memory requirements for a genetic algorithm, Java's garbage collector relieves us from having to allocate and deallocate memory for chromosomes in each generation. This allows us to focus specifically on coding the problem at hand and not worrying about memory management details. Also, the use of objects allows us to create an endless number of problem encodings and still use the genetic algorithm framework. This means that once the basic algorithm structure is developed, implementing a genetic algorithm to solve new problems becomes a matter of defining the problem and its encoding. Next month we'll take an in-depth look at what this means during implementation.
Robust and Secure
Java was designed for creating software that's highly reliable and capable of operating in distributed environments. As developers start to move genetic algorithms from a single CPU to a network of parallel and distributed CPUs, robustness and security are essential. Think of partitioning a genetic algorithm into a number of populations and letting them evolve separately in parallel, frequently distributing the most fit from each population into all the populations. JavaSpaces presents itself as an excellent candidate for moving genetic algorithms into a distributed environment.
Architecture-Neutral and Portable
As referenced above, the real power of genetic algorithms can be obtained in parallel and distributed environments. With Java's platform-neutrality, populations and the algorithm to evolve them can be distributed among a network of computers for processing, provided that a JVM is available. Don't worry about the implementations for different operating systems and CPUs. Think of the [email protected] project that utilized over two million PCs connected to the internet to churn through radar data in the search for extraterrestrial intelligence. Genetic algorithms are ideal candidates for use in such a distributed environment, with java being the obvious language of choice given its portability. Computing power is no longer an issue; there will be more than enough to go around.
Now that we've briefly examined the nature of genetic algorithms and why Java makes sense as the development language of choice, let's take a more detailed look at the fundamental components that make up a genetic algorithm. For the sake of simplicity, we'll cover the most basic implementations of genetic algorithms and introduce the essential core concepts. I highly recommend further research and study if genetic algorithms spark a deeper curiosity. A number of resources are available on the web for such study.
A gene can be defined as the encoding of a single parameter in a genetic algorithm. A gene can take many forms depending on the problem definition. For the traveling salesman problem, a gene represents a city and its longitude and latitude coordinates. However, when solving a high-order, nonlinear, partial differential equation, a gene can represent one of the variables to solve for and its range of acceptable values.
This highlights the two main flavors of genetic algorithms: permutation-encoded versus real-parameter. In the former version, the goal is to find the optimal ordering of a set of genes such as in the TSP. As for the latter, an example of a real-parameter genetic algorithm is finding x and y such that the following function is minimized: f(x, y) = 2x * sin(3 * y) + 4y * cos (5 * x).
Historically, genes were represented as sequences of 1's and 0's. However, this approach has not been shown to yield better performance and introduces a layer of complexity as a translation is needed between the actual values of parameters and their binary representation. In addition, handling genes as objects in Java makes the implementation more intuitive and can be extended to make them reusable across different genetic algorithm implementations. (More on this in next month's article.)
Much like its biological equivalent, the gene pool for a genetic algorithm is a collection of all the available genes. From the gene pool, chromosomes are created at the beginning of a genetic algorithm by randomly drawing genes from the gene pool and assembling them to build a chromosome that represents one solution for the search space defined for the genetic algorithm.
Returning to the examples mentioned above, the gene pool for solving the traveling salesman problem consists of one gene per city to be traversed. For the case of 20 cities, there will be 20 genes in the gene pool from which random chromosomes will be created. For real parameter genetic algorithms, such as minimizing the function f(x, y), the gene pool will consist of two genes, one representing the variable x and the other representing the variable y.
Continuing with definitions, a chromosome is a collection of genes representing a single point in the solution search space. The fitness of a chromosome is determined by a cost function determined prior to the execution of the genetic algorithm. Again, returning to the traveling salesman problem, the fitness of a given chromosome is the sum of the distances between the cities when traversed in the order specified by the chromosome. for the real parameter chromosome (f(x, y)), the fitness is the result of substituting the x and y values back into the original function and performing the calculation. Note that the fitness of a chromosome tells you nothing about its strength relative to other chromosomes; rather, it's a raw evaluation of the chromosome's fitness. It's at a higher level that fitnesses are compared and selection proceeds according to the rules of a genetic algorithm. This higher level is the population.
A population is a collection of all the chromosomes being evolved in a genetic algorithm. As new chromosomes are created and reinserted into the population, less fit chromosomes are replaced and only the most fit survive into the next generation. As mentioned previously, it's here that the process of digital evolution occurs, as the fitness of the competing chromosomes is compared in order to select parent chromosomes to reproduce.
Depending on the search space for a given problem, population size can range from a few dozen chromosomes to several hundred, several thousand, or more. Given the fact that a chromosome represents a single point in the solution search space, for problems with extremely large search spaces (such as the 20-city TSP), it makes sense that a large population size is needed to cover as much of the space as possible. Otherwise, the genetic algorithm may approach a local minimum and converge toward it, rather than the global minimum. Convergence is a core issue in genetic algorithm implementation, and I highly recommend further examination outside of this article to gain additional insight.
Genetic Algorithm Operations
Now that we've discussed the requisite components of a genetic algorithm, it's essential to understand how a genetic algorithm operates on each of the components to create a simulated evolutionary environment that combs a search space for an optimal solution. There are five elementary genetic algorithm operations:
- Fitness evaluation: With the examination of a chromosome and its role within a population, we talked briefly about fitness evaluation and its importance. The proper definition and evaluation of a fitness function is critical to the success of the genetic algorithm. It's the means by which chromosomes are compared to one another to determine the most fit individuals. The primary goal here is differentiation between the more fit chromosomes and the less fit chromosomes. Remember, it's survival of the fittest.
- Selection: This is the method by which chromosomes are chosen to reproduce in order to create children for the next generation. The goal of selection is to choose individuals that, on average, are more fit than others to pass on their genes to the next generation while, at the same time, maintaining genetic diversity. If a population consists of identical individuals, genetic diversity is lost and it's difficult for the genetic algorithm to explore different regions of a search space.
Several different methods are available for genetic algorithm selection, but for the sake of simplicity and brevity I'll focus on a technique labeled tournament selection. With this technique, a group of individuals is selected at random and the two most fit are selected for reproduction (i.e., they win the tournament). Keeping the tournament size small (4-8 chromosomes) ensures genetic diversity as the group is small, and what appears to be the most fit within the group may actually be a weak chromosome when compared with the entire population.
- Crossover: Once two parent chromosomes are selected, they reproduce two child chromosomes via the crossover operation. One of the parameters of a genetic algorithm is the crossover probability (typically 75-90%) that represents the statistical chance that two given chromosomes will cross over. For each potential crossover, a random number between 0.0 and 1.0 is generated. If the number is greater than the crossover rate, then crossover doesn't occur and the children chromosomes are exact replicas of their parents. If crossover does occur, then the parents randomly exchange genes to create new chromosomes.
There are three types of crossover covering a wide range of problem encodings:
- Permutation encoding with unique genes: In this case, a gene can appear only once within a chromosome. One example is the TSP. Each city may appear only a single time within the chromosome.
- Crossover operating on the permutation encoding, with the exception that genes don't have to be unique: Let's imagine that we have a genetic algorithm that's evolving a musical piece within the key of C. All the notes in the key of C are viable and can be repeated indefinitely up to the size of the chromosome.
- Real parameter chromosome crossover: In a real parameter chromosome, each gene will represent a parameter to be applied to a given cost function. Building on the function, f(x, y) described earlier, two parent chromosomes will have genes for the x variable, both representing different values. A method for crossing over the two genes might involve creating a new gene for the x variable with the value being the average of the two parent genes.
Crossover is another essential genetic algorithm operator that ensures genetic diversity within a population. The conceptual goal of crossover is, over time, to combine the good portions of chromosomes into newer and better chromosomes. For a better understanding, see Figure 3. I highly recommend further exploration of the crossover operator before attempting to implement your own genetic algorithm.
- Mutation: Similar to crossover in that it randomly modifies chromosomes, it operates on only a single chromosome at a time (see Figure 4). As with crossover, there's a probability associated with the occurrence of mutations, albeit a small one (typically 5-25%). Yet again, returning to the TSP, a typical mutation can include randomly selecting two endpoints within a chromosome and reversing the order of the genes. Several mutation techniques that can be utilized depending on the problem encoding won't be discussed here. It's important to remember that mutation is a fundamental operator for ensuring genetic diversity within a population, which translates into a better coverage of the search space.
- Insertion: This is the final algorithmic step to conclude a generation in a genetic algorithm. Insertion is the process of introducing children chromosomes into a population and removing the less fit chromosomes. One common technique for insertion utilizes a technique called elitism in which the n best chromosomes of a population are kept for the next generation and the rest are replaced with new children. This ensures that the most fit chromosomes survive into the following generation and have the opportunity to reproduce again.
Genetic Algorithm Considerations
By now you should have a basic understanding of what a genetic algorithm is and how it works. Let's now quickly look at some considerations when implementing a genetic algorithm.
The goal of implementing any genetic algorithm is convergence on an optimal solution for a given search space. Convergence will be affected by numerous factors associated with the implementation of the genetic algorithm, such as parameter encoding, population size, crossover and mutation rates, and selection technique. Depending on the problem being solved, these factors are usually determined only by experience working with genetic algorithms of all flavors. My recommendation is to start coding!
Performance is an issue that has constantly plagued genetic algorithms due to their heavy-duty processing power requirements. with the combination of Moore's Law and the increased availability of highly parallel, distributed computing power, I don't think performance will be an issue in the near future.
Here's the number one barrier to acceptance of genetic algorithms as a practical programming technique: real-world applications. Genetic algorithms have resided primarily in academia solving classic computer science problems. Their use in business and commercial environments is highly unproven. As computing power becomes more readily available, I think we'll see an increase in adaptive software systems with genetic algorithms at their core.
One particular area of work that may break down the wall is security. Researchers have begun to develop operating systems modeled after the immune system of animals. As new viruses invade the system, strategies are evolved to combat the virus, remove it from the operating system, and identify similar attacks in the future. And with the proliferation of highly sophisticated attacks on internet sites, such an "immune" system offers a much better (and quicker) response than waiting for a human to recognize the attack and code a patch to fix it or close ports on a firewall to deny it.
Another interesting outbranching of genetic algorithms is the field of genetic programming pioneered by John Koza. Without getting into the details, genetic programming is essentially using genetic algorithms with the genes representing programmatic constructs (e.g., AND, OR, IF, THEN, +, and -). What's evolved are chromosomes representing computer programs. It's an exciting field that's worth a deeper look.
The goal of this article wasn't to encourage you to implement genetic algorithms in your code tomorrow, but rather to inform and educate you about one technique for building software capable of adaptation. As the Internet continues to grow at a furious pace, a new digital world is being created that operates in the language of 0's and 1's. The organisms fighting for survival are the web sites that you and I create on a daily basis. Whether fighting for survival in the sense of attracting new customers or warding off the latest computer hacker, adaptability will be crucial to survival in the complex digital world. Hopefully this article has sparked a newfound interest in software development and its future. If so, stay tuned for the next issue of JDJ, in which I'll demonstrate a simple implementation of a genetic algorithm.
Michael Lacy is an engineer for the platform development group at Shutterfly, an online photo service, where he develops Web-based solutions for digital image printing, enhancing, and sharing. He's also a certified Java programmer and developer.