pugh gecco15


1 Confronting the Challenge of Quality Diversity . New York, NY: ACM. In: Proc. of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO 2015) Justin K. Pugh, L. B. Soros, Paul A. Szerlip, and Kenneth O. Stanley Department of EECS, Computer Science Division University of Central Florida Orlando, FL 32816 USA { jpugh, lsoros, pszerlip, kstanley } @eecs.ucf.edu mizer is inspired by the feats of evolution in nature, where it ABSTRACT appears to have optimized a myriad of complex yet highly In contrast to the conventional role of evolution in evolution- functional forms. While some niche areas of EC deviate from ary computation (EC) as an optimization algorithm, a new this objective-driven paradigm (such as open-ended evolu- class of evolutionary algorithms has emerged in recent years 23 , 21 , 2 tion in artificial life [ ]), it has nevertheless broadly that instead aim to accumulate as diverse a collection of dis- encompassed even newer research areas such as generative coveries as possible, yet where each variant in the collection is ]) and neuroevolution 25 and developmental systems (GDS [ as fit as it can be. Often applied in both neuroevolution and [8]. (QD) quality diversity morphological evolution, these new One potential downside of this emphasis on objective op- algorithms are particularly well-suited to evolution’s inherent timization is that it has pitted EC as a direct competitor strengths, thereby offering a promising niche for EC within to virtually the entire field of machine learning, which simi- the broader field of machine learning. However, because QD larly focuses largely on the problem of efficient optimization algorithms are so new, until now no comprehensive study 1 , 16 ]. While one consequence of this similar emphasis is a [ has yet attempted to systematically elucidate their relative view among some in machine learning that EC is an inferior strengths and weaknesses under different conditions. Tak- optimization algorithm, even if that view is arguably unin- ing a first step in this direction, this paper introduces a formed, it still leaves evolution in the position of seemingly new benchmark domain designed specifically to compare and “just another optimization algorithm.” contrast QD algorithms. It then shows how the degree of Yet evolution in nature is clearly not only about optimiza- alignment between the measure of quality and the behavior tion. It is notable not for solving one particular problem, but characterization (which is an essential component of all QD problems, and in uncountably diverse ways. In this many algorithms to date) impacts the ultimate performance of sense, evolution is not a mere alternative to (for example) different such algorithms. The hope is that this initial study backpropagation [ 22 ], which 3 ] or support vector machines [ will help to stimulate interest in QD and begin to unify the are not designed to generate a vast diversity of high-quality disparate ideas in the area. solutions. (As explained later, this kind of evolutionary di- Categories and Subject Descriptors: versity is also not the same conceptually as a multi-objective : Learning; I.2.6 [Artificial Intelligence] ] either.) This incongruity between EC in 7 Pareto front [ I.2.6 [Software Engineering] : Metrics— complexity mea- practice and its natural inspiration suggests a misalignment sures, performance measures between the conventional application of EC and its full po- tential as a genuinely unique paradigm for search that can General Terms: Algorithms separate it from much of machine learning. novelty search; non-objective search; quality Keywords: Hints of this broader interpretation of evolution within diversity; neuroevolution EC have begun to emerge with non-objective algorithms ], which searches for novelty 12 like novelty search (NS) [ 14 , 1. INTRODUCTION without an explicit objective. NS has in turn inspired a succession of variant algorithms that elaborate on the idea In evolutionary computation (EC), evolution has conven- of a search through novelty alone [ 19 , 18 , 13 , 9 ]. Yet even tionally played the role of an optimization algorithm, wherein while this new generation of evolutionary algorithms begins the performance of the algorithm is judged by how quickly to deviate from the traditional objective paradigm, such and how close it comes to reaching an objective [ 6 ]. The based on their evaluated algorithms are still predominantly idea that evolution can act as an efficient objective opti- 14 19 , ]. That is, while ability to reach a particular objective [ they may not function internally through the compass of the Permission to make digital or hard copies of all or part of this work for personal or they are often still treated merely as objective, externally classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation alternatives to objective-driven optimization algorithms. on the first page. Copyrights for components of this work owned by others than ACM However, some researchers have begun to take non-objec- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, tive search in a different direction. They have instead em- to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] braced the potential for evolution to collect quality diversity 4 , ], which means searching for as many 17 27 (QD) [ , , 5 , 11 GECCO ’15, July 11–15, 2015, Madrid, Spain possible variations (with respect to a chosen characterization) c © 2015 ACM. ISBN 978-1-4503-3472-3/15/07. . . $15.00 http://dx.doi.org/10.1145/2739480.2754664 DOI:

2 as possible, but where each such variation is as good as it can behaviors on its own is not enough because the ideal outcome would be that each variant is not only unique but also as high- possibly be (with respect to a separate measure of quality). performing as it can be. For example, merely discovering QD is particularly intriguing as an application of EC because what most algorithms in machine learning that self-ambulating pogo sticks exist in the morphology not it is explicitly and behavior space of virtual creatures is not as good as are suited to achieving, yet nevertheless well-aligned with self-ambulating pogo sticks, in best possible discovering the the achievements of evolution in nature. In this way QD is addition to the best possible behaviors of all the other viable an opportunity for EC to shine as an essential option for the ]. Note that this notion of QD kinds of problems that require accumulating a multitude of morphologies that exist [ 11 diverse yet high-quality alternatives. is not the same as finding the highest-fitness subset of all In this spirit, the aim of this paper is to begin to in- conceivable creatures because even the very best pogo stick troduce a framework for understanding and relating QD fitness than e.g. quadrupeds may have significantly lower algorithms. While the literature analyzing the nuances of al- and bipeds. However, the point is that we still want to find gorithms in EC and elsewhere for the purpose of optimization that pogo stick and see how well it can do. The diversity components in QD also still differ from the separate objectives is vast, almost no studies to date investigate the tradeoffs in multi-objective optimization because individual diversity among different approaches to QD. This paper begins such components include no intrinsic ranking – for example, all an enterprise by introducing a domain designed explicitly morphologies in the space of creatures are equally of interest to analyze approaches to QD in the context of neuroevolu- as long as they can walk at all. tion, and then compares several leading algorithms in that To empower evolution to collect QD, a new breed of al- new framework, including one new variant algorithm called MAP-Elites+Novelty , elaborated from the original MAP- gorithms have appeared in recent years that augment al- , 5 Elites algorithm [ ]. The result is fresh insight into the 17 gorithms like novelty search to push each unique variant towards its best possible performance, opening up new appli- consequences for different approaches to QD of varying the cations. For example, Lehman and Stanley degree of alignment of the behavior characterization (which [11] introduced is essential to all such algorithms so far) with the notion of the novelty search with local competition (NSLC) algorithm quality in the domain: some algorithms perform best when to evolve a wide diversity of successful ambulating creatures in a single run. Similarly, Szerlip and Stanley [27] such alignment is high, while others excel without it. Be- collect a diversity of high-performing behaviors in a Sodarace-inspired cause QD is such a promising new direction for EC with two-dimensional creature domain. In an intriguing foreshad- many possible future applications (as discussed in the next [4] section), this initial investigation of algorithms in this area owing of future applications, Cully and Mouret show that behavioral helps to lay the groundwork for an emerging new field. QD once collected can be aggregated into a useful repertoire , in this case a collection of diverse robot motion skills. 2. BACKGROUND later introduced an alternative algorithm Cully et al. [5] Early hints of a science of QD within EC first appeared for collecting QD called MAP-Elites, which showcased yet in the 1990s with work on multimodal function optimization another intriguing application of behavioral diversity: a [15] . MMFO (MMFO); a good review is provided by Mahfoud collection of ambulation strategies becomes a source of means aims to discover as many high-scoring (though not neces- to alter ambulation in the event of damage. Yet another sarily global) optima within a fitness landscape as possible. [20] application of QD is demonstrated by Nguyen et al. , who While MMFO spawned a number of important diversity- generate with MAP-Elites a diverse collection of high-quality preservation mechanisms (mainly focusing on genetic diver- spoof images that fool deep networks. This breadth of early sity), a shift more than ten years later towards methods applications of such QD algorithms suggests an important , focused on behavioral diversity [ 12 , 14 18 , 19 ] revived inter- new research area with many applications yet to be realized. est in the potential for diversity to transform the application The aim of this paper is to begin to understand the factors of EC. This new wave of algorithms, starting with novelty that contribute to successfully collecting QD under different 12 ], differed from the older work on MMFO in partic- search [ conditions. A common component of all modern QD algo- ular in its focus on discovering the full spectrum of possible rithms is the (BC), which formal- behavior characterization behaviors within a particular domain, often represented by izes the behavior of a candidate individual in the population evolved neural networks. While much work in this area cen- ]. 11 , and is used to determine its eligibility for selection [ 5 tered on the possibility that a side effect of seeking behavioral However, different approaches to QD treat the BC in differ- diversity is sometimes the discovery of very high performers, ent ways, and the consequences of these choices are currently another view that emerged is that the diversity of discoveries poorly understood. For example, while novelty favors indi- is a desirable byproduct. itself viduals with the most novel BC, MAP-Elites discretizes the Unlike in multi-objective optimization [ 7 ], where the aim behavior space into unique bins from which (when occupied) is to collect a Pareto front of performance trade-offs, these selection is equally probable. This paper focuses in particular diversity-collecting algorithms focus instead on sampling the on the trade-offs created by such differing treatments and full behavior space as comprehensively as possible without begins to explore the possibility of hybridizations that may concern for objective performance. In some cases a collec- combine the benefits of different approaches. tion of diverse behaviors on its own offers direct practical value; for example the recent divergent discriminative fea- 3. EXPERIMENT 28 ] accumulates a vast ture accumulation (DDFA) algorithm [ collection of diverse feature detectors that are then aggre- To effectively tease out the differences between competing quality diversity (QD) algorithms, a test domain is needed gated into a neural network for later training in classification problems. with both (1) a concrete measure of quality, and (2) a diver- However, for many applications a collection of diverse sity of legitimate solutions. No standard such benchmark

3 domain yet exists in the area of QD, and thus one of the con- tributions of this paper is to propose one suited specifically to the concerns of QD. The first experiments with novelty search [ 14 ] intro- , 12 that has since become duced a domain called HardMaze popular within the literature because of its explicit visual S depiction of deception within a fitness landscape. The catch in HardMaze is that for a robot to navigate to the goal point away from the of the maze it is necessary first to navigate goal multiple times along the path. Additionally, there is an easily-reachable cul-de-sac that is close to the goal in terms of Euclidean distance but is actually farther away from the goal in terms of distance along the correct path. G Importantly, searching for behavioral novelty alone (i.e. nov- elty search) in the HardMaze solves the maze much faster than a conventional objective-based search wherein fitness is the Euclidean distance to the goal point at the end of a trial (which, unlike novelty search, is easily drawn into the Figure 1: Quality diversity maze (QD-Maze). Indi- cul-de-sac). Thus, in HardMaze, Euclidean distance serves viduals start at the point S and the highest quality solutions as a concrete measure of quality. However, HardMaze only . While the maze is deceptive enough to G navigate to point contains correct path through the maze (i.e. the diversity one challenge objective-based algorithms, it also contains a vari- of legitimate solutions is limited), diminishing its utility for ety of solutions and thus serves as a benchmark for combining testing QD algorithms that seek many solutions. quality with diversity. Instead of HardMaze, in this paper, we augment the idea of a maze navigation domain that uses Euclidean distance as a concrete measure of quality to also encompass a diversity of of the BC’s alignment with the objective, three alternative possible solutions to the maze. This new domain, QD-Maze BCs are also tested, each less aligned than the previous: (figure 1), still contains several deceptive traps (albeit less FullTrajectoryBC, HalfTrajectoryBC, and DirectionBC. difficult than those in HardMaze: all algorithms tend to FullTrajectoryBC , the agent’s position is sampled Under find at least one solution) while also allowing many possible at three points throughout its trial: one-third, two-thirds, “correct paths” to the goal, thus allowing an explicit diversity and at the end of the trial, and all three ) points are x, y ( of paths to the goal. concatenated together to form the six-dimensional BC vector. Aside from the change in map configuration, the setup This BC is less strongly aligned with the notion of quality for QD-Maze is identical to HardMaze. A robot agent is than EndpointBC because only the final sampled point has given a fixed time to move around the maze before the a strong correlation to the final score within the domain. quality is scored according to the trial ends and the agent’s , wherein the HalfTrajectoryBC Even less aligned again is Euclidean distance between its final position and the goal only agent’s position is sampled at three points throughout point (the score is maximum if the goal point is reached). the first half of its trial. The second half of the trial leaves Reaching the goal point prematurely ends the trial. Agents sufficient time for an agent to reach the goal from any loca- are equipped with a set of six rangefinders, five of which tion within the maze, rendering HalfTrajectoryBC even less are equally spaced across the front-facing 180 degrees and aligned with the objective than FullTrajectoryBC. Highlight- the sixth facing the rear. Additionally, four pie slice sensors ing this point, in theory, every behavior under HalfTrajecto- (front, right, rear, and left -facing) detect the direction of the ryBC can potentially achieve a maximum quality score (i.e. goal by activating the pie slice facing the goal with maximal ultimately solving the maze). value. In all experiments in this paper, agents are driven by , samples DirectionBC The final and least-aligned BC, evolved neural networks. the direction the robot is facing on each tick (north, east, south, or west) and divides the duration of the trial into 3.1 Behavior Characterizations (BCs) five equal-length time slices. The BC then consists of five dimensions, each corresponding to the direction that the The nature of the “diversity of solutions” available for agent was facing most often during each of the five time slices discovery within the search depends heavily on how behaviors , 375 , = 0 east ). 875 . = 0 west , 625 . = 0 south ( north = 0 . 125 . 12 ], are characterized. In the original HardMaze experiments [ This characterization is almost completely agnostic to the agent behaviors are characterized according to their final notion of quality because the goal point of the maze can be position at the end of the trial ( EndpointBC ). This BC is approached from any direction and the agent is free to wander notable for its strong alignment with the implicit objective around the maze and face any direction at any time. Under of a maze. In other words, the neighborhood of behaviors DirectionBC, every possible behavior in the behavior space around any particular trajectory endpoint generally share can potentially achieve a maximum score on the objective of a similar level of quality, thereby aligning behavior with a reaching the goal and there is little to no correlation between notion of quality. Under the pressure of a strongly-aligned quality and behavioral neighborhoods. BC, searching for novelty alone may often be enough to find not only diversity, but also high-quality solutions to the 3.2 QD Algorithms domain, without even the need to explicitly optimize each behavioral neighborhood independently. To investigate the A variety of algorithms can collect QD in a given domain, effect on different approaches to QD of varying the degree some designed explicitly to do so and some more as a side

4 effect. In this paper, we compare the performance of three grows exponentially with BC dimensionality; for this reason existing algorithms (novelty search, novelty search with lo- and to keep the comparisons fair, none of the BCs in this cal competition, and MAP-Elites), the latter two designed paper includes more than six dimensions. explicitly for QD, and a new variant of MAP-Elites (MAP- While the simplicity of selection in ME is appealing, its effect is that evolution is more likely to concentrate resources Elites+Novelty). A conventional fitness-based search (which is not expected to excel at QD) also serves as a baseline for in regions of the behavior space already filled (i.e. because that is where most of the bins are occupied). This approach performance. To normalize comparisons across the algorithms as much proved effective in ME applications so far, but it is easy to 10 augment ME with pressure to concentrate effort on more as possible, the SharpNEAT [ ] distribution of the NEAT MAP- algorithm [24] serves as the underlying neuroevolution plat- novel regions by adding novelty pressure. In this form for all of them. Therefore, the neural networks are able variant tested for the first Elites+Novelty (ME-Nov) to evolve increasing complexity, a central feature of NEAT. time here, the probability of selecting an occupied bin for reproduction is proportional to its novelty, creating a stronger However, except in the pure fitness-driven baseline runs, to pressure towards diversity than in the original ME. However, minimize potentially confounding factors, the conventional genetic diversity component of NEAT (also called once all the bins are filled in ME-Nov, it collapses back to ) speciation an approximation of regular ME. The question is whether is excluded. That way, the diversity facet of QD is exclusively , which is the focus of the majority of behavior the added initial pressure towards novelty might enhance focused on recent innovation in non-objective search algorithms and has ME’s drive towards QD. Also, while ME has effectively filled 5 most bins in experiments published so far [ ], in future 20 exhibited significantly more impact on performance in such , domains in which some bins require very advanced solutions algorithms than genetic diversity [ ]. Furthermore, because 19 of preliminary experimental evidence that generational evolu- and therefore are unlikely all to be filled, the additional novelty pressure in ME-Nov could prove instrumental in tion can lead to instability or “jumping around the behavior space” in non-objective search populations, all algorithms are covering as much of the space as possible. (im- run in steady state Finally, a regular mode, which means that a subset of the fitness-based search (fitness) ]) simply searches for objectively population (in this case 32 individuals evaluated in parallel) 24 plemented as NEAT [ superior performance (though with NEAT’s conventional ge- is replaced at a time instead of a whole generation (which is 26 ] variant of NEAT). In total five netic speciation fitness-sharing mechanism to give it at least similar to the rtNEAT [ some hope of maintaining diversity). This variant, which is algorithms are compared: susceptible to the trap of deception [ 12 , 14 only The classic ] algorithm novelty search (NS) [ ] and has no behav- 14 ioral diversity mechanism, helps to highlight the need for searches for behavioral diversity with no notion of quality. It rewards behaviors that are maximally distant from pre- specialized approaches to the problem of QD. viously discovered behaviors. However, if the BC is aligned with a notion of quality, then it is possible that NS without 3.3 MAP-Elites as a QD assessment mecha- any special augmentation can itself collect QD effectively, a nism possibility thereby investigated in this experiment. To compare the ability of competing QD algorithms to NS is augmented to explicitly seek QD in the novelty collect quality diversity, a quantitative measure of quality ] algo- [ 11 , 27 search plus local competition (NSLC) diversity is needed. Interestingly, the MAP-Elites style of rithm. In this variant, a multi-objective formulation imple- variant all discretized behavior space can be collected for mented through NSGA-II [ 7 ] casts classic novelty as one algorithms, even those that are not MAP-Elites. As long as objective and local competition (LC), which means a quality there are not too many dimensions in the BC (as is true in BC, as another. The similar rank relative only to those with a BCs in this paper), the best behavior found for every bin in LC component encourages unique behaviors to improve in the behavior space (regardless of through which method it quality within their local behavioral neighborhoods, thereby was found) can simply be stored for later analysis. aligning the search explicitly with the notion of QD. The For this purpose, for each BC, the behavior space is dis- LC approach contrasts with the idea of combining novelty cretized into an appropriately small number of bins (i.e. small global competition objective (NSGC), which has been with a enough that MAP-Elites itself can be run efficiently and with previously shown less effective than NSLC for discovering a breeding pool of the same order of magnitude as the size of ] because a diversity of solutions within a search space [ 11 the breeding population in the NS, NSLC, and fitness-based global competition explicitly prunes out diversity, and is algorithmic treatments): 900 bins for EndpointBC, 729 for therefore not considered in this paper. FullTrajectoryBC and HalfTrajectoryBC, and 1,024 for Di- MAP-Elites (ME) Unlike NS and NSLC, the recent rectionBC. For each algorithmic treatment, all individuals 5 , 17 , 20 ] divides the behavior space of the BC algorithm [ are checked for eligibility in this MAP-Elites-style grid as into discrete bins. Each bin then remembers the fittest (elite) they are evaluated. The performance (QD score) of each genome ever to occupy it, and only one genome occupies algorithm at any given point during the run is then mea- any given bin at one time. Thus the elites within each sured as the total fitness across all filled grid bins within the bin capture the notion of quality and the whole set of bins QD collection grid (where higher fitness or quality means a capture the notion of diversity. In the classic formulation final robot position closer to the goal). Throughout a run, of ME, selection is very simple: the chance of producing an algorithms therefore improve their QD score by either (1) dis- offspring is equal for each filled bin. Thus search effort tends covering higher quality solutions for existing bins (increasing to be spread uniformly across the known behavior space quality) or (2) discovering more bins (increasing diversity). (rather than explicitly biased by novelty). One technical To achieve the highest QD scores (which is possible in this limitation of ME is that the BC must be constrained to a domain), an algorithm must therefore excel with respect to low number of dimensions because the total number of bins both quality and diversity.

5 3.4 Experimental setup NS Each of the methods (NS, NSLC, ME, ME-Nov, fitness) NSLC ME-Nov 400k was combined with each of the behavior characterizations ME (EndpointBC, FullTrajectoryBC, HalfTrajectoryBC, Direc- 300k tionBC) to comprehensively test the effects of both factors Fit on QD. Except for the DirectionBC variants, each combi- ,000 eval- 250 nation was evaluated over 20 runs capped at 200k uations. Because the DirectionBC variants converge to sta- ble scores significantly more slowly (collecting QD across 100k such a poorly-aligned BC is harder), they were capped at Quality Diversity evaluations. The population size was 500 for NS, 750 ,000 0 NSLC, and fitness; NS and NSLC also keep an associated 0 100k 200k 150k 250k 50k novelty archive of maximum size 2,500, after which the least Evaluations novel is removed when a new entry is added. For MAP- Elites, as is standard the population at any given time was In this Figure 2: EndpointBC (very high alignment). the size of the number of occupied bins. To enable par- performance comparison and others in this paper, the average allel evaluation, steady state evolution is implemented in QD (taken over 20 runs) for each variant method is shown batches of 32 offspring at one time. Following the settings in across a run. The yellow strip at the top indicates the period the original novelty search experiments [ 14 ], the probabili- during which there is a significant difference at least between ties for mutating connections was 60%, for adding connec- the top and bottom method (exlcuding fitness, which is tions was 10%, and for adding neurons was 0.5%. Evolved always significantly worse than all other methods). The networks are constrained to be strictly feedforward. All method labels are color coded to match with their respective other settings follow standard SharpNEAT 1.0 defaults [ 10 ]. curves and are shown from top to bottom in the their rank Source code for the entire experimental setup is available at order during the period of significance. Note that because http://eplex.cs.ucf.edu/uncategorised/software#QD of the large QD scale, sometimes significant differences exist Performance for each method and BC combination was even when not visually apparent. For EndpointBC, NS measured according to the total fitness across all filled grid performs best, followed by NSLC. bins (Section 3.3). Results were averaged across twenty runs for each of the 20 method-BC combinations (a total of 400 separate runs were performed). Additionally, pairwise com- parisons were performed after every 160 evaluations to deter- as they converge near the end of the run can be difficult to mine when exactly there are significant differences between perceive in the figures, but the QD scale is actually vast, so 05) was used to establish 0 . methods. Tukey’s test (with p < significant difference actually do persist in some cases where significance instead of the Student’s t-test to account for the it is difficult to perceive visually.) The fact that novelty statistical problem of multiple pairwise comparisons. alone (NS) does so well highlights the fact that when the BC is aligned with fitness, additional sophisticated machinery designed to simultaneously reward quality and diversity is 4. RESULTS potentially unnecessary (perhaps depending on the difficulty of the domain), and in such situations novelty alone becomes Figures 2 through 5 show for each BC the total quality a powerful force for QD. diversity (QD) discovered so far by each method over the course of 20 averaged runs for each method. Higher QD is However, as the alignment of the BC with quality re- duces, the story changes. In the FullTrajectoryBC (figure 3), better. Fitness runs, which are included only as a baseline, are unsurprisingly significantly below all other runs. Among which is still fairly aligned with the notion of quality (i.e. be- cause trajectories that lead to the goal are high quality), NS, the other four methods, a yellow bar at the top of each graph NSLC, and ME-Nov are tied (i.e. not significantly different) highlights the period during which significant differences are observed between the best and worst performing methods throughout most of the run. For this BC, only ME performs significantly below the other methods. Thus NS retains its among those four. Note that in terms of both QD and signif- utility as a top choice, but no longer holds an advantage icance testing, numerical comparisons are only meaningful over NSLC or ME-Nov. ME is not well suited to BCs with because the total between methods within the same BC possible QD score depends on the BC. The main result is high alignment because it is agnostic with respect to novelty, leading to a relative delay compared to novelty-driven ap- that the ranking of methods depends on the alignment of the BC with the notion of quality, suggesting that choosing a proaches (including ME-Nov) in exploiting promising areas of behavior space. QD method for a particular problem can potentially become The HalfTrajectoryBC (figure 4) aligns with quality even a principled decision in the future, though some methods are less, which leads to an interesting phenomenon seen only more generally robust across different BCs than others. with this BC: After a very brief period at the start of the With the EndpointBC (figure 2), which is highly aligned run, all methods effectively perform the same (although ME with the notion of quality (i.e. closeness to the endpoint), is still a bit below, the difference is not significant). In effect, the novelty search-based methods (NS and NSLC) signifi- cantly outperform all other methods for most of the run, HalfTrajectoryBC hits just the right level of misalignment with NS ending significantly (though only slightly) above that the advantages and disadvantages of respective meth- NSLC. ME-Nov also significantly outperforms original ME, ods become a wash. Surprisingly, even NS alone remains though is still significantly below NS and NSLC. (It is im- competitive, despite its lack of an explicit quality component. portant to note that the differences between the methods Finally, DirectionBC (figure 5), which tracks facing di-

6 ME 500k 100k ME-Nov NSLC NSLC NS ME-Nov NS ME Fit 400k 80k 300k 60k 200k 40k Quality Diversity Fit 100k Quality5Diversity 20k 0 0 0 150k 100k 50k 250k 200k 0 150k 450k 600k 750k 300k Evaluations Evaluations Figure 5: DirectionBC (low alignment). With an al- For Figure 3: FullTrajectoryBC (high alignment). most complete lack of alignment in this BC, ME and ME-Nov this BC, which is slightly less aligned with quality than tie for first place, and NS trails far behind. EndpointBC, NS, NSLC, and ME-Nov are effectively tied, with ME significantly behind. 4.1 Visualizing Behavior Spaces Recall that for all variant search algorithms, the highest- performing genomes are collected in a discretized behavior space “grid” (inspired by the MAP-Elites grid). Collecting 350k ME-Nov NS elite behaviors in this way enables a convenient visualization NSLC 300k of each algorithm’s ability to collect QD within the domain. ME , uses [5] The visualization technique, proposed by Cully et al. 250k two dimensions to display an entire behavior space grid by 200k nesting two-dimensional grids within other two-dimensional grids, each two-dimensional grid representing two dimensions 150k of the behavior space. As an example, figures 6 through 8 100k Fit depict the final state of the QD collection grid at the end of Quality Diversity 50k one typical run of Fitness, ME, and NS in the FullTrajecto- ryBC. These visualizations reveal that: (1) Fitness (figure 6) 0 100k 0 50k 250k 200k 150k and MAP-Elites (figure 7) exhibit a diminished ability to fill the grid for this BC (corresponding to a diminished pressure Evaluations towards diversity) while the highest performing algorithm (on FullTrajectoryBC), NS (figure 8), is able to fill the grid Figure 4: HalfTrajectoryBC (modest alignment). All almost entirely after 250,000 evaluations. The visualizations the methods except fitness are tied for the vast majority of also expose (2) the relatively strong alignment of the behavior the run for this BC, which is even less aligned than FullTra- space with the fitness measure through the smooth gradient jectoryBC. in the outer two dimensions from low fitness at the top of the grid towards high fitness at the bottom-middle of the grid. rection only, aligns least with quality because it is largely This paper is also accompanied by a website where sample detached from trajectory. For example, agents can arbitrarily interactive behavior space visualizations (including the ability spin around whenever they want without impacting their to browse through the discovered behaviors within each bin) trajectory. Note that such low alignment is not unprece- are available for all 20 method-BC combinations (including dented in serious applications. For example, an algorithm the three shown here): might search for diversity in terms of the height and mass of http://eplex.cs.ucf.edu/QD/GECCO-15/compare.html virtual creatures, which is hardly predictive of their quality ]. The results with DirectionBC suggest that as walkers [ 11 5. DISCUSSION such a complete loss of alignment can significantly upend the utility of NS as a QD mechanism. In this scenario, NS drops The main insight to emerge from the experiment concerns to the bottom of the ranking, performing significantly below the two key forces operating in QD algorithms: the push for other approaches (each of which have quality-seeking compo- novelty and the push for quality. If the BC, which supplies nents). Furthermore, while NSLC exceeds NS significantly the main push for novelty, is sufficiently aligned with a notion of quality, then novelty-based approaches are most (showing that the LC component of NSLC can indeed provide an advantage in QD), both ME and ME-Nov significantly effectively empowered. In contrast, if the BC is orthogonal outperform NSLC. In fact, ME alone effectively ties with to a notion of quality, a strong push for quality (as in the ME-Nov, suggesting that in cases of high misalignment, the elitism of ME) becomes the more instrumental factor in QD performance. The apparent ability under some BCs for NS advantage of the novelty pressure is eclipsed by the advan- to succeed, even without any explicit push for quality, alone tage of a strong elitist mechanism that pushes towards higher in other BCs to succeed without any explicit and ME alone quality.

7 Figure 6: Fitness in the FullTrajectoryBC. Com- Figure 7: MAP-Elites in the FullTrajectoryBC. In this pared to Fitness (figure 6), ME discovers far more QD under example grid, the six-dimensional behavior space (FullTra- FullTrajectoryBC. Some bins remain unfilled (white), corre- jectoryBC) (discretized into three bins per dimension for a total of 729 bins) is visually depicted as a series of nested sponding to the lack of pressure towards diversity within the ME algorithm. 3 two-dimensional grids (each of which are 3). The color of × each grid box corresponds to the quality of the solution found 250 ,000 evaluations: yellow by the search algorithm after corresponds to low quality, dark red to high quality, and white to unfilled bins. Fitness finds very few of the possible behaviors for this BC. push for novelty, highlights just how profoundly this apparent principle applies. Yet it also appears that augmenting these most simple methods with a component to cover what they lack, e.g. augmenting NS with LC and ME with novelty, carries generally little risk. In fact, in some cases it improves their results, especially when they are being applied in their respectively less ideal alignment circumstances, hinting that aiming for “best of both worlds” is not unrealistic and may even be attainable for future QD algorithms yet unrealized. However, the results in this initial study should be qualified by the observation that the QD-Maze domain is relatively easy; all methods except for fitness eventually find diverse solutions. Because more ambitious applications of QD in the future are likely significantly more challenging (imagine e.g. Figure 8: Novelty Search in the FullTrajectoryBC. much more challenging parallel mazes), a natural question is Of the five compared algorithms, NS performs the best under whether the results here will extend to future experiments. the FullTrajectoryBC (featuring relatively high alignment While of course further empirical research is necessary and between the BC and the objective) because it focuses exclu- this study is only a first step, there is still good reason to sively on pursuing diversity. This conclusion is supported believe that the results reported here are at least somewhat in the QD collection grid by almost all bins being filled (i.e. general: First, even in much harder domains, as the length non-white). of runs approaches infinity, all methods are likely eventually to find most QD. The experiment here is simply shorter performance alone that may arise in specific domains. For because the domain is less difficult. Yet the stratification of different methods observed during the course of these short example, because it divides behavior space into discrete bins, runs can conceivably emerge similarly in more challenging ME (and ME-Nov) becomes increasingly difficult to apply to domains, though simply over longer timescales. Second, the higher-dimensional BCs, an issue not present for NS or NSLC. , the ME grid also turns out Yet beyond its use as a method fact that the ranking of methods in QD-Mazes matches well an excellent mechanism for collecting and measuring QD our expectations for each method (i.e. that those more reliant results, even for methods that are not ME. We anticipate on a separate measure of quality do better when the BC is that the results of future QD algorithms even unrelated to poorly aligned with fitness) further supports the credibility ME will ultimately often be visualized and quantified by of the results. Of course, there are some other considerations outside QD placing them within in a classic ME grid.

8 6. CONCLUSION [12] J. Lehman and K. O. Stanley. Exploiting open-endedness to solve problems through the search The aim of this paper was to help to unify the emerg- for novelty. In Proc. of the 11th Intl. Conf. on Artificial quality diversity ing (QD) research area by beginning to Life (Alife XI) , Cambridge, MA, 2008. MIT Press. introduce benchmarks and principles for analyzing and con- [13] J. Lehman and K. O. Stanley. Revising the trasting methods in the area. For this purpose a new domain evolutionary computation abstraction: Minimal criteria with many parallel solutions called the QD-Maze was intro- novelty search. In Proc. of the 12th Annual Conf. on duced, and several QD methods (including a new variant Genetic and Evolutionary Computation (GECCO ’10) , called MAP-Elites+Novelty) were compared under different pages 103–110, New York, NY, USA, 2010. ACM. behavior characterizations in QD-Maze. The results begin [14] J. Lehman and K. O. Stanley. Abandoning objectives: to establish initial principles for understanding QD, namely Evolution through the search for novelty alone. that the alignment of the behavior characterization with the , 19(2):189–223, 2011. Evolutionary Computation notion of quality significantly influences which QD methods [15] S. W. Mahfoud. Niching Methods for Genetic will be most effective in a particular setup. This insight . PhD thesis, University of Illinois at Algorithms gives hope that an even broader understanding of QD can be Urbana-Champaign, Urbana, IL, May 1995. reached in the future, thereby helping to expand the reach [16] T. M. Mitchell. Machine Learning . McGraw-Hill, New and impact of evolutionary computation. York, NY, 1997. [17] J.-B. Mouret and J. Clune. Illuminating search spaces Acknowledgments by mapping elites. arXiv preprint , 2015. [18] J.-B. Mouret and S. Doncieux. Overcoming the This work was supported by the National Science Foundation bootstrap problem in evolutionary robotics using under grant no. IIS-1421925. Any opinions, findings, and behavioral diversity. In Proceedings of the IEEE conclusions or recommendations expressed in this material Congress on Evolutionary Computation (CEC-2009) , are those of the authors and do not necessarily reflect the pages 1161–1168. IEEE, 2009. views of the National Science Foundation. J.-B. Mouret and S. Doncieux. Encouraging behavioral [19] diversity in evolutionary robotics: An empirical study. , 20(1):91–133, 2012. Evolutionary Computation References [20] A. Nguyen, J. Yosinski, and J. Clune. Deep neural [1] C. M. Bishop. Pattern Recognition and Machine networks are easily fooled: High confidence predictions , volume 1. Springer, New York, NY, 2006. Learning , arXiv preprint for unrecognizable images. [2] A. Channon. Improving and still passing the alife test: abs/1412.1897, 2014. Component-normalised activity statistics classify [21] C. Ofria and C. O. Wilke. Avida: A software platform Artificial Life evolution in geb as unbounded. , 8: for research in computational evolutionary biology. 173–181, 2003. Artificial Life , 10(2):191–229, 2004. [3] C. Cortes and V. Vapnik. Support-vector networks. [22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. , 20(3):273–297, 1995. Machine Learning Learning internal representations by error propagation. [4] A. Cully and J.-B. Mouret. Behavioral repertoire , pages Parallel Distributed Processing, Volume 1 In Proc. of the 15th Annual Conf. learning in robotics. In 318–362. MIT Press, Cambridge, MA, 1986. on Genetic and Evolutionary Computation (GECCO [23] L. Spector, J. Klein, and M. Feinstein. Division blocks ’13) , pages 175–182, New York, NY, USA, 2013. ACM. and the open-ended evolution of development, form, [5] A. Cully, D. Tarapore, J. Clune, and J.-B. Mouret. and behavior. In Proceedings of the Ninth Annual Robots that can adapt like natural animals. arXiv Conference on Genetic and Evolutionary Computation preprint , abs/1407.3501, 2014. (GECCO ’07) , pages 316–323. ACM, 2007. [6] K. A. De Jong. Evolutionary Computation: A Unified [24] K. O. Stanley and R. Miikkulainen. Evolving neural Perspective . MIT Press, Cambridge, MA, 2002. Evolutionary networks through augmenting topologies. [7] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A Computation , 10:99–127, 2002. fast and elitist multiobjective genetic algorithm: [25] K. O. Stanley and R. Miikkulainen. A taxonomy for NSGA-II. IEEE Transactions on Evolutionary Artificial Life , 9(2):93–130, 2003. artificial embryogeny. , 6(2):182–197, 2002. Computation [26] K. O. Stanley, B. D. Bryant, and R. Miikkulainen. [8] D. Floreano, P. D urr, and C. Mattiussi. ̈ Real-time neuroevolution in the NERO video game. Neuroevolution: From architectures to learning. IEEE Transactions on Evolutionary Computation , 1:47–62, 2008. Evolutionary Intelligence Special Issue on Evolutionary Computation and Games , [9] J. Gomes and A. L. Christensen. Generic behaviour 9(6):653–668, 2005. similarity measures for evolutionary swarm robotics. In [27] P. Szerlip and K. O. Stanley. Indirectly encoded Proceedings of the Fifteenth Annual Conference on sodarace for artificial life. In Proceedings of the Genetic and Evolutionary Computation (GECCO ’13) , European Conference on Artificial Life (ECAL-2013) , pages 199–206, New York, NY, USA, 2013. ACM. volume 12, pages 218–225, 2013. [10] C. Green. SharpNEAT homepage, 2003–2006. URL P. A. Szerlip, G. Morse, J. K. Pugh, and K. O. Stanley. [28] http://sharpneat.sourceforge.net/. Unsupervised feature learning through divergent [11] J. Lehman and K. O. Stanley. Evolving a diversity of Proc. of the discriminative feature accumulation. In virtual creatures through novelty search and local Twenty-Ninth AAAI Conference on Artificial Proc. of the Thirteenth Annual Conf. competition. In , Menlo Park, CA, 2015. Intelligence (AAAI-2015) on Genetic and Evolutionary Computation (GECCO AAAI Press. ’11) , pages 211–218, Dublin, Ireland, 2011. ACM.

Related documents