Abstract
In this study, the challenge of developing a dissimilarity metric for machine learning pipeline optimization is addressed. Traditional approaches, limited by simplified operator sets and pipeline structures, fail to address the full complexity of this task. Two novel metrics are proposed for measuring structural, and hyperparameter, dissimilarity in the decision space. A hierarchical approach is employed to integrate these metrics, prioritizing structural over hyperparameter differences. The Tree-based Pipeline Optimization Tool (TPOT) is utilized as the primary automated machine learning framework, applied on the abalone dataset. Novel visual representations of TPOT’s search dynamics are also proposed, providing some deeper insights into its behaviour and evolutionary trajectories, under different search conditions. The effects of altering the population selection mechanism and reducing population size are explored, highlighting the enhanced understanding these methods provide in automated machine learning pipeline optimization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
De Rainville, F.M., Fortin, F.A., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: a Python framework for evolutionary algorithms. In: Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 85–92 (2012)
Garciarena, U., Santana, R., Mendiburu, A.: Analysis of the complexity of the automatic pipeline generation problem. In: 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE (2018)
Gijsbers, P., et al.: AMLB: an AutoML benchmark. arXiv preprint arXiv:2207.12560 (2022)
Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning. Data Mining, Inference, and Prediction, vol. 2. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Hutter, F., Kotthoff, L., Vanschoren, J.: Automated Machine Learning. Methods, Systems, Challenges. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5
Kenny, A., Ray, T., Limmer, S., Singh, H.K., Rodemann, T., Olhofer, M.: Hybridizing TPOT with Bayesian optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 502–510 (2023)
Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964)
Müller, A.C., Guido, S.: Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, Inc. (2016)
Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: 2016 Proceedings of the Genetic and Evolutionary Computation Conference, pp. 485–492 (2016)
Pimenta, C.G., de Sá, A.G.C., Ochoa, G., Pappa, G.L.: Fitness landscape analysis of automated machine learning search spaces. In: Paquete, L., Zarges, C. (eds.) EvoCOP 2020. LNCS, vol. 12102, pp. 114–130. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43680-3_8
Poli, R., Langdon, W.B., McPhee, N.F.: A Field Guide to Genetic Programming. Lulu Enterprises, UK Ltd. (2008)
Pushak, Y., Hoos, H.: AutoML loss landscapes. ACM Trans. Evol. Learn. 2(3), 1–30 (2022)
Selkow, S.M.: The tree-to-tree editing problem. Inf. Process. Lett. 6(6), 184–186 (1977)
Teixeira, M.C., Pappa, G.L.: Understanding AutoML search spaces with local optima networks. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 449–457 (2022)
Teixeira, M.C., Pappa, G.L.: On the effect of solution representation and neighborhood definition in AutoML fitness landscapes. In: Pérez Cáceres, L., Stützle, T. (eds.) Evolutionary Computation in Combinatorial Optimization. EvoCOP 2023. LNCS, vol. 13987, pp. 227–243. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-30035-6_15
Teixeira, M.C., Pappa, G.L.: Fitness landscape analysis of TPOT using local optima network. In: Naldi, M.C., Bianchi, R.A.C. (eds.) Intelligent Systems, BRACIS 2023. LNCS, vol. 14197, pp. 65–79. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-45392-2_5
Tong, H., Minku, L.L., Menzel, S., Sendhoff, B., Yao, X.: What makes the dynamic capacitated arc routing problem hard to solve: insights from fitness landscape analysis. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 305–313 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kenny, A., Ray, T., Limmer, S., Singh, H.K., Rodemann, T., Olhofer, M. (2024). A Hierarchical Dissimilarity Metric for Automated Machine Learning Pipelines, and Visualizing Search Behaviour. In: Smith, S., Correia, J., Cintrano, C. (eds) Applications of Evolutionary Computation. EvoApplications 2024. Lecture Notes in Computer Science, vol 14635. Springer, Cham. https://doi.org/10.1007/978-3-031-56855-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-56855-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56854-1
Online ISBN: 978-3-031-56855-8
eBook Packages: Computer ScienceComputer Science (R0)