Reinforcement Learning Based Agents For Improving Layouts of Automotive Crash Structures
Reinforcement Learning Based Agents For Improving Layouts of Automotive Crash Structures
https://doi.org/10.1007/s10489-024-05276-6
Abstract
The topology optimization of crash structures in automotive and aeronautical applications is challenging. Purely mathematical
methods struggle due to the complexity of determining the sensitivities of the relevant objective functions and restrictions
according to the design variables. For this reason, the Graph- and Heuristic-based Topology optimization (GHT) was devel-
oped, which controls the optimization process with rules derived from expert knowledge. In order to extend the collected
expert rules, the use of reinforcement learning (RL) agents for deriving a new optimization rule is proposed in this paper.
This heuristic is designed in such a way that it can be applied to many different models and load cases. An environment is
introduced in which agents interact with a randomized graph to improve cells of the graph by inserting edges. The graph is
derived from a structural frame model. Cells represent localized parts of the graph and delineate the areas where agents can
insert edges. A newly developed shape preservation metric is presented to evaluate the performance of topology changes made
by agents. This metric evaluates how much a cell has deformed by comparing its shape in the deformed and undeformed state.
The training process of the agents is described and their performance is evaluated in the training environment. It is shown
how the agents and the environment can be integrated as a new heuristic into the GHT. An optimization of the frame model
and a vehicle rocker model with the enhanced GHT is carried out to assess its performance in practical optimizations.
123
1752 J. Trilling et al.
are manifold. One possibility is to obtain useful gradients • the concept of cells and their advantages and disadvan-
for the optimization as shown in [1], where topologi- tages as an interface between the graph based structures
cal derivatives via the adjoint equilibrium equation are and the RL model,
determined. • the implementation of an environment the agents are
The Equivalent Static Loads Method (ESL) transforms the trained on,
loads of a nonlinear problem into several static problems. • the calculation of the shape preservation metric describ-
The load is determined such that the displacements in an ing the stiffness of a cell and
equivalent linear simulation correspond to those in a specific • the assessment of the performance of the trained agents
time step of the nonlinear simulation. Sensitivities can be in practical optimizations.
determined for the static problems, which are then used to
perform the topology optimization [2–4]. The paper is structured as follows. Section 2 presents
In addition to the calculation of usable sensitivities, related literature and concepts that are used in this paper
there are also methods for topology optimization that do or have directly influenced the work. In Section 3, the
not use direct sensitivity information. Instead, these use requirements for the RL models, also called agents, the
engineering knowledge that guides the optimization in the implementation of the RL environment and the training of
form of heuristic rules. The rules of cellular automata as the agents are introduced. It also describes how the environ-
used in Hybrid Cellular Automata (HCA) [5] can be used ment and the agents are integrated into the GHT process.
to form three-dimensional voxel based structures. In this The best trained model is selected and evaluated in Section 4
process, the cellular automaton redistributes material such within the training environment. To assess the performance
that each voxel in the model is equally utilized for the of the agent based heuristic within practical topology opti-
particular load case. As a criterion for this, the internal mizations, a frame model and a rocker model are studied
energy density of the respective voxels is used. The process with the GHT in Section 5. Finally, the results and findings
of the Graph- and Heuristic-based Topology optimization are summarized in Section 6.
(GHT) are also driven by heuristic update rules [6–9]. Those
updates are performed on a mathematical graph consist-
ing of nodes and edges, which describes the cross section 2 Related work
of an extrusion profile. With the GHT, the real objec-
tive and constraints can be considered in the optimization Section 2.1 gives an overview of different approaches to inte-
process. grate artificial intelligence (AI) into crash simulation and
This work presents a reinforcement learning (RL) based crash optimization. These studies collectively highlight the
approach for topology optimization of crash loaded struc- intersection of AI with crash simulations, illustrating the
tures using the GHT to improve local structural cells with growing trend in this research area. However, while they offer
respect to their stiffness. The RL model is integrated into valuable insights, they do not directly lay the groundwork for
the GHT optimization process and functions as an additional the present work. This is followed with Section 2.2, where
heuristic applicable in many different models and load cases. an introduction to RL is given. Lastly, Section 2.3 discusses
That allows for a more diverse design generation during opti- how the GHT works, as it is the framework for the RL based
mization which in return should result in better optima at the method presented in this paper.
cost of a higher simulation count.
The core concept that is presented here is the underly- 2.1 Use of artificial intelligence in crash simulation
ing RL environment which defines the interface between the and optimization
agent and the crash model that should be improved. For this, a
shape preservation metric is proposed, that describes the stiff- Crash simulation and optimization are integral aspects of
ness of a cell by measuring how much the undeformed and modern automotive and aeronautical crash safety analysis.
deformed cells differ from each other geometrically. While Leveraging AI and machine learning techniques has recently
the GHT process is already well described in the literature gained momentum in the field of crash analysis, allowing for
[6–9], the RL based approach is completely new and will be enhanced computational and predictive capabilities.
investigated in this paper. The key contributions of this work A primary focus of recent studies has been the applica-
are tion of dimensionality reduction and clustering techniques
for analyzing crash simulations. In [10] a clustering algo-
• the combination of the two research fields of RL and rithm to discern structural instabilities from reduced crash
crash optimization, simulation data is incorporated. The study delves into clus-
• the support of the GHT with a new RL based heuristic tering of nodal displacements derived from finite element
that increases the stiffness of local cells, (FE) models spanning different simulations. By subsequently
123
Reinforcement learning based agents... 1753
processing these clusters through dimensionality reduction in a full frontal crash by analyzing different crash relevant
techniques, inherent characteristics of the simulation runs components.
are unraveled, obviating the need for manual sifting of the In [16], the impact point in low speed crashes is identified
data. The practicality of their approach is underscored by its based on the time history of sensor data with conventional
effective application to a longitudinal rail example. feature extracting algorithms. The impact points are classi-
[11] presented an automated evaluation technique aimed at fied by 8 different positions around the vehicle. From 3176
discerning anomalous crash behaviours within sets of crash extracted features of the time series, the 9 most important
simulations. By calculating an outlier score for each sim- features are chosen and passed into a decision tree. Using
ulation state via a k-nearest-neighbour strategy, the study this method, a cross-validated accuracy of 76 % for the given
aggregates these results into a consolidated score for indi- dataset has been achieved.
vidual simulations. By averaging these scores for a given
simulation, the method facilitates the distinction between 2.2 Reinforcement learning overview
regular and outlier simulations. The effectiveness of this
method is underscored by its high precision and notable recall RL [17] is a subset of AI, that describes the process of learn-
when evaluated on five distinct datasets. ing tasks in an unknown and often dynamic environment.
A geometric approach is presented in [12]. Primarily one- An agent performs actions within the environment over time
dimensional structures are embedded by a representative according to its policy. The actions are selected by the agent
regression line used to analyze the deformation behavior of depending on an observation of the environment, i.e. the
different crash models. Those regression lines are param- agents perception of the current state of the environment,
eterized as Bézier curves. Simulation responses are then with the goal of maximizing a cumulative reward. Applying
projected onto the regression line and smoothed with a ker- the action to the environment and generating a new obser-
nel density smoothing. Leveraging a discretized version of vation based on the new state of the environment is called a
the smoothed data, it is possible to effectively identify and step. Depending on the task of the agent, a few steps up to
categorize distinct deformation patterns and find the most an theoretically infinite amount of steps are performed until
influencing parameters regarding the deformation modes the environment reaches a terminal state. The steps from the
through data mining techniques. This method is validated initial state of the environment to the final state are called
on different important structural components in a full frontal an episode. For each step performed, a numerical reward is
crash. given to the agent. This way the agent learns to understand
Crash simulations are intrinsically time dependent. The how beneficial the chosen action has been for the previous
use of time data for AI in crash simulations is therefore state. After an episode, the environment is reset and a new
suggestive. The study from [13] offers a novel data analy- episode starts. This iterative concept of stepping through the
sis methodology for efficiently post-processing bundles of environment is referred to as the RL loop.
FE data from numerical simulations. Similar to the Fourier In many state-of-the-art RL algorithms, the agent itself is
transform, which decomposes temporal signals into their a function approximator, usually an artificial neural network
individual frequency spectra, [13] propose a method that (ANN). Depending on the actual algorithm used, the ANN
characterises the geometry of structures using spectral coef- is trained to either predict the value of the observed state or
ficients. The coefficients with the highest values are decisive predict an action distribution over the possible actions that
for the representation of the original geometry. By selecting maximizes a given objective directly. The value of a state is
these predominant spectral coefficients, the geometry can the expected return starting in the current state and following
consequently be represented in a low-dimensional way. The the current policy. The return is a possibly discounted sum
method is successfully validated on a full frontal crash by of the rewards gained in each performed step. Choosing an
analyzing the behaviour of the vehicles support structure. action is then done by finding the action that maximizes the
In a simpler approach, [14] bring forth the concept of value in the current state.
Oriented Bounding Boxes (OBBs). Those are cuboids that In case of actor-critic algorithms [18], which are used in
encapsulate FE components at minimum volumes through- this paper, both the action distribution and the states value
out the simulation. This geometric abstraction enabled the are approximated in two distinct ANNs. The network predict-
estimation of size, rotation and translation of crash struc- ing the action distribution is called an actor and the network
tures over time. Moreover, their method, which uses a Long predicting the value of the states is called critic. The pol-
Short-Term Memory (LSTM) [15] autoencoder to gener- icy given by the actor network is updated with a gradient
ate a low dimensional representation of the original data, based approach with information provided by the critic [18].
paves the way for predicting and stopping simulations that This combination of both algorithmic approaches enables a
exhibit undesirable deformation modes. The method is val- much higher sampling efficiency compared to their individ-
idated on 196 simulations with varying material properties ual counterparts.
123
1754 J. Trilling et al.
When using an RL model that has been trained with In this work, Python is used for implementing the heuris-
an actor-critic-approach, only the actor is necessary for tic and the RL training. The most important python modules
the decision making given an observed state. This interac- used are stable-baselines3 [24], which is an RL framework,
tion between the actor and the environment is visualized in gym [25], which enables a standardized implementation
Fig. 1. The best actions will be sampled from the probability of environments, networkx [26] for processing graph data,
distribution over the actions. The environment acts according numpy [27] for numerical operations on arrays and qd cae
to the given action and answers with a new state which will [28] for parsing the simulation results. The crash simulations
be the foundation for the observation passed next to the actor are carried out with Ls-Dyna [29].
model.
In contrast to supervised learning, where an ANN is 2.3 Crash optimization with the graph- and heuristic
trained on a given dataset, the RL training is driven by a trial based topology optimization
and error approach. The agent steps through the environment
according to its policy, collecting data as it steps. This is the The GHT is a shape and topology optimization method for
main advantage of using RL especially for complex mechani- crash structures. There are possibilities to optimize the cross
cal problems. For training an RL agent, no information about section of an extrusion profile (GHT-2D) [6, 9] as well as the
optimized structures must be known beforehand. layout of different combined profiles (GHT-3D) [7].
This poses the problem described in the literature as the In this work, the focus will be on the GHT-2D. For all
exploration-exploitation trade-off [17]. While training, the following references to the GHT, it is always the GHT-2D
agent should act according to its current policy, such that it is that is referred to. In the GHT, the cross sections of extrusion
able to use the already learned knowledge about the environ- profiles are described by graphs. The graph nodes contain
ment to reach favourable states. This is called exploitation. the relevant coordinates that describe the geometry of the
At the same time, the agent needs to try out new actions, to profile. Edges between the nodes represent the walls of the
avoid getting stuck in local optima by only acting according extruded model. For the automatic translation of a graph into
to the current policy. This is called exploration. a FE simulation model, the GHT-internal mesher GRAMB
Typical examples where RL is frequently applied is (Graph based Mechanics Builder) can be used. This software
robotics [20] and video games [19, 21]. New observations has been initiated by [8] and further developed in [6]. An
can be obtained fast and in a simple and structured format, example of a graph and its FE counterpart is given in Fig. 2.
like scalar sensor data in robotics and a visual representation As described in [7, 30], the use of a graph allows for
of the environment in games. Thus, RL is a suitable method
for training agents in these application fields.
• an easy way of manipulating the structure,
An example of an application of RL in mechanical prob-
• generating an interpretable structure with little effort,
lems is given in [22, 23]. In the mentioned work, the volume
• a simple check of the manufacturing constraints and
for planar steel frames is calculated with RL considering
• high quality meshs due to the automatic FE meshing with
stresses, displacements and other engineering relevant con-
GRAMB for every design.
straints. There, the cross sectional size is chosen from a list
of discrete sizes. The steel frames are represented by a graph.
A tailored graph embedding is used to preprocess the graph Starting from an initial design, the graph is modified
data into an RL suitable format. over several iterations using heuristics. Heuristics are expert
knowledge condensed into formulas that analyze the mechan-
ical behavior of the structure from the simulation. Within an
iteration, these heuristics suggest a new topology. If desired,
a dimensioning or shape optimization of each new structural
proposal is subsequently performed. The heuristics operate in
parallel and in competition. Only the best designs are passed
on to the next iteration. These new designs are the basis for
the following iterations.
The heuristics used in this work have been developed in
[6, 7] and are listed below.
123
Reinforcement learning based agents... 1755
• Support Buckling Walls (SBW) identifies FE nodes that the agent inserts an edge into the graph locally, a reward is
are moving rapidly towards each other, detecting that the calculated, which indicates whether the newly added edge
structure has a buckling tendency. These areas are sup- improved the local performance of the model. How the
ported with an additional wall. reward for each step is calculated, is shown in Section 3.1.4.
• Balance Energy Density (BED) provides a homogeneous Lastly, in Section 3.1.5, the training process for the agent is
distribution of the absorbed energy in the structure by described.
connecting low and high energy areas.
• Use Deformation Space (UDS) has the variants com-
3.1.1 Environment concept
pression (UDSC) and tension (USDT). For this purpose,
deformation spaces moving towards and away from each
The first step in implementing the environment is to clarify
other are identified and supported by a wall.
what the environment should achieve. It is supposed to give
• Split Long Edges (SLE) reduces the buckling tendency
a framework for an agent to increase the stiffness of a struc-
by splitting and connecting the longest edge with another
ture by manipulating the topology of the graph locally. This
long edge in the graph.
should be achieved by sequentially inserting edges into the
graph. An optimal topology proposal by the trained agent
acting in the environment is desirable, but not mandatory.
3 Reinforcement learning based heuristic This is because the RLS heuristic, which is derived from the
generation proposed environment, works in competition to other heuris-
tics listed in Section 2.3. Therefore, suboptimal topologies
In the following, the framework of the RL based GHT heuris- are sorted out in the optimization process.
tic is formulated. The design of the training environment for Figure 3 gives an overview of stepping through an episode
the heuristic is described in detail in Section 3.1. Section 3.2 of the environment. The environment is split into two main
describes, how the heuristic is integrated into the GHT opti- modules. When an episode starts, the reset module is acti-
mization process. vated. This reset module handles a randomized generation
In this investigation, the aim of the heuristic is to improve of a GHT graph representing the cross section of an extru-
the local stiffness of the structure, hence the heuristic name sion profile, which is then translated into a finite element
RLS, which is an abbreviation for “Reinforcement Learning model and simulated in a randomized load case. Local parts
Stiffness”. of the graph, referred to as cells, are identified. Edges will
be inserted directly into those cells. Based on the results of
3.1 Environment implementation the FE analysis, an observation consisting of the mechani-
cal properties of the initial and deformed simulation model
The environment definition is the most important part when is build. Using this initial observation, the agent is able to
training an agent with RL. It defines what the agent should choose its first action, which is passed into the step mod-
learn based on the received rewards and how the interac- ule. The step module contains similar procedures as the reset
tion between the agent and the environment is implemented. module. Additionaly, the topology of the cell inside the graph
Therefore, the main concept of the environment is shown in is modified first using the given action. The cell is evaluated
Section 3.1.1. Section 3.1.2 explains the interaction between based on its updated topology. Then the reward and a termi-
the agent and the environment, i.e. the actions the agent can nation flag that decides whether the episode should terminate
take and the observations the agent receives. The random is calculated. All of these concepts will be explained in more
generation of models and load cases is shown in 3.1.3. When depth throughout this section.
123
1756 J. Trilling et al.
123
Reinforcement learning based agents... 1757
123
1758 J. Trilling et al.
123
Reinforcement learning based agents... 1759
Fig. 7 Example of an
aggregated internal energy
feature vector, where each of its
entries is mapped onto its
corresponding edge of the cell.
The semi-transparent box
highlights the fully constrained
side of the model
displacements and energies. The mean and standard devia- sen as the impact segment. An edge segment is a series of
tion, which are used when normalizing, are not known before edges with the same orientation. The velocity of the spheri-
the training. Therefore, the normalization is done by calcu- cal impactor is also randomly selected. Here, the sphere can
lating the running mean and running standard deviation of either move perpendicular towards the selected edge seg-
the observations values distributions while training. ment, or the sphere can move in the direction of the center of
gravity of the graph. With additional random adjustment of
3.1.3 Model and load case generation the size, the wall thickness and the orientation of the graph
in space, an unlimited number of models and different load
During the training, the agent should be able to analyze as cases can be built, that will result in a large variety of differ-
many different deformation modes of various cells as possi- ent deformation modes. Figure 8 shows a randomized model
ble. This enables the trained agent to perform useful actions in a random load case.
for similar deformation states in GHT optimizations. The manufacturing constraints used for training the agents
The frame model shown in Section 3.1.1 is used as a foun- are given in Table 1. Those are derived from [30] and adapted
dation to derive randomized models for training the agent. to suit the environment. The edge distance of two edge pairs is
Based on the models graph, about 30000 different graph calculated for all edge pairs that do not share a node. Since the
topologies from previous GHT optimizations were identified edges in the cell border are split, their distance to other edges
to serve as the basis for the randomized model generator. A changes. This results in smaller distances than the unsplit
randomly selected edge segment along the outer frame of cell, although it is geometrically identical. For this specific
the structure is constrained and another edge segment is cho- reason, the minimum distance between edges d is set to the
small value of 4 mm.
Edge length l l ≥ 10 mm
Distance between edges d d ≥ 4 mm
Fig. 8 Example of a randomized graph and load case build by the Connection angle α α ≥ 15◦
environment. The semi-transparent box highlights the fully constrained Wall thickness twall 1 mm ≤ twall ≤ 4 mm
side of the model
123
1760 J. Trilling et al.
(1) (1)
on the evaluation graph. The evaluation graphs in the unde- only one difference area A emerges. With At = A one
0
formed state and in the deformed state at the evaluation time can see that the value of à is indeed 1.
step are superimposed at their respective center of gravity. If One advantage of using such a metric is, that its values are
the structure preserves its shape, then both cell boundaries not subject to significant noise, unlike section forces or other
lie exactly on top of each other. Otherwise, difference areas crash relevant responses, as they are entirely based on the
emerge, which are summed up and normalized. This process displacements of the cell nodes in the evaluation graph. In
is independent of any inserted edges and works for empty addition, the values of the metric are normalized so that they
cells and for cells with edges in it. Since rigid body transla- are model and load case independent. The simple behavior
tion and rotation have no influence on the shape preservation of the metric simplifies the training for the agent.
of the cell, these are eliminated when creating the evaluation This measure is also used to identify if an empty cell is a
graph. Figure 9 shows the superposition of the evaluation candidate for optimization with the agent. An à ≥ 0.03
graphs in more detail. identifies the cell being deformed and therefore it makes
The following original formula for calculating the shape sense to optimize it. A value of à ≤ 0.01 terminates the
preservation measure à is given by episode, since the deformation of the cell is small.
With this measure on how well the current cell performs,
( j)
A t it is possible to reward or punish the agent by how much
à = eval . (1) the cell performance improved compared to the cell from the
A t + A t previous step for a given episode. A relative improvement
0 eval
is considered instead of the absolute improvement to ensure
The area spanned by the evaluation that all resulting rewards have a similar range independent of
graph at a given simula-
tion point in time is given by At . Difference areas between the actual load case and deformation of the cell. The relative
the superimposed evaluation graphs of the undeformed and improvement is given by the formula
( j)
deformed state are given by A . j is the index of the con-
sidered difference area. Ã s − Ã s
δ = clip i−1
i
, −3, 3 , (2)
The shape preservation measure value is bound between
Ã
s0
0 and 1 due to the normalization in à with At + At .
0 eval
A value of 0 implies that the shape of the cell did not change where si refers to the evaluation of the metric value at the
from the initial state to the deformed state and a value of current environment step. The clipping is just a safety precau-
1 means that the structures collapsed into a point, i.e. the tion. It is clipped for numerical stability to avoid any outliers
structure is infinitely weak. In the case when à = 0, no generating too small or large rewards.
difference area emerge, setting the numerator to 0 in the for- Using this improvement the reward function r evaluates
mula for à . For the collapsed cell, à = 1 is true due to the to
fact that the cell has a cross sectional area of At = 0. Then
eval
δ if model is manufacturable,
r = p+ (3)
−1 else,
123
Reinforcement learning based agents... 1761
the edges that are inserted into the cell and continuous obser- The possible hyperparameter values for the policy that is
vation spaces. sampled from to generate the batch of 12 agents are given in
It is necessary to determine the best hyperparameter set- Table 3.
tings, such as the architecture of the underlying ANN to The policy parameters define the size of the underlying
achieve the best possible performance with the PPO algo- neural network. Since the PPO is considered an actor-critic-
rithm. Since the best hyperparameter settings are not known algorithm, a neural network for the actor and a neural network
beforehand, one would usually do a large hyperparameter for the critic is build with the respective number of hidden
tuning, which is too time consuming for the given task. layers and neurons. Both the actor and the critic network
Instead, 12 agents with different hyperparameters are trained share the same preceding layer with the specified amount
simultaneously on the identical task and after training the best of neurons. For the CNN architecture, the default CNN in
agent is manually picked. Table 2 shows the parameters and stable-baselines3 is used, which is the Nature CNN. The
their set of values which is sampled from to generate the hyperparameters of the algorithm and the policy resulting in
batch of 12 agents. The parameters chosen are the ones that the best agents will be shown in Section 4, where the top
are expected to significantly impact the training behaviour of performing agents are selected and evaluated.
the agents. Other parameters exist, but are not shown, since Three different cell types are trained. These include cells
they have not been tuned and are set to the default values with 3 sides, with 4 sides and with 5 sides. Every cell type
used in stable-baselines3. needs an individual agent for training, since the action and
The learning rate is a hyperparameter that determines the observation space is determined by it. In total, 36 agents are
step size at which the parameters of the policy network are trained in parallel.
updated during training via stochastic gradient ascent algo- In order to expedite the training process, parallel compu-
rithms. The batch size parameter specifies how many of tation across eight environments for each agent is used. This
samples of observations and actions will be used to compute approach ensures a rapid provision of data points, enhancing
the policy gradient and update the policy network. Gamma is the efficiency of the training phase. With these settings, train-
a discount factor used in RL algorithms to balance the impor- ing a single agent on a compute cluster takes approximately
tance of immediate and future rewards. A Gamma of 0 results a month. This high computation effort is justified by the fact
in an agent with myopic behaviour, i.e. an agent that does not that the agent only needs to be trained once and can then
look into the future and only uses its current state for deci- be used in the new heuristic for a wide variety of problems
sion making. A Gamma of 1 means that future rewards are without any significant additional invest.
as important as the immediate reward. Using a value close to
0 would help in this environment due to the high uncertainty
3.2 Integration of the agents and the environment
in the crash simulations. However, this approach might only
into the GHT process
find mediocre optima. A value closer to 1 would help in find-
ing better optima due to the future planning that is involved in
In the previous sections, the structure of the training environ-
the decision making. In the end, this might result in a worse
ment is explained. The environment can almost be directly
performance, depending on the unknown uncertainty of the
used as the new RLS heuristic. Differences between the
environment and how well the agent can predict the future.
environment used for the RLS heuristic and the training envi-
The number of rollout steps determines the number of steps
ronment are
taken in the environment before computing the policy update.
123
1762 J. Trilling et al.
• the selection of actions, which is now done by the trained 4 Evaluation and selection of the agents
agent based on the normalized observations, in the training environment
• the FE simulation model to optimize, which is given in
the GHT process and not randomly generated by the envi- In the following, the best performing agents out of the 12
ronment and trained agents per cell type are shown. The Tables 4 and 5
• the cell selection process. show the hyperparameters for the best performing agents for
every cell type. The parameters given are the ones that have
While in training mode, the cell to optimize was chosen been tuned in the hyperparameter tuning.
randomly from the set of valid cells inside the graph. In the The corresponding training history for the best perform-
heuristic counterpart, a cell selection scheme is deployed. ing agents is given in Figs. 10 and 11 with a mean return
According to this scheme, the shape preservation measure and a mean episode length respectively. In the rollout plots,
à for all empty cells within the current graph is calculated. data is collected and averaged over the last 100 training
The more a cell is deformed, the more suitable it is to be episodes. Since those episodes are determined randomly and
optimized. At the same time, larger cells should be preferred, the actions from the agent are sampled from the probability
since they are more likely to show an influence on the global distribution given by the actor network, values at different
structural behavior. Therefore, the shape preservation mea- steps are not directly comparable. In the evaluation plots, 15
sure of the cell is weighted with the corresponding spanned different models and load cases are averaged and compared
area of the cell. This results in the importance λ of a cell between steps. Those models and load cases are always the
same for a specific cell type and the agent always chooses the
λ = Ã · A. (4) most promising action according to its current policy. This
allows for a better comparison of the agents performance
The cell with the highest λ in the given graph, will be chosen between the steps.
for the heuristic activation. It can be seen from the plots, that all agents were able to
It can happen that the agent worsens the cell performance achieve a significant improvement of the return, especially in
compared to the previous step due to a wrong decision. There- the first steps. The mean return is higher for cell types with a
fore, the heuristic chooses the structure that generates the smaller number of cell sides. The episode length for all agents
best shape preservation value of the cell over all steps in the after training is close to 2, which approximates the number
episode. of inserted edges in the graph. It is not the exact number
of inserted edges, because it is possible that the agent theo-
123
Reinforcement learning based agents... 1763
retically inserts the same edge several times in a row, which The final 3 sided cell has a shape preservation value
would not change the structure. But this effect is negligible in à = 0.02. Only one edge is inserted by the agent. For
the trained state, because agents rarely insert edges multiple the 4 and 5 sided cells, the intermediate steps and the cor-
times, due to the penalization of inserting them. responding cells where only one edge is inserted are also
Since the mean return only measures the performance discussed. For the 4 sided cell, the agent has an à = 0.016
quantitatively but not qualitatively, examples of initial cell with only the diagonal edge inserted first. This diagonal edge
deformation behaviour compared to their improved counter- supports the cell stiffness by utilizing the deformation space
parts are shown in Fig. 12. The figure includes the shape along the compression direction of the deformed cell. From
preservation value à for the given examples. an engineering point of view, it might make more sense to
In all examples, the agent manages to improve the struc- support the cell in the tension direction to avoid buckling of
ture performance in terms of the shape preservation metric. the edge. Although the diagonal edge does not buckle in this
It is consistent with the episode length evaluation, that all example, the agent has learned to avoid the risk of buckling
agents in these examples insert one or two edges. All exam- and inserts a supporting second edge, which is reasonable
ple structures can be manufactured, which is not granted due for cells that absorb more energy. This means that the agent
to the high number of possible invalid edge combinations in failed to recognize that the episode could have been termi-
the cell. Since these are only non-representative examples, it nated earlier. For use in a later optimization, where the overall
must be mentioned that the agents often, but not always, make stiffness of a structure is to be improved, a correct recogni-
reasonable decisions. It is noticeable that the shape preser- tion of the terminal step would have been advantageous, as
vation value is always close to 0 for the final cell. While this the overall wall thickness would have remained larger due to
is desirable, it is not always possible, e.g., if the cell has to the mass constrain. Similar behaviour can be observed for the
absorb a lot of energy due to a direct impact of the sphere on cell with 5 sides. The shape preservation value, where only
the cell. the first edge connecting the blue and orange edge segment
123
1764 J. Trilling et al.
with the pink and red edges is inserted, is à = 0.031. This constraints and keeping the mass m of the model constant.
is only slightly worse than the shape preservation of in the The global y-direction is identical to the local y-direction ŷ
final cell. Although the structural performance improved to of the profile shown in Fig. 4. It is important to notice, that
à = 0.026 with the final cell, the reward received for this the optimization is performed with heuristic activations only
action and final cell is slightly negative due to the penaliza- and no shape optimization is done at any point. The num-
tion of inserting an edge, implying that the agent should not ber of designs passed into the next generation is set to 5. A
have inserted the second edge. maximum of 10 iterations is allowed. In the following, the
optimization problem is formulated:
123
Reinforcement learning based agents... 1765
edge splitting process. All designs of the optimization fulfill 5.2 Optimization of a rocker model
the constraints.
The optimized design reduces the objective y from 72.36 So far, the frame model has been studied, which was also
mm to 7.46 mm. The greatest impact on structural perfor- used to train the agent. How the agent based RLS heuristic
mance comes from combining the RLS and DNW heuristics performs in other models and load cases is examined in this
from iteration 1 to 3 by changing the shape of the overall section.
structure to a triangular shape. This is initiated by the RLS Since the agent is designed to improve the stiffness of a
heuristic in iteration 1, where a diagonal edge is inserted into structure, a relevant vehicle component is selected for occu-
the graph. pant protection in a side crash. In such side impacts, there
Comparing the findings to an optimization without the is little deformation space until the occupant is struck. It is
RLS heuristic active, shows that the GHT finds a much more vital that the vehicle occupants are protected from excessive
complex structure with a worse objective value. This can be intrusion by the opposing vehicle or impactor.
seen in a side by side comparison of the deformed states Therefore, the performance of the RLS heuristic is inves-
of the initial, the optimized design with the RLS heuristic tigated based on a model of a rocker in a side crash against a
active and the optimized design without the RLS heuristic in rigid pole, which is presented in Fig. 15.
Fig. 14. The rocker is made out of aluminum, has an initial wall
It can be observed that the edges must be much thinner thickness of 3.5 mm and has an extrusion length of 600 mm,
in the optimized design without the RLS heuristic active in resulting in a mass of 2.801 kg. The energy of a moving
order to keep the mass of the model constant. The objective of rigid wall is introduced into the rocker through the seat cross
the optimized design without the RLS heuristic is improved member. The rigid wall has a mass of 85 kg and an initial
from 72.36 mm to 14.65 mm compared to the initial design. velocity in negative y-direction of 8.056 ms−1 .
The lower performance can be attributed to the fact that none The objective is to find a topology, that minimizes the
of the already existing heuristics inserted an edge diagonally displacement y of the rigid wall and therefore increases
in the frame. If a shape optimization was active, the GHT the stiffness of the rocker. All manufacturing constraints must
without the RLS heuristic would also find a similar triangular be fulfilled and the mass of the model must be equal for all
shape. designs. The number of concurrent designs is set to 5 and the
In total, 273 function calls, i.e. FE simulations, were car- maximum number of iterations allowed is set to 12. No shape
ried out in the optimization with the RLS heuristic active. optimization is performed during the optimization. The exact
123
1766 J. Trilling et al.
optimization problem is formulated as: The design variations that lead to the optimized structure
are shown in Fig. 16. Along this path, the RLS heuristic is
min y activated twice. One time it is activated in iteration 2 and
subject to l ≥ 20 mm, another time in iteration 5. In iteration 3 and 4 the agent
d ≥ 10 mm, proposed the same topology change as in iteration 5, but the
activation of other heuristics resulted in a better overall per-
α ≥ 15◦ , formance of the structure. The design proposals of the RLS
1.5 mm ≤ twall ≤ 3.5 mm, heuristic in iterations 6 and 8 increase the shape preservation
m = 2.801kg. (6) value of the respective cell, but are not useful with respect to
the stiffness of the rocker structure.
The performance of the structure from iteration 4 to 5
gets worse when the RLS heuristic is activated. In iteration
6, the DNW heuristic removes part of an edge added by the
RLS heuristic, causing the structure to perform better in the
long run. Similar to the optimization of the frame model, the
combination of the RLS and DNW heuristic works well.
The optimization was able to increase the objective
y from 68.92 mm to 29.95 mm. Comparing this to the
optimization with inactive RLS heuristic, a slightly worse
improvement from 68.92 mm to 31.53 mm is achieved. A
direct comparison of the deformed models is given in Fig. 17.
Fig. 17 Comparison of the initial and the optimized rocker designs with
Fig. 16 Optimization path that leads to the optimized rocker design and without the RLS heuristic active in the optimization
with the RLS heuristic active
123
Reinforcement learning based agents... 1767
Although the optimized structure with the active RLS the cells. It is not guaranteed that the heuristic will always
heuristic performs slightly better, the emerging pattern of improve the optimization results. The displacements in crash
the inserted supporting walls in an offset manner is similar. simulations, which are assumed to play a major role in the
The optimization with the RLS heuristic was able to fill more decision process of the agent, behave well in crash simula-
space with this pattern successfully. tions from a mechanical point of view and therefore one could
The optimization with the RLS heuristic active is based assume that the agents decisions are fairly robust. Further
on 296 function calls. 213 of those function calls are from the research needs to be conducted to substantiate this assump-
conventional GHT heuristics and an additional 83 function tion.
calls are from the RLS heuristic. With an improved imple- Accordingly, there are some things that should be further
mentation of the interface between the RLS heuristic and explored in future work. To enhance the design diversity of
the GHT, the number of function calls of the RLS heuristic the cells, it is useful to extend the edge splitting process with
can be reduced to 37. The number of function calls in the more nodes along one edge. But there are limitations to this,
optimization without the RLS heuristic is 166. as very short edges are generated that do not fulfill the man-
ufacturing constraints. Also, only one simulation model for
training has been considered with limited amount of diversity
6 Discussion and conclusion in the load case. It is unclear how different training models
will affect the performance in real optimizations. Objective
In this paper, a novel heuristic for the topology optimization functions other than stiffness were also not investigated in this
of crash structures with the GHT was presented. For this paper. In crash development, force levels are often used as
purpose, RL was used to train agents that can locally improve an optimization objective for crash load cases. An additional
the stiffness of structures. Within the training environment, RL based heuristic that makes the structure more compliant
the agents were able to make plausible decisions about the instead of stiffer could help in those optimizations.
topology of the cells. It was more difficult for the agents to
differentiate if an episode could be terminated early. Author Contributions
• Conceptualization: Jens Trilling, Axel Schumacher, Ming Zhou
The trained agents have been used as a new RL based • Methodology: Jens Trilling, Axel Schumacher, Ming Zhou
heuristic in two GHT optimizations. Firstly, an optimization • Implementation: Jens Trilling
of a frame model was performed, where the new heuristic was • Investigation: Jens Trilling
able to direct the optimization to a better design compared • Writing - original draft preparation: Jens Trilling
• Writing - review and editing: Axel Schumacher, Ming Zhou
to the optimization without the new heuristic. Secondly, an • Supervision: Axel Schumacher
optimization of an application-oriented rocker model was
performed. The differences between the designs with and Funding Open Access funding enabled and organized by Projekt
without the new heuristic were smaller compared to the frame DEAL.
model optimization. Data availability The finite element models generated and analysed
Table 6 summarizes the results of those two optimizations. during the current study are available from the corresponding author on
In both optimizations the optimized structures performed reasonable request.
better with the RLS heuristic active. The use of a new heuris-
tic increases the number of function calls in an optimization. Declarations
Especially with the RL heuristic, where after every added
edge the performance of the cell must be evaluated, many
Compliance with Ethical Standards The data used in this study was
additional simulations must be performed. exclusively generated by the authors. No third parties were involved
Given the results shown, it is valid to state that the in the data generation. No research involving human participants or
presented heuristic is able to help the GHT in the optimiza- animals has been performed.
tion process at the cost of an increased numer of function
Competing interests The authors have no competing interests to
calls. The underlying agents perform reasonable from an declare that are relevant to the content of this article.
engineering perspective with respect to the goal of stiffening
123
1768 J. Trilling et al.
Open Access This article is licensed under a Creative Commons Doctoral thesis, University of Wuppertal, Wuppertal. https://d-nb.
Attribution 4.0 International License, which permits use, sharing, adap- info/1182555063/34
tation, distribution and reproduction in any medium or format, as 13. Iza-Teran R, Garcke J (2019) A Geometrical Method for Low-
long as you give appropriate credit to the original author(s) and the Dimensional Representations of Simulations 7(2). https://doi.org/
source, provide a link to the Creative Commons licence, and indi- 10.1137/17M1154205
cate if changes were made. The images or other third party material 14. Hahner S, Iza-Teran R, Garcke J (2020) Analysis and predic-
in this article are included in the article’s Creative Commons licence, tion of deforming 3d shapes using oriented bounding boxes and
unless indicated otherwise in a credit line to the material. If material lstm autoencoders. In: Farkaš I, Masulli P, Wermter S (eds.) Arti-
is not included in the article’s Creative Commons licence and your ficial neural networks and machine learning – ICANN 2020,
intended use is not permitted by statutory regulation or exceeds the Springer, Cham, pp 284–296. https://doi.org/10.1007/978-3-030-
permitted use, you will need to obtain permission directly from the copy- 61609-0_23
right holder. To view a copy of this licence, visit http://creativecomm 15. Hochreiter S, Schmidhuber J (1997) Long short-term memory.
ons.org/licenses/by/4.0/. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.
1997.9.8.1735
16. Koch M, Wang H, Bäck T (2018) Machine Learning for Predicting
the Damaged Parts of a Low Speed Vehicle Crash. In: 2018 Thir-
References teenth international conference on digital information management
(ICDIM), pp 179–184. IEEE, Piscataway, NJ, USA. https://doi.org/
1. Weider K, Schumacher A (2018) Adjoint Method for Topological 10.1109/ICDIM.2018.8846974
Derivatives for Optimization Tasks with Material and Geometrical 17. Sutton RS, Barto AG (2018) Reinforcement Learning: An Intro-
Nonlinearities. In: EngOpt 2018 proceedings of the 6th interna- duction. MIT Press
tional conference on engineering optimization, Springer, Cham, 18. Konda V, Tsitsiklis J (1999) Actor-critic algorithms. In: Solla S,
pp 867–878. https://doi.org/10.1007/978-3-319-97773-7_75 Todd L, Müller K (eds) Advances in Neural Information Processing
2. Choi WS, Park GJ (2002) Structural optimization using equiv- Systems (NIPS), vol 12. MIT Press, Denver, CO, USA, pp 1008–
alent static loads at all time intervals. Comput Methods 1014
Appl Mech Eng 191(19–20):2105–2122. https://doi.org/10.1016/ 19. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I,
S0045-7825(01)00373-5 Wierstra D, Riedmiller MA (2013) Playing atari with deep rein-
3. Park G-J (2011) Technical overview of the equivalent static loads forcement learning. In: Neural information processing systems
method for non-linear static response structural optimization. (NIPS) deep learning workshop, Lake Tahoe, NV, USA. https://
Struct Multidiscip Optim 43(3):319–337. https://doi.org/10.1007/ doi.org/10.48550/arXiv.1312.5602
s00158-010-0530-x 20. Haarnoja T, Ha S, Zhou A, Tan J, Tucker G, Levine S (2019) Learn-
4. Triller J, Immel R, Timmer A, Harzheim L (2021) The difference- ing to walk via deep reinforcement learning. In: Robotics: science
based equivalent static load method: an improvement of the and systems XV, Freiburg im Breisgau, Germany. https://doi.org/
ESL method’s nonlinear approximation quality. Struct Multidis- 10.48550/arXiv.1812.11103
cip Optim 63(6):2705–2720. https://doi.org/10.1007/s00158-020- 21. Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A,
02830-x Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T, Simonyan
5. Patel NM, Kang B-S, Renaud JE, Tovar A (2009) Crashworthiness K, Hassabis D (2018) A general reinforcement learning algo-
Design Using Topology Optimization. J Mech Des 131(6):061013. rithm that masters chess, shogi, and go through self-play. Science
https://doi.org/10.1115/1.3116256 362(6419):1140–1144. https://doi.org/10.1126/science.aar6404
6. Ortmann C, Schumacher A (2013) Graph and heuristic based topol- 22. Hayashi K, Ohsaki M (2020) Reinforcement Learning and Graph
ogy optimization of crash loaded structures. Struct Multidiscip Embedding for Binary Truss Topology Optimization Under Stress
Optim 47(6):839–854. https://doi.org/10.1007/s00158-012-0872- and Displacement Constraints. Front Built Environ 6. https://doi.
7 org/10.3389/fbuil.2020.00059
7. Beyer F, Schneider D, Schumacher A (2021) Finding three- 23. Hayashi K, Ohsaki M (2022) Graph-based reinforcement learning
dimensional layouts for crashworthiness load cases using the for discrete cross-section optimization of planar steel frames. Adv
graph and heuristic based topology optimization. Struct Multidiscip Eng Inform 51:101512. https://doi.org/10.1016/j.aei.2021.101512
Optim 63(1):59–73. https://doi.org/10.1007/s00158-020-02768-0 24. Raffin A, Hill A, Gleave A, Kanervisto A, Ernestus M, Dormann
8. Olschinka C, Schumacher A (2008) Graph Based Topology Opti- N (2021) Stable-baselines3: reliable reinforcement learning imple-
mization of Crashworthiness Structures. In: Proceedings in applied mentations. J Mach Learn Res 22(268):1–8
mathematics and mechanics (PAMM), vol 8, pp 10029–10032. 25. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J,
https://doi.org/10.1002/pamm.200810029 Tang J, Zaremba W (2016) OpenAI Gym. https://doi.org/10.48550/
9. Ortmann C, Sperber J, Schneider D, Link S, Schumacher A (2021) arXiv.1606.01540
Crashworthiness design of cross-sections with the Graph and 26. Hagberg A, Schult DA, Swart PJ (2008) Exploring network struc-
Heuristic based Topology Optimization incorporating competing ture, dynamics, and function using networkx. In: Varoquaux G,
designs. Struct Multidiscip Optim 64(3):1063–1077. https://doi. Vaught T, Millman J (eds.) Proceedings of the 7th python in sci-
org/10.1007/s00158-021-02927-x ence conference, Pasadena, CA, USA, pp 11–15
10. Bohn B, Garcke J, Iza-Teran R, Paprotny A, Peherstorfer B, Schep- 27. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen
smeier U, Thole C-A (2013) Analysis of car crash simulation data P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R,
with nonlinear machine learning methods. Procedia Computer Sci- Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, Del Río
ence 18:621–630. https://doi.org/10.1016/j.procs.2013.05.226 JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy
11. Kracker D, Dhanasekaran RK, Schumacher A, Garcke J (2022) T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array
Method for automated detection of outliers in crash simula- programming with numpy. Nature 585(7825):357–362. https://doi.
tions. Int J Crashworthiness 28(1):96–107. https://doi.org/10.1080/ org/10.1038/s41586-020-2649-2
13588265.2022.2074634 28. Diez C (2018) qd - Build your own LS-DYNA Tool Quickly
12. Diez C (2019) Process for extraction of knowledge from crash in Python. In: 15th International LS-DYNA Users Conference,
simulations by means of dimensionality reduction and rule mining. Detroit, MI, USA
123
Reinforcement learning based agents... 1769
29. Livermore Software Technology Corporation (LSTC): Ls-Dyna 35. Fukushima K (1980) Neocognitron: a self organizing neural net-
Manuals. https://www.dynasupport.com/manuals/ work model for a mechanism of pattern recognition unaffected by
30. Ortmann C (2015) Entwicklung eines graphen- und heuristik- shift in position. Biol Cybern 36(4):193–202. https://doi.org/10.
basierten Verfahrens zur Topologieoptimierung von Profilquer- 1007/BF00344251
schnitten für Crashlastfälle. Doctoral thesis, University of Wup- 36. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O
pertal, Wuppertal (2017) Proximal Policy Optimization Algorithms. https://doi.org/
31. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G 10.48550/arXiv.1707.06347
(2009) The graph neural network model. IEEE Trans Neural Net- 37. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare
works 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605 MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen
32. Kipf TN, Welling M (2017) Semi-Supervised Classification with S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wier-
Graph Convolutional Networks. In: 5th International conference stra D, Legg S, Hassabis D (2015) Human-level control through
on learning representations (ICLR), Toulon, France. https://doi. deep reinforcement learning. Nature 518(7540):529–533. https://
org/10.48550/arXiv.1609.02907 doi.org/10.1038/nature14236
33. Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y,
Jaiswal S (2017) graph2vec: Learning Distributed Representations
of Graphs. https://doi.org/10.48550/arXiv.1707.05005
Publisher’s Note Springer Nature remains neutral with regard to juris-
34. Trilling J, Schumacher A, Zhou M (2022) Generation of designs for
dictional claims in published maps and institutional affiliations.
local stiffness increase of crash loaded extrusion profiles with rein-
forcement learning. In: Machine learning and artificial intelligence
in CFD and structural analysis, Wiesbaden, Germany. NAFEMS
123