Symmetry 15 00916 v2
Symmetry 15 00916 v2
Article
BEM-SM: A BERT-Encoder Model with Symmetry Supervision
Module for Solving Math Word Problem
Yijia Zhang † , Tiancheng Zhang *,† , Peng Xie, Minghe Yu and Ge Yu
School of Computer Science and Engineering, Notrheastern University, Shenyang 110819, China;
2171991@stu.neu.edu.cn (Y.Z.)
* Correspondence: tczhang@mail.neu.edu.cn
† These authors contributed equally to this work.
Abstract: In order to find solutions to math word problems, some modules have been designed to
check the generated expressions, but they neither take into account the symmetry between math word
problems and their corresponding mathematical expressions, nor do they utilize the efficiency of
pretrained language models in natural language understanding tasks. Anyway, designing fine-tuning
tasks for pretrained language models that encourage cooperation with other modules to improve the
performance of math word problem solvers is an unaddressed problem. To solve these problems,
in this paper we propose a BERT-based model for solving math word problems with a supervision
module. Based on pretrained language models, we present a fine-tuning task to predict the number
of different operators in the expressions to learn the potential relationships between the problems
and the expressions. Meanwhile, a supervision module is designed to check the incorrect expressions
generated and improve the model’s performance by optimizing the encoder. A series of experiments
are conducted on three datasets, and the experimental results demonstrate the effectiveness of our
model and its component’s designs.
Table 1. Examples of MWP. Here, the % represents that the mathematical expression inferred by
the GTS is incorrect and cannot solve the problem of Example 2, while the X indicates that the
mathematical expression inferred by our model is correct. We can use this expression to calculate the
solution of the problem of Example 2.
Example 1
A manuscript has 1250 words, Fang typed 32 words
Problem in the morning, she typed another 30 words in the
afternoon, how many words are left to type?
Expression x = 1250 − 32 − 30
Answer 1188
Example 2
A manuscript has 1250 words, and Fang types an
Problem average of 32 words per minute. After she has
typed 30 min, how many words are left to type?
Expression x = 1250 − 32 ∗ 30
GTS: x = 1250 − 32 − 30 (%) Ours: x = 1250 − 32 ∗ 30 (X)
Although these methods which just focus on expression generation have shown good
performance in MWP tasks, none of them can ensure that the generated expressions are
consistent with the mathematical logic of the problems, nor can they design fine-tuning
tasks for pretrained language models to adapt the supervision module to enable the
supervision module to better check the symmetry between the expression generated by the
decoder and the problem text. For example, as shown in Table 1, Example 2 has a small
change in the expression of the problem, but it has a larger change in logic. For example, the
GTS model simply generates the expression directly from the problem expression. It does
not infer that the relationship between quantity “32” and quantity “30” is multiplication,
and it infers the incorrect expression.
To address these issues, we propose a BERT-encoder model with a supervision module
for solving math word problems (BEM-SM). To further confirm the correctness of the
generated expression, we designed a multihead-attention-based supervision module to
supervise the symmetry between the generated expression and the question text. First,
we train the classifier in advance to learn the matching expression of each problem. Then
check whether each problem matches the expressions generated by the decoder during the
solution process. When the mismatch occurs, we improve the model effect by optimizing
the problem representations calculated by the encoder that lead to the incorrect solutions.
On the other hand, the method is based on the advantages of the pretrained language
models in natural language understanding; we adopted them to obtain the contextual
representation of the problems. Moreover, we also proposed a fine-tuning task to predict
the number of different operators in the expressions and cooperated with the supervision
module to identify the incorrect expressions generated by the decoding module. The main
contributions of this paper are summarized as follows:
• We design a multihead-attention-based supervision module to check the generated
incorrect expressions and further improve the performance of the model by optimizing
the encoder.
• We present a fine-tuning task to predict the number of different operators in each
expression, which enables the model to better understand the potential connection
between problems and expressions.
• We conducted a sufficient number of experiments on three datasets, and the experi-
mental results show that our model has a better performance when compared with
the current optimal methods.
Symmetry 2023, 15, 916 3 of 13
2. Related Works
2.1. Math Word Problem
The automatic solution of math word problems, as one of the tools to assist students
in learning, has received widespread attention in recent years. The development of the
MWP task is closely related to the evolution of natural language processing techniques.
Early MWP techniques mainly include rule-based matching methods [13] and statistical
learning-based methods [14], which can only solve problems with limited scenarios due to
their heavy reliance on manual labor and their poor flexibility [15]. In 2017, Wang et al. [1]
proposed the first deep neural network solver, marking the beginning of solving techniques
based on deep learning methods. After that, several deep learning solver models based on
the seq-seq structure have been proposed, such as [1–6], which adopt neural networks to
directly convert problem text sequences into expression sequences. Liu et al. [7] proposed
a solver model with a seq-tree structure, which generates expression trees by top-down
decoding of the tree structure. Xie et al. [8] proposed another tree structure model, that
generates expression trees based on a goal-driven approach. This tree decoding approach
has better generation results than decoding in sequence form. Zhang et al. [9] combined
the advantages of [8] and proposed a graph-tree model to convert problem information
into a graph structure to encode the quantitative relations in the form of a graph structure
for rich problem representations. Zhang et al. [10] proposed a teacher–student network
with multiple decoders to guide the inference of knowledge, in which a multiple decoding
structure composed of basic and perturbation decoders could obtain multiple solutions.
Wu et al. [11] designed a knowledge-aware solution model based on the seq-tree structure,
which combined commonsense knowledge from an external knowledge base in the problem
encoding part. Lin et al. [12] proposed a hierarchical solution model to understand
and analyze the problem from a word–sentence–problem level, and adopted a pointer
generation network at the decoder to guide the model to replicate the existing information
and infer additional information.
3. Model
3.1. Problem Statement
The input of the math word problem model is a sequence of text with length n, denoted
as P = { p1 , p2 , ..., pn }, where pi may be a word from normal language or the number.
The result is a mathematical expression of length m, denoted as A = { a1 , a2 , ..., am },
where ai might be any one of the following three components. The first part is made up of
the numbers that appear in the text of the problem, which is denoted as Vnum . The second
part, which is designated Vcon , consists of some external auxiliary constants that must be
employed to solve mathematical application problems, such as (1,π).
The third part is denoted as Vop , which is made up of the operators such as ’+’, ’−’, etc.
Figure 1. Overall architecture of the BEM-SM (The numbers in the Decoder Module are used to
explain how our decoding process is carried out).
−
Then, the overall representation vector of Problem Z is obtained by averaging each
word vector within matrix Z, that is, to take the average of all columns in matrix Z
− 1 n
n i∑
Z= zi (2)
=0
We enable the encoder to recognize the mathematical relationship present in the prob-
lem by introducing the operator prediction task. The decoding module and the supervision
module could thus receive a more accurate problem representation, avoiding the generation
of incorrect solutions. To verify our idea, we, respectively, selected fifteen problems with
the corresponding equation templates of “N/N”, “N ∗ N/N”, and “N ∗ N + N”, and use
the MacBERT and the MacBERT that we have fine-tuned by our fine-tuned approach to
encode these problems, and use the T-SNE to reduce the dimensions of these problem
representations; the results are shown in the blue, green, and red points in Figure 2. By
observing the distribution of the points corresponding to the different problems, we can
see that the encoder after fine-tuning can better separate the problem representations with
different numbers of operators, and problems with similar mathematical expression tem-
plates have more similar vector representations, while the representations of the unrelated
problems are further separated. This makes the matching relationships between the prob-
lem representation vectors and the expression representation vectors more obvious, and
the concatenated feature vectors more representative, to reduce the classification difficulty
of the supervision module.
Figure 2. Scatter chart of problem representation. Here, points of different colors represent dimen-
sionally reduced representations of problems with different equation templates.
into sub-goals combined by an operator in a top-down recursive way. The decoder module
in Figure 1 is an illustration of an expression tree. Here, the feature vector qroot of the root
−
node in the expression tree is initially created using the vector Z, which reflects the overall
information of the problem computed by the encoder.
−
qroot = Z (4)
After that, the attention mechanism is used to calculate the context vector ĉ, and the
mathematical symbol prediction is implemented according to qroot and ĉ shown as 1 , 3 ,
5 , 7 , 9 in Figure 1.
ĉ = Attention(q, Z ) (5)
∼
y = Predict(q, ĉ) (6)
∼
If the prediction token y is the number or a constant, the subtree representation is directly
∼ ∼
implemented by y. If the prediction token y is an operator, the target needs to be decom-
posed into left and right subtargets, and its left subgoal qle f t is calculated based on the
∼
prediction token of this node y, the context vector ĉ and the target vector q, shown as 2 ,
6 in Figure 1.
∼
qle f t = Le f tChild q, ĉ, y (7)
∼
y le f t = Predict qle f t , ĉle f t (8)
Then we continue to construct the expression tree with the left child node as the root
node in a prior order until the prediction mathematical symbols of a node are the number
or a constant. Then the construction of the right sibling node of the node is carried out. In
addition to the features required to build the node, the process also considers the subtree
representation tle f t of the node shown as 4 , 8 in Figure 1. Then the prediction of the
right sibling node is established according to the right child target qright and its context
vector ĉright .
∼
qright = RightChild q, ĉ, y, tle f t (9)
∼
y right = Predict qright , ĉright (10)
If the prediction symbol is an operator, the target decomposition of the node should be
continued until the prediction symbol is the number or a constant. Then the subtree
representation t of each node starts to be constructed upward layer by layer. In this way,
the subtree representation is constructed from the subtree representation of its left and right
subtrees as shown in Figure 1 10 , 11
∼ ∼
t = SubTree Num y i f y ∈ (Vnum or Vcon ) (11)
∼ ∼
t = SubTreeOp tle f t , tright , y i f y ∈ Vop (12)
h A = MultiheadAttention( A, A, A) (13)
−
Here we average h A and concatenate it with the problem representation Z. Then we put
the concatenated vector in the classifier to determine the consistency between the problem
representation and the expression representation.
− 1 m
m i∑
hA = h Ai (14)
=0
− −
u = FC [ Z : h A ] (15)
− −
In Equation (15) Z and h A are the mean vectors of Z and h A , and [:] denotes the
concatenation operation.
To improve the classification effect and to avoid the error due to the limitation of the
classifier, we use a similar negative sampling algorithm to that in [23], which provides the
classifier with both the real mathematical expression A corresponding to each problem and
the negative example expression Aneg generated according to A. As shown in Algorithm 1,
the parameter λ in the algorithm is a probability threshold between 0 and 1, which decides
whether to change a mathematical symbol in A. In this paper, λ is set to 0.1.
In Equation (16) MSE stands for mean square error, O pre represents the number of
individual mathematical symbols in the expression predicted by the model, and Otruth is
the actual number of individual mathematical symbols in the expression.
Symmetry 2023, 15, 916 8 of 13
Loss Function for Supervision Module: The goal of the supervision module is to
determine whether the problem representation and the expression representation match,
and here our learning goal is to minimize the binary-cross-entropy loss:
− −
Lsupervision = −logP u = 1 A, Z − logP u = 0 Aneg , Z (17)
Loss Function for the Encoder–Decoder Module: The goal of the encoder–decoder
module is to maximize the probability of generating its corresponding mathematical ex-
pression given the problem and, in addition, under the supervision of the teacher module,
the encoder should maximize the consistency of the representation Z and the answer A.
Therefore, the loss function to be minimized is:
−
L Encoder− Decoder = −logP( A| P) − α ∗ logP u = 1 A, Z (18)
Parameter α is the weight of the loss of the supervision module during training. In
this paper, α is set to 0.05.
4. The Experiment
4.1. Datasets
In this paper, we use three publicly available datasets in this field as experimental
data. Among them, Math23K [1] and Ape210K [24] are Chinese datasets. MathQA [25]
is an English dataset that consists of large-scale complex English math word problems.
The Math23K dataset contains a total of 23,162 problems. Ape210K is a larger and more
complex MWP dataset, containing a total of 210,488 problems with 56,532 templates. Due
to the presence of noisy data in Ape210K, we used the Ape-clean dataset obtained by
Liang et al. [26] after data filtering, which contains 81,225 questions.
4.3. Baselines
We compared our method with some typical MWP solution models proposed so far:
• Group-ATT [5]: proposed using attention to extract the various relevant features of
the problem.
• GTS [8]: the proposed goal-driven form of generating expression trees is a significant
MWP benchmark model.
Symmetry 2023, 15, 916 9 of 13
4.4. Evaluation
As with other benchmarking methods, we use the solution accuracy as an evaluation
metric. For the Math23k dataset, we use two validation methods, with the first using the
standard division given by the dataset, denoted Math23K, and the second using a fivefold
cross-validation method, denoted Math23K*. For the Ape-clean dataset, the division given
in [26] is used, which contains 79,388 training problems and 1837 testing problems. We
integrate the training set of Ape-clean and the remaining 129,263 questions from Ape210K
for fine-tuning the BERT. Furthermore, for the MathQA dataset, we use the standard
division given by the dataset, denoted MathQA, which contains 23,703 training problems
and 3540 testing problems.
Table 2. Comparison of the accuracy of the answers solved by the BEM-SM and the baseline method.
These results prove our model can learn the potential relationship between the prob-
lems and the expressions by predicting the number of different operators, and further
improve the accuracy of the generated expression through the supervision module.
We also conducted ablation experiments with the supervision module, and the results
are shown in Tables 3 and 4, where BEM-SM w/o SM indicates that the model does
not employ the supervision module. When the supervision module is not used, the
performance of the model decreases for all three verification methods, which proves that
the supervision module designed to optimize problem representation can optimize our
encoder and improve the rationality of the model generated expression.
We also conducted experiments on the number of negative expressions and the results
are shown in Table 6. The results show that it performs best when five negative expressions
are generated for each correct expression.
Symmetry 2023, 15, 916 11 of 13
Number of
Ex-Acc An-Acc
Negative Sampling
1 0.719 0.841
3 0.719 0.843
5 0.72 0.845
10 0.715 0.842
In addition, we tested the effect of the model on items with different expression
lengths, as shown in Figure 3. The green broken line represents the proportion of items
with different expression lengths. It can be observed that BEM-SM exhibits the best solution
performance in all cases, and the gap between our model and other models is becoming
larger in the case of long expressions (11+). To some extent, it proves that our model has
more advantages for more complex problems.
1 0.6
0.9
0.5
0.8
0.7
0.4
Proportion
0.6
Acc
0.5 0.3
0.4
0.2
0.3
0.2
0.1
0.1
0 0
3- 5 7 9 11+
Expression Length
GTS Graph2Tree BEM-SM Proportion
Figure 3. Model accuracy on items of different expression lengths.
5. Conclusions
In this paper, we propose a BERT-based model with a supervision module for the
automatic solving of math word problems. We designed a multihead-attention-based
supervision module, which makes the encoder generate a more accurate problem repre-
sentation by checking the consistency between the problem representation and expression
representation to improve the solution accuracy. Based on pretrained models, we also
designed a fine-tuning task to predict the number of different operators to better ascertain
the relationship between the problems and expressions, which could improve the solution
accuracy. Our experimental results on three datasets also demonstrate the effectiveness of
our model and the design of its various components.
In the future, the performance of the supervision module in improving the effect
of the model and its flexibility make it the third most important part in addition to the
encoding module and decoding module in the MWP models. In addition, designing other
fine-tuning tasks to further improve the solution effect of the model is also worth studying.
On this basis, we can also try our modules and other fine-tuning tasks on other pretrained
language models.
Symmetry 2023, 15, 916 12 of 13
Author Contributions: Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z.,
and P.X.; formal analysis, T.Z.; investigation, Y.Z.; resources, T.Z.; data curation, Y.Z.; writing—
original draft preparation, Y.Z.; writing—review and editing, Y.Z.; visualization, Y.Z.; supervision,
T.Z.; project administration, T.Z., M.Y. and G.Y.; funding acquisition, T.Z. All authors have read and
agreed to the published version of the manuscript.
Funding: This work is supported by National Natural Science Foundation of China under Grant
(Nos.62272093, 62137001).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: All data used in this manuscript was downloaded from Github.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wang, Y.; Liu, X.; Shi, S. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 845–854.
2. Wang, L.; Wang, Y.; Cai, D.; Zhang, D.; Liu, X. Translating a math word problem to an expression tree. arXiv 2018, arXiv:1811.05632.
3. Wang, L.; Zhang, D.; Zhang, J.; Xu, X.; Gao, L.; Dai, B.T.; Shen, H.T. Template-based math word problem solvers with recursive
neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February
2019; Volume 33, pp. 7144–7151.
4. Chiang, T.R.; Chen, Y.N. Semantically-aligned equation generation for solving and reasoning math word problems. arXiv 2018,
arXiv:1811.00720.
5. Li, J.; Wang, L.; Zhang, J.; Wang, Y.; Dai, B.T.; Zhang, D. Modeling intra-relation in math word problems with different functional
multi-head attentions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence,
Italy, 28 July–2 August 2019; pp. 6162–6167.
6. Meng, Y.; Rumshisky, A. Solving math word problems with double-decoder transformer. arXiv 2019, arXiv:1908.10924.
7. Liu, Q.; Guan, W.; Li, S.; Kawahara, D. Tree-structured decoding for solving math word problems. In Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2370–2379.
8. Xie, Z.; Sun, S. A goal-driven tree-structured neural model for math word problems. In Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 5299–5305.
9. Zhang, J.; Wang, L.; Lee, R.K.W.; Bin, Y.; Wang, Y.; Shao, J.; Lim, E.P. Graph-to-tree learning for solving math word problems. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020.
10. Zhang, J.; Lee, R.K.W.; Lim, E.P.; Qin, W.; Wang, L.; Shao, J.; Sun, Q. Teacher-student networks with multiple decoders for solving
math word problem. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-20),
Yokohama, Japan, 11–17 July 2020.
11. Wu, Q.; Zhang, Q.; Fu, J.; Huang, X.J. A knowledge-aware sequence-to-tree network for math word problem solving. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November
2020; pp. 7137–7146.
12. Lin, X.; Huang, Z.; Zhao, H.; Chen, E.; Liu, Q.; Wang, H.; Wang, S. Hms: A hierarchical solver with dependency-enhanced
understanding for math word problem. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February
2021; Volume 35, pp. 4232–4240.
13. Mukherjee, A.; Garain, U. A review of methods for automatic understanding of natural language mathematical problems. Artif.
Intell. Rev. 2008, 29, 93–122. [CrossRef]
14. Liang, C.C.; Hsu, K.Y.; Huang, C.T.; Li, C.M.; Miao, S.Y.; Su, K.Y. A tag-based statistical english math word problem solver with
understanding, reasoning and explanation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence (IJCAI-16), New York, NY, USA, 9–15 July 2016; pp. 4254–4255.
15. Zhang, D.; Wang, L.; Zhang, L.; Dai, B.T.; Shen, H.T. The gap of semantic parsing: A survey on automatic math word problem
solvers. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2287–2305. [CrossRef] [PubMed]
16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv.
Neural Inf. Process. Syst. 2017, 30, 1–11.
17. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
18. Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Tian, H.; Wu, H.; Wang, H. Ernie 2.0: A continual pre-training framework for language
understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020;
Volume 34, pp. 8968–8975.
Symmetry 2023, 15, 916 13 of 13
19. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly
optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
20. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language
representations. arXiv 2019, arXiv:1909.11942.
21. Clark, K.L.; Le, M.; Manning, Q.; ELECTRA, C. Pre-training text encoders as discriminators rather than generators. arXiv 2020,
arXiv:2003.10555.
22. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting pre-trained models for chinese natural language processing. arXiv
2020, arXiv:2004.13922.
23. Liang, Z.; Zhang, X. Solving math word problems with teacher supervision. In Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence (IJCAI-21), Montreal, QC, Canada, 19–27 August 2021; pp. 3522–3528.
24. Zhao, W.; Shang, M.; Liu, Y.; Wang, L.; Liu, J. Ape210k: A large-scale and template-rich dataset of math word problems. arXiv
2020, arXiv:2009.11506.
25. Amini, A.; Gabriel, S.; Lin, P.; Koncel-Kedziorski, R.; Choi, Y.; Hajishirzi, H. Mathqa: Towards interpretable math word problem
solving with operation-based formalisms. arXiv 2019, arXiv:1905.13319.
26. Liang, Z.; Zhang, J.; Shao, J.; Zhang, X. Mwp-bert: A strong baseline for math word problems. arXiv 2021, arXiv:2107.13435.
27. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.