2103.11614v1 - Unknown
2103.11614v1 - Unknown
Programs
Benjamin Paaßen1 , Jessica McBroom1 , Bryn Jeffries2 , Irena Koprinska1 , and Kalina
Yacef1
1
School of Computer Science, The University of Sydney
2
Grok Learning
arXiv:2103.11614v1 [cs.LG] 22 Mar 2021
This work is a preprint, provided by the authors, and has been submitted to the
Journal of Educational Datamining
Abstract
Educational datamining involves the application of datamining techniques to student
activity. However, in the context of computer programming, many datamining techniques
can not be applied because they expect vector-shaped input whereas computer programs
have the form of syntax trees. In this paper, we present ast2vec, a neural network that
maps Python syntax trees to vectors and back, thereby facilitating datamining on com-
puter programs as well as the interpretation of datamining results. Ast2vec has been
trained on almost half a million programs of novice programmers and is designed to be
applied across learning tasks without re-training, meaning that users can apply it without
any need for (additional) deep learning. We demonstrate the generality of ast2vec in three
settings: First, we provide example analyses using ast2vec on a classroom-sized dataset,
involving visualization, student motion analysis, clustering, and outlier detection, includ-
ing two novel analyses, namely a progress-variance-projection and a dynamical systems
analysis. Second, we consider the ability of ast2vec to recover the original syntax tree from
its vector representation on the training data and two further large-scale programming
datasets. Finally, we evaluate the predictive capability of a simple linear regression on
top of ast2vec, obtaining similar results to techniques that work directly on syntax trees.
We hope ast2vec can augment the educational datamining toolbelt by making analyses
of computer programs easier, richer, and more efficient.
Keywords: computer science education, computer programs, word embeddings, repre-
sentation learning, neural networks, visualization, program vectors
1 Introduction
Techniques for analyzing and utilizing student programs have been the focus of much recent
research in computer science education. Such techniques have included hint systems to provide
automated feedback to students [Paaßen et al., 2018b, Piech et al., 2015b, Price et al., 2019,
Rivers and Koedinger, 2017], as well as visualization and search approaches to help teachers
understand student behavior [McBroom et al., 2018, Nguyen et al., 2014]. Considering that
programming is a key skill in many fields, including science, technology, engineering and
mathematics [Denning, 2017, McCracken et al., 2001, Wiles et al., 2009], and considering the
difficulty students have with learning programming [Denning, 2017, Lahtinen et al., 2005,
Qian and Lehman, 2017, Robins et al., 2003], developing and expanding the range of available
techniques to improve educational outcomes is of great importance.
1
Preprint version under consideration at the Journal of Educational Datamining 2
Unfortunately, computer programs are particularly difficult to analyze for two main rea-
sons. First, programs come in the form of raw code or syntax trees (after compilation), which
few datamining techniques are equipped to handle. Instead, one first has to represent pro-
grams differently to turn them into valid input for data mining techniques [Paaßen et al.,
2018b]. Second, the space of possible programs grows combinatorially with program length
and most programs are never written by a student, even fewer more than once [Paaßen et al.,
2018b, Rivers and Koedinger, 2012, 2017]. Accordingly, one needs to abstract from meaning-
less differences between programs to shrink the space and, thus, make it easier to handle with
less risk of overfitting.
Several prior works have addressed both the representation and the abstraction step,
often in conjunction. For example, Rivers and Koedinger [2012] have suggested semantically
motivated transformations of syntax trees to remove syntactic variations that have no semantic
influence (such as unreachable code or direction of binary operators). Peddycord III. et al.
[2014] suggest to represent programs by their output rather than their syntax. Similarly,
Paaßen et al. [2016] suggest to represent programs by their execution behavior. Gulwani et al.
[2018] as well as Gross et al. [2014] perform a clustering of programs to achieve a representation
in terms of a few discrete clusters. We point to the ’related work’ section and to the review of
McBroom et al. [2019] for a more comprehensive list. We also note that many of the possible
abstraction and representation steps are not opposed but can be combined to achieve better
results [McBroom et al., 2019].
The arguably most popular representation of computer programs are pairwise tree edit
distances [Zhang and Shasha, 1989]. For example, Mokbel et al. [2013], Paaßen et al. [2018b],
Price et al. [2017], and Rivers and Koedinger [2017] have applied variations of the standard
tree edit distance for processing programs. Edit distances have the advantage that they do not
only quantify distance between programs, they also specify which parts of the code have to
be changed to transform one program into another, which can be utilized to construct hints
[Paaßen et al., 2018b, Price et al., 2017, Rivers and Koedinger, 2017]. Additionally, many
datamining techniques for visualization, clustering, regression, and classification can deal
with input in terms of pairwise distances [Pekalska and Duin, 2005, Hammer and Hasenfuss,
2010]. Still, a vast majority of techniques can not. For example, of 126 methods contained in
the Python library scikit-learn1 , only 24 can natively deal with pairwise distances, eight further
methods can deal with kernels, which require additional transformations [Gisbrecht and Schleif,
2015], and 94 only work with vector-shaped input. As such, having a vector-formed represen-
tation of computer programs enables a much wider range of analysis techniques. Further, a
distance-based representation depends on a database of reference programs to compare it to.
The computational complexity required for analyzing a program thus scales at least linearly
with the size of the database. By contrast, a parametric model with vector-shaped input can
perform operations independent of the size of the training data. Finally, a representation in
terms of pairwise distances tries to solve the representation problem for computer programs
every time anew for each new learning task. Conceptually, it appears more efficient to share
representational knowledge across tasks. In this paper, we aim to achieve just that: To find a
mapping from computer programs to vectors in Euclidean space (and back) that generalizes
across learning tasks and thus solves the representation problem ahead of time so that we, as
educational dataminers, only need to add a simple model specific to the analysis task we wish
to solve. In other words, we wish to achieve for computer programs what word embeddings
like word2vec [Mikolov et al., 2013] or GloVe [Pennington et al., 2014] offer for natural lan-
guage processing: a re-usable component that solves the representation problem of programs
such that subsequent research can concentrate on other datamining challenges. Our technique
1
Taken from this list https://scikit-learn.org/stable/modules/classes.html.
Preprint version under consideration at the Journal of Educational Datamining 3
Write a program that asks the user, “What are your favourite animals?”. If their
response is “Quokkas”, the program should print “Quokkas are the best!”. If their
response is anything else, the program should print “I agree, x are great animals.”,
where x is their response.
Our four simulated unit tests checked 1) whether the first line of output was “What are
your favourite animals?”, and 2-4) whether the remaining output was correct for the inputs
“Quokkas”, “Koalas” and “Echidnas”, respectively.
Overall, the dataset contains 58 (partial) programs, and these form N = 25 unique syntax
trees after compilation. Since ast2vec converts program trees to n = 256 dimensional vectors,
this means the data matrix X ∈ RN ×n after encoding all programs is of size N ×n = 25×256.
0.5
x = input ( ’ < string > ’ )
if x == ’ < string > ’:
print ( ’ < string > ’)
else :
0 print ( ’ < string > ’)
Figure 1: A progress-variance plot of the example task. The special points (0, 0) and (1, 0)
correspond to the empty program and the reference solution, respectively. The size of points
corresponds to their frequency in the data. Arrows indicate student motion. Different students
are plotted in different colors.
give educators a concise overview of typical paths towards the goal, where students tend to
struggle, as well as how many typical strategies exist [McBroom et al., 2021].
In order to create such a visualization, we combine ast2vec with a second technique that
we call progress-variance projection. Our key idea is to construct a linear projection from
256 dimensions to two dimensions, where the first axis captures the progress from the empty
program towards the goal and the second axis captures as much of the remaining variance
between programs as possible. A detailed description of this approach is given in Section 3.2.
Figure 1 shows the result of applying this process to the sample data. Each circle in the plot
represents a unique syntax tree in the data, with larger circles representing more frequent trees
and the three most common trees shown on the right. Note that (0, 0) represents the empty
program and (1, 0) represents the solution. In addition, the arrows indicate student movement
between programs, with different colors representing different students. More specifically, an
arrow is drawn from vector ~x to ~y if the student made a program submission corresponding
to vector ~x followed by a submission corresponding to vector ~y .
Without the axis labels, this plot is similar to the interaction networks suggested by
Barnes et al. [2016], which have already been shown to provide useful insights into student
learning [McBroom et al., 2018]. However, in contrast to interaction networks, our progress-
variance projection additionally maps vectors to meaningful locations in space. In particular,
x-axis corresponds to the (linear) progress toward the solution, whereas the y-axis corresponds
to variation between student programs that is orthogonal to the progress direction. This can
provide an intuitive overview of the types of programs students submit and how they progress
through the exercise.
Additionally, the plot reveals to us that most syntax trees only occur a single time and that
student programs differ a lot on a fine-grained level, especially close to the solution around
(0.8, 0.4). However, while exact repetitions of syntax trees are rare, the coarse-grained motion
of students seems to be consistent, following an arc from the origin to the correct solution.
We will analyze this motion in more detail in the next section.
Finally, we notice that multiple students tend to get stuck at a program which only does a
single function call (corresponding to point (0.5, 0.55) in the plot). This stems from programs
Preprint version under consideration at the Journal of Educational Datamining 6
0
x = input ( ’ < string > ’)
−0.2 if x == ’ < string > ’:
print ( ’ < string > ’)
−0.4 else :
print ( ’ < string > ’)
Figure 2: A linear dynamical system f , which has a single attractor at the correct solution
and approximates the motion of students through the space of programs (orange). Arrows are
scaled with factor 0.3 for better visibility. An example trace starting at the empty program and
following the dynamical system is shown in blue. Whenever the decoded program changes
along the trace, we show the code on the right. The coordinate system is the same as in
Figure 1.
that have the correct syntax tree but not the right string (e.g. due to typos). Accordingly, if
students fix the string they still have the same syntax tree and, hence, the same location in
the visualization.
In summary, this section demonstrates how ast2vec can be used to visualize student
progress through a programming task, how we can interpret such a visualization, and how
the vectorial output of ast2vec enabled us to develop a novel visualization technique in the
first place.
behavior we observe in Figure 1, namely the arc-shaped motion from the empty program
towards the reference solution. We can verify this finding by simulating a new trace that
follows our dynamical system. In particular, we start at the empty program ~x0 and then
set ~xt+1 = f (xt ) until the location does not change much anymore. The resulting motion
is plotted in blue. We further decode all steps ~xt using ast2vec and inspect the resulting
syntax tree. These trees are shown in Figure 2 on the right. We observe that the simulated
trace corresponds to reasonable student motion, namely to first ask the user for input, then
add a single print statement, and finally to extend the solution with an if-statement that can
distinguish between the input ’Quokka’ and all other inputs.
where p(~x|k) is the standard Gaussian density. The mean, covariance, and prior p(k) of each
Gaussian are learned according to the data. After training, we assign data vectors ~xi to clus-
ters by evaluating the posterior probability p(k|~xi ) and assigning the Gaussian k with highest
probability. Note that, prior to clustering, we perform a standard principal component anal-
ysis (PCA) because distances tend to degenerate for high dimensionality and thus complicate
distance-based clusterings [Aggarwal et al., 2001]. We set the PCA to preserve 95% of the
data variance, which yielded 12 dimensions. Then, we cluster the data via a Gaussian mixture
model with K = 4 Gaussians.
Figure 3 shows the resulting clustering with clusters indicated by color and cluster means
highlighted as diamonds. We also decode all means back into syntax trees using ast2vec.
Indeed, we notice that the clustering roughly corresponds to the progress through the task
with the blue cluster mean containing an input and a print statement, the yellow cluster mean
storing the input in a variable, the purple cluster mean adding an if statement, and the orange
cluster mean printing out the user input.
0.5
x = input ( ’ < string > ’)
if x == ’ < string > ’:
print ( ’ < string > ’)
0
x = input ( ’ < string > ’)
0 0.2 0.4 0.6 0.8 1 if x == ’ < string > ’:
progress print ( f ( ’ < string > ’ + x ))
plot corresponds to an input statement without a print statement, which is an unusual path
towards the solution. In this dataset, it is more common to write the print statement first.
The two outliers around (0.9, 0.45) correspond to the following program shape:
Here, the question for the favourite animals is posed as a print statement and the input is
requested with a separate command, which is a misunderstanding of how input works.
The final outlier is the program:
This program does pass all our test cases but does not adhere to the “spirit” of the task,
because it does not generalize to new inputs. Such an outlier may instruct us that we need to
change our test cases to be more general, e.g. by using a hidden test with a case that is not
known to the student.
This concludes our example data analyses using ast2vec. We note that further types of
data analysis are very well possible, e.g. to define an interaction network [Barnes et al., 2016]
on top of a clustering. In the remainder of this paper, we will explain how ast2vec works in
more detail and evaluate it on large-scale data.
Preprint version under consideration at the Journal of Educational Datamining 9
φ
~x
~ǫ ~x + ~ǫ
x̂ ŷ
3 Methods
In this section, we describe the methods employed in this paper in more detail. In Section 3.1,
we provide a summary of the autoencoder approach we used for ast2vec. In Section 3.2, we
describe the progress-variance projection that we used to generate 2D visualizations of student
data. Finally, in Section 3.3, we explain how to learn linear dynamical systems from student
data.
Module
f Name
Expr
f Call f Expr f Module
Call
f Str
Name Str
where ǫi is a Gaussian noise vector that is generated randomly in every round of training,
β is a
hyper-parameter to regulate how smooth we want our coding space to be, and DKL φ(x̂i )+ǫi
is the Kullback-Leibler divergence between the distribution of the noisy neural code φ(x̂i ) + ǫi
and the standard normal distribution - i.e. it punishes if the code distribution becomes too
different from a standard normal distribution.
In the next paragraphs, we describe the encoder φ, the decoder ψ, the probability distri-
bution pψ , and the training scheme for ast2vec.
Encoder: The first step of our encoding is to use the Python compiler to generate an
abstract syntax tree for the program. Now, let x(ŷ1 , . . . , ŷK ) be such an abstract syntax tree,
where x is the root syntactic element and ŷ1 , . . . , ŷK are its K child subtrees. For an example
of such a syntax tree, refer to Figure 5 (left). Our encoder φ is then defined as follows.
x
φ x(ŷ1 , . . . , ŷK ) := f φ(ŷ1 ), . . . , φ(ŷK ) , (2)
where f x : RK×n → Rn is a function that takes the vectors φ(ŷ1 ), . . . , φ(ŷK ) for all chil-
dren as input and returns a vector for the entire tree, including the syntactic element x
and all its children. Because Equation 2 is recursively defined, we also call our encoding
recursive. Figure 5 shows an example encoding for the program print(’Hello,␣world!’) with
n = 3 dimensions. We first use the Python compiler to translate this program into the
abstract syntax tree Module(Expr(Call(Name, Str))) and then apply our recursive encod-
ing scheme. In particular, recursively applying Equation 2 to this tree yields the expression
f Module(f Expr (f Call (f Name (), f Str ()))). We can evaluate this expression by first computing the
leaf terms f Name() and f Str (), which do not require any inputs because they have no children.
This yields two vectors, one representing Name and one representing Str, respectively. Next,
we feed these two vectors into the encoding function f Call , which transforms them into a vec-
tor code of the subtree Call(Name, Str). We feed this vector code into f Expr , whose output
Preprint version under consideration at the Journal of Educational Datamining 11
we feed into f Module , which in turn yields the overall vector encoding for the tree. Note that
our computation follows the structure of the tree bottom-up, where each encoding function
receives exactly as many inputs as it has children.
A challenge in this scheme is that we have to know the number of children K of each
syntactic element x in advance to construct a function f x . Fortunately, the grammar of the
programming language3 tells us how many children are permitted for each syntactic element.
For example, an if element in the Python language has three children, one for the condition,
one for the code that gets executed if the condition evaluates to True (the ‘then’ branch), and
one for the code that gets executed if the condition evaluates to False (the ‘else’ branch).
Leaves, like Str or Name, are an important special case. Since these have no children, their
encoding is a constant vector f x ∈ Rn . This also means that we encode all possible strings
as the same vector (the same holds for all different variable names or all different numbers).
Incorporating content information of variables in the encoding is a topic for future research.
Another important special case are lists. For example, a Python program is defined as a
list of statements of arbitrary length. In our scheme, we encode a list by adding up all vectors
inside it. The empty list is represented as a zero vector.
Note that, up to this point, our procedure is entirely general and has nothing to do
with neural nets. Our approach becomes a (recursive) neural network because we implement
the encoding functions f x as neural networks. In particular, we use a simple single-layer
feedforward network for the encoding function f x of the syntactic element x:
f x (~y1 , . . . , ~yK ) = tanh U1x · ~y1 + . . . + UK
x
· ~yK + ~bx , (3)
where Ukx ∈ Rn×n is the matrix that decides what information flows from the kth child to its
parent and ~bx ∈ Rn represents the syntactic element x. These U matrices and the ~b vectors
are the parameters of the encoder that need to be learned during training.
Importantly, this architecture is still relatively simple. If one would aim to optimize
autoencoding performance, one could imagine implementing f x instead with more advanced
neural networks, such as a Tree-LSTM [Chen et al., 2018, Dai et al., 2018, Tai et al., 2015].
This is a topic for future research.
Decoding: To decode a vector ~x recursively back into a syntax tree x(ŷ1 , . . . , ŷK ), we have
to make K + 1 decisions. First, we have to decide the syntactic element x. Then, we have to
decide the vector codes ~y1 , . . . , ~yK for each child. For the first decision we set up a function hA
which computes a numeric score hA (x|~x) for each possible syntactic element x from the vector
~x. We then select the syntactic element x with the highest score. The A in the index of h
refers to the fact that we guide our syntactic decision by the Python grammar. In particular,
we only allow syntactic elements to be chosen that are allowed by the current grammar symbol
A. All non-permitted elements receive a score of −∞. For simplicity, we do not discuss the
details of grammar rules here but point the interested reader to Paaßen et al. [2021].
Once we know the syntactic element, the Python grammar tells us the number of children
K, i.e. how many child codes we need to generate. Accordingly, we use K decoding functions
gkx : Rn → Rn which tell us the vector code for each child based on the parent code ~x. The
precise definition of the decoding procedure is as follows.
ψ(~x) =x ψ(~y1 ), . . . , ψ(~yk ) where (4)
x = arg max hA (y|~x) and ~yk = gkx (~x) for all k ∈ {1, . . . , K}
y
3
The entire grammar for the Python language can be found at this link:
https://docs.python.org/3/library/ast.html.
Preprint version under consideration at the Journal of Educational Datamining 12
In other words, we first use hA to choose the current syntactic element x with maximum score
x to compute the neural codes for all children,
hA (x|~x), use the decoding functions g1x , . . . , gK
and proceed recursively to decode child subtrees.
An example of the decoding process is shown in Figure 6. As input, we receive some vector
~x, which we first feed into the scoring function hA . As the Python grammar requires that
each program begins with a Module, the only score higher than −∞ is h(Module|~x), meaning
that the root of our generated tree is Module. Next, we feed the vector ~x into the decoding
function g1Module , yielding a vector g1Module (~x) to be decoded into the child subtree. With this
vector, we re-start the process, i.e. we feed the vector g1Module (~x) into our scoring function hA ,
which this time selects the syntactic element Expr. We then generate the vector representing
the child of Expr as ~y = g1Expr (g1Module (~x)). For this vector, hA selects Call, which expects
two children. We obtain the vectors for these two children as g1Call (~y ) and g2Call (~y ). For these
vectors, hA selects Name and Str, respectively. Neither of these has children, such that the
process stops, leaving us with the tree Module(Expr(Call(Name, Str))).
Again, we note the special case of lists. To decode a list, we use a similar scheme, where
we let hA make a binary choice whether to continue the list or not, and use the decoding
function gkx to decide the code for the next list element.
We implement the decoding functions gkx with single-layer feedforward neural networks
as in Equation 3. A special case are the decoding functions for list-shaped children, which
we implement as recurrent neural networks, namely gated recurrent units [Cho et al., 2014].
Again, we note that one could choose to implement all decoding functions gkx as recurrent
networks, akin to the decoding scheme suggested by [Chen et al., 2018]. Here, we opt for a
simple version and leave the extension to future work.
For the scoring function hA , we use a linear layer with n inputs and as many outputs as
there are syntactic elements in the programming language. Importantly, this choice process
can also be modelled in a probabilistic fashion, where the syntactic element x is chosen with
probability
exp hA (x|~x)
p(x|~x) = P , (5)
y exp hA (y|~
x)
i.e. x is sampled according to a softmax distribution with weights hA (x|~x). The probability
pψ (x̂|~x) from the optimization problem 1 is then defined as the product over all these proba-
bilities during decoding. In other words, pψ (x̂|~x) is the probability that the tree x̂ is sampled
Preprint version under consideration at the Journal of Educational Datamining 13
1. ~δ = ~x∗ − ~x0 , where ~x∗ and ~x0 are the encodings of the solution and empty program
4
https://ncss.edu.au
5
https://groklearning.com/challenge/
6
Contrary to other neural networks, recursive neural nets can not be trained on GPUs because the com-
putational graph is unique for each tree.
Preprint version under consideration at the Journal of Educational Datamining 14
loss 100
10−1
10−2
0 0.2 0.4 0.6 0.8 1 1.2
epoch ·105
Figure 7: The learning curve for our recursive tree grammar autoencoder trained on 448, 992
Python programs recorded in the 2018 NCSS beginners challenge of groklearning. The dark
orange curve shows the loss of Equation 1 over the number of training epochs. For better
readability we show the mean over 100 epochs, enveloped by the standard deviation (light
orange). Note that the plot is in log scaling, indicating little change toward the end.
0.5 ~ν
~x∗
variance
~ν ~δ
1 0
~x0 ~δ ~x∗
z
~x0
0 0.5
0.5 0 −0.5
1 1.5 2 2.5 −0.5 0 0.2 0.4 0.6 0.8 1
x y
progress
respectively. This is used as the x-axis of the plot and captures progress towards or
away from the solution.
2. ~ν , which is chosen as the unit vector orthogonal to ~δ that preserves as much variance in
the dataset as possible. Note that this setup is similar to principal component analysis
[Pearson, 1901], but with the first component fixed to ~δ.
The full algorithm to obtain the 2D representation of all vectors is shown in Algorithm 1.
T
y = δ̂ ~ν
~ · (~x − ~x0 ) / k~δk (high to low projection) (6)
~x′ = δ̂ ~ν · ~y × k~δk + ~x0 ,
(low to high embedding) (7)
where δ̂ = ~δ/k~δk is the unit vector parallel to ~δ and δ̂ ~ν is the matrix with δ̂ and ~ν as
columns.
In general, ~x′ will not be equal to the original ~x because the 2D space can not preserve all
information in the n = 256 dimensions. However, our special construction ensures that the
empty program ~x0 corresponds exactly to the origin (0, 0) and that the reference solution ~x∗
corresponds exactly to the point (1, 0), which can be checked by substituting ~x0 and ~x∗ into
the equations above. In other words, the x-axis represents linear progress in the coding space
toward the goal, and the y-axis represents variance orthogonal to progress. An example for
the geometric construction of our progress-variance plot is shown in Figure 8. In this example,
we project a three-dimensional dataset down to 2D.
where ~x∗ is the reference solution to the task and W is a matrix of parameters to be learned.
This form of the dynamical system is carefully chosen to ensure two desirable properties.
First, the reference solution is a guaranteed fixed point of our system, i.e. we obtain f (~x∗ ) =
~x∗ . Indeed, we can prove that this system has the reference solution ~x∗ as unique stable
attractor if W has sufficiently small eigenvalues (refer to Theorem 1 in the appendix for
details). This is desirable because it guarantees that the default behavior of our system is to
move towards the correct solution.
Second, we can learn a matrix W that best describes student motion via simple linear
regression. In particular, let ~x1 , . . . , ~xT be a sequence of programs submitted by a student in
their vector encoding provided by ast2vec. Then, we wish to find the matrix W which best
captures the dynamics in the sense that f (~xt ) should be as close as possible to ~xt+1 for all t.
More formally, we obtain the following minimization problem.
T
X −1
min kf (~xt ) − ~xt+1 k2 + λ · kW k2F , (9)
W
t=1
where kW kF is the Frobenius norm of the matrix W and where λ > 0 is a parameter that
can be increased to ensure that the reference solution remains an attractor. This problem has
the closed-form solution
T −1
W = Xt+1 − Xt · Xt · XtT · Xt + λ · I , (10)
where Xt = (~x∗ − ~x1 , . . . , ~x∗ − ~xT −1 )T ∈ RT −1×n is the concatenation of all vectors in the
trace up to the last one and Xt+1 = (~x∗ − ~x2 , . . . , ~x∗ − ~xT )T ∈ RT −1×n is the concatenation
of all successors. Refer to Theorem 2 in the appendix for a proof.
As a side note, we wish to highlight that this technique can readily be extended to multiple
reference solutions by replacing ~x∗ in Equations 8 and 10 with the respective closest correct
solution to the student’s current state. In other words, we partition the space of programs
according to several basins of attraction, one per correct solution. The setup and training
remain the same, otherwise.
This concludes our description of methods. In the next section, we evaluate these methods
on large-scale datasets.
4 Evaluation
In this section, we evaluate the ast2vec model on two large-scale anonymised datasets of
introductory programming, namely the 2018 and 2019 beginners challenge by the National
Computer Science School (NCSS)7 , an Australian educational outreach programme. The
courses were delivered by the Grok Learning platform8 . Each offering was available to (mostly)
Australian school children in Years 5–10, with curriculum-aligned educational slides and sets
of exercises released each week for five weeks. Students received a score for successfully
completing each exercise, with the score available starting at 10 points, reducing by one point
every 5 incorrect submissions, to a minimum of 5 points. In both datasets, we consider only
submissions, i.e. programs that students deliberately submitted for evaluation against test
cases.
The 2018 dataset contains data of 12, 141 students working on 26 different programming
tasks, yielding 148, 658 compileable programs. This is also part of the data on which ast2vec
7
https://ncss.edu.au
8
https://groklearning.com/challenge/
Preprint version under consideration at the Journal of Educational Datamining 17
was trained. The 2019 dataset contains data of 10, 558 students working on 24 problems,
yielding 194, 797 compileable programs overall.
For our first analysis, we further broaden our scope and include a third dataset with
a slightly different course format, in particular the Australian Computing Academy digital
technologies (DT) chatbot project9 which consists of 63 problems and teaches students the
skills to program a simple chatbot, and is also delivered via the Grok Learning platform. Our
dataset includes the data of 27, 480 students enrolled between May 2017 and August 2020,
yielding 1, 343, 608 compilable programs.
We first check how well ast2vec is able to correctly autoencode trees in all three datasets
and then go on to check its utility for prediction. We close the section by inspecting the
coding space in more detail for the example dataset from Section 2.
coverage [%]
error (TED)
60
20 40
20
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
tree size
NCSS beginners 2019 data
100
40 80
coverage [%]
error (TED)
60
20 40
20
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
tree size
DT Chatbot data
100
40 80
coverage [%]
error (TED)
60
20 40
20
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
tree size
Figure 9: The autoencoding error as measured by the tree edit distance for the NCSS 2018
beginners challenge (top), NCSS 2019 beginner challenge (middle) and DT chatbot course
(bottom). The error is plotted here versus tree size. The orange line marks the mean, the
black line the median error. The orange region is the standard deviation around the man.
Additionally, the blue line indicates how many trees in the dataset have a size up to x.
Preprint version under consideration at the Journal of Educational Datamining 19
40
encoding
time [ms] 30 decoding
20
10
0
0 10 20 30 40 50 60 70 80 90 100
tree size
Figure 10: The time needed for encoding (orange) and decoding (blue) trees of different sizes
for the DT chatbot course. The thick lines mark the means whereas shaded regions indicate
one standard deviation. The dotted lines indicate the best linear fit.
Figure 10 shows the time needed to encode (orange) and decode (blue) a tree from the
chatbot dataset of a given size. As the plot indicates, the empiric time complexity for both
operations is roughly linear in the tree size, which corresponds to the theoretical findings
of Paaßen et al. [2021]. Using a linear regression without intercept, we find that encoding
requires roughly 0.9 millisecond per ten tree nodes, whereas decoding requires roughly 2.8
milliseconds per ten tree nodes. Given that decoding involves more operations (both element
choice and vector operations), this difference in runtime is to be expected. Fortunately, both
operations remain fast even for relatively large trees.
1. the simple identity, which predicts the next step to be the same as the current one and
is included as a baseline [Hyndman and Koehler, 2006],
2. one-nearest-neighbor prediction (1NN). This involves searching the training data for the
closest tree based on tree edit distance and predicting its successor, which is similar to
the scheme suggested by Gross and Pinkwart [2015], and
3. the continuous hint factory [Paaßen et al., 2018b, CHF], which is also based on the
tree edit distance but instead involves Gaussian process regression and heuristic-driven
reconstruction techniques to predict next steps.
Preprint version under consideration at the Journal of Educational Datamining 20
40 baseline
1NN
30 CHF
20 ast2vec
+ linear
10
0
2 4 6 8 10 12 14 16 18 20 22 24 26
task number
2019 NCSS challenge
prediction error (TED)
baseline
15 1NN
CHF
10 ast2vec
+ linear
5
0
2 4 6 8 10 12 14 16 18 20 22 24
task number
Figure 11: The average root mean square prediction error (in terms of tree edit distance)
across the entire beginners 2018 (top) and 2019 (bottom) curriculum for different prediction
methods. The x-axis uses the same order of problems as students worked on them.
Both 1NN as well as CHF are nonlinear predictors with a much higher representational
capacity, such that we would expect them to be more accurate [Paaßen et al., 2018a]. We
investigate this in the following section.
10−2
50 100 150 200 250 300 350 400 450 500
number of students
Figure 12: The average training (dashed lines) and prediction time (solid lines) for different
training set sizes for the 2019 dataset. Prediction times are measured as the accumulated
time to provide predictions for an entire student trace. Note that the y axis is in log scale.
tasks involved larger programs (task 4 was an early exercise, but it involved a large program
scaffold that students needed to modify). Based on the analysis in the previous section,
larger programs tend to have higher autoencoding error, which we would expect to affect the
prediction error. Task 8, however, is interesting because it involves relatively small programs.
On closer inspection, this task has many different possible solutions, so the error could be
related to the simple design of our model (specifically, that it only makes predictions towards
the most common solution). All-in-all, this suggests the simple model performs comparably
to the others, and only performs worse in a small number of specific and explainable cases.
We note in passing that the naive baseline of predicting the current step performs also
quite well in both datasets. This may seem surprising but is a usual finding in forecasting
settings [Hyndman and Koehler, 2006, Paaßen et al., 2018a]. In particular, a low baseline
error merely indicates that students tend to change their program only a little in each step,
which is to be expected. Importantly, though, the baseline can not be used in an actual
hint system, where one would utilize the difference between a student’s current state and the
predicted next state to generate hints [Rivers and Koedinger, 2017, Paaßen et al., 2018b]. For
the baseline, there is no difference, and hence no hints can be generated.
4.2.2 Runtime
Another interesting aspect is training and prediction time. While training is very fast across
all methods (1NN does not even need training), the prediction time of neighborhood-based
methods like 1NN and CHF scales with the size of the training dataset, whereas the prediction
time for our ast2vec+linear scheme remains constant. Figure 12 displays the average training
and prediction times for all methods and varying training dataset sizes on the 2019 dataset.
As we can see, our proposed scheme always needs about 300ms to make all predictions for
a student trace (including encoding and decoding times), whereas 1NN exceeds this time
starting from ≈ 80 students and CHF is slower across time scales, requiring more than a
second even for small training data sizes. This is an important consideration in large-scale
educational contexts, where many predictions may need to be made very quickly.
neighborhood-based schemes. As noted in Section 3.3, one important property of our simple
model is that the predictions are mathematically guaranteed to lead to a solution, and the
linearity of the model means this is done smoothly. By contrast, neither 1NN nor CHF can
formally guarantee convergence to a correct solution if one follows the predictions. Addition-
ally, the predictions of 1NN are discontinuous, i.e. they change suddenly if the student enters a
different neighborhood. Overall, a smooth and guaranteed motion towards a correct solution
could be valuable for designing a next-step hint system where the generated hints should be
both directed to the desired target and intuitive [McBroom et al., 2019, Paaßen et al., 2018b,
Rivers and Koedinger, 2017].
Furthermore, a dynamical system based on ast2vec only has to solve the problem of
describing student motion in the space of programs for a particular task. Constructing the
space of programs in the first place is already solved by ast2vec. By contrast, a neighborhood-
based model must solve both the representation and the learning problem anew for each
learning task.
Additionally, any neighborhood-based model must store the entire training dataset at all
times and recall it for every prediction, whereas the ast2vec+linear model merely needs to
store and recover the predictive parameters, which includes the ast2vec parameters, and for
each learning task the vector encoding of the correct solution and the matrix W . This is not
only more time-efficient (as we saw above) but also space efficient as each new learning task
only requires an additional n + 1 × n = 65, 792 floating point numbers to be stored, which is
roughly half a megabyte.
Finally, if additional predictive accuracy is desired, one can improve the predictive model
by replacing the linear prediction with a nonlinear prediction (e.g. a small neural network),
whereas the predictive accuracy of a neighborhood-based system can only be improved by
using a richer distance measure or more training data.
In summary, this evaluation has shown that a simple linear regression model using the
vector encodings of ast2vec can perform comparably to neighborhood-based next-step pre-
diction techniques in terms of prediction error, but has other desirable properties that may
make it preferable in many settings, such as constant time and space complexity, guaranteed
convergence to a correct solution, smoothness, and separation of concerns between represen-
tation and prediction. As such, we hope that ast2vec can contribute to improve educational
datamining pipelines, e.g. for next-step hint predictions in the future.
variance
0 x = input ( ’ < string > ’)
if x == ’ < string > ’:
print ( ’ < string > ’)
empty program else :
print ( ’ < string > ’)
−0.5
0 0.2 0.4 0.6 0.8 1
progress
Figure 13: A linear interpolation between the empty program at (0, 0) and the correct solution
at (1, 0) as well as along the axis of biggest variation. Color indicates the tree that the grid
point decodes to. The three most common trees are shown in the boxes.
This visualization shows several reassuring properties: First, neighboring points in the grid
tend to decode to the same tree. Second, if two points correspond to the same tree, the space
between them also correspond to the same tree, i.e. there are no two points corresponding to
the same tree at completely disparate locations of the grid. Finally, the syntax trees that are
particularly common are also meaningful for the task, in the sense that they actually occur in
student data. Overall, this bolsters our confidence that the encoding space is indeed smooth.
5 Related Work
In this paper, we propose a novel, vectorial encoding for computer programs to support
educational datamining. In doing so, we build on prior work which also proposed alternative
representations for computer programs. In the remainder of this section, we review these
alternative representations and relate them to ast2vec.
Abstract Syntax Trees: An abstract syntax tree (AST) represents the program code as
a tree of syntactic elements, which is also the internal representation used by compilers to
transform human-readable program code to machine code [Aho et al., 2006]. For example,
Figure 5 (left) shows the abstract syntax tree for the Python program print(’Hello,␣world’)!.
In most processing pipelines for computer programs, compiling the source code into an AST
is the first step, because it tokenizes the sources code meaningfully, establishes the hierar-
chical structure of the program, and separates functionally relevant syntax from functionally
irrelevant syntax, like variable names or comments [Aho et al., 2006, McBroom et al., 2019].
Additionally, several augmentations have been suggested to make syntax trees more meaning-
ful. In particular, Rivers and Koedinger [2012] suggest canonicalization rules to match trees
that are functionally equivalent, such as normalizing the direction of comparison operators,
inlining helper functions, or removing unreachable code. Gross et al. [2014] further suggest to
insert edges in the tree between variable names and their original declaration, thus augment-
ing the tree to a graph. Following prior work, we also use the AST representation as a first
step before we apply a neural network to it. Although we do not use them here explicitly, the
canonicalization rules of Rivers and Koedinger [2012] are directly compatible with ast2vec.
A more difficult match are the additional reference edges suggested by Gross et al. [2014]
Preprint version under consideration at the Journal of Educational Datamining 24
because these would require neural networks that can deal with full graphs, i.e. graph neural
networks [Kipf and Welling, 2016, Scarselli et al., 2009].
Tree Patterns: Since many datamining methods can not be applied to trees, several prior
works have suggested to transform the AST to a collection of patterns first. Such repre-
sentations are common in the domain of natural language processing, where texts can be
represented as a collection of the words or n-grams they contain [Broder et al., 1997]. Such
techniques can be extended to trees by representing a tree by the subtree patterns it contains
[Nguyen et al., 2014, Zhi et al., 2018, Zimmerman and Rupakheti, 2015]. For example, the
syntax tree “Module(Expr(Call(Name, Str)))” could be represented as the following collection
of subtree patterns of height two: “Call(Name, Str)”, “Expr(Call)”, and “Module(Expr)”. Once
this collection is computed, we can associate all possible subtree patterns with a unique index
in the range 1, . . . , K and then represent a tree by a K-dimensional vector, where entry k
counts how often the kth subtree pattern occurs in the tree.
If the collection of subtree patterns is meaningful for the programming task at hand, this
representation can be both simple and powerful. For example, if a programming task is about
writing a while loop, it is valuable to know whether a student’s program actually contains such
a loop. Zhi et al. [2018] have considered this issue in more detail and considered both expert-
based and data-driven tree patterns as a representation. However, if there is no clear relation
between tree patterns and progress in a task, it may be problematic to use such pattern
as a representation, in particular when all tree patterns are weighted equally. ast2vec is
weakly related to tree patterns because the tree pattern determines which encoding/decoding
functions are called when processing the tree. However, the vector returned by the network
does not simply count tree patterns but, instead, considers the entire tree structure, i.e. how
the tree patterns are nested into each other to form the entire tree. More precisely, ast2vec
is trained to generate vector encodings that still contain enough information to recover the
original tree, whereas tree pattern counting usually does not enable us to recover the original
tree.
enables us to decouple the general problem of representing programs as vectors from task-
specific problems. We can utilize the implicit information of 500,000 student programs in
ast2vec in a new task and may need only little new student data to solve the additional,
task-specific problem at hand.
That being said, the basic logic of distance measures is still crucial to ast2vec. In par-
ticular, the negative log-likelihood in Equation 1 can be interpreted as a measure of distance
between the original program and its autoencoded version, and our notion of a smooth en-
coding can be interpreted as small distances between vectors in the encoding space implying
small distances of the corresponding programs.
Clustering: One of the challenges of computer programs is that the space of programs for
the same task is very large and it is infeasible to design feedback for all possible programs.
Accordingly, several researchers have instead grouped programs into a small number of clus-
ters, for which similar feedback can be given [Choudhury et al., 2016, Glassman et al., 2015,
Gross et al., 2014, Gulwani et al., 2018]. Whenever a new student requests help, one can
simply check which cluster the student’s program belongs to, and assign the feedback for that
cluster. One typical way to perform clustering is by using a pairwise distance measure on trees,
such as the tree edit distance [Choudhury et al., 2016, Gross et al., 2014, Zhang and Shasha,
1989]. However, it is also possible to use clustering strategies more specific to computer
programs, such as grouping programs by their control flow [Gulwani et al., 2018] or by the
unit tests they pass [McBroom et al., 2021]. ast2vec can be seen as a preprocessing step for
clustering. Once all programs are encoded as vectors, standard clustering approaches can be
applied. Additionally, ast2vec provides the benefit of being able to interpret the clustering by
translating cluster centers back into syntax trees (refer to Section 2.4).
Execution Behavior: Most representations so far focused on the AST of a program. How-
ever, it is also possible to represent programs in terms of their execution behavior. The most
popular example is the representation by test cases, where a program is fed with example in-
put. If the program’s output is equal to the expected output, we say the test has passed, other-
wise failed [Ihantola et al., 2010]. This is particularly useful for automatic grading of programs
because test cases verify the functionality, at least for some examples, while giving students
freedom on how to implement this functionality [Ihantola et al., 2010]. Further, failing a cer-
tain unit test can indicate a specific misconception, warranting functionality-based feedback
[McBroom et al., 2021]. However, for certain types of tasks the computational path toward an
output may be relevant, for example when comparing sorting programs. Paaßen et al. [2016]
therefore use the entire computation trace as a representation, i.e. the sequence of states of all
variables in a program. Unfortunately, a mismatch in the output or in the execution behavior
toward the output is, in general, difficult to relate to a specific change the student would need
to perform in order to correct the problem. To alleviate this challenge, test cases have to be
carefully and densely designed, or the challenge has to be left to the student. As such, we
believe that there is still ample room for AST-based representations, like our proposed vector
encodings, which are closer to the actions a student can actually perform on their own code.
Neural networks: Prior work already has investigated neural networks to encode computer
programs. In particular, Piech et al. [2015a] used a recursive tensor network to encode syn-
tax trees and performed classification on the embeddings to decide whether certain feedback
should be applied or not; Alon et al. [2019] represented syntax trees as a collection of paths,
encoded these paths as vectors and then computed a weighted average to obtain an overall
encoding of the program; Chen et al. [2018] translate programs into other programming lan-
Preprint version under consideration at the Journal of Educational Datamining 26
guages by means of an intermediary vector representation; and Dai et al. [2018] proposed an
auto-encoding model for computer programs that incorporates an attribute grammar of the
programming language to not only incorporate syntactic, but also semantic constraints (like
that variables can only be used after declaration).
Both the works of Piech et al. [2015a] and Alon et al. [2019] are different from our contri-
bution because they do not optimize an autoencoding loss but instead train an neural net for
the classification of programs, i.e. into feedback classes or into tags. This scheme is unable to
recover the original program from the network’s output and requires expert labelling for all
training programs. Ast2vec has neither of those limitations. The work of Chen et al. [2018]
is more similar in that both input and output are programs. However, the network does not
include grammar constraints and uses an attention mechanism on the original tree to decide
on its output. This is not possible in our setting where we wish to perform datamining solely
in the vector space of encodings and be able to decode any encoding back into a tree, without
reference to a prior tree. The most similar prior work to our own is the autoencoding model
of Dai et al. [2018], which is also an autoencoding model with grammar constraints, albeit
with an LSTM-based encoder and decoder. One could frame ast2vec as a combination of
the auto-encoding ability and grammar knowledge of Dai et al. [2018] with the recursive pro-
cessing of Piech et al. [2015a], yielding a recursive tree grammar autoencoder [Paaßen et al.,
2021]. That being said, future work may incorporate more recurrent network concepts and
thus improve autoencoding error further.
Next-step hints: Ample prior work has considered the problem of predicting the next
step a student should take in a learning system, refer e.g. to the review of [McBroom et al.,
2019]. On a high level, this concerns the problem of selecting the right sequence of lessons
to maximize learning gain [Lan et al., 2014, Reddy et al., 2016]. In this paper, we rather
consider the problem of predicting the next code change within a single programming task.
This problem has been considered in more detail, for example, by Rivers and Koedinger [2017],
Price et al. [2017], Paaßen et al. [2018b], and Price et al. [2019]. Here, we compare to two
baselines evaluated by Price et al. [2019], namely the continuous hint factory [Paaßen et al.,
2018b] and a nearest-neighbor prediction [Gross and Pinkwart, 2015]. We note, however, that
we only evaluate the predictive accuracy, not the hint quality, which requires further analysis
[Price et al., 2019].
6 Conclusion
In this paper, we presented ast2vec, a novel autoencoder to translate the syntax trees of
beginner Python programs to vectors in Euclidean space and back. We have trained this
autoencoder on almost 500,000 student Python programs and evaluated it in three settings.
First, we utilized the network for a variety of analyses on a classroom-sized dataset, thereby
demonstrating the flexibility of the approach qualitatively. As part of our qualitative analysis,
we also showed that the encoding space of ast2vec is smooth - at least for the example - and
introduced two novel techniques for analyzing programming data based on ast2vec, namely
progress-variance-projections for a two-dimensional visualization, and a linear dynamical sys-
tems method that guarantees convergence to the correct solution.
In terms of quantitative analyses, we evaluated the autoencoding error of ast2vec as well as
the predictive accuracy of a simple linear model on top of ast2vec. We found that ast2vec had
a low autoencoding error for the majority of programs, including on two unseen, large-scale
datasets, though the error tended to increase with tree size. In addition, the encoding and
decoding times were low and consistent with the theoretical O(n) bounds, suggesting ast2vec
Preprint version under consideration at the Journal of Educational Datamining 27
is scalable. Moreover, by coupling ast2vec with our linear dynamical systems method, we
were able to approach the predictive accuracy of existing methods with a very simple model
that is both time- and space-efficient and guarantees convergence to a correct solution.
While we believe that these results are encouraging, we also acknowledge limitations: At
present, ast2vec does not decode the content of variables, although such content may be
decisive to solve a task correctly. Further improvements in terms of autoencoding error are
possible as well, especially for larger programs, perhaps by including more recurrent networks
as encoders and decoders. Finally, the predictive accuracy of our proposed linear dynamical
systems model has not yet achieved the state-of-the-art and nonlinear predictors are likely
necessary to improve performance further.
Still, ast2vec provides the educational datamining community with a novel tool that can
be utilized without any need for further deep learning in a broad variety of analyses. We are
excited to see its applications in the future.
7 Acknowledgements
Funding by the German Research Foundation (DFG) under grant number PA 3460/1-1 is
gratefully acknowledged.
References
Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior
of distance metrics in high dimensional space. In Jan Van den Bussche and Victor Vianu,
editors, Proceedings of the International Conference on Database Theory (ICDT 2001),
pages 420–434, Berlin, Heidelberg, 2001. Springer. doi:10.1007/3-540-44503-X_27.
Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman. Compilers: Principles, Techniques,
and Tools. Addison Wesley, Boston, MA, USA, 2 edition, 2006.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning distributed
representations of code. Proceedings of the ACM on Programming Languages, 3:40, 2019.
doi:10.1145/3290353.
Tiffany Barnes, Behrooz Mostafavi, and Michael J Eagle. Data-driven domain models for
problem solving, volume 4, pages 137–145. US Army Research Laboratory, Orlando, FL,
USA, 2016.
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic
clustering of the web. Computer Networks and ISDN Systems, 29(8):1157 – 1166, 1997.
doi:10.1016/S0169-7552(97)00031-7.
M.C. Campi and P.R. Kumar. Learning dynamical systems in a stationary environment.
Systems & Control Letters, 34(3):125 – 132, 1998. doi:10.1016/S0167-6911(98)00005-X.
Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for pro-
gram translation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Proceedings of the 31st International Conference
on Advances in Neural Information Processing Systems (NeurIPS 2018), 2018. URL
https://papers.nips.cc/paper/2018/hash/d759175de8ea5b1d9a2660e45554894f-Abstract.html.
Preprint version under consideration at the Journal of Educational Datamining 28
Kyunghyun Cho, Bart Van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-
decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and
Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP 2014), pages 1724–1734, 2014. URL
https://www.aclweb.org/anthology/D14-1179.
Rohan Roy Choudhury, HeZheng Yin, Joseph Moghadam, and Armando Fox. Autostyle:
Toward coding style feedback at scale. In Darren Gergle, Meredith Ringel Morris, Pernille
Bjørn, Joseph Konstan, Gary Hsieh, and Naomi Yamashita, editors, Proceedings of the
19th ACM Conference on Computer Supported Cooperative Work and Social Computing
Companion (CSCW 2016), page 21–24, 2016. doi:10.1145/2818052.2874315.
Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed varia-
tional autoencoder for structured data. In Yoshua Bengio, Yann LeCun, Tara Sainath,
Ian Murray, Marc Aurelio Ranzato, and Oriol Vinyals, editors, Proceedings of the
6th International Conference on Learning Representations (ICLR 2018), 2018. URL
https://openreview.net/forum?id=SyqShMZRb.
Andrej Gisbrecht and Frank-Michael Schleif. Metric and non-metric proximity transformations
at linear costs. Neurocomputing, 167:643–657, 2015. doi:10.1016/j.neucom.2015.04.017.
Elena L. Glassman, Jeremy Scott, Rishabh Singh, Philip J. Guo, and Robert C. Miller.
Overcode: Visualizing variation in student solutions to programming problems at scale.
ACM Transactions on Computer-Human Interaction, 22(2):7, 2015. doi:10.1145/2699751.
Sebastian Gross and Niels Pinkwart. How do learners behave in help-seeking when given a
choice? In Cristina Conati, Neil Heffernan, Antonija Mitrovic, and M. Felisa Verdejo, edi-
tors, Proceedings of the 17th International Conference on Artificial Intelligence in Education
(AIED 2015), pages 600–603, 2015. doi:10.1007/978-3-319-19773-9_71.
Sebastian Gross, Bassam Mokbel, Benjamin Paaßen, Barbara Hammer, and Niels Pinkwart.
Example-based feedback provision using structured solution spaces. International Journal
of Learning Technology, 9(3):248–280, 2014. doi:10.1504/IJLT.2014.065752.
Sumit Gulwani, Ivan Radiček, and Florian Zuleger. Automated clustering and program repair
for introductory programming assignments. ACM SIGPLAN Notices, 53(4):465–480, 2018.
doi:10.1145/3296979.3192387.
Barbara Hammer and Alexander Hasenfuss. Topographic mapping of large dissimilarity data
sets. Neural Computation, 22(9):2229–2284, 2010. doi:10.1162/NECO_a_00012.
Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via
spectral filtering. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Proceedings of the 30th Conference on Ad-
vances Neural Information Processing Systems (NIPS 2017), pages 6702–6712, 2017. URL
https://proceedings.neurips.cc/paper/2017/file/165a59f7cf3b5c4396ba65953d679f17-Paper.pdf
Preprint version under consideration at the Journal of Educational Datamining 29
Rob J. Hyndman and Anne B. Koehler. Another look at measures of forecast accuracy. Inter-
national Journal of Forecasting, 22(4):679 – 688, 2006. doi:10.1016/j.ijforecast.2006.03.001.
Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Seppälä. Review of recent systems
for automatic assessment of programming assignments. In Casten Schulte and Jarkko Suho-
nen, editors, Proceedings of the 10th Koli Calling International Conference on Computing
Education Research (Koli Calling 2010), page 86–93, 2010. doi:10.1145/1930464.1930480.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
Yoshua Bengio, Yann LeCun, Brian Kingsbury, Samy Bengio, Nando de Freitas, and Hugo
Larochelle, editors, Proceedings of the 3rd International Conference on Learning Represen-
tations (ICLR 2015), 2015. URL https://arxiv.org/abs/1412.6980.
Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In Yoshua Bengio,
Yann LeCun, Aaron Courville, Rob Fergus, and Chris Manning, editors, Proceedings of
the 1st International Conference on Learning Representations (ICLR 2013), 2013. URL
https://arxiv.org/abs/1312.6114.
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In Hugo Larochelle, Samy Bengio, Brian Kingsbury, Yoshua Bengio, and Yann
LeCun, editors, Proceedings of the 4th International Conference on Learning Representa-
tions (ICLR 2016), 2016. URL https://arxiv.org/abs/1609.02907.
Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational
autoencoder. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th In-
ternational Conference on Machine Learning (ICML 2017), pages 1945–1954, 2017. URL
http://proceedings.mlr.press/v70/kusner17a.html.
Essi Lahtinen, Kirsti Ala-Mutka, and Hannu-Matti Järvinen. A study of the difficulties of
novice programmers. SIGCSE Bulletin, 37(3):14–18, 2005. doi:10.1145/1151954.1067453.
Andrew S. Lan, Christoph Studer, and Richard G. Baraniuk. Time-varying learning and con-
tent analytics via sparse factor analysis. In Sofus Macskassy, Claudia Perlich, Jure Leskovec,
Wei Wang, and Rayid Ghani, editors, Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD ’14), page 452–461, 2014.
doi:10.1145/2623330.2623631.
Jessica McBroom, Kalina Yacef, Irena Koprinska, and James R. Curran. A data-
driven method for helping teachers improve feedback in computer programming au-
tomated tutors. In Carolyn Penstein Rosé, Roberto Martínez-Maldonado, H. Ulrich
Hoppe, Rose Luckin, Manolis Mavrikis, Kaska Porayska-Pomsta, Bruce McLaren, and
Benedict du Boulay, editors, Artificial Intelligence in Education, pages 324–337, 2018.
doi:10.1007/978-3-319-93843-1_24.
Jessica McBroom, Irena Koprinska, and Kalina Yacef. A survey of automated pro-
gramming hint generation–the hints framework. arXiv, 1908.11566, 2019. URL
https://arxiv.org/abs/1908.11566.
Jessica McBroom, Benjamin Paaßen, Bryn Jeffries, Irena Koprinska, and Kalina Yacef.
Progress networks as a tool for analysing student programming difficulties. In Claudia
Szabo and Judy Sheard, editors, Proceedings of the Twenty-Third Australasian Conference
on Computing Education (ACE 2021), 2021. accepted.
Preprint version under consideration at the Journal of Educational Datamining 30
Michael McCracken, Vicki Almstrum, Danny Diaz, Mark Guzdial, Dianne Hagan, Yifat Ben-
David Kolikant, Cary Laxer, Lynda Thomas, Ian Utting, and Tadeusz Wilusz. A multi-
national, multi-institutional study of assessment of programming skills of first-year cs stu-
dents. In Working Group Reports from ITiCSE on Innovation and Technology in Computer
Science Education, pages 125–180, 2001. doi:10.1145/572133.572137.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In
C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,
editors, Proceedings of the 26th International Conference on Advances in Neu-
ral Information Processing Systems (NIPS 2013), pages 3111–3119, 2013. URL
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-th
Bassam Mokbel, Sebastian Gross, Benjamin Paaßen, Niels Pinkwart, and Barbara Ham-
mer. Domain-independent proximity measures in intelligent tutoring systems. In
S. K. D’Mello, R. A. Calvo, and A. Olney, editors, Proceedings of the 6th Interna-
tional Conference on Educational Data Mining (EDM 2013), pages 334–335, 2013. URL
http://www.educationaldatamining.org/EDM2013/papers/rn_paper_68.pdf.
Andy Nguyen, Christopher Piech, Jonathan Huang, and Leonidas Guibas. Codewebs: Scal-
able homework search for massive open online programming courses. In Chin-Wan Chung,
Andrei Broder, Kyuseok Shim, and Torsten Suel, editors, Proceedings of the 23rd Inter-
national Conference on World Wide Web (WWW 2014), page 491–502. Association for
Computing Machinery, 2014. doi:10.1145/2566486.2568023.
Benjamin Paaßen, Joris Jensen, and Barbara Hammer. Execution traces as
a powerful data representation for intelligent tutoring systems for program-
ming. In Tiffany Barnes, Min Chi, and Mingyu Feng, editors, Proceedings
of the 9th International Conference on Educational Data Mining (EDM 2016),
pages 183 – 190. International Educational Datamining Society, 2016. URL
http://www.educationaldatamining.org/EDM2016/proceedings/paper_17.pdf.
Benjamin Paaßen, Christina Göpfert, and Barbara Hammer. Time series prediction for
graphs in kernel and dissimilarity spaces. Neural Processing Letters, 48(2):669–689, 2018a.
doi:10.1007/s11063-017-9684-5. URL https://arxiv.org/abs/1704.06498.
Benjamin Paaßen, Barbara Hammer, Thomas Price, Tiffany Barnes, Sebastian Gross, and
Niels Pinkwart. The continuous hint factory - providing hints in vast and sparsely popu-
lated edit distance spaces. Journal of Educational Datamining, 10(1):1–35, 2018b. URL
https://jedm.educationaldatamining.org/index.php/JEDM/article/view/158.
Benjamin Paaßen, Irena Koprinska, and Kalina Yacef. Recursive tree grammar autoencoders.
arXiv, 2012.02097, 2021. https://arxiv.org/abs/2012.02097.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
son, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
Pytorch: An imperative style, high-performance deep learning library. In Hanna Wallach,
Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Roman
Garnett, editors, Proceedings of the 32nd International Conference on Advances in
Neural Information Processing Systems (NeurIPS 2019), pages 8026–8037, 2019. URL
http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learn
Preprint version under consideration at the Journal of Educational Datamining 31
Karl Pearson. On lines and planes of closest fit to systems of points in space. The London,
Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
doi:10.1080/14786440109462720.
Barry Peddycord III., Andrew Hicks, and Tiffany Barnes. Generating hints
for programming problems using intermediate output. In Manolis Mavrikis
and Bruce M. McLaren, editors, Proceedings of the 7th International Con-
ference on Educational Data Mining (EDM 2014), pages 92–98, 2014. URL
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.659.1815&rep=rep1&type=pdf.
Elzbieta Pekalska and Robert Duin. The Dissimilarity Representation for Pattern Recogni-
tion: Foundations And Applications (Machine Perception and Artificial Intelligence). World
Scientific Publishing Co., Inc., River Edge, NJ, USA, 2005. ISBN 9812565302.
Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami,
and Leonidas Guibas. Learning program embeddings to propagate feedback on stu-
dent code. In Francis Bach and David Blei, editors, Proceedings of the 37th Interna-
tional Conference on Machine Learning (ICML 2015), pages 1093–1102, 2015a. URL
http://proceedings.mlr.press/v37/piech15.html.
Chris Piech, Mehran Sahami, Jonathan Huang, and Leonidas Guibas. Autonomously gener-
ating hints by inferring problem solving policies. In Gregor Kiczales, Daniel Russell, and
Beverly Woolf, editors, Proceedings of the Second ACM Conference on Learning @ Scale
(L@S 2015), page 195–204, 2015b. doi:10.1145/2724660.2724668.
Thomas W Price, Yihuan Dong, Rui Zhi, Benjamin Paaßen, Nicholas Lytle, Veronica Cateté,
and Tiffany Barnes. A comparison of the quality of data-driven programming hint genera-
tion algorithms. International Journal of Artificial Intelligence in Education, 29(3):368–395,
2019. doi:10.1007/s40593-019-00177-z.
Yizhou Qian and James Lehman. Students’ misconceptions and other difficulties in introduc-
tory programming: A literature review. ACM Transactions on Computing Education, 18
(1):1–25, 2017. doi:10.1145/3077618.
Siddharth Reddy, Igor Labutov, and Thorsten Joachims. Latent skill embedding
for personalized lesson sequence recommendation. arXiv, 1602.07029, 2016. URL
http://arxiv.org/abs/1602.07029.
Preprint version under consideration at the Journal of Educational Datamining 32
Kelly Rivers and Kenneth R. Koedinger. A canonicalizing model for building programming
tutors. In Stefano A. Cerri, William J. Clancey, Giorgos Papadourakis, and Kitty Panourgia,
editors, Proceedings of the 11th International Conference on Intelligent Tutoring Systems,
(ITS 2012), pages 591–593, 2012. doi:10.1007/978-3-642-30950-2_80.
Kelly Rivers and Kenneth R Koedinger. Data-driven hint generation in vast solution spaces:
a self-improving python programming tutor. International Journal of Artificial Intelligence
in Education, 27(1):37–64, 2017. doi:10.1007/s40593-015-0070-z.
Anthony Robins, Janet Rountree, and Nathan Rountree. Learning and teaching program-
ming: A review and discussion. Computer Science Education, 13(2):137–172, 2003.
doi:10.1076/csed.13.2.137.14200.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-
dini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):
61–80, 2009. doi:10.1109/TNN.2008.2005605.
Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic repre-
sentations from tree-structured long short-term memory networks. In Yuji Matsumoto,
Chengqing Zong, and Michael Strube, editors, Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistic (ACL 2015), pages 1556–1566, 2015. URL
https://www.aclweb.org/anthology/P15-1150.pdf.
Nghi Truong, Paul Roe, and Peter Bancroft. Static analysis of students’ java pro-
grams. In Raymond Lister and Alison Young, editors, Proceedings of the Sixth Aus-
tralasian Conference on Computing Education (ACE 2004), page 317–325, 2004. URL
https://dl.acm.org/doi/abs/10.5555/979968.980011.
Rose Wiles, Gabriele Durrant, Sofie De Broe, and Jackie Powell. Methodological approaches
at phd and skills sought for research posts in academia: a mismatch? International Journal
of Social Research Methodology, 12(3):257–269, 2009. doi:10.1080/13645570701708550.
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance be-
tween trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989.
doi:10.1137/0218082.
Rui Zhi, Thomas W Price, Nicholas Lytle, Yihuan Dong, and Tiffany Barnes.
Reducing the state space of programming problems through data-driven feature
detection. In David Azcona, Sharon Hsiao, Nguyen-Thinh Le, John Stam-
per, and Michael Yudelson, editors, Proceedings of the second Educational Data
Mining in Computer Science Education Workshop (CSEDM 2018), 2018. URL
https://people.engr.ncsu.edu/twprice/website/files/CSEDM%202018.pdf.
Proof. We note that the theorem follows from general results in stability analysis, especially
Lyapunov exponents [Politi, 2013]. Still, we provide a full proof here to give interested readers
insight into how such an analysis can be done.
In particular, let ~x1 be any n-dimensional real vector. We know wish to show that plug-
ging this vector into f repeatedly yields a sequences ~x1 , ~x2 , . . . with ~xt+1 = f (~xt ) which
asymptotically converges to ~x∗ in the sense that
To show this, we first define the alternative vector x̂t := ~xt − ~x∗ . For this vector we obtain:
x̂t+1 = ~xt+1 − ~x∗ = f (~xt ) − ~x∗ = ~xt + W · (~x∗ − ~xt ) − ~x∗ = I − W ) · x̂t = I − W )t · x̂1 .
Further, note that our desired convergence of ~xt to ~x∗ is equivalent to stating that x̂t converges
to zero. To show that x̂t converges to zero, it is sufficient to prove that the matrix I − W )t
converges to zero.
To show this, we need to consider the eigenvalues of our matrix W . In particular, let
V ·Λ·V −1 = W be the eigenvalue decomposition of W , where V is the matrix of eigenvectors
and Λ is the diagonal matrix of eigenvalues λ1 , . . . , λn . Then it holds:
t t
I − W )t = V · V −1 − V · Λ · V −1 = V · (I − Λ) · V −1
= V · (I − Λ) · V −1 · V · . . . · V −1 · V · (I − Λ) · V −1 = V · (I − Λ)t · V −1 .
Now, let us consider the matrix power (I − Λ)t in more detail. Because this is a diagonal
matrix, the power is just applied elementwise. In particular, for the jth diagonal element we
obtain the power (1 − λj )t . In general, the jth eigenvalue can be complex-valued. However,
the absolute value still behaves like |(1 − λj )t | = |1 − λj |t . Now, since we required that
|1 − λj | < 1 we obtain
Proof. We solve the problem by setting the gradient to zero and considering the Hessian.
First, we compute the gradient of the loss with respect to W .
T
X −1
∇W k~xt + W · (~x∗ − ~xt ) − ~xt+1 k2 + λ · kW k2F
t=1
T
X −1
~xt + W · (~x∗ − ~xt ) − ~xt+1 · (~x∗ − ~xt )T + 2 · λ · W
=2 ·
t=1
For λ > 0, the matrix XtT · Xt + λ · I is quaranteed to be invertible, which yields our desired
solution.
Preprint version under consideration at the Journal of Educational Datamining 34
It remains to show that the Hessian for this problem is positive definite. Re-inspecting
P −1 the
gradient, we observe that the matrix W occurs only as a product with the term Tt=1 (~x∗ −
~xt ) · (~x∗ − ~xt )T , which is positive semi-definite, and the term λ · I, which is strictly positive
definite. Hence, our problem is convex and our solution is the unique global optimum.