0% found this document useful (0 votes)
11 views34 pages

2103.11614v1 - Unknown

The paper presents ast2vec, a neural network designed to convert Python syntax trees into vector representations, facilitating educational datamining on programming tasks. Trained on nearly half a million novice programmers' programs, ast2vec allows for various analyses such as visualization, clustering, and predictive modeling without the need for additional deep learning. The authors demonstrate its effectiveness through multiple analyses and highlight its potential to enhance educational datamining tools in computer science education.

Uploaded by

Don John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views34 pages

2103.11614v1 - Unknown

The paper presents ast2vec, a neural network designed to convert Python syntax trees into vector representations, facilitating educational datamining on programming tasks. Trained on nearly half a million novice programmers' programs, ast2vec allows for various analyses such as visualization, clustering, and predictive modeling without the need for additional deep learning. The authors demonstrate its effectiveness through multiple analyses and highlight its potential to enhance educational datamining tools in computer science education.

Uploaded by

Don John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

ast2vec: Utilizing Recursive Neural Encodings of Python

Programs
Benjamin Paaßen1 , Jessica McBroom1 , Bryn Jeffries2 , Irena Koprinska1 , and Kalina
Yacef1
1
School of Computer Science, The University of Sydney
2
Grok Learning
arXiv:2103.11614v1 [cs.LG] 22 Mar 2021

This work is a preprint, provided by the authors, and has been submitted to the
Journal of Educational Datamining

Abstract
Educational datamining involves the application of datamining techniques to student
activity. However, in the context of computer programming, many datamining techniques
can not be applied because they expect vector-shaped input whereas computer programs
have the form of syntax trees. In this paper, we present ast2vec, a neural network that
maps Python syntax trees to vectors and back, thereby facilitating datamining on com-
puter programs as well as the interpretation of datamining results. Ast2vec has been
trained on almost half a million programs of novice programmers and is designed to be
applied across learning tasks without re-training, meaning that users can apply it without
any need for (additional) deep learning. We demonstrate the generality of ast2vec in three
settings: First, we provide example analyses using ast2vec on a classroom-sized dataset,
involving visualization, student motion analysis, clustering, and outlier detection, includ-
ing two novel analyses, namely a progress-variance-projection and a dynamical systems
analysis. Second, we consider the ability of ast2vec to recover the original syntax tree from
its vector representation on the training data and two further large-scale programming
datasets. Finally, we evaluate the predictive capability of a simple linear regression on
top of ast2vec, obtaining similar results to techniques that work directly on syntax trees.
We hope ast2vec can augment the educational datamining toolbelt by making analyses
of computer programs easier, richer, and more efficient.
Keywords: computer science education, computer programs, word embeddings, repre-
sentation learning, neural networks, visualization, program vectors

1 Introduction
Techniques for analyzing and utilizing student programs have been the focus of much recent
research in computer science education. Such techniques have included hint systems to provide
automated feedback to students [Paaßen et al., 2018b, Piech et al., 2015b, Price et al., 2019,
Rivers and Koedinger, 2017], as well as visualization and search approaches to help teachers
understand student behavior [McBroom et al., 2018, Nguyen et al., 2014]. Considering that
programming is a key skill in many fields, including science, technology, engineering and
mathematics [Denning, 2017, McCracken et al., 2001, Wiles et al., 2009], and considering the
difficulty students have with learning programming [Denning, 2017, Lahtinen et al., 2005,
Qian and Lehman, 2017, Robins et al., 2003], developing and expanding the range of available
techniques to improve educational outcomes is of great importance.

1
Preprint version under consideration at the Journal of Educational Datamining 2

Unfortunately, computer programs are particularly difficult to analyze for two main rea-
sons. First, programs come in the form of raw code or syntax trees (after compilation), which
few datamining techniques are equipped to handle. Instead, one first has to represent pro-
grams differently to turn them into valid input for data mining techniques [Paaßen et al.,
2018b]. Second, the space of possible programs grows combinatorially with program length
and most programs are never written by a student, even fewer more than once [Paaßen et al.,
2018b, Rivers and Koedinger, 2012, 2017]. Accordingly, one needs to abstract from meaning-
less differences between programs to shrink the space and, thus, make it easier to handle with
less risk of overfitting.
Several prior works have addressed both the representation and the abstraction step,
often in conjunction. For example, Rivers and Koedinger [2012] have suggested semantically
motivated transformations of syntax trees to remove syntactic variations that have no semantic
influence (such as unreachable code or direction of binary operators). Peddycord III. et al.
[2014] suggest to represent programs by their output rather than their syntax. Similarly,
Paaßen et al. [2016] suggest to represent programs by their execution behavior. Gulwani et al.
[2018] as well as Gross et al. [2014] perform a clustering of programs to achieve a representation
in terms of a few discrete clusters. We point to the ’related work’ section and to the review of
McBroom et al. [2019] for a more comprehensive list. We also note that many of the possible
abstraction and representation steps are not opposed but can be combined to achieve better
results [McBroom et al., 2019].
The arguably most popular representation of computer programs are pairwise tree edit
distances [Zhang and Shasha, 1989]. For example, Mokbel et al. [2013], Paaßen et al. [2018b],
Price et al. [2017], and Rivers and Koedinger [2017] have applied variations of the standard
tree edit distance for processing programs. Edit distances have the advantage that they do not
only quantify distance between programs, they also specify which parts of the code have to
be changed to transform one program into another, which can be utilized to construct hints
[Paaßen et al., 2018b, Price et al., 2017, Rivers and Koedinger, 2017]. Additionally, many
datamining techniques for visualization, clustering, regression, and classification can deal
with input in terms of pairwise distances [Pekalska and Duin, 2005, Hammer and Hasenfuss,
2010]. Still, a vast majority of techniques can not. For example, of 126 methods contained in
the Python library scikit-learn1 , only 24 can natively deal with pairwise distances, eight further
methods can deal with kernels, which require additional transformations [Gisbrecht and Schleif,
2015], and 94 only work with vector-shaped input. As such, having a vector-formed represen-
tation of computer programs enables a much wider range of analysis techniques. Further, a
distance-based representation depends on a database of reference programs to compare it to.
The computational complexity required for analyzing a program thus scales at least linearly
with the size of the database. By contrast, a parametric model with vector-shaped input can
perform operations independent of the size of the training data. Finally, a representation in
terms of pairwise distances tries to solve the representation problem for computer programs
every time anew for each new learning task. Conceptually, it appears more efficient to share
representational knowledge across tasks. In this paper, we aim to achieve just that: To find a
mapping from computer programs to vectors in Euclidean space (and back) that generalizes
across learning tasks and thus solves the representation problem ahead of time so that we, as
educational dataminers, only need to add a simple model specific to the analysis task we wish
to solve. In other words, we wish to achieve for computer programs what word embeddings
like word2vec [Mikolov et al., 2013] or GloVe [Pennington et al., 2014] offer for natural lan-
guage processing: a re-usable component that solves the representation problem of programs
such that subsequent research can concentrate on other datamining challenges. Our technique
1
Taken from this list https://scikit-learn.org/stable/modules/classes.html.
Preprint version under consideration at the Journal of Educational Datamining 3

is by no means meant to supplant all existing representations. It merely is supposed to be


another tool on the toolbelt for educational datamining in computer science education.
We acknowledge that we are not the first to attempt such a vector-shaped representa-
tion. Early work has utilized hand-engineered features such as program length or cyclomatic
complexity [Truong et al., 2004]. More recent approaches check whether certain tree patterns
occur within a given syntax tree, which gives rise to a vector with one dimension per tree
pattern that is zero if the syntax tree does not contain the pattern and nonzero if it does
[Nguyen et al., 2014, Zhi et al., 2018, Zimmerman and Rupakheti, 2015]. One can also de-
rive a vectorial representation from pairwise edit distances using linear algebra (although the
dependence on a database remains) [Paaßen et al., 2018b]. Finally, prior works have trained
neural networks to map programs to vectors [Piech et al., 2015a, Alon et al., 2019]. How-
ever, all these representations lose a desirable property of other representations named above,
namely the ability to return from a program’s representation to the original program. For
example, while it is simple to compute the length of a program, it is impossible to identify the
original program just from its length. However, such an inverse translation is crucial for the
interpretability of a result. For example, if I want to provide a hint to a student, it would not
help the student much if I tell them that they should increase the length of their program.
Instead, one would rather like to know which parts of their code they have to change.
So how does one achieve a mapping from computer programs to vectors that can be
inverted and still generalizes across tasks? The solution appears to lie in recent works on
autoencoders for tree-structured data [Chen et al., 2018, Dai et al., 2018, Kusner et al., 2017,
Paaßen et al., 2021], which train an encoder φ from trees to vectors and a decoder ψ from
vectors back to trees such that φ(ψ(x)) is similar to x for as many trees x as possible from
the training data. While training such a model requires large amounts of data and a lot of
computation time (as is usual in deep learning), our hope is that a trained model can then be
used as a representation and abstraction device without any further need for deep learning.
Instead, one can just apply vector-based datamining techniques, such as any of the methods
in scikit-learn, to the vector representations of computer programs and even transform any
vector back into a program if desired.
In this paper, we present ast2vec, an autoencoder for trees trained on almost half a million
Python programs written by novice programmers that encodes beginner Python programs
as 256-dimensional vectors in Euclidean space and decodes vectors as syntax trees. Both
encoding and decoding are guided by the grammar of the Python programming language,
thereby ensuring that decoded trees are syntactically correct. The model and its source code
is completely open-source and is available at https://gitlab.com/bpaassen/ast2vec/.
The overarching question guiding our research is: Does ast2vec generalize across learning
tasks and across different datamining techniques? To this end, we test ast2vec in three
settings: First, we utilize ast2vec for a variety of analyses on a classroom-sized dataset, thereby
showing the generality qualitatively. Second, we quantitatively investigate the autoencoding
error of ast2vec on its training data and on two similarly large datasets of novice programs
from subsequent years. In all cases, we observe that small trees usually get autoencoded
correctly whereas errors may occur for larger trees. Finally, we investigate the performance
of a simple linear regression model on top of ast2vec for predicting the next step of students
in a learning task and observe that we achieve comparable results to established techniques
while being both more space- and more time-efficient.
This paper is set out as follows: Section 2 begins by utilizing ast2vec for a variety of
datamining analyses on a classroom-sized dataset to demonstrate its broad applicability qual-
itatively. Section 3 then specifies the details of ast2vec, including how it was trained and
constructed, as well as the details of two novel analysis techniques used in Section 2, namely
Preprint version under consideration at the Journal of Educational Datamining 4

progress-variance projections and dynamic analyses. Next, Section 4 provides an evaluation


of ast2vec on large-scale datasets, including an investigation of its autoencoding performance
and predictive capabilities. Finally, Section 5 describes related work in more detail and Sec-
tion 6 concludes with a summary of the main ideas of the paper and a discussion of future
directions.

2 Example analyses using ast2vec


Before introducing the details of ast2vec, we perform example analysis to investigate how well
the vector representation of ast2vec generalizes across methods. In particular, we perform
four example analyses using ast2vec on a classroom-sized dataset: 1) visualizing student work
and progress on a task, 2) modeling student behavior, 3) clustering student programs, and 4)
outlier detection.
Note that we have used a synthethic dataset for this example analysis because we wanted
to make all of the code and data used in this section publicly available (it can be found
here: https://gitlab.com/bpaassen/ast2vec/). However, we have not limited our analyses
to synthetic data and, indeed, in Section 4 we will discuss experiments on large-scale, real
datasets.

2.1 Dataset Construction


To construct the dataset used in these example analyses, we manually simulated ten students
attempting to solve the following introductory programming task:

Write a program that asks the user, “What are your favourite animals?”. If their
response is “Quokkas”, the program should print “Quokkas are the best!”. If their
response is anything else, the program should print “I agree, x are great animals.”,
where x is their response.

We simulated the development process by starting with an empty program, writing a


partial program, receiving feedback from automatic test cases for this partial program, and
revising the program in response to the test feedback until a correct solution was reached.
The reference solution for the task is the following:

1 x = input ( " What are your favourite animals ? " )


2 if x = = " Quokkas " :
3 print ( " Quokkas are the best ! " )
4 else :
5 print ( f " I agree , { x } are great animals . " )

Our four simulated unit tests checked 1) whether the first line of output was “What are
your favourite animals?”, and 2-4) whether the remaining output was correct for the inputs
“Quokkas”, “Koalas” and “Echidnas”, respectively.
Overall, the dataset contains 58 (partial) programs, and these form N = 25 unique syntax
trees after compilation. Since ast2vec converts program trees to n = 256 dimensional vectors,
this means the data matrix X ∈ RN ×n after encoding all programs is of size N ×n = 25×256.

2.2 Application 1: Visualizing Student Work and Progress


As a first example application of ast2vec, we consider a visualization of student progress
from the empty program to the correct solution. The purpose of such a visualization is to
Preprint version under consideration at the Journal of Educational Datamining 5

1 print ( ’ < string > ’)

input ( ’ < string > ’)


print ( ’ < string > ’)
variance

0.5
x = input ( ’ < string > ’ )
if x == ’ < string > ’:
print ( ’ < string > ’)
else :
0 print ( ’ < string > ’)

0 0.2 0.4 0.6 0.8 1


progress

Figure 1: A progress-variance plot of the example task. The special points (0, 0) and (1, 0)
correspond to the empty program and the reference solution, respectively. The size of points
corresponds to their frequency in the data. Arrows indicate student motion. Different students
are plotted in different colors.

give educators a concise overview of typical paths towards the goal, where students tend to
struggle, as well as how many typical strategies exist [McBroom et al., 2021].
In order to create such a visualization, we combine ast2vec with a second technique that
we call progress-variance projection. Our key idea is to construct a linear projection from
256 dimensions to two dimensions, where the first axis captures the progress from the empty
program towards the goal and the second axis captures as much of the remaining variance
between programs as possible. A detailed description of this approach is given in Section 3.2.
Figure 1 shows the result of applying this process to the sample data. Each circle in the plot
represents a unique syntax tree in the data, with larger circles representing more frequent trees
and the three most common trees shown on the right. Note that (0, 0) represents the empty
program and (1, 0) represents the solution. In addition, the arrows indicate student movement
between programs, with different colors representing different students. More specifically, an
arrow is drawn from vector ~x to ~y if the student made a program submission corresponding
to vector ~x followed by a submission corresponding to vector ~y .
Without the axis labels, this plot is similar to the interaction networks suggested by
Barnes et al. [2016], which have already been shown to provide useful insights into student
learning [McBroom et al., 2018]. However, in contrast to interaction networks, our progress-
variance projection additionally maps vectors to meaningful locations in space. In particular,
x-axis corresponds to the (linear) progress toward the solution, whereas the y-axis corresponds
to variation between student programs that is orthogonal to the progress direction. This can
provide an intuitive overview of the types of programs students submit and how they progress
through the exercise.
Additionally, the plot reveals to us that most syntax trees only occur a single time and that
student programs differ a lot on a fine-grained level, especially close to the solution around
(0.8, 0.4). However, while exact repetitions of syntax trees are rare, the coarse-grained motion
of students seems to be consistent, following an arc from the origin to the correct solution.
We will analyze this motion in more detail in the next section.
Finally, we notice that multiple students tend to get stuck at a program which only does a
single function call (corresponding to point (0.5, 0.55) in the plot). This stems from programs
Preprint version under consideration at the Journal of Educational Datamining 6

x = input ( ’ < string > ’)


0.4

0.2 x = input ( ’ < string > ’)


print ( ’ < string > ’)
variance

0
x = input ( ’ < string > ’)
−0.2 if x == ’ < string > ’:
print ( ’ < string > ’)
−0.4 else :
print ( ’ < string > ’)

0 0.2 0.4 0.6 0.8 1


progress

Figure 2: A linear dynamical system f , which has a single attractor at the correct solution
and approximates the motion of students through the space of programs (orange). Arrows are
scaled with factor 0.3 for better visibility. An example trace starting at the empty program and
following the dynamical system is shown in blue. Whenever the decoded program changes
along the trace, we show the code on the right. The coordinate system is the same as in
Figure 1.

that have the correct syntax tree but not the right string (e.g. due to typos). Accordingly, if
students fix the string they still have the same syntax tree and, hence, the same location in
the visualization.
In summary, this section demonstrates how ast2vec can be used to visualize student
progress through a programming task, how we can interpret such a visualization, and how
the vectorial output of ast2vec enabled us to develop a novel visualization technique in the
first place.

2.3 Application 2: Dynamical analysis


In Figure 1, we have observed that the high-level motion of students appears to be consistent.
In particular, all students seem to follow an arc-shaped trajectory beginning at (0, 0), then
rising in the y-axis, before approaching the reference solution at (1, 0). This raises the ques-
tion: Can we capture this general motion of students in a simple, dynamical system? The
purpose of such a prediction could be to provide next-step hints. If a student does not know
how to continue, we can provide a prediction how students would generally proceed, namely
following the arc in the plot. Then, we can use ast2vec to decode the predicted position back
into a syntax tree and use an edit distance to construct the hint, similar to prior techniques
[Paaßen et al., 2018b, Price et al., 2017, Rivers and Koedinger, 2017].
In Section 3.3, we discuss one technique to perform such a prediction. In particular, we
learn a linear dynamical system that has the correct solution to the task as a unique stable
attractor.
Figure 2 illustrates the dynamical system we obtain from our example data. In particular,
the orange arrows indicate how a point at the origin of the arrow would be moved by the
dynamical system. As we can see, the dynamical system does indeed capture the qualitative
Preprint version under consideration at the Journal of Educational Datamining 7

behavior we observe in Figure 1, namely the arc-shaped motion from the empty program
towards the reference solution. We can verify this finding by simulating a new trace that
follows our dynamical system. In particular, we start at the empty program ~x0 and then
set ~xt+1 = f (xt ) until the location does not change much anymore. The resulting motion
is plotted in blue. We further decode all steps ~xt using ast2vec and inspect the resulting
syntax tree. These trees are shown in Figure 2 on the right. We observe that the simulated
trace corresponds to reasonable student motion, namely to first ask the user for input, then
add a single print statement, and finally to extend the solution with an if-statement that can
distinguish between the input ’Quokka’ and all other inputs.

2.4 Application 3: Clustering


When inspecting Figure 1, we also notice that student programs seem to fall into clusters,
namely one cluster around the correct solution, one cluster to the top left of it, and further
minor clusters. Finding clusters in programming data is useful both for providing feedback
based on clusters [Gross et al., 2014] as well as reducing data complexity by summarizing a
big dataset of programs by a few representative programs [Glassman et al., 2015].
Given that ast2vec represents our data as vectors, we can use any clustering technique.
In this case, we use a Gaussian mixture model, which also provides us with a notion of
probability [Dempster et al., 1977]. In particular, a Gaussian mixture model expresses the
data as a mixture of K Gaussians
K
X
p(~x) = p(~x|k) · p(k),
k=1

where p(~x|k) is the standard Gaussian density. The mean, covariance, and prior p(k) of each
Gaussian are learned according to the data. After training, we assign data vectors ~xi to clus-
ters by evaluating the posterior probability p(k|~xi ) and assigning the Gaussian k with highest
probability. Note that, prior to clustering, we perform a standard principal component anal-
ysis (PCA) because distances tend to degenerate for high dimensionality and thus complicate
distance-based clusterings [Aggarwal et al., 2001]. We set the PCA to preserve 95% of the
data variance, which yielded 12 dimensions. Then, we cluster the data via a Gaussian mixture
model with K = 4 Gaussians.
Figure 3 shows the resulting clustering with clusters indicated by color and cluster means
highlighted as diamonds. We also decode all means back into syntax trees using ast2vec.
Indeed, we notice that the clustering roughly corresponds to the progress through the task
with the blue cluster mean containing an input and a print statement, the yellow cluster mean
storing the input in a variable, the purple cluster mean adding an if statement, and the orange
cluster mean printing out the user input.

2.5 Application 4: Outlier Detection


An additional benefit of a probability-based clustering is that we can use it directly to detect
“outliers”. Detecting such outliers can be useful because they lie outside the range of usual
data and thus may require special attention by educators.
Figure 3 highlights all points with a red circle that received particularly low probability
p(~x) by our Gaussian mixture model2 . Unsurprisingly, one outlier is the empty program.
However, the other outliers are more instructive. In particular, the outlier in the center of the
2
We defined ’particularly low’ as having a log probability of at most half the average when normalizing by
the least likely sample.
Preprint version under consideration at the Journal of Educational Datamining 8

input ( ’ < string > ’)


print ( ’ < string > ’)
1
x = input ( ’ < string > ’)
print ( ’ < string > ’)
variance

0.5
x = input ( ’ < string > ’)
if x == ’ < string > ’:
print ( ’ < string > ’)

0
x = input ( ’ < string > ’)
0 0.2 0.4 0.6 0.8 1 if x == ’ < string > ’:
progress print ( f ( ’ < string > ’ + x ))

Figure 3: A visualization of Gaussian mixture clusters of programs with color indicating


cluster assignment. Cluster centers are shown as diamonds. Outliers are highlighted with a
red ring. The coordinate system is the same as in Figure 1.

plot corresponds to an input statement without a print statement, which is an unusual path
towards the solution. In this dataset, it is more common to write the print statement first.
The two outliers around (0.9, 0.45) correspond to the following program shape:

1 print ( ’ What are your favourite animals ? ’)


2 animals = input ()
3 if animals = = ’ Quokkas ’:
4 print ( ’ Quokkas are the best ! ’)
5 else :
6 print ( f ’I agree , { animals } are great animals . ’)

Here, the question for the favourite animals is posed as a print statement and the input is
requested with a separate command, which is a misunderstanding of how input works.
The final outlier is the program:

1 animals = input ( ’ What are your favourite animals ? ’)


2 if animals = = ’ Koalas ’:
3 print ( ’I agree , Koalas are great animals . ’)
4 elif animals = = ’ Echidnas ’:
5 print ( ’I agree , Echidnas are great animals . ’)
6 else :
7 print ( ’ Quokkas are the best ! ’)

This program does pass all our test cases but does not adhere to the “spirit” of the task,
because it does not generalize to new inputs. Such an outlier may instruct us that we need to
change our test cases to be more general, e.g. by using a hidden test with a case that is not
known to the student.
This concludes our example data analyses using ast2vec. We note that further types of
data analysis are very well possible, e.g. to define an interaction network [Barnes et al., 2016]
on top of a clustering. In the remainder of this paper, we will explain how ast2vec works in
more detail and evaluate it on large-scale data.
Preprint version under consideration at the Journal of Educational Datamining 9

φ
~x
~ǫ ~x + ~ǫ
x̂ ŷ

Figure 4: A high-level illustration of the (denoising) autoencoder framework. Syntax tree x̂


(left) gets encoded as a vector ~x = φ(x̂) (right). We then add noise ~ǫ and decode back to a
tree ŷ = ψ(~x +~ǫ). If x̂ 6= ŷ, we adjust the parameters of both encoder and decoder to increase
the probability that ~x + ~ǫ is decoded into x̂ instead of ŷ.

3 Methods
In this section, we describe the methods employed in this paper in more detail. In Section 3.1,
we provide a summary of the autoencoder approach we used for ast2vec. In Section 3.2, we
describe the progress-variance projection that we used to generate 2D visualizations of student
data. Finally, in Section 3.3, we explain how to learn linear dynamical systems from student
data.

3.1 The Architecture of ast2vec


Our neural network ast2vec is an instance of a recursive tree grammar autoencoder model as
proposed by Paaßen et al. [2021]. In this section, we describe this approach as we used it for
ast2vec. If readers are interested in the general approach (and more technical details), we
recommend the original paper.
On a high level, ast2vec is a so-called autoencoder, i.e. a combination of an encoder φ : X →
Rn from trees to vectors and a decoder ψ : Rn → X from vectors to trees, such that ψ(φ(x̂))
is equal to x̂ for as many trees x̂ as possible. Unfortunately, this problem has undesirable
solutions if we only consider a finite training dataset. In particular, let x̂1 , . . . , x̂N ∈ X be
a training dataset of trees. Then, we can set φ(x̂i ) = i and ψ(i) = x̂i , i.e. we implement φ
and ψ as a simple lookup table that is only defined exactly on the training data and does not
generalize to any other tree.
A remedy against this issue is the variational autoencoding (VAE) framework of Kingma and Welling
[2013]. In this framework, we add a small amount of Gaussian noise to the encoding vector
before decoding it again to a program (refer also to Figure 4 for an illustration). Because
this noise is different every time, the encoder-decoder pair must be able to generalize at least
slightly outside the training data. Additionally, VAEs ensure that the distribution of vectors
remains close to a standard normal distribution, which in turn means that the encodings
cannot degenerate to large values. In more detail, VAEs rephrase the entire autoencoding
problem in a probabilistic fashion, where we replace the deterministic decoder with a proba-
bility distribution pψ (x̂|~x) of vector ~x getting decoded to tree x̂. We then wish to maximize
the probability that a syntax tree gets autoencoded correctly even after adding a noise vec-
tor ~ǫ to the encoding vector; and we wish to keep the noisy code vectors standard normally
Preprint version under consideration at the Journal of Educational Datamining 10

Module
f Name
Expr
f Call f Expr f Module
Call
f Str
Name Str

Figure 5: An illustration of the encoding process into a vector with n = 3 dimensions.


We first use the Python compiler to transform the program print(’Hello,␣world’)! into the
abstract syntax tree Module(Expr(Call(Name, Str))) (left). Then, from left to right, we
apply the recursive encoding scheme, where arrows represent the direction of computation
and traffic light-elements represent three-dimensional vectors, where color indicates the value
of the vector in that dimension.

distributed. The precise optimization problem becomes


N h
X i  
min n − log pψ x̂i φ(x̂i ) + ǫi + β · DKL φ(x̂i ) + ǫi , (1)
φ:X →R
ψ:Rn →X i=1

where ǫi is a Gaussian noise vector that is generated randomly in every round of training,
 β is a
hyper-parameter to regulate how smooth we want our coding space to be, and DKL φ(x̂i )+ǫi
is the Kullback-Leibler divergence between the distribution of the noisy neural code φ(x̂i ) + ǫi
and the standard normal distribution - i.e. it punishes if the code distribution becomes too
different from a standard normal distribution.
In the next paragraphs, we describe the encoder φ, the decoder ψ, the probability distri-
bution pψ , and the training scheme for ast2vec.

Encoder: The first step of our encoding is to use the Python compiler to generate an
abstract syntax tree for the program. Now, let x(ŷ1 , . . . , ŷK ) be such an abstract syntax tree,
where x is the root syntactic element and ŷ1 , . . . , ŷK are its K child subtrees. For an example
of such a syntax tree, refer to Figure 5 (left). Our encoder φ is then defined as follows.
   
x
φ x(ŷ1 , . . . , ŷK ) := f φ(ŷ1 ), . . . , φ(ŷK ) , (2)

where f x : RK×n → Rn is a function that takes the vectors φ(ŷ1 ), . . . , φ(ŷK ) for all chil-
dren as input and returns a vector for the entire tree, including the syntactic element x
and all its children. Because Equation 2 is recursively defined, we also call our encoding
recursive. Figure 5 shows an example encoding for the program print(’Hello,␣world!’) with
n = 3 dimensions. We first use the Python compiler to translate this program into the
abstract syntax tree Module(Expr(Call(Name, Str))) and then apply our recursive encod-
ing scheme. In particular, recursively applying Equation 2 to this tree yields the expression
f Module(f Expr (f Call (f Name (), f Str ()))). We can evaluate this expression by first computing the
leaf terms f Name() and f Str (), which do not require any inputs because they have no children.
This yields two vectors, one representing Name and one representing Str, respectively. Next,
we feed these two vectors into the encoding function f Call , which transforms them into a vec-
tor code of the subtree Call(Name, Str). We feed this vector code into f Expr , whose output
Preprint version under consideration at the Journal of Educational Datamining 11

we feed into f Module , which in turn yields the overall vector encoding for the tree. Note that
our computation follows the structure of the tree bottom-up, where each encoding function
receives exactly as many inputs as it has children.
A challenge in this scheme is that we have to know the number of children K of each
syntactic element x in advance to construct a function f x . Fortunately, the grammar of the
programming language3 tells us how many children are permitted for each syntactic element.
For example, an if element in the Python language has three children, one for the condition,
one for the code that gets executed if the condition evaluates to True (the ‘then’ branch), and
one for the code that gets executed if the condition evaluates to False (the ‘else’ branch).
Leaves, like Str or Name, are an important special case. Since these have no children, their
encoding is a constant vector f x ∈ Rn . This also means that we encode all possible strings
as the same vector (the same holds for all different variable names or all different numbers).
Incorporating content information of variables in the encoding is a topic for future research.
Another important special case are lists. For example, a Python program is defined as a
list of statements of arbitrary length. In our scheme, we encode a list by adding up all vectors
inside it. The empty list is represented as a zero vector.
Note that, up to this point, our procedure is entirely general and has nothing to do
with neural nets. Our approach becomes a (recursive) neural network because we implement
the encoding functions f x as neural networks. In particular, we use a simple single-layer
feedforward network for the encoding function f x of the syntactic element x:
 
f x (~y1 , . . . , ~yK ) = tanh U1x · ~y1 + . . . + UK
x
· ~yK + ~bx , (3)

where Ukx ∈ Rn×n is the matrix that decides what information flows from the kth child to its
parent and ~bx ∈ Rn represents the syntactic element x. These U matrices and the ~b vectors
are the parameters of the encoder that need to be learned during training.
Importantly, this architecture is still relatively simple. If one would aim to optimize
autoencoding performance, one could imagine implementing f x instead with more advanced
neural networks, such as a Tree-LSTM [Chen et al., 2018, Dai et al., 2018, Tai et al., 2015].
This is a topic for future research.

Decoding: To decode a vector ~x recursively back into a syntax tree x(ŷ1 , . . . , ŷK ), we have
to make K + 1 decisions. First, we have to decide the syntactic element x. Then, we have to
decide the vector codes ~y1 , . . . , ~yK for each child. For the first decision we set up a function hA
which computes a numeric score hA (x|~x) for each possible syntactic element x from the vector
~x. We then select the syntactic element x with the highest score. The A in the index of h
refers to the fact that we guide our syntactic decision by the Python grammar. In particular,
we only allow syntactic elements to be chosen that are allowed by the current grammar symbol
A. All non-permitted elements receive a score of −∞. For simplicity, we do not discuss the
details of grammar rules here but point the interested reader to Paaßen et al. [2021].
Once we know the syntactic element, the Python grammar tells us the number of children
K, i.e. how many child codes we need to generate. Accordingly, we use K decoding functions
gkx : Rn → Rn which tell us the vector code for each child based on the parent code ~x. The
precise definition of the decoding procedure is as follows.
 
ψ(~x) =x ψ(~y1 ), . . . , ψ(~yk ) where (4)
x = arg max hA (y|~x) and ~yk = gkx (~x) for all k ∈ {1, . . . , K}
y
3
The entire grammar for the Python language can be found at this link:
https://docs.python.org/3/library/ast.html.
Preprint version under consideration at the Journal of Educational Datamining 12

hA Module hA Expr hA Call Module


g1Call hA Name
Expr
g1Module g1Expr
Call
g2Call hA Str
Name Str

Figure 6: An illustration of the decoding process of a vector with n = 3 dimensions into an


abstract syntax tree. We first plug the current vector into a scoring function h, based on
which we select the current syntactic element x (shown in orange). After this step, we feed
x , one for each child a syntactic element
the current vector into decoding functions g1x , . . . , gK
x expects. This yields the next vectors for which the same process is applied again until all
remaining vectors decode into leaf elements.

In other words, we first use hA to choose the current syntactic element x with maximum score
x to compute the neural codes for all children,
hA (x|~x), use the decoding functions g1x , . . . , gK
and proceed recursively to decode child subtrees.
An example of the decoding process is shown in Figure 6. As input, we receive some vector
~x, which we first feed into the scoring function hA . As the Python grammar requires that
each program begins with a Module, the only score higher than −∞ is h(Module|~x), meaning
that the root of our generated tree is Module. Next, we feed the vector ~x into the decoding
function g1Module , yielding a vector g1Module (~x) to be decoded into the child subtree. With this
vector, we re-start the process, i.e. we feed the vector g1Module (~x) into our scoring function hA ,
which this time selects the syntactic element Expr. We then generate the vector representing
the child of Expr as ~y = g1Expr (g1Module (~x)). For this vector, hA selects Call, which expects
two children. We obtain the vectors for these two children as g1Call (~y ) and g2Call (~y ). For these
vectors, hA selects Name and Str, respectively. Neither of these has children, such that the
process stops, leaving us with the tree Module(Expr(Call(Name, Str))).
Again, we note the special case of lists. To decode a list, we use a similar scheme, where
we let hA make a binary choice whether to continue the list or not, and use the decoding
function gkx to decide the code for the next list element.
We implement the decoding functions gkx with single-layer feedforward neural networks
as in Equation 3. A special case are the decoding functions for list-shaped children, which
we implement as recurrent neural networks, namely gated recurrent units [Cho et al., 2014].
Again, we note that one could choose to implement all decoding functions gkx as recurrent
networks, akin to the decoding scheme suggested by [Chen et al., 2018]. Here, we opt for a
simple version and leave the extension to future work.
For the scoring function hA , we use a linear layer with n inputs and as many outputs as
there are syntactic elements in the programming language. Importantly, this choice process
can also be modelled in a probabilistic fashion, where the syntactic element x is chosen with
probability  
exp hA (x|~x)
p(x|~x) = P  , (5)
y exp hA (y|~
x)
i.e. x is sampled according to a softmax distribution with weights hA (x|~x). The probability
pψ (x̂|~x) from the optimization problem 1 is then defined as the product over all these proba-
bilities during decoding. In other words, pψ (x̂|~x) is the probability that the tree x̂ is sampled
Preprint version under consideration at the Journal of Educational Datamining 13

via decoder ψ when receiving ~x as input.


We note in passing that ast2vec only decodes to syntax trees and does not include variable
names, numbers, or strings. However, it is possible to train a simple support vector machine
classifier which maps the vector code for each node of the tree during decoding to the variable
or function it should represent. This is how we obtain the function names and variables in
our Figures like in Figure 1. We provide the source code for training such a classifier as part
of our code distribution at https://gitlab.com/bpaassen/ast2vec/.

Training: Our training procedure is a variation of stochastic gradient descent. In each


training epoch, we randomly sample a mini-batch of N = 32 computer programs from our
dataset, compute the abstract syntax trees x̂i for each of them, then encode them as vectors
φ(x̂i ) using the current encoder, add Gaussian noise ~ǫi , and then compute the probabilities
pψ x̂i φ(x̂i ) + ~ǫi as described above. With these probabilities we compute the loss in Equa-
tion 1. Note that this loss is differentiable with respect to all our neural network parameters.
Accordingly, we can use the automatic differentiation mechanisms of pyTorch [Paszke et al.,
2019] to compute the gradients of the loss with respect to all parameters and then perform an
Adam update step [Kingma and Ba, 2015]. We set the dimensionality n to 256, the smooth-
ness parameter β to 10−3 , and the learning rate to 10−3 .
We trained ast2vec on 448, 992 Python programs recorded as part of the the National
Computer Science School (NCSS)4 , an Australian educational outreach programme. The
course was delivered by the Grok Learning platform5 . Each offering was available to mostly
Australian school children in Years 5–10, with curriculum-aligned educational slides and sets
of exercises released each week for five weeks. For training ast2vec, we used all programs
that students tried to run in the 2018 ’beginners’ challenge. After compilation, we were left
with 86, 991 unique abstract syntax trees. We performed training for 130, 000 epochs (each
epoch with a mini-batch of 32 trees), which corresponds to roughly ten epochs per program
(32 · 130, 000 = 4, 160, 000). By inspecting the learning curve of the neural net, we notice
that the loss has almost converged (refer to Figure 7). All training was performed on a
consumer-grade laptop with a 2017 Intel core i7 CPU6 and took roughly one week of real
time.
We emphasize again that we do not expect teachers to aggregate a similarly huge dataset
to train their own neural net. Instead, we propose to use the pre-trained ast2vec model as a
general-purpose component without retraining (refer to Section 2). In the following sections,
we explain two example techniques that are possible thanks to the vectorial representation
achieved by ast2vec.

3.2 Progress-Variance Projections


In this section, we present a novel technique for visualizing student programs, the progress-
variance projection. This technique, which requires a vector encoding as provided by ast2vec,
involves mapping program vectors to a two dimensional representation where the first axis
captures the progress between empty program and solution and where the second axis captures
the variation orthogonal to the progress axis. In particular, we map our vectors onto the plane
spanned by the following two orthogonal vectors:

1. ~δ = ~x∗ − ~x0 , where ~x∗ and ~x0 are the encodings of the solution and empty program
4
https://ncss.edu.au
5
https://groklearning.com/challenge/
6
Contrary to other neural networks, recursive neural nets can not be trained on GPUs because the com-
putational graph is unique for each tree.
Preprint version under consideration at the Journal of Educational Datamining 14

loss 100

10−1

10−2
0 0.2 0.4 0.6 0.8 1 1.2
epoch ·105

Figure 7: The learning curve for our recursive tree grammar autoencoder trained on 448, 992
Python programs recorded in the 2018 NCSS beginners challenge of groklearning. The dark
orange curve shows the loss of Equation 1 over the number of training epochs. For better
readability we show the mean over 100 epochs, enveloped by the standard deviation (light
orange). Note that the plot is in log scaling, indicating little change toward the end.

0.5 ~ν

~x∗
variance

~ν ~δ
1 0
~x0 ~δ ~x∗
z

~x0
0 0.5
0.5 0 −0.5
1 1.5 2 2.5 −0.5 0 0.2 0.4 0.6 0.8 1
x y
progress

Figure 8: An illustration of the progress-variance plotting technique. Left: The high-


dimensional space (here 3D) where orange points indicate program encodings. The encoding
for the empty program and the reference solution are highlighted in red and annotated with
~x0 and ~x∗ , respectively. The x-axis of our progress-variance plot is the vector ~δ = ~x∗ − ~x0 .
The y-axis is the orthogonal axis ~ν which covers as much variance as possible of the point
distribution. Right: The 2D-dimensional coordinate system with all points projected down
to 2D.
Preprint version under consideration at the Journal of Educational Datamining 15

respectively. This is used as the x-axis of the plot and captures progress towards or
away from the solution.

2. ~ν , which is chosen as the unit vector orthogonal to ~δ that preserves as much variance in
the dataset as possible. Note that this setup is similar to principal component analysis
[Pearson, 1901], but with the first component fixed to ~δ.

The full algorithm to obtain the 2D representation of all vectors is shown in Algorithm 1.

Algorithm 1 An algorithm to map a data matrix of program encodings X ∈ RN ×n to a


2D progress-variance representation Y ∈ RN ×2 based on an encoding for the empty program
~x0 ∈ Rn and an encoding for the reference solution ~x∗ ∈ Rn .
1: ~
δ ← ~x∗ − ~x0 .
2: δ̂ ← ~
δ/k~δk.
3: X̂ ← X − ~x0 . ⊲ Row-wise subtraction
4: T
X̂ ← X̂ − X̂ · δ̂ · δ̂ . ~
⊲ Project δ out, i.e. X̂ lives in the orthogonal space to δ
5: C ← X̂ T · X̂. ⊲ Quasi Covariance matrix
6: ~ν ← eigenvector with largest eigenvalue of C.
7: Y ← (X − ~x0 )/k~δk · (δ̂, ~ν ). ⊲ Map to 2D
8: return Y .

Once ~δ and ~ν are known, we can translate any high-dimensional vector ~x ∈ Rn to a 2D


version ~y and back via the following equations:

T
y = δ̂ ~ν
~ · (~x − ~x0 ) / k~δk (high to low projection) (6)
~x′ = δ̂ ~ν · ~y × k~δk + ~x0 ,

(low to high embedding) (7)

where δ̂ = ~δ/k~δk is the unit vector parallel to ~δ and δ̂ ~ν is the matrix with δ̂ and ~ν as


columns.
In general, ~x′ will not be equal to the original ~x because the 2D space can not preserve all
information in the n = 256 dimensions. However, our special construction ensures that the
empty program ~x0 corresponds exactly to the origin (0, 0) and that the reference solution ~x∗
corresponds exactly to the point (1, 0), which can be checked by substituting ~x0 and ~x∗ into
the equations above. In other words, the x-axis represents linear progress in the coding space
toward the goal, and the y-axis represents variance orthogonal to progress. An example for
the geometric construction of our progress-variance plot is shown in Figure 8. In this example,
we project a three-dimensional dataset down to 2D.

3.3 Learning Dynamical Systems


In this section, we describe how to learn a linear dynamical system that captures the motion
of students in example data. We note that we are not the first to learn (linear) dynamical
systems from data [Campi and Kumar, 1998, Hazan et al., 2017]. However, to our knowledge,
we are the first to apply dynamical systems learning for educational datamining purposes; and
the system we propose here is particularly simple to learn.
In particular, we consider a linear dynamical system f , which predicts the next step of a
student at location ~x as follows.

f (~x) = ~x + W · (~x∗ − ~x), (8)


Preprint version under consideration at the Journal of Educational Datamining 16

where ~x∗ is the reference solution to the task and W is a matrix of parameters to be learned.
This form of the dynamical system is carefully chosen to ensure two desirable properties.
First, the reference solution is a guaranteed fixed point of our system, i.e. we obtain f (~x∗ ) =
~x∗ . Indeed, we can prove that this system has the reference solution ~x∗ as unique stable
attractor if W has sufficiently small eigenvalues (refer to Theorem 1 in the appendix for
details). This is desirable because it guarantees that the default behavior of our system is to
move towards the correct solution.
Second, we can learn a matrix W that best describes student motion via simple linear
regression. In particular, let ~x1 , . . . , ~xT be a sequence of programs submitted by a student in
their vector encoding provided by ast2vec. Then, we wish to find the matrix W which best
captures the dynamics in the sense that f (~xt ) should be as close as possible to ~xt+1 for all t.
More formally, we obtain the following minimization problem.
T
X −1
min kf (~xt ) − ~xt+1 k2 + λ · kW k2F , (9)
W
t=1

where kW kF is the Frobenius norm of the matrix W and where λ > 0 is a parameter that
can be increased to ensure that the reference solution remains an attractor. This problem has
the closed-form solution
T −1
W = Xt+1 − Xt · Xt · XtT · Xt + λ · I , (10)

where Xt = (~x∗ − ~x1 , . . . , ~x∗ − ~xT −1 )T ∈ RT −1×n is the concatenation of all vectors in the
trace up to the last one and Xt+1 = (~x∗ − ~x2 , . . . , ~x∗ − ~xT )T ∈ RT −1×n is the concatenation
of all successors. Refer to Theorem 2 in the appendix for a proof.
As a side note, we wish to highlight that this technique can readily be extended to multiple
reference solutions by replacing ~x∗ in Equations 8 and 10 with the respective closest correct
solution to the student’s current state. In other words, we partition the space of programs
according to several basins of attraction, one per correct solution. The setup and training
remain the same, otherwise.
This concludes our description of methods. In the next section, we evaluate these methods
on large-scale datasets.

4 Evaluation
In this section, we evaluate the ast2vec model on two large-scale anonymised datasets of
introductory programming, namely the 2018 and 2019 beginners challenge by the National
Computer Science School (NCSS)7 , an Australian educational outreach programme. The
courses were delivered by the Grok Learning platform8 . Each offering was available to (mostly)
Australian school children in Years 5–10, with curriculum-aligned educational slides and sets
of exercises released each week for five weeks. Students received a score for successfully
completing each exercise, with the score available starting at 10 points, reducing by one point
every 5 incorrect submissions, to a minimum of 5 points. In both datasets, we consider only
submissions, i.e. programs that students deliberately submitted for evaluation against test
cases.
The 2018 dataset contains data of 12, 141 students working on 26 different programming
tasks, yielding 148, 658 compileable programs. This is also part of the data on which ast2vec
7
https://ncss.edu.au
8
https://groklearning.com/challenge/
Preprint version under consideration at the Journal of Educational Datamining 17

was trained. The 2019 dataset contains data of 10, 558 students working on 24 problems,
yielding 194, 797 compileable programs overall.
For our first analysis, we further broaden our scope and include a third dataset with
a slightly different course format, in particular the Australian Computing Academy digital
technologies (DT) chatbot project9 which consists of 63 problems and teaches students the
skills to program a simple chatbot, and is also delivered via the Grok Learning platform. Our
dataset includes the data of 27, 480 students enrolled between May 2017 and August 2020,
yielding 1, 343, 608 compilable programs.
We first check how well ast2vec is able to correctly autoencode trees in all three datasets
and then go on to check its utility for prediction. We close the section by inspecting the
coding space in more detail for the example dataset from Section 2.

4.1 Autoencoding error


Our first evaluation concerns the ability of ast2vec to maintain information in its encoding. In
particular, we evaluate the autoencoding error, i.e. we iterate over all compileable programs
in the dataset, compile them to an abstract syntax tree (AST), use ast2vec to convert the
AST into a vector, decode the vector back into an AST, and compute the tree edit distance
[Zhang and Shasha, 1989] to the original AST.
Figure 9 shows the autoencoding error for trees of different sizes in the three datasets. The
black line shows the median autoencoding error, whereas the orange line shows the average,
with the orange region indicating one standard deviation. The blue dashed line indicates the
cumulative distribution, i.e. the fraction of trees in the data that have at most x nodes.
With respect to the 2018 data, we observe that the median is generally lower than the
mean, indicating that the distribution has a long tail of few trees with higher autoencoding
error, whereas most trees have low autoencoding error. Indeed, up to tree size 36 the median
autoencoding error is zero, and 78.92% of all programs in the dataset are smaller than 37
nodes, thus covering a sizeable portion of the dataset. The overall average autoencoding error
across the dataset is 1.86 and the average tree size is 27.98.
With respect to the 2019 dataset, we observe that median and average are more closely
aligned, indicating more symmetric distributions of errors. Additionally, we notice that the
autoencoding error grows more quickly compared to the 2018 data, which indicates that
ast2vec is slightly worse in autoencoding programs outside its training data compared to
programs inside its training data. We obtain an average of 4.10 compared to an average tree
size of 24.28.
Similarly, for the chatbot data, we observe the average overtaking the median and that
the error is generally larger compared to the 2018 data. A specific property of the chatbot
dataset is that there is a long tail of large programs which correspond to the full chatbot.
If we include these in the analysis, we obtain an average autoencoding error of 7.86 and an
average tree size of 27.40. If we only consider trees up to size 85, covering 95% of all trees
in the dataset, we obtain an average autoencoding error of 4.41 and an average tree size of
22.49.
Overall, we observe that ast2vec performs better on smaller trees and better on trees in
the training data. This is not surprising but should be taken into account when applying
ast2vec to new datasets. One may also be able to further reduce the autoencoding error by
adjusting the ast2vec architecture, e.g. by incorporating more recurrent nets such as Tree
LSTMs [Chen et al., 2018, Dai et al., 2018, Tai et al., 2015].
9
https://aca.edu.au/resources/python-chatbot/
Preprint version under consideration at the Journal of Educational Datamining 18

NCSS beginners 2018 data


100
40 80

coverage [%]
error (TED)

60
20 40
20
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
tree size
NCSS beginners 2019 data
100
40 80

coverage [%]
error (TED)

60
20 40
20
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
tree size
DT Chatbot data
100
40 80
coverage [%]
error (TED)

60
20 40
20
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
tree size

Figure 9: The autoencoding error as measured by the tree edit distance for the NCSS 2018
beginners challenge (top), NCSS 2019 beginner challenge (middle) and DT chatbot course
(bottom). The error is plotted here versus tree size. The orange line marks the mean, the
black line the median error. The orange region is the standard deviation around the man.
Additionally, the blue line indicates how many trees in the dataset have a size up to x.
Preprint version under consideration at the Journal of Educational Datamining 19

40
encoding
time [ms] 30 decoding

20

10

0
0 10 20 30 40 50 60 70 80 90 100
tree size

Figure 10: The time needed for encoding (orange) and decoding (blue) trees of different sizes
for the DT chatbot course. The thick lines mark the means whereas shaded regions indicate
one standard deviation. The dotted lines indicate the best linear fit.

Figure 10 shows the time needed to encode (orange) and decode (blue) a tree from the
chatbot dataset of a given size. As the plot indicates, the empiric time complexity for both
operations is roughly linear in the tree size, which corresponds to the theoretical findings
of Paaßen et al. [2021]. Using a linear regression without intercept, we find that encoding
requires roughly 0.9 millisecond per ten tree nodes, whereas decoding requires roughly 2.8
milliseconds per ten tree nodes. Given that decoding involves more operations (both element
choice and vector operations), this difference in runtime is to be expected. Fortunately, both
operations remain fast even for relatively large trees.

4.2 Next-Step Prediction


In the previous section, we evaluated how well ast2vec could translate from programs to
vectors and back, finding that it did well on the majority of beginner programs. In this
section, we investigate the dynamical systems analysis suggested in Sections 2.3 and 3.3. In
particular, we compare its ability to predict the next step of a student based on example data
from other students.
Recall that our dynamical systems model is quite simple: It encodes the student’s current
abstract syntax tree as a vector, computes the difference in the encoding space to the most
common correct solution, applies a linear transformation to that difference, adds the result
to the student’s current position, and decodes the resulting vector to achieve the next-step-
prediction. Given this simplicity, our research objective in this section is not to provide the
best possible next-step prediction. Rather, we wish to investigate how well a very simple
model can perform just by virtue of using the continuous vector space of ast2vec.
By contrast, we compare to the following reference models from the literature:

1. the simple identity, which predicts the next step to be the same as the current one and
is included as a baseline [Hyndman and Koehler, 2006],

2. one-nearest-neighbor prediction (1NN). This involves searching the training data for the
closest tree based on tree edit distance and predicting its successor, which is similar to
the scheme suggested by Gross and Pinkwart [2015], and

3. the continuous hint factory [Paaßen et al., 2018b, CHF], which is also based on the
tree edit distance but instead involves Gaussian process regression and heuristic-driven
reconstruction techniques to predict next steps.
Preprint version under consideration at the Journal of Educational Datamining 20

prediction error (TED) 2018 NCSS challenge

40 baseline
1NN
30 CHF
20 ast2vec
+ linear
10

0
2 4 6 8 10 12 14 16 18 20 22 24 26
task number
2019 NCSS challenge
prediction error (TED)

baseline
15 1NN
CHF
10 ast2vec
+ linear
5

0
2 4 6 8 10 12 14 16 18 20 22 24
task number

Figure 11: The average root mean square prediction error (in terms of tree edit distance)
across the entire beginners 2018 (top) and 2019 (bottom) curriculum for different prediction
methods. The x-axis uses the same order of problems as students worked on them.

Both 1NN as well as CHF are nonlinear predictors with a much higher representational
capacity, such that we would expect them to be more accurate [Paaßen et al., 2018a]. We
investigate this in the following section.

4.2.1 Predictive error


We evaluate the error in next-step prediction on two different datasets: the first was the 2018
NCSS challenge data, which was used to train ast2vec, and the second was the 2019 NCSS
challenge data, which contained programs unseen by ast2vec. We considered each learning
task in both challenges separately. For each task, we partitioned the student data into 10
folds and performed 10-fold cross-validation, i.e. using nine folds as training data and one fold
as evaluation data. Additionally, we subsampled the training data to only contain the data
of 30 student in order to simulate a classroom-sized training dataset.
Figure 11 shows the average root mean square prediction error in terms of tree edit distance
on the 2018 and 2019 data. In general, our simple linear model performed comparably to the
other models, with the performance being particularly close for the 2018 data. This difference
between the datasets is to be expected given that ast2vec has lower autoencoding error on
the 2018 data set.
With respect to the 2019 data, our model still performed reasonably well for most exercises,
though there were some cases where the error was noticeably higher (i.e. tasks 4, 8, 20,
21 and 23). The error on tasks 4, 20, 21 and 23 can be explained by the fact that these
Preprint version under consideration at the Journal of Educational Datamining 21

prediction time [s] 1NN


100 CHF
ast2vec
+ linear
10−1

10−2
50 100 150 200 250 300 350 400 450 500
number of students

Figure 12: The average training (dashed lines) and prediction time (solid lines) for different
training set sizes for the 2019 dataset. Prediction times are measured as the accumulated
time to provide predictions for an entire student trace. Note that the y axis is in log scale.

tasks involved larger programs (task 4 was an early exercise, but it involved a large program
scaffold that students needed to modify). Based on the analysis in the previous section,
larger programs tend to have higher autoencoding error, which we would expect to affect the
prediction error. Task 8, however, is interesting because it involves relatively small programs.
On closer inspection, this task has many different possible solutions, so the error could be
related to the simple design of our model (specifically, that it only makes predictions towards
the most common solution). All-in-all, this suggests the simple model performs comparably
to the others, and only performs worse in a small number of specific and explainable cases.
We note in passing that the naive baseline of predicting the current step performs also
quite well in both datasets. This may seem surprising but is a usual finding in forecasting
settings [Hyndman and Koehler, 2006, Paaßen et al., 2018a]. In particular, a low baseline
error merely indicates that students tend to change their program only a little in each step,
which is to be expected. Importantly, though, the baseline can not be used in an actual
hint system, where one would utilize the difference between a student’s current state and the
predicted next state to generate hints [Rivers and Koedinger, 2017, Paaßen et al., 2018b]. For
the baseline, there is no difference, and hence no hints can be generated.

4.2.2 Runtime
Another interesting aspect is training and prediction time. While training is very fast across
all methods (1NN does not even need training), the prediction time of neighborhood-based
methods like 1NN and CHF scales with the size of the training dataset, whereas the prediction
time for our ast2vec+linear scheme remains constant. Figure 12 displays the average training
and prediction times for all methods and varying training dataset sizes on the 2019 dataset.
As we can see, our proposed scheme always needs about 300ms to make all predictions for
a student trace (including encoding and decoding times), whereas 1NN exceeds this time
starting from ≈ 80 students and CHF is slower across time scales, requiring more than a
second even for small training data sizes. This is an important consideration in large-scale
educational contexts, where many predictions may need to be made very quickly.

4.2.3 Qualitative Comparison


Beyond the quantitative comparison in terms of prediction error and runtime, there are
important qualitative differences between our proposed dynamical systems and previous,
Preprint version under consideration at the Journal of Educational Datamining 22

neighborhood-based schemes. As noted in Section 3.3, one important property of our simple
model is that the predictions are mathematically guaranteed to lead to a solution, and the
linearity of the model means this is done smoothly. By contrast, neither 1NN nor CHF can
formally guarantee convergence to a correct solution if one follows the predictions. Addition-
ally, the predictions of 1NN are discontinuous, i.e. they change suddenly if the student enters a
different neighborhood. Overall, a smooth and guaranteed motion towards a correct solution
could be valuable for designing a next-step hint system where the generated hints should be
both directed to the desired target and intuitive [McBroom et al., 2019, Paaßen et al., 2018b,
Rivers and Koedinger, 2017].
Furthermore, a dynamical system based on ast2vec only has to solve the problem of
describing student motion in the space of programs for a particular task. Constructing the
space of programs in the first place is already solved by ast2vec. By contrast, a neighborhood-
based model must solve both the representation and the learning problem anew for each
learning task.
Additionally, any neighborhood-based model must store the entire training dataset at all
times and recall it for every prediction, whereas the ast2vec+linear model merely needs to
store and recover the predictive parameters, which includes the ast2vec parameters, and for
each learning task the vector encoding of the correct solution and the matrix W . This is not
only more time-efficient (as we saw above) but also space efficient as each new learning task
only requires an additional n + 1 × n = 65, 792 floating point numbers to be stored, which is
roughly half a megabyte.
Finally, if additional predictive accuracy is desired, one can improve the predictive model
by replacing the linear prediction with a nonlinear prediction (e.g. a small neural network),
whereas the predictive accuracy of a neighborhood-based system can only be improved by
using a richer distance measure or more training data.
In summary, this evaluation has shown that a simple linear regression model using the
vector encodings of ast2vec can perform comparably to neighborhood-based next-step pre-
diction techniques in terms of prediction error, but has other desirable properties that may
make it preferable in many settings, such as constant time and space complexity, guaranteed
convergence to a correct solution, smoothness, and separation of concerns between represen-
tation and prediction. As such, we hope that ast2vec can contribute to improve educational
datamining pipelines, e.g. for next-step hint predictions in the future.

4.3 Interpolation in the coding space


In the previous sections we have established that ast2vec is generally able to decode vectors
back to fitting syntax trees. However, what remains open is whether the coding space is
smooth, i.e. whether neighboring points in the coding space correspond to similar programs.
This property is important to ensure that typical analyses like visualization and dynamical
systems from Section 2 are meaningful. More specifically, if we look at a visualization of a
dataset, we implicitly assume that points close in the plot also correspond to similar programs.
Similarly, when we perform a dynamical system analysis, we implicitly assume that small
movements in the n = 256-dimensional coding space also correspond to small changes in the
corresponding program.
In principle, the variational autoencoder framework (refer to Section 3) ensures such a
smoothness property. In this section, we wish to validate this for an example. In particular,
Figure 13 shows a 2D grid of points which we sample from the progress-variance projection
for our example dataset from Section 2. Points that decode to the same syntax tree receive
the same color. The programs corresponding to regions with at least 10 grid points are shown
on the right.
Preprint version under consideration at the Journal of Educational Datamining 23

0.5 x = input ( ’ < string > ’)


print ( ’ < string > ’)

variance
0 x = input ( ’ < string > ’)
if x == ’ < string > ’:
print ( ’ < string > ’)
empty program else :
print ( ’ < string > ’)
−0.5
0 0.2 0.4 0.6 0.8 1
progress

Figure 13: A linear interpolation between the empty program at (0, 0) and the correct solution
at (1, 0) as well as along the axis of biggest variation. Color indicates the tree that the grid
point decodes to. The three most common trees are shown in the boxes.

This visualization shows several reassuring properties: First, neighboring points in the grid
tend to decode to the same tree. Second, if two points correspond to the same tree, the space
between them also correspond to the same tree, i.e. there are no two points corresponding to
the same tree at completely disparate locations of the grid. Finally, the syntax trees that are
particularly common are also meaningful for the task, in the sense that they actually occur in
student data. Overall, this bolsters our confidence that the encoding space is indeed smooth.

5 Related Work
In this paper, we propose a novel, vectorial encoding for computer programs to support
educational datamining. In doing so, we build on prior work which also proposed alternative
representations for computer programs. In the remainder of this section, we review these
alternative representations and relate them to ast2vec.

Abstract Syntax Trees: An abstract syntax tree (AST) represents the program code as
a tree of syntactic elements, which is also the internal representation used by compilers to
transform human-readable program code to machine code [Aho et al., 2006]. For example,
Figure 5 (left) shows the abstract syntax tree for the Python program print(’Hello,␣world’)!.
In most processing pipelines for computer programs, compiling the source code into an AST
is the first step, because it tokenizes the sources code meaningfully, establishes the hierar-
chical structure of the program, and separates functionally relevant syntax from functionally
irrelevant syntax, like variable names or comments [Aho et al., 2006, McBroom et al., 2019].
Additionally, several augmentations have been suggested to make syntax trees more meaning-
ful. In particular, Rivers and Koedinger [2012] suggest canonicalization rules to match trees
that are functionally equivalent, such as normalizing the direction of comparison operators,
inlining helper functions, or removing unreachable code. Gross et al. [2014] further suggest to
insert edges in the tree between variable names and their original declaration, thus augment-
ing the tree to a graph. Following prior work, we also use the AST representation as a first
step before we apply a neural network to it. Although we do not use them here explicitly, the
canonicalization rules of Rivers and Koedinger [2012] are directly compatible with ast2vec.
A more difficult match are the additional reference edges suggested by Gross et al. [2014]
Preprint version under consideration at the Journal of Educational Datamining 24

because these would require neural networks that can deal with full graphs, i.e. graph neural
networks [Kipf and Welling, 2016, Scarselli et al., 2009].

Tree Patterns: Since many datamining methods can not be applied to trees, several prior
works have suggested to transform the AST to a collection of patterns first. Such repre-
sentations are common in the domain of natural language processing, where texts can be
represented as a collection of the words or n-grams they contain [Broder et al., 1997]. Such
techniques can be extended to trees by representing a tree by the subtree patterns it contains
[Nguyen et al., 2014, Zhi et al., 2018, Zimmerman and Rupakheti, 2015]. For example, the
syntax tree “Module(Expr(Call(Name, Str)))” could be represented as the following collection
of subtree patterns of height two: “Call(Name, Str)”, “Expr(Call)”, and “Module(Expr)”. Once
this collection is computed, we can associate all possible subtree patterns with a unique index
in the range 1, . . . , K and then represent a tree by a K-dimensional vector, where entry k
counts how often the kth subtree pattern occurs in the tree.
If the collection of subtree patterns is meaningful for the programming task at hand, this
representation can be both simple and powerful. For example, if a programming task is about
writing a while loop, it is valuable to know whether a student’s program actually contains such
a loop. Zhi et al. [2018] have considered this issue in more detail and considered both expert-
based and data-driven tree patterns as a representation. However, if there is no clear relation
between tree patterns and progress in a task, it may be problematic to use such pattern
as a representation, in particular when all tree patterns are weighted equally. ast2vec is
weakly related to tree patterns because the tree pattern determines which encoding/decoding
functions are called when processing the tree. However, the vector returned by the network
does not simply count tree patterns but, instead, considers the entire tree structure, i.e. how
the tree patterns are nested into each other to form the entire tree. More precisely, ast2vec
is trained to generate vector encodings that still contain enough information to recover the
original tree, whereas tree pattern counting usually does not enable us to recover the original
tree.

Pairwise distances: An alternative to an explicit vector representation is offered by pair-


wise distances (or similarities). Most prominently, abstract syntax trees can be represented
by their pairwise tree edit distance d(x, y), which is defined as the number of syntactic el-
ements one has to delete, insert, or re-label to get from tree x to tree y [Mokbel et al.,
2013, Paaßen et al., 2018b, Price et al., 2017, Rivers and Koedinger, 2017, Zhang and Shasha,
1989]. Such a pairwise distance representation is mathematically quite rich. Indeed, one can
prove that a distance representation implicitly assigns vectors to programs and one can facil-
itate this implicit representation for distance-based variants of machine learning methods, e.g.
for visualization, classification, and clustering [Pekalska and Duin, 2005, Hammer and Hasenfuss,
2010]. Additionally, an edit distance enables us to infer the changes we have to apply to a
tree x̂ to get to another tree ŷ, which facilitates hints [Paaßen et al., 2018b, Price et al.,
2017, Rivers and Koedinger, 2017]. However, there are several subtle problems that can
complicate a practical application. First, there may exist exponentially many edit paths
between x̂ and ŷ and it is not always easy to choose among them, especially since many inter-
mediate trees may be syntactically invalid or implausible to students [Paaßen et al., 2018b,
Rivers and Koedinger, 2017]. More generally, distances require reference programs to which
distances can be computed. This makes distance-based methods challenging to scale for larger
datasets where many reference programs exist. By contrast, ast2vec can translate a new pro-
gram into a vector without reference to past data. Only the model parameters have to be
stored, which is constant size and runtime versus linear size and runtime. Finally, ast2vec
Preprint version under consideration at the Journal of Educational Datamining 25

enables us to decouple the general problem of representing programs as vectors from task-
specific problems. We can utilize the implicit information of 500,000 student programs in
ast2vec in a new task and may need only little new student data to solve the additional,
task-specific problem at hand.
That being said, the basic logic of distance measures is still crucial to ast2vec. In par-
ticular, the negative log-likelihood in Equation 1 can be interpreted as a measure of distance
between the original program and its autoencoded version, and our notion of a smooth en-
coding can be interpreted as small distances between vectors in the encoding space implying
small distances of the corresponding programs.

Clustering: One of the challenges of computer programs is that the space of programs for
the same task is very large and it is infeasible to design feedback for all possible programs.
Accordingly, several researchers have instead grouped programs into a small number of clus-
ters, for which similar feedback can be given [Choudhury et al., 2016, Glassman et al., 2015,
Gross et al., 2014, Gulwani et al., 2018]. Whenever a new student requests help, one can
simply check which cluster the student’s program belongs to, and assign the feedback for that
cluster. One typical way to perform clustering is by using a pairwise distance measure on trees,
such as the tree edit distance [Choudhury et al., 2016, Gross et al., 2014, Zhang and Shasha,
1989]. However, it is also possible to use clustering strategies more specific to computer
programs, such as grouping programs by their control flow [Gulwani et al., 2018] or by the
unit tests they pass [McBroom et al., 2021]. ast2vec can be seen as a preprocessing step for
clustering. Once all programs are encoded as vectors, standard clustering approaches can be
applied. Additionally, ast2vec provides the benefit of being able to interpret the clustering by
translating cluster centers back into syntax trees (refer to Section 2.4).

Execution Behavior: Most representations so far focused on the AST of a program. How-
ever, it is also possible to represent programs in terms of their execution behavior. The most
popular example is the representation by test cases, where a program is fed with example in-
put. If the program’s output is equal to the expected output, we say the test has passed, other-
wise failed [Ihantola et al., 2010]. This is particularly useful for automatic grading of programs
because test cases verify the functionality, at least for some examples, while giving students
freedom on how to implement this functionality [Ihantola et al., 2010]. Further, failing a cer-
tain unit test can indicate a specific misconception, warranting functionality-based feedback
[McBroom et al., 2021]. However, for certain types of tasks the computational path toward an
output may be relevant, for example when comparing sorting programs. Paaßen et al. [2016]
therefore use the entire computation trace as a representation, i.e. the sequence of states of all
variables in a program. Unfortunately, a mismatch in the output or in the execution behavior
toward the output is, in general, difficult to relate to a specific change the student would need
to perform in order to correct the problem. To alleviate this challenge, test cases have to be
carefully and densely designed, or the challenge has to be left to the student. As such, we
believe that there is still ample room for AST-based representations, like our proposed vector
encodings, which are closer to the actions a student can actually perform on their own code.

Neural networks: Prior work already has investigated neural networks to encode computer
programs. In particular, Piech et al. [2015a] used a recursive tensor network to encode syn-
tax trees and performed classification on the embeddings to decide whether certain feedback
should be applied or not; Alon et al. [2019] represented syntax trees as a collection of paths,
encoded these paths as vectors and then computed a weighted average to obtain an overall
encoding of the program; Chen et al. [2018] translate programs into other programming lan-
Preprint version under consideration at the Journal of Educational Datamining 26

guages by means of an intermediary vector representation; and Dai et al. [2018] proposed an
auto-encoding model for computer programs that incorporates an attribute grammar of the
programming language to not only incorporate syntactic, but also semantic constraints (like
that variables can only be used after declaration).
Both the works of Piech et al. [2015a] and Alon et al. [2019] are different from our contri-
bution because they do not optimize an autoencoding loss but instead train an neural net for
the classification of programs, i.e. into feedback classes or into tags. This scheme is unable to
recover the original program from the network’s output and requires expert labelling for all
training programs. Ast2vec has neither of those limitations. The work of Chen et al. [2018]
is more similar in that both input and output are programs. However, the network does not
include grammar constraints and uses an attention mechanism on the original tree to decide
on its output. This is not possible in our setting where we wish to perform datamining solely
in the vector space of encodings and be able to decode any encoding back into a tree, without
reference to a prior tree. The most similar prior work to our own is the autoencoding model
of Dai et al. [2018], which is also an autoencoding model with grammar constraints, albeit
with an LSTM-based encoder and decoder. One could frame ast2vec as a combination of
the auto-encoding ability and grammar knowledge of Dai et al. [2018] with the recursive pro-
cessing of Piech et al. [2015a], yielding a recursive tree grammar autoencoder [Paaßen et al.,
2021]. That being said, future work may incorporate more recurrent network concepts and
thus improve autoencoding error further.

Next-step hints: Ample prior work has considered the problem of predicting the next
step a student should take in a learning system, refer e.g. to the review of [McBroom et al.,
2019]. On a high level, this concerns the problem of selecting the right sequence of lessons
to maximize learning gain [Lan et al., 2014, Reddy et al., 2016]. In this paper, we rather
consider the problem of predicting the next code change within a single programming task.
This problem has been considered in more detail, for example, by Rivers and Koedinger [2017],
Price et al. [2017], Paaßen et al. [2018b], and Price et al. [2019]. Here, we compare to two
baselines evaluated by Price et al. [2019], namely the continuous hint factory [Paaßen et al.,
2018b] and a nearest-neighbor prediction [Gross and Pinkwart, 2015]. We note, however, that
we only evaluate the predictive accuracy, not the hint quality, which requires further analysis
[Price et al., 2019].

6 Conclusion
In this paper, we presented ast2vec, a novel autoencoder to translate the syntax trees of
beginner Python programs to vectors in Euclidean space and back. We have trained this
autoencoder on almost 500,000 student Python programs and evaluated it in three settings.
First, we utilized the network for a variety of analyses on a classroom-sized dataset, thereby
demonstrating the flexibility of the approach qualitatively. As part of our qualitative analysis,
we also showed that the encoding space of ast2vec is smooth - at least for the example - and
introduced two novel techniques for analyzing programming data based on ast2vec, namely
progress-variance-projections for a two-dimensional visualization, and a linear dynamical sys-
tems method that guarantees convergence to the correct solution.
In terms of quantitative analyses, we evaluated the autoencoding error of ast2vec as well as
the predictive accuracy of a simple linear model on top of ast2vec. We found that ast2vec had
a low autoencoding error for the majority of programs, including on two unseen, large-scale
datasets, though the error tended to increase with tree size. In addition, the encoding and
decoding times were low and consistent with the theoretical O(n) bounds, suggesting ast2vec
Preprint version under consideration at the Journal of Educational Datamining 27

is scalable. Moreover, by coupling ast2vec with our linear dynamical systems method, we
were able to approach the predictive accuracy of existing methods with a very simple model
that is both time- and space-efficient and guarantees convergence to a correct solution.
While we believe that these results are encouraging, we also acknowledge limitations: At
present, ast2vec does not decode the content of variables, although such content may be
decisive to solve a task correctly. Further improvements in terms of autoencoding error are
possible as well, especially for larger programs, perhaps by including more recurrent networks
as encoders and decoders. Finally, the predictive accuracy of our proposed linear dynamical
systems model has not yet achieved the state-of-the-art and nonlinear predictors are likely
necessary to improve performance further.
Still, ast2vec provides the educational datamining community with a novel tool that can
be utilized without any need for further deep learning in a broad variety of analyses. We are
excited to see its applications in the future.

7 Acknowledgements
Funding by the German Research Foundation (DFG) under grant number PA 3460/1-1 is
gratefully acknowledged.

References
Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior
of distance metrics in high dimensional space. In Jan Van den Bussche and Victor Vianu,
editors, Proceedings of the International Conference on Database Theory (ICDT 2001),
pages 420–434, Berlin, Heidelberg, 2001. Springer. doi:10.1007/3-540-44503-X_27.

Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman. Compilers: Principles, Techniques,
and Tools. Addison Wesley, Boston, MA, USA, 2 edition, 2006.

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning distributed
representations of code. Proceedings of the ACM on Programming Languages, 3:40, 2019.
doi:10.1145/3290353.

Tiffany Barnes, Behrooz Mostafavi, and Michael J Eagle. Data-driven domain models for
problem solving, volume 4, pages 137–145. US Army Research Laboratory, Orlando, FL,
USA, 2016.

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic
clustering of the web. Computer Networks and ISDN Systems, 29(8):1157 – 1166, 1997.
doi:10.1016/S0169-7552(97)00031-7.

M.C. Campi and P.R. Kumar. Learning dynamical systems in a stationary environment.
Systems & Control Letters, 34(3):125 – 132, 1998. doi:10.1016/S0167-6911(98)00005-X.

Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for pro-
gram translation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Proceedings of the 31st International Conference
on Advances in Neural Information Processing Systems (NeurIPS 2018), 2018. URL
https://papers.nips.cc/paper/2018/hash/d759175de8ea5b1d9a2660e45554894f-Abstract.html.
Preprint version under consideration at the Journal of Educational Datamining 28

Kyunghyun Cho, Bart Van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-
decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and
Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP 2014), pages 1724–1734, 2014. URL
https://www.aclweb.org/anthology/D14-1179.

Rohan Roy Choudhury, HeZheng Yin, Joseph Moghadam, and Armando Fox. Autostyle:
Toward coding style feedback at scale. In Darren Gergle, Meredith Ringel Morris, Pernille
Bjørn, Joseph Konstan, Gary Hsieh, and Naomi Yamashita, editors, Proceedings of the
19th ACM Conference on Computer Supported Cooperative Work and Social Computing
Companion (CSCW 2016), page 21–24, 2016. doi:10.1145/2818052.2874315.

Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed varia-
tional autoencoder for structured data. In Yoshua Bengio, Yann LeCun, Tara Sainath,
Ian Murray, Marc Aurelio Ranzato, and Oriol Vinyals, editors, Proceedings of the
6th International Conference on Learning Representations (ICLR 2018), 2018. URL
https://openreview.net/forum?id=SyqShMZRb.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data


via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological),
39(1):1–38, 1977. URL http://www.jstor.org/stable/2984875.

Peter J. Denning. Remaining trouble spots with computational thinking. Communications of


the ACM, 60(6):33–39, 2017. doi:10.1145/2998438.

Andrej Gisbrecht and Frank-Michael Schleif. Metric and non-metric proximity transformations
at linear costs. Neurocomputing, 167:643–657, 2015. doi:10.1016/j.neucom.2015.04.017.

Elena L. Glassman, Jeremy Scott, Rishabh Singh, Philip J. Guo, and Robert C. Miller.
Overcode: Visualizing variation in student solutions to programming problems at scale.
ACM Transactions on Computer-Human Interaction, 22(2):7, 2015. doi:10.1145/2699751.

Sebastian Gross and Niels Pinkwart. How do learners behave in help-seeking when given a
choice? In Cristina Conati, Neil Heffernan, Antonija Mitrovic, and M. Felisa Verdejo, edi-
tors, Proceedings of the 17th International Conference on Artificial Intelligence in Education
(AIED 2015), pages 600–603, 2015. doi:10.1007/978-3-319-19773-9_71.

Sebastian Gross, Bassam Mokbel, Benjamin Paaßen, Barbara Hammer, and Niels Pinkwart.
Example-based feedback provision using structured solution spaces. International Journal
of Learning Technology, 9(3):248–280, 2014. doi:10.1504/IJLT.2014.065752.

Sumit Gulwani, Ivan Radiček, and Florian Zuleger. Automated clustering and program repair
for introductory programming assignments. ACM SIGPLAN Notices, 53(4):465–480, 2018.
doi:10.1145/3296979.3192387.

Barbara Hammer and Alexander Hasenfuss. Topographic mapping of large dissimilarity data
sets. Neural Computation, 22(9):2229–2284, 2010. doi:10.1162/NECO_a_00012.

Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via
spectral filtering. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Proceedings of the 30th Conference on Ad-
vances Neural Information Processing Systems (NIPS 2017), pages 6702–6712, 2017. URL
https://proceedings.neurips.cc/paper/2017/file/165a59f7cf3b5c4396ba65953d679f17-Paper.pdf
Preprint version under consideration at the Journal of Educational Datamining 29

Rob J. Hyndman and Anne B. Koehler. Another look at measures of forecast accuracy. Inter-
national Journal of Forecasting, 22(4):679 – 688, 2006. doi:10.1016/j.ijforecast.2006.03.001.

Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Seppälä. Review of recent systems
for automatic assessment of programming assignments. In Casten Schulte and Jarkko Suho-
nen, editors, Proceedings of the 10th Koli Calling International Conference on Computing
Education Research (Koli Calling 2010), page 86–93, 2010. doi:10.1145/1930464.1930480.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
Yoshua Bengio, Yann LeCun, Brian Kingsbury, Samy Bengio, Nando de Freitas, and Hugo
Larochelle, editors, Proceedings of the 3rd International Conference on Learning Represen-
tations (ICLR 2015), 2015. URL https://arxiv.org/abs/1412.6980.

Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In Yoshua Bengio,
Yann LeCun, Aaron Courville, Rob Fergus, and Chris Manning, editors, Proceedings of
the 1st International Conference on Learning Representations (ICLR 2013), 2013. URL
https://arxiv.org/abs/1312.6114.

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In Hugo Larochelle, Samy Bengio, Brian Kingsbury, Yoshua Bengio, and Yann
LeCun, editors, Proceedings of the 4th International Conference on Learning Representa-
tions (ICLR 2016), 2016. URL https://arxiv.org/abs/1609.02907.

Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational
autoencoder. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th In-
ternational Conference on Machine Learning (ICML 2017), pages 1945–1954, 2017. URL
http://proceedings.mlr.press/v70/kusner17a.html.

Essi Lahtinen, Kirsti Ala-Mutka, and Hannu-Matti Järvinen. A study of the difficulties of
novice programmers. SIGCSE Bulletin, 37(3):14–18, 2005. doi:10.1145/1151954.1067453.

Andrew S. Lan, Christoph Studer, and Richard G. Baraniuk. Time-varying learning and con-
tent analytics via sparse factor analysis. In Sofus Macskassy, Claudia Perlich, Jure Leskovec,
Wei Wang, and Rayid Ghani, editors, Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD ’14), page 452–461, 2014.
doi:10.1145/2623330.2623631.

Jessica McBroom, Kalina Yacef, Irena Koprinska, and James R. Curran. A data-
driven method for helping teachers improve feedback in computer programming au-
tomated tutors. In Carolyn Penstein Rosé, Roberto Martínez-Maldonado, H. Ulrich
Hoppe, Rose Luckin, Manolis Mavrikis, Kaska Porayska-Pomsta, Bruce McLaren, and
Benedict du Boulay, editors, Artificial Intelligence in Education, pages 324–337, 2018.
doi:10.1007/978-3-319-93843-1_24.

Jessica McBroom, Irena Koprinska, and Kalina Yacef. A survey of automated pro-
gramming hint generation–the hints framework. arXiv, 1908.11566, 2019. URL
https://arxiv.org/abs/1908.11566.

Jessica McBroom, Benjamin Paaßen, Bryn Jeffries, Irena Koprinska, and Kalina Yacef.
Progress networks as a tool for analysing student programming difficulties. In Claudia
Szabo and Judy Sheard, editors, Proceedings of the Twenty-Third Australasian Conference
on Computing Education (ACE 2021), 2021. accepted.
Preprint version under consideration at the Journal of Educational Datamining 30

Michael McCracken, Vicki Almstrum, Danny Diaz, Mark Guzdial, Dianne Hagan, Yifat Ben-
David Kolikant, Cary Laxer, Lynda Thomas, Ian Utting, and Tadeusz Wilusz. A multi-
national, multi-institutional study of assessment of programming skills of first-year cs stu-
dents. In Working Group Reports from ITiCSE on Innovation and Technology in Computer
Science Education, pages 125–180, 2001. doi:10.1145/572133.572137.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In
C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,
editors, Proceedings of the 26th International Conference on Advances in Neu-
ral Information Processing Systems (NIPS 2013), pages 3111–3119, 2013. URL
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-th
Bassam Mokbel, Sebastian Gross, Benjamin Paaßen, Niels Pinkwart, and Barbara Ham-
mer. Domain-independent proximity measures in intelligent tutoring systems. In
S. K. D’Mello, R. A. Calvo, and A. Olney, editors, Proceedings of the 6th Interna-
tional Conference on Educational Data Mining (EDM 2013), pages 334–335, 2013. URL
http://www.educationaldatamining.org/EDM2013/papers/rn_paper_68.pdf.
Andy Nguyen, Christopher Piech, Jonathan Huang, and Leonidas Guibas. Codewebs: Scal-
able homework search for massive open online programming courses. In Chin-Wan Chung,
Andrei Broder, Kyuseok Shim, and Torsten Suel, editors, Proceedings of the 23rd Inter-
national Conference on World Wide Web (WWW 2014), page 491–502. Association for
Computing Machinery, 2014. doi:10.1145/2566486.2568023.
Benjamin Paaßen, Joris Jensen, and Barbara Hammer. Execution traces as
a powerful data representation for intelligent tutoring systems for program-
ming. In Tiffany Barnes, Min Chi, and Mingyu Feng, editors, Proceedings
of the 9th International Conference on Educational Data Mining (EDM 2016),
pages 183 – 190. International Educational Datamining Society, 2016. URL
http://www.educationaldatamining.org/EDM2016/proceedings/paper_17.pdf.
Benjamin Paaßen, Christina Göpfert, and Barbara Hammer. Time series prediction for
graphs in kernel and dissimilarity spaces. Neural Processing Letters, 48(2):669–689, 2018a.
doi:10.1007/s11063-017-9684-5. URL https://arxiv.org/abs/1704.06498.
Benjamin Paaßen, Barbara Hammer, Thomas Price, Tiffany Barnes, Sebastian Gross, and
Niels Pinkwart. The continuous hint factory - providing hints in vast and sparsely popu-
lated edit distance spaces. Journal of Educational Datamining, 10(1):1–35, 2018b. URL
https://jedm.educationaldatamining.org/index.php/JEDM/article/view/158.
Benjamin Paaßen, Irena Koprinska, and Kalina Yacef. Recursive tree grammar autoencoders.
arXiv, 2012.02097, 2021. https://arxiv.org/abs/2012.02097.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
son, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
Pytorch: An imperative style, high-performance deep learning library. In Hanna Wallach,
Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Roman
Garnett, editors, Proceedings of the 32nd International Conference on Advances in
Neural Information Processing Systems (NeurIPS 2019), pages 8026–8037, 2019. URL
http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learn
Preprint version under consideration at the Journal of Educational Datamining 31

Karl Pearson. On lines and planes of closest fit to systems of points in space. The London,
Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
doi:10.1080/14786440109462720.

Barry Peddycord III., Andrew Hicks, and Tiffany Barnes. Generating hints
for programming problems using intermediate output. In Manolis Mavrikis
and Bruce M. McLaren, editors, Proceedings of the 7th International Con-
ference on Educational Data Mining (EDM 2014), pages 92–98, 2014. URL
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.659.1815&rep=rep1&type=pdf.

Elzbieta Pekalska and Robert Duin. The Dissimilarity Representation for Pattern Recogni-
tion: Foundations And Applications (Machine Perception and Artificial Intelligence). World
Scientific Publishing Co., Inc., River Edge, NJ, USA, 2005. ISBN 9812565302.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global


vectors for word representation. In Alessandro Moschitti, Bo Pang, and Wal-
ter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP 2014), pages 1532–1543, 2014. URL
http://www.aclweb.org/anthology/D14-1162.

Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami,
and Leonidas Guibas. Learning program embeddings to propagate feedback on stu-
dent code. In Francis Bach and David Blei, editors, Proceedings of the 37th Interna-
tional Conference on Machine Learning (ICML 2015), pages 1093–1102, 2015a. URL
http://proceedings.mlr.press/v37/piech15.html.

Chris Piech, Mehran Sahami, Jonathan Huang, and Leonidas Guibas. Autonomously gener-
ating hints by inferring problem solving policies. In Gregor Kiczales, Daniel Russell, and
Beverly Woolf, editors, Proceedings of the Second ACM Conference on Learning @ Scale
(L@S 2015), page 195–204, 2015b. doi:10.1145/2724660.2724668.

A. Politi. Lyapunov exponent. Scholarpedia, 8(3):2722, 2013. doi:10.4249/scholarpedia.2722.

Thomas Price, Rui Zhi, and Tiffany Barnes. Evaluation of a data-driven


feedback algorithm for open-ended programming. In Xiangen Hu, Tiffany
Barnes, and Paul Inventado, editors, Proceedings of the 10th International Con-
ference on Educational Data Mining (EDM 2017), pages 192–197, 2017. URL
http://educationaldatamining.org/EDM2017/proc_files/papers/paper_36.pdf.

Thomas W Price, Yihuan Dong, Rui Zhi, Benjamin Paaßen, Nicholas Lytle, Veronica Cateté,
and Tiffany Barnes. A comparison of the quality of data-driven programming hint genera-
tion algorithms. International Journal of Artificial Intelligence in Education, 29(3):368–395,
2019. doi:10.1007/s40593-019-00177-z.

Yizhou Qian and James Lehman. Students’ misconceptions and other difficulties in introduc-
tory programming: A literature review. ACM Transactions on Computing Education, 18
(1):1–25, 2017. doi:10.1145/3077618.

Siddharth Reddy, Igor Labutov, and Thorsten Joachims. Latent skill embedding
for personalized lesson sequence recommendation. arXiv, 1602.07029, 2016. URL
http://arxiv.org/abs/1602.07029.
Preprint version under consideration at the Journal of Educational Datamining 32

Kelly Rivers and Kenneth R. Koedinger. A canonicalizing model for building programming
tutors. In Stefano A. Cerri, William J. Clancey, Giorgos Papadourakis, and Kitty Panourgia,
editors, Proceedings of the 11th International Conference on Intelligent Tutoring Systems,
(ITS 2012), pages 591–593, 2012. doi:10.1007/978-3-642-30950-2_80.

Kelly Rivers and Kenneth R Koedinger. Data-driven hint generation in vast solution spaces:
a self-improving python programming tutor. International Journal of Artificial Intelligence
in Education, 27(1):37–64, 2017. doi:10.1007/s40593-015-0070-z.

Anthony Robins, Janet Rountree, and Nathan Rountree. Learning and teaching program-
ming: A review and discussion. Computer Science Education, 13(2):137–172, 2003.
doi:10.1076/csed.13.2.137.14200.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-
dini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):
61–80, 2009. doi:10.1109/TNN.2008.2005605.

Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic repre-
sentations from tree-structured long short-term memory networks. In Yuji Matsumoto,
Chengqing Zong, and Michael Strube, editors, Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistic (ACL 2015), pages 1556–1566, 2015. URL
https://www.aclweb.org/anthology/P15-1150.pdf.

Nghi Truong, Paul Roe, and Peter Bancroft. Static analysis of students’ java pro-
grams. In Raymond Lister and Alison Young, editors, Proceedings of the Sixth Aus-
tralasian Conference on Computing Education (ACE 2004), page 317–325, 2004. URL
https://dl.acm.org/doi/abs/10.5555/979968.980011.

Rose Wiles, Gabriele Durrant, Sofie De Broe, and Jackie Powell. Methodological approaches
at phd and skills sought for research posts in academia: a mismatch? International Journal
of Social Research Methodology, 12(3):257–269, 2009. doi:10.1080/13645570701708550.

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance be-
tween trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989.
doi:10.1137/0218082.

Rui Zhi, Thomas W Price, Nicholas Lytle, Yihuan Dong, and Tiffany Barnes.
Reducing the state space of programming problems through data-driven feature
detection. In David Azcona, Sharon Hsiao, Nguyen-Thinh Le, John Stam-
per, and Michael Yudelson, editors, Proceedings of the second Educational Data
Mining in Computer Science Education Workshop (CSEDM 2018), 2018. URL
https://people.engr.ncsu.edu/twprice/website/files/CSEDM%202018.pdf.

Kurtis Zimmerman and Chandan R. Rupakheti. An automated framework for recommending


program elements to novices (n). In Myra Cohen, Lars Grunske, and Michael Whalen, edi-
tors, Proceedings of the 30th IEEE/ACM International Conference on Automated Software
Engineering (ASE 2015), pages 283–288, 2015. doi:10.1109/ASE.2015.54.

A Proofs for the dynamical system analysis


Theorem 1. Let f (~x) = ~x + W · (~x∗ − ~x) for some matrix W ∈ Rn×n . If |1 − λj | < 1 for all
eigenvalues λj of W , f asymptotically converges to ~x∗ .
Preprint version under consideration at the Journal of Educational Datamining 33

Proof. We note that the theorem follows from general results in stability analysis, especially
Lyapunov exponents [Politi, 2013]. Still, we provide a full proof here to give interested readers
insight into how such an analysis can be done.
In particular, let ~x1 be any n-dimensional real vector. We know wish to show that plug-
ging this vector into f repeatedly yields a sequences ~x1 , ~x2 , . . . with ~xt+1 = f (~xt ) which
asymptotically converges to ~x∗ in the sense that

lim ~xt = ~x∗ .


t→∞

To show this, we first define the alternative vector x̂t := ~xt − ~x∗ . For this vector we obtain:

x̂t+1 = ~xt+1 − ~x∗ = f (~xt ) − ~x∗ = ~xt + W · (~x∗ − ~xt ) − ~x∗ = I − W ) · x̂t = I − W )t · x̂1 .

Further, note that our desired convergence of ~xt to ~x∗ is equivalent to stating that x̂t converges
to zero. To show that x̂t converges to zero, it is sufficient to prove that the matrix I − W )t
converges to zero.
To show this, we need to consider the eigenvalues of our matrix W . In particular, let
V ·Λ·V −1 = W be the eigenvalue decomposition of W , where V is the matrix of eigenvectors
and Λ is the diagonal matrix of eigenvalues λ1 , . . . , λn . Then it holds:
t t
I − W )t = V · V −1 − V · Λ · V −1 = V · (I − Λ) · V −1
= V · (I − Λ) · V −1 · V · . . . · V −1 · V · (I − Λ) · V −1 = V · (I − Λ)t · V −1 .

Now, let us consider the matrix power (I − Λ)t in more detail. Because this is a diagonal
matrix, the power is just applied elementwise. In particular, for the jth diagonal element we
obtain the power (1 − λj )t . In general, the jth eigenvalue can be complex-valued. However,
the absolute value still behaves like |(1 − λj )t | = |1 − λj |t . Now, since we required that
|1 − λj | < 1 we obtain

lim |1 − λj |t = 0 ⇒ lim V · (I − Λ)t · V −1 = 0,


t→∞ t→∞

which implies our desired result.

Theorem 2. Problem 9 has the unique solution in Equation 10.

Proof. We solve the problem by setting the gradient to zero and considering the Hessian.
First, we compute the gradient of the loss with respect to W .
T
X −1
∇W k~xt + W · (~x∗ − ~xt ) − ~xt+1 k2 + λ · kW k2F
t=1
T
X −1
~xt + W · (~x∗ − ~xt ) − ~xt+1 · (~x∗ − ~xt )T + 2 · λ · W

=2 ·
t=1

Setting this to zero yields


 TX
−1  TX
−1
W· (~x∗ − ~xt ) · (~x∗ − ~xt )T + λ · I = (~xt − ~xt+1 ) · (~x∗ − ~xt )T
t=1 t=1
T
XtT

⇐⇒ W· · Xt + λ · I = Xt+1 − Xt · Xt

For λ > 0, the matrix XtT · Xt + λ · I is quaranteed to be invertible, which yields our desired
solution.
Preprint version under consideration at the Journal of Educational Datamining 34

It remains to show that the Hessian for this problem is positive definite. Re-inspecting
P −1 the
gradient, we observe that the matrix W occurs only as a product with the term Tt=1 (~x∗ −
~xt ) · (~x∗ − ~xt )T , which is positive semi-definite, and the term λ · I, which is strictly positive
definite. Hence, our problem is convex and our solution is the unique global optimum.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy