theseGNN XAI
theseGNN XAI
Machine Learning
Location :
GREYC laboratory, CNRS UMR 6072, Université de Caen Normandie, 14000 Caen, France
Scientific context
Pandora This thesis is financed within the Pandora project funded by the French ANR
(National Research Agency), underway since February 2025. Pandora is situated in the
context of explainable artificial intelligence (XAI) as applied to graph neural networks
(GNN). By focusing on the internal functioning of GNNs, the objectives of the project are
as follows :
— characterize, understand and clearly explain the internal workings of GNNs using
pattern extraction techniques ;
— uncover statistically significant patterns of neural activation, called “activation rules,”
to determine how networks encode concepts [7, 8] ;
— translate these activation rules into graph patterns interpretable by a user ;
— use this knowledge to improve GNNs by identifying learning biases, generating
additional data, and building explanatory systems.
The thesis will be concerned with the last of those research questions.
The work carried out in this project (and by extension in the thesis) will be partially
based on molecular data resulting from biochemical experiments from our collaboration
with the CERMN laboratory (Centre d’Études et de Recherche sur le Médicament de
Normandie), University of Caen Normandy.
Problem setting In machine learning, we do not always have training data sets that are
sufficiently representative of the real world (for example, chemical/biological experiments
often focus only on certain well-explored molecules or certain therapeutic targets). How to
detect that a training data set is insufficient ? Two non-exhaustive proposals for this :
— possible parts of the data space are not represented (e.g. some node/edge combina-
tions cannot be found).
— the learned model is unreliable in some subspaces of the data (the reliability of
a supervised model can be studied, for example, by looking at the importance of
instances in the construction of decision boundaries).
The literature contains methods to characterize data in a model-independent manner [5]
and methods to characterize the behavior of a model based on the components of the
individual graphs considered [9, 2, 6, 3, 4, 1]. However, there is no approach that establishes
1
the link between data and the performance of a specific model. Furthermore, there exist
no approaches for augmenting the data as a means for improving model performance and
reliability. The thesis is intended to address these gaps.
Objectives
This thesis has three objectives. First, we want to characterize at a global level graph
datasets in a way similar to that already used for vectorial datasets. Second, we want
to design one (or more) approaches to use the explanations of the behavior of GNNs to
identify relevant instances of the training set used. Finally, we leverage the results of the
first two points to generate additional data instances to improve the data set and therefore
render GNNs more accurate and more robust.
2
instances and subgraphs linked to the explanatory descriptors/rules will also allow
to determine how the descriptors characterize different subsets of data.
3. Develop a formalism to extend concepts defined for vector data (density, decision
boundaries, value distribution) to graph data. This formalism, in combination with
the results of step 2, will allow to determine where learning instances are missing
in a training dataset and thus where it is useful to generate synthetic data.
4. Exploit the information derived from the first three points, as well as others —
for instance graph patterns extracted using pattern mining methods — to define
constraints on symbolic data generators to arrive at data with precise properties
that fill the gaps in the data sets.
5. Evaluate the generated data in the context of project use cases, particularly mole-
cular data activity prediction.
Keywords : Statistical learning, graph neural networks, explainable AI, data mining.
Thesis period : Starting in autumn 2025
Remuneration : Approximately 2,200e gross per month.
Supervising team :
— Bruno Crémilleux (GREYC – Université de Caen Normandie).
— Marc Plantevit (LRE – EPITA)
— Albrecht Zimmermann (GREYC – Université de Caen Normandie).
Candidate profile
The candidate must be enrolled in the final year of a Master’s degree or an engineering
degree, or hold such a degree, in a field related to computer science or applied mathematics,
and have solid programming skills. Experience in data science, deep learning, etc. would
be a plus.The candidate must be able to write scientific reports and communicate research
results at conferences in English.
To apply
Application period : from now until the position is filled.
Send the following documents (exclusively in pdf format) to bruno.cremilleux@unicaen.
fr, marc.plantevit@epita.fr et albrecht.zimmermann@unicaen.fr :
— cover letter explaining your qualifications, experiences and motivation for this sub-
ject ;
— curriculum vitae ;
— transcript of grades (if possible with ranking) of 3rd year of Bachelor’s degree, 1st
and 2nd year of Master’s degree or equivalent for engineering schools ;
— if possible, names of people (teachers or other person) who can provide information
on your skills and your work ;
— a link to personal project repositories (e.g. GitHub) ;
— any other information you consider useful.
3
Références
[1] C. Abrate, G. Preti, and F. Bonchi. Counterfactual explanations for graph classification
through the lenses of density. In World Conference on Explainable Artificial Intelligence,
pages 324–348. Springer, 2023.
[2] A. Duval and F. D. Malliaros. Graphsvx : Shapley value explanations for graph neu-
ral networks. In Machine Learning and Knowledge Discovery in Databases. Research
Track : European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17,
2021, Proceedings, Part II 21, pages 302–318. Springer, 2021.
[3] Q. Huang, M. Yamada, Y. Tian, D. Singh, and Y. Chang. Graphlime : Local interpre-
table model explanations for graph neural networks. IEEE Transactions on Knowledge
and Data Engineering, 35(7) :6968–6972, 2022.
[4] A. Mastropietro, G. Pasculli, C. Feldmann, R. Rodríguez-Pérez, and J. Bajorath. Ed-
geshaper : Bond-centric shapley value-based explanation method for graph neural net-
works. Iscience, 25(10), 2022.
[5] M. A. Munoz, L. Villanova, D. Baatar, and K. Smith-Miles. Instance spaces for machine
learning classification. Machine Learning, 107(1) :109–147, 2018.
[6] A. Perotti, P. Bajardi, F. Bonchi, and A. Panisson. Graphshap : Explaining
identity-aware graph classifiers through the language of motifs. arXiv preprint
arXiv :2202.08815, 2022.
[7] L. Veyrin-Forrer, A. Kamal, S. Duffner, M. Plantevit, and C. Robardet. In pursuit of
the hidden features of gnn’s internal representations. Data & Knowledge Engineering,
142 :102097, 2022.
[8] L. Veyrin-Forrer, A. Kamal, S. Duffner, M. Plantevit, and C. Robardet. On gnn ex-
plainability with activation rules. Data Mining and Knowledge Discovery, pages 1–35,
2022.
[9] H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji. On explainability of graph neural networks
via subgraph explorations. In M. Meila and T. Zhang, editors, Proceedings of the 38th
International Conference on Machine Learning, volume 139 of Proceedings of Machine
Learning Research, pages 12241–12252. PMLR, 18–24 Jul 2021.