0% found this document useful (0 votes)
18 views4 pages

theseGNN XAI

The document outlines a PhD position at GREYC laboratory focused on explainable AI for graph data augmentation in machine learning, starting in the fourth quarter of 2025. The research aims to enhance graph neural networks (GNNs) by characterizing datasets, identifying learning biases, and generating additional data to improve model performance. Candidates should have a background in computer science or applied mathematics, with strong programming skills and experience in data science or deep learning preferred.

Uploaded by

jeanadamado10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

theseGNN XAI

The document outlines a PhD position at GREYC laboratory focused on explainable AI for graph data augmentation in machine learning, starting in the fourth quarter of 2025. The research aims to enhance graph neural networks (GNNs) by characterizing datasets, identifying learning biases, and generating additional data to improve model performance. Candidates should have a background in computer science or applied mathematics, with strong programming skills and experience in data science or deep learning preferred.

Uploaded by

jeanadamado10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Explainable AI for Graph Data Augmentation in

Machine Learning

PhD position – starting fourth quarter 2025

Location :
GREYC laboratory, CNRS UMR 6072, Université de Caen Normandie, 14000 Caen, France

Scientific context
Pandora This thesis is financed within the Pandora project funded by the French ANR
(National Research Agency), underway since February 2025. Pandora is situated in the
context of explainable artificial intelligence (XAI) as applied to graph neural networks
(GNN). By focusing on the internal functioning of GNNs, the objectives of the project are
as follows :
— characterize, understand and clearly explain the internal workings of GNNs using
pattern extraction techniques ;
— uncover statistically significant patterns of neural activation, called “activation rules,”
to determine how networks encode concepts [7, 8] ;
— translate these activation rules into graph patterns interpretable by a user ;
— use this knowledge to improve GNNs by identifying learning biases, generating
additional data, and building explanatory systems.
The thesis will be concerned with the last of those research questions.
The work carried out in this project (and by extension in the thesis) will be partially
based on molecular data resulting from biochemical experiments from our collaboration
with the CERMN laboratory (Centre d’Études et de Recherche sur le Médicament de
Normandie), University of Caen Normandy.

Problem setting In machine learning, we do not always have training data sets that are
sufficiently representative of the real world (for example, chemical/biological experiments
often focus only on certain well-explored molecules or certain therapeutic targets). How to
detect that a training data set is insufficient ? Two non-exhaustive proposals for this :
— possible parts of the data space are not represented (e.g. some node/edge combina-
tions cannot be found).
— the learned model is unreliable in some subspaces of the data (the reliability of
a supervised model can be studied, for example, by looking at the importance of
instances in the construction of decision boundaries).
The literature contains methods to characterize data in a model-independent manner [5]
and methods to characterize the behavior of a model based on the components of the
individual graphs considered [9, 2, 6, 3, 4, 1]. However, there is no approach that establishes

1
the link between data and the performance of a specific model. Furthermore, there exist
no approaches for augmenting the data as a means for improving model performance and
reliability. The thesis is intended to address these gaps.

Objectives
This thesis has three objectives. First, we want to characterize at a global level graph
datasets in a way similar to that already used for vectorial datasets. Second, we want
to design one (or more) approaches to use the explanations of the behavior of GNNs to
identify relevant instances of the training set used. Finally, we leverage the results of the
first two points to generate additional data instances to improve the data set and therefore
render GNNs more accurate and more robust.

Topic and overview of the work plan of the thesis


In short, the thesis deals with the use of patterns learnt from GNN to improve GNNs by
identifying learning biases, generating additional data, and building explanatory systems.
More precisely, we wish to develop new methods to improve the learning of graph models
by relying on the analysis of the internal functioning of these models via, for example,
activation rules expressed in the latent space. This will involve analyzing decision boun-
daries, characterizing the errors of the model studied in the data space or in their latent
representations in order to propose corrective solutions. This approach can be broken down
into sub-problems :
Data characterization and bias identification. The characterization of training
data can help identify instances on which the model commits errors but also detect
whether the data are not the source of bias in learning. One work direction is to
study the complexity of activation rules and compare them to domain knowledge.
Targeted generation of additional data. Once the model’s limitations have been
identified, we want to automatically define "corrective patches" to improve the
model’s robustness. A preferred area of work will be the generation of targeted
additional data to allow the model to better separate the data according to the
class studied in the constructed representation.
The first problem, i.e. data characterization will start from the knowledge developed in
meta-learning for vectorial data, combined with existing work explaining GNN predictions
and on activation rules.
The second problem poses relatively complex research questions since realistic graph
data with desired properties is rather hard to generate. While a number of graph data ge-
nerators exist in the literature, the generated data have often been found to lack properties
observed in real-world data.

Preliminary work plan


1. Conduct a literature review of methods for explaining the behavior of GNN models
[9, 2, 8, 7, 6, 3, 4, 1]. The aim of this study is to establish in what sense the different
methods identify certain aspects of the data used to train the model.
2. Design and implement approaches to identify the instances (graphs) involved by the
explanatory descriptors/rules. It is not certain that such approaches will be found
for all of them, which will then lead to a selection of descriptors. Highlighting the

2
instances and subgraphs linked to the explanatory descriptors/rules will also allow
to determine how the descriptors characterize different subsets of data.
3. Develop a formalism to extend concepts defined for vector data (density, decision
boundaries, value distribution) to graph data. This formalism, in combination with
the results of step 2, will allow to determine where learning instances are missing
in a training dataset and thus where it is useful to generate synthetic data.
4. Exploit the information derived from the first three points, as well as others —
for instance graph patterns extracted using pattern mining methods — to define
constraints on symbolic data generators to arrive at data with precise properties
that fill the gaps in the data sets.
5. Evaluate the generated data in the context of project use cases, particularly mole-
cular data activity prediction.

Keywords : Statistical learning, graph neural networks, explainable AI, data mining.
Thesis period : Starting in autumn 2025
Remuneration : Approximately 2,200e gross per month.
Supervising team :
— Bruno Crémilleux (GREYC – Université de Caen Normandie).
— Marc Plantevit (LRE – EPITA)
— Albrecht Zimmermann (GREYC – Université de Caen Normandie).

Candidate profile
The candidate must be enrolled in the final year of a Master’s degree or an engineering
degree, or hold such a degree, in a field related to computer science or applied mathematics,
and have solid programming skills. Experience in data science, deep learning, etc. would
be a plus.The candidate must be able to write scientific reports and communicate research
results at conferences in English.

To apply
Application period : from now until the position is filled.
Send the following documents (exclusively in pdf format) to bruno.cremilleux@unicaen.
fr, marc.plantevit@epita.fr et albrecht.zimmermann@unicaen.fr :
— cover letter explaining your qualifications, experiences and motivation for this sub-
ject ;
— curriculum vitae ;
— transcript of grades (if possible with ranking) of 3rd year of Bachelor’s degree, 1st
and 2nd year of Master’s degree or equivalent for engineering schools ;
— if possible, names of people (teachers or other person) who can provide information
on your skills and your work ;
— a link to personal project repositories (e.g. GitHub) ;
— any other information you consider useful.

3
Références
[1] C. Abrate, G. Preti, and F. Bonchi. Counterfactual explanations for graph classification
through the lenses of density. In World Conference on Explainable Artificial Intelligence,
pages 324–348. Springer, 2023.
[2] A. Duval and F. D. Malliaros. Graphsvx : Shapley value explanations for graph neu-
ral networks. In Machine Learning and Knowledge Discovery in Databases. Research
Track : European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17,
2021, Proceedings, Part II 21, pages 302–318. Springer, 2021.
[3] Q. Huang, M. Yamada, Y. Tian, D. Singh, and Y. Chang. Graphlime : Local interpre-
table model explanations for graph neural networks. IEEE Transactions on Knowledge
and Data Engineering, 35(7) :6968–6972, 2022.
[4] A. Mastropietro, G. Pasculli, C. Feldmann, R. Rodríguez-Pérez, and J. Bajorath. Ed-
geshaper : Bond-centric shapley value-based explanation method for graph neural net-
works. Iscience, 25(10), 2022.
[5] M. A. Munoz, L. Villanova, D. Baatar, and K. Smith-Miles. Instance spaces for machine
learning classification. Machine Learning, 107(1) :109–147, 2018.
[6] A. Perotti, P. Bajardi, F. Bonchi, and A. Panisson. Graphshap : Explaining
identity-aware graph classifiers through the language of motifs. arXiv preprint
arXiv :2202.08815, 2022.
[7] L. Veyrin-Forrer, A. Kamal, S. Duffner, M. Plantevit, and C. Robardet. In pursuit of
the hidden features of gnn’s internal representations. Data & Knowledge Engineering,
142 :102097, 2022.
[8] L. Veyrin-Forrer, A. Kamal, S. Duffner, M. Plantevit, and C. Robardet. On gnn ex-
plainability with activation rules. Data Mining and Knowledge Discovery, pages 1–35,
2022.
[9] H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji. On explainability of graph neural networks
via subgraph explorations. In M. Meila and T. Zhang, editors, Proceedings of the 38th
International Conference on Machine Learning, volume 139 of Proceedings of Machine
Learning Research, pages 12241–12252. PMLR, 18–24 Jul 2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy