Keywords

1 Introduction

Brain-computer interfaces (BCI) link machines and human brains with the brainwaves as mean of communication for several purposes [1]. The necessity of such a link is crucial to automatize several tasks such as the prediction of epilepsy seizure, or the detection of neurological pathologies. Also, it commonly uses brain signals as a control signal for devices such as keyboards or joysticks, which can improve the quality of life of severely disabled patients, or many non-medical applications such as video games, controlling a robot or authentication [13]. The most used sensor is electroencephalography (EEG) that relies on electrodes placed in the scalp to detect the variation of electrical activity. It processes the collected data with signal processing techniques to keep important features. Then, machine learning take a decision depending on the use case.

The most well-known applications are related to Motor Imagery (MI) [15]. It is a neural response that is produced when a person performs a movement or just imagine it. Unfortunately, the signals are intrinsically non-stationary, non-linear, and noisy [13]. Overcoming those problems requires the use of sophisticated algorithms that requires human intervention (e.g. the eye blink elimination) and computational power that can be constraining. Deep Learning permits to waive a solution to all the previously cited obstacles [9]. It extracts the features automatically without human-engineered features and classifies in the same process which enables end-to-end approaches. Several other advances in new activation function, regularization, training strategies, and data augmentation yielded to state-of-the-art performances in several fields [3, 7, 10]. Also, it is possible to explain the decision of deep classifiers by advance visualization methods such as weight visualization to discover the learned features.

In this paper, we propose a new convolutional neural network (Convnet) architecture based on Inception for motor imagery classification. It allows to process the data with parallel process In our approach, we use the multivariate raw signal as input with a bandpass filter as preprocessing. Therefore, we use the same first block of [12] but with higher complexity which increases the capacity of the network. Then, an Inception block will extract temporal features more efficiency which improves the performance and speeds up the learning despite the depth to reduce the degradation problem [18]. To test our approach, we use dataset IIa from the BCI Competition IV [19]. As a baseline, we compare with FBCSP and ShallowConvNet which are the state-of-the-art techniques [2]. We investigate some visualization techniques to examine the ability of our networks to extract relevant features.

The rest of the paper is organize as follows: We presents some related works in Sect. 2. We introduce our method in Sect. 3. In Sect. 4, we evaluate the performances and visualize the learned features. Section 5 discuss the result and conclude the paper.

2 Related Works

The first interesting approach was a ConvNet that uses raw EEG data for P300 speller application [6]. It uses convolutional layers that extract temporal and spatial features. It is inspired from Filter Banks Common Spatial Pattern (FBCSP) [2]. A convolution is performed with a kernel of size \((1,n_t)\), then an other convolution with a kernel with a size (C, 1) where C is the number of the channels. Then, it use a softmax layer to classifies the features extracted. [17] introduced similar architectures for MI. ShallowConvNet is a shallow convnet that is composed with the two convolutional layers then the classification layers. DeepConvNet is a deep architecture that includes more aggregation layer after the convolutional layer. ShallowConvNet outperforms state-of-the-art FBCSP. [12] proposed EEGNet as a compact version of the existing methods. It relies on Depthwise convolutional and separable convolution which permitted to reduce the number of the parameter using 796 parameters only for the EEGNet 4, 2. EEGNet performs lower than ShallowConvNet since it was not trained with the same data augmentation (cropped training) suggested by [17]. Also, cropped training requieres a huge time to train which can be problematic in that cas of a takes a huge time to train, for one subjects compared with EEGNet.

3 Method

3.1 EEG Proprieties and Data Representation

MI yields on the apparition of fluctuation of the amplitude of the neuro-signals generated in the primary sensorimotor cortex [14]. It appears as an increase and a decrease of amplitude that target specific frequency bands that are related to motor activities. They are called Event-Related Synchronization (ERS) and Event-Related Desynchronization (ERD). The \(\mu \) and \(\beta \) bands are present respectively in [8, 13] Hz and the beta band [13, 30] Hz are the targeted pattern. As input, each trial is turned into a matrix of \( \mathbb {R}^{C \times T}\) where C represents the number of electrodes and T represents the number of time samples. We sample our data at 128 Hz and we use the segment [0.5–2.5] s after cue.

3.2 Incep-EEGNet

We propose Incep-EEGNet as it is illustrated in Fig. 1. It is a multistage ConvNet that is based on Inception [18]. It is composed as follows:

The first part is the same as EEGNet from [12]. They base it on two convolutional layers that act as temporal and spatial filter as act similarly to FBCSP, which is a widely used approach. We use a temporal convolutional layer with F kernel of size (1, tx) with padding. This layer will learn to extract relevant temporal features as it act as a FIR filter. We choose a size of 32 which correspond to a duration of 0.25 s of a signal sampled at 128 Hz. A second convolution is used to extract the spatial feature. It relies on Depthwise convolution that produces the number of feature maps per input which reduces considerably the computational cost. It is a convolution with a size of (C, 1) where C represents the number of channels. Also, we use batch normalization after each convolution and activation after the second one. This layer will allow only the important electrodes to contribute to the decision and learning frequency-specific spatial filter with Depthwise convolution where it controls the number of connections by the depth parameter D.

In the second part, we introduce the novelty of this architecture which is an inception based block. This block comes as a solution to the inconvenience of EEGNET that is too shallow and too compact, which restricts the capacity of the networks leading to overfitting in most cases. Even with a deeper network, the performance still low because of a degradation problem for DeepConvNet. Hence, we suggest to use an inception stage based That will learn features from several branches:

  • A convolutional branch with a convolution with a kernel size of (1, 7).

  • A convolutional branch with a convolution with a kernel size of (1, 9).

  • A branch with a pointwise convolution with a kernel size of 1, 1 with a stride of (1, 2)

  • A branch with an average pooling with a kernel size of

We merge the output of the different branches by stacking them along with the feature map dimension. We apply batch normalization and an activation. The use of dropout restricted only after final the activation cause we observed no improvement. Each convolutional branch include a pointwise convolution that reduces the number of feature map to 64 and an average pooling layer with a size of (1, 2).

In the final part, we use an additional convolutional layer with a \(F*D\) kernel with a size of (1, 5) along with batchnormalization, activation, and dropout. We use an Global AveragePooling layer to reduce the number of parameters to \(2*F\). Then, we use Softmax classification with 4 units that represent the 4 classes of the dataset.

Fig. 1.
figure 1

Architecture of the proposed system with layers hyperparameters

3.3 Hyperparameters and Training

Our implementation uses publicly available codes of preprocessing based on braindecode [17]. We trained deep learning methods on a NVIDIA P100 1.12.0. We train our method by optimizing the categorical cross-entropy using ADAM Optimizer [11] with Nesterov. Dropout probability is 0.5 as advised by [3]. We use a batch size of 64 as for EEGNET [12]. We fix the network parameter to \(F=64\) and \(D=4\). Exponential Linear Unit (ELU) is chosen as the activation [7]. We train our ConvNets as follows: We train for 100 epochs with a learning rate (Lr) of \(5\times 10^{-4}\). At the end of the training, we retrain it for 50 epochs and Lr set to \(1\times 10^{-4}\) with the merged training and validation set. Once again, we do the same operation for 30 epochs and a Lr set to \(2\times 10^{-5}\). Similar training was done for ShallowConvNet [17].

4 Experiment

4.1 Dataset

As a dataset, we use the dataset IIa from the BCI competition IV [19]. It contains EEG data of four MI tasks (right hand, left hand, foot, and tongue imagined movements) from nine subjects. It uses a set of 22 electrodes placed on the scalp. The recording was on two different sessions where the first was defined as a training set and the second one as a testing set. The subjects are asked to performs 288 MI tasks per session (72 trials for each class) after a cue that was. The original data is sampled at 240 Hz and filtered with a bandpass filter between 0.1 Hz and 100 Hz. We add additional preprocessing to the data as described in [17]. We resample the signals at 128 Hz and filter with a bandpass filter between 1 Hz and 32 Hz. We use \(20\%\) of the training set as a validation set. We use a cropping data augmentation by extracting the segments [0.3, 2.3] s, [0.4, 2.4] s, [0.5, 2.5] s, [0.6, 2.6] s, [0.7, 2.7] s post cue only on the training set (1152 trials). The validation and testing set contain only [0.5, 2.5] s segment to prevent leaking (for validation set) that can compromise the training. Therefor, the input will have a shape of \(22 \times 256\).

4.2 Results

To assert the performances of our method, we compare with FBCSP, Riemannian geometry [4], Bayesian optimization [5], and ShallowNet [17]. Table 1 shows the results of the classification of our method and the baselines in terms of accuracy. It shows that the proposed method outperforms the baselines for several subjects (S2, S3, S5, S6, S7, S9). However, BO got better results for S1 and S8, when ShallowNet performs better for S4. On the other hand, FBCSP2 and RG did not achieve higher results. For an advanced evaluation, we conduct statistical testing with the Wilcoxon test. To evaluate the significance of the results on the mean value. It shows that our method has a statistically significant difference compared with BO with \(p < 0.05\). Comparing with FBCSP2 and RG, the difference is highly significant with \(p < 0.01\).

Table 1. Classification accuracy (%) comparaison of our methods and the baselines,

Table 2 shows the results of the classification of our method and the baselines in terms of kappa. The result shows that our method outperforms for most of the subjects. It only failed to outperform FBCSP1 for S2 and ShallowNet for S4. Once Again, FBCSP2 and RG got bad results. Statistical testing shows that the increase in mean kappa is statistically significant with \(p < 0.05\) for FBCSP1, MDRM, and ShallowNet. For the other methods, the difference is highly significant at \(p < 0.01\).

Table 3 and Table 4 show the confusion matrix of Incep-EEGNet and FBCSP2 respectively. They show that both methods have difficulties to classify foot classes. Also, they confuse between right-hand and left-hand classes. Performances of our method are better than the reference.

Table 2. Kappa values comparison of our methods and the baselines
Table 3. Confusion matrix of Incep-EEGNet
Table 4. Confusion matrix of FBCSP
Fig. 2.
figure 2

Sample of relevant convolutional weights.

Figure 2a represents the Fourier transform of a temporal filter learned in the first convolution. It was designed to extract the temporal features of the EEG signals. As it was expected, Incep-EEGNet learned exactly the frequencies that are involved in the MI neural response. Also, we observe that there is a peak at 55 Hz, which can indicate that MI may be also characterized by this band as was reported by [8]. Figure 2b shows a spatial filter reconstructed by interpolation of the weights. The scale in the right is from 1 to \(-1\). It shows that Incep-EEGNet extracts the signals from the electrodes C3, CZ, and C4. It happens that those electrodes cover the part of the brain that is responsible for the movement of the hands and the feet.

5 Discussion and Conclusion

Designing ConvNets for BCI applications may be problematic. The existing approaches need an intensive data augmentation, and to be Shallow. Deep ConvNets are defective and lacks performances. Therefore, we built the Incep-EEGnet which is a modified EEGNET with a greater number of feature map that increases the complexity of the model where it outperforms state-of-the-art methods. To diminish any problem of degradation, we use an inception block that has several branches that offer an efficient feature extraction layer. The pointwise convolution works as a residual connection that prevents from vanishing gradient problems. Incep-EEGNet outperforms FBCSP, RG, and several ConvNets. Indeed, CSP techniques are considered state-of-the-art techniques for their efficiency, but as drawbacks, they are sensitive to noises, artifacts, and need larger datasets [16]. RG relies on and representation of the data that does not take into account the frequential features as its authors praise. But, it lowers its performances compared with FBCSP and ConvNets. ConvNet methods perform better and faster in the same conditions if we wisely use them. The overall performances are still low for several subjects highlighting a strong incompatibility between some subjects.