1 Introduction

The number of internet users globally will grow to 5.3 billion by 2023 according to estimates of Statista (2022). The digital world promotes myriad downloads and data uploads, especially in text format. A significant amount of text data has been produced, it is time-consuming, expensive, and difficult to perform a manual curation of this content. Being desired algorithms that automatically classify documents and assist text mining tasks (Hassani et al. 2020).

Since texts are a rich source of information, techniques that automatically analyze and structure text cost-effectively are of great interest for academic and business applications. Text classification aims to assign a predefined category to a text. It is one of the principal tasks in Natural Language Processing (NLP) with several applications. Text classification is usually divided into supervised, unsupervised, and semi-supervised approaches (Thangaraj and Sivakami 2018). Supervised is the most expensive since depends on labeled data, the most common algorithms explored are SVM, decision tree, KNN, and neural networks. Unsupervised learning occurs when labeled data is not accessible, and the performance is not always good. The most common algorithms explored are K-Means, hierarchical clustering, and fuzzy c-means. Semi-supervised are used when there are few labeled data and a lot of unlabeled data. Common algorithms explored are co-training, self-training, transductive SVM, and graph-based methods.

Training data is a bottleneck in text classification and a great challenge is the labeling process, which involves a human annotator, who interprets and categorizes the content. This is time-consuming and expensive. So, machine learning techniques such as semi-supervised learning (SSL) that consider few labeled data and allow it to scale to any application can enable real-time analysis. This way, semi-supervised approaches become a hot research topic that uses the few labeled data and then classifies unlabeled documents (Zhou 2021; Van Engelen and Hoos 2020). NLP and SSL techniques have been combined and employed in different domains such as sentiment analysis (Silva et al. 2016; Han et al. 2020; Lee et al. 2019), word sense disambiguation (Duarte et al. 2021; Li et al. 2019), fake news detection (Benamira et al. 2019) and text classification (Linmei et al. 2019; Alam et al. 2018), reaching interesting results.

The idea of combining labeled and unlabeled data has been investigated for a long time, it starts in the statistic area when some authors proposed building classifiers with likelihood maximization by testing all possible class assignments (Hartley and Rao 1968; Day 1969). Since then, different approaches have been proposed for SSL. In Van Engelen and Hoos (2020) a taxonomy was proposed dividing the techniques into wrapper methods, unsupervised preprocessing, intrinsically semi-supervised, and graph-based. In the last years, some surveys presented text classification algorithms (Kowsari et al. 2019; Kadhim 2019), the main deep learning approaches used in text classification (Minaee et al. 2021), feature selection techniques (Deng et al. 2019), however, as far as we know no review has focused in SSL techniques for text classification, which are our focus.

One of the problems investigated in the SSL is the classifier degradation performance concerning the unlabeled data quantity added to a fixed set of labeled data. Nigam et al. (2000) used expectation-maximization (EM) combined with a generative classifier to investigate unlabeled and labeled samples in text classification. Cozman and Cohen (2002) analyzed the maximum-likelihood estimator and generative classifier focusing on modeling errors to evaluate the effect of unlabeled samples. More recently, Banitalebi-Dehkordi et al. (2022) showed that the unlabeled data from unconstrained distributions can generate a drop in the accuracy of SSL methods.

Text classification is the basis of many applications already mentioned, such as sentiment analysis, spam and fraud detection, word sense disambiguation, and so on, becoming a big issue in the field of artificial intelligence. This paper aims to retrieve and analyze the main approaches for text classification, especially, employing SSL, and presents their strengths and weaknesses. This study is very important for the computer science area, to help researchers and professionals to know the current research trends, develop customized models, and support project development and knowledge discovery. The information condensed here can help to optimize resources and maximize accuracy.

This work is an endeavor to retrieve and contextualize the main approaches of SSL for text classification, as well as its recent advances. We access the publications from the last 5 years in four digital libraries (ACM, IEEE Xplore, Science Direct, and Springer). Initially, we selected 1794 articles, after applying exclusion criteria, 157 articles were chosen to be included in this review. The main contributions of this work are: (i) identify the main idioms, domains and tasks explored; (ii) retrieve the datasets used; (iii) identify the primary text representation used; (iv) detect the main algorithms used; (v) organize the works into the SSL techniques; (vi) find the percentage of labeled data and the results achieved by the SSL approaches into the datasets; (vii) present the strengths, limitations, and current research trends in SSL text classification.

Section 2 presents the methodology employed to retrieve and select the articles to be included in this review. Section 3 shows the results in graphic format to facilitate the view and interpretation, besides a brief discussion of the results. Section 4 presents the works divided into the main SSL approaches. Section 5 shows a comparative analysis of the results obtained by the works in the main tasks and datasets. Section 6 presents the benefits and limitations of the techniques. Section 7 presents the future opportunities in the area and Section 8 concludes the paper.

2 Methodology

This section presents the methodology employed to perform the review and retrieve the works. Section 2.1 presents the research question that guided the work. Section 2.2 presents the sources and search terms used for the research. Section 2.3 presents the selection process and exclusion criteria. Section 2.4 presents the main information we extract from the articles.

2.1 Research question

Principal question: Which are the approaches in the semi-supervised text classification that achieved relevant results in recent years?

To answer the main question we constructed the knowledge-based considering the used semi-supervised approaches, text representations, text classification tasks, machine learning algorithms, languages, domains, and datasets.

2.2 Sources and search terms

First, we performed the search at the end of March 2021 on four digital libraries, taking into account publications from the previous 5 years: ACM Digital Library,Footnote 1 IEEE Xplore,Footnote 2 Science DirectFootnote 3 and Springer.Footnote 4Title, abstract, and keywords are the fields that we used to elaborate the search expression to select the articles on ACM and Science Direct. We used the field All Metadata on IEEE Xplore. However, in the Springer library, the number of articles returned was much bigger than the other libraries considering that it does not separate the articles by fields, instead, all the text is analyzed. The search expression used on the libraries were: (text classification) AND (semi-supervised). At the end of February 2022, we update the review process. We used the same libraries and keywords in this second stage.

2.3 Articles selection procedure

We selected the returned articles by the search expression and read their titles, abstracts, and keywords. We read the method and experiments when we had doubts about the suitability of the article for the proposed survey. We reject the articles that meet at least one exclusion criterion. The exclusion criteria considered were:

  • Publication date of the article was before the initial date of the search;

  • Language of the article other than English;

  • Systematic Review, Survey, and Chapter publication;

  • Article without experiments;

  • No access;

  • Not suitable for the proposed objective;

2.4 Information extraction strategy

We did a full reading of the selected articles considering the following items that guided the information extraction:

  • Title, publication year, country, library;

  • Application domain;

  • Objective;

  • Dataset language;

  • Text representation;

  • Semi-supervised approach;

  • Machine learning algorithms and/or deep learning method;

  • Binary, and/or multi-class, and/or multi-label classification;

  • Evaluation metrics;

  • Classification results.

3 Results and discussion

This section presents the results obtained through the review. Figure 1 depicts the survey process. We achieved 1794 articles with the search strategy. From this group, 1637 articles were rejected according to the exclusion criteria. Thus, 157 articles were selected to perform a full reading and extract their information. Then, we performed a quantitative analysis to comprise the semi-supervised text classification information.

Fig. 1
figure 1

Survey process to SSL for text classification. We access four digital libraries: ACM, IEEE Explore, Science Direct, and Springer. Applying the search strategy we retrieve 1794 articles. After applying the exclusion criteria 157 articles were selected to be included in this review

Following, Sect. 3.1 shows the number of publications per year and per country. Section 3.2 presents the main researched idioms, domains, and tasks. Section 3.3 shows the common datasets used. Section 3.4 presents the text representations and Sect. 3.5 the SSL approaches investigated.

3.1 Publications per year and per countries

Figure 2 depicts the scientific production per year. Since 2019 there has been an increase in the application of artificial neural networks (ANN) in the semi-supervised process. The years 2016, 2017, and 2018 had 17 publications that explored feature engineering or assist semi-supervised approaches using ANN. In 2019, 19 publications used ANN, 2020, 2021, and the first two months of 2022 had a total of 40 articles exploring ANN too.

Fig. 2
figure 2

Number of articles published by year

We identify 33 countries that published semi-supervised text classification articles. Figure 3 shows the number of articles published per country, we included countries with at least four articles for visual and aesthetic reasons. China, the United States (USA), and India are the countries that most produced articles. China published 55 articles which represent 35.02% of the total of articles produced by all countries cataloged in this survey. After, we have the USA with 23 articles, and India with 15 articles, which represent 14.65%, and 9.55% of the total published articles, respectively.

Brazil, Italy, and United Kingdom published five articles each of them. Germany, Iran, Japan, Korea, and Vietnam with four published articles per country. Turkey published three articles, and the remaining 21 countries published one or two articles each of them. It is known that China has overtaken the United States and it is the world’s largest producer of scientific articles. However, the USA is still considered a scientific powerhouse with high-level publications (Tollefson 2018).

Fig. 3
figure 3

Publications per country

3.2 Explored idioms, domains and tasks

Most of the NLP research was applied to English idioms, we identified 127 articles which correspond to 77.43% of the analyzed articles as shown in Fig. 4. Despite a small number of published articles, we identify 15 idioms other than English that investigated the text classification to their natural language. The Chinese had 18 (10.98%) published articles, Vietnamese 4 (2.44%), Arabic, Italian, and Brazilian Portuguese had 2 articles in each language. Each of the 11 remaining languages had one article published.

Fig. 4
figure 4

Idioms explored in the papers

We distinguish 16 domains along with SSL applications. There are articles associated with more than one domain, then, the quantity of the articles distributed in the domains was 220, as shown in Fig. 5. We only present seven domains for viewing reasons. News was the most used domain in text classification with 56 (25.45%) articles. The majority of News datasets are accessible benchmarks with known and verified outcomes, such as 20 Newsgroups, and Reuters 21578. In e-commerce, before customers make a purchase or hire a service, it is common for them to seek information from consumers about certain brands or services. Sentiment analysis supports e-commerce companies to understand the consumers feeling about their items for decision-making. We observed an increasing number of articles published in the Product and Service Reviews domain during the years analyzed, we count a total of 48 (21.82%) articles, where Amazon, Yelp, TripAdvisor, Movie Review, and IMDB were the prevalent datasets used.

Currently, there are approximately 400 million Twitter users in the world with 206 million active users per day (Dean 2022). Therefore, social networks generate abundant material that can be explored for the understanding of social behavior and its implications, e.g. sentiment analysis, emergency event detection, political purpose, fake news detection, and epidemiological studies. Social Network domain had 25 (11.36%) articles dealing with sentiment analysis or short text classification. Forum Discussion had 7 (3.18%) articles, the domain encompasses online discussions through a web platform where the users share their knowledge and argue about a determined topic, and the generated textual data can be applied to text classification tasks, i.e question classification.

In the Web Pages domain the articles mainly used WebKB or DBpedia datasets, and the total of articles was 26 (11.82%). Scientific Articles domain had 22 (10.00%) articles mainly related to the node classification. The Health Area domain with 14 (6.36%) articles and the Ohsumed dataset about medical abstracts was the most used. Different domains of the aforementioned had fewer articles: email, patent documents, internet advertisement, quotation, law, and education. Thus, these domains represented by Others had a total of 22 (10.00%) articles.

Fig. 5
figure 5

Number of published articles per domain

We organize the text classification into seven tasks according to Fig. 6. Some articles were applied in more than one task, then we had 177 articles distributed in the tasks. Generic Text Classification task was related to news, web pages, scientific articles, and documents that were explored in 72 (40.68%) articles. Then, Sentiment Analysis with 42 articles represents 23.73% of the total. The Short Text Classification task had 33 (18.64%) articles, in this case, we considered sentences, and microblogging when it is not used for the Sentiment Analysis task, e.g. sarcasm detection, intention detection, misinformation detection, rumor detection, irony detection, fake comments, and so on. The Question, Node, Topic, and Multi-Lingual Classification had 11 (6.21%), 9 (5.09%), 8 (4.52%), and 2 (1.13%) articles, respectively.

Fig. 6
figure 6

Text classification tasks

3.3 Datasets

Table 1 Datasets used by the domain (Chinese -zh, and Vietnamese -vi)

Table 1 represents the benchmark datasets most used in the experiments, totalizing 22 datasets. However, we identify other 114 datasets, but they are specific which makes them unfeasible for a possible comparison among the semi-supervised methods. The news domain had a total of 60 (34.68%) articles, where 20 Newsgroups and Reuters with 50 articles prevailed over AG News and Sogou News datasets, the last one is a Chinese benchmark. Concerning a short text, Social Network, and Product and Service Review domains had 66 (38.15%) articles that were used for sentiment analysis, social bot detection, deceptive review detection, and spam classification. In the Social Network domain, Twitter was quite explored in the English language with 12 articles, and Weibo in the Chinese language with 4 articles. In the Product and Service Review domain, Movie Review and IMDB dataset had 21 (12.14%) articles, and Amazon product categories (Books, DVD, Electronics, Kitchen, Music, Video) had 15 (8.67%) articles. Yelp, TripAdvisor, and Vietnamese datasets had a total of 14 articles, they are user reviews from restaurants, hotels, and places. The scientific and Medical domain includes 34 (19.65%) articles related to scientific publications and most of them were experienced with the Graph-based approach because the benchmark datasets structures are appropriate for node classification: CiteSeer, PubMed, and DBLP. On the Web Page domain, 7 (4.05%) articles used the WebKB dataset which is formed by web pages from computer science departments of various universities. Lastly, the TREC dataset with 6 (3.47%) articles for question classification.

3.4 Text representations

Table 2 displays the different types of text representation or feature engineering methods and their quantities applied in the text classification process in descending order. Despite bag of word (BoW), and term frequency–inverse document frequency (TF–IDF) being simplified methods, they are still quite used. Word2Vec, fastText, and GloVe are language models that handle lexical semantics, the first two are based on ANN, and the last one is based on word co-occurrence. Word2Vec and its extensions, e.g. Sent2Vec, and Doc2Vec had 22 articles related to them. FastText had five articles, and GloVe had four articles. BERT, DistilBERT, ALBERT, and ELMo are context-sensitive word embedding methods, we identify 13 articles referring to the first 3 methods and 2 articles to the last one. We identified 35 articles that implemented deep learning methods to generate or improve word embeddings. Latent semantic analysis (LSA) and latent Dirichlet allocation (LDA) had eight articles. Information gain and mutual information had three articles.

Table 2 The most used text representation

A comparison with the feature engineering methods based on ANN and based on term occurrence/frequency is shown in Fig. 7.

Fig. 7
figure 7

Feature engineering methods based on neural network and term occurrence

ANN had an increasing application over the years with four articles in 2017 and nine in 2018, respectively. However, 2019 had a sharp increase with 16 articles, then 2020 with 19 articles, and 24 articles in 2021. The simplest methods of text representations had decreased in their use since 2017. Although, in 2021 the number of articles using traditional methods was doubled compared with 2020. In many cases, traditional text representation methods were used to do a comparison with contextualized vector representations or/and as input to an ANN.

Figure 8 exhibits in more detail the frequency of published articles over the years using ANN methods, and the methods based on term occurrence and term frequency. Word2Vec and its extensions had an increase between 2016 and 2018, but they remained practically stable in the following years. Context-sensitive pre-trained models applied in the experiments appeared in 2019 with BERT, and in 2020, and 2021 appeared experiments using ELMo too. Experiments with deep learning algorithms using their embedding layer had two articles in 2016, one article in 2017, three in 2018, and an expressive number of published articles in the following years. GloVe is based on co-occurrence matrices from Corpus and it is not context-sensitive such as Word2Vec and fastText. GloVe had one article in 2018, and 2019 and two articles in 2022.

Fig. 8
figure 8

Feature engineering methods per year

3.5 Semi-supervised approaches

We followed the taxonomy proposed by van Engelen and Hoos (2019) to categorize semi-supervised approaches, as shown in Fig. 9. Meantime, we had the boldness to insert new approaches in the spectrum of semi-supervised algorithms to ensure the articles’ categorization when the method did not match the taxonomy. Thus, considering the main method feature, we group the remaining articles in transfer learning and transductive support vector machines (TSVM) and not identification (N.I.) approaches. For the transfer learning approach, we separate articles that used jointly a limited number of labeled and a large amount of unlabeled target data in the training. Two articles were related to TSVM, three articles do not have a consensual opinion about the approach used.

Fig. 9
figure 9

Semi-supervised approaches per published articles

With 31 articles the Graph-based was the most used technique. Until 2018, 15 of 17 articles with the graph method were not related to the ANN. Nevertheless, since 2019, 11 of 16 articles combined ANN and graphs. After 2019, 30 articles employed the Self-training approach, of these 21 articles applied traditional methods in feature engineering and text classification algorithms, and nine articles employed ANN. The third most used approach was Generative models with a total of 22 articles, in which 14 applied ANN. Then, Feature extraction, Cluster-then-label, Co-training, Transfer learning, Perturbation-based, Boosting, TSVM and N.I., and Manifolds with 17, 16, 13, 10, 8, 5, 5, and 2 articles, respectively. Without considering the first three most used approaches, there is a total of 76 articles in the remaining approaches which 35 articles applied ANN.

As can be seen in the previous paragraph, there has been an inclination to semi-supervised approaches using ANN over the years. Figure 10 clearly shows the behavior of traditional and ANN algorithms since 2016. There has been an increase in ANN and a decrease in traditional algorithms in the semi-supervised area. Although, the use of traditional algorithms has been shrinking, until 2018 it has a superiority compared to ANN. The year 2019 seems to be the inversion point, thenceforth ANN predominated in the semi-supervised approaches.

Fig. 10
figure 10

Traditional algorithms versus neural network algorithms applied to the semi-supervised approaches

A comparison of the ANN and traditional algorithms applied in the semi-supervised approaches is shown in Fig. 11. Concerning articles with traditional algorithms, SVM frequency was 33 (18.54%), Naive Bayes with 22 (12.36%), and decision tree with 14 (7.87%). The decision tree includes CART, J48, random forest, and C4.5 algorithms. Then, Logistic Regression, k-nearest neighbors (kNN), K-Means, and EM algorithms with 13, 11, 4, and 4 articles, respectively.

Fig. 11
figure 11

Traditional and neural network algorithms

ANN algorithms were grouped by their methods: long short-term memory (LSTM) and gated recurrent unit (GRU), Graph Neural Network (GNN), Convolutional Neural Network (CNN), bi-LSTM and bi-GRU, BERT, and neural network (misc), i.e. miscellaneous, but outnumbered algorithms. Neural network (misc) with 21 (11.80%) articles include different types of algorithms, e.g. multi-layer perceptron (MLP), autoencoder, ladder network, deep belief network (DBN), and capsule network. LSTM and GRU were used in 16 articles, while bi-LSTM and bi-GRU were used in 8 articles, and recurrent neural network (RNN) algorithms comprise 24 (13.48%) articles. CNN, GNN, and BERT were applied in 14, 12, and 6, respectively.

4 Semi-supervised learning for text classification

In this section, we present the main works using SSL and text mining. We divide the topics following the taxonomy proposed by Van Engelen and Hoos (2020). Section 4.1 presents the graph-based approaches. Section 4.2 presents the unsupervised pre-processing approaches, especially the feature extraction and cluster-than-label methods. Section 4.3 presents the wrapper methods, especially self-training, co-training, and boosting. Section 4.4 presents the intrinsically SSL approaches, especially the perturbation-based, manifolds, and generative models. We also include the transfer learning methods in Sect. 4.5 and other approaches in Sect. 4.6.

4.1 Graph-based

Graph-based SSL methods propagate the labels to unlabeled nodes in a constructed graph \(G=(V, E)\), where \(V=\{V_{l} \cup V_{u} \}\) is a set formed by the labeled nodes \(V_{l}\), and the set of unlabeled nodes \(V_{u}\). V is a set of nodes, such that \(V=\{v_{1}, v_{2}, \ldots , v_{n}\}\) represents the data points. E is related with a \(n \times n\) matrix W containing for each pair of nodes \(v_{i}\) and \(v_{j}\) a non-negative edge weight \(w_{i,j}\). The edge weight represents the similarity between the nodes.

Graph-based methods have been used in various contexts, e.g. news, web pages, health and scientific articles. We group the articles by context or text classification tasks to describe their methods. Regarding news classification, a method based on Positive and Unlabeled Learning (PUL) with Label Propagation (LP) to minimize the news labeling effort was proposed by Souza et al. (2021). Negative document extraction with graph paths based on Dijkstra’s algorithm was proposed by Carnevali et al. (2021). They used sparse graphs for graph construction and Gaussian Field Harmonic Functions (GFHF), and Local and Global Consistency (LLGC) algorithms for classification. Authors in Yadav et al. (2019) compared distance/similarity metrics (Euclidean L2 norm; cosine similarity; improved sqrt-cosine similarity) to measure its effect on the quality of graph construction (Average Node Degree, and Standard Deviation of the Node Degree). The extraction of relevant content from the news web pages was carried out by Bose and Mukherjee (2019). The web page was represented as a graph, where text elements are nodes and the edge weights represent the similarity between nodes. A few nodes were labeled in the graph using heuristics and the remaining nodes were labeled by a weighted measure of similarity to the labeled nodes.

A graph-based algorithm to solve the label insufficiency by means of LP in the news dataset was studied by Gong et al. (2017). They explored two measures, i.e Graph Trend Filtering ad Smooth Eigenbase Pursuit to handle label inaccuracy by filtering out initial noisy labels. Widmann and Verberne (2017) constructed a graph employing documents nodes and features nodes where the order of the word was preserved. The connection was formed in two ways, i.e. among document nodes and features nodes, and features nodes based on words. A matrix representation of the graph was constructed with extracted features to LP based on context similarity using the Jaccard index. In a multi-head-Pooling-based on Graph Convolutional Network (GCN) applied for news text classification, Zhao et al. (2022) focused on the structural information of the text graph for pre-training word embedding as the initial node feature. Important nodes were evaluated and selected from multiple perspectives through multi-head pooling.

In the news context, some works explored k-partite graph for text classification, where the vertices are partitioned into k different sets. A tripartite graph was developed by Ganiz (2016). Semantics in higher-order co-occurrence paths between words were exploited, which linked terms in unlabeled documents to terms in labeled documents. Furthermore, the method was able to estimate class conditional probabilities for the terms in unlabeled documents. Rossi et al. (2017) represented text collections by the bipartite heterogeneous network, where objects were documents and terms, and term and document were connected if there was a term occurrence in the document. The label of connected terms was propagated to a new document using a weighted linear function.

In a Chinese text classification for news, Zhu et al. (2018) developed a method based on Wikipedia sample extension (WSE). A network graph was constructed with concepts and their links extracted from Wikipedia. The generated extension was carried out by correlation of the labeled sample data and the concepts in Wikipedia by means of TF–IDF and then calculated the significant value of each concept for each category. Besides, to further expand the sample, was proposed WSEs with links (WSE-L), i.e. an enhanced sample extension method. After, it was placed a limiting condition to WSE-L to control the number of the training sample. Zhang et al. (2019a) investigated a news text classification based on a domain ontology graph of semi-supervised conceptual clustering. To deal with the problem of WSD, a framework of ontology learning of Chinese classification in accordance with the structural model of the domain ontology graph was developed.

Semi-supervised fake news detection method based on GNN was investigated by Benamira et al. (2019). GloVe of news articles was generated, and contextual similarities among texts were produced by kNN along with Euclidean distances in the embedding space. GCN and Attention GNN were used for the classification task. For the misinformation detection task, Abdali et al. (2021) studied three aspects of a news article which were combined and modeled as a tensor/matrix, with one model for each aspect. A hierarchical approach for finding latent patterns derived from those aspects was proposed. The nearest-neighbors graph was constructed with the articles in the embedding space for the semi-supervised label inference of unknown news articles.

Dealing with a short text classification task, Ji et al. (2021) proposed a streaming social traffic event detection via multiple edge computing based on heterogeneous information network (HIN) and clustering method. GNN along with HIN to obtain the optimal meta-path weights for traffic event detection was applied to measure the relationships between social texts. Binary sample GCN and binary sample graph-attention network (GAT) were constructed to address the problem of a large number of traffic event categories and a small number of samples in each category. Zhao et al. (2022), beyond news classification as described previously, applied the method for short text classification too. The smoothness assumption to the question of transductive multi-label learning was employed in Sun et al. (2018) with the purpose to exploit the correlation in the feature space and label space. A non-negative matrix factorization (NMF) based modeling and training algorithm which learned from adjacencies of the instances and labels from the training set was proposed. Employing a non-negative least square optimization algorithm, the labels were exploited and propagated.

In the short text classification task context, Kernel-based GNN for graph classification in social networks and movie reviews was investigated by Ju et al. (2022). Graph kernels were combined with GNNs to effectively learn graph representation and used graph similarity for prediction. WordNet for WSD was used in Billal et al. (2017) and created a weakly connected graph through the words of the corpus with their synsets to extract connected components, where a component are nodes (words) and the edges are semantic relations among components. Furthermore, in a multi-label classification, semi-supervised graph methods were proposed for the extraction of subjects from the social network. Classification Maximization Deep Multi-label and Classification Maximization Deep Back Propagation Neural Networks were applied in the experiments.

For the short text classification task in Yang et al. (2021a), heterogeneous information embedding was carried out by heterogeneous GAT. Dual-level attention mechanism was applied to learn the weights and to capture the importance of different types of neighboring. Xu and Li (2017) developed a sentiment classification method based on a LP algorithm. The method combined text content and user information to construct similarity based on the reviewer’s score preference and text features. High similarity between scoring preference and text features enabled to propagate of scores to unlabeled reviews.

Dealing with short text classification in a language other than English, Wang et al. (2017) performed a comparative study of algorithm performance with Chinese online reviews from multi-domains to resolve the problems of robustness and field dependence. Charalampakis et al. (2016) detected irony in a corpus of Greek political tweets researching training-collective classification. The goal was to find a relation between the ironic tweets that refer to the political parties and leaders in Greece in the pre-election period of May 2012, and their actual election results. Guo et al. (2016) focused on analyzing the credibility of influenza posts published on Sina Weibo by means of user, content, and post. An undirected Graph Markov Network with random variables was used to model dependencies among nodes and to capture interactions among features.

In the scientific context, considering the importance of the external information of nodes to improve the performance of representation learning, Liu et al. (2018a) applied the Hierarchical Attention Network Embedding method which performed integration between text and label features of nodes to learn the hierarchical relational network embeddings for scientific articles. Two layers of bi-GRU were applied to the hierarchical learning: one layer extracted latent features of words with word-level attention to obtain the lexical features, and the other one extracted latent features of sentences with sentence-level attention to obtain textual features. Zhu et al. (2021) researched random walk and GNN using global and local information to handle scientific articles. Global information was preserved by global features. A set of parallel kernel GNNs was used to learn different aspects of pre-trained global features and the raw attributes of the graph. Yang et al. (2021b) explored multilayer GCNs to handle complexity and redundant calculations, and the overfitting problem of GCNs. A simplified multilayer GCN with dropout which extends shallow GCNs was applied in scientific texts.

In scientific texts, Xu et al. (2020) investigated label consistency with GNN that generated label distribution for each node in addition to the similarity to aggregation weight between two nodes. The method benefited from the proportion of neighboring nodes with the same label, and of the target nodes and unconnected nodes that shared the same labels. Akujuobi et al. (2020) applied recurrent-attention strategy to handle the problem of a large number of neighboring nodes to be analyzed and used inductive properties in semi-supervised node classification using scientific articles. The walk on the graph was learned based on recurrent attention which reduced the noise information, interpreted the decision-making process, and inferred class label dependency. GAT to label propagation was applied by Huang et al. (2021), and the graph was constructed considering citation datasets properties. The embedding vector of each node was generated based on their neighborhood. An attention mechanism was applied to learn the representation of neighbor nodes of target nodes, then nodes with high similarity to target nodes had higher weights, and low similarity nodes had lower weights.

For scientific text classification, a dynamic anchor graph to learn local and global features jointly was elaborated by Wang et al. (2021). A two-branch architecture was built, one branch was single-sample consistency that learned local features by consistency regularization term, and the other one used outputs from the previous branch to construct dynamically an anchor graph. Graph embedding branch learned global features in the graph by context prediction log loss. Timsina et al. (2016) investigated various SSL including label-spreading along with Radial Basis Function kernel to select articles for medical systematic reviews. In Kontonatsios et al. (2017), an active learning method was proposed to contribute to citation screening in clinical and public health reviews. The approach was based on cluster assumption and used label propagation to neighboring unlabeled citations supported by cosine similarity measure applied in the feature space.

4.2 Unsupervised pre-processing

4.2.1 Feature extraction

Unsupervised preprocessing is a category of inductive methods that use unlabeled and labeled data in dissociated stages, where the unsupervised stage can be done by feature extraction. In NLP, feature extraction converts the raw text data into numeric features which are able to improve the performance of the classifier. Feature extraction is an SSL method carried out on unlabeled data and seeks to extract relevant information from the raw data, and it uses a supervised fine-tuning stage (van Engelen and Hoos 2019).

In the news context, using CNN for multi-label classification, Li et al. (2018) presented the following process: words were extracted from legal documents, and Word2Vec generated the word embeddings; two view embedding learning generated training data; predicted target regions with feature regions by training; two view embeddings were integrated into CNN for text classification. Jiang et al. (2018) combined DBN and Softmax Regression forming a hybrid algorithm, where the features were learned by DBN, and softmax regression was trained along with a few samples labeled. In the fine-tuning step, the system parameters were optimized with limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm which used estimation to the inverse Hessian matrix and cost function with second-order Taylor expansion.

With an approach to the sentiment analysis task, Pan et al. (2020) used Ladder Network that integrated a small amount of labeled data with a large number of unlabeled reviews and augmented data effectively. The method has two models, the first one leveraging contextual features from unlabeled data using either Word2Vec, BERT, DistilBERT, or ALBERT. The second model was Ladder Network along with an encoder and decoder model. For sentiment classification, Sun et al. (2019a) explored fine-tuning methods of BERT. The within-task and in-domain further pre-training boosted text classification performance and improved the task with small-size data. For affect classification, Chawla et al. (2019) introduced a deep neural network in an environment with limited labeled data, the method was a gated sequence-to-sequence, convolutional–deconvolutional auto-encoding. The classification of tweets was addressed in Baecchi et al. (2015), according to their polarity, considering both textual and visual information. A novel schema was proposed incorporating a CBOW with negative sampling and Denoising Auto-encoders to exploit web-scale sources corpus and robust visual features obtained from unsupervised learning.

For sentiment analysis tasks in languages other than English, Jahanbakhsh et al. (2020) researched a model based on content and context features to verify Persian rumors. The content-based features were a set of writing style features, and the context-based features were speech acts of rumor documents and contextual word embeddings that were extracted by two parallel BERT models. In Guellil et al. (2021), an approach for sentiment analysis of Arabic messages extracted from social media was proposed. Both Arabic and Arabizi (it used Arabizi transliteration and Arabizi translation to Arabic) were considered by the method and Word2Vec associated with the classical algorithms was applied. Besides, Word2Vec and fastText plus deep learning Algorithms (CNN, LSTM, bi-LSTM) were applied too. Di Capua and Petrosino (2017) experimented with the ANN model based on DBN which learned feature representations from labeled and unlabeled data. The method was built to deal with data uncertainty for sentiment analysis and adopted the Italian language. Yadav and Bhojane (2019) developed an SSL approach to sentiment analysis in Hindi language documents. The authors worked with three approaches: ANN with pre-classified words; classification using Hindi SentiWordNet; classification with ANN and pre-classified sentences.

This paragraph summarizes text classification in Japanese and Chinese languages. For the Japanese language, automatic section identification of requests for quotation documents was developed. Novel features were introduced derived from unlabeled data to enhance the performance, e.g. lexicon features, word cluster features with Word2Vec, and cluster features with constraints (Hidetaka and Wang 2019). For Chinese charge prediction, He et al. (2019) elaborated a Sequence Enhanced Capsule Model constituted by: an input layer where the words of fact description of a case were transformed to the primary capsule; multiple seq-caps layers, one layer produced advanced semantic representation from fact description and other one restored the sequence information of fact description; mechanism attention, a new residual unit improved the generalization and provided auxiliary information for charge prediction; the output layer, all the features vectors from the multiple seq-caps layers were flattened and concatenated with the global context vector, then the fully connected network and softmax function were used to generate the probability.

In a web page domain, Geraci and Papini (2018) built automatically a set of examples to use as the training set. The method exploited the strong correlation between URLs text representation and text from the web page, therefore a set of web pages per class was constructed. Vectors of features were built per class/URL pair and were used to label URLs by ranking the classes. In McNulty et al. (2021), an approach that classifies HTML documents in research and non-research based on structural, content, and formality features was explored. In Lieder et al. (2019), millions of public business web pages were mined and they used multi-lingual BERT to obtain a contextualized representation of texts and CNN multi-label text classification. Due to the fact that missing labels affect the classification performance for multi-label learning, Cheng et al. (2021) approached missing multi-label learning with non-equilibrium based on a two-level autoencoder to web page classification. Two-level auto-encoder was constructed considering the noise interference in the feature space and the correlation between features and labels.

For sentence classification, the fastText model was analyzed by Agibetov et al. (2018) in biomedical sentences. SSL models were pre-trained through unsupervised training on predicting word contexts or sentence reconstruction tasks and then used downstream supervised classification.

4.2.2 Cluster-then-label

It is an inductive method that uses unlabeled and labeled data in two stages, such as feature extraction. The unsupervised stage comprises the clustering of the data.

In the news context, Jedrzejowicz and Zakrzewska (2020) proposed a hybrid approach by the LDA algorithm and Word2Vec. The method clustered the documents into categories using topics in an unsupervised way. The results of collapsed Gibbs sampling for LDA were acquired and each topic was expanded by the word embeddings, similar to the most representative words from the topic, through the cosine distance metric. For a new document was calculated word-topic distribution for each word of the document, then the topic was assigned to the document which had the highest number of word-topic assignments. For news classification, Barman and Chowdhury (2018) used Kohonen’s self-organizing map to extract the groups from texts and unlabeled samples of each group were labeled based on the voting of the class label with labeled members of the group. New classes were detected during the clustering process to news text categorization in Guru et al. (2016). Samples too far apart from all clusters in the clustering process formed one or more news clusters. The new cluster fully formed by unlabeled samples represented the new class, therefore the samples were labeled.

For online news article classification, Krishnamoorthy et al. (2018) used two incremental clustering methods. The method I calculated for each new document its cosine similarity with all of the original documents. Method II used the centroids of the original clusters rather than all the data points when calculating the cosine similarity values of the new document with the clusters. A selective seeding technique to obtain a coherent set of initial centroids based on maximum feature coverage was implemented. Vilhagra et al. (2020) elaborated a deep clustering approach to document clustering and feature learning through the K-Means algorithm, convolutional Siamese network (CSN), and pairwise constraints (cannot-link constraint, and must-link constraint). The CSN and pairwise constraints were used to learn a low-dimensional representation, the feature vectors were conducted by \(L_{1}\) norm that brings them closer or farther away by semantic distance.

Still, in the news domain, Thomas and Resmipriya (2016) formed clusters with samples with the same labels, and they were identified by their centroids and labels. The distance between the unlabeled samples and the centroids of the labeled clusters was calculated, where the minimum distance defined the cluster target to the unlabeled sample to be added and labeled. The similarity metrics were Euclidean distance, cosine similarity measure, similarity measure for text processing (SMTP), and dice coefficient. For news classification, an unbiased semi-supervised cluster (SSC) tree was proposed by Sun et al. (2020), in which the learning process used only very few labeled data, and a confidence error-based pruning algorithm. The K-Means algorithm was applied to generate the SSC tree, where each level of this hierarchical tree was built in a top-down manner, and the confidence error was used to prune the tree. With a global strategy based on the weak cluster assumption to explore the unlabeled data, the method proposed resolved the local maxima problem.

For the short text classification task, Ng and Carley (2021) examined coronavirus-related fact-checked stories. In K-Means clustering six topics were chosen, and each story was assigned to a cluster number based on its Euclidean distance to the cluster center in the projected space. BoW classifier was constructed to label the story type by means of cosine distance, and the BERT classifier to label the target story using the closest vector embedding found through the smallest cosine distance. Buza and Revina (2020) improved the classification of time series and applied it to short text classification. Previously, labeled and unlabeled samples were clustered with constrained single-linkage hierarchical agglomerative clustering. Then, in the top-level clusters generated, the unlabeled samples in each cluster were labeled by their seeds. However, the complexity of distance computations was \({\mathcal {O}}({n^{2}})\). Considering the distance computations used in the old method, when the dataset was divided into parts (c) and computed \(m{\text {-}}times\), the complexity became \({\mathcal {O}}\left( {\frac{n^{2}}{c}}\right) \). Therefore, the authors relied on this logic to reduce the computational cost.

A short text classification method based on weighted word vectors representation was proposed by Zhang et al. (2019b). Expected cross-entropy was used in the labeled data to extract strong category feature sets. To reduce the high-dimensional sparseness of features from short texts, word vectors were generated and used to represent eigenvectors increasing the semantic information of short texts. The method calculated the cosine similarity of the whole eigenvectors and the virtual class center, where the virtual class center represented the mean value of the eigenvectors. The real class center of the labeled samples was calculated based on normalized similarity. The similarity between the clustering center and the real class center of the labeled data was used to classify the unlabeled samples.

In the social media sentiment analysis task, Nguyen (2016) exploited the concept of emotional consistency with spectral-based LP and distant supervision labels or noisy labels. The LP was based on a similarity matrix that used a Gaussian kernel based on textual features. In the emotional clustering, consistency was built on three different predictors based on three lexicon resources using the lexicon-ratio method. The final sentiment classifier was built by the reference predictions and the labeled data of the target domain. Namrutha Sridhar et al. (2020) identified and associated social media text with multiple emotions with varying degrees. Word embeddings were trained for the entire Twitter dataset, then Twitter level similarity was calculated between unlabeled and labeled tweets by word mover’s distance.

Two researchers performed short text classification in the Vietnamese language. First, Ha et al. (2018a) did a recursive adaptation multi-label classification algorithm with semi-supervised clustering. The method finds the first label (\(\lambda \)) as the greater number of occurrences in \(L_{2}\) which is the set of possible labels that the labeled dataset might have. The clusters were created based on \(\lambda \) and generated three macro labels \(\lambda _{1}\), \(\lambda _{2}\), and \(\lambda _{3}\) as simulated label set. A set of clusters (\(D_{1}\), \(D_{2}\), and \(D_{3}\)) related with the labels \(\lambda _{1}\), \(\lambda _{2}\), \(\lambda _{3}\) was produced. Second, Ha et al. (2018b) proposed a lifelong topic modeling method, which focused on learning bias on the domain level based on the proposed domain closeness measure, and an application framework for multi-label classification based on semi-supervised clustering to Vietnamese texts.

In the scientific classification domain, Varghese et al. (2018) employed an unsupervised clustering algorithm with a minimal training dataset to cluster the labeling process to reduce the manual effort in the process of a systematic review of toxicological studies.

4.3 Wrapper methods

4.3.1 Self-training

Self-training approach is part of wrapper methods, whose logic of such methods is to generate pseudo-labels to unlabeled data, and add the additional labeled data generated along with the existing labeled data to train an inductive classifier.

In the news context, a modification in the self-training method was performed to reduce the sensitivity of the learning algorithm to the noise contained in the labeled data by means of automatically generated summaries by Villatoro-Tello et al. (2016). Another contribution to the research was a new strategy based on the distance to select confidently labeled instances in every iteration of self-training, which helped to preserve high homogeneity values among classes. In Pavlinek and Podgorelec (2017), the topic model to represent text was investigated with the aim to improve performance in the SSL method. The news text classification method based on self-training and LDA topic models was proposed to augment very small labeled data sets with unlabeled content. In Kumar et al. (2021) a novel framework of binary classifiers eliminated the threshold issue to improve the performance of pseudo labeling in the conventional SSL for text classification using a new dataset.

Dealing with news and sentiment context, a new hybrid method was built for classification which used class-based meaning values and weights of terms (Altınel and Ganiz 2016). The meanings of the words for the class were calculated and the meaning score defined the labels to unlabeled samples. After that, Class Weighting Kernel constructed the class-based matrix which represented the weights of the words for each class. Then, based on a class-based matrix a symmetric term-by-term semantic smoothing matrix was generated to calculate the similarity/kernel between documents. The kernel function was embedded into the implementation of the SVM algorithm used along with Platt’s Sequential Minimal Optimization classifier. Still, for news and sentiment classification, Altnel et al. (2017) along with a meaning calculation computed the words’ mean scores in the scope of classes. Instance labeling used meaning calculation in a semi-supervised way to construct a semantic smoothing kernel for SVM.

In the sentiment analysis task context, Khan et al. (2017) incorporated machine learning along with sentiment lexicon in order to alleviate existing problems of data unavailability, data sparsity and domain dependence. The sentiment knowledge base was constructed resulting in two sentiment lexicons named Senti-IG and Senti-Cosine by the application of mathematical models such as Information Gain and Cosine Similarity for the SentiWordNet lexicon to generate revised sentiment scores. A system was developed by Zaghdoudi and Glomann (2021) to automate user research activities on the web. The synonym replacement method was used for data augmentation, and LSTM was applied to sentiment analysis. For the sentiment and topic classification, Xiang and Yin (2021) combined deep neural network bi-GRU and temporal ensembling extended which unlabeled samples were labeled with pseudo labels. A sarcasm-unlabeled method was proposed by Li et al. (2020) for contextual sarcasm detection in social networks using the concatenation of content representation based on CNN and sarcastic preference embedding along with the main-balanced and main-unbalanced dataset.

SSL to sentiment classification as a model-based reinforcement learning problem was inspired by self-training in Li and Ye (2018). An adversarial network-based framework was proposed, but unlike most of the other generative adversarial network (GAN)-based SSL approaches, the framework did not need to reconstruct input data and hence could be applied for semi-supervised text classification. In Banerjee et al. (2018), sentiment classification was handled through positive and unlabeled data, when the positive class was a rare event in customer reviews. Stage I sought to label data for Non-Reportable and new kinds of Reportable cases and estimated the prior class probabilities by means of sentiment score, keyword score, and similarity score (using LSA or GloVe embeddings). Stage II used an entropy-regularized logistic classifier that penalized the entropy of the posterior measured on the unlabeled samples.

In Lee and Kim (2017), the sentiment labeling method was explored to generate confidently pseudo-labeled samples with threshold parameter which was added to the training corpus in order to enrich the initial sentiment classifier. In each iteration, the self-training with concatenated embedding vectors was conducted. Four experiments were carried out: sentiment classification to prove the effect of sentiment labeling; an experiment conducted to identify whether sentiment labeling with a lower confidence threshold could improve classification accuracy and to determine whether there was a correlation between the joint sentiment/topic model variance and classification accuracy; one experiment for further validation; with an increment of the size of initial-human-labeled data, an experiment was carried out to analyze the performance of the proposed method.

In the short text context, Shulman and Simo (2021) proposed a method based on deep learning for helping users in online social networks avoid regrettable posts and disclosure of sensitive information. A semi-supervised self-training approach was employed to incrementally label messages from online social networks and create a large-scale corpus. Word2Vec and fastText were used to generate domain-specific word embeddings. User information to alleviate the data sparsity in sentence classification in social scenarios was used by Ma et al. (2020). The up-based regularization term was applied to assist the prediction and in the self-training, the pseudo-labeled had noise reduction by a sample selector. A pre-trained ELMo was used to contextualize word embeddings and the softmax layer to output the probability distribution over classes. Deocadez et al. (2017) applied algorithms in order to automate the classification of functional and non-functional requirements contained in the App Store reviews.

For short text, the label prediction method was proposed by Stanojevic et al. (2019) which predicted probabilities to guide the choice of labels for each post from unlabeled data based on the small number of labeled samples. The method captured additional contexts from the unlabeled data with model learning, e.g. fastText, and deep learning models. With SSL framework for short text, Ghosh and Desarkar (2020) improved the performance of the classifier trained in a small labeled set incorporating highly confident samples from unlabeled data for labeled training data. One criterion for the class assignment and selection of samples was the restriction in the number of samples per class, and the other one was based on the class-specific threshold, which restricted the assignment of samples to class.

For short text classification, Karisani and Karisani (2021) proposed a neural SSL model based on a classic self-training algorithm that was threshold-free to cope with social network data. The method handled the semantic drift problem and revised the previously labeled documents. The approach was iterative and formed by two neural network classifiers that reverse each other. In each iteration, one classifier obtained a random set of unlabeled documents and labels them. This set was used to initialize the other classifier, to be further trained by the set of labeled documents. Three semi-supervised methods to classify tickets in a binary fashion from bug tracking system data were employed in Pohl et al. (2020), and sentiment polarities were used as a feature of the Self-training. Wulan and Supangkat (2017) proposed a semi-supervised Self-training to classify motivational messages which may motivate the learner to study.

In the context of languages other than English, Duong and Anh (2021) used Easy Data Augmentation, e.g. synonym replacement, random swap, random insert, and random delete to sentiment analysis in Vietnamese texts. Besides, syntax-Tree transformation and back translation data augmentation techniques. For sentiment analysis, Nguyen Nhat Dang and Duong (2019) has taken various experiments including many pre-processing techniques, and semantic lexicon complementation. Furthermore, synonym replacement and random swap data augmentation techniques improved the accuracies of classifiers. Yin et al. (2018) applied the SSL method, SVM classifier (SLAS), and CART model for sentiment classification.

In Li et al. (2017a), a lot of unlabeled samples in the data set were labeled iteratively based on the similarity between the samples. A novel semi-supervised Chinese short text classification algorithm based on fusion similarity and class center was developed. Khan and Zubair (2020) proposed a model for the multi-lingual (English and Roman Urdu) classification of tweets into a multi-class model. The SSL method was based on a feature set from the labeled dataset, the unlabeled samples were labeled and the model was re-trained with them jointly and the smallest previous labeled set. Omar et al. (2021) focused on the short text classification on the social network and constructed a standard Arabic dataset using manual annotation and semi-supervised annotation techniques. One of several experiments was self-training used to label the remaining unlabeled posts with sentiment class.

In the health domain, a comparative analysis was performed on various SSL methods with the purpose to address the problem of the small training dataset to text classification algorithms in medical systematic review (Liu et al. 2018b). Self-training with label spreading to identify the most confident unlabeled instances was one of the semi-supervised methods used. Hasan et al. (2020) identified adverse drug reactions and side effects from a patient report on social media along with a semi-supervised method. The method was based on a Conditional Random Field with a small labeled dataset which iteratively augmented the training set with high-confidence labeled sentences coming from a large set of unlabeled data. Furthermore, incrementally the method augmented symptoms and side-effect dictionaries with the most confident medical terms. Thus, with the terms correctly classified, sentences that were rejected before could be added to the training data.

In the Web Page context, Lin et al. (2017) elaborated a competitive perspective identification based on user-level perspective consistency which selected high-quality classified texts from the unlabeled corpus and iteratively boosted the classifier. The method refined the perspective classifiers with the document-topic distributions mined from texts using NMF. SSL multi-view similarity for web page classification was designed by Wu et al. (2019). The method learned multiple view-individual transformations and one shareable transformation. Therefore, the particularity and commonality of different views were explored. Label information of labeled samples and the similarity information of unlabeled samples were used from both intra-view and inter-view aspects. The overall objective was given by the combination of the terms of semi-supervised multi-view similarity preserving, multi-view statistical uncorrelated design (to reduce information across views to learn view-specific features with view-individual transformation using covariance matrix), and classification loss. The \(l_{2,1}{\text {-}}norm\) base regularizer was employed for view-specific transformations that were sparse in rows, then discriminant features could be selected for each view.

4.3.2 Co-training

The approach is a semi-supervised method and a part of wrapper methods that use supervised algorithms to iteratively label unlabeled samples. The characterization of Co-training is given by the use of two or more distinct views of the labeled data to iteratively train the classifiers. At each iteration, the most confident prediction from each classifier is passed to the labeled data of the other classifiers.

In the news context, a collaborative text classification was combined with a supervised topic model to identify the semantic relation between the topic and category by Zhang et al. (2021a). The views were generated by different feature representations for training two classifiers, and the approach adopted a confidence calculation method based on posterior distribution distance and sampling strategy to select credible unlabeled samples. Xu et al. (2016) dealt with weakly labeled learning problems with multi-view training data, where pseudo-label vectors were used to pass information among different views. A projection operator was proposed, which converted the predictions to pseudo-label vectors considering different constraints in weakly labeled data from different learning scenarios. Multi-view semi-supervised co-training algorithm to news text classification was applied by Iglesias et al. (2016), where a BoW view and a new view from the BoW based on hidden Markov models (HMMs) were generated. A document group was constructed for each label and HMMs represented the groups. The classification of a new document was given by maximum probability value after the probability analysis of the document being generated by each of the HMMs.

For sentiment analysis, a new hybrid approach that combined context-dependent embeddings based on the ELMo language model along with co-training in an integrated perspective was investigated. The classification was carried out in an online social network of a German direct banking institution by Graef (2021). An adaptation was done by Alnashwan et al. (2019) in the co-training method to a multi-class classification to sentiment analysis in online medical forums about Lyme Disease, and Lupus.

In the scope of the question and short text classification task. Drug treatment question classification task using the co-training method in medical forums by bi-LSTM and bi-GRU was explored by Wang and Ren (2019). Random subspace method for co-training (RASCO) and relevant random subspace co-training (Rel-RASCO) to automate the classification in App Store reviews were applied by Deocadez et al. (2017). RASCO did random feature splits, while Rel-RASCO was a result of RASCO modification that changed random feature subspace ideas, and searched to select relevant feature sub-spaces. A novel design for CNN in SSL short text classification was presented by Shayegh et al. (2019). The dataset was partitioned into independent views via topic modeling to train independent classifiers. The kNN grouped views into unique categories based on their topic similarity to auxiliary classifiers to predict the label of documents. The method leveraged Words’ synonyms to augment the dataset in addition to the original labeled training. A novel framework for learning from the text-rich network was proposed by Zhang et al. (2021b). With co-training algorithm and feature sharing, two modules were trained jointly, a text analysis module for text embedding by BERT, and a GNN module for categorical information propagation. The GNN model used neighborhood sampling and attention-based aggregation, the two modules had different inductive biases. SSL was applied in Jing (2018) for online fake comment detecting with dynamic and static features representations as views.

With the web page dataset, Gokhale and Fasli (2017) proposed a co-training SSL approach to the multi-class recognition problem to classify human rights abuses. A multi-labeled deep method that combined two-view for text classification by implementing two deep neural networks was proposed by Kihlman and Fasli (2021) to classify human rights violations. The method added noise data to the classifiers to learn to differentiate noise data and correct data, and so improve classification accuracy.

In the scientific context, the view insufficiency problem was addressed in Guo (2018), the method sought to identify harmful data and modify them, reducing their effects, i.e. decreasing their weights in the training set to scientific classification.

4.3.3 Boosting

In pseudo-labeled boosting methods, the classifier ensemble is formed by dependents base learners. The method trains models with supervised base learners using unlabeled samples, in each learning iteration the method generates pseudo-labeled which are incorporated with the labeled samples. Furthermore, in each learning iteration, the models are combined to build a single classification model (van Engelen and Hoos 2019).

In the news context, Tanha (2019) investigated a new multiclass loss function using new codewords to address the multiclass semi-supervised text classification problem. In the multiclass loss function, one term was the margin cost of the labeled data and the other one was a regularization term of the unlabeled data. In order to guide the base learning for assigning the pseudo-label to the unlabeled data, the loss function combined the pairwise similarity and the classifier predictions. A set of new different similarity functions was applied to improve the classification performance using different distance/metric learning algorithms, and boosting frameworks to derive an algorithm from the proposed loss function.

A new form of boosting framework for learning optimal similarity function to multiclass news text classification problem was proposed by Tanha (2018). The method combined the similarity information between labeled and unlabeled data with classifier predictions to assign pseudo-label for unlabeled examples. Based on cluster assumption and maximizing margin approach for multiclass case, a new risk function to multiclass semi-supervised classification problem was introduced. Weights were assigned to all data points which were used to find a new optimal classifier and decrease the risk function and used boosting framework to learn weak similarity functions. The final classification model was formed by a combination of weak classifiers and similarity functions. For news classification, Liu et al. (2016) elaborated an extension of the AdaBoost with Universum examples, where the training error was bounded by the product of the normalization factor.

In the sentiment analysis task, auto-labeled unlabeled tweets gathered by location from the USA along with emoticons to generate the training data were proposed by Hanafy et al. (2018). Features were extracted from labeled data by statistical and unsupervised approaches, e.g. TF–IDF and Word2Vec, respectively. Classical (SVM, MaxEnt) and deep learning methods (LSTM, CNN) were combined generating a unified model.

In languages other than English, Li et al. (2017b) employed an ensemble classifier based on Bagging and AdaBoost methods for Chinese question classification. A simple data editing technology based on kNN was applied for not to prejudice the classification model with predicted error labels from unlabeled samples. TF–IDF and lexical-semantic extension methods derived from Tongyici Cilin were used with Naive Bayes, J48graft, and J48 classifiers. The semantic extension method was compared with TF–IDF in supervised and semi-supervised methods.

4.4 Intrinsically semi-supervised

4.4.1 Perturbation-based

Intrinsically semi-supervised methods add unlabeled samples to the objective function, and they perform a direct objective function optimization. These methods modify the objective functions to include unlabeled data, thus they are considered enlargement of supervised methods and do not depend on supervised base learners. Another feature of these methods is their dependence on one of the SSL assumptions. The maximum margin depends on the low-density assumption, and the decision boundary must remain in a low-density area, while the Perturbation-based method directly incorporates the smoothness assumption (van Engelen and Hoos 2019).

Local perturbations generate adversarial examples which are the results of imperceptible changes in samples. Prediction model robustness to local perturbations is a presupposed smoothness assumption. Thus, the predictions for imperceptible changes or noise in the sample and the unchanged sample should be similar. Unlabeled samples can be used because the similarity is not dependent on the true labels of the samples.

In the news and product and service review context, a multi-label classification method that integrated label correlations into consistency regularization was elaborated in Qiu et al. (2020). Consistency regularization contributed to the model predicting the same class for an unlabeled sample even though it was perturbed. The method leveraged the Exponential Moving Average model and the label correlation matrix to generate an accurate target for each unlabeled instance and applied the mixup technique to compute consistency regularization. Miyato et al. (2017) extended the virtual adversarial training (VAT) from images to text classification. Text embeddings suffered perturbations because VAT uses continuous inputs, approximate adversarial virtual perturbation was used which corresponded to a second-order Taylor expansion and the power method was applied.

In the product and service review context, based on the CBOW model, Zhang et al. (2020) analyzed the appropriate perturbations to generate the adversarial texts that are readable to deceive human observers by controlling the perturbation direction vectors. The perturbations meet the context in the neighborhood of words. Meanwhile, they used adversarial product and movie review texts to enhance the robustness of the model with Adversarial Training to regularize the classification model and extended it to semi-supervised tasks with VAT. The method demonstrated that the generated adversaries’ texts and original texts had a similar meaning, they were interpretable and confused to humans and the VAT improved the robustness of the model. The method trained a model to defend against readable adversarial text attacks. Li and Sethy (2020) proposed a framework Layer partitioning for discrete text input which was combined \(\Pi \)-Model or temporal ensembling for short text classification. A neural network was split into two parts, one part with lower layers used to feature extractor and to add systematical noise in the input, and the other one with higher layers. With the perturbed input, the SSL method was used to train the higher layers employing \(\Pi \)-Model and temporal ensembling.

For scientific context, Sun et al. (2019b) investigated VAT to the supervised loss of GCN to improve the performance in scientific articles classification. Thus, GCN Sparse VAT (GCNSVAT) and GCN Dense VAT (GCNDVAT) algorithms were results where virtual adversarial perturbations were inserted on sparse and dense features. Also in the context of scientific articles, due to susceptibility from GCN to the perturbations, Hu et al. (2021) used Adversarial Training considering graph structure to decrease the feature perturbations impact from a neighbor node.

For the Chinese language, considering the smoothness assumption, a semi-supervised multi-class short text classifier to detect and classify emergency events with a deep learning architecture was proposed by Liu et al. (2021). Kullback–Leibler divergence measured the distance between two predictions: clean samples and their perturbed version. In Huang et al. (2020a), it was elaborated two-stage SSL framework for Chinese patent classification based on the theory of Inventive Problem-Solving. The method used a standard LSTM, and pooling layer with soft attention and k-Max pooling for feature extraction. The method pre-trained the model with unlabeled data, then it used a mixed objective function to train the text classification model. The mixed objective function was a combination of cross-entropy, entropy minimization, and adversarial and virtual adversarial loss functions.

4.4.2 Manifolds

Manifolds are part of intrinsically semi-supervised methods. The manifold assumption says that the data points are located in the multiple lower-dimensional manifolds which comprise the input space and data points have the same label if they are located in the same lower-dimensional manifold (van Engelen and Hoos 2019).

For sentiment analysis, Gupta et al. (2018) employed learning feature representations with Doc2Vec, pre-training, and manifold regularization to train a sentiment classification model. The manifold regularization used a mix of external and in-domain data and it was applied to train a statistical model to use the labeled and unlabeled data resources. Park et al. (2019) proposed a semi-supervised distributed representation method that reflected the difference of document distributions depending on the sentiments using partially labeled documents. A new objective function obtained document embedding best suited to sentiment information for sentiment classification. Document embeddings were acquired with one restriction related to manifold assumption, and another one related to the smoothness assumption of the sentiment classifier in learned representations.

4.4.3 Generative models

Even Manifolds, Perturbation-based methods, and Generative Models are intrinsically semi-supervised. However, different from these methods, whose only objective is to deduce a function to classify data points, generative methods have the primary objective to model the process that generated the data. Mixture models, GANs, and variational autoencoders (VAE) are examples of generative model methods.

Supposing each observation from the dataset comes from one specific distribution, i.e Gaussian distribution. The maximum likelihood or EM is used to infer the parameter of distribution, such as mean and variance. Then, with the mixture generative model method the distribution p(xy) is modeled and samples can be drawn and the model can be used for classification. GANs are deep learning architectures to train generative models. GANs approach the learning of distribution with loss function based on the zero-sum game between two players (Generator and Discriminator), where the sum of player costs is zero. The Generator is trained to deceive the Discriminator with the production of samples similar to the training data distribution, while the Discriminator in a supervised way classifies the samples as reals or fakes (Goodfellow 2017).

VAE is formed by the encoder and decoder, it is a deep generative model which can generate samples using the latent space. Each data point x is treated as being generated from a vector of latent variables z. VAEs limit p(z) to a simple distribution to facilitate sampling, i.e. standard multivariate Gaussian distribution. Based on data point x, the encoder establishes the parameters of \(p(z \mid x)\) distribution. While the decoder performs the transformation from p(z) to a more complex distribution \(p(x \mid z)\). To generate reconstructions of x, a sample is drawn from the distribution, p(z), thus a sample z vector is passed through the decoder and is multiplied by the weights, added a bias, and applied an activation function. A combined cost function with Kullback–Leibler divergence between the posterior distribution \(p(z \mid x)\) and some simple prior distribution p(z), and the reconstruction cost of the output of the autoencoder for input data are minimized by encoder and decoder which are trained simultaneously. The decoder is used as a generative model (van Engelen and Hoos 2019).

In the news context, the generative process for both words and response variables was employed by Soleimani and Miller (2016a). The approach was a mixture of class-conditioned topic models to discover topics and predict class labels in a semi-supervised fashion based on the assumption that documents from the same class have similar topic proportions. Manifold and cluster assumption was introduced by Xie et al. (2019) to regularize the classifier in deep generative models. The methods encouraged classifier invariance to local perturbations in the data sub-manifold of each cluster and distinct classification outputs for data points in different clusters producing a discriminative ability of the classifier. Data augmentation methods through a Generator and a Filter for topic classification and sentiment analysis were proposed by Queiroz Abonizio and Barbon Junior (2020). The Generator synthesized new samples and the Filter captured high-quality ones.

Still in the news context. BERT with semi-supervised GANs were combined in Croce et al. (2020) to text classification. The Generator produced fake samples based on the data distribution and the BERT model was used as a discriminator. By leveraging the information from hierarchy labels to generate the topics, Agarwal (2021) implemented a semi-supervised hierarchical LDA: a probabilistic graphic model to discover latent topics from the news documents by Gibbs sampler. In textual anomaly detection, Steyn and de Waal (2016) enhanced the Multinomial Naive Bayes classifier with an augmented EM algorithm. For hierarchical text classification based on a generative model, Xiao et al. (2019) proposed a path cost-sensitive learning algorithm. The approach applied the EM and local maxima were obtained based on the parameters of the Naive Bayes classifier in labeled data.

For the short text classification task, using a Kernel-based Deep Architecture combined with semi-supervised GAN, Croce et al. (2019) investigated how to improve the robustness of deep architectures by exploiting an expressive space that encodes rich linguistic information. Najari et al. (2022) customized the GAN for text-based social bot detection wherein the GAN used a common LSTM layer as a shared channel between the generator and the classifier to handle the convergence limitation of traditional Seq-GAN. Spam detection based on GANs was addressed in Stanton and Irissappane (2019), the features were learned by ANN, and the method generated similar spam/non-spam reviews in relation to the training set. Multi-layer RNN with gated recurrent units was the base cell to represent the generator and the discriminator. Aghakhani et al. (2018) modified GAN for detecting deceptive reviews by means of two discriminator models and one generative model to avoid mod collapse issues by learning from both distributions of truthful and deceptive reviews. Regularized GAN (ScoreGAN) was developed in Shehnepoor et al. (2022) for fraud review detection due to the limitation of GANs with the task. The text representation was by GLoVe concatenated with a score, and the discriminator was trained to label the reviews coming from the generator.

In the context of languages other than English, Song et al. (2016) proposed a new text categorization using the Chinese language, an algorithm based on deep learning structure and semi-supervised DBN. DBN is based on Restricted Boltzmann Machines which is ANN trained in an unsupervised way with fast learning algorithm called contrastive divergence. In the fine-tuning stage, the softmax regression classifier received the output data of DBN and used the backpropagation algorithm to construct an optimized network. Liu et al. (2020) developed a cross-domain patent retrieval with functional, technical, and domain properties. The approach applied the Chinese word segmentation tool due the fact of the particularities of the language. Naive Bayes was used as a classifier and trained according to the primary level of functional basis, and the EM algorithm as the final classifier. The automatic Chinese patent classification method was proposed in Li et al. (2017), it was based on the functional basis and Naive Bayes theory with the aim of effectively extracting the hidden information from the patent texts and to further providing this information to support the product innovation design process.

For sentiment analysis task, Duan et al. (2020) proposed a method for sentiment classification in stock message comments. The method considered the train and test set together to avoid the affection of short messages, the inferred features were more comprehensive opposing the features of traditional learning methods which only used the train set. The generative emotion model was employed and defined a text as a probability distribution over the seven-dimensional emotion space and represented the emotion as a probability distribution over words. Semi-supervised aspect-level sentiment classification based on VAE with aspect information in the encoder/decoder and aspect-level emotion classifier was proposed by Fu et al. (2019). The method only considered the aspect-category level task and Topic Word Embedding model learned aspect-specific word embedding. The method was supported by attention-based LSTM with aspect embedding as feature representation and classifier. Besides, a conditional LSTM as the decoder of VAE to introduce the text label into the decoder was applied. Sentiment classification based on conditional VAE along with attention mechanism was elaborated by Yu et al. (2019). The latent semantic information of the but-clause was integrated with the model by the integration of the attention mechanism into conditional VAE for classification improvement.

In the scientific context, for multi-label learning problems in attributed graphs to scientific document classification, Akujuobi et al. (2018) proposed a deep generative model; based on GANs, Anokye and Kahanda (2021) developed a novel method called BioSGAN for the protein-phenotype co-mention classification task; for improving the performance of AUC-optimized classifiers with scientific texts, Fujino and Ueda (2016) applied generative models to assist the incorporation of unlabeled samples in the model; for document and sentence-level class inferences, Soleimani and Miller (2016b) investigated a multi-label topic model. The method found the topics present in the corpus, learned the association between topics and class labels, labels were predicted for new documents, and performed label associations for each sentence in the documents.

4.5 Transfer learning

Considering that domain adaptation, a method of transfer learning can be divided into unsupervised and semi-supervised approaches by the availability of labeled target data (Abdi and Hasehmi 2021). In this survey, we define transfer learning as a semi-supervised approach when a method used a small amount of labeled target data and a large and sufficient unlabeled target data.

In the news context, for binary logistic regression, Wang et al. (2019) applied multiple-source deferentially private hypothesis transfer learning method. The scarce labeled target data were treated using unlabeled data with a rigorous differential privacy guarantee. The weight assigned to each source hypothesis was determined by its relationship with the target, then the negative transfer was attenuated. Li and Dai (2018) overcome the problem of small amounts labeled in target to form a validation set extracting samples from the source dataset based on dynamic dataset regrouping. A new inductive knowledge transfer learning algorithm integrated with a modified Rank-based Reduce Error ensemble selection approach to address the different distributions in both source and target domains were used for news text classification.

In the cross-lingual task, Moon and Carbonell (2016) sought to learn new target tasks with limited label information by leveraging source datasets with heterogeneous features and label spaces. The approach mapped heterogeneous source and target labels into the same Skip-gram word embedding to obtain their semantic class relation. In cross-lingual text categorization, Huang et al. (2020b) elaborated a novel algorithm denominated heterogeneous discriminative features learning and LP to learn discriminative features with label consistency through two domain-specific projections, and LP through exploiting structural information of data.

Still in the cross-lingual task. For heterogeneous transfer learning, Sukhija and Krishnan (2019) employed a new approach, i.e. Web-induced Heterogeneous Transfer Learning with sample selection to multilingual text classification. A novel Feature Space Remapping algorithm associated the domains with heterogeneous feature and label spaces without relying on an instance or feature correspondences between the source and target domain. Based on web-induced knowledge, labels across two domains were semantically aligned, then reached the correspondence for aligning the heterogeneous features of the source and target domain. By a novel semi-supervised discriminative transfer Learning method, Kang et al. (2019) tackled the cross-language text classification. The unlabeled data in the source and target language were used to adjust the different distribution of the features in the target labeled data. In addition to a monolingual classification for an efficient transition, where the classifier was trained with labeled data in the source language.

In the sentiment analysis task, Mathapati et al. (2019) experimented with a semi-supervised method for dual sentiment analysis to the polarity shift problem associated with an adaptive domain that conducted training with scarce labeled adapted in different domains. The approach applied collaborative deep learning due to the problem of dependency between distant terms in reviews: LSTM addressed sequence prediction and CNN extracted features. For the sentiment analysis, Abdi and Hasehmi (2021) learned a new discriminative representation of the data by innovative domain adaptation technique. The instances of the source and target domains were embedded into a new feature space, thus with the samples in a common latent feature space, the method minimized the discrepancy between the source and target distribution while the structural information of the data was preserved.

Domain adaptable lexicon to sentiment analysis using maximum entropy with bipartite clustering was built by Deshmukh and Tripathy (2017). Source and target preprocessed datasets were taken as input, an adapted entropy classifier was applied, and a bipartite graph clustering algorithm between common and uncommon words was constructed. Clustering handled the mismatch between domain-specific words of the source and target domain. In multiple domains with specialized multiple sources transfer learning based on multi-instance learning, Song and Park (2018) identified intention posts. The method used positive instances to transfer the knowledge across domains, thus false negatives that affect multi-instance learning were treated.

4.6 Others

In this subsection, we describe semi-supervised methods that do not comply with the taxonomy proposed by van Engelen and Hoos (2019).

The paragraph describes the articles in which methods were applied in the news context. TSVM algorithm based on Ant Colony Optimization to solve the transduction inference SVMs optimization problem was proposed by Yu et al. (2016). Based on PUL, Sakai et al. (2017) applied area under the curve (AUC) optimization method. Unlabeled data contributed to improving the generalization performance in PU and semi-supervised AUC optimization methods without the restrictive distributional assumptions. Cheeks et al. (2016) developed a process of discovering communication frames found in online news articles with relevant socio-environmental issue contexts. NMF was combined with TF–IDF for discovering frames through the process of revealing latent relationships in articles. Customer disputes automatically according to their root causes were classified in Severin et al. (2019). Categories and their Keywords were defined in a supervised step of the method, then the disputes were placed into the appropriate categories. Thus, reducing manual labeling of a training dataset.

In the Chinese news context, a small part of documents was automatically labeled with high accuracy based on the lexical databases as external semantic resources (Xu et al. 2017). Labeled and a lot of unlabeled documents were combined to form the training data and a TSVM and Deterministic Annealing to build the SSL approach.

5 Results analysis per datasets

A comparison among machine learning methods does not produce a reliable answer due to the fact of there are several parameters involved in the learning process. In the semi-supervised method, for example, the amount of labeled and unlabeled data, evaluation metrics, and subsets of the datasets used in the experiments not always were equal. Absolutely, we do not have the pretension to judge the semi-supervised methods, otherwise, our goal is to shed some light on the area through observation. The following subsections demonstrate the semi-supervised approaches per dataset and the results achieved by the article authors. Section 5.1 presents the 20 Newsgroups dataset. Section 5.2 presents the Reuters 21578 dataset. Section 5.3 presents the Reuters RCV1 and RCV2 datasets. Section 5.4 presents the movie review datasets. Section 5.5 presents the Twitter datasets. Section 5.6 presents the Amazon, Yelp, and TripAdvisor datasets. Section 5.7 presents the scientific datasets. Section 5.8 presents the medical datasets. Section 5.9 presents the AG News, DBpedia, and WebKB datasets. Section 5.10 presents the TREC datasets. Section 5.11 presents the Chinese and Vietnamese datasets.

5.1 20 Newsgroups dataset

Results of experiments on 20 Newsgroups dataset are shown in Table 3 which has 24 articles, five of which performed experiments with ANN. SSL approaches in addition to ANN were researched by Zhao et al. (2022) that along with GCN outperformed state-of-the-art models across five benchmark datasets.

In Jiang et al. (2018), DBN surpassed the classical baseline algorithm on different data scales of datasets used beyond 20 Newsgroups. In fine-tuning optimization, L-BFGS was more adequate than gradient descent. In Vilhagra et al. (2020), CSN for the deep neural representation of the input data based on pairwise constraints outperformed the MPC-KMeans, and ordinary K-Means algorithm in six datasets, and its performance increased with the number of constraints provided. LDA and Word2Vec overcome baselines in Jedrzejowicz and Zakrzewska (2020). GAN–BERT developed by Croce et al. (2020) compared to BERT demonstrated superior results. With 1% of labeled data, GAN–BERT achieved F\(_{1}\)-Score higher than 40% while BERT result was below 20%. Besides, GAN–BERT was superior to baseline until 40% of labeled data.

Table 3 20 Newsgroups dataset by SSL approach

The remaining 19 authors used algorithms other than ANN in the text representation as well as in the classification model development. In Widmann and Verberne (2017), the results were not able to prove the advantage of graph-based SSL over the supervised learning baseline. Guru et al. (2016) demonstrated the efficacy and robustness of the proposed model in detecting unknown classes efficiently. Sun et al. (2020) had superior classification accuracy over state-of-the-art SSL algorithms. Pavlinek and Podgorelec (2017) demonstrated that the self-training and LDA method when used in combination with Multinomial Naive Bayes performed the accuracy than the comparable methods. Altnel et al. (2017) labeled unlabeled instances based on meaning scores of words to augment the training set, it was valuable and increased the accuracy of previously unseen test instances. Altınel and Ganiz (2016) utilized abundant sources of unlabeled instances to improve the accuracy, especially when the number of labeled instances was limited. Iglesias et al. (2016) improved the accuracy of the text classifiers. Zhang et al. (2021a) with comparative experiments results demonstrated that the method had good classification performance. Yadav et al. (2019) compared sqrt-cosine similarity metric to Euclidean L2 norm and cosine similarity demonstrating superior results in the quality of graph construction, and the classification/inference. Barman and Chowdhury (2018) showed the effectiveness in assigning labels to a set of large unlabeled data with the help of a very small labeled dataset.

In Liu et al. (2016) the Universum supported the classifiers when few labels are available. Fujino and Ueda (2016) outperformed the baseline methods, the approach improved the imbalanced binary classification performance. Soleimani and Miller (2016a) surpassed the performance of both standard semi-supervised and supervised topic models. Steyn and de Waal (2016) had good performance with text classification. However, the results in the identification of anomalous text documents demonstrated a decreased accuracy due to the fact that unlabeled data increased the magnitude of class imbalance through EM. Xiao et al. (2019) demonstrated improvements in the algorithm’s effectiveness. Wang et al. (2019) had improvement over baselines, Li and Dai (2018) outperformed the baselines non-transfer algorithms, the state-of-the-art transfer learning algorithms with lower storage requirements and higher classification speed. Yu et al. (2016) overcome the baselines of TSVM algorithms considering classification precision and running efficiency indexes. Sakai et al. (2017) exceeded with short computation time baseline algorithms.

5.2 Reuters 21578 dataset

The results with the Reuters 21578 dataset, which is a collection of documents with new articles, are presented according to Table 4. ANN was applied by four authors, three already described previously. Kumar et al. (2021) along with MLP achieved competitive performance gain in classifiers based on SSL—Cascading (gain of 7%); Rank-based (gain of 5%) over SSL baseline.

Table 4 Reuters 21578 dataset by SSL approach

The remaining 10 authors used algorithms other than ANN in the text representation as well as in the classification model development. Four articles already had the results summarized previously. Carnevali et al. (2021) outperformed state-of-the-art algorithms based on the vector space model or graphs algorithms in terms of F\(_{1}\)-Score. The method improved the classification performance from 10% when using only 1 labeled document to 28% with 30 labeled documents. Rossi et al. (2017) facilitated the graph construction, Villatoro-Tello et al. (2016) demonstrated that selecting confidently labeled documents improved the performance across iterations when short text summaries were used as the set of labeled data. In Tanha (2019), Decision Tree as base learner outperformed supervised and semi-supervised baseline algorithms. Tanha (2018) surpassed state-of-the-art boosting methods to multiclass SSL. Thomas and Resmipriya (2016) had better accuracy with SMTP for the distance calculation.

5.3 Reuters RCV1 and RCV2 datasets

Reuters RCV1 and RCV2 datasets are a collection of news articles used for cross-lingual and multi-label classification. The SSL approaches results are present according to Table 5. Five authors employed the ANN approach, Li et al. (2018) used CNN for multi-label classification and improved the performance compared with traditional ANN. Shayegh et al. (2019) applied CNN and achieved results equated with several state-of-the-art supervised and SSL algorithms. In Qiu et al. (2020), pre-trained 300-dimensional fastText language model and CNN as the multi-label text classifier outperformed two supervised multi-label learning solutions, and compared with two SSL methods based on consistency regularization, the approach overcome them in 19 and 16 evaluation indicators separately. Miyato et al. (2017) with LSTM, and bi-LSTM achieved state-of-the-art performance in the RCV1 dataset with a 5.54% error rate. Besides, the method achieved state-of-the-art in various text classification tasks. Moon and Carbonell (2016) improved hetero-lingual text classification task.

Table 5 Reuters RCV1 and RCV2 datasets by SSL approach

The remaining seven articles used algorithms other than ANN and the results are summarized in sequence. Gong et al. (2017) overcome baselines methods in accuracy metric. Besides, the method outperformed the GFHF baseline method when label noise was present. Xu et al. (2016) with CoL(2-layer) (71.73%) and CoL(3-layer) (72.45%) outperformed the existing SSL methods which the best result achieved (69.34%). Sukhija and Krishnan (2019) outperformed the baselines SHFR-RF by 3.5–7%, SHDA-RF by 2.5–3%, DAMA by 7–15% and Co-HTL by 1.5–3.5% in every cross-lingual transfer setting. For the cross-lingual Reuters Multilingual dataset, the method had performance improvement over the baseline Random Forest, and overcome state-of-the-art transfer approaches on three diverse real-world transfer tasks. Huang et al. (2020b) outperformed several baseline adaptation methods even if the distribution difference was substantially large. Kang et al. (2019) demonstrated the overall significance of the performance with 89.2%, and 85.4% of accuracy in over 20 one-vs.-one classification tasks, and one-vs.-all classification, respectively. While the best baseline achieved 88.4%, and 84.2%, respectively.

5.4 Movie review datasets

Table 6 demonstrates the results of experiments in movie review datasets, experiments with the ANN were carried out by 14 articles. Ju et al. (2022) applied GNN to learning graph representation. In IMDB multi-class dataset it was slightly lower than the baseline which had 43.7% of accuracy. In the IMDB binary dataset, varying the amounts of the labeled data, the method achieved the best performance compared to baseline algorithms. With only 5% of the labeled data, the method achieved 67.0% of accuracy roughly. GAT implemented by Yang et al. (2021a) outperformed state-of-the-art methods under both transductive and inductive learning. Pan et al. (2020) applied LN, Word2Vec, BERT, DistilBERT, or ALBERT, encoder and decoder model. The method was effective for sentiment analysis, ALBERT achieved 83.4% of accuracy considering 4% of labeled data and outperformed supervised LSTM and SVM. The cost function reduced the difference between the clean encoder and the noise encoder–decoder.

Fine-tuning pre-trained language model BERT to sentiment classification was employed by Sun et al. (2019a). The within-task and in-domain further pre-training boosted text classification performance and improved the task with small-size data. The proposed approach achieved the new state-of-the-art on eight text classification datasets. Li and Ye (2018) with the GAN approach and using neural word embeddings for text representation, LSTM as discriminator outperformed competing state-of-the-art methods. bi-GRU was implemented by Xiang and Yin (2021) and the method was compared with semi-supervised baselines demonstrating an improvement of 7% while some baselines such as Virtual Adversarial improved by 2%. However, the model achieved an accuracy of 89.0% versus 94% of accuracy from the Virtual Adversarial model. To generate the adversarial texts, Zhang et al. (2020) used CBOW and applied bi-LSTM which outperformed the methods based on adversarial training, VAT, and the baseline without perturbations. Along with the BERT language model and the ANN, Li and Sethy (2020) had results comparable to the supervised baseline.

Table 6 IMDB and Movie Review (MR) datasets by SSL approach

ANN and Doc2Vec in Manifold’s approach were used by Gupta et al. (2018). The method had gained in a single corpus setting as well as two cross-corpora settings, particularly when a smaller fraction of training was labeled. In two cross-corpora settings, the semi-supervised regularization outperformed baseline supervised training. With VAE and attention mechanism applied in the Generative Model approach, Yu et al. (2019) outperformed the baseline semi-supervised methods, and the method achieved an accuracy of 80.7% against the best baseline Aux-LSTM (79.5%) with 10k of unlabeled data. Aux-LSTM had better performance with 1k, 2k, and 4k of unlabeled data, but CVAE-Attention achieved the best performance with 10k of unlabeled data.

The remaining five authors from 17 articles investigated algorithms other than ANN. Ganiz (2016) with \(\lambda \) = 1 achieved 88.00% of accuracy in the IMDB dataset which was more than 10% difference from its closest competitor when the training dataset size was only 1.0% and unlabeled data size was 79.0%. In the 1150haber dataset, the method with \(\lambda = 1\) achieved an accuracy of more than 90.0% with 1% of the data as the labeled training set. The method outperformed the baseline semi-supervised algorithm in the WebKB4 dataset, with \(\lambda = 0.5\) achieving an accuracy of about 77.0% with 1.0% as the labeled training set. Khan et al. (2017) had an accuracy improvement of 2–3% on average when the model selection procedure was introduced. The approach outperformed the state-of-the-art semi-supervised and supervised approaches in the Cornell MR dataset.

5.5 Twitter datasets

Table 7 demonstrates the Twitter datasets in the experiments, where 9 of 12 articles applied ANN. Namrutha Sridhar et al. (2020) produced word embedding for the entire Twitter dataset by Word2Vec, and one of the based learners was MLP. The method had the best overall labels and individual class labels among the baselines. In Baecchi et al. (2015), CBOW with negative sampling and Logistic Regression improved the accuracy compared to CBOW representation. Using the fastText language model and deep learning models, Stanojevic et al. (2019) outperformed alternative algorithms by capturing additional contexts from the unlabeled data. The method was equated with state-of-the-art classification models.

In Karisani and Karisani (2021), BERT and ANN overcome the baseline algorithms in the ADR dataset, Earthquake dataset when data labeled N = 500, and in the Product dataset. The approach outperformed the existing state-of-the-art semi-supervised classifiers across multiple settings. With Word2Vec, LSTM and CNN, Hanafy et al. (2018) improved the accuracy of the individual models by more than 1% using a simple voting ensemble. The method achieved accuracy near to the state-of-the-art results with 170K of training data i.e. using only 10% of baseline models. GAN with a common LSTM implemented by Najari et al. (2022) had appropriate results for bot detection. Queiroz Abonizio and Barbon Junior (2020) used DistilGPT-2 as a generator, and DistilBERT as a discriminator to augment real-world social media datasets overcoming the recent text augmentation techniques.

The following three authors did not use ANN. Nguyen (2016) outperformed all other baseline methods by accuracy performance when only a few labeled instances were used. Ghosh and Desarkar (2020) achieved Macro-F1 of 61.18% against 58.68% from baseline, both models with SVM in FIRE16 dataset, and achieved 86.60% versus 85.23% of baseline in SMERP17 dataset. Experiments on three disaster-related datasets demonstrated that improvement results in overall performance increased over a standard supervised approach. In Hasan et al. (2020), the score was further improved for MedHelp and Twitter when symptom and side-effect classes were combined into one single class. The improvement of the Macro-F\(_{1}\) and Micro-F\(_{1}\) score by the semi-supervised model was about 1% when symptom and side-effect dictionaries were not used and the training size was less than 50%.

Table 7 Tweeter datasets by SSL approach

5.6 Amazon, Yelp and TripAdvisor datasets

Table 8 demonstrates the results of experiments with Amazon, Yelp, and TripAdvisor datasets by SSL approaches. From 22 articles, 16 performed experiments with the ANN at some stage of the classification task. With convolutional–deconvolutional auto-encoding, Chawla et al. (2019) outperformed the baseline for sentiment classification in the Yelp dataset with 1% of labeled data, and the state-of-the-art for text reconstruction in the Hotel review dataset as well as the Enron email data. Joint learning, with pre-training and data-relevant language, features improved the performance of the model for effect prediction in the Enron-FFP dataset. In Zaghdoudi and Glomann (2021), LSTM achieved an accuracy of about 87.0% in multi-label classification. Zhang et al. (2021b) applied BERT for the embedding in addition to classification, and GNN with attention-based aggregation. In the product categorization dataset with 683 categories and only three seed documents per category achieved accuracy which was only less than 2% from the supervised BERT model trained with about 50K labeled documents. Using Word2Vec, Park et al. (2019) had sentiment prediction better compared to traditional representations methods in both Amazon and Yelp datasets.

Using GAN, LSTM as a generator, and CNN as a discriminator, Shehnepoor et al. (2022) outperformed baseline methods. Aghakhani et al. (2018) with Word2Vec and GAN, LSTM as a generator, and CNN as a discriminator demonstrated the same performance in terms of accuracy that the state-of-the-art approaches which applied supervised machine learning. Stanton and Irissappane (2019) used word embedding generated by an ANN, and multi-layer RNN with GRUs as the base cell to represent the generator, and the RNN for the discriminator. Experiments demonstrated that the method surpassed state-of-the-art supervised and semi-supervised techniques when labeled data is limited. LSTM to address sequence prediction and CNN to extract features, Mathapati et al. (2019) demonstrated that deep collaboration had better accuracy in relation to Naive Bayes, CNN, or LSTM. Using ANN word embedding, Abdi and Hasehmi (2021) achieved superior results in comparison with unsupervised and semi-supervised state-of-the-art domain adaptation approaches.

The remained articles did not use an ANN in any stage of text classification, some of them already had the results summarized previously. Sentiment classification was improved by leveraging reviewer information, accordingly with Xu and Li (2017). In Deshmukh and Tripathy (2017), the accuracy achieved by the baseline method was 78.14% to 80.04%, whereas the accuracy of the proposed approach was from 71.65 to 96.89%.

Table 8 Amazon, Yelp, and TripAdvisor datasets by SSL approach

5.7 Scientific datasets

Table 9 demonstrates the results of SSL approaches in the Scientific datasets. In the Graph-based approach and ANN, Zhu et al. (2021) applied GNN to learn different aspects of pre-trained global features and the raw attributes of the graph. The method achieved SSL state-of-the-art results in both plain and attributed graphs. With label consistency GNN, Xu et al. (2020) outperformed traditional GNNs in node classification. Wang et al. (2021) along with CNN and graph embedding branch to learn global features outperformed comparative approaches in the CiteSeer and Cora dataset with an accuracy improvement of 2.4% and 3.9%, respectively. In PubMed, the performance of the proposed model was only 0.7% lower than the baseline. Yang et al. (2021b) employed a simplified multilayer GCN where redundant computation was handled with the removal of nonlinearities and merging weight matrices between graph conventional layers. The method matched the running speed of simple graph convolution (SGC) and outperformed GCN and SGC in five downstream tasks.

Overfitting was reduced by the feature augmentation from the dropout layer by Hu et al. (2021) with CNN. Besides, the method improved the robustness effectively and generalization performance of GCNs, and it improved the performance in the scenario where rare few labels were available for training. GCNSVAT and GCNDVAT algorithms were applied by Sun et al. (2019b), and the method demonstrated the effectiveness under different training sizes across scientific datasets. Huang et al. (2021) along with GAT surpassed benchmarks and achieved the most advanced performance in Cora, CiteSeer, and PubMed. Attention network embedding by two layers of bi-GRU was applied by Liu et al. (2018a) that outperformed the baseline methods. Akujuobi et al. (2020) used a recurrent-attention strategy, the method was flexible for working in both transductive and inductive settings. In the transductive setting, the model exhibited similar performance compared with GCN, but it outperformed all other baseline methods in all settings. Extensive experiments in four datasets demonstrated that the proposed method outperformed several state-of-the-art methods. Akujuobi et al. (2018) applied ANN and overcome the baselines. Anokye and Kahanda (2021) using MLP, and bi-LSTM achieved state-of-the-art performance for classifying the validity of a given sentence-level co-mention from biomedical literature outperforming traditional machine learning-based with an F-max of 81.0%.

5 of 18 articles used different methods. Guo (2018) achieved an error rate of about 8% after 40 iterations with View 1 and 10% after 45 iterations with View 2 in the Courses dataset, and achieved an error rate of about 9% after 30 iterations with View 1 and 5% after 30 iterations with View 2 in ads dataset. The results demonstrated that the proposed approach outperformed the original co-training and DCPE co-training in Courses and ads datasets. The remaining articles were described previously.

Table 9 Scientific datasets by SSL approach

5.8 Medical datasets

The results from the Medical datasets in addition to SSL approaches are presented in Table 10. Two authors implemented GNN. Without ANN, Soleimani and Miller (2016b) achieved better labeling performance than baseline methods and increased the quality of topics (higher likelihood of unseen data), even compared to other semi-supervised methods such as LDA. Besides, the proposed approach outperformed several baseline methods concerning both document and sentence labeling as well as test set log-likelihood.

Table 10 Medical datasets by SSL approach

5.9 AG News, DBpedia, WebKB datasets

The AG News, DBpedia, and WebKB datasets results are presented in Table 11, 7 of 14 articles implemented ANN. In Xie et al. (2019), the encoder and classifier implemented were vanilla LSTM networks and the decoder applied the conditioned LSTM. Without ANN, Wu et al. (2019) performed baselines web page classification with the ratio of the number of labeled training samples to the total number of training samples increasing from 10 to 90%. Experiments with widely used web page datasets demonstrated that the proposed approach significantly outperformed state-of-the-art semi-supervised multi-view feature learning.

Table 11 AG News, DBpedia, WebKB datasets by SSL approach

5.10 TREC dataset

Table 12 demonstrates the TREC dataset and SSL approaches where two of six articles used ANN. With deep neural representation, Liu et al. (2018a) outperformed the MPC-KMeans and ordinary K-Means algorithms. Along with the BERT language model and only 60 label samples, Li and Sethy (2020) had a better result than the semi-supervised ULMFiT with 100 label samples.

Table 12 TREC dataset by SSL approach

5.11 Chinese and Vietnamese datasets

Table 13 demonstrates the Chinese, Vietnamese datasets and SSL approaches. Ji et al. (2021) applied GNN and variants from GNN, e.g. binary sample GCN and binary sample GAT. The proposed method was superior to most text classification methods in streaming social traffic event detection. Along with DBN, Song et al. (2016) extracted abstract features resulting in improved performance of the classifier which was better than the SVM algorithm. BERT language model and deep learning were employed by Liu et al. (2021). With 700 labeled examples BERT achieved 87.5% and the proposed approach 92.1% of accuracy on the Weibo dataset.

The remaining 12 articles did not apply ANN. Guo et al. (2016) achieved an accuracy improvement of 2.8% and an overall of 5.2% and outperformed the baselines in detecting credible influenza posts on Sina Weibo. Zhang et al. (2019a) achieved a classification accuracy of 96.7% and 98.1% in PKU, and FD datasets, respectively, and outperformed the best baseline algorithm. Using Sohu News and texts from Fudan University datasets, Zhu et al. (2018) outperformed the baseline method with 30% of sample expansion. Based on the expansion of 100 samples, WSE with Naive Bayes achieved the best result with an F-measure of 72.5% approximately in the Sohu dataset. WSE with SVM achieved the best result in text from Fudan University with F\(_{1}\)-Measure of 75% approximately. In Ha et al. (2018b), when the size of the current dataset was small, the improvement was about 2%. The proposed approach outperformed the baseline approach for all groups of experiments with an improvement of about 1%. The features built from the approach were the support for the classification and achieved the best result of 78.77% with 20 topics.

Table 13 Chinese -zh, and Vietnamese -vi datasets by SSL approach

In Ha et al. (2018a), experiments in two datasets, Vietnamese reviews and English emails of Enron, demonstrated positive effects. Accuracies of classifiers for almost all experimented datasets were improved by Nguyen Nhat Dang and Duong (2019). With F\(_{1}\)-Score of 86.2% and the Easy Data Augmentation techniques, Duong and Anh (2021) improved Vietnamese sentiment polarity, the result achieved F\(_{1}\)-Score of 85.2%. Yin et al. (2018) achieved better results compared with the kNN and SLAS algorithm in five aspects, politics, economy, education, entertainment, and science and technology. Xu et al. (2017) achieved more than 95% of accuracy with until 10% of labeled documents. TSVM with 96.3% of accuracy and DA (96.6%) achieved the best results in the Netease Dataset 1 versus 86.8% from baseline SVM. In the Netease Dataset 2, TSVM had (95.8%), and DA (96.7%) versus 92.3% from baseline SVM. In the Sogou Dataset 1, TSVM had (92.6%), and DA (94.6%) compared with 94.7% from baseline SVM. Lastly, TSVM had (96.5%), and DA (96.4%) in Sogou Dataset 2 versus 93.2% from baseline SVM.

6 Benefits and limitations of the works

The benefits and limitations of each category of SSL approaches are described as follows.

GNN necessitates a huge amount of labeled data to learn effective graph representations to support graph similarity for prediction. Accordingly, with Xu et al. (2020), GCN is limited in aggregating the information from nodes with similar features or attributes for the reason that the aggregation matrix exclusively depends on graph structure. Despite great results with GAT, the aggregation matrix is based on exclusively neighboring nodes, and Brody et al. (2021) demonstrated that attention from GAT is limited, i.e. it is static attention.

In the cluster-then-label approach, low-dimensional and dense feature space is the appropriate mold to improve clustering algorithms since high-dimensional and sparsity in document clustering declines text classification performance. In a short text scenario considering document length, the problem is more accentuated, the features are high-dimensional and extremely sparse. Besides, another problem is related to the side information, constraints quality are fundamental in semi-supervised clustering algorithms.

In the feature extraction approach, the embeddings are learned from text regions of the unlabeled data, and then applies a neural network to the supervised part. The unsupervised part of the feature extraction approach takes advantage of contextual or static features and integrates them into a supervised ANN. The approach has used a pre-trained language model (Word2Vec, BERT, among others) or ANNs such as CNN, and DBN using embedding layers to handle text input. However, Word2Vec and static embeddings are limited in relation to the permanence of the full meaning of documents, they do not recognize elements with the same meaning in different sentences and they do not treat polysemy. Furthermore, there is a dependence on a huge Corpus and they do not comprise words outside the vocabulary of the training Corpus, with the exception of fastText which solved the problem of unknown words using n-gram at character level (Kowsari et al. 2019). With the emergence of contextual text representation and transformer-based language models, words began to be interpreted from their contexts. However, the transformer and Attention mechanism face a problem to track long sequences, and large amounts (millions or billions) of parameters used and/or Corpus size make the training expensive and slow.

Self-training approach seeks to select samples confidently predicted to augment the training set. However, the threshold parameter is not always suitable for to sample selector, and without confidently pseudo-labels selection, errors influence the classifier to learn from noise, i.e errors re-enforce themselves. Another limitation is the scarce labeled data. In the transfer learning approach, a problem in domain adaptation is the discrepancy between labeled source and target instances, pseudo-label strategies to unlabeled target samples are a way to handle the problem. However, pseudo-labels are subjected to noisy information.

The co-training approach trains two classifiers on the same training data with different views for each classifier, based on the assumption that the training data has two independent views. The views are limited by methods of the text’s representativeness, and contextual text representations to generate independent views integrated co-training approach has still been not much-investigated (Graef 2021), as well as deep networks as essential learn algorithm. Furthermore, co-training has the same problem in the self-training approach, when not confident unlabeled samples are added in the labeled training. In boosting approach, the pairwise similarity function is applied to labeled and unlabeled data and thus assigns more reliable pseudo-labels to unlabeled examples. However, inappropriate similarity measure compromises the algorithm performance (Tanha 2019).

In relation to the perturbed-based approach, continuous word embeddings are used in adversarial training for allowing infinitesimal perturbations due to the discrete nature of the text and its representation in high-dimensional one-hot vectors. Perturbations in texts are more difficult than image domain which is space continuous. The perturbation in texts affects the quality of examples in reason of the problem of non-interpretable adversarial examples. Models are trained to be smooth with examples based on adversarial direction, i.e. the direction where the model is more vulnerable. In a white-box attack, the generation of adversaries is a gradient-based method on word embeddings, then the quality of adversaries is linked with distance metrics. In VAT the perturbation generated is rigid due to random initialized perturbation and constraint problems.

In the Generative Model approach, GANs was very applied (36.36%). GANs have some issues that have not been completely resolved, e.g., text quality, mode collapse, training instability, and vanishing gradient. Partial collapse is more common than mode collapse, it occurs when the generator produces realistic and diverse samples, but the diversity is much less than real data distribution. GANs have problems with convergence, parameter updates change the cost functions of the discriminator and generator and gradient climb may occur for one player and gradient descent for another player. For some games, the gradients converge and the equilibrium is achieved. However, according to the game, it is not always possible to reach equilibrium.

We observed that 78 (49.68%) articles were published in the context of short texts from a social network, product and service review, and forum discussion to investigate tasks such as sentiment analysis, emergence event detection, fake news detection, and question classification. A short text is too sparse and had an exiguous language structure, which makes it still a challenging problem for a deep neural network whose performance comes from the structured corpus. If the feature set construction does not fully represent the text, consequently sentiment analysis tasks are affected. Then, the high-dimensional sparseness of features from short texts can be further explored.

Another limitation observed in the area is related to the use of languages different from English. Only 23% of the works explored other languages like Chinese, Vietnamese, Italian, Portuguese, etc. This way, it can be difficult to find resources for other languages, such as pre-trained language models in different languages, corpora, etc. Related to this, is the few amounts of works exploring multi-lingual classification (around 1%). Moreover, the oriental languages are very different from the occidental, this way, the actual language models cannot be effective for these languages.

The percentage of labeled data varies a lot, from less than 1 to 50%. Even in the same dataset, there is no consensus on using a fixed percentage of labeled data which difficult the comparison the works. This also happens with the evaluation metrics, only accuracy is the most used, and precision and recall almost no paper calculate them. This is a limitation in the area since many papers explore multiclass classification and accuracy is not the indicated measure in this case.

7 Current research trends in SSL text classification

We identified six main future trends: ANN language model for text representation, algorithms for hyper-parameter optimization, explainable artificial intelligence (XAI) (set of methods that allows users to better comprehend the results and output created by machine learning algorithms), regularization method, development of resources for languages different from English, and analysis of degradation performance in SSL proportional to unlabeled samples.

Considering the techniques employed for text representation, it has been a growth of ANN models for generating word embeddings. Especially after 2019, the number of ANN papers surpassed the traditional algorithms, as shown in Fig. 10. Since Word2Vec, different models have been proposed like ELMo, BERT, AlBERT, GPT-2, GPT-3. Accordingly, to Fig. 8, Word2Vec and its extensions had grown since 2016, meanwhile, from 2019, they practically stabilized. Context-sensitive pre-trained model BERT appeared in 2019, and ELMo in 2020, totalizing 16 articles. However, experiments with word embedding as a part/layer of the deep learning model were the most applied compared to the word embedding language model.

We provide visualizations and analysis showing that the learned word embeddings have improved in quality and the model is less prone to overfitting. There has been a strong focus on ANN for text representation and is a current trend. The models can capture semantic and syntactic information in local sequences of consecutive words. However, they may not capture global co-occurrences of words. New approaches using GNNs can overcome some of these problems and can be a new field to be explored. These models can lead to high accuracy by capturing contextual, semantic and syntactic properties from texts. However, it is needed to consider the limitation of GNNs in integrating the information from nodes with similar features because the adjacency matrix exclusively depends on the graph structure. In addition, it may lead to an inappropriate performance in sentiment analysis if we do not consider the order of words.

GNN has been used with an attention mechanism to construct an aggregation matrix based on embedding information. The method can benefit from the improvement of the language model and could comprise better the relationships among nodes in the graph structure. Nevertheless, more investigation is necessary to go beyond neighboring nodes in adjacency matrix formation. Besides, GNNs are computationally expensive for training and need large corpora.

Due to the discrete nature of textual data, perturbations are applied in continuous word embeddings generating a lack of interpretability. Thus, in this case, Adversarial Training is applied as a regularization method. VAT is an extension of Adversarial Training for semi-supervised text classification, problems with VAT were investigated by Li and Qiu (2020) and the results demonstrated improvement. However, VAT is a field that can be further investigated considering contextual perturbation in texts and the gradient-based method.

Adversarial Training, GANs, and contextual embeddings can be combined and exploited in the semi-supervised text classification domain. GANs suffer from the instability problem, and research has required efforts to stabilize GANs, among them are GANs and Adversarial Training associated to improve the robustness of the discriminator and training stabilization of GANs applied in image datasets (Sajeeda and Hossain 2022). However, we did not find the mixed methods in a semi-supervised text classification domain. Furthermore, pre-trained language models of domain-specific could bring improvement over the general domain. BERT, ELECTRA, and GPT families of a general and specific domain could be investigated along with Adversarial Training and GANs.

Few articles investigated the problem of degradation performance in relation to unlabeled data. Self-training suffers from semantic drift problem, Karisani and Karisani (2021) used two-stage training to cope with this problem and showed that while the number of unlabeled samples grew the performance did not drop. Altınel and Ganiz (2016) and Altnel et al. (2017) took advantage of unlabeled samples, however, the analysis in relation to growing the labeled samples and decreasing the unlabeled samples in various datasets showed better performance. Using GANs for opinion spam detection, Stanton and Irissappane (2019) demonstrated a slightly decreased in performance when the number of unlabeled samples increased. However, there is still space to investigate the potential performance degradation when considering the unlabeled data.

Other subjects that also need more investigation in semi-supervised text classification are algorithms for hyper-parameter optimization and XAI. We realized a gap in the studies regarding automated hyper-parameter tuning, and interpretability to recognize the behavior of the models. There has been an increased interest in explainability in some domains, such as medical diagnosis or legal areas. Although exist some models for explainable machine learning for models trained in text, we find few works exploring a conceptual understanding of embedding generation and the SSL models or exploring explainable IA for text classification.

Additionally, the long road ahead demands the exploration of new languages and the development of resources for languages different from English. Interdisciplinary research approaches involving applications in multiple fields probably will increase too.

8 Conclusions

Semi-supervised text classification is gaining pro-eminence due to its ability to reduce annotation costs and achieve competitive results. This survey filled the gap on this topic by selecting 157 articles from 2017 to 2022. We presented the main classification algorithms and results, datasets, SSL approaches, as well their limitations.

This study only focuses on techniques based on SSL for text classification and did not address supervised and unsupervised approaches. From the papers retrieved, it is impractical to indicate a specific classifier for a particular problem. However, various text classification techniques have been identified in different applications and the information provided in this study can help to guide the choice of the best approaches to be considered.

This survey also helps to diffuse the datasets used in the area of SSL text mining and presents in Tables 3-13 all the datasets cited in the papers, besides some information related to the approach and results obtained by the works. Especially in Table 13, we present datasets in languages other than English, to incentive more researchers to use them.

Finally, we also present many research trends that can be taken into consideration by researchers and professionals in the area.