Keywords

1 Introduction

Hundreds of tickets are raised everyday on a Ticketing system which is an input for Information Technology Infrastructure Library (ITIL) services such as problem management and configuration management. The incident tickets record symptom description of issues, as well as details on the incident resolution. If tickets are grouped into different clusters based on their textual content they can provide a better understanding of the types of issues present in the system. This clustering of tickets can be achieved through topic modeling. Such topic learning is done by using popular topic derivation methods including Latent Dirichlet Allocation (LDA) [2], Probabilistic Latent Semantic Analysis (PLSA) [6], Non-negative Matrix Factorization (NMF) [8] etc.

Ticket summaries are short in nature which are restricted to 2–3 sentences. Hence, the frequency of co-occurring terms among tickets is normally low. This results in an extremely sparse relationship matrix between the tickets and the terms occurring in them. Like Twitter data, the normal topic derivation methods using this matrix will generate poor quality topics with traditional topic learning algorithms. To alleviate this problem we propose a novel approach to derive topics. This approach called T2-NMF, works in 2 stages. First, we build a ticket-ticket relationship matrix by considering content similarity of each pair of tickets. Further we subject this matrix to symmetric NMF (Non-negative Matrix Factorization) to produce ticket-topic matrix. By this matrix each topic can be represented as a non-negative linear combination of tickets and each ticket is represented as an additive combination of base topics. In the second stage we factorize the ticket-term relationship matrix as a product of ticket-topic matrix (using previously obtained) and topic-term matrix using NMF again. This matrix produces each topic as a non-negative combination of terms. From ticket-topic relationship matrix the cluster membership of each ticket can be easily determined by finding the base topic with which the concerned ticket has the largest projection value assuming that each topic corresponds to exactly one cluster under T2-NMF approach.

By assigning tickets to topics and thus clusters, in the above-mentioned way does not produce the desired result. A closer look at the clusters reveal that quite a few clusters contain heterogeneous tickets and correspondingly, their Silhouette indices are low. As our aim is to generate a coherent topic model of tickets in which only similar tickets would be grouped in the same cluster we adopt a heuristic-based approach on top of T2-NMF approach, which will be called hT2-NMF. By this we shall assign a ticket to a topic (and hence a cluster) based on its contribution to the topic content. If this value of contribution is no less than a threshold value then we assign the ticket to this topic, otherwise the ticket is bucketed into an anonymous (unnamed) cluster, the idea being that if a ticket cannot be assigned properly to a topic then we leave it unlabeled. This way we have been able to get more homogeneous clusters of tickets and an improved value for their Silhouette indices. Moreover, our experiment shows that in most of the cases the labels that are generated and chosen for the clusters using T2-NMF remain the same while we use hT2-NMF, and attain similar scores. The details of this method can be found in [13].

1.1 Related Work

A vast amount of literature is available on learning topics out of lengthy documents using techniques like PLSA [6], LDA [2], NMF [14] etc. These techniques do not produce fruitful results for short texts like tweet data, as term document matrix is very sparse. As a consequence, the quality of topics suffers. Although some extensions of them have been proposed to deal with tweets [3, 4, 12] sparsity still remains a problem. To deal with this problem some studies have used 2-step approach to derive topics out of tweet data. For example, researchers in [3] have created term correlation matrix by computing the positive mutual information value for each pair of unique terms. The term correlation matrix is then factored along with the tweet-term matrix to learn topics for the tweets and keyword representation for topics. In another work [12] the authors use tweet text similarity and interaction measure to derive tweet-tweet relationship matrix which is then factorized along with tweet-term matrix to obtain the clusters of tweets and the topic representation of keywords. In this work we adopt similar techniques to learn topics out of ticket data. We factorize ticket-ticket relationship matrix along with the ticket-term matrix to derive the topics (and hence clusters) for tickets and the term representation of tickets. Additionally we incorporate the interaction between ticket vectors (using term frequency) and the relevant topic vectors by using suitable heuristic to finally assign tickets to topics.

2 Ticket Model

We consider incident tickets with similar schema which are frequent in IT maintenance. These tickets usually consist of two fields, fixed and free form. Fixed fields are customized and inserted in a menu-driven fashion. Example of such fields are the category of a ticket, sub-category, application name, incident type etc. There is no standard value for free-form fields. These fields can contain description or summary of a ticket which captures the issue behind the creation of such ticket on the maintenance system. This can be just a sentence that summarizes the problem reported in it, or it may contain a short text describing the incident.

The fixed field entries of a ticket can be represented using a relational schema. For them we shall consider only a limited number of fixed fields of a ticket for choosing attributes that reflect its main characteristics (the domain experts’ comments play an important role in choosing the fixed fields). They can be represented as a tuple: Ticket (application name, category and sub-category). Each of the instances of tuples corresponding to entries in the fixed fields in the ticket can be thought of an instantiation of the schema.

We consider the collection of summary of tickets for extracting their features using light natural language processing. As a pre-processing we remove the tickets which do not contain a summary. We delete unnecessary characters from a summary so that it contains only alphanumeric characters. We delete stop words from the collection of summary and then use Stanford NLP tool to tokenize the summaries and perform POS tagging on the tokens. Once the POS tagging is performed we lemmatize the tokens. All the nouns are added as unigrams (also called keywords). Certain combinations of adjectives, verbs and nouns are used to form the bigrams and trigrams with the help of some heuristics such as All Words Heuristic, Any Word Heuristic. The final list of keywords and keyphrases (together they will be called terms also) consists of unigrams, bigrams and trigrams extracted as above. We discard terms with DF (Document frequency) smaller than 3 to remove some rare terms and some noise which do not contribute to the content of ticket summary.

Thus we can model a ticket as a vector \(T = (x_1, \ldots , x_m)\), where each element \(x_l, 1 \le l \le m\) represents the importance or the weight of a term l with respect to the ticket T. Here we take \(x_l = tf*idf(T,l)\) where \(tf*idf(T,l)\) denotes the TF * IDF of a term l wrt ticket T as its weight (here we assume smoothened IDF).

3 The NMF-Based Approach

We assume that a collection of tickets can be grouped into k clusters each of which corresponds to a coherent topic. In this work we assume hard clustering in which each ticket is assigned to one cluster only. In normal practice one uses non-negative matrix factorization (NMF) on ticket-term matrix to factor into ticket-topic and topic-term matrix. While the former matrix captures each ticket as a non-negative linear combination of topics the latter matrix produces a topic as an additive mixture of terms.

However, the ticket-term matrix which captures term occurrence is extremely sparse as each ticket contains very few lines of text [3, 12]. To alleviate this problem, we propose a novel approach to derive topics from a collection of tickets by considering ticket content similarity through a two stage method, for details see [13].

Relationship Between Tickets. Recall a ticket \(T_i\) can be represented as a vector \(T_i = (x_{i1}, \cdots , x_{im})\), where \(x_{ip}\) indicates the weight (typically TF * IDF value) of term \(t_p\) in Ticket \(T_i\). Ticket \(T_j\) can be represented likewise \(T_j = (x_{j1}, \cdots , x_{jm})\). We define the cosine similarity between two tickets \(T_i\) and \(T_j\) as:

          \( \mathrm{sim}_C (T_i,T_j) = \frac{\sum _{p=1}^{m} x_{ip}\,*\,x_{jp} }{\sqrt{ \left( \sum _{p=1}^{m} x_{ip}^2 \right) \,+\,\left( \sum _{p=1}^{m} x_{jp}^2 \right) }} \)

Assuming that we have n tickets in the collection we define a symmetric matrix \(\mathbf {R} = [r_{ij}]\) of degree n as \(r_{ij} = 1\) if \(i=j\). Otherwise \(r_{ij} = \mathrm{sim}_C (T_i,T_j)\). This \(\mathbf {R}\) is called ticket-ticket relationship matrix or simply, ticket-ticket matrix.

Assuming that a ticket is represented as a vector of terms as above we build the ticket-term matrix denoted as \(\mathbf {X} = [x_{ij}]\) of dimension \(n \times m\), where \(x_{ij}\) represents the TF*IDF value of term j for ticket i.

Topic Learning for Tickets. The ticket-ticket relationship is modeled capturing the content similarity. The derived ticket-topic matrix will model the thematic structure of the relationships between tickets. Subsequently, using techniques similar to graph clustering the topic learning problem is formulated as finding the ticket-topic matrix \(\mathbf {U}\) by minimizing the following objective function: \( L(\mathbf {U}) = \frac{1}{2} \big \Vert \mathbf {R} - \mathbf {U} \mathbf {U}^{T} \big \Vert ^2_{F}, \,\,\, \text{ s.t. }\,\, \mathbf {U} \ge 0,\) where \(\mathbf {U}\) is of dimension \(n \times k\) (assuming a fixed number of topics equaling to k). Further each column of \(\mathbf {U}\) represents a topic by a vector of weighted contribution of each ticket. This special form of non-negative matrix factorization is known as the symmetric non-negative matrix factorization [3], which is a special case of NMF decomposition. We use a parallel multiplicative update algorithm from [5] to solve this problem. This algorithm works by first minimizing the Euclidean distance and then adopting an \(\alpha \)-update rule.

Deriving Representative Terms for a Topic. In the second step we try to generate term representation for a topic. One obtains these representative terms by factoring the ticket-term matrix into the ticket-topic matrix and the topic-term matrix. Our method utilizes ticket-topic matrix to derive the representative terms. In particular, we consider the ticket-term matrix \(\mathbf {X}\) and project the terms into the latent topic space. In the NMF framework, this problem is formulated as finding a non-negative matrix \(\mathbf {V}\) (of dimension \(m \times k\)) by minimizing the loss function: \( L_{\varGamma }(\mathbf {V}) = \frac{1}{2} \big \Vert \mathbf {X} - \mathbf {U} \mathbf {V}^T\big \Vert _F^2, \ \text{ s.t. }\,\, \mathbf {V} \ge 0, \,\, \mathrm{and} \,\, \mathbf {U}\,\, \text{ is } \text{ given }\).

The above equation is a NNLS (Non-negative Least Square) problem. Here we use multiplicative update algorithm (MUA) [7, 14] for solving this NNLS (the rows for \(\mathbf {X}\) are initialized to have Euclidean length of 1). However we use only one update rule as \(\mathbf {U}\) is already given: \(v_{ij} \leftarrow v_{ij} \frac{{({\mathbf {X}}^T \mathbf {U})}_{ij}}{{(\mathbf {V}\mathbf {U}^T\mathbf {U})}_{ij}}\). For this reason we only normalize the values of entries in \(\mathbf {V}\) as follows: \(v_{ij} \leftarrow \frac{v_{ij}}{\sqrt{\sum _{j} v_{ij}^2}}\).

4 Ticket Clustering Using Derived Topics Aided by Heuristic

So far, we have learned two matrices \(\mathbf {U}\) and \(\mathbf {V}\) that would be useful to partition the tickets into clusters. In the ticket-topic matrix \(\mathbf {U}\) each column captures a topic axis, while each row corresponds to a ticket with its elements depicting the share of each topic \(\theta \) to the ticket. Let us denote the set of generated topics as \(\varTheta \). Assuming each topic giving rise to a unique cluster, one can think of assigning a ticket to a cluster by examining the ticket topic matrix \(\mathbf {U}\), in which each element \(u_{ij}\) denotes the fraction of the contribution of topic \(\theta _j \in \varTheta \) to ticket \(T_i\). Should ticket \(T_i\) solely belong to topic \(\theta _j\) then \(u_{ij}\) would be the maximum among the entries \(u_{1j}, u_{2j}, \ldots , u_{kj}\). However this kind of assignment does not produce the desired results as quite a few heterogeneous tickets may be grouped in the same cluster. This might be attributed to the fact that the above \(u_{ij}\) value (for \(\theta _j\)) may be small (also it is very difficult to determine a threshold value) and a ticket cannot be solely assigned to a topic (cluster) based on this value. To overcome this problem we consider the interaction of a ticket and topic through terms and adopt the following heuristic (as a mechanism to remove outlier tickets from a cluster).

In topic-term matrix \(\mathbf {V}\) each element \(v_{ij}\) represents the degree to which the term \(t_i \in \varGamma \) belongs to topic \(\theta _j \in \varTheta \). We shall utilize this information to find the contribution of a ticket T to a topic \(\theta \). For this we shall adopt a binary representation of a ticket as \(T_i^b = (y_{i1}, \cdots , y_{im})\) with m terms, where \(y_{ij} = 1\) if the term \(t_j\) is present in the ticket \(T_i\), otherwise it is zero. Subsequently, we compute the contribution of ticket \(T_i\) to topic \(\theta _j\) as: \(\xi (T_i, \theta _j) = {\mathbf {y}_{i \cdot } } {\mathbb \cdot }\ {\mathbf {V}_{ {\cdot } j}} = \sum _{k=1}^m y_{ik} *v_{kj}\).

For a fixed topic (and hence cluster) \(\theta _j\) we omit \(\theta _j\) and write \(\xi (T_i)\) to denote the contribution of ticket \(T_i\) to the topic in question. We also compute the mean and standard deviation of contributions of tickets to the topic which are denoted as \(\mu _{\xi }\) and \(\sigma _{\xi }\) respectively. Based on this, for a fixed topic \(\theta _j\) we empirically fix a threshold value of the contribution as \(\kappa = 0.75\,*\,\mu _{\xi }\,+\,0.5\,*\,\frac{\mu _{\xi }}{\sigma _{\xi }}\). We arrive at such a heuristic expression by examining contributory values of different tickets assigned to different clusters (using T2-NMF method) and the mean and variances of them.

Given a tuple in the ticket schema we say a ticket \(T_i\) is finally assigned to the topic \(\theta _j\) in the tuple if \(T_i\) is associated with topic \(\theta _j\) using T2-NMF and \(\xi (T_i, \theta _j) \ge \kappa \). Otherwise ticket \(T_i\) is assigned to an anonymous topic \(\theta _{\mathrm{an}}\). We assume each tuple will have only one such anonymous topic. Finally, some of the tickets do not get assigned to any topic as they are put in \(\theta _\mathrm{an}\) as warranted by the heuristic.

5 Automatic Labeling of Clusters

For automatic labeling of clusters we extract semantic labels for each topic [11]. This is followed by formulating a semantic scoring function of each label wrt a topic corresponding to a cluster. For details cf. [13]. Finally we use two-layered labels for a ticket,- the first layer label corresponds to the tuple detail that the ticket belongs to (refer to Sect. 2) and for the second layer we use the generated label which would be assigned to the ticket following the method described below.

Label Generation for Topics. We shall consider meaningful phrases/terms (in terms of ngrams) as candidate labels which are semantically relevant and discriminative across topics. We extract ngrams (bigrams and trigrams) from the summary of tickets using the feature extraction technique described in Sect. 2. After ngrams have been identified we apply some suitable measure of association to check their collocation, that is, if the words in a ngram are more likely to occur together. We use a simple frequency based approach for finding collocations in a text corpus which is based on counting, the details of which can be found in [10].

Semantic Relevance Scoring. We adopt a scoring function called First-order Relevance [11] to score the generated labels against topics. We use the relevance scoring function computed using the Kullback-Leibler (KL) divergence [11], the details are in [13].

6 Experimental Framework

We conduct the performance evaluations using Infosys internal ticket data sets which follow the schema described in Sect. 2. The data sets come from five different domains: Application Maintenance and Development (AMD), Supply and Trading (SnT), Office Supply and Products (OSP), Telecommunication (Telecom) and Banking and Finance (BnF). SnT domain has the highest number of tickets, 35014 while AMD has the lowest number of tickets, 4510. Using appropriate NLP techniques discussed in Sect. 3 we extract terms from tickets in each domain.

Our industrial data are not partitioned a-priori and hence not labeled. So we shall use a couple of metrics to determine the quality of topic derivation which do not require the data to be labeled. We consider Silhouette index (S.I.) and Davies-Boudin index (D-B.I.) [1] for this purpose. Moreover, we want to ensure that not many tickets remain unassigned to any topic for a particular tuple. For this we introduce a metric called “Mismatch Index (M.I.)”. M.I. measures the ratio of number of tickets that are assigned to the anonymous topic (those are not associated with any cluster) to the total number of tickets in a tuple.

We compare our proposed T2-NMF and hT2-NMF approaches with the following approaches: NMF [8, 14], pNMF [9] and LDA [2].

7 Evaluation and Discussion

We determine the number of topics a-priori by computing Silhouette indices and choosing a number for which this has the highest value [13]. For further experimentation we only select the tuples if they contain at least 50 tickets (otherwise there are not too many tickets left for analysis after performing natural language processing). First we use T2-NMF approach to derive topics from this selection. We compute S.I. and D-B.I. for each tuple by taking the average of the S.I. and D-B.I. values of the clusters in the tuple. Afterward we perform hT2-NMF on the same tuples for each of the datasets. When we apply heuristic on the clusters generated using T2-NMF we allow some of the tickets to remain unassigned and put them in a anonymous cluster (topic). We compute the fraction of tickets that are unassigned for each tuple for calculating the M.I. values and then find the mean of them over the tuples. These average M.I. values are produced in Fig. 1. For most of the cases few tickets remain unassigned as evident from average M.I. values, the maximum M.I. value being 35% witnessed for AMD and OSP datasets.

Fig. 1.
figure 1

Comparison of M.I. values for different domains

We now compare our approach with the baseline approaches on a couple of tuples from different data sets. We consider one such tuple from Telecom which has the highest number of tickets (1732) among all other tuples across all the datasets. As shown in Fig. 2 both TC-NMF and hTC-NMF perform better than the rest of the approaches on the whole. For clusters of sizes 6 and 7, hTC-NMF outperforms TC-NMF in terms of S.I. values (Fig. 2(a)). For D-B.I. values however, for clusters of sizes 5, 6 and 7 T2-NMF performs better than hT2-NMF, but the latter performs better than the rest (Fig. 2(b)). We could find this trend in most of the tuples having higher number of tickets, but we cannot report them because of lack of space.

Fig. 2.
figure 2

Comparison of S.I. and D-B.I. values for a tuple from Telecom domain with different approaches

Fig. 3.
figure 3

Box plot for average scores of labels for tuples in all domains

Lastly we discuss how the application of heuristic on top of NMF does not affect the labeling of clusters. We use three labels with higher scores as designated labels for a cluster (as second layer labels). We compute the mean of these scores for these three labels for each cluster in a tuple. Next we take the average of these mean scores for each tuple. This average score of each tuple is used to draw the box plot, see Fig. 3. The topmost labels almost have the same score using T2-NMF and hT2-NMF for each tuple. The median values for all domains do not change much on use of heuristic implying the divergence between the labels and the topic models remain almost unchanged.

8 Conclusions

Topic learning can play an important role in providing services for many applications of IT management by grouping tickets arising in their maintenance into different clusters which helps identify the root causes behind the generation of these tickets. In future we plan to propose a ticket clustering pipeline which can group the tickets in real time as they arrive in streams.