0% found this document useful (0 votes)
44 views14 pages

Data Discretization Unification

This document discusses data discretization and compares various discretization methods. It introduces data discretization as the process of converting continuous attribute values into a finite set of intervals to minimize information loss. The document then summarizes different types of discretization methods, including unsupervised methods like equal-width and equal-frequency, as well as more sophisticated supervised methods like MDLP and Chi-square test based algorithms. It discusses how discretization methods can be further categorized as top-down or bottom-up approaches and compares their goals of minimizing information loss versus achieving statistical independence between intervals. Finally, the document notes that while many discretization methods have been proposed, there has been little analytical comparison between different approaches.

Uploaded by

Hung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views14 pages

Data Discretization Unification

This document discusses data discretization and compares various discretization methods. It introduces data discretization as the process of converting continuous attribute values into a finite set of intervals to minimize information loss. The document then summarizes different types of discretization methods, including unsupervised methods like equal-width and equal-frequency, as well as more sophisticated supervised methods like MDLP and Chi-square test based algorithms. It discusses how discretization methods can be further categorized as top-down or bottom-up approaches and compares their goals of minimizing information loss versus achieving statistical independence between intervals. Finally, the document notes that while many discretization methods have been proposed, there has been little analytical comparison between different approaches.

Uploaded by

Hung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Discretization Unification

Ruoming Jin Yuri Breitbart


Department of Computer Science
Kent State University, Kent, OH 44241
{jin,yuri}@cs.kent.edu

it may result in significant information loss. If a discretization


method generates too many data intervals, it may lead to false
information.
ABSTRACT Discretization of continuous attributes has been extensively
Data discretization is defined as a process of converting contin- studied [5, 8, 9, 10, 13, 15, 24, 25]. There are a wide variety
uous data attribute values into a finite set of intervals with mini- of discretization methods starting with the naive methods (often
mal loss of information. In this paper, we prove that discretiza- referred to as unsupervised methods) such as equal-width and
tion methods based on informational theoretical complexity and equal-frequency [26], to much more sophisticated methods (of-
the methods based on statistical measures of data dependency of ten referred to as supervised methods) such as MDLP [15] and
merged data are asymptotically equivalent. Furthermore, we de- Pearson’s X 2 or Wilks’ G2 statistics based discretization algo-
fine a notion of generalized entropy and prove that discretization rithms [18, 5]. Unsupervised discretization methods are not pro-
methods based on MDLP, Gini Index, AIC, BIC, and Pearson’s vided with class label information whereas supervised discretiza-
X 2 and G2 statistics are all derivable from the generalized en- tion methods are supplied with a class label for each data item
tropy function. Finally, we design a dynamic programming algo- value.
rithm that guarantees the best discretization based on the gener- Both unsupervised and supervised discretization methods can
alized entropy notion. be further subdivided into top-down or bottom-up methods. A
top-down method starts with a single interval that includes all
Keywords data attribute values and then generates a set of intervals by split-
ting the initial interval into two or more intervals. A bottom-up
Discretization, Entropy, Gini index, MDLP, Chi-Square Test, G2 method initially considers each data point as a separate interval.
Test It then selects one or more adjacent data points merging them
into a new interval. For instance, the methods based on statisti-
1. INTRODUCTION cal independence tests, such as Pearson’s X 2 statistics [23, 27,
5], are examples of bottom-up methods. On the other hand, the
Many real-world data mining tasks involve continuous attributes.
method based on information theoretical measures, such as en-
However, many of the existing data mining systems cannot han-
tropy and MDLP [25], is an example of the top-down method.
dle such attributes. Furthermore, even if a data mining task can
Liu et. al. [26] introduce a nice categorization of a large number
handle a continuous attribute its performance can be significantly
of existing discretization methods.
improved by replacing a continuous attribute with its discretized
Regardless of the discretization method, a compromise must
values. Data discretization is defined as a process of converting
be found between the information quality of resulting intervals
continuous data attribute values into a finite set of intervals and
and their statistical quality. The former is generally achieved by
associating with each interval some specific data value. There
considering a notion of entropy and a method’s ability to find a
are no restrictions on discrete values associated with a given data
set of intervals with a minimum of information loss. The latter,
interval except that these values must induce some ordering on
however is achieved by resorting to a specific statistic to evaluate
the discretized attribute domain. Discretization significantly im-
the level of independence of merged data.
proves the quality of discovered knowledge [8, 30] and also re-
In spite of the wealth of literature on discretization, there are
duces the running time of various data mining tasks such as as-
very few attempts to analytically compare different discretization
sociation rule discovery, classification, and prediction. Catlett in
methods. Typically, researchers compare the performance of dif-
[8] reported ten fold performance improvement for domains with
ferent algorithms by providing experimental results of running
a large number of continuous attributes with little or no loss of
these algorithms on publicly available data sets.
accuracy.
In [13], Dougherty et al. compare discretization results ob-
In this paper, we propose to treat the discretization of a single
tained by unsupervised discretization versus a supervised method
continuous attribute as a 1-dimensional classification problem.
proposed by [19] and the entropy based method proposed by
Good discretization may lead to new and more accurate knowl-
[15]. They conclude that supervised methods are better than
edge. On the other hand, bad discretization leads to unnecessary
unsupervised discretization method in that they generate fewer
loss of information or in some cases to false information with dis-
classification errors. In [25], Kohavi and Sahami report that
astrous consequences. Any discretization process generally leads
the number of classification errors generated by the discretiza-
to a loss of information. The goal of the good discretization al-
tion method of [15] is comparatively smaller than the number
gorithm is to minimize such information loss. If discretization
of errors generated by the discretization algorithm of [3]. They
leads to an unreasonably small number of data intervals, then
conclude that entropy based discretization methods are usually Intervals Class 1 Class 2 ··· Class J Row Sum
S1 c11 c12 ··· c1J N1
better than other supervised discretization algorithms. S2 c21 c22 ··· c2J N2
Recently, many researchers have concentrated on the genera- .. .. .. .. .. ..
tion of new discretization algorithms [38, 24, 5, 6]. The goal of . . . . . .
the CAIM algorithm [24] is to find the minimum number of in- SI ′ cI ′ 1 cI ′ 2 ··· cI ′ J NI ′
Column Sum M1 M2 ··· MJ N (Total)
tervals that minimize the loss between class-attribute interdepen-
dency. The authors of [24] report that their algorithm generates Table 1: Notations for Contingency Table CI′ ′ ×J
fewer classification errors than two naive unsupervised meth-
ods (equal-width and equal-frequency) as well as four supervised The goal of a discretization method is to find another contin-
methods (max entropy, Patterson-Niblett, IEM, and CADD). gency table, C ′ , with I ′ << I, where each row in the new table
Boulle [5] has proposed a new discretization method called C ′ is the combination of several consecutive rows in the original
Khiops. The method uses Pearson’s X 2 statistic. It merges two C table, and each row in the original table is covered by exactly
intervals that maximize the value of X 2 and the two intervals one row in the new table.
are replaced with the result of their merge. He then shows that Thus, we define the discretization as a function g mapping
Khiops is as accurate as other methods based on X 2 statistics but each row of the new table to a set of rows in the original ta-
performs much faster. ble, such that g : {1, 2, · · · , I ′ } → 2{1,2,···I} with the following
MODL is another latest discretization method proposed by properties:
Boulle [5]. This method builds an optimal criteria based on
a Bayesian model. A dynamic programming approach and a 1. ∀i, 1 ≤ i ≤ I ′ , g(i) 6= ∅;
greedy heuristic approach are developed to find the optimal cri- 2. ∀i, 1 ≤ i < I ′ , g(i) = {x, x + 1, · · · x + k}
teria. The experimental results show MODL can produce fewer

intervals than other discretization methods with better or compa- 3. ∪Ii=1 g(i) = {1, 2, · · · , I}.
rable accuracy.
Finally, Yang and Webb have studied discretization for naive- In termsPof the cell counts, for the i-th row in the new table,
x+k
Bayes classifiers [38]. They have proposed a couple of methods, c′i,j = y=x cy,j , assuming g(i) = {x, x + 1, · · · , x + k}.
such as proportional k-interval discretization and equal size dis- Potentially, the number of valid g functions (the ways to do the
cretization, to manage the discretization bias and variance. They discretization) is 2I−1 .
report their discretization can achieve lower classification error The discretization problem is defined then as selecting the best
for the naive-Bayes classifiers than other alternative discretiza- discretization function. The quality of the discretization function
tion methods. is measured by a goodness function we propose here that de-
To summarize, a comparison of different discretization meth- pends on two parameters. The first parameter (termed cost(data))
ods that appeared so far have been done by running discretiza- reflects the number of classification errors generated by the dis-
tion methods on publicly available data sets and comparing cer- cretization function, whereas the second one (termed
tain performance criteria, such as the number of data intervals penalty(model)) is the complexity of the discretization which
produced by a discretization method and the number of classi- reflects the number of discretization intervals generated by the
fication errors. Several fundamental questions of discretization, discretization function. Clearly, the more discretization intervals
however, remain to be answered. How these different methods created, the fewer the number of classification errors, and thus
are related to each other and how different or how similar are the cost of the data is lower. That is, if one is interested only
they? Can we analytically evaluate and compare these differ- in minimizing the number of classification errors, the best dis-
ent discretization algorithms without resorting to experiments on cretization function would generate I intervals – the number of
different data sets? In other words, is there an objective func- data points in the initial contingency table. Conversely, if one is
tion which can measure the goodness of different approaches? If only interested in minimizing the number of intervals (and there-
so, how would this function look like? If such a function exists, fore reducing the penalty of the model), then the best discretiza-
what is the relationship between it and the existing discretization tion function would generate a single interval by merging all data
criteria? In this paper we provide a list of positive results toward points into one interval. Thus, finding the best discretization
answering these questions. function is to find the best trade-off between the cost(data) and
the penalty(model).
1.1 Problem Statement
For the purpose of discretization, the entire dataset is projected
1.2 Our Contribution
onto the targeted continuous attribute. The result of such a pro- Our results can be summarized as follows:
jection is a two dimensional contingency table, C with I rows 1. We demonstrate a somewhat unexpected connection be-
and J columns. Each row corresponds to either a point in the tween discretization methods based on information theo-
continuous domain, or an initial data interval. We treat each retical complexity, on one hand, and the methods which are
row as an atomic unit which cannot be further subdivided. Each based on statistical measures of the data dependency of the
column corresponds to a different class and we assume that the contingency table, such as Pearson’s X 2 or G2 statistics
dataset has a total of J classes. A cell cij represents the number on the other hand. Namely, we prove that each goodness
of points with j-th class label falling in the i-th point (or inter- function defined in [15, 16, 5, 23, 27, 4] is a combination
val) in the targeted continuous domain. Table 1 lists the basic of G2 defined by Wilks’ statistic [1] and degrees of free-
notations for the contingency table C. dom of the contingency table multiplied by a function that
In the most straightforward way, each continuous point (or ini- is bounded by O(logN ), where N is the number of data
tial data interval) corresponds to a row of a contingency table. samples in the contingency table.
Generally, in the initially given set of intervals each interval con-
tains points from different classes and thus, cij may be more than 2. We define a notion of generalized entropy and introduce a
zero for several columns in the same row. notion of generalized goodness function. We prove that
goodness functions for discretization methods based on Let the i-th interval be Si , which corresponds to the i-th row
MDLP, Gini Index, AIC, BIC, Pearson’s X 2 , and G2 statis- in the contingency table C. For simplicity, consider that we have
tic are all derivable from the generalized goodness func- only two intervals S1 and S2 in the contingency table, then the
tion. entropies for each individual interval is defined as follows:
J J
3. Finally, we design a dynamic programming algorithm that X c1j c1j X c2j c2j
H(S1 ) = − log2 , H(S2 ) = − log2
guarantees the best discretization based on a generalized N1 N1 N2 N2
j=1 j=1
goodness function.
If we merge these intervals into a single interval (denoted by
2. GOODNESS FUNCTIONS S1 ∪S2 ) following the same rule, we have the entropy as follows:
In this section we introduce a list of goodness functions which Xk
c1j + c2j c1j + c2j
are used to evaluate different discretization for numerical H(S1 ∪ S2 ) = − log2
N N
attributes. These goodness functions intend to measure three j=1
different qualities of a contingency table: the information qual-
Further, if we treat each interval independently (without merg-
ity (Subsection 2.1), the fitness of statistical models (Subsec-
ing), the total entropy of these two intervals is expressed as
tion 2.2), and the confidence level for statistical independence
H(S1 , S2 ), which is the weighted average of both individual en-
tests (Subsection 2.3).
tropies. Formally, we have
2.1 Information Theoretical Approach and N1 N2
MDLP H(S1 , S2 ) = H(S1 ) + H(S2 )
N N
In the information theoretical approach, we treat discretization L EMMA 1. There always exists information loss for the merged
of a single continuous attribute as a 1-dimension classification intervals: H(S1 , S2 ) ≤ H(S1 ∪ S2 )
problem. The Minimal Description Length Principle (MDLP)
is a commonly used approach for choosing the best classification Proof:This can be easily proven by the concaveness of the en-
model [31, 18]. It considers two factors: how good the discretiza- tropy function. 2
tion fit the data, and the penalty for the discretization which is Thus, every merge operation leads to information loss. The
based on the complexity of discretization. Formally, MDLP as- entropy gives the lower bound of the cost to transfer the label per
sociates a cost with each discretization, which has the following data point. This means that it takes a longer message to send all
form: data points in these two intervals if they are merged
(N × H(S1 ∪ S2 )) than sending both intervals independently
cost(model) = cost(data|model) + penalty(model) (N × H(S1 , S2 ). However, after we merge, the number of in-
tervals is reduced. Therefore, the discretization becomes simpler
where both terms correspond to these two factors, respectively.
and the penalty of the model in CostM DLP becomes smaller.
Intuitively, when a classification error increases, the penalty de-
Goodness Function based on MDLP: To facilitate the com-
creases and vice versa.
parison with other cost functions, we formally define a goodness
In MDLP, the cost of discretization (cost(model)) is calcu-
function of a MDLP based discretization method applied to con-
lated under the assumption that there are a sender and a receiver.
tingency table C to be the difference between the cost of C 0 ,
Each of them knows all continuous points, but the receiver is
which is the resulting table after merging all the rows of C into
without the knowledge of their labels. The cost of using a dis-
a single row, and the cost of C. We will also use natural log
cretization model to describe the available data is then equal to
instead of the log2 function. Formally, we denote the goodness
the length of the shortest message to transfer the label of each
function based on MDLP as GFM DLP .
continuous point. Thus, the first term (cost(data|model)) cor-
responds to the shortest message to transfer the label of all data GFM DLP (C) = CostM DLP (C 0 ) − CostM DLP (C)
points of each interval of a given discretization. The second term = N × H(S1 ∪ · · · ∪ SI ′ ) − N × H(S1 , · · · , SI ′ ) −
penalty(model) corresponds to the coding book and delimiters
N
to identify and translate the message for each interval at the re- ((I ′ − 1)log ′ + (I ′ − 1)(J ′ − 1)logJ) (2)
ceiver site. Given this, the cost of discretization based on MDLP I −1
(CostM DLP ) is derived as follows: Note that for a discretization problem, any discretization method
I ′ shares the same C 0 . Thus, the least cost of transferring a contin-
X N gency table corresponds to the maximum of the goodness func-
Ni H(Si ) + (I ′ − 1)log2 + I ′ (J − 1)log2 J (1)
i=1
I′ − 1 tion.

where H(Si ) is the entropy of interval Si , the first term corre- 2.2 Statistical Model Selection (AIC and BIC):
sponds to cost(data|model), and the rest corresponds to A different way to look at a contingency table is to assume
penalty(model). The detailed derivation is given in Appendix. that all data points are generated from certain distributions (mod-
In the following, we formally introduce a notion of entropy and els) with unknown parameters. Given a distribution, the maximal
show how a merge of some adjacent data intervals results in in- likelihood principle (MLP) can help us to find the best parame-
formation loss as well as in the increase of the cost(data|model). ters to fit the data [16]. However, to provide a better data fitting,
D EFINITION 1. [12] The entropy of an ensemble X is defined more expensive models (including more parameters) are needed.
to be the average Shannon information content of an outcome: Statistical model selection tries to find the right balance between
X 1 the complexity of a model corresponding to the number of pa-
H(X) = P (x)log2 rameters, and the fitness of the data to the selected model, which
x∈A
P (x)
x
corresponds to the likelihood of the data being generated by the
where Ax is the possible outcome of x. given model.
In statistics, the multinomial distribution is commonly used to I ′ is the total number of rows. Consider a null hypothesis H0 (the
model a contingency table. Here, we assume the data in each rows and columns are statistically independent) against an alter-
interval (or row) of the contingency table are independent and native hypothesis Ha . Consequently, we obtain the confidence
all intervals are independent. Thus, the kernel of the likelihood level of the statistical test to reject the independence hypothesis
function for the entire contingency table is: (H0 ). The confidence level is calculated as
I ′
J
Z t
1 df
cij
Fχ2 (X 2 ) = s 2 −1 e−s/2 ds
Y Y
L(~π) = ( πj|i ) df /2 df
df
2 Γ( 2 ) 0
i=1 j=1

where ~π = (π1|1 , π2|1 , . . . , πJ |1 , · · · , πJ |I ′ ) are the unknown where, Fχ2 is the cumulative χ2 distribution function. We use
df
parameters. Applying the maximal likelihood principle, we iden- the calculated confidence level as our goodness function to com-
tify the best fitting parameters as πj|i = cij /Ni , 1 ≤ i ≤ I ′ , pare different discretization methods that use Pearson’s X 2 statis-
1 ≤ j ≤ J. We commonly transform the likelihood to the log- tic. Our goodness function is formally defined as
likelihood as follow:
′ GFX 2 (C) = Fχ2 (X 2 ) (7)
I X
J df
X cij
SL (D|~π ) = −logL(~π) = − cij log
i=1 j=1
N i We note that 1−Fχ2 (X 2 ) is essentially the P-value of the afore-
df
mentioned statistical independence test [7]. The lower the P-
According to [16], SL (~π) is treated as a type of entropy term that value (or equivalently, the higher the goodness), with more con-
measures how well the parameters ~ π can compress (or predict) fidence we can reject the independence hypothesis (H0 ). This
the training data. approach has been used in Khiops [5], which describes a heuris-
Clearly, different discretizations correspond to different multi- tic algorithm to perform discretization.
nomial distributions (models). For choosing the best discretiza- Wilks’ G2 : In addition to Pearson’s chi-square statistic, another
tion model, the Akaike information criterion or AIC [16] can be statistic called likelihood-ratio χ2 statistic, or Wilks’ statistic [1],
used and it is defined as follows: is used for the independence test. This statistic is derived from
CostAIC = 2SL (D|~π ) + 2(I ′ × (J − 1)) (3) the likelihood-ratio test, which is a general-purpose way of test-
ing a null hypothesis H0 against an alternative hypothesis Ha . In
where, the first term corresponds to the fitness of the data given this case we treat both intervals (rows) and the classes (columns)
the discretization model, and the second term corresponds to the equally as two categorical variables, denoted as X and Y . Given
complexity of the model. Note that in this model for each row we this, the null hypothesis of statistical independence is H0 : πij =
have the constraint π1|i + · · · + πJ |i = 1. Therefore, the number πi+ π+j for all row i and column j, where {πij } is the joint dis-
of parameters for each row is J − 1. tribution of X and Y , and πi+ and π+j are the marginal distri-
Alternatively, for choosing the best discretization model based butions for the row i and column j, respectively.
on Bayesian arguments that take into account the size of the train- Based on the multinomial sampling assumption (a common
ing set N is also frequently used. The Bayesian information cri- assumption in a contingency table) and the maximal likelihood
terion or BIC [16] is defined as follows: principle, these parameters can be estimated as π̂i+ = Ni /N ,
CostBIC = 2SL (D|~π ) + (I ′ × (J − 1))logN (4) π̂+j = Mj /N , and π̂ij = Ni × Mj /N 2 (under H0 ). In the
general case under Ha , the likelihood is maximized when π̂ij =
In the BIC definition, the penalty of the model is higher than the cij /N . Thus the statistical independence between the rows and
one in the AIC by a factor of logN/2. the columns of a contingency table can be expressed as the ratio
Goodness Function based on AIC and BIC: For the same rea- of the likelihoods:
son as MDLP, we denote the goodness function of a given contin-
gency table based on AIC and BIC as the difference between QI ′ QJ 2 cij
the cost of C 0 (the resulting table after merging all the rows of i=1 j=1 (Ni Mj /N )
Λ= QI ′ QJ
C into a single row), and the cost of of C. cij
i=1 j=1 (cij /N )
GFAIC (C) = CostAIC (C 0 ) − CostAIC (C) (5) where the denominator corresponds to the likelihood under Ha ,
GFBIC (C) = CostBIC (C 0 ) − CostBIC (C) (6) and the nominator corresponds to the likelihood under H0 .
Wilks has shown that −2logΛ, denoted by G2 , has a limiting
2.3 Confidence Level from Independence Tests null chi-squared distribution, as N → ∞.
Another way to treat discretization is to merge intervals so that ′
j
I X
the rows (intervals) and columns (classes) of the entire contin- X cij
gency table become more statistically dependent. In other words, G2 = −2logΛ = 2 cij log (8)
i=1 j=1
Ni Mj /N
the goodness function of a contingency table measures its statis-
tical quality in terms of independence tests.
For large samples, G2 has a chi-squared null distribution with
Pearson’s X 2 : In the existing discretization approaches, the
degrees of freedom equal to (I ′ − 1)(J − 1). Clearly, we can
Pearson statistic X 2 [1] is commonly used to test the statistical
use G2 to replace X 2 for calculating the confidence level of the
independence. The X 2 statistic is as follows:
entire contingency table, which serves as our goodness function
X X (cij − m̂ij )2
X2 = GFG2 (C) = Fχ2 (G2 ) (9)
m̂ij df

where, m̂ij = N (Ni /N )(Mj /N ) is the expected frequencies. Indeed, this statistic has been applied in discretization (though
It is well known that Pearson X 2 statistic has an asymptotic χ2 not for the global goodness function), and is referred to as class-
distribution with degrees of freedom df = (I ′ −1)(J −1), where attributes interdependency information [37].
2.4 Properties of Proposed Goodness Func- 4. MAX Principle (P4): Consider all the contingency tables
tions C which have I rows and J columns. If for any row Si
An important theoretical question we address is how these in C, Si =< ci1 , · · · , ciJ >, there exists one cell count
methods are related to each other and how different they are. such that cij 6= 0, and others cik , k 6= j, cik = 0, then the
Answering these questions helps to understand the scope of these contingency table C achieves the maximum in terms of a
approaches and shed light on the ultimate goal: for a given dataset, goodness function for any I × J contingency table.
automatically find the best discretization method. This principle determines what is the best possible dis-
We first investigate some simple properties shared by the afore- cretization when the number of intervals is fixed. Clearly,
mentioned goodness functions (Theorem 1). We now describe the best discretization is achieved if we have the maximal
four basic principles we believe any goodness function for dis- discriminating power in each interval. This is the case
cretization must satisfy. where all the data points in each interval belong to only
one class.
1. Merging Principle (P1): Let Si =< ci1 , · · · , ciJ >, and
Si+1 =< ci+1,1 , · · · , ci+1,J > be two adjacent rows in
The following theorem states that all aforementioned goodness
the contingency table C. If cij /Ni = c(i+1)j /Ni+1 , ∀j,
functions satisfy these four principles.
1 ≤ j ≤ J, then GF (C ′ ) > GF (C), where Ni and Ni+1
are the row sums, GF is a goodness function and C ′ is the T HEOREM 1. GFM DLP , GFAIC , GFBIC , GFX 2 , GFG2
resulting contingency table after we merge these rows. satisfy all four principles, P1, P2, P3, and P4.
Intuitively, this principle reflects a main goal of discretiza- Proof:In Appendix. 2
tion, which is to transform the continuous attribute into a
compact interval-based representation with minimal loss 3. EQUIVALENCE OF GOODNESS FUNC-
of information. As we discussed before, a good discretiza-
tion reduces the number of intervals without generating too TIONS
many classification errors. Clearly, if two consecutive in- In this section, we analytically compare different discretiza-
tervals have exactly the same data distribution, we cannot tion goodness functions introduced in Section 2. In particular, we
differentiate between them. In other words, we can merge find some rather surprising connection between these seemingly
them without information loss. Therefore, any goodness quite different approaches: the information theoretical complex-
function should prefer to merge such consecutive intervals. ity (Subsection 2.1), the statistical fitness (Subsection 2.2), and
the statistical independence tests (Subsection 2.3). We basically
We note that the merging principle (P1) echoes the cut
prove that all these functions can be expressed in a uniform for-
point candidate pruning techniques for discretization which
mat as follows:
have been studied by Fayyand and Irani [15] and Elomaa
and Rousu [14]. However, they did not explicitly define a GF = G2 − df × f (G2 , N, I, J) (10)
global goodness function for discretization. Instead, their
focus is either on evaluating the goodness of each single where, df is a degree of freedom of the contingency table, N
cut or on the goodness when the total target of intervals for is the number of data points, I is the number of data rows in
discretization is given. As we mentioned in Section 1, the the contingency table, J is the number of class labels and f is
goodness function discussed in this paper is to capture the bounded by O(logN ). The first term G2 corresponds to the cost
tradeoff between the information/statistical quality and the of the data given a discretization model (cost(data|model)), and
complexity of the discretization. In addition, this principle the second corresponds to the penalty or the complexity of the
can be directly applied to reduce the size of the original model (penalty(model)).
contingency table since we can simply merge the consecu- To derive this expression, we first derive an expression for
tive rows with the same class distribution. the cost of the data for different goodness functions discussed
in section 2 (Subsection 3.1). This is achieved by expressing G2
2. Symmetric Principle (P2): Let Cj be the j-th column of statistics through information entropy (Theorem 3). Then, us-
contingency table C. GF (C) = GF (C ′ ), where C =< ing a Wallace’s result [35, 36] on approximating χ2 distribution
C1 , . . . CJ > and C ′ is obtained from C by an arbitrary with a normal distribution, we transform the goodness function
permutation of C’s columns. based on statistical independence tests into the format of For-
This principle asserts that the order of class labels should mula 10. Further, a detailed analysis of function f reveals a
not impact the goodness function that measures the qual- deeper relationship shared by these different goodness functions
ity of the discretization. Discretization results must be the (Subsection 3.3). Finally, we compare the methods based on the
same for both tables. global independence tests, such as Khiops [5] (GFX 2 ) and those
based on local independence tests, such as ChiMerge [23] and
3. MIN Principle (P3): Consider all contingency tables C Chi2 [27] (Subsection 3.4).
which have I rows and J columns, and the same marginal
distribution for classes (columns). If for any row Si in C, 3.1 Unifying the Cost of Data (cost(data|model))
Si =< ci1 , · · · , ciJ >, cij /Ni = Mj /N , then the con- to G2
tingency table C reaches the minimum for any goodness In the following, we establish the relationship among entropy,
function. log-likelihood and G2 . This is the first step for an analytical com-
This principle determines what is the worst possible dis- parison of goodness functions based on the information theoret-
cretization for any contingency table. This is the case when ical, the statistical model selection, and the statistical indepen-
each row shares exactly the same class distribution in a dence test approaches.
contingency table, and thus the entire table has the maxi- First, it is easy to see that for a given contingency table, the
mal redundancy. cost of the data transfer (cost(data|model), a key term in the
information theoretical approach) is equivalent to the log likeli- It has long been known that they are asymptotically equivalent.
hood SL (D|~π ) (used in the statistical model selection approach) The next theorem provides tool to connect the information the-
as the following theorem asserts. oretical approach and the statistical independence test approach
T HEOREM 2. For a given contingency table CI ′ ×J , the cost based on Pearson’s chi-square (X 2 ) statistic.
of data transfer (cost1 (data|model)) is equal to the log likeli- T HEOREM 5. [1] Let N be the total number of data values
hood SL (D|~π ), i.e. in the contingency table T of I × J dimensions. If the rows
(columns) of contingency table are independent, then probability
N × H(S1 , · · · , SI ′ ) = −logL(~π ) of X 2 - G2 = 0 converges to one as N → ∞.
Proof: In the following, we mainly focus on the asymptotic properties
shared by X 2 and G2 based cost functions. Thus, our further
discussions on G2 can also be applied to X 2 .

I
X
N × H(S1 , · · · , SI ′ ) = − Ni × N (Si ) Note that Theorem 2 and 3 basically establish the basis for
i=1 Formula 10 of goodness functions based on the information theo-
I′
J retical approach and statistical model selection approaches. Even
X X cij cij
=− Ni × log though Theorems 4 and 5 relate the information theoretical ap-
i=1 j=1
N i Ni proach (based on entropy) to the statistical independence test ap-
J
I X ′ proach (based on G2 and X 2 ), it is still unclear how to compare
X cij them directly since the goodness function of the former one is
=− cij log = −logL(~π )
i=1 j=1
Ni based on the total cost of transferring the data and the goodness
function of the latter one is the confidence level for a hypothesis
2 test. Subsection 3.2 presents our approach on tackling this issue.
The next theorem establishes a relationship between entropy
criteria and the likelihood independence testing statistics G2 . This 3.2 Unifying Statistical Independence Tests
is the key to discover the connection between the information In order to compare the quality of different goodness func-
theoretical and the statistical independence test approaches. tions, we introduce a notion of equivalent goodness functions.
T HEOREM 3. Let C be a contingency table. Then Intuitively, the equivalence between goodness functions means
G2 /2 = N × H(S1 ∪ · · · ∪ SI ′ ) − N × H(S1 , · · · , SI ′ ) that these functions rank different discretization of the same con-
tingency table identically.
Proof: D EFINITION 2. Let C be a contingency table and GF1 (C),

I X
J
GF2 (C) be two different goodness functions. GF1 and GF2 are
X cij equivalent if and only if for any two contingency tables C1 and
G2 /2 = −logΛ = cij log
Ni Mj /N C2 , GF1 (C1 ) ≤ GF1 (C2 ) =⇒ GF2 (C1 ) ≤ GF2 (C2 ) and
i=1 j=1

GF2 (C1 ) ≤ GF2 (C2 ) =⇒ GF1 (C1 ) ≤ GF1 (C2 ).
J
I X
X cij N Using the equivalence notion, we transform goodness func-
= (cij log + cij log )
Ni Mj tions to different scales and/or to different formats. In the se-
i=1 j=1
quel, we apply this notion to compare seemingly different good-
I′ X
J J I′ ness functions based on a statistical confidence and those that are
X cij X Mj X
= cij log − log × cij based on MDLP, AIC, and BIC.
Ni j=1 N
i=1 j=1 i=1 The relationship between the G2 and the confidence level is
J
I X′
J rather complicated. It is clearly not a simple one-to-one mapping
X cij X Mj
= cij log − Mj log as the same G2 may correspond to very different confidence level
i=1 j=1
Ni j=1 N depending on degrees of freedom of the χ2 distribution and, vice
= −N × (H(S1 , · · · , SI ′ ) + H(S1 ∪ · · · ∪ SI ′ )) versa the same confidence level may correspond to very different
G2 values. Interestingly enough, such many-to-many mapping
2 actually holds the key for the aforementioned transformation. In-
Theorem 3 can be generalized as follows. tuitively, we have to transform the confidence interval to a scale
T HEOREM 4. Assuming we have k consecutive rows, Si ,Si+1 , of entropy or G2 parameterized by the degree of freedom for the
· · · , Si+k−1 . Let G2k be the likelihood independence test statistic χ2 distribution.
for the k rows. Then, we have G2(i,i+k−1) /2 = Our proposed transformation is as follows.
D EFINITION 3. Let u(t) be the normal deviation correspond-
N(i,i+k−1) (H(Si ∪ · · · ∪ Si+k−1 ) − H(Si , · · · , Si+k−1 )) ing to the chi-square distributed variable t. That is, the following
Proof:Omit for simplicity. 2 equality holds:
Consequently, we rewrite the goodness functions GFM DLP , Fχ2 (t) = Φ(u(t))
df
GFAIC and GFBIC as follows.
where, Fχ2 is the cumulative χ2 distribution with df degrees of
df
GFM DLP = freedom, and Φ is the cumulative normal distribution function.
2 ′ N For a given contingency table C, which has the log likelihood
G − 2(I − 1)log ′ − 2(I ′ − 1)(J ′ − 1)logJ (11)
I −1 ratio G2 , we define
GFAIC = G2 − (I ′ − 1)(J − 1) (12)
GFG′ 2 = u(G2 ) (14)
GFBIC = G2 − (I ′ − 1)(J − 1)logN/2 (13)
as a new goodness function for C.
For the rest of the paper we use the above formulas for GFM DLP , The next theorem establishes the equivalence between a good-
GFAIC and GFBIC . ness functions GFG2 and GFG′ 2 .
T HEOREM 6. The goodness function GFG′ 2 = u(G2 ) is equiv- represented by AIC, the penalty is on the order of degree of free-
alent to the goodness function GFG2 = Fχ2 (G2 ). dom, O(df ). On the higher end, which is represented by BIC,
df
Proof: Assuming we have two contingency tables C1 and C2 the penalty is O(df logN ).
with degree of freedom df1 and df2 , respectively. Their respec- Penalty of GFG′′2 (Formula 15): The penalty of our new good-
tive G2 statistics are denoted as G21 and G22 . Clearly, we have ness function GFG′′2 = u2 (G2 ) is between O(df ) and O(df logN ).
The lower bound is achieved, provided that G2 being strictly
Fχ2 (G21 ) ≤ Fχ2 (G22 ) ⇐⇒ higher than df (G2 > df ). Lemma 2 gives the upper bound.
df1 df2
L EMMA 2. G2 is bounded by 2N logJ (G2 ≤ 2N × logN ).
Φ(u(G21 )) ≤ Φ(u(G22 )) ⇐⇒ Proof:
u(G21 ) ≤ u(G22 )
G2 = 2N × (H(S1 ∪ · · · ∪ SI ) − H(S1 , · · · , SI ))
This basically establishes the equivalence of these two goodness ≤ 2N × (−J × (1/J × log(1/J)) − 0)
functions. 2
The newly introduced goodness function GFG′ 2 is rather com- ≤ 2N × logJ
plicated and it is hard to find for it a closed form expression. In 2
the following, we use a theorem from Wallace [35, 36] to derive In the following, we consider two cases for the penalty GFG′′2 =
an asymptotically accurate closed form expression for a simple u (G2 ). Note that these two cases corresponding to the lower
2

variant of GFG′ 2 . bound and upper bound of G2 , respectively.


T HEOREM 7. [35, 36] For all t > df , all df > .37, and
with w(t) = [t − df − df log(t/df )] 2 ,
1
1. if G2 = c1 × df , where c1 > 1, the penalty of this good-
ness function is (1 + logc1 )df , which is O(df ).
1
0 < w(t) ≤ u(t) ≤ w(t) + .60df − 2
2. if G2 = c2 × N logJ, where c2 ≤ 2 and c2 >> 0, the
2 2 2 2 penalty of the goodness function is (1+log(c2 N logJ/df )).
Note that if u(G ) ≥ 0, then, u (G ) is equivalent to u(G ).
Here, we limit our attention only to the case when G2 > df ,
The second case is further subdivided into two subcases.
which is the condition for Theorem 7. This condition implies
that u(G2 ) ≥ 0. 1 We show that under some conditions, u2 (G2 ) 1. If N/df ≈ N/(IJ) = c, where c is some constant, the
can be approximated as w2 (G2 ). penalty is O(df ).
0.36 w(G2 ) 2. If N → ∞ and N/df ≈ N/(IJ) → ∞, the penalty is
w2 (G2 ) ≤ u2 (G2 ) ≤ w2 (G2 ) + + 1.2 √
df df
df (1+log(c2 N logJ/df ) ≈ df (1+logN/df ) ≈ df (logN )
u2 (G2 ) 0.36 1.2
1≤ ≤1+ 2 2 + √
w2 (G2 ) w (G ) × df w(G2 ) df Penalty of GFM DLP (Formula 11): The penalty function f
If df → ∞ and w(t) >> 0, then derived in the goodness function based on the information theo-
0.36 1.2 retical approach can be written as
→ 0 and √ →0
w2 (G2 ) × df w(G2 ) df df N N
log + df logJ = df (log /(J − 1) + logJ)
2
u (G )2 J −1 I −1 I −1
T heref ore, →1
w2 (G2 ) Here, we again consider two cases:
Thus, we can have the following goodness function: 1. If N/(I − 1) = c, where c is some constant, we have the
2 penalty of MDLP is O(df ).
G
GFG′′2 = u2 (G2 ) = G2 − df (1 + log( )) (15)
df 2. If N >> I and N → ∞, we have the penalty of MDLP
is O(df logN ).
Similarly, function GFχ′′2 is obtained from GFG′′2by replacing
in the GFG′′2 expression G with X . Formulas 11, 12, 13 and
2 2
Note that in the first case, the contingency table is very sparse
16 indicate that all goodness functions introduced in section 2 (N/(IJ) is small). In the second case, the contingency table is
can be (asymptotically) expressed in the same closed form (For- very dense (N/(IJ) is very large).
mula 10). Specifically, all of them can be decomposed into two To summarize, the penalty can be represented in a generic
parts. The first part contains G2 , which corresponds to the cost form as df × f (G2 , N, I, J) (Formula 10). This function f is
of transferring the data using information theoretical view. The bounded by O(logN ). Finally, we observe that different penalty
second part is a linear function of degrees of freedom, and can be clearly results in different discretization. The higher penalty in
treated as the penalty of the model using the same view. the goodness function results in the less number of intervals in
the discretization results. For instance, we can state the follow-
3.3 Penalty Analysis ing theorem.
In this section, we perform a detailed analysis of the relation- T HEOREM 8. Given an initial contingency table C with
ship between penalty functions of these different goodness func- logN ≥ 2 2 , let IAIC be the number of intervals of the dis-
tions . Our analysis reveals a deeper similarity shared by these cretization generated by using GFAIC and IBIC be the num-
functions and at the same time reveals differences between them. ber of intervals of the discretization generated by using GFBIC .
Simply put, the penalties of these goodness functions are es- Then IAIC ≥ IBIC .
sentially bounded by two extremes. On the lower end, which is Note that this is essentially a direct application of the well-known
1
facts from statistical machine learning research: higher penalty
If u(G2 ) < 0, it becomes very hard to reject the hypothesis that will result in more concise models [16].
the entire table is statistically independent. Here, we basically
2
focus on the cases where this hypothesis is likely to be reject. The condition for the penalty of BIC is higher than AIC
3.4 Global and Local Independence Tests The penalty of the model based on gini index can be approx-
This subsection compares the discretization methods based on imated as 2I ′ − 1 (see detailed derivation in Appendix). The
global statistical independence tests versus local statistical inde- basic idea is to apply a generalized MDLP principle in such a
pendence tests. Note that the latter one does not have a global way so that the cost of transferring the data
goodness function for the discretization. Instead, they treat each (cost(data|model)) and the cost of transferring the coding book
local statistical tests as an indication for merge action. The well- as well as necessary delimiters (penalty(model)) are treated as
known discretization algorithms based on local independence test the complexity measure. Therefore, the gini index can be utilized
include ChiMerge [23] and Chi2 [27], etc. Specifically, for con- to provide such a measure. Thus, the goodness function based on
secutive intervals, these algorithms perform a statistical indepen- gini index is as follows:
dence test based on Pearson’s X 2 or G2 . If they could not re- I X J ′
J
ject the independence hypothesis for those invervals, they merge X c2ij X Mj2
GFgini (C) = − + + 2(I ′ − 1) (16)
them into one row. Given such constraints, they usually try to N i N
i=1 j=1 j=1
find the best discretization with the minimal number of intervals.
A natural question to ask is how such local independence test
relates to global independence tests, as well as to the goodness
4.2 Generalized Entropy
functions GFX 2 and GFG2 . In this subsection, we introduce a notion of generalized en-
2
Formally, let X(i,i+k) be the Pearson’s chi-square statistic for tropy, which is used to uniformly represent a variety of com-
the k+1 consecutive rows (from i to i+k). Let Fχ2 2
(X(i,i+k) ) plexity measures, including both information entropy and gini
k×(J−1) index by assigning different values to the parameters of the gen-
be the confidence level for these rows and their corresponding eralized entropy expression. Thus, it serves as the basis to de-
columns being statistically independent. If rive the parameterized goodness function which represents all the
Fχ2 2
(X(i,i+k) ) < δ, aforementioned goodness functions, such as GFM DLP , GFAIC ,
k×(J−1)
GFBIC , GFG2 , and GFgini , in a closed form.
we can merge these k + 1 rows into one row. D EFINITION 4. [32, 29] For a given interval Si , the gener-
We only summarize our main results here. The detailed dis- alized entropy is defined as
cussion is in Appendix. A typical local hypothesis test can be
rewritten as follows: J
X cij cij
Hβ (Si ) = [1 − ( )β ]/β, β > 0
G2i,i+k−1 < df = (k × (J − 1)) j=1
N i Ni
In other words, as long as the above condition holds, we can When β = 1, we can see that
merge these consecutive rows into one.
This suggests that the local condition essentially shares the J
X cij cij
penalty of the same order of magnitude as GFAIC . In addition, H1 (Si ) = [1 − ]/β = gini(Si )
we note that the penalty of O(df logN ) allows us to combine j=1
Ni Ni
consecutive rows even if they are likely to be statistically depen-
dent based on G2 or X 2 statistic. In other words, the penalty of When β → 0,
O(df ) in the goodness function is a stricter condition for merging J
consecutive rows than O(df logN ). Therefore, it would result in
X cij cij β
Hβ→0 (Si ) = limβ→0 [1 − ( ) ]/β
more intervals in the best discretization identified by the good- j=1
Ni Ni
ness using penalty of O(df ) than those identified by the good- J
ness using penalty of O(df logN ). This essentially provides an X cij cij
=− log = H(Si )
intuitive argument for Theorem 8. j=1
N 1 N 1

PJ
L EMMA 3. Hβ [p1 , · · · , pJ ] = j=1 pi (1 − pβi )/β is con-
4. PARAMETRIZED GOODNESS FUNC- cave when β > 0.
TION Proof:
The goodness functions discussed so far are either entropy or ∂Hβ
χ2 or G2 statistics based. In this section we introduce a new = (1 − (1 + β)pβi /β
∂pi
goodness function which is based on gini index [4]. Gini index
∂ 2 Hβ
based goodness function is strikingly different from goodness = −(1 + β)pβ−1
i /β < 0
functions introduced so far. In this section we show that a newly ∂ 2 pi
introduced goodness function GFgini along with the goodness ∂ 2 Hβ
functions discussed in section 2 are all can be derived from a =0
∂pi ∂pj
generalized notion of entropy [29].
Thus,
4.1 Gini Based Goodness Function
∂ 2 Hβ ∂ 2 Hβ
2 3
Let Si be a row in contingency table C. Gini index of row Si ∂ 2 p1
··· ∂p1 ∂pJ
is defined as follows [4]:
6 7
▽2 Hβ [p1 , · · · , pJ ] = 6
6 .. .. .. 7
. . . 7
J 4 5
cij cij ∂ 2 Hβ ∂ 2 Hβ
···
X
Gini(Si ) = [1 − ] ∂p1 ∂pJ ∂ 2 pJ
j=1
Ni Ni
PI ′ Clearly, ▽2 Hβ [p1 , · · · , pJ ] is negative definite. Therefore,
and CostGini (C) = i=1 Ni × Gini(Si ) Hβ [p1 , · · · , pJ ] is concave. 2
Let CI×J be a contingency table., We define the generalized The parameterized goodness function not only allows us to
entropy for C as follows. represent the existing goodness functions in a closed uniform
I
form, but, more importantly, it provides a new way to understand
X Ni and handle discretization. First, the parameterized approach pro-
Hβ (S1 , · · · , SI ) = Hβ (Si ) =
i=1
N vides a flexible framework to access a large collection (poten-
I J
tially infinite) of goodness functions. Suppose we have a two di-
X Ni X cij cij β mension space where α is represented in the X-axis and β is rep-
× [1 − ( ) ]/β
i=1
N j=1
Ni Ni resented in the Y -axis. Then, each point in the two-dimensional
space for α > 0 and 0 < β ≤ 1 corresponds to a potential
Similarly, we have goodness function. The existing goodness functions corresponds
J to certain points in the two-dimensional space. These points are
X Mj Mj β specified by the aforementioned parameter choices. Note that
Hβ (S1 ∪ · · · ∪ SI ) = [1 − ( ) ]/β
j=1
N N this treatment is in the same spirit of regularization theory devel-
oped in the statistical machine learning field [17, 34]. Secondly,
T HEOREM 9. There always exists information loss for the finding the best discretization for different data mining tasks for
merged intervals: Hβ (S1 , S2 ) ≤ Hβ (S1 ∪ S2 ) a given dataset is transformed into a parameter selection prob-
Proof:This is the direct application of the concaveness of the lem. Ultimately, we would like to identify the parameter selec-
generalized entropy. 2 tion which optimizes the targeted data mining task. For instance,
suppose we are discretizing a given dataset for a Naive Bayesian
4.3 Parameterized Goodness Function classifier. Clearly, a typical goal of the discretization is to build
Based on the discussion in Section 3, we derive that different a Bayesian classifier with the minimal classification error. As
goodness functions basically can be decomposed into two parts. described in regularization theory [17, 34], the methods based
The first part is for G2 , which corresponds to the information on cross-validation, can be applied here. However, it is an open
theoretical difference between the contingency table under con- problem how we may automatically select the parameters with-
sideration and the marginal distribution along classes. The sec- out running the targeted data mining task. In other words, can we
ond part is the penalty which counts the difference of complexity analytically determine the best discretization for different data
for the model between the contingency table under consideration tasks for a given dataset? This problem is beyond the scope of
and the one-row contingency table. The different goodness func- this paper and we plan to investigate it in future work. Finally,
tions essentially have different penalties ranging from O(df ) to the unification of goodness functions allows to develop efficient
O(df logN ). algorithms to discretize the continuous attributes with respect to
In the following, we propose a parameterized goodness func- different parameters in a uniform way. This is the topic of the
tion which treats all the aforementioned goodness functions in a next subsection.
uniform way.
D EFINITION 5. Given two parameters, α and β, where 0 <
β ≤ 1 and 0 < α, the parameterized goodness function for
contingency table C is represented as
4.4 Dynamic Programming for Discretiza-
tion
I ′
X This section presents a dynamic programming approach to find
GFα,β (C) = N × Hβ (S1 ∪ · · · ∪ SI ′ ) − Ni × Hβ (Si ) the best discretization function to maximize the parameterized
i=1 goodness function. Note that the dynamic programming has been
′ 1 β used in discretization before [14]. However, the existing ap-
−α × (I − 1)(J − 1)[1 − ( ) ]/β (17)
N proaches do not have a global goodness function to optimize,
The following theorem states the basic properties of the pa- and almost all of them have to require the knowledge of targeted
rameterized goodness function. number of intervals. In other words, the user has to define the
T HEOREM 10. The parameter goodness function GFα,β , with number of intervals for discretization. Thus, the existing ap-
α > 0 and 0 < β ≤ 1, satisfies all four principles, P1, P2, P3, proaches can not be directly applied to discretization for maxi-
and P4. mizing the parameterized goodness function.
Proof:In Appendix. 2 In the following, we introduce our dynamic programming ap-
By adjusting different parameter values, we show how good- proach for discretization. To facilitate our discussion, we use GF
ness functions defined in section 2 can be obtained from the for GFα,β , and we simplify the GF formula as follows. Since
parametrized goodness function. We consider several cases: a given table C, N × Hβ (S1 ∪ · · · ∪ SI ) (the first term in GF ,
Formula 17) is fixed, we define
1. Let β = 1 and α = 2(N − 1)/(N (J − 1)). Then
GF2(N−1)/(N(J −1)),1 = GFgini . F (C) = N × Hβ (S1 ∪ · · · ∪ SI )GF (C) =
2. Let α = 1/logN and β → 0. Then GF1/logNβ→0 = I ′
X 1 β
GFAIC . Ni × Hβ (Si ) + α × (I ′ − 1)(J − 1)[1 − ( ) ]/β
i=1
N
3. Let α = 1/2 and β → 0. Then GF1/2,β→0 = GFBIC .
4. Let α = const, β → 0 and N >> I. Then GFconst,β→0 = Clearly, the minimization of the new function F is equivalent
G2 − O(df logN ) = GFM DLP . to maximizing GF . In the following, we will focus on finding
the best discretization to minimize F . First, we define a sub-
5. Let α = const, β → 0, and G2 = O(N logJ), N/(IJ) → contingency table of C as C[i : i + k] = {Si , · · · , Si+k }, and
∞. Then GFconst,β→0 = G2 − O(df logN ) = GFG′′2 ≈ let C 0 [i : i+k] = Si ∪· · ·∪Si+k be the merged column sum for
′′
GFX 2. the sub-contingency table C[i : i + k]. Thus, the new function F
of the row C 0 [i : i + k] is: for them? Clearly for these applications, misclassification can be
i+k
very costly. But the number of intervals generated by the dis-
X cretization may not be that important. Pursuing these questions,
F (C 0 [i : i + k]) = ( Nr ) × Hβ (Si ∪ · · · ∪ Si+k )
we plan to conduct experimental studies to compare different
r=i
goodness functions, and evaluate the effect of parameter selec-
Let C be the input contingency table for discretization. Let tion for the generalized goodness function on discretization.
Opt(i, i + k) be the minimum of the F function from the partial
contingency table from row i to i + k, k > 1. The optimum 6. REFERENCES
which corresponds to the best discretization can be calculated
recursively as follows: [1] A. Agresti Categorical Data Analysis. Wiley, New York,
1990.
Opt(i, i + k) = min(F (C 0 [i : i + k]), [2] H. Akaike. Information Theory and an Extension of the
min1≤l≤k−1 (Opt(i, i + l) + Opt(i + l + 1, i + k) + Maximum Likelihood Principle. In Second International
1 Symposium on Information Theory, 267-281, Armenia, 1973.
α × (J − 1)[1 − ( )β ]/β)) [3] P. Auer, R. Holte, W. Maass. Theory and Applications of
N
Agnostic Pac-Learning with Small Decision Trees. In
where k > 0 and Opt(i, i) = F (C 0 [i : i]). Given this, we can Machine Learning: Proceedings of the Twelth International
apply the dynamic programming to find the discretization with Conference, Morgan Kaufmann, 1995.
the minimum of the goodness function, which are described in [4] L. Breiman, J. Friedman, R. Olshen, C. Stone Classification
Algorithm 1. The complexity of the algorithm is O(I 3 ), where I and Regression Trees. CRC Press, 1998.
is the number of intervals of the input contingency table C. [5] M. Boulle. Khiops: A Statistical Discretization Method of
Continuous Attributes. Machine Learning, 55, 53-69, 2004.
Algorithm 1 Discretization(Contingency Table CI×J ) [6] M. Boulle. MODL: A Bayes optimal discretization method
for i = 1 to I do for continuous attributes. Mach. Learn. 65, 1 (Oct. 2006),
for j = i downto 1 do 131-165.
Opt(j, i) = F (C 0 [j : i]) [7] George Casella and Roger L. Berger. Statistical Inference
for k = j to i − 1 do (2nd Edition). Duxbury Press, 2001.
Opt(j, i) = min(Opt(j, i), Opt(j, k)+ [8] J. Catlett. On Changing Continuous Attributes into Ordered
Opt(k + 1, i) + α(J − 1)[1 − ( N1 )β ]/β) Discrete Attributes. In Proceedings of European Working
end for Session on Learning, p. 164-178, 1991.
end for [9] J. Y. Ching, A.K.C. Wong, K. C.C. Chan. Class-Dependent
end for Discretization for Inductive Learning from Continuous and
return Opt(1, I) Mixed-Mode Data. IEEE Transactions on Pattern Analysis
and Machine Intelligence, V. 17, No. 7, 641-651, 1995.
[10] M.R. Chmielewski, J.W. Grzymala-Busse. Global
Discretization of Continuous Attributes as Preprocessing for
Machine Learning. International Journal of Approximate
5. CONCLUSIONS
Reasoning, 15, 1996.
In this paper we introduced a generalized goodness function to [11] Y.S. Choi, B.R. Moon, S.Y. Seo. Genetic Fuzzy
evaluate the quality of a discretization method. We have shown Discretization with Adaptive Intervals for Classification
that seemingly disparate goodness functions based on entropy, Problems. Proceedings of 2005 Conference on Genetic and
AIC, BIC, Pearson’s X 2 and Wilks’ G2 statistic as well as Gini
Evolutionary Computation, pp. 2037-2043, 2005.
index are all derivable from our generalized goodness function.
[12] Thomas M. Cover and Joy A. Thomas, Elements of
Furthermore, the choice of different parameters for the gener-
Information Thoery, Second Edition. Published by John
alized goodness function explains why there is a wide variety
Wiley & Sons, Inc., 2006.
of discretization methods. Indeed, difficulties in comparing dif-
ferent discretization methods were widely known. Our results [13] J. Dougherty, R. Kohavi, M. Sahavi. Supervised and
provide a theoretical foundation to approach these difficulties Unsupervised Discretization of Continuous Attributes.
and offer rationale as to why evaluation of different discretiza- Proceedings of the 12th International Conference on
tion methods for an arbitrary contingency table is difficult. Our Machine Learning, pp. 194-202, 1995.
generalized goodness function gives an affirmative answer to the [14] Tapio Elomaa and Juho Rousu. Efficient Multisplitting
question: is there an objective function to evaluate different dis- Revisited: Optima-Preserving Elimination of Partition
cretization methods? Another contribution of this paper is to Candidates. Data Mining and Knowledge Discovery, 8,
identify a dynamic programming algorithm that provides an op- 97-126, 2004.
timal discretization which achieves the minimum of the general- [15] U.M. Fayyad and K.B. Irani Multi-Interval Discretization
ized goodness function. of Continuous-Valued Attributes for Classification Learning.
There are, however several questions that remain open. First In Proceedings of the 13th Joint Conference on Artificial
of all, even if an objective goodness function exists, different Intelligence, 1022-1029, 1993.
parameter choices will result in different discretizations. There- [16] David Hand, Heikki Mannila, Padhraic Smyth. Principles
fore, the question is for a particular set of applications, what are of Data Mining MIT Press, 2001.
the best parameters for the discretization? Further, can we clas- [17] Federico Girosi, Michael Jones, and Tomaso Poggio.
sify user-applications into different categories and identify the Regularization theory and neural networks architectures. In
optimal parameters for each category? For example, consider- Neural Computation, Volume 7 , Issue 2 (March 1995),
ing medical applications; what is the best discretization function Pages: 219 - 269.
[18] M.H. Hansen, B. Yu. Model Selection and the Principle of Transactions on Pattern Analysis and Machine Intelligence,
Minimum Description Length. Journal of the American vol. 9, NNo. 6, pp. 796-805, 1987.
Statistical Assciation, 96, p. 454, 2001. [38] Ying Yang and Geoffrey I. Webb. Weighted Proportional
[19] R.C. Holte. Very Simple Calssification Rules Perform Well k-Interval Discretization for Naive-Bayes Classifiers. In
on Most Commonly Used Datasets. Machine Learning, 11, Advances in Knowledge Discovery and Data Mining: 7th
pp. 63-90, 1993. Pacific-Asia Conference, PAKDD, page 501-512, 2003.
[20] Janssens, D., Brijs, T., Vanhoof, K., and Wets, G.
Evaluating the performance of cost-based discretization
versus entropy-and error-based discretization. Comput. Oper.
Res. 33, 11 (Nov. 2006), 3107-3123.
[21] N. Johnson, S. Kotz, N. Balakrishnan. Continuous
Univariate Distributions, Second Edition. John Wiley & Sons,
INC., 1994.
[22] Ruoming Jin and Yuri Breitbart, Data Discretization
Unification. Technical Report
(http://www.cs.kent.edu/research/techrpts.html), Department
of Computer Science, Kent State Univeristy, 2007.
[23] Randy Kerber. ChiMerge: Discretization of Numeric
Attributes. National Conference on Artificial Intelligence,
1992.
[24] L.A. Kurgan, K.J. Cios CAIM Discretization Algorithm.
IEEE Transactions on Knowledge and Data Engineering, V.
16, No. 2, 145-153, 2004.
[25] R. Kohavi,M. Sahami. Error-Based and Entropy-Based
Discretization of Continuous Features. Proceedings of the
Second International Conference on Knowledge Discovery
and Data Mining, pp. 114-119, Menlo Park CA, AAAI Press,
1996.
[26] Huan Liu, Farhad Hussain, Chew Lim Tan, Manoranjan
Dash. Discretization: An Enabling Technique. Data Mining
and Knowledge Discovery, 6, 393-423, 2002.
[27] H. Liu and R. Setiono. Chi2: Feature selection and
discretization of numeric attributes. Proceedings of 7th IEEE
Int’l Conference on Tools with Artificial Intelligence, 1995.
[28] X. Liu, H.Wang A Discretization Algorithm Based on a
Heterogeneity Criterion. IEEE Transaction on Knowledge
and Data Engineering, v. 17, No. 9, 1166-1173, 2005.
[29] S. Mussard, F. Seyte, M. Terraza. Decomposition of Gini
and the generalized entropy inequality measures. Economic
Bulletin, Vol. 4, No. 7, 1-6, 2003.
[30] B. Pfahringer. Supervised and Unsupervised Discretization
of Continuous Features. Proceedings of 12th International
Conference on Machine Learning, pp. 456-463, 1995.2003.
[31] J. Rissanen Modeling by shortest data description
Automatica, 14,pp. 465-471, 1978.
[32] D.A. Simovici and S. Jaroszewicz An axiomatization of
partition entropy IEEE Transactions on Information Theory,
Vol. 48, Issue:7, 2138-2142, 2002.
[33] Robert A. Stine. Model Selection using Information
Theory and the MDL Principle. In Sociological Methods &
Research, Vol. 33, No. 2, 230-260, 2004.
[34] Trevor Hastie, Robert Tibshirani and Jerome Friedman.
The Elements of Statistical Learning Springer-Verlag, 2001.
[35] David L. Wallace . Bounds on Normal Approximations to
Student’s and the Chi-Square Distributions. The Annals of
Mathematical Statistics, Vol. 30, No. 4, pp 1121-1130, 1959.
[36] David L. Wallace . Correction to ”Bounds on Normal
Approximations to Student’s and the Chi-Square
Distributions”. The Annals of Mathematical Statistics, Vol.
31, No. 3, p. 810, 1960.
[37] A.K.C. Wong, D.K.Y. Chiu. Synthesizing Statistical
Knowledge from Incomplete Mixed-Mode Data. IEEE
Appendix where cij = ci+1,j , ∀i, 1 ≤ j ≤ J. Let C ′ be the resulting
contingency table after we merge these two rows. Then we have
Derivation of the Goodness Function based on
MDLP I i−1
For an interval S1 , the best way to transfer the labeling informa-
X X
Nk × H(Sk ) = Nk × H(Sk ) + Ni × H(Si )
tion of each point in the interval is bounded by a fundamental k=1 k=1
theorem in information theory, stating that the average length of I
X i−1
X
the shortest message is higher than N1 × H(S1 ). Though we +Ni+1 × H(Si+1 ) + Nk × H(Sk ) = Nk × H(Sk ) +
can apply the Hoffman coding to get the optimal coding for each k=i+2 k=1
interval, we are not interested in the absolute minimal coding. I
X
Therefore, we will apply the above formula as the cost to trans- (Ni + Ni+1 ) × H(Si ) + Nk × H(Sk )
fer each interval. Given this, we can easily derive the total cost k=i+2
to transfer all the I ′ intervals as follows. I−1
Nk′ H(Sk )′
X
=
cost1 (data|model) = N × H(S1 , · · · , SI ′ ) k=1
= N1 × H(S1 ) + N2 × H(S2 ) + · · · + NI ′ × H(SI ′ )
In the meantime, we have to transfer the model itself, which In addition, we have
includes all the intervals and the coding book for transferring the
point labels for each interval. The length of the message to trans-
ferring the model is served as the penalty function for the model. N
′ (I − 1)log2 + (I − 1) × J × log2 J
The cost to transfer all the intervals will require a log2 (N+I −1
)- I −1
I ′ −1
′ N
bit message. This cost, denoted as L1 (I , N ), can be approxi- −((I − 2)log2 + (I − 2) × J × log2 J)
mated as I−2
′ = (I − 1)log2 N − (I − 1)log2 (I − 1) + (I − 1) × J × log2 J
L1 (I ′ , N ) = log2 (N+I
I ′ −1
−1
) −((I − 2)log2 N C(I − 2)log2 (I − 2) + (I − 2) × J × log2 J)
N I′ − 1 > log2 N Clog2 (I − 1) + J × log2 J (N ≤ I) > 0
≈ (N + I ′ − 1)H( , )
N + I − 1 N + I′ − 1

N I′ − 1
= −(N log2 ′
+ (I ′ − 1)log2 )
N +I −1 N + I′ − 1 Adding together, we have CostM DLP (C) > CostM DLP (C ′ ),
N and GFM DLP (C) < GFM DLP (C ′ ).
(log2 → 0, N → 0)
N + I′ − 1 Symmetric Principle (P2) for GFM DLP : This can be di-
N + I′ − 1 rectly derived from the symmetric property of entropy.
≈ (I ′ − 1)log2
I′ − 1 MIN Principle (P3) for GFM DLP : Since the number of rows
′ N (I), the number of samples (N ), and the number of classes (J)
≈ (I − 1)log2 ′ are fixed, we only need to maximize N × H(S1 , · · · , SI ).
I −1
Next, we have to consider the transfer of the coding book for
each interval. For a given interval Si , each code will correspond N × H(S1 , · · · , SI ) ≤ N × H(S1 ∪ · · · ∪ SI )
to a class, which can be coded in log2 J bits. We need to transfer I
X
such codes at most J −1 times for each interval, since after know- N × H(S1 , · · · , SI ) = Nk × H(Sk )
ing J − 1 classes, the remaining class can be inferred. Therefore, k=1
the total cost for the coding book, denoted as L2 , can be written I
X
as = Nk × H(S1 ∪ · · · ∪ SI )
′ k=1
L2 = I × (J − 1) × log2 J
= N × H(S1 ∪ · · · ∪ SI )
Given this, the penalty of the discretization based on the theo-
retical viewpoint is
MAX Principle (P4) for GFM DLP : Since the number of
penalty1 (model) = L1 (I ′ , N ) + L2 rows (I), the number of samples (N ), and the number of classes
N (J) are fixed, we only need to minimize N × H(S1 , · · · , SJ ).
= (I ′ − 1)log2 ′ + I ′ × (J − 1) × log2 J
I −1
J
Put together, the cost of the discretization based on MDLP is X
N × H(S1 , · · · , SJ ) = Nk × H(Sk )

I k=1
X N
CostM DLP = Ni H(Si )+(I ′ −1)log2 ′ +I ′ (J−1)log2 J J
I −1 X
i=1 ≥ Nk × (log2 1) ≥ 0
k=1
Proof of Theorem 1
Proof:We will first focus on proving for GFM DLP . The proof Now, we prove the four properties for GFX 2 .
for GFAIC and GFBIC can be derived similarly. Merging Principle (P1) for GFX 2 : Assuming we have two
Merging Principle (P1) for GFM DLP : Assuming we have consecutive rows i and i + 1 in the contingency table C, Si =<
two consecutive rows i and i + 1 in the contingency table C, ci1 , · · · , ciJ >, and Si+1 =< ci+1,1 , · · · , ci+1,J >, where
Si =< ci1 , · · · , ciJ >, and Si+1 =< ci+1,1 , · · · , ci+1,J >, cij = ci+1,j , ∀i, 1 ≤ j ≤ J. Let C ′ be the resulting contin-
gency table after we merge these two rows. Then we have Fχ2 (X 2 ) is maximized. In other words, we have the
(I−1)(J−1)

X X (ckj − Ni × Mj /N )2 best possible discretization given existing conditions.


2 2 The proof for GFG2 can be derived similarly from GFM DLP
XC − XC ′ =
Nk × Mj /N and GFX 2 . 2
X X (c′kj − Ni′ × Mj /N )2

Nk′ × Mj /N Details of Global and Local Independence Tests
J J
X (cij − Ni × Mj /N )2 X (ci+1,j − Ni+1 × Mj /N )2 To facilitate our investigation, we first formally describe the lo-
= + 2
Ni × Mj /N Ni+1 × Mj /N cal independence tests. Let X(i,i+k) be the Pearson’s chi-square
j=1 j=1
statistic for the k + 1 consecutive rows (from i to i + k). Let
J 2
X ((cij + ci+1,j ) − (Ni + Ni+1 ) × Mj /N )2 Fχ2 (X(i,i+k) ) be the confidence level for these rows and
− k×(J−1)
j=1
(Ni + Ni+1 ) × Mj /N their corresponding columns being statistical independent. The
J J less the confidence level is, the harder we can not reject this hy-
X (cij − Ni × Mj /N )2 X (2cij − 2Ni × Mj /N )2
=2× − pothesis. Given this, assuming we have a user-specified threshold
j=1
Ni × Mj /N j=1
2Ni × Mj /N δ (usually less than 50%), if Fχ2 2
(X(i,i+k) ) < δ, we can
k×(J−1)
XJ
(cij − Ni × Mj /N )2 XJ
4 × (cij − Ni × Mj /N )2 merge these k + 1 rows into one row. Usually, the user defines a
=2× − threshold for this purpose, such as 50%. If the confidence level
j=1
Ni × Mj /N j=1
2Ni × Mj /N
derived from the independence test is lower than this threshold,
=0 that means we can not reject H0 , the independence hypothesis.
Therefore, we treat them as statistically independent and allow
We note that the degree of freedom in the original contingency to merge them.
table is (I − 1)(J − 1) and the second one is (I − 2)(J − Now, to relate the global goodness function to the local inde-
1). In addition, we have for any t > 0, Fχ2 (t) < pendence test, we map the global function to the local conditions.
(I−1)(J−1)
Fχ2 (t). Therefore, the second table is better than the Considering we have two two contingency table C1 , and C2 , with
(I−2)(J−1)
first one. the only difference between them is that the k consecutive rows
Symmetric Principle (P2) for GFM DLP : This can be di- in C1 are merged into a single row in C2 . Given this, we can
rectly derived from the symmetric property of X 2 . transform the global difference as the local difference:
MIN Principle (P3) for GFM DLP : Since the number of rows Cost(C1 )−Cost(C2 ) = −G2i,i+k−1 +O((k−1)(J −1)logN )
(I), the number of samples (N ), and the number of classes (J)
are fixed, we only need to minimize X 2 . Since ckj = 1/J × Ni , where we assume the penalty for the global goodness function
we can see that Mj = N/J. is O(df logN). Since the discretization reduces the value of the
global goodness function, i.e. Cost(C1 ) > Cost(C2 ), we need
X X (ckj − Nk × Mj /N )2
X2 = the local condition −G2i,i+k−1 + O((k − 1)(J − 1)logN ) > 0.
Nk × Mj /N In other words, as long as
X X (Mj /N × Nk − Nk × Mj /N )2
= G2i,i+k−1 < O((k − 1)(J − 1)logN )
Nk × Mj /J
=0 holds, we can combine the k rows into one.
Let us focus now on the local independence test, which has
Since X 2 ≥ 0, we achieve the minimal of X 2 . been used in the goodness function based on the number of inter-
MAX Principle (P4) for GFM DLP : Since the number of vals for the discretization. Note that we require
rows (J), the number of samples (N ), and the number of classes
FX 2 (G2(i,i+k−1) ) < δ
(J) are fixed, we only need to maximize X 2 . (k−1)×(J−1)

X X (ckj − Nk × Mj /N )2 We would like the G2 statistic for the k consecutive rows is as


X2 = small as possible (as close as to 0). The δ usually choose less
Nk × Mj /N
X X c2kj + (Nk × Mj /N )2 − 2 × ckj × Nk × Mj /N than or equal to 50%. This means we have at least more confi-
= dence to accept the independence hypothesis test than reject it.
Nk × Mj /N Based on the approximation results from Fisher and Wilson &
XX 2
ckj Hilferty [21], the 50% percentile point of χ2df is
= ( + Nk × Mj /N − 2 × ckj )
Nk × Mj /N χ2df,0.5 ≈ df = (k − 1)(J − 1)
J
XX ckj ckj Given this, we can rewrite our local hypothesis tests as
= N/Mj × [ckj ] + N − 2N ( ≤ 1)
k=1
Nk Nk
G2i,i+k−1 < O(df )
J
X J
X
≤ (N/Mj ) × ckj − N In other words, as long as the above condition holds, we can
j=1 k=1 merge these consecutive rows into one.
J
X
= (N/Mj ) × Mj − N Derivation of Goodness Function based on
j=1
Generalized Entropy (and Gini)
= (J − 1) × N
As discussed in Section 2, MDLP provides a general way to con-
Note that this bound can be achieved in our condition. Basi- sider the trade-off between the complexity of the data given a
cally, in any row k, we will have one cell ckj = Nk . Therefore, model and the complexity of the model itself (or referred to as the
penalty of the model). Here, we apply the generalized entropy to Similar to the treatment in Section 2, we define the generalized
describe both complexities and use their sum as the goodness goodness function as the cost difference between contingency ta-
function. ble C 0 and contingency table C.
First, for a given interval Si , we use Ni × Hβ (Si ) to describe
GFβ (C) = Costβ (C 0 ) − Costβ (C)
the labeling information of each point. Note that this is an analog
to the traditional information theory, which states that the average Note that the goodness function based on Gini can be derived
length of the shortest message can not lower than Ni × H(Si ). simply by fixing β = 1.
costβ (data|model) = N × Hβ (S1 , · · · , SI ′ )
Proof of Theorem 10
= N1 × Hβ (S1 ) + N2 × Hβ (S2 ) + · · · + NI ′ × Hβ (SI ′ )
Proof:Merging Principle (P1) for GFα,β : Assuming we have
In the meantime, we have to transfer the model itself, which two consecutive rows i and i + 1 in the contingency table C,
includes all the intervals and the coding book for transferring Si =< ci1 , · · · , ciJ >, and Si+1 =< ci+1,1 , · · · , ci+1,J >,
the point labels for each interval. For transferring the interval where cij = ci+1,j , ∀i, 1 ≤ j ≤ J. Let C ′ be the resulting
information, we consider the message will be transferred is in contingency table after we merge these two rows. Then we have
the following format. Supposing there are I ′ intervals, we will
I i−1
first transfer N1 zeros followed by a stop symbol 1, then, N2 X X
zeros, until the last NI ′ zeros. Once again, such information Nk × Hβ (Sk ) = Nk × Hβ (Sk ) + Ni × Hβ (Si )
corresponds to the impurity measures of such message using the k=1 k=1
generalized entropy. Such cost is served as the penalty func- I
X i−1
X
tion for the model. The impurity of all the intervals, denoted as +Ni+1 × Hβ (Si+1 ) + Nk × Hβ (Sk ) = Nk × Hβ (Sk ) +
L1β (I ′ , N ), will be as follows. k=i+2 k=1
I
I′ − 1
X
′ ′ N (Ni + Ni+1 ) × Hβ (Si ) + Nk × Hβ (Sk )
L1β (I , N ) = (N + I − 1)Hβ ( , )
N + I′ − 1 N + I′ − 1 k=i+2
N N I−1
= (N + I ′ − 1) × { [1 − ( )β ]/β +
Nk′ Hβ (Sk )′
X
N + I′ − 1 N + I′ − 1 =
I −1 I−1 k=1
[1 − ( )β ]/β}
N + I′ − 1 N + I′ − 1 In addition, we have
N I −1
= N × [1 − ( )β ]/β + (I − 1) × [1 − ( )β ]/β 1 β
N + I′ − 1 N + I′ − 1 α × (I − 1)(J − 1)[1 − ( ) ]/β
N
1 β
−α × (I − 2)(J − 1)[1 − ( ) ]/β =
Clearly, when β → 0, this is the traditional entropy measure N
L1 . When β = 1, this is the impurity measure based on gini. 1
α × (J − 1)[1 − ( )β ]/β > 0
Next, we have to considering transfer the coding book for each N
interval. For a given interval Si , each code will correspond to a
class. Therefore, there will be a total of J! ways of coding. Given Thus, we have GFα,β (C) < GFα,β (C ′ ).
this, the total cost for the coding book, denoted as L2β , can be Symmetric Principle (P2) for GFα,β : This can be directly
written as derived from the symmetric property of entropy.
1 J! − 1 MIN Principle (P3) for GFα,β : Since the number of rows
L2β = I ′ × J! × Hβ (
, ) (I), the number of samples (N ), and the number of classes (J)
J! J!
1 1 J! − 1 J! − 1 β are fixed, we only need to maximize N × H(S1 , · · · , SI ). By
= I ′ × J! × { [1 − ( )β ]/β + [1 − ( ) ]/β} the concaveness of the Hβ (Theorem 9),
J! J! J! J!
Given this, the penalty of the discretization based on the gen- N × Hβ (S1 , · · · , SI ) ≤ N × Hβ (S1 ∪ · · · ∪ SI )
eralized entropy is I
X
N × Hβ (S1 , · · · , SI ) = Nk × Hβ (Sk )
penaltyβ (model) = L1β (I ′ , N ) + L2β k=1

I′ − 1 I
≈ (I ′ − 1) × [1 − ( )β ]
X
N + I′ − 1 = Nk × Hβ (S1 ∪ · · · ∪ SI )
k=1
1 I′ − 1 β
+I ′ [1 − ( )β ]/β ≈ (I ′ − 1) × [1 − ( ) ]/β = N × Hβ (S1 ∪ · · · ∪ SI )
J! N
1 MAX Principle (P4) for GFα,β : Since the number of rows
+I ′ [1 − ( )β ]/β
J! (I), the number of samples (N ), and the number of classes (J)
are fixed, we only need to minimize N × Hβ (S1 , · · · , SJ ).
Note that when β = 1, we have penaltyβ (model) ≈ 2I ′ − 1.
When β → 0, we have penaltyβ (model) ≈ (I ′ −1)log(N/(I ′ − J
X
1)) + I ′ (J − 1)logJ. N × Hβ (S1 , · · · , SJ ) = Nk × Hβ (Sk )
Put together, the cost of the discretization based on is k=1
J
X
′ ′
I I
X X ≥ Nk × 0 ≥ 0
Costβ = Ni Hβ (Si ) + L1β (I ′ , N ) + L2β = Ni Hβ (Si ) k=1
i=1 i=1
Note that the proof of GFα,β immediately implies that the four
I′ − 1 β 1
+(I ′ − 1) × [1 − ( ) ]/β + I ′ [1 − ( )β ]/β principles hold for GFAIC and GFBIC .
N J!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy