Data Discretization Unification
Data Discretization Unification
where H(Si ) is the entropy of interval Si , the first term corre- 2.2 Statistical Model Selection (AIC and BIC):
sponds to cost(data|model), and the rest corresponds to A different way to look at a contingency table is to assume
penalty(model). The detailed derivation is given in Appendix. that all data points are generated from certain distributions (mod-
In the following, we formally introduce a notion of entropy and els) with unknown parameters. Given a distribution, the maximal
show how a merge of some adjacent data intervals results in in- likelihood principle (MLP) can help us to find the best parame-
formation loss as well as in the increase of the cost(data|model). ters to fit the data [16]. However, to provide a better data fitting,
D EFINITION 1. [12] The entropy of an ensemble X is defined more expensive models (including more parameters) are needed.
to be the average Shannon information content of an outcome: Statistical model selection tries to find the right balance between
X 1 the complexity of a model corresponding to the number of pa-
H(X) = P (x)log2 rameters, and the fitness of the data to the selected model, which
x∈A
P (x)
x
corresponds to the likelihood of the data being generated by the
where Ax is the possible outcome of x. given model.
In statistics, the multinomial distribution is commonly used to I ′ is the total number of rows. Consider a null hypothesis H0 (the
model a contingency table. Here, we assume the data in each rows and columns are statistically independent) against an alter-
interval (or row) of the contingency table are independent and native hypothesis Ha . Consequently, we obtain the confidence
all intervals are independent. Thus, the kernel of the likelihood level of the statistical test to reject the independence hypothesis
function for the entire contingency table is: (H0 ). The confidence level is calculated as
I ′
J
Z t
1 df
cij
Fχ2 (X 2 ) = s 2 −1 e−s/2 ds
Y Y
L(~π) = ( πj|i ) df /2 df
df
2 Γ( 2 ) 0
i=1 j=1
where ~π = (π1|1 , π2|1 , . . . , πJ |1 , · · · , πJ |I ′ ) are the unknown where, Fχ2 is the cumulative χ2 distribution function. We use
df
parameters. Applying the maximal likelihood principle, we iden- the calculated confidence level as our goodness function to com-
tify the best fitting parameters as πj|i = cij /Ni , 1 ≤ i ≤ I ′ , pare different discretization methods that use Pearson’s X 2 statis-
1 ≤ j ≤ J. We commonly transform the likelihood to the log- tic. Our goodness function is formally defined as
likelihood as follow:
′ GFX 2 (C) = Fχ2 (X 2 ) (7)
I X
J df
X cij
SL (D|~π ) = −logL(~π) = − cij log
i=1 j=1
N i We note that 1−Fχ2 (X 2 ) is essentially the P-value of the afore-
df
mentioned statistical independence test [7]. The lower the P-
According to [16], SL (~π) is treated as a type of entropy term that value (or equivalently, the higher the goodness), with more con-
measures how well the parameters ~ π can compress (or predict) fidence we can reject the independence hypothesis (H0 ). This
the training data. approach has been used in Khiops [5], which describes a heuris-
Clearly, different discretizations correspond to different multi- tic algorithm to perform discretization.
nomial distributions (models). For choosing the best discretiza- Wilks’ G2 : In addition to Pearson’s chi-square statistic, another
tion model, the Akaike information criterion or AIC [16] can be statistic called likelihood-ratio χ2 statistic, or Wilks’ statistic [1],
used and it is defined as follows: is used for the independence test. This statistic is derived from
CostAIC = 2SL (D|~π ) + 2(I ′ × (J − 1)) (3) the likelihood-ratio test, which is a general-purpose way of test-
ing a null hypothesis H0 against an alternative hypothesis Ha . In
where, the first term corresponds to the fitness of the data given this case we treat both intervals (rows) and the classes (columns)
the discretization model, and the second term corresponds to the equally as two categorical variables, denoted as X and Y . Given
complexity of the model. Note that in this model for each row we this, the null hypothesis of statistical independence is H0 : πij =
have the constraint π1|i + · · · + πJ |i = 1. Therefore, the number πi+ π+j for all row i and column j, where {πij } is the joint dis-
of parameters for each row is J − 1. tribution of X and Y , and πi+ and π+j are the marginal distri-
Alternatively, for choosing the best discretization model based butions for the row i and column j, respectively.
on Bayesian arguments that take into account the size of the train- Based on the multinomial sampling assumption (a common
ing set N is also frequently used. The Bayesian information cri- assumption in a contingency table) and the maximal likelihood
terion or BIC [16] is defined as follows: principle, these parameters can be estimated as π̂i+ = Ni /N ,
CostBIC = 2SL (D|~π ) + (I ′ × (J − 1))logN (4) π̂+j = Mj /N , and π̂ij = Ni × Mj /N 2 (under H0 ). In the
general case under Ha , the likelihood is maximized when π̂ij =
In the BIC definition, the penalty of the model is higher than the cij /N . Thus the statistical independence between the rows and
one in the AIC by a factor of logN/2. the columns of a contingency table can be expressed as the ratio
Goodness Function based on AIC and BIC: For the same rea- of the likelihoods:
son as MDLP, we denote the goodness function of a given contin-
gency table based on AIC and BIC as the difference between QI ′ QJ 2 cij
the cost of C 0 (the resulting table after merging all the rows of i=1 j=1 (Ni Mj /N )
Λ= QI ′ QJ
C into a single row), and the cost of of C. cij
i=1 j=1 (cij /N )
GFAIC (C) = CostAIC (C 0 ) − CostAIC (C) (5) where the denominator corresponds to the likelihood under Ha ,
GFBIC (C) = CostBIC (C 0 ) − CostBIC (C) (6) and the nominator corresponds to the likelihood under H0 .
Wilks has shown that −2logΛ, denoted by G2 , has a limiting
2.3 Confidence Level from Independence Tests null chi-squared distribution, as N → ∞.
Another way to treat discretization is to merge intervals so that ′
j
I X
the rows (intervals) and columns (classes) of the entire contin- X cij
gency table become more statistically dependent. In other words, G2 = −2logΛ = 2 cij log (8)
i=1 j=1
Ni Mj /N
the goodness function of a contingency table measures its statis-
tical quality in terms of independence tests.
For large samples, G2 has a chi-squared null distribution with
Pearson’s X 2 : In the existing discretization approaches, the
degrees of freedom equal to (I ′ − 1)(J − 1). Clearly, we can
Pearson statistic X 2 [1] is commonly used to test the statistical
use G2 to replace X 2 for calculating the confidence level of the
independence. The X 2 statistic is as follows:
entire contingency table, which serves as our goodness function
X X (cij − m̂ij )2
X2 = GFG2 (C) = Fχ2 (G2 ) (9)
m̂ij df
where, m̂ij = N (Ni /N )(Mj /N ) is the expected frequencies. Indeed, this statistic has been applied in discretization (though
It is well known that Pearson X 2 statistic has an asymptotic χ2 not for the global goodness function), and is referred to as class-
distribution with degrees of freedom df = (I ′ −1)(J −1), where attributes interdependency information [37].
2.4 Properties of Proposed Goodness Func- 4. MAX Principle (P4): Consider all the contingency tables
tions C which have I rows and J columns. If for any row Si
An important theoretical question we address is how these in C, Si =< ci1 , · · · , ciJ >, there exists one cell count
methods are related to each other and how different they are. such that cij 6= 0, and others cik , k 6= j, cik = 0, then the
Answering these questions helps to understand the scope of these contingency table C achieves the maximum in terms of a
approaches and shed light on the ultimate goal: for a given dataset, goodness function for any I × J contingency table.
automatically find the best discretization method. This principle determines what is the best possible dis-
We first investigate some simple properties shared by the afore- cretization when the number of intervals is fixed. Clearly,
mentioned goodness functions (Theorem 1). We now describe the best discretization is achieved if we have the maximal
four basic principles we believe any goodness function for dis- discriminating power in each interval. This is the case
cretization must satisfy. where all the data points in each interval belong to only
one class.
1. Merging Principle (P1): Let Si =< ci1 , · · · , ciJ >, and
Si+1 =< ci+1,1 , · · · , ci+1,J > be two adjacent rows in
The following theorem states that all aforementioned goodness
the contingency table C. If cij /Ni = c(i+1)j /Ni+1 , ∀j,
functions satisfy these four principles.
1 ≤ j ≤ J, then GF (C ′ ) > GF (C), where Ni and Ni+1
are the row sums, GF is a goodness function and C ′ is the T HEOREM 1. GFM DLP , GFAIC , GFBIC , GFX 2 , GFG2
resulting contingency table after we merge these rows. satisfy all four principles, P1, P2, P3, and P4.
Intuitively, this principle reflects a main goal of discretiza- Proof:In Appendix. 2
tion, which is to transform the continuous attribute into a
compact interval-based representation with minimal loss 3. EQUIVALENCE OF GOODNESS FUNC-
of information. As we discussed before, a good discretiza-
tion reduces the number of intervals without generating too TIONS
many classification errors. Clearly, if two consecutive in- In this section, we analytically compare different discretiza-
tervals have exactly the same data distribution, we cannot tion goodness functions introduced in Section 2. In particular, we
differentiate between them. In other words, we can merge find some rather surprising connection between these seemingly
them without information loss. Therefore, any goodness quite different approaches: the information theoretical complex-
function should prefer to merge such consecutive intervals. ity (Subsection 2.1), the statistical fitness (Subsection 2.2), and
the statistical independence tests (Subsection 2.3). We basically
We note that the merging principle (P1) echoes the cut
prove that all these functions can be expressed in a uniform for-
point candidate pruning techniques for discretization which
mat as follows:
have been studied by Fayyand and Irani [15] and Elomaa
and Rousu [14]. However, they did not explicitly define a GF = G2 − df × f (G2 , N, I, J) (10)
global goodness function for discretization. Instead, their
focus is either on evaluating the goodness of each single where, df is a degree of freedom of the contingency table, N
cut or on the goodness when the total target of intervals for is the number of data points, I is the number of data rows in
discretization is given. As we mentioned in Section 1, the the contingency table, J is the number of class labels and f is
goodness function discussed in this paper is to capture the bounded by O(logN ). The first term G2 corresponds to the cost
tradeoff between the information/statistical quality and the of the data given a discretization model (cost(data|model)), and
complexity of the discretization. In addition, this principle the second corresponds to the penalty or the complexity of the
can be directly applied to reduce the size of the original model (penalty(model)).
contingency table since we can simply merge the consecu- To derive this expression, we first derive an expression for
tive rows with the same class distribution. the cost of the data for different goodness functions discussed
in section 2 (Subsection 3.1). This is achieved by expressing G2
2. Symmetric Principle (P2): Let Cj be the j-th column of statistics through information entropy (Theorem 3). Then, us-
contingency table C. GF (C) = GF (C ′ ), where C =< ing a Wallace’s result [35, 36] on approximating χ2 distribution
C1 , . . . CJ > and C ′ is obtained from C by an arbitrary with a normal distribution, we transform the goodness function
permutation of C’s columns. based on statistical independence tests into the format of For-
This principle asserts that the order of class labels should mula 10. Further, a detailed analysis of function f reveals a
not impact the goodness function that measures the qual- deeper relationship shared by these different goodness functions
ity of the discretization. Discretization results must be the (Subsection 3.3). Finally, we compare the methods based on the
same for both tables. global independence tests, such as Khiops [5] (GFX 2 ) and those
based on local independence tests, such as ChiMerge [23] and
3. MIN Principle (P3): Consider all contingency tables C Chi2 [27] (Subsection 3.4).
which have I rows and J columns, and the same marginal
distribution for classes (columns). If for any row Si in C, 3.1 Unifying the Cost of Data (cost(data|model))
Si =< ci1 , · · · , ciJ >, cij /Ni = Mj /N , then the con- to G2
tingency table C reaches the minimum for any goodness In the following, we establish the relationship among entropy,
function. log-likelihood and G2 . This is the first step for an analytical com-
This principle determines what is the worst possible dis- parison of goodness functions based on the information theoret-
cretization for any contingency table. This is the case when ical, the statistical model selection, and the statistical indepen-
each row shares exactly the same class distribution in a dence test approaches.
contingency table, and thus the entire table has the maxi- First, it is easy to see that for a given contingency table, the
mal redundancy. cost of the data transfer (cost(data|model), a key term in the
information theoretical approach) is equivalent to the log likeli- It has long been known that they are asymptotically equivalent.
hood SL (D|~π ) (used in the statistical model selection approach) The next theorem provides tool to connect the information the-
as the following theorem asserts. oretical approach and the statistical independence test approach
T HEOREM 2. For a given contingency table CI ′ ×J , the cost based on Pearson’s chi-square (X 2 ) statistic.
of data transfer (cost1 (data|model)) is equal to the log likeli- T HEOREM 5. [1] Let N be the total number of data values
hood SL (D|~π ), i.e. in the contingency table T of I × J dimensions. If the rows
(columns) of contingency table are independent, then probability
N × H(S1 , · · · , SI ′ ) = −logL(~π ) of X 2 - G2 = 0 converges to one as N → ∞.
Proof: In the following, we mainly focus on the asymptotic properties
shared by X 2 and G2 based cost functions. Thus, our further
discussions on G2 can also be applied to X 2 .
′
I
X
N × H(S1 , · · · , SI ′ ) = − Ni × N (Si ) Note that Theorem 2 and 3 basically establish the basis for
i=1 Formula 10 of goodness functions based on the information theo-
I′
J retical approach and statistical model selection approaches. Even
X X cij cij
=− Ni × log though Theorems 4 and 5 relate the information theoretical ap-
i=1 j=1
N i Ni proach (based on entropy) to the statistical independence test ap-
J
I X ′ proach (based on G2 and X 2 ), it is still unclear how to compare
X cij them directly since the goodness function of the former one is
=− cij log = −logL(~π )
i=1 j=1
Ni based on the total cost of transferring the data and the goodness
function of the latter one is the confidence level for a hypothesis
2 test. Subsection 3.2 presents our approach on tackling this issue.
The next theorem establishes a relationship between entropy
criteria and the likelihood independence testing statistics G2 . This 3.2 Unifying Statistical Independence Tests
is the key to discover the connection between the information In order to compare the quality of different goodness func-
theoretical and the statistical independence test approaches. tions, we introduce a notion of equivalent goodness functions.
T HEOREM 3. Let C be a contingency table. Then Intuitively, the equivalence between goodness functions means
G2 /2 = N × H(S1 ∪ · · · ∪ SI ′ ) − N × H(S1 , · · · , SI ′ ) that these functions rank different discretization of the same con-
tingency table identically.
Proof: D EFINITION 2. Let C be a contingency table and GF1 (C),
′
I X
J
GF2 (C) be two different goodness functions. GF1 and GF2 are
X cij equivalent if and only if for any two contingency tables C1 and
G2 /2 = −logΛ = cij log
Ni Mj /N C2 , GF1 (C1 ) ≤ GF1 (C2 ) =⇒ GF2 (C1 ) ≤ GF2 (C2 ) and
i=1 j=1
′
GF2 (C1 ) ≤ GF2 (C2 ) =⇒ GF1 (C1 ) ≤ GF1 (C2 ).
J
I X
X cij N Using the equivalence notion, we transform goodness func-
= (cij log + cij log )
Ni Mj tions to different scales and/or to different formats. In the se-
i=1 j=1
quel, we apply this notion to compare seemingly different good-
I′ X
J J I′ ness functions based on a statistical confidence and those that are
X cij X Mj X
= cij log − log × cij based on MDLP, AIC, and BIC.
Ni j=1 N
i=1 j=1 i=1 The relationship between the G2 and the confidence level is
J
I X′
J rather complicated. It is clearly not a simple one-to-one mapping
X cij X Mj
= cij log − Mj log as the same G2 may correspond to very different confidence level
i=1 j=1
Ni j=1 N depending on degrees of freedom of the χ2 distribution and, vice
= −N × (H(S1 , · · · , SI ′ ) + H(S1 ∪ · · · ∪ SI ′ )) versa the same confidence level may correspond to very different
G2 values. Interestingly enough, such many-to-many mapping
2 actually holds the key for the aforementioned transformation. In-
Theorem 3 can be generalized as follows. tuitively, we have to transform the confidence interval to a scale
T HEOREM 4. Assuming we have k consecutive rows, Si ,Si+1 , of entropy or G2 parameterized by the degree of freedom for the
· · · , Si+k−1 . Let G2k be the likelihood independence test statistic χ2 distribution.
for the k rows. Then, we have G2(i,i+k−1) /2 = Our proposed transformation is as follows.
D EFINITION 3. Let u(t) be the normal deviation correspond-
N(i,i+k−1) (H(Si ∪ · · · ∪ Si+k−1 ) − H(Si , · · · , Si+k−1 )) ing to the chi-square distributed variable t. That is, the following
Proof:Omit for simplicity. 2 equality holds:
Consequently, we rewrite the goodness functions GFM DLP , Fχ2 (t) = Φ(u(t))
df
GFAIC and GFBIC as follows.
where, Fχ2 is the cumulative χ2 distribution with df degrees of
df
GFM DLP = freedom, and Φ is the cumulative normal distribution function.
2 ′ N For a given contingency table C, which has the log likelihood
G − 2(I − 1)log ′ − 2(I ′ − 1)(J ′ − 1)logJ (11)
I −1 ratio G2 , we define
GFAIC = G2 − (I ′ − 1)(J − 1) (12)
GFG′ 2 = u(G2 ) (14)
GFBIC = G2 − (I ′ − 1)(J − 1)logN/2 (13)
as a new goodness function for C.
For the rest of the paper we use the above formulas for GFM DLP , The next theorem establishes the equivalence between a good-
GFAIC and GFBIC . ness functions GFG2 and GFG′ 2 .
T HEOREM 6. The goodness function GFG′ 2 = u(G2 ) is equiv- represented by AIC, the penalty is on the order of degree of free-
alent to the goodness function GFG2 = Fχ2 (G2 ). dom, O(df ). On the higher end, which is represented by BIC,
df
Proof: Assuming we have two contingency tables C1 and C2 the penalty is O(df logN ).
with degree of freedom df1 and df2 , respectively. Their respec- Penalty of GFG′′2 (Formula 15): The penalty of our new good-
tive G2 statistics are denoted as G21 and G22 . Clearly, we have ness function GFG′′2 = u2 (G2 ) is between O(df ) and O(df logN ).
The lower bound is achieved, provided that G2 being strictly
Fχ2 (G21 ) ≤ Fχ2 (G22 ) ⇐⇒ higher than df (G2 > df ). Lemma 2 gives the upper bound.
df1 df2
L EMMA 2. G2 is bounded by 2N logJ (G2 ≤ 2N × logN ).
Φ(u(G21 )) ≤ Φ(u(G22 )) ⇐⇒ Proof:
u(G21 ) ≤ u(G22 )
G2 = 2N × (H(S1 ∪ · · · ∪ SI ) − H(S1 , · · · , SI ))
This basically establishes the equivalence of these two goodness ≤ 2N × (−J × (1/J × log(1/J)) − 0)
functions. 2
The newly introduced goodness function GFG′ 2 is rather com- ≤ 2N × logJ
plicated and it is hard to find for it a closed form expression. In 2
the following, we use a theorem from Wallace [35, 36] to derive In the following, we consider two cases for the penalty GFG′′2 =
an asymptotically accurate closed form expression for a simple u (G2 ). Note that these two cases corresponding to the lower
2
PJ
L EMMA 3. Hβ [p1 , · · · , pJ ] = j=1 pi (1 − pβi )/β is con-
4. PARAMETRIZED GOODNESS FUNC- cave when β > 0.
TION Proof:
The goodness functions discussed so far are either entropy or ∂Hβ
χ2 or G2 statistics based. In this section we introduce a new = (1 − (1 + β)pβi /β
∂pi
goodness function which is based on gini index [4]. Gini index
∂ 2 Hβ
based goodness function is strikingly different from goodness = −(1 + β)pβ−1
i /β < 0
functions introduced so far. In this section we show that a newly ∂ 2 pi
introduced goodness function GFgini along with the goodness ∂ 2 Hβ
functions discussed in section 2 are all can be derived from a =0
∂pi ∂pj
generalized notion of entropy [29].
Thus,
4.1 Gini Based Goodness Function
∂ 2 Hβ ∂ 2 Hβ
2 3
Let Si be a row in contingency table C. Gini index of row Si ∂ 2 p1
··· ∂p1 ∂pJ
is defined as follows [4]:
6 7
▽2 Hβ [p1 , · · · , pJ ] = 6
6 .. .. .. 7
. . . 7
J 4 5
cij cij ∂ 2 Hβ ∂ 2 Hβ
···
X
Gini(Si ) = [1 − ] ∂p1 ∂pJ ∂ 2 pJ
j=1
Ni Ni
PI ′ Clearly, ▽2 Hβ [p1 , · · · , pJ ] is negative definite. Therefore,
and CostGini (C) = i=1 Ni × Gini(Si ) Hβ [p1 , · · · , pJ ] is concave. 2
Let CI×J be a contingency table., We define the generalized The parameterized goodness function not only allows us to
entropy for C as follows. represent the existing goodness functions in a closed uniform
I
form, but, more importantly, it provides a new way to understand
X Ni and handle discretization. First, the parameterized approach pro-
Hβ (S1 , · · · , SI ) = Hβ (Si ) =
i=1
N vides a flexible framework to access a large collection (poten-
I J
tially infinite) of goodness functions. Suppose we have a two di-
X Ni X cij cij β mension space where α is represented in the X-axis and β is rep-
× [1 − ( ) ]/β
i=1
N j=1
Ni Ni resented in the Y -axis. Then, each point in the two-dimensional
space for α > 0 and 0 < β ≤ 1 corresponds to a potential
Similarly, we have goodness function. The existing goodness functions corresponds
J to certain points in the two-dimensional space. These points are
X Mj Mj β specified by the aforementioned parameter choices. Note that
Hβ (S1 ∪ · · · ∪ SI ) = [1 − ( ) ]/β
j=1
N N this treatment is in the same spirit of regularization theory devel-
oped in the statistical machine learning field [17, 34]. Secondly,
T HEOREM 9. There always exists information loss for the finding the best discretization for different data mining tasks for
merged intervals: Hβ (S1 , S2 ) ≤ Hβ (S1 ∪ S2 ) a given dataset is transformed into a parameter selection prob-
Proof:This is the direct application of the concaveness of the lem. Ultimately, we would like to identify the parameter selec-
generalized entropy. 2 tion which optimizes the targeted data mining task. For instance,
suppose we are discretizing a given dataset for a Naive Bayesian
4.3 Parameterized Goodness Function classifier. Clearly, a typical goal of the discretization is to build
Based on the discussion in Section 3, we derive that different a Bayesian classifier with the minimal classification error. As
goodness functions basically can be decomposed into two parts. described in regularization theory [17, 34], the methods based
The first part is for G2 , which corresponds to the information on cross-validation, can be applied here. However, it is an open
theoretical difference between the contingency table under con- problem how we may automatically select the parameters with-
sideration and the marginal distribution along classes. The sec- out running the targeted data mining task. In other words, can we
ond part is the penalty which counts the difference of complexity analytically determine the best discretization for different data
for the model between the contingency table under consideration tasks for a given dataset? This problem is beyond the scope of
and the one-row contingency table. The different goodness func- this paper and we plan to investigate it in future work. Finally,
tions essentially have different penalties ranging from O(df ) to the unification of goodness functions allows to develop efficient
O(df logN ). algorithms to discretize the continuous attributes with respect to
In the following, we propose a parameterized goodness func- different parameters in a uniform way. This is the topic of the
tion which treats all the aforementioned goodness functions in a next subsection.
uniform way.
D EFINITION 5. Given two parameters, α and β, where 0 <
β ≤ 1 and 0 < α, the parameterized goodness function for
contingency table C is represented as
4.4 Dynamic Programming for Discretiza-
tion
I ′
X This section presents a dynamic programming approach to find
GFα,β (C) = N × Hβ (S1 ∪ · · · ∪ SI ′ ) − Ni × Hβ (Si ) the best discretization function to maximize the parameterized
i=1 goodness function. Note that the dynamic programming has been
′ 1 β used in discretization before [14]. However, the existing ap-
−α × (I − 1)(J − 1)[1 − ( ) ]/β (17)
N proaches do not have a global goodness function to optimize,
The following theorem states the basic properties of the pa- and almost all of them have to require the knowledge of targeted
rameterized goodness function. number of intervals. In other words, the user has to define the
T HEOREM 10. The parameter goodness function GFα,β , with number of intervals for discretization. Thus, the existing ap-
α > 0 and 0 < β ≤ 1, satisfies all four principles, P1, P2, P3, proaches can not be directly applied to discretization for maxi-
and P4. mizing the parameterized goodness function.
Proof:In Appendix. 2 In the following, we introduce our dynamic programming ap-
By adjusting different parameter values, we show how good- proach for discretization. To facilitate our discussion, we use GF
ness functions defined in section 2 can be obtained from the for GFα,β , and we simplify the GF formula as follows. Since
parametrized goodness function. We consider several cases: a given table C, N × Hβ (S1 ∪ · · · ∪ SI ) (the first term in GF ,
Formula 17) is fixed, we define
1. Let β = 1 and α = 2(N − 1)/(N (J − 1)). Then
GF2(N−1)/(N(J −1)),1 = GFgini . F (C) = N × Hβ (S1 ∪ · · · ∪ SI )GF (C) =
2. Let α = 1/logN and β → 0. Then GF1/logNβ→0 = I ′
X 1 β
GFAIC . Ni × Hβ (Si ) + α × (I ′ − 1)(J − 1)[1 − ( ) ]/β
i=1
N
3. Let α = 1/2 and β → 0. Then GF1/2,β→0 = GFBIC .
4. Let α = const, β → 0 and N >> I. Then GFconst,β→0 = Clearly, the minimization of the new function F is equivalent
G2 − O(df logN ) = GFM DLP . to maximizing GF . In the following, we will focus on finding
the best discretization to minimize F . First, we define a sub-
5. Let α = const, β → 0, and G2 = O(N logJ), N/(IJ) → contingency table of C as C[i : i + k] = {Si , · · · , Si+k }, and
∞. Then GFconst,β→0 = G2 − O(df logN ) = GFG′′2 ≈ let C 0 [i : i+k] = Si ∪· · ·∪Si+k be the merged column sum for
′′
GFX 2. the sub-contingency table C[i : i + k]. Thus, the new function F
of the row C 0 [i : i + k] is: for them? Clearly for these applications, misclassification can be
i+k
very costly. But the number of intervals generated by the dis-
X cretization may not be that important. Pursuing these questions,
F (C 0 [i : i + k]) = ( Nr ) × Hβ (Si ∪ · · · ∪ Si+k )
we plan to conduct experimental studies to compare different
r=i
goodness functions, and evaluate the effect of parameter selec-
Let C be the input contingency table for discretization. Let tion for the generalized goodness function on discretization.
Opt(i, i + k) be the minimum of the F function from the partial
contingency table from row i to i + k, k > 1. The optimum 6. REFERENCES
which corresponds to the best discretization can be calculated
recursively as follows: [1] A. Agresti Categorical Data Analysis. Wiley, New York,
1990.
Opt(i, i + k) = min(F (C 0 [i : i + k]), [2] H. Akaike. Information Theory and an Extension of the
min1≤l≤k−1 (Opt(i, i + l) + Opt(i + l + 1, i + k) + Maximum Likelihood Principle. In Second International
1 Symposium on Information Theory, 267-281, Armenia, 1973.
α × (J − 1)[1 − ( )β ]/β)) [3] P. Auer, R. Holte, W. Maass. Theory and Applications of
N
Agnostic Pac-Learning with Small Decision Trees. In
where k > 0 and Opt(i, i) = F (C 0 [i : i]). Given this, we can Machine Learning: Proceedings of the Twelth International
apply the dynamic programming to find the discretization with Conference, Morgan Kaufmann, 1995.
the minimum of the goodness function, which are described in [4] L. Breiman, J. Friedman, R. Olshen, C. Stone Classification
Algorithm 1. The complexity of the algorithm is O(I 3 ), where I and Regression Trees. CRC Press, 1998.
is the number of intervals of the input contingency table C. [5] M. Boulle. Khiops: A Statistical Discretization Method of
Continuous Attributes. Machine Learning, 55, 53-69, 2004.
Algorithm 1 Discretization(Contingency Table CI×J ) [6] M. Boulle. MODL: A Bayes optimal discretization method
for i = 1 to I do for continuous attributes. Mach. Learn. 65, 1 (Oct. 2006),
for j = i downto 1 do 131-165.
Opt(j, i) = F (C 0 [j : i]) [7] George Casella and Roger L. Berger. Statistical Inference
for k = j to i − 1 do (2nd Edition). Duxbury Press, 2001.
Opt(j, i) = min(Opt(j, i), Opt(j, k)+ [8] J. Catlett. On Changing Continuous Attributes into Ordered
Opt(k + 1, i) + α(J − 1)[1 − ( N1 )β ]/β) Discrete Attributes. In Proceedings of European Working
end for Session on Learning, p. 164-178, 1991.
end for [9] J. Y. Ching, A.K.C. Wong, K. C.C. Chan. Class-Dependent
end for Discretization for Inductive Learning from Continuous and
return Opt(1, I) Mixed-Mode Data. IEEE Transactions on Pattern Analysis
and Machine Intelligence, V. 17, No. 7, 641-651, 1995.
[10] M.R. Chmielewski, J.W. Grzymala-Busse. Global
Discretization of Continuous Attributes as Preprocessing for
Machine Learning. International Journal of Approximate
5. CONCLUSIONS
Reasoning, 15, 1996.
In this paper we introduced a generalized goodness function to [11] Y.S. Choi, B.R. Moon, S.Y. Seo. Genetic Fuzzy
evaluate the quality of a discretization method. We have shown Discretization with Adaptive Intervals for Classification
that seemingly disparate goodness functions based on entropy, Problems. Proceedings of 2005 Conference on Genetic and
AIC, BIC, Pearson’s X 2 and Wilks’ G2 statistic as well as Gini
Evolutionary Computation, pp. 2037-2043, 2005.
index are all derivable from our generalized goodness function.
[12] Thomas M. Cover and Joy A. Thomas, Elements of
Furthermore, the choice of different parameters for the gener-
Information Thoery, Second Edition. Published by John
alized goodness function explains why there is a wide variety
Wiley & Sons, Inc., 2006.
of discretization methods. Indeed, difficulties in comparing dif-
ferent discretization methods were widely known. Our results [13] J. Dougherty, R. Kohavi, M. Sahavi. Supervised and
provide a theoretical foundation to approach these difficulties Unsupervised Discretization of Continuous Attributes.
and offer rationale as to why evaluation of different discretiza- Proceedings of the 12th International Conference on
tion methods for an arbitrary contingency table is difficult. Our Machine Learning, pp. 194-202, 1995.
generalized goodness function gives an affirmative answer to the [14] Tapio Elomaa and Juho Rousu. Efficient Multisplitting
question: is there an objective function to evaluate different dis- Revisited: Optima-Preserving Elimination of Partition
cretization methods? Another contribution of this paper is to Candidates. Data Mining and Knowledge Discovery, 8,
identify a dynamic programming algorithm that provides an op- 97-126, 2004.
timal discretization which achieves the minimum of the general- [15] U.M. Fayyad and K.B. Irani Multi-Interval Discretization
ized goodness function. of Continuous-Valued Attributes for Classification Learning.
There are, however several questions that remain open. First In Proceedings of the 13th Joint Conference on Artificial
of all, even if an objective goodness function exists, different Intelligence, 1022-1029, 1993.
parameter choices will result in different discretizations. There- [16] David Hand, Heikki Mannila, Padhraic Smyth. Principles
fore, the question is for a particular set of applications, what are of Data Mining MIT Press, 2001.
the best parameters for the discretization? Further, can we clas- [17] Federico Girosi, Michael Jones, and Tomaso Poggio.
sify user-applications into different categories and identify the Regularization theory and neural networks architectures. In
optimal parameters for each category? For example, consider- Neural Computation, Volume 7 , Issue 2 (March 1995),
ing medical applications; what is the best discretization function Pages: 219 - 269.
[18] M.H. Hansen, B. Yu. Model Selection and the Principle of Transactions on Pattern Analysis and Machine Intelligence,
Minimum Description Length. Journal of the American vol. 9, NNo. 6, pp. 796-805, 1987.
Statistical Assciation, 96, p. 454, 2001. [38] Ying Yang and Geoffrey I. Webb. Weighted Proportional
[19] R.C. Holte. Very Simple Calssification Rules Perform Well k-Interval Discretization for Naive-Bayes Classifiers. In
on Most Commonly Used Datasets. Machine Learning, 11, Advances in Knowledge Discovery and Data Mining: 7th
pp. 63-90, 1993. Pacific-Asia Conference, PAKDD, page 501-512, 2003.
[20] Janssens, D., Brijs, T., Vanhoof, K., and Wets, G.
Evaluating the performance of cost-based discretization
versus entropy-and error-based discretization. Comput. Oper.
Res. 33, 11 (Nov. 2006), 3107-3123.
[21] N. Johnson, S. Kotz, N. Balakrishnan. Continuous
Univariate Distributions, Second Edition. John Wiley & Sons,
INC., 1994.
[22] Ruoming Jin and Yuri Breitbart, Data Discretization
Unification. Technical Report
(http://www.cs.kent.edu/research/techrpts.html), Department
of Computer Science, Kent State Univeristy, 2007.
[23] Randy Kerber. ChiMerge: Discretization of Numeric
Attributes. National Conference on Artificial Intelligence,
1992.
[24] L.A. Kurgan, K.J. Cios CAIM Discretization Algorithm.
IEEE Transactions on Knowledge and Data Engineering, V.
16, No. 2, 145-153, 2004.
[25] R. Kohavi,M. Sahami. Error-Based and Entropy-Based
Discretization of Continuous Features. Proceedings of the
Second International Conference on Knowledge Discovery
and Data Mining, pp. 114-119, Menlo Park CA, AAAI Press,
1996.
[26] Huan Liu, Farhad Hussain, Chew Lim Tan, Manoranjan
Dash. Discretization: An Enabling Technique. Data Mining
and Knowledge Discovery, 6, 393-423, 2002.
[27] H. Liu and R. Setiono. Chi2: Feature selection and
discretization of numeric attributes. Proceedings of 7th IEEE
Int’l Conference on Tools with Artificial Intelligence, 1995.
[28] X. Liu, H.Wang A Discretization Algorithm Based on a
Heterogeneity Criterion. IEEE Transaction on Knowledge
and Data Engineering, v. 17, No. 9, 1166-1173, 2005.
[29] S. Mussard, F. Seyte, M. Terraza. Decomposition of Gini
and the generalized entropy inequality measures. Economic
Bulletin, Vol. 4, No. 7, 1-6, 2003.
[30] B. Pfahringer. Supervised and Unsupervised Discretization
of Continuous Features. Proceedings of 12th International
Conference on Machine Learning, pp. 456-463, 1995.2003.
[31] J. Rissanen Modeling by shortest data description
Automatica, 14,pp. 465-471, 1978.
[32] D.A. Simovici and S. Jaroszewicz An axiomatization of
partition entropy IEEE Transactions on Information Theory,
Vol. 48, Issue:7, 2138-2142, 2002.
[33] Robert A. Stine. Model Selection using Information
Theory and the MDL Principle. In Sociological Methods &
Research, Vol. 33, No. 2, 230-260, 2004.
[34] Trevor Hastie, Robert Tibshirani and Jerome Friedman.
The Elements of Statistical Learning Springer-Verlag, 2001.
[35] David L. Wallace . Bounds on Normal Approximations to
Student’s and the Chi-Square Distributions. The Annals of
Mathematical Statistics, Vol. 30, No. 4, pp 1121-1130, 1959.
[36] David L. Wallace . Correction to ”Bounds on Normal
Approximations to Student’s and the Chi-Square
Distributions”. The Annals of Mathematical Statistics, Vol.
31, No. 3, p. 810, 1960.
[37] A.K.C. Wong, D.K.Y. Chiu. Synthesizing Statistical
Knowledge from Incomplete Mixed-Mode Data. IEEE
Appendix where cij = ci+1,j , ∀i, 1 ≤ j ≤ J. Let C ′ be the resulting
contingency table after we merge these two rows. Then we have
Derivation of the Goodness Function based on
MDLP I i−1
For an interval S1 , the best way to transfer the labeling informa-
X X
Nk × H(Sk ) = Nk × H(Sk ) + Ni × H(Si )
tion of each point in the interval is bounded by a fundamental k=1 k=1
theorem in information theory, stating that the average length of I
X i−1
X
the shortest message is higher than N1 × H(S1 ). Though we +Ni+1 × H(Si+1 ) + Nk × H(Sk ) = Nk × H(Sk ) +
can apply the Hoffman coding to get the optimal coding for each k=i+2 k=1
interval, we are not interested in the absolute minimal coding. I
X
Therefore, we will apply the above formula as the cost to trans- (Ni + Ni+1 ) × H(Si ) + Nk × H(Sk )
fer each interval. Given this, we can easily derive the total cost k=i+2
to transfer all the I ′ intervals as follows. I−1
Nk′ H(Sk )′
X
=
cost1 (data|model) = N × H(S1 , · · · , SI ′ ) k=1
= N1 × H(S1 ) + N2 × H(S2 ) + · · · + NI ′ × H(SI ′ )
In the meantime, we have to transfer the model itself, which In addition, we have
includes all the intervals and the coding book for transferring the
point labels for each interval. The length of the message to trans-
ferring the model is served as the penalty function for the model. N
′ (I − 1)log2 + (I − 1) × J × log2 J
The cost to transfer all the intervals will require a log2 (N+I −1
)- I −1
I ′ −1
′ N
bit message. This cost, denoted as L1 (I , N ), can be approxi- −((I − 2)log2 + (I − 2) × J × log2 J)
mated as I−2
′ = (I − 1)log2 N − (I − 1)log2 (I − 1) + (I − 1) × J × log2 J
L1 (I ′ , N ) = log2 (N+I
I ′ −1
−1
) −((I − 2)log2 N C(I − 2)log2 (I − 2) + (I − 2) × J × log2 J)
N I′ − 1 > log2 N Clog2 (I − 1) + J × log2 J (N ≤ I) > 0
≈ (N + I ′ − 1)H( , )
N + I − 1 N + I′ − 1
′
N I′ − 1
= −(N log2 ′
+ (I ′ − 1)log2 )
N +I −1 N + I′ − 1 Adding together, we have CostM DLP (C) > CostM DLP (C ′ ),
N and GFM DLP (C) < GFM DLP (C ′ ).
(log2 → 0, N → 0)
N + I′ − 1 Symmetric Principle (P2) for GFM DLP : This can be di-
N + I′ − 1 rectly derived from the symmetric property of entropy.
≈ (I ′ − 1)log2
I′ − 1 MIN Principle (P3) for GFM DLP : Since the number of rows
′ N (I), the number of samples (N ), and the number of classes (J)
≈ (I − 1)log2 ′ are fixed, we only need to maximize N × H(S1 , · · · , SI ).
I −1
Next, we have to consider the transfer of the coding book for
each interval. For a given interval Si , each code will correspond N × H(S1 , · · · , SI ) ≤ N × H(S1 ∪ · · · ∪ SI )
to a class, which can be coded in log2 J bits. We need to transfer I
X
such codes at most J −1 times for each interval, since after know- N × H(S1 , · · · , SI ) = Nk × H(Sk )
ing J − 1 classes, the remaining class can be inferred. Therefore, k=1
the total cost for the coding book, denoted as L2 , can be written I
X
as = Nk × H(S1 ∪ · · · ∪ SI )
′ k=1
L2 = I × (J − 1) × log2 J
= N × H(S1 ∪ · · · ∪ SI )
Given this, the penalty of the discretization based on the theo-
retical viewpoint is
MAX Principle (P4) for GFM DLP : Since the number of
penalty1 (model) = L1 (I ′ , N ) + L2 rows (I), the number of samples (N ), and the number of classes
N (J) are fixed, we only need to minimize N × H(S1 , · · · , SJ ).
= (I ′ − 1)log2 ′ + I ′ × (J − 1) × log2 J
I −1
J
Put together, the cost of the discretization based on MDLP is X
N × H(S1 , · · · , SJ ) = Nk × H(Sk )
′
I k=1
X N
CostM DLP = Ni H(Si )+(I ′ −1)log2 ′ +I ′ (J−1)log2 J J
I −1 X
i=1 ≥ Nk × (log2 1) ≥ 0
k=1
Proof of Theorem 1
Proof:We will first focus on proving for GFM DLP . The proof Now, we prove the four properties for GFX 2 .
for GFAIC and GFBIC can be derived similarly. Merging Principle (P1) for GFX 2 : Assuming we have two
Merging Principle (P1) for GFM DLP : Assuming we have consecutive rows i and i + 1 in the contingency table C, Si =<
two consecutive rows i and i + 1 in the contingency table C, ci1 , · · · , ciJ >, and Si+1 =< ci+1,1 , · · · , ci+1,J >, where
Si =< ci1 , · · · , ciJ >, and Si+1 =< ci+1,1 , · · · , ci+1,J >, cij = ci+1,j , ∀i, 1 ≤ j ≤ J. Let C ′ be the resulting contin-
gency table after we merge these two rows. Then we have Fχ2 (X 2 ) is maximized. In other words, we have the
(I−1)(J−1)
I′ − 1 I
≈ (I ′ − 1) × [1 − ( )β ]
X
N + I′ − 1 = Nk × Hβ (S1 ∪ · · · ∪ SI )
k=1
1 I′ − 1 β
+I ′ [1 − ( )β ]/β ≈ (I ′ − 1) × [1 − ( ) ]/β = N × Hβ (S1 ∪ · · · ∪ SI )
J! N
1 MAX Principle (P4) for GFα,β : Since the number of rows
+I ′ [1 − ( )β ]/β
J! (I), the number of samples (N ), and the number of classes (J)
are fixed, we only need to minimize N × Hβ (S1 , · · · , SJ ).
Note that when β = 1, we have penaltyβ (model) ≈ 2I ′ − 1.
When β → 0, we have penaltyβ (model) ≈ (I ′ −1)log(N/(I ′ − J
X
1)) + I ′ (J − 1)logJ. N × Hβ (S1 , · · · , SJ ) = Nk × Hβ (Sk )
Put together, the cost of the discretization based on is k=1
J
X
′ ′
I I
X X ≥ Nk × 0 ≥ 0
Costβ = Ni Hβ (Si ) + L1β (I ′ , N ) + L2β = Ni Hβ (Si ) k=1
i=1 i=1
Note that the proof of GFα,β immediately implies that the four
I′ − 1 β 1
+(I ′ − 1) × [1 − ( ) ]/β + I ′ [1 − ( )β ]/β principles hold for GFAIC and GFBIC .
N J!