0% found this document useful (0 votes)
71 views24 pages

Need A Home? Start The Data Mining!: Faculty of Economics

This document summarizes a data mining project that analyzed real estate data from Cluj-Napoca, Romania to help buyers find suitable apartments. The project used data from real estate listings in 2001 containing information on 1981 apartments. Two algorithms were applied to the data: linear regression to estimate apartment prices based on attributes, and J48 decision trees to classify apartments on a 1-5 rating scale based on attributes. Both algorithms identified several key attributes that influence price and rating. The results aimed to simplify the home search process for buyers in the busy real estate market of Cluj-Napoca.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views24 pages

Need A Home? Start The Data Mining!: Faculty of Economics

This document summarizes a data mining project that analyzed real estate data from Cluj-Napoca, Romania to help buyers find suitable apartments. The project used data from real estate listings in 2001 containing information on 1981 apartments. Two algorithms were applied to the data: linear regression to estimate apartment prices based on attributes, and J48 decision trees to classify apartments on a 1-5 rating scale based on attributes. Both algorithms identified several key attributes that influence price and rating. The results aimed to simplify the home search process for buyers in the busy real estate market of Cluj-Napoca.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Babes-Bolyai University, Cluj Napoca Faculty of Economics

Need a Home? Start the Data Mining!


A Data Mining Application in Weka on Real Estate Market

Authors Tit-Liviu Leontin Mircea Moca Darie Moldovan Manuela Rusu Daniela Secar Corina Trifu

Scientific project coordinators Professor tefan Nichi, PhD. Teaching Assistant Gheorghe Silaghi

Cluj Napoca, May 20 2004

Objectives
According to a recent survey conducted by "Capital" business magazine in edition 18 (April 24th, 2004), there are about 200 real estate agencies in the city, the highest number per capita in the country. Local industries have exploded in recent years, foreign as well as national investors considering Cluj Napoca to be one of the fastest economically developing regions in Romania. Given the high demand for residential apartments and the soaring prices in this sector, the aim of this project is to facilitate the acquisition of a flat. This will be accomplished by analysing data from the real estate market with the aid of data mining techniques. There will be determined the price and rating for each apartment, thus showing on a scale from 1 to 5 whether the apartment is worthwhile considering or not.

Motivation
Due to the very high number of real estate agencies in Cluj Napoca, finding a flat can become a troublesome and wearying business. Moreover, having so many possibilities to choose from, it may become difficult to visit every single available apartment and try to decide if it suits or not one's wishes. That is where our data mining application comes in handy. It provides a simple mechanism to rank the apartments. This way, one can easily find and eliminate the low-ranked flats and focus on the important ones. Another reason for this project is the extreme dynamism of the real estate market. Also the continuously growing demand entails the unjustified increase of prices, thus leading to the necessity of periodic reevaluation of the apartments which can be more easily achieved using this kind of data mining analysis.

Tools and Methods


For this data mining project, we used Weka - a collection of machine learning algorithms for data mining tasks written in Java by the Computer Science department of University of Waikato, New Zeeland. Weka can be ran both in command-line mode as well as with a graphical user interface. From the algorithms included in Weka, we have selected two which best match our application: linear regression and J48.

Data Set
The data is provided by "Piata de la A la Z" ("A to Z Market") weekly newspaper. It contained, of course, information on all apartment sales advertisments published in January 2001. The data was received in Excel spreadsheet format. In order to be able to analyse this database with Weka, so that we could reach our stated objectives, we had to make a series of changes on the initial format of the data. First of all we have to mention that the database contained the data in only one memo field for each record. So, the first step we made was to separate from the memo field the relevant data: floor number, balcony, TV cable, telephone, central heating unit, garrage etc., using Excel string searching formulas. Afterwards, based on these keywords, we created a new spreadsheet that contained all the 18 final attributes of the database we used for training in Weka. The fields that had no value for an instance were marked with a question mark.

There were a few other adjustments that had to be made before starting the tests. We had to take into account that the algorithms we wanted to use in Weka establish the relationship between input and output attributes in the database. That is why all the instances in the spreadsheet needed to have at least two attributes filled in, one for input and one for output, because it is impossible to establish a relationship with a single attribute. As a result, we have deleted from the database all the instances that had only one attribute filled in. Another issue that we needed to take care of was the currency, because the prices quoted for the apartments were expressed in four different currencies, so we decided to convert all the prices into Euros using the official exchange rates of January 7, 2001. Finally, the Weka environment needs the database to be formatted under "ARFF" format, so we converted the Excel spreadsheet into "CSV" format (Comma Separated Values) then added the required attribute descriptor header to create the "ARFF" file. As shown in the attached table, the final form of the database is composed of 18 attributes that describe the information on the apartments, and contains 1981 instances. The table also shows the number of instances for each value of discrete attribute. As far as the database is concerned, there is one more issue that needs to be clarified: in order to obtain the best results with the J48 classifier we have made some adjustments to the database: there were 7 relevant attributes left and a new attribute expressing an overall rating of each apartments characteristics was added. The rating was calculated using an expert function applied on all 18 attributes, thus obtaining a more personalized score of each apartment based on the average buyers preferences; the rating is a number between 1 and 5, 1 being the lowest rating and representing a bad buying decision and 5 being the highest rating and representing an excellent apartment.

Linear Regression
The relationship between apartments' attributes and prices can be intuitively estimated to be linear. This means that modifying one of the attributes results in a proportional change in the resulting price. The corresponding algorithm for a linear relationship between input attributes and output attribute is the linear regression model, which is a part of the Weka functional algorithms. Using statistic clasiffication, linear regression determines the numeric coefficient of each attribute on the training data set and reports the statistical errors of the algorithm through the correlation coefficient. The linear regression algorithm implemented in Weka is not restricted to numeric attributes alone. Thus, it is an excellent algorithm for our training database, in which 17 attributes (characteristics) determine the 18th numeric attribute (the price). The most important parameter for linear regression is the attribute selection method, with three possible options. On the extremes are "None" which provides fast results but less selective, and "Greedy" which is considerably slower but more accurate. Between them is the "M5" option which compromises between speed and correctness. One restriction must be taken in consideration, though; Greedy algorithm determines the most precise formula, but it requires the relationship between input and output attributes to be as close as possible to linear in order to properly determine local maximum values. If the relationship is not linear, the Greedy method will give an erroneous formula. By applying the linear regression classifier algorithm on the complete database of 1981 records, both Greedy and M5 options returned exactly the same formula. The results of the machine learning algorithm are shown in appendix.

The correlation coefficient is 0.702, which means that the relationship between apartment characteristics and prices resembles relatively well a linear relationship. A correlation coefficient of 1 indicates a perfect linear relationship, while a correlation coefficient of 0 indicates no relationship between input and output attributes. The algorithm considered only 6 attributes relevant in the linear relationship. The above formula gives a simple method of estimating the price of an apartment based on its characteristics. For example, for an apartment with the following attributes: camere = 3; decomandat = decomandat; confort = sporit; finisare = finisat; garaj = none; cartierul = Zorilor the estimated price is determined by simply adding the corresponding coefficients: pret = 4537.7439 + 3851.0755 + 2015.4919 + 5328.5996 + 4467.4402 + 1201.8261 + 2520.7528 - 595.0813 + 1265.0314 = 24,592.8801

J48
J48 is actually the implementation of J. Ross Quinlans C4.5 algorithm, that uses the top-down inductive method for the construction of decision trees. Starting from the root, each node is being tested for each record. Each node represents the name of an attribute. The algorithm tries to insert every instance in an existing class using similar characteristics. At the same time, it evaluates the attribute for the current node. Depending on its value, the instance will follow one of the tree's branches. When there are no more nodes left for evaluation, the instance is being classified. If a certain class turns out not to be significantly different than another one after the insertion of several records, the two classes will be united. This process is called "pruning". Since J48 performs a classification with a discrete variable for output, we had to evaluate each apartment taking into consideration every attribute except the number of rooms. This attribute should not influence the rating. The result of the evaluation is a number between 1 and 5, 1 being the lowest rating and representing a bad buying decision, and 5 being the highest rating and representing an excellent apartment. Attribute evaluation was necessary in order to obtain high quality classification. To do this, we chose the "Ranker" method from the "ChiSquaredAttributeEval" algorithm of Weka's Attribute Evaluator. "ChiSquaredAttributeEval" calculates the intensity of the correlations between attributes, using the ChiSquare test. The "Ranker" method sorts the attributes depending on the evaluation. The evaluation is being done based on the apartments type, that is the number of rooms. The results are shown in appendix. The attributes from ranks 9 to 18 have been marked the score 0, which made us remove them because their importance is insignificant in this case. Not providing them to the classification algorithm entails a higher speed of model creation as well as a higher accuracy. We applied J48 to the entire data set (training set). The testing takes place on different data: the test set. The score is the attribute used for classification. Thus, we obtain a decision tree, its leaves representing the score of its category. See appendix for detailed results.

The accuracy is of 87.2%, which means that 1728 out of 1981 instances have been correctly classified. The root of our decision tree represents the number of rooms. This is the main means to create a difference between the apartments. Other important attributes are price, finishing rating, comfort level and residential area. The model could not be applied successfully for apartments with five rooms, because there are only six such records in our data set. For testing the model we used a test set containing 200 records. The results are shown in appendix. The obtained accuracy (84%) proves the high quality of the model. Here is an example of interpreting the decision tree: "If the apartment has four rooms, its price is between 31864 and 44722 euro and it is located in downtown, then its score is 4."

Conclusion
The results of this project show that Weka is an appropriate tool for extracting information regarding apartments' prices based on their characteristics and to use that information for decision making. Taking into consideration the changes, which occur every day in the economic life and that can lead to an imperfect market and competition, errors cannot be avoided. As our application is based on economic issues, errors are inevitable. The number of instances is not very high and a lot of attributes are missing; the prices are the starting ones requested by sellers and they change during negotiations. Another source of errors is the exchange rate; because in 2001 apartments offered for sale had prices in United States Dollars, Euros, German Marks and Romanian Lei, establishing a common currency generated exchange errors.

Perspectives for Further Development


A practical application for this project is to implement it on a web server, to estimate the price of an apartment and assist buyers, sellers and real estate agencies to evaluate and decide upon a transaction. Such a database can provide excellent results in other statistical research and data mining, including the prediction of the future demand for apartments, information that can be of great value to real estate agencies, city officials responsible with urban development, architects and construction entrepreneurs. Using this project we can elaborate and determine models and charts regarding the seasonal and annual fluctuation of the prices caused by several factors, for example the movement of the students. A comparative evaluation of the real estate industry and market between different cities from our country can also help people and companies to make the best decisions when choosing a city. Another possible perspective is that of a comparative study in the European real estate market in order to determine if it is possible for all the European citizens to buy apartments.

Bibliography
Weka Data Mining application, Computer Science department at the University of Waikato, New Zeeland - www.cs.waikato.ac.nz/ml/weka/ Ross Quinlan (1993). "C4.5: Programs for Machine Learning", Morgan Kaufmann Publishers, San Mateo, CA. "Clujul tace i le face", Capital no. 18/2004 www.capital.ro/index.jsp?page=archive&magazine_id=279&article_id=14048 "Piata de la A la Z", Celina Prodcom Ltd., Cluj Napoca - www.piata-az.ro

Appendix - Database Attributes


Attribute Description Data Type Values 1 2 3 4 5 decomandat semidecomandat sporit unu doi intermediar parter balcon superfinisat finisat semifinisat nefinisat parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj parcare Manastur Gheorgheni Zorilor Marasti Grigorescu Centru Count 173 758 668 376 6 658 77 259 795 85 875 121 653 114 500 305 115 598 634 582 37 115 68 261 598 58 175 75 451 255 184 260 176 108 1981

camere

Number of rooms

nominal

decomandat Presence of a central hallway confort etaj balcon finisare parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj Comfort level Floor Presence of balcony Level of finishing Parquet Faience Gritstone Insulated windows Customizations Central heating unit Gas and water meters Telephone TV cable Garrage or parking place

nominal nominal nominal binary nominal binary binary binary binary binary binary binary binary binary nominal

cartierul

Residential area

nominal

pret

Price Attribute Price Minimum 2236

numeric Maximum 78264 Mean 19742.161

Appendix - Linear Regression Results


=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 pret_apartament 1981 18 camere decomandat confort etaj balcon finisare parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj cartierul pret evaluate on training data

Test mode:

=== Classifier model (full training set) === Linear Regression Model pret = 4537.7439 3851.0755 2897.5552 28548.9403 2015.4919 5328.5996 4467.4402 1201.8261 3754.5847 1682.1089 2520.7528 -595.0813 6855.378 1265.0314 * * * * * * * * * * * * * camere=2,3,4,5 + camere=3,4,5 + camere=4,5 + camere=5 + decomandat=decomandat + confort=unu,sporit + confort=sporit + finisare=finisat,superfinisat + finisare=superfinisat + garaj=garaj + cartierul=Gheorgheni,Grigorescu,Zorilor,Centru + cartierul=Grigorescu,Zorilor,Centru + cartierul=Centru +

Time taken to build model: 0.62 seconds === Evaluation on training set === === Summary === Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0.702 3091.6882 4919.9781 65.6611 % 71.2214 % 1981

Appendix - Attribute Ranking


=== Run information === Evaluator: Search: -1 Relation: Instances: Attributes: weka.attributeSelection.ChiSquaredAttributeEval weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N

pret_apartament 1981 19 camere decomandat confort etaj balcon finisare parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj cartierul pret scor Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data === Search Method: Attribute ranking. Attribute Evaluator (supervised, Class (nominal): 19 scor): Chi-squared Ranking Filter Ranked attributes: 2398.843 18 pret 1760.281 1 camere 120.16 17 cartierul 44.849 3 confort 10.946 6 finisare 8.646 4 etaj 0.475 16 garaj 0.296 2 decomandat 0 7 parchet 0 5 balcon 0 8 faianta 0 13 contorizat 0 14 telefon 0 15 cablu 0 12 centrala 0 9 gresie 0 10 termopan 0 11 modificari Selected attributes: 18,1,17,3,6,4,16,2,7,5,8,13,14,15,12,9,10,11 : 18

Appendix - J48 Training Results


=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.trees.J48 -C 0.25 -M 2 scor_apartament 1981 9 camere pret decomandat confort etaj finisare garaj cartierul scor evaluate on training data

Test mode:

=== Classifier model (full training set) === J48 pruned tree -----------------camere = 1 | pret <= 14904 | | pret <= 8944: 4 (6.0/1.0) | | pret > 8944: 3 (117.0/8.0) | pret > 14904: 2 (50.0/17.0) camere = 2 | pret <= 23255: 4 (712.0/92.0) | pret > 23255 | | pret <= 30187: 4 (36.0/9.0) | | pret > 30187 | | | pret <= 54784: 3 (8.0) | | | pret > 54784: 2 (2.0) camere = 3 | pret <= 24262: 4 (512.0/44.0) | pret > 24262 | | pret <= 30187: 4 (118.0/11.0) | | pret > 30187 | | | pret <= 51430 | | | | cartierul = Manastur: 4 (1.5/0.45) | | | | cartierul = Gheorgheni | | | | | pret <= 33542: 4 (4.82/1.59) | | | | | pret > 33542: 3 (2.68) | | | | cartierul = Zorilor: 3 (0.0) | | | | cartierul = Marasti: 4 (1.5/0.45) | | | | cartierul = Grigorescu: 3 (7.5/1.23) | | | | cartierul = Centru: 4 (15.0/5.55) | | | pret > 51430: 2 (5.0/1.0) camere = 4 | pret <= 21802 | | pret <= 17665: 5 (60.0/10.0) | | pret > 17665: 4 (102.0/34.0) | pret > 21802 | | pret <= 33318: 4 (181.0/6.0) | | pret > 33318 | | | pret <= 46958: 3 (23.0/10.0) | | | pret > 46958: 2 (10.0/1.0) camere = 5 | pret <= 51430: 5 (3.0/1.0) | pret > 51430: 1 (3.0/1.0)

Appendix - J48 Training Results (continued)


Number of Leaves Size of the tree : : 24 40

Time taken to build model: 0.3 seconds === Evaluation on training set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances === Detailed Accuracy By Class === TP Rate 0.667 0.842 0.743 0.985 0.229 FP Rate 0.001 0.01 0.011 0.432 0.006 Precision 0.667 0.716 0.88 0.879 0.825 Recall 0.667 0.842 0.743 0.985 0.229 F-Measure 0.667 0.774 0.806 0.929 0.359 Class 1 2 3 4 5 1726 255 0.6201 0.0843 0.205 52.7333 % 72.6086 % 1981 87.1277 % 12.8723 %

=== Confusion Matrix === a 2 1 0 0 0 b 1 48 18 0 0 c d 0 0 8 0 139 30 11 1485 0 175 e 0 0 0 11 52 | | | | | <-- classified as a = 1 b = 2 c = 3 d = 4 e = 5

Appendix - J48 Test Results


=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.trees.J48 -C 0.25 -M 2 scor_apartament 1981 9 camere pret decomandat confort etaj finisare garaj cartierul scor user supplied test set: 200 instances

Test mode:

=== Classifier model (full training set) === === Evaluation on test set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances === Detailed Accuracy By Class === TP Rate 0 1 0.765 1 0.067 FP Rate 0 0.005 0 0.596 0 Precision 0 0.833 1 0.827 1 Recall 0 1 0.765 1 0.067 F-Measure 0 0.909 0.867 0.905 0.125 Class 1 2 3 4 5 168 32 0.515 0.0918 0.226 55.6739 % 77.6152 % 200 84 16 % %

=== Confusion Matrix === a 0 0 0 0 0 b 0 5 1 0 0 c d 0 0 0 0 13 3 0 148 0 28 e 0 0 0 0 2 | | | | | <-a b c d e classified as = 1 = 2 = 3 = 4 = 5

Babes-Bolyai University, Cluj Napoca Faculty of Economics

Need a Home? Start the Data Mining!


A Data Mining Application in Weka on Real Estate Market

Authors Tit-Liviu Leontin Darie Moldovan Manuela Rusu Daniela Secar Corina Trifu

Scientific project coordinators Professor tefan Nichi, PhD. Teaching Assistant Gheorghe Silaghi

Cluj Napoca, May 20 2004

Objectives
According to a recent survey conducted by "Capital" business magazine in edition 18 (April 24th, 2004), there are about 200 real estate agencies in the city, the highest number per capita in the country. Local industries have exploded in recent years, foreign as well as national investors considering Cluj Napoca to be one of the fastest economically developing regions in Romania. Given the high demand for residential apartments and the soaring prices in this sector, the aim of this project is to facilitate the acquisition of a flat. This will be accomplished by analysing data from the real estate market with the aid of data mining techniques. There will be determined the price and rating for each apartment, thus showing on a scale from 1 to 5 whether the apartment is worthwhile considering or not.

Motivation
Due to the very high number of real estate agencies in Cluj Napoca, finding a flat can become a troublesome and wearying business. Moreover, having so many possibilities to choose from, it may become difficult to visit every single available apartment and try to decide if it suits or not one's wishes. That is where our data mining application comes in handy. It provides a simple mechanism to rank the apartments. This way, one can easily find and eliminate the low-ranked flats and focus on the important ones. Another reason for this project is the extreme dynamism of the real estate market. Also the continuously growing demand entails the unjustified increase of prices, thus leading to the necessity of periodic reevaluation of the apartments which can be more easily achieved using this kind of data mining analysis.

Tools and Methods


For this data mining project, we used Weka - a collection of machine learning algorithms for data mining tasks written in Java by the Computer Science department of University of Waikato, New Zeeland. Weka can be ran both in command-line mode as well as with a graphical user interface. From the algorithms included in Weka, we have selected two which best match our application: linear regression and J48.

Data Set
The data is provided by "Piata de la A la Z" ("A to Z Market") weekly newspaper. It contained, of course, information on all apartment sales advertisments published in January 2001. The data was received in Excel spreadsheet format. In order to be able to analyse this database with Weka, so that we could reach our stated objectives, we had to make a series of changes on the initial format of the data. First of all we have to mention that the database contained the data in only one memo field for each record. So, the first step we made was to separate from the memo field the relevant data: floor number, balcony, TV cable, telephone, central heating unit, garrage etc., using Excel string searching formulas. Afterwards, based on these keywords, we created a new spreadsheet that contained all the 18 final attributes of the database we used for training in Weka. The fields that had no value for an instance were marked with a question mark.

There were a few other adjustments that had to be made before starting the tests. We had to take into account that the algorithms we wanted to use in Weka establish the relationship between input and output attributes in the database. That is why all the instances in the spreadsheet needed to have at least two attributes filled in, one for input and one for output, because it is impossible to establish a relationship with a single attribute. As a result, we have deleted from the database all the instances that had only one attribute filled in. Another issue that we needed to take care of was the currency, because the prices quoted for the apartments were expressed in four different currencies, so we decided to convert all the prices into Euros using the official exchange rates of January 7, 2001. Finally, the Weka environment needs the database to be formatted under "ARFF" format, so we converted the Excel spreadsheet into "CSV" format (Comma Separated Values) then added the required attribute descriptor header to create the "ARFF" file. As shown in the attached table, the final form of the database is composed of 18 attributes that describe the information on the apartments, and contains 1981 instances. The table also shows the number of instances for each value of discrete attribute. As far as the database is concerned, there is one more issue that needs to be clarified: in order to obtain the best results with the J48 classifier we have made some adjustments to the database: there were 7 relevant attributes left and a new attribute expressing an overall rating of each apartments characteristics was added. The rating was calculated using an expert function applied on all 18 attributes, thus obtaining a more personalized score of each apartment based on the average buyers preferences; the rating is a number between 1 and 5, 1 being the lowest rating and representing a bad buying decision and 5 being the highest rating and representing an excellent apartment.

Linear Regression
The relationship between apartments' attributes and prices can be intuitively estimated to be linear. This means that modifying one of the attributes results in a proportional change in the resulting price. The corresponding algorithm for a linear relationship between input attributes and output attribute is the linear regression model, which is a part of the Weka functional algorithms. Using statistic clasiffication, linear regression determines the numeric coefficient of each attribute on the training data set and reports the statistical errors of the algorithm through the correlation coefficient. The linear regression algorithm implemented in Weka is not restricted to numeric attributes alone. Thus, it is an excellent algorithm for our training database, in which 17 attributes (characteristics) determine the 18th numeric attribute (the price). The most important parameter for linear regression is the attribute selection method, with three possible options. On the extremes are "None" which provides fast results but less selective, and "Greedy" which is considerably slower but more accurate. Between them is the "M5" option which compromises between speed and correctness. One restriction must be taken in consideration, though; Greedy algorithm determines the most precise formula, but it requires the relationship between input and output attributes to be as close as possible to linear in order to properly determine local maximum values. If the relationship is not linear, the Greedy method will give an erroneous formula. By applying the linear regression classifier algorithm on the complete database of 1981 records, both Greedy and M5 options returned exactly the same formula. The results of the machine learning algorithm are shown in appendix.

The correlation coefficient is 0.702, which means that the relationship between apartment characteristics and prices resembles relatively well a linear relationship. A correlation coefficient of 1 indicates a perfect linear relationship, while a correlation coefficient of 0 indicates no relationship between input and output attributes. The algorithm considered only 6 attributes relevant in the linear relationship. The above formula gives a simple method of estimating the price of an apartment based on its characteristics. For example, for an apartment with the following attributes: camere = 3; decomandat = decomandat; confort = sporit; finisare = finisat; garaj = none; cartierul = Zorilor the estimated price is determined by simply adding the corresponding coefficients: pret = 4537.7439 + 3851.0755 + 2015.4919 + 5328.5996 + 4467.4402 + 1201.8261 + 2520.7528 - 595.0813 + 1265.0314 = 24,592.8801

J48
J48 is actually the implementation of J. Ross Quinlans C4.5 algorithm, that uses the top-down inductive method for the construction of decision trees. Starting from the root, each node is being tested for each record. Each node represents the name of an attribute. The algorithm tries to insert every instance in an existing class using similar characteristics. At the same time, it evaluates the attribute for the current node. Depending on its value, the instance will follow one of the tree's branches. When there are no more nodes left for evaluation, the instance is being classified. If a certain class turns out not to be significantly different than another one after the insertion of several records, the two classes will be united. This process is called "pruning". Since J48 performs a classification with a discrete variable for output, we had to evaluate each apartment taking into consideration every attribute except the number of rooms. This attribute should not influence the rating. The result of the evaluation is a number between 1 and 5, 1 being the lowest rating and representing a bad buying decision, and 5 being the highest rating and representing an excellent apartment. Attribute evaluation was necessary in order to obtain high quality classification. To do this, we chose the "Ranker" method from the "ChiSquaredAttributeEval" algorithm of Weka's Attribute Evaluator. "ChiSquaredAttributeEval" calculates the intensity of the correlations between attributes, using the ChiSquare test. The "Ranker" method sorts the attributes depending on the evaluation. The evaluation is being done based on the apartments type, that is the number of rooms. The results are shown in appendix. The attributes from ranks 9 to 18 have been marked the score 0, which made us remove them because their importance is insignificant in this case. Not providing them to the classification algorithm entails a higher speed of model creation as well as a higher accuracy. We applied J48 to the entire data set (training set). The testing takes place on different data: the test set. The score is the attribute used for classification. Thus, we obtain a decision tree, its leaves representing the score of its category. See appendix for detailed results.

The accuracy is of 87.2%, which means that 1728 out of 1981 instances have been correctly classified. The root of our decision tree represents the number of rooms. This is the main means to create a difference between the apartments. Other important attributes are price, finishing rating, comfort level and residential area. The model could not be applied successfully for apartments with five rooms, because there are only six such records in our data set. For testing the model we used a test set containing 200 records. The results are shown in appendix. The obtained accuracy (84%) proves the high quality of the model. Here is an example of interpreting the decision tree: "If the apartment has four rooms, its price is between 31864 and 44722 euro and it is located in downtown, then its score is 4."

Conclusion
The results of this project show that Weka is an appropriate tool for extracting information regarding apartments' prices based on their characteristics and to use that information for decision making. Taking into consideration the changes, which occur every day in the economic life and that can lead to an imperfect market and competition, errors cannot be avoided. As our application is based on economic issues, errors are inevitable. The number of instances is not very high and a lot of attributes are missing; the prices are the starting ones requested by sellers and they change during negotiations. Another source of errors is the exchange rate; because in 2001 apartments offered for sale had prices in United States Dollars, Euros, German Marks and Romanian Lei, establishing a common currency generated exchange errors.

Perspectives for Further Development


A practical application for this project is to implement it on a web server, to estimate the price of an apartment and assist buyers, sellers and real estate agencies to evaluate and decide upon a transaction. Such a database can provide excellent results in other statistical research and data mining, including the prediction of the future demand for apartments, information that can be of great value to real estate agencies, city officials responsible with urban development, architects and construction entrepreneurs. Using this project we can elaborate and determine models and charts regarding the seasonal and annual fluctuation of the prices caused by several factors, for example the movement of the students. A comparative evaluation of the real estate industry and market between different cities from our country can also help people and companies to make the best decisions when choosing a city. Another possible perspective is that of a comparative study in the European real estate market in order to determine if it is possible for all the European citizens to buy apartments.

Bibliography
Weka Data Mining application, Computer Science department at the University of Waikato, New Zeeland - www.cs.waikato.ac.nz/ml/weka/ Ross Quinlan (1993). "C4.5: Programs for Machine Learning", Morgan Kaufmann Publishers, San Mateo, CA. "Clujul tace i le face", Capital no. 18/2004 www.capital.ro/index.jsp?page=archive&magazine_id=279&article_id=14048 "Piata de la A la Z", Celina Prodcom Ltd., Cluj Napoca - www.piata-az.ro

Appendix - Database Attributes


Attribute Description Data Type Values 1 2 3 4 5 decomandat semidecomandat sporit unu doi intermediar parter balcon superfinisat finisat semifinisat nefinisat parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj parcare Manastur Gheorgheni Zorilor Marasti Grigorescu Centru Count 173 758 668 376 6 658 77 259 795 85 875 121 653 114 500 305 115 598 634 582 37 115 68 261 598 58 175 75 451 255 184 260 176 108 1981

camere

Number of rooms

nominal

decomandat Presence of a central hallway confort etaj balcon finisare parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj Comfort level Floor Presence of balcony Level of finishing Parquet Faience Gritstone Insulated windows Customizations Central heating unit Gas and water meters Telephone TV cable Garrage or parking place

nominal nominal nominal binary nominal binary binary binary binary binary binary binary binary binary nominal

cartierul

Residential area

nominal

pret

Price Attribute Price Minimum 2236

numeric Maximum 78264 Mean 19742.161

Appendix - Linear Regression Results


=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 pret_apartament 1981 18 camere decomandat confort etaj balcon finisare parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj cartierul pret evaluate on training data

Test mode:

=== Classifier model (full training set) === Linear Regression Model pret = 4537.7439 3851.0755 2897.5552 28548.9403 2015.4919 5328.5996 4467.4402 1201.8261 3754.5847 1682.1089 2520.7528 -595.0813 6855.378 1265.0314 * * * * * * * * * * * * * camere=2,3,4,5 + camere=3,4,5 + camere=4,5 + camere=5 + decomandat=decomandat + confort=unu,sporit + confort=sporit + finisare=finisat,superfinisat + finisare=superfinisat + garaj=garaj + cartierul=Gheorgheni,Grigorescu,Zorilor,Centru + cartierul=Grigorescu,Zorilor,Centru + cartierul=Centru +

Time taken to build model: 0.62 seconds === Evaluation on training set === === Summary === Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0.702 3091.6882 4919.9781 65.6611 % 71.2214 % 1981

Appendix - Attribute Ranking


=== Run information === Evaluator: Search: -1 Relation: Instances: Attributes: weka.attributeSelection.ChiSquaredAttributeEval weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N

pret_apartament 1981 19 camere decomandat confort etaj balcon finisare parchet faianta gresie termopan modificari centrala contorizat telefon cablu garaj cartierul pret scor Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data === Search Method: Attribute ranking. Attribute Evaluator (supervised, Class (nominal): 19 scor): Chi-squared Ranking Filter Ranked attributes: 2398.843 18 pret 1760.281 1 camere 120.16 17 cartierul 44.849 3 confort 10.946 6 finisare 8.646 4 etaj 0.475 16 garaj 0.296 2 decomandat 0 7 parchet 0 5 balcon 0 8 faianta 0 13 contorizat 0 14 telefon 0 15 cablu 0 12 centrala 0 9 gresie 0 10 termopan 0 11 modificari Selected attributes: 18,1,17,3,6,4,16,2,7,5,8,13,14,15,12,9,10,11 : 18

Appendix - J48 Training Results


=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.trees.J48 -C 0.25 -M 2 scor_apartament 1981 9 camere pret decomandat confort etaj finisare garaj cartierul scor evaluate on training data

Test mode:

=== Classifier model (full training set) === J48 pruned tree -----------------camere = 1 | pret <= 14904 | | pret <= 8944: 4 (6.0/1.0) | | pret > 8944: 3 (117.0/8.0) | pret > 14904: 2 (50.0/17.0) camere = 2 | pret <= 23255: 4 (712.0/92.0) | pret > 23255 | | pret <= 30187: 4 (36.0/9.0) | | pret > 30187 | | | pret <= 54784: 3 (8.0) | | | pret > 54784: 2 (2.0) camere = 3 | pret <= 24262: 4 (512.0/44.0) | pret > 24262 | | pret <= 30187: 4 (118.0/11.0) | | pret > 30187 | | | pret <= 51430 | | | | cartierul = Manastur: 4 (1.5/0.45) | | | | cartierul = Gheorgheni | | | | | pret <= 33542: 4 (4.82/1.59) | | | | | pret > 33542: 3 (2.68) | | | | cartierul = Zorilor: 3 (0.0) | | | | cartierul = Marasti: 4 (1.5/0.45) | | | | cartierul = Grigorescu: 3 (7.5/1.23) | | | | cartierul = Centru: 4 (15.0/5.55) | | | pret > 51430: 2 (5.0/1.0) camere = 4 | pret <= 21802 | | pret <= 17665: 5 (60.0/10.0) | | pret > 17665: 4 (102.0/34.0) | pret > 21802 | | pret <= 33318: 4 (181.0/6.0) | | pret > 33318 | | | pret <= 46958: 3 (23.0/10.0) | | | pret > 46958: 2 (10.0/1.0) camere = 5 | pret <= 51430: 5 (3.0/1.0) | pret > 51430: 1 (3.0/1.0)

Appendix - J48 Training Results (continued)


Number of Leaves Size of the tree : : 24 40

Time taken to build model: 0.3 seconds === Evaluation on training set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances === Detailed Accuracy By Class === TP Rate 0.667 0.842 0.743 0.985 0.229 FP Rate 0.001 0.01 0.011 0.432 0.006 Precision 0.667 0.716 0.88 0.879 0.825 Recall 0.667 0.842 0.743 0.985 0.229 F-Measure 0.667 0.774 0.806 0.929 0.359 Class 1 2 3 4 5 1726 255 0.6201 0.0843 0.205 52.7333 % 72.6086 % 1981 87.1277 % 12.8723 %

=== Confusion Matrix === a 2 1 0 0 0 b 1 48 18 0 0 c d 0 0 8 0 139 30 11 1485 0 175 e 0 0 0 11 52 | | | | | <-- classified as a = 1 b = 2 c = 3 d = 4 e = 5

Appendix - J48 Test Results


=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.trees.J48 -C 0.25 -M 2 scor_apartament 1981 9 camere pret decomandat confort etaj finisare garaj cartierul scor user supplied test set: 200 instances

Test mode:

=== Classifier model (full training set) === === Evaluation on test set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances === Detailed Accuracy By Class === TP Rate 0 1 0.765 1 0.067 FP Rate 0 0.005 0 0.596 0 Precision 0 0.833 1 0.827 1 Recall 0 1 0.765 1 0.067 F-Measure 0 0.909 0.867 0.905 0.125 Class 1 2 3 4 5 168 32 0.515 0.0918 0.226 55.6739 % 77.6152 % 200 84 16 % %

=== Confusion Matrix === a 0 0 0 0 0 b 0 5 1 0 0 c d 0 0 0 0 13 3 0 148 0 28 e 0 0 0 0 2 | | | | | <-a b c d e classified as = 1 = 2 = 3 = 4 = 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy