Screenshot 2024-12-10 at 8.32.21 PM
Screenshot 2024-12-10 at 8.32.21 PM
CHAPTER 1
PREAMBLE
1.1 Introduction
The House Price Prediction project is a practical application of machine learning that
aims to estimate the market value of residential properties. This project utilizes
historical data and various factors influencing house prices, such as location, size,
number of rooms, amenities, and proximity to key infrastructure, to create a predictive
model.
In this project, we focus on data from Metropolitan city, known for its dynamic real
estate market and diverse property features. The predictive model is built using the
XGBoost and linear regression algorithm, a powerful and efficient gradient boosting
framework known for its high accuracy and ability to handle complex datasets.
XGBoost’s strength lies in uncovering intricate patterns in data, making it an excellent
choice for predicting house prices.
This project not only provides valuable insights into the real estate market but also
serves as a decision-making tool for buyers, sellers, and investors. By leveraging
advanced machine learning techniques, the project demonstrates the potential of data-
driven approaches in addressing real-world challenges, especially in rapidly growing
urban areas.
CHAPTER 2
INTRODUCTION TO WEB SCRAPING AND
MACHINE LEARNING
2.1 Web Scraping
Web scraping, also referred to as web extraction or harvesting, is a technique for
extracting data from the World Wide Web (WWW) and storing it in a structured format, such
as a file system or database, for later retrieval or analysis. This process involves utilizing
Hypertext Transfer Protocol (HTTP) or web browsers to fetch data either manually by a user
or automatically through a bot or web crawler. Given the immense amount of heterogeneous
data continuously generated on the WWW, web scraping has become an indispensable and
efficient method for collecting and analysing big data.
2.1.1 Key Characteristics of Web Scraping
• Automation: Allows for the systematic extraction of large datasets in minimal
time.
• Versatility: Can extract data from multiple formats, such as HTML, JSON, XML,
and multimedia files like images and videos.
• Scalability: Adaptable for tasks ranging from small-scale, ad hoc data collection to
large-scale automated systems.
2.1.2 The Web Scraping Process
The process of web scraping can be divided into two sequential steps:
Acquiring Web Resources:
• A web scraper sends an HTTP request to the target website.
• The request may be a GET query (URL-based) or a POST query (message-based).
• Once processed, the requested resource (HTML, XML, JSON, or multimedia data)
is retrieved from the website.
Extracting Desired Information:
• The downloaded web data is parsed and organized into a structured format.
• Tools such as Beautiful Soup and Pyquery are commonly used to extract
information from raw HTML or XML.
2.2 Web Scraping with BeautifulSoup4 (bs4)
BeautifulSoup4 (bs4) is a powerful and easy-to-use Python library designed for web
scraping. It allows users to parse HTML and XML documents, navigate through the
document tree, and extract specific data. It is widely used in web scraping applications due to
its simplicity and flexibility. It helps automate the process of extracting data from websites,
which is an essential aspect of many data-driven applications.
2.2.1 Core Components:
The core components of bs4 are
• HTML Parsing: BeautifulSoup4 parses HTML and XML documents and
allows for easy navigation and data extraction.
• Search and Extraction: It provides methods like find(), find_all(), and select()
to locate and extract data from the document using CSS selectors, tags,
attributes, etc.
• Data Cleanliness: BeautifulSoup automatically handles poorly formatted
HTML and still provides accurate results.
2.2.2 Benefits of Using BeautifulSoup4:
• Ease of Use: BeautifulSoup4's API is designed to be easy to understand, even
for beginners in Python and web scraping.
• Compatibility: It works well with different parsers, including html.parser, lxml,
and html5lib.
• Handles Malformed HTML: BeautifulSoup can handle poorly formatted or
malformed HTML, which makes it highly useful when scraping real-world
web data.
• Search Capabilities: BeautifulSoup provides robust methods to search for and
extract elements by tag name, CSS class, ID, and attributes.
2.3 Web Scraping with Selenium
Selenium is a powerful and widely-used web scraping tool that is primarily designed
for automating web browsers. Unlike traditional scraping tools that deal with static HTML
content, Selenium enables interaction with dynamic content that is generated by JavaScript.
This makes it ideal for scraping web pages where content is loaded asynchronously or
requires user interactions like form submissions, button clicks, or scrolling.
2.3.1 Core Components of Selenium:
• WebDriver: The WebDriver is the core component of Selenium that interacts with
the browser. It can simulate user actions like navigating between pages, clicking
on buttons, or filling out forms.
• Browser Automation: Selenium supports multiple web browsers including
Chrome, Firefox, Safari, and Edge, allowing cross-browser scraping.
• Element Interaction: Selenium allows you to locate and interact with page
elements using various locators such as XPath, CSS Selectors, or element IDs.
Once an element is located, you can perform actions like clicking, typing, or
submitting.
• JavaScript Rendering: Selenium works with real web browsers, allowing it to
render JavaScript and interact with dynamically loaded content.
2.3.2 Benefits of Using Selenium:
• JavaScript Rendering Support: Selenium is particularly useful for scraping content
that is dynamically generated using JavaScript or AJAX. While libraries like
BeautifulSoup only scrape static HTML, Selenium can interact with web pages
that require JavaScript to load content. It can render the page, execute JavaScript,
and extract the necessary data.
• Browser Automation: Selenium can fully automate browser tasks, making it
possible to interact with websites that require form submissions, user
authentication, or complex actions like dropdown selections. This allows for
automated data collection from websites that require these interactions.
• Simulates Real User Behavior: Selenium mimics the way a real user would interact
with a webpage, including clicking buttons, filling out forms, scrolling, and even
handling user input. This capability is especially useful for scraping websites that
detect and block non-human traffic, such as bot protections and CAPTCHA
challenges.
• Cross-Browser Compatibility: Selenium supports several browsers, including
Chrome, Firefox, Safari, and Edge. This cross-browser compatibility ensures that
your scraping scripts will work consistently across different platforms and
environments.
• Handles Complex Websites: Selenium is well-suited for scraping complex websites
that use dynamic content, infinite scrolling, or multiple page interactions. It can
simulate user behaviors like scrolling or clicking through multiple pages to load
additional content, making it an effective tool for scraping modern web
applications.
• Headless Mode: Selenium can run browsers in headless mode, which means the
browser operates without a graphical user interface (GUI). This is useful when
running scraping scripts in a server environment or when you don't need to see the
browser window. Running in headless mode can improve performance and reduce
resource usage.
• Element Location and Interaction: Selenium supports multiple strategies for
locating elements, including using XPath, CSS Selectors, and IDs. This flexibility
allows you to write reliable scraping scripts even for complex and dynamic web
pages.
• Real-Time Debugging: Since Selenium operates with a real browser, it provides the
benefit of real-time debugging. You can visually observe how the browser
interacts with the page, making it easier to troubleshoot issues and understand the
behavior of the webpage.
Figure 3.1
In the Figure 3.1 you can go through the high level design
of the project and below given are the main components
1. User Interface Layer
• Components:
o Frontend:
o API Endpoint:
• Components:
o Scraping Engine:
o Data Preprocessing:
▪ Handle missing or incorrect data.
• Components:
o K-Means Clustering:
o Product Scoring:
4. Prediction Layer
• Components:
o Linear Regression:
o Visualization:
3.2 .2 Module 2
In the Figure 3.2 you can see that the process begins when the user submits a
search query (e.g., a product name) through the web interface. Based on this query,
the system scrapes data from four e-commerce websites. For Amazon and Ajio, which
have dynamic content, the system uses Selenium WebDriver to interact with the page
elements and capture the product details such as title, price, rating, and product link.
For Snapdeal and eBay, which feature static content, BeautifulSoup is used to parse
and extract the required information more efficiently.
Once the data is gathered, it undergoes a cleaning process. Prices are cleaned by
removing any unwanted characters (such as the ₹ symbol and commas) and are then
converted to float values. Ratings are converted to numeric values, with any missing
ratings replaced by a default value (e.g., 0). The next step involves data normalization
using the MinMaxScaler. This ensures that both the price and rating are scaled to a
range between 0 and 1, making the data comparable across products.
Figure 3.3
The system then applies K-Means clustering to group products based on their
normalized price and rating. Each product is assigned to one of three clusters, and the
system calculates the distance to the centroid of each cluster. This distance, along with
the product’s rating, is used to generate a predicted score for each product. The
predicted score helps rank the products, with those having the highest scores
appearing first.
Finally, the best three products from each platform are displayed to the user. For each
platform (Amazon, Ajio, eBay, and Snapdeal), the system shows the product title,
price, rating, and a link to the product. This ranking process allows users to view the
best products across different e-commerce platforms based on price and rating,
optimized using machine learning techniques.
3.2.3 Module 3
In the Figure 3.3 you can go through the systematic process for predicting
future price trends using web scraping and linear regression. This method involves
distinct stages, each playing a crucial role in. Linear regression, a simple yet powerful
machine learning technique, is employed to uncover patterns and relationships in
historical price data. By analyzing past trends, the model determines how prices have
fluctuated over time.
Figure 3.4
The insights generated from these predictions are invaluable. These predictions can be
displayed as graphs, dashboards, or interactive interfaces, making them easy to
interpret and utilize. Visualization tools help highlight key trends, such as price drops,
surges, or seasonal patterns, allowing users to explore and understand the data.
In summary, this workflow combines the automation of web scraping, the analytical
power of linear regression, and the clarity of visualizations to create a robust system
for price trend prediction. This approach not only saves time and effort but also
empowers users to make informed decisions based on reliable, data-driven insights.
3.3 Overall Flow of the Project
In the Figure 3.4 you can go through the project begins with user input, where the
user provides a product query, such as a product name or category. Using web scraping tools
like BeautifulSoup and Selenium, the system retrieves unstructured product data, including
details like titles, prices, and ratings, from e-commerce platforms. This unstructured data is
then converted into structured form for further processing. The structured data undergoes
cleaning and preprocessing using Pandas to handle missing values and ensure consistency.
Once cleaned, the data is analyzed using machine learning algorithm K-Means Clustering
Algorithm to extract insights, such as clustering products based on price and ratings. Finally,
linear regression is applied to predict future prices, enabling refined product rankings and
predictive insights. The process concludes by presenting the results to the user.
Figure 3.5
3.4 Implementation
• User Input
The project begins with the user providing the product title they wish to
analyze. This input is sent to the backend for further processing.
• Scraping Product Data
Based on the product title, relevant product data such as price, ratings, and
reviews are extracted from e-commerce websites. BeautifulSoup is used to
parse static HTML pages for data extraction, while Selenium handles
dynamically rendered pages. Selenium automates browser interactions to
retrieve fully rendered HTML, which is then processed using BeautifulSoup.
• Data Structuring and Cleaning
The scraped data is initially in an unstructured format and is converted into a
structured tabular form using Pandas. Fields such as product name, price,
rating, and reviews are organized for analysis. A data cleaning process is
performed to handle missing or inconsistent values, normalize currencies, and
remove redundant or irrelevant entries.
• Analysis Using KMeans Clustering
The cleaned data is analyzed using KMeans clustering to group similar
products based on attributes like price and rating. This process identifies
distinct market segments, such as budget-friendly, mid-range, and premium-
priced products.
• Price Prediction Using Linear Regression
Linear Regression is applied to predict future prices for the specified product
title. Historical pricing data is analyzed, with time as the independent variable
and price as the dependent variable. The model forecasts potential price
trends, providing insights into future pricing.
• Frontend Display:
The results are presented on a user-friendly front end built with HTML, CSS,
and JavaScript. The front end accepts the product title as input and displays
the clustering insights and price predictions in a visually appealing manner.
3.5 Pseudocodes
3.5.1 Module 1 :- Scraping the Data using bs4 and Selenium
IMPORT necessary libraries (requests, BeautifulSoup, selenium, etc.)
DEFINE constants for user-agent headers
DEFINE initialize_driver() -> return WebDriver for headless browsing
DEFINE scrape_amazon(query):
- Replace spaces with '+'
- Initialize driver and load Amazon search page
- Extract product details (title, price, rating, link)
- Move to next page if available
- Save scraped products to database
DEFINE scrape_ebay(query):
- Replace spaces with '+'
- Send request to eBay search URL
- Parse response and extract product details (title, price, rating, link)
- Convert USD price to INR using usd_to_inr()
- Save scraped products to database
DEFINE scrape_snapdeal(query):
- Replace spaces with '+'
- Send request to Snapdeal search URL
- Parse response and extract product details (title, price, rating, link)
- Save scraped products to database
DEFINE scrape_ajio(query):
- Replace spaces with '+'
- Initialize driver and load Ajio search page
- Wait for products to load and extract product details
- Save scraped products to database
DEFINE is_valid_rating(rating):
- Try to convert rating to float
- If valid, return True, else return False
FUNCTION scrape_products(request):
- Retrieve the search query from the request
IF query is provided:
- Scrape products from different platforms (Amazon, eBay, Snapdeal, Ajio)
- Rank the scraped products from each platform using rank_products_ml
function
- Extract the best 3 products from each platform
- Return the rendered page with the context containing product data
ELSE:
- Return a rendered page with an error message ("No query provided")
END
END
3.5.3 Module 3 Future Price Prediction using Linear Regression
FUNCTION visualize_price_data(prices, labels=None, future_steps=10):
- Create a DataFrame using the `prices` list
- Add an additional column:
- Use `labels` for X-axis if provided
- Otherwise, use default indices as labels
- Train a Linear Regression model:
- Use the `label` column as features (X)
- Use the `price` column as the target (y)
- Predict future prices:
- Generate future labels starting from the last index
- Use the model to predict prices for the next `future_steps`
- Plot the data:
- Plot actual prices vs. labels as a blue line with markers
- Plot predicted future prices as a red dashed line
- Add labels, a title, and rotate X-axis labels for readability
- Save the plot:
- Save the plot image to a temporary buffer in PNG format
- Encode the image in Base64 format
- RETURN:
- Base64-encoded image string of the plot
- Predicted future prices
CHAPTER 4
RESULTS AND SNAPSHOTS
Figure 4.1
This page depicts the home page of our project which includes the description of the project .
Figure 4.2
This image shows different cities covered with some description in home page
Figure 4.3
This graph shows the overall analysis of the preferred cities.
Figure 4.4
This graph shows the overall analysis of cost of living of the cities .
Figure 4.5
This graph shows the overall analysis of spaciousness of the cities covered.
Figure 4.6
This graph shows the overall analysis of affordability of the cities covered.
Figure 4.7
This image shows the analysis and graph of the Mumbai city.
Figure 4.8
This image shows the analysis and graph of the Kolkata city
Figure 4.9
This image shows the analysis and the graph of the Delhi city
Figure 4.10
This image shows the analysis and graph of the Pune city
Figure 4.11
This image shows the analysis and graph of the Chennai city
Figure 4.12
This image shows the analysis and graph of the Bangalore city
Figure 4.13
This image shows the analysis and graph of the Ahmedabad city
Figure 4.14
This image displays the predict page ,where it takes the input from the user.
Figure 4.15
This image is the result page ,displays the prediction of the city we choose
CHAPTER 5
CONCLUSION & FUTURE SCOPE
5.1 Conclusion
House price prediction using machine learning is a valuable tool for the real estate market.
By analyzing factors like location, size, and property features, machine learning models can
accurately estimate property prices. This helps buyers, sellers, and investors make better
decisions based on data. The success of the system depends on using good quality data,
selecting the right features, and updating the model regularly to reflect market changes.
Overall, it shows how technology can make price estimation more efficient, reliable, and
useful for everyone involved.
5.2 Future Scope
• Offer personalized product recommendations based on user behaviour.
• Provide real-time price tracking, stock updates, and trending product insights.
• Predict market trends and optimal pricing using advanced analytics.
• Launch a mobile app with features like voice and image search for convenience.
• Expand to support international markets with multi-currency options.
• Deliver competitive analysis and insights to help businesses make strategic decisions.
• Use interactive dashboards for clear and actionable data visualization.
CHAPTER 6
REFERENCES
[1] Aswad Shaikh, Aniket Sonmali, and Soham Wakade, "Product Comparison Website using
Web Scraping and Machine Learning," Student at Department of Information Technology,
Atharva College of Engineering, Mumbai, Maharashtra, India.
[2] V. Srividhya and P. Megala, "Scraping and Visualization of Product Data from E-
commerce Websites," Dept of Computer Science, Avinashilingam Institute for Home Science
and Higher Education for Women, Coimbatore, Tamilnadu, India.
[3] C. Lotfi, S. Srinivasan, M. Ertz, and I. Latrous, "Web Scraping Techniques and
Applications: A Literature Review," Labo NFC, University of Quebec at Chicoutimi, 555
Boulevard de l’Université, Saguenay (QC), Canada.
[4] M. Fraňo, "Web Scraping as a Data Source for Machine Learning Models and the
Importance of Preprocessing Web Scraped Data," University of Twente, The Netherlands.
[5] B. Zhao, "Web Scraping," College of Earth, Ocean, and Atmospheric Sciences, Oregon
State University, Corvallis, OR, USA.
[6] S. C. M. de S. Sirisuriya, "Importance of Web Scraping as a Data Source for Machine
Learning Algorithms - Review," Department of Computer Science, Faculty of Computing,
General Sir John Kotelawala Defence University, Sri Lanka.
[7] R. Praba, G. Darshan, K. T. Roshanraj, and B. Surya Prakash, "Study On Machine
Learning Algorithms," Department of Information Technology, Dr. N.G.P. Arts and Science
College, Coimbatore, Tamil Nadu, India.
[8] T. S. Singh and H. Kaur, "Review Paper on Django Web Development," Department of
Computer Application, Rayat Bahra University.
[9] Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN
9781800568112
[10] https://docs.python.org/3/
[11] https://pypi.org/project/beautifulsoup4/
[12] https://pypi.org/project/requests/
[13] https://www.djangoproject.com/
[14] https://www.selenium.dev/documentation/
[15] https://docs.djangoproject.com/en/stable/
[16] Adrian Holovaty, Jacob Kaplan Moss, The Definitive Guide to Django: Web
Development Done Right, Second Edition, Springer-Verlag Berlin and Heidelberg GmbH &
Co. KG Publishers, 2009
[17] S. Sridhar, M Vijayalakshmi “Machine Learning”,Oxford,2021.
[18] https://scikit-learn.org/stable/documentation.html
[19] https://matplotlib.org/stable/contents.html