0% found this document useful (0 votes)

13 views8 pages

1 PB

This study analyzes New York City taxi trip data from 2023 to predict fares using ordinary least squares (OLS) and limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithms. The research involves data cleaning, feature engineering, and model training on over 12 million records, with performance evaluated using root-mean-square error (RMSE) and mean squared error (MSE). Results indicate that both methods yield comparable accuracy, providing insights for optimizing taxi operations and enhancing customer satisfaction.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views8 pages

1 PB

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

International Journal of Electrical and Computer Engineering (IJECE)

Vol. 15, No. 1, February 2025, pp. 711~718

ISSN: 2088-8708, DOI: 10.11591/ijece.v15i1.pp711-718  711

Analysis of big data from New York taxi trip 2023: revenue
prediction using ordinary least squares solution and limited-
memory Broyden-Fletcher-Goldfarb-Shanno algorithms

Sara Rhouas, Norelislam El Hami

Engineering Science Laboratory, National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco

Article Info ABSTRACT

Article history: This study explores the prediction of taxi trip fares using two linear
regression methods: normal equations (ordinary least squares solution
Received Jun 14, 2024 (OLS)) and limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS).
Revised Sep 6, 2024 Utilizing a dataset of New York City yellow taxi trips from 2023, the
Accepted Oct 1, 2024 analysis involves data cleaning, feature engineering, and model training. The
data consists of over 12 million records, managed, and processed that
involves configuring the Spark driver and executor memory to efficiently
Keywords: process the Parquet-format data stored on hadoop distributed file system
(HDFS). Key features influencing fare amount, such as passenger count, trip
Big data distance, fare amount, and tip amount, were analyzed for correlation. Models
Data analysis were trained on an 80-20 train-test split, and their performance was
Linear regression evaluated using root-mean-square error (RMSE) and mean squared error
Machine learning (MSE). Results show that both methods provide comparable accuracy, with
Spark slight differences in coefficients and training time. Additionally, vendor
performance metrics, including total trips, average trip distance, fare
amount, and tip amount, were analyzed to reveal trends and inform strategic
decisions for fleet management. This comprehensive analysis demonstrates
the efficacy of linear regression techniques in predicting taxi fares and offers
valuable insights for optimizing taxi operations.
This is an open access article under the CC BY-SA license.

Corresponding Author:
Rhouas Sara
Engineering Science Laboratory, National School of Applied Sciences, Ibn Tofail University
Av. de L'Université, Kénitra, Marocco
Email: rhouas.sara@gmail.com

1. INTRODUCTION
Predicting taxi trip fares accurately is a critical task for both fleet operators and passengers in urban
transportation systems. With the advent of big data and advanced analytical tools, it is now possible to
leverage extensive datasets to gain insights into fare determinants and improve fare prediction models [1].
This study focuses on utilizing linear regression techniques to predict taxi trip fares using data from New
York City's yellow taxi fleet for the entire year of 2023 [2]. By comparing two prominent regression
methods, normal equations ordinary least squares (OLS) solution and limited-memory Broyden-Fletcher-
Goldfarb-Shanno (L-BFGS), we aim to identify the most effective approach for fare prediction. Accurate fare
predictions can enhance operational efficiency, optimize pricing strategies, and improve customer
satisfaction by providing transparent and predictable fare estimates [3], [4].
New York City’s yellow taxi dataset provides a rich source of information, encompassing millions
of trip records with diverse attributes such as trip distance, passenger count, fare amount, tip amount, and
temporal details [5]. The large volume of data allows for a detailed analysis of fare determinants and the

Journal homepage: http://ijece.iaescore.com

712  ISSN: 2088-8708

development of robust predictive models. However, the presence of null values and outliers necessitates
rigorous data cleaning and preprocessing. This study systematically addresses these challenges, ensuring the
integrity and reliability of the dataset. Feature engineering techniques are employed to extract meaningful
insights from the data, such as temporal patterns in trip frequencies and fare variations across different
vendors [6], [7].
In addition to building predictive models, this study conducts a comprehensive correlation analysis
to understand the relationships between various trip attributes and the fare amount. By examining these
correlations, we identify the most significant features influencing fare predictions. The performance of the
regression models is evaluated using metrics such as root-mean-square error (RMSE) and mean squared error
(MSE), providing a quantitative measure of their accuracy. Furthermore, the study delves into vendor
performance analysis, comparing key performance indicators like total trips, average trip distance, fare
amount, and tip amount across different vendors. This holistic approach not only highlights the effectiveness
of linear regression techniques in fare prediction but also offers valuable insights into vendor operations,
contributing to the overall optimization of taxi services in New York City [8].

2. METHOD
Our approach to analyzing New York City taxi trip data in 2023 combines the Apache spark
platform and linear regression models for fare prediction. Spark handles large datasets, enabling efficient data
cleaning, transformation, and analysis. After loading the data from parquet files and filtering invalid records,
we use linear regression to predict fares based on features like passenger count, trip distance, fare, and tips.
We implement two methods for linear regression, normal equations for smaller data and L-BFGS for
high-dimensional data [9]. The data is split into training and test sets to evaluate performance using RMSE
and MSE. We also assess taxi vendors' performance by analyzing metrics such as trip counts, average
distance, fare, and tips, visualized through bar charts to highlight performance differences. This integrated
approach enhances taxi service efficiency and supports strategic decision-making in transportation [10].

2.1. Work methodology

In this section, we will explore the various methodologies and tools employed to handle big data,
focusing on techniques that enable efficient processing and analysis of large datasets. We will delve into the
application of linear regression in machine learning, discussing how different approaches, such as ordinary
least squares (OLS) and limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), can be utilized to
optimize model parameters for predictive accuracy. Additionally, we will examine the metrics used for
evaluating model performance, shedding light on how they measure the effectiveness of predictive models,
identify areas for improvement, and ensure that the chosen algorithms align with the goals of data-driven
decision-making. Through this exploration, we aim to provide a comprehensive understanding of how big
data processing tools, such as Apache Spark, and linear regression techniques can be leveraged to build,
optimize, and evaluate predictive models.

2.1.1. Tools for handling big data

Our approach to managing big data relies on Apache spark, a distributed computing system known
for its efficiency and scalability. Apache spark excels in processing large datasets by distributing tasks across
a cluster of computers, which enables parallel processing. This capability significantly reduces data
processing time compared to traditional single-machine methods [11].
Apache spark uses resilient distributed datasets (RDDs) to ensure fault tolerance and enhance
performance. RDDs are cached in memory, allowing iterative algorithms to reuse intermediate results across
multiple computations. This feature greatly speeds up machine learning algorithms and other tasks that
require multiple data passes [12].
Spark's unified analytics engine supports diverse data processing needs, including batch processing,
real-time stream processing, and machine learning. It includes several specialized libraries, such as Spark
SQL for SQL queries, spark streaming for real-time data, MLlib for machine learning, and GraphX for graph
processing. These libraries extend Spark's functionality and make it versatile for various data tasks [13].
One of spark’s notable advantages is its in-memory computing capability, which allows for rapid
data processing by storing data in memory rather than on disk [14]. This feature is particularly beneficial for
iterative algorithms and interactive data exploration [15]. Additionally, spark’s user-friendly APIs in Java,
Scala, Python, and R simplify the creation of complex workflows and data pipelines while providing
advanced controls for experienced users. Spark’s efficient processing and versatile capabilities make it
essential for modern data analytics and machine learning [16], [17].

Int J Elec & Comp Eng, Vol. 15, No. 1, February 2025: 711-718
Int J Elec & Comp Eng ISSN: 2088-8708  713

2.1.2. Linear regression in machine learning

Linear regression is one of the most fundamental and widely used techniques in machine learning
for predicting a continuous target variable based on one or more predictor variables [18]. At its core, linear
regression aims to model the relationship between the dependent variable (the target) and the independent
variables (the predictors) by fitting a linear equation to observed data. The primary objective of linear
regression is to determine the optimal values for these coefficients such that the sum of the squared
differences between the observed actual values and the values predicted by the linear model (known as the
residual sum of squares) is minimized. In our study, we utilized two specific methods to perform linear
regression: the OLS. and the L-BFGS algorithm [19].
The ordinary least squares (OLS) method is a fundamental approach in linear regression used to
estimate the coefficients that minimize the residual sum of squares between the observed values and the
values predicted by the model. The goal is to find the best-fit line that captures the relationship between the
independent variables (predictors) and the dependent variable (target) [20]. The OLS solution is derived
using the normal (1):

𝛽 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦 (1)

where 𝛽 represents the vector of coefficients, X is the matrix of input features (including a column of ones
for the intercept term), y is the vector of observed values, 𝑋 𝑇 is the transpose of the matrix. This method
provides an exact solution by solving the above equation, making it straightforward and computationally
efficient for smaller datasets. However, for very large datasets, the matrix inversion can become
computationally expensive, which is a limitation of this approach [21].
The L-BFGS algorithm is an iterative optimization technique particularly well-suited for large-scale
and high-dimensional datasets [22]. It is a variant of the BFGS algorithm that uses limited memory to
approximate the inverse Hessian matrix, which is essential for determining the direction of the steepest
descent in optimization problems. The iterative process follows these steps (2):

𝛽𝑘+1 = 𝛽𝑘 − 𝛼𝑘 𝐻𝐾−1 ∇𝑓(𝛽𝑘 ) (2)

where 𝛽𝑘 is the coefficient vector at iteration k, 𝛼𝑘 is the step size (learning rate), 𝐻𝐾−1 is the inverse hessian
matrix approximation at iteration k, and ∇𝑓(𝛽𝑘 ) is the gradient of the cost function at 𝛽𝑘 . Unlike the OLS
method, L-BFGS does not require matrix inversion, making it more scalable and efficient for handling large
datasets. It iteratively adjusts the coefficients by following the gradient of the cost function, gradually
converging to the optimal solution. This makes L-BFGS particularly advantageous for scenarios where the
dataset size or the number of features is large [23]. Despite the simplicity and interpretability of linear
regression, it is essential to evaluate the underlying assumptions—such as linearity, independence,
homoscedasticity (constant variance of errors), and normality of error terms—to ensure the validity and
reliability of the model’s predictions. By carefully selecting the appropriate method and validating the
assumptions, linear regression remains a powerful tool for understanding and predicting the relationships
within the data across various domains [24].

2.1.3. Scoring metrics

To fit the linear regression model using these methods, we first prepare the data by consolidating the
selected features into a single vector using a VectorAssembler. The dataset is then divided into training and
test sets, which allows us to evaluate the model's performance. Evaluation metrics such as RMSE and MSE
are used to assess how well the model generalizes to unseen data. These metrics are essential for determining
the accuracy of our predictions, offering insights into the model’s effectiveness and its ability to handle new
data [25].
MSE is a widely used metric for evaluating the accuracy of predictive models. It quantifies the mean
of the squared differences between predicted and observed values. MSE essentially measures the average
magnitude of the squared deviations across all data points, providing a detailed assessment of model
performance. This metric is valuable for understanding the overall quality of the model's predictions, as it
captures the extent of prediction errors in a continuous manner [26].
RMSE is another key metric that offers a straightforward measure of prediction error. By taking the
square root of the MSE, RMSE presents an error metric that maintains the same units as the target variable,
making it more intuitive. RMSE places a higher emphasis on larger errors due to the squaring of differences,
which means it penalizes significant deviations more. This characteristic makes RMSE particularly useful for
understanding the model's performance with respect to outlier predictions [27].

Analysis of big data from New York taxi trip 2023: revenue prediction using … (Sara Rhouas)
714  ISSN: 2088-8708

2.2. Application method

In this analysis, a combination of machine learning techniques, including linear regression and
Spark's distributed computing capabilities, were employed to predict taxi trip revenues in New York City for
the year 2023. Leveraging Spark's powerful data processing platform, the analysis aimed to provide accurate
revenue predictions by incorporating key features such as passenger count, trip distance, fare amount, and tip
amount. The utilization of linear regression, a well-established and interpretable modeling technique, ensured
a comprehensive and effective approach to revenue prediction. Furthermore, Spark's distributed computing
capabilities enabled the efficient handling of large-scale datasets, allowing for timely and accurate
predictions even with massive amounts of data.

2.2.1. Data used

The dataset utilized in this analysis consisted of New York City taxi trip data for the year 2023,
sourced from Parquet files. These files contain detailed information about taxi trips, including attributes such
as pickup datetime, passenger count, trip distance, fare amount, tip amount, and total amount. The dataset
was meticulously cleaned and preprocessed to ensure data quality and reliability for subsequent analysis.
Invalid records and missing values were filtered out, and the pickup datetime column was cast to a date type
for temporal analysis. This refined dataset served as the foundation for building and training the linear
regression model for revenue prediction [28].
Through exploratory data analysis and feature engineering, insights were extracted from the dataset
to enhance the predictive model's performance. Key features such as passenger count, trip distance, fare
amount, and tip amount were identified based on their potential impact on trip revenues. These features were
then used to train the linear regression model, which served as the predictive engine for estimating taxi trip
revenues. By leveraging Spark's distributed computing capabilities, the model was able to efficiently process
and analyze large-scale datasets, providing stakeholders with accurate and timely revenue predictions.

2.2.2. Process
This process outlines a data-driven approach for predicting taxi trip revenues in New York City for
2023. It begins with data loading and initial processing, where Spark is configured for efficient handling of
large datasets. The data, stored in parquet format on Hadoop distributed file system (HDFS), is verified,
loaded, and combined into a single data frame for the entire year. Following this, data cleaning and
preprocessing ensure the dataset's integrity by removing rows with null or invalid values, reducing the data to
37,000,870 rows. Feature selection and engineering identify key insights, including peak operational periods
and feature correlations, setting the stage for model training.
Model training and evaluation involves splitting the data into training and testing sets and applying
two linear regression methods—normal equations and L-BFGS. The models are evaluated using RMSE and
MSE metrics for accuracy. Performance evaluation and visualization examine feature impacts and vendor
metrics, such as trip counts, average fares, and tips, while insights and decision-making leverage these results
to optimize taxi operations and enhance customer satisfaction. This structured analysis offers actionable
insights to improve service efficiency and profitability.

3. RESULTS AND DISCUSSION

The analysis focuses on evaluating the performance of two major taxi vendors in New York City.
Using comprehensive trip data, key metrics such as the total number of trips, average trip distance, average
fare amount, and average tip amount are analyzed to assess each vendor's operational efficiency and market
positioning. The following sections provide a detailed examination of these metrics, highlighting the
strengths and weaknesses of vendor 1 and vendor 2.

3.1. Performance analysis for each vendor

In this analysis, we examine the performance of two major taxi vendors in New York City using key
metrics derived from comprehensive trip data. By evaluating the total number of trips, average trip distance,
average fare amount, and average tip amount, we aim to understand the operational efficiency and market
positioning of each vendor. The data spans a significant period and provides a robust foundation for
comparing these vendors' effectiveness in meeting passenger demand and generating revenue. The following
paragraphs delve into each metric, offering insights into the strengths and weaknesses of vendor 1 and
vendor 2.
As shown in Figure 1, vendor 2 demonstrates a significantly higher volume of total trips compared
to vendor 1. Specifically, vendor 2 recorded 27,471,887 trips, whereas vendor 1 recorded 9,528,983 trips.
This disparity indicates that vendor 2 has a larger share of the market, which could be due to a variety of

Int J Elec & Comp Eng, Vol. 15, No. 1, February 2025: 711-718
Int J Elec & Comp Eng ISSN: 2088-8708  715

factors such as a more extensive fleet, more efficient dispatch and routing systems, or stronger brand
recognition. The higher trip volume also suggests that vendor 2 is better at meeting passenger demand and
potentially has wider operational coverage across New York City. This large volume of trips provides vendor
2 with a robust revenue base and enhances its ability to generate significant income from a high number of
service transactions.
In Figure 2 we can see that the average trip distance for vendor 2 is slightly longer than that for
vendor 1, with vendor 2 averaging 3.64 miles per trip and vendor 1 averaging 3.42 miles. While the
difference may seem minimal, it has important implications for revenue. Longer trips typically result in
higher fares, contributing more significantly to total revenue. Vendor 2’s slightly longer average trip distance
could indicate that they serve areas with greater distances between common pick-up and drop-off points or
that they attract trips that tend to cover more distance. This could be a result of strategic operational decisions
or a focus on areas with higher fare potential. The longer trip distances might also suggest that vendor 2 has a
higher proportion of trips to and from major hubs like airports or business districts, which typically involve
greater distances.
Vendor 2 also outperforms vendor 1 in terms of average fare amount, with an average fare of $19.67
compared to vendor 1’s $18.71. This difference in fare amounts is likely linked to the longer average trip
distances mentioned earlier. Higher average fares not only boost per-trip revenue but also suggest that vendor
2 may be operating more in premium segments of the market where passengers are willing to pay more for
better service or convenience. Additionally, the higher fares could be a result of effective dynamic pricing
strategies, where vendor 2 adjusts prices based on demand and supply conditions to maximize revenue. This
ability to command higher fares strengthens vendor 2’s overall financial performance and competitive
advantage in the market.
The average tip amount is another area where vendor 2 leads, with an average tip of $3.65 compared
to vendor 1’s $3.26. Tips are often indicative of customer satisfaction and service quality. The higher average
tips for vendor 2 suggest that passengers perceive the service quality to be better or feel more satisfied with
their rides. This could be due to various factors such as cleaner vehicles, more courteous drivers, better ride
experiences, or more reliable service. Higher tips contribute directly to the drivers' earnings and can also
boost overall driver morale and retention. From a business perspective, higher tips indicate a positive
customer experience, which is crucial for customer loyalty and repeat business.
Vendor 2’s higher trip volume, longer average trip distance, higher average fare amount, and greater
average tip amount collectively paint a picture of a more dominant and financially successful operator. The
higher trip volume indicates a larger operational scale and better market penetration, while the longer trip
distances and higher fare amounts suggest a focus on higher-value segments of the market. The greater
average tips reflect superior service quality, leading to higher customer satisfaction and loyalty. These factors
combined position Vendor 2 as a more robust and competitive player in New York City's taxi industry, with a
stronger ability to generate revenue and sustain long-term growth compared to vendor 1.

Figure 1. Total trips per vendor ID Figure 2. Average trip distance per vendor ID

3.2. Analysis of the regression performances

In analyzing yellow taxi trip fare prediction, linear regression models were employed to understand
the impact of various factors on the total fare. Two methods, OLS and L-BFGS, were used to build these
Analysis of big data from New York taxi trip 2023: revenue prediction using … (Sara Rhouas)
716  ISSN: 2088-8708

models. Both methods offer distinct advantages in terms of computational efficiency and scalability, making
them suitable for different contexts depending on the size and complexity of the dataset. This section delves
into the results obtained from both regression methods, providing a detailed comparison of their performance
metrics, computational requirements, and the significance of the derived coefficients. By examining the
coefficients and their implications, we gain insights into the primary drivers of taxi fares, enhancing our
understanding of fare structures and customer behaviors.
As shown in Table 1, both the OLS and L-BFGS linear regression models yielded nearly identical
coefficients, demonstrating the robustness of the findings. The coefficient for passenger count is
approximately 0.0702, indicating that each additional passenger has a small but positive impact on the total
fare. This suggests that while having more passengers slightly increases the fare, their influence is relatively
minimal compared to other factors. The trip distance coefficient, around 0.0010, also shows a very small
impact on the total fare, indicating that trip distance contributes marginally to fare calculations. This small
impact might reflect a fare structure where fixed costs or time-based charges are more significant than
distance, potentially due to minimum fare policies or the inclusion of initial service fees that overshadow the
distance-based component.

Table 1. The results of each method

Metric OLS L-BFGS
Training time (seconds) 37.96 122.69
RMSE 4.691598018 4.6915980186
MSE 22.01109196 22.01109197
Passenger count coefficient 0.070184284 0.070184285
Trip distance coefficient 0.0009725934876 0.00097259340671
Fare amount coefficient 1.0036740054 1.0036740051
Tip amount coefficient 1.35752071718 1.357520719
Intercept 4.008011703 4.008011700

The fare amount, with a coefficient of about 1.0037, shows a near one-to-one relationship with the
total fare, confirming that base fare calculations are the primary determinant of the total fare. In contrast, the
tip amount, with a coefficient of approximately 1.3575, indicates that tips significantly boost the total fare.
This higher coefficient suggests that tipping not only adds directly to the fare but also correlates with
scenarios involving higher service quality or more expensive rides. The intercept, around 4.0080, represents
the baseline total fare, ensuring a minimum charge regardless of other factors. This baseline underscores the
importance of initial fees in the fare structure. Collectively, these coefficients reveal that while passenger
count and trip distance play secondary roles, the fare amount and tips are crucial drivers of the total fare,
reflecting a fare structure heavily influenced by base charges and customer tipping behavior.
The fare amount, with a coefficient of 1.0037, shows a near one-to-one relationship with the total
fare, confirming that base fare calculations are the primary factor. Meanwhile, the tip amount, with a
coefficient of 1.3575, has a more significant influence, indicating that tips not only increase the fare directly
but also correlate with scenarios involving higher service quality or more expensive rides. The intercept,
around 4.0080, ensures a minimum fare, emphasizing the importance of base charges. Overall, fare amount
and tips are the main drivers of the total fare, with passenger count and trip distance playing smaller roles.
The OLS and L-BFGS linear regression models were used to predict taxi fares, with both showing
nearly identical performance metrics. The OLS model, using the normal equations method, had an RMSE of
4.6916 and an MSE of 22.0111, and completed in 37.96 seconds, making it efficient for datasets that fit
within memory limits. This efficiency comes from the closed-form solution of the Normal Equations, which
allows for quick calculations when data size is manageable.
The L-BFGS model, an iterative optimization method for larger datasets, achieved the same RMSE
and MSE as the OLS model. However, its computational time was significantly longer, at 122.69 seconds,
reflecting its iterative nature. Despite this, the L-BFGS method is more flexible and scalable, making it
suitable for large datasets that exceed memory limits. Its performance and coefficient alignment with the
OLS model confirm its effectiveness in capturing the dataset’s linear relationships.
Comparing the two models, both showed similar predictive accuracy, but the OLS method was
faster and more efficient for smaller datasets, while the L-BFGS method excelled in handling larger, more
complex datasets. The choice between the two depends on the dataset size and computational needs, with
OLS favored for speed and L-BFGS for scalability. Understanding these trade-offs ensures the appropriate
model is used for efficient and accurate analysis.

Int J Elec & Comp Eng, Vol. 15, No. 1, February 2025: 711-718
Int J Elec & Comp Eng ISSN: 2088-8708  717

4. CONCLUSION
The growing interest in big data and machine learning has revolutionized numerous industries,
including urban transportation. Leveraging these advanced technologies allows for more informed
decision-making, operational efficiency, and enhanced customer experiences. In this context, analyzing
extensive datasets, such as those generated by New York City's yellow taxi services, provides valuable
insights into the performance and market dynamics of competing vendors. This study harnesses the power of
big data and machine learning to evaluate the operational metrics of two major taxi vendors, offering a
detailed comparison of their effectiveness in meeting passenger demand and generating revenue.
In conclusion, the integration of big data and machine learning in analyzing New York City's yellow
taxi industry reveals vendor 2 as the more dominant and financially successful operator. Higher trip volumes,
longer average trip distances, higher fare amounts, and greater tips position vendor 2 as a stronger competitor
with a better ability to meet passenger demands and generate revenue. These insights are instrumental for
both vendors in optimizing their operations, improving service quality, and making data-driven decisions that
enhance customer satisfaction and operational efficiency. This study exemplifies the transformative potential
of big data and machine learning in urban transportation, paving the way for more effective and competitive
service delivery.

REFERENCES
[1] B. Itri, Y. Mohamed, B. Omar, E. M. Latifa, M. Lahcen, and O. Adil, “Hybrid machine learning for stock price
prediction in the Moroccan banking sector,” International Journal of Electrical and Computer Engineering,
vol. 14, no. 3, pp. 3197–3207, Jun. 2024, doi: 10.11591/ijece.v14i3.pp3197-3207.
[2] Q. Hu, L. Zhu, C. Chang, and W. Zhang, “A truncated three-term conjugate gradient method with complexity
guarantees with applications to nonconvex regression problem,” Applied Numerical Mathematics, vol. 194,
pp. 82–96, Dec. 2023, doi: 10.1016/j.apnum.2023.08.006.
[3] A. L. Burton, “Ordinary least squares (linear) regression,” in The Encyclopedia of Research Methods in
Criminology and Criminal Justice, Wiley, 2021, pp. 509–514.
[4] S. Rhouas, A. El Attaoui, and N. El Hami, “Optimization of the prediction performance in the future exchange
rate,” in 2023 9th International Conference on Optimization and Applications (ICOA), Oct. 2023, pp. 1–6,
doi: 10.1109/ICOA58279.2023.10308858.
[5] A. El Attaoui, S. Rhouas, and N. El Hami, “ETL applied to klarna e-commerce dataset,” in 2023 9th International
Conference on Optimization and Applications (ICOA), Oct. 2023, pp. 1–4, doi: 10.1109/ICOA58279.2023.10308808.
[6] M. B. Ulak, A. Yazici, and M. Aljarrah, “Value of convenience for taxi trips in New York City,” Transportation
Research Part A: Policy and Practice, vol. 142, pp. 85–100, Dec. 2020, doi: 10.1016/j.tra.2020.10.016.
[7] M. S. Ansar, Y. Ma, S. Chen, K. Tang, and Z. Zhang, “Investigating the trip configured causal effect of distracted
driving on aggressive driving behavior for e-hailing taxi drivers,” Journal of Traffic and Transportation
Engineering (English Edition), vol. 8, no. 5, pp. 725–734, Oct. 2021, doi: 10.1016/j.jtte.2020.12.001.
[8] X. Dong, E. Guerra, and M. S. Ryerson, “Investigating the recovery of for-hire-vehicle, taxi, and airtrain at two
New York City airports during the COVID-19 pandemic,” Travel Behaviour and Society, vol. 33, Oct. 2023,
doi: 10.1016/j.tbs.2023.100646.
[9] D. Katić, H. Krstić, I. Ištoka Otković, and H. Begić Juričić, “Comparing multiple linear regression and neural
network models for predicting heating energy consumption in school buildings in the Federation of Bosnia and
Herzegovina,” Journal of Building Engineering, vol. 97, Nov. 2024, doi: 10.1016/j.jobe.2024.110728.
[10] B. V. Surya Vardhan, M. Khedkar, I. Srivastava, and S. K. Patro, “Impact of integrated classifier — regression
mapped short term load forecasting on power system management in a grid connected multi energy systems,”
Electric Power Systems Research, vol. 230, May 2024, doi: 10.1016/j.epsr.2024.110222.
[11] M. Armanur Rahman, A. Hossen, J. Hossen, V. C, T. Bhuvaneswari, and A. Sultana, “Towards machine
learning-based self-tuning of hadoop-spark system,” Indonesian Journal of Electrical Engineering and Computer
Science, vol. 15, no. 2, pp. 1076–1085, Aug. 2019, doi: 10.11591/ijeecs.v15.i2.pp1076-1085.
[12] A. Manconi, M. Gnocchi, L. Milanesi, O. Marullo, and G. Armano, “Framing Apache spark in life sciences,”
Heliyon, vol. 9, no. 2, Feb. 2023, doi: 10.1016/j.heliyon.2023.e13368.
[13] P. Jha, A. Tiwari, N. Bharill, M. Ratnaparkhe, M. Mounika, and N. Nagendra, “Apache spark based kernelized
fuzzy clustering framework for single nucleotide polymorphism sequence analysis,” Computational Biology and
Chemistry, vol. 92, Jun. 2021, doi: 10.1016/j.compbiolchem.2021.107454.
[14] M. B. Al-Masadeh, M. S. Azmi, and S. S. Syed Ahmad, “Tiny datablock in saving hadoop distributed file system
wasted memory,” International Journal of Electrical and Computer Engineering, vol. 13, no. 2,
pp. 1757–1772, Apr. 2023, doi: 10.11591/ijece.v13i2.pp1757-1772.
[15] F. Ashkouti, K. Khamforoosh, and A. Sheikhahmadi, “DI-Mondrian: distributed improved Mondrian for
satisfaction of the L-diversity privacy model using Apache spark,” Information Sciences, vol. 546, pp. 1–24,
Feb. 2021, doi: 10.1016/j.ins.2020.07.066.
[16] Y. Liu and S. Cao, “The analysis of aerobics intelligent fitness system for neurorobotics based on big data and
machine learning,” Heliyon, vol. 10, no. 12, Jun. 2024, doi: 10.1016/j.heliyon.2024.e33191.
[17] S. G Purohit and V. Swamy, “Enhancing data publishing privacy: split-and-mould, an algorithm for equivalent
specification,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 33, no. 2,
Feb. 2024, doi: 10.11591/ijeecs.v33.i2.pp1273-1282.
Analysis of big data from New York taxi trip 2023: revenue prediction using … (Sara Rhouas)
718  ISSN: 2088-8708

[18] Y. Huang, W. Xu, P. Sukjairungwattana, and Z. Yu, “Learners’ continuance intention in multimodal language
learning education: an innovative multiple linear regression model,” Heliyon, vol. 10, no. 6, Mar. 2024, doi:
10.1016/j.heliyon.2024.e28104.
[19] C. Kleiber, “Finite sample efficiency of OLS in linear regression models with long-memory disturbances,”
Economics Letters, vol. 72, no. 2, pp. 131–136, Aug. 2001, doi: 10.1016/S0165-1765(01)00423-2.
[20] I. Ahmad et al., “Spatial configuration of groundwater potential zones using OLS regression method,” Journal of
African Earth Sciences, vol. 177, May 2021, doi: 10.1016/j.jafrearsci.2021.104147.
[21] E. Ghysels and H. Qian, “Estimating MIDAS regressions via OLS with polynomial parameter profiling,”
Econometrics and Statistics, vol. 9, pp. 1–16, Jan. 2019, doi: 10.1016/j.ecosta.2018.02.001.
[22] A. Bemporad, “An L-BFGS-B approach for linear and nonlinear system identification under l1 and group-Lasso
regularization,” arXiv:2403.03827, Mar. 2024.
[23] F. Alpak et al., “A machine-learning-accelerated distributed LBFGS method for field development optimization:
algorithm, validation, and applications,” Computational Geosciences, vol. 27, no. 3, pp. 425–450, Jun. 2023,
doi: 10.1007/s10596-023-10197-3.
[24] D. Chang, S. Sun, and C. Zhang, “An accelerated linearly convergent stochastic L-BFGS algorithm,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 30, no. 11, pp. 3338–3346, Nov. 2019,
doi: 10.1109/TNNLS.2019.2891088.
[25] M. W. Liemohn, A. D. Shane, A. R. Azari, A. K. Petersen, B. M. Swiger, and A. Mukhopadhyay, “RMSE is not
enough: guidelines to robust data-model comparisons for magnetospheric physics,” Journal of Atmospheric and
Solar-Terrestrial Physics, vol. 218, Jul. 2021, doi: 10.1016/j.jastp.2021.105624.
[26] S. Rhouas, A. El Attaoui, and N. El Hami, “Enhancing currency prediction in international e-commerce: Bayesian-
optimized random forest approach using the Klarna dataset,” International Journal of Electrical and Computer
Engineering, vol. 14, no. 3, Jun. 2024, doi: 10.11591/ijece.v14i3.pp3177-3186.
[27] S. Hadiyoso, H. Nugroho, T. L. Erawati Rajab, and K. Surendro, “Data prediction for cases of incorrect data in
multi-node electrocardiogram monitoring,” International Journal of Electrical and Computer Engineering, vol. 12,
no. 2, pp. 1540–1547, Apr. 2022, doi: 10.11591/ijece.v12i2.pp1540-1547.
[28] “TLC trip record data,” Taxi and Limousine Commission. https://www.nyc.gov/site/tlc/about/tlc-trip-record-
data.page (accessed Jun. 13, 2024).

BIOGRAPHIES OF AUTHORS

Sara Rhouas is currently pursuing her Ph.D. in computer science at the National
School of Applied Sciences, Ibn Tofail University in Morocco. She earned her engineering
degree in Industrial Engineering in 2019 from the same institution. Her academic experience
includes a strong focus on automobile technologies, with a particular interest in braking
systems. She has carried out research in the field of optimization algorithms and her research
interests extend to areas such as big data, interoperability, artificial intelligence, machine
learning, and deep learning. She has authored and co-authored several publications in both
conferences and scientific journals. She can be contacted at email: rhouas.sara@gmail.com.

Norelislam El Hami is a professor of computer science at the National School of

Applied Sciences, Ibn Tofail University, in Kenitra, Morocco. He earned a diploma of state
engineer in 2000, specializing in computer and telecommunications from the Polytechnic
Faculty of Mons (FPMS) in Belgium. He holds a Ph.D. in computer science from the National
Institute of Applied Sciences (INSA) of Rouen, France, as well as a Ph.D. in applied
mathematics and computer science from Mohammed V University in Rabat, Morocco. His
work includes numerous scholarly publications in conferences and journals. He can be
contacted at email: norelislam@outlook.com.

Int J Elec & Comp Eng, Vol. 15, No. 1, February 2025: 711-718

Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
33 pages
Taxi Fare Prediction Using Random Forests
No ratings yet
Taxi Fare Prediction Using Random Forests
10 pages
Document Reference
No ratings yet
Document Reference
33 pages
Analyzing Taxi Trends
No ratings yet
Analyzing Taxi Trends
43 pages
Gauranga Das - The Art of Focus (2021, Penguin Random House India Private Limited) - Libgen - Li
67% (3)
Gauranga Das - The Art of Focus (2021, Penguin Random House India Private Limited) - Libgen - Li
253 pages
Machine Learning Using Exploratory Analy
No ratings yet
Machine Learning Using Exploratory Analy
9 pages
GA - Meet - Problem Statement & Methodology
No ratings yet
GA - Meet - Problem Statement & Methodology
19 pages
Machine Learning Thesis
No ratings yet
Machine Learning Thesis
92 pages
Cab Service Price Prediction
No ratings yet
Cab Service Price Prediction
17 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Data Mining Report
No ratings yet
Data Mining Report
72 pages
FULLTEXT01
No ratings yet
FULLTEXT01
56 pages
Taxi Fare Team 09
No ratings yet
Taxi Fare Team 09
25 pages
Deep Learning Powers Better Decisions in Financial Services
No ratings yet
Deep Learning Powers Better Decisions in Financial Services
29 pages
Taxi Demand Prediction Using Machine Learning.: International Research Journal of Engineering and Technology (Irjet)
No ratings yet
Taxi Demand Prediction Using Machine Learning.: International Research Journal of Engineering and Technology (Irjet)
5 pages
Final Thesis Version Femke Schurmann 4727738
No ratings yet
Final Thesis Version Femke Schurmann 4727738
102 pages
Final Report
No ratings yet
Final Report
17 pages
Supervised Learning Approach For Forecasting Taxi Travel Demand
No ratings yet
Supervised Learning Approach For Forecasting Taxi Travel Demand
67 pages
Predicting Adaptive Pricing in Ride-Hailing Platforms Using Deep Neural Networks
No ratings yet
Predicting Adaptive Pricing in Ride-Hailing Platforms Using Deep Neural Networks
9 pages
ML Project Paper Final
No ratings yet
ML Project Paper Final
6 pages
ARCHANACSE
No ratings yet
ARCHANACSE
4 pages
Acd 21 JB
No ratings yet
Acd 21 JB
51 pages
Master Thesis TU Delft Dinesh Bisesser 2020
No ratings yet
Master Thesis TU Delft Dinesh Bisesser 2020
104 pages
With Python: Machine Learning
No ratings yet
With Python: Machine Learning
3 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
Uber Data Analysis
100% (4)
Uber Data Analysis
37 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Predictive Analysis of Taxi Fare Using M
No ratings yet
Predictive Analysis of Taxi Fare Using M
6 pages
Sri Mittapalli College of Engineering: A Shiny Interface in Exploring The Taxi Trips Data
No ratings yet
Sri Mittapalli College of Engineering: A Shiny Interface in Exploring The Taxi Trips Data
39 pages
Predicting Taxi Demand Using Machine Learning: International Research Journal of Engineering and Technology (Irjet)
No ratings yet
Predicting Taxi Demand Using Machine Learning: International Research Journal of Engineering and Technology (Irjet)
4 pages
ML Ex 5
No ratings yet
ML Ex 5
6 pages
Shivaraj
No ratings yet
Shivaraj
11 pages
FULLTEXT01
No ratings yet
FULLTEXT01
68 pages
cz4041 Project Final Report Nyc Taxi Fare Prediction
0% (1)
cz4041 Project Final Report Nyc Taxi Fare Prediction
18 pages
Research Paper
No ratings yet
Research Paper
5 pages
Ijirt161160 Paper
No ratings yet
Ijirt161160 Paper
5 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
100% (1)
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
32 pages
Predicting Taxi Demand at High Spatial Resolution
No ratings yet
Predicting Taxi Demand at High Spatial Resolution
10 pages
ML1 Research Paper
No ratings yet
ML1 Research Paper
6 pages
Group B: Machine Learning
No ratings yet
Group B: Machine Learning
25 pages
Newyork Taxi
No ratings yet
Newyork Taxi
9 pages
TDIA2 TP3 Spark
No ratings yet
TDIA2 TP3 Spark
2 pages
Report
No ratings yet
Report
36 pages
Unsupervised Time Series Outlier Detection123
No ratings yet
Unsupervised Time Series Outlier Detection123
56 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
GREAT Manager Framework
100% (4)
GREAT Manager Framework
14 pages
IJRPR22505
No ratings yet
IJRPR22505
3 pages
Route Agnostic Estimated Time of Arrival in Vehicle Trip Using Machine Learning
No ratings yet
Route Agnostic Estimated Time of Arrival in Vehicle Trip Using Machine Learning
3 pages
Anuj Sip - 1
No ratings yet
Anuj Sip - 1
34 pages
Prediction of Dynamic Price of Ride-On-Demand Services Using Linear Regression
No ratings yet
Prediction of Dynamic Price of Ride-On-Demand Services Using Linear Regression
1 page
N N N N N N: A Ovel Approach To A Alyze Uber Datausi G Machi E Lear I G
No ratings yet
N N N N N N: A Ovel Approach To A Alyze Uber Datausi G Machi E Lear I G
17 pages
Record of Experiments: Cloud Application Development Lab
No ratings yet
Record of Experiments: Cloud Application Development Lab
10 pages
UCLA Electronic Theses and Dissertations: Title
No ratings yet
UCLA Electronic Theses and Dissertations: Title
43 pages
Major Project
No ratings yet
Major Project
17 pages
Module 5
No ratings yet
Module 5
31 pages
We Need To Talk About IT Architecture
No ratings yet
We Need To Talk About IT Architecture
60 pages
Data Analytics On Banking
No ratings yet
Data Analytics On Banking
3 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
Predictive Analysis For Big Mart Sales Using Machine Learning Algorithms
No ratings yet
Predictive Analysis For Big Mart Sales Using Machine Learning Algorithms
6 pages
Cancellation Predictor For Revenue Management: Applied in The Hospitality Industry
No ratings yet
Cancellation Predictor For Revenue Management: Applied in The Hospitality Industry
26 pages
Cab Fare Prediction Report by Abhinav Jha
No ratings yet
Cab Fare Prediction Report by Abhinav Jha
41 pages
Moog Valves DIVelectricalInterfaces Manual
No ratings yet
Moog Valves DIVelectricalInterfaces Manual
108 pages
Sistema de Frenos Freight m12
No ratings yet
Sistema de Frenos Freight m12
457 pages
HAZOP
No ratings yet
HAZOP
30 pages
Eng523 - 2
No ratings yet
Eng523 - 2
4 pages
WEG - Transformer
No ratings yet
WEG - Transformer
20 pages
Municipal Corporation of Greater Mumbai
No ratings yet
Municipal Corporation of Greater Mumbai
95 pages
Communication Aids and Strategies Using Tools of Technology
No ratings yet
Communication Aids and Strategies Using Tools of Technology
32 pages
Lab 1 Group 3 - Pure and Series
No ratings yet
Lab 1 Group 3 - Pure and Series
60 pages
MRO Intelligence Report PDF
No ratings yet
MRO Intelligence Report PDF
9 pages
Mil H 6875H
No ratings yet
Mil H 6875H
29 pages
Topic 2 Linear Programming
No ratings yet
Topic 2 Linear Programming
64 pages
Alternative Routes To Monoethylene Glycol (MEG) : Section
No ratings yet
Alternative Routes To Monoethylene Glycol (MEG) : Section
11 pages
Revision For Mid Term Test
No ratings yet
Revision For Mid Term Test
7 pages
Algebra and More For Analytics
No ratings yet
Algebra and More For Analytics
29 pages
Class 10 - Maths - Arithmetic Progressions
No ratings yet
Class 10 - Maths - Arithmetic Progressions
51 pages
Gr.8 - Unit #3 - L.4 - Speech Analysis
No ratings yet
Gr.8 - Unit #3 - L.4 - Speech Analysis
11 pages
Physical Science - q4 - Slm13-Pages-Deleted
No ratings yet
Physical Science - q4 - Slm13-Pages-Deleted
5 pages
TRC P4P Proposal
No ratings yet
TRC P4P Proposal
48 pages
OOP Templates Assignment - Zip
No ratings yet
OOP Templates Assignment - Zip
31 pages
Unit 8
No ratings yet
Unit 8
9 pages
Squib 1
No ratings yet
Squib 1
2 pages
Notes On Anova: Dr. Mcintyre Mcdaniel College Revised: August 2005
No ratings yet
Notes On Anova: Dr. Mcintyre Mcdaniel College Revised: August 2005
10 pages
Planning Engineer
No ratings yet
Planning Engineer
2 pages
Printlac High Gloss TDS
No ratings yet
Printlac High Gloss TDS
2 pages
Engine Code Won't Clear in My 2008 Saturn Vue - Google Search
No ratings yet
Engine Code Won't Clear in My 2008 Saturn Vue - Google Search
1 page
A Simple Proof of Bernoulli's Inequality: Sanjeev Saxena
No ratings yet
A Simple Proof of Bernoulli's Inequality: Sanjeev Saxena
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1 PB

Uploaded by

1 PB

Uploaded by

International Journal of Electrical and Computer Engineering (IJECE)

Vol. 15, No. 1, February 2025, pp. 711~718

Sara Rhouas, Norelislam El Hami

Article Info ABSTRACT

Journal homepage: http://ijece.iaescore.com

2.1. Work methodology

2.1.1. Tools for handling big data

2.1.2. Linear regression in machine learning

𝛽𝑘+1 = 𝛽𝑘 − 𝛼𝑘 𝐻𝐾−1 ∇𝑓(𝛽𝑘 ) (2)

2.1.3. Scoring metrics

2.2. Application method

2.2.1. Data used

3. RESULTS AND DISCUSSION

3.1. Performance analysis for each vendor

3.2. Analysis of the regression performances

Table 1. The results of each method

Norelislam El Hami is a professor of computer science at the National School of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.