G AIA 200 Cuddle
G AIA 200 Cuddle
Preliminaries
✓ The totality of your source files, except all useless files (binary, temp files, objfiles,. . . ),
must be included in your delivery.
✓ All the bonus files (including a potential specific Makefile) should be in a directory
named bonus.
✓ Error messages have to be written on the error output, and the program should then
exit with the 84 error code (0 if there is no error).
The Cuddle project aims to re-code the popular pandas library, a cornerstone tool in the data
manipulation ecosystem, especially for handling CSV files and complex data structures.
Pandas is renowned for its robust capabilities in data cleaning, transformation, and analysis, of-
fering a user-friendly interface that makes data science more accessible and efficient.
We highly recommend that you explore the Python pandas library to get a practical sense of its
functionalities before diving into the recoding process.
1
Getting started
For this project, the input file will be a CSV (Comma Separated Values) file, which is a common
file format for storing tabular data. The CSV file will contain a header row with the column names
and subsequent rows with the data values.
The goal of the “My Panda” project is to create a program that can:
Your program should be able to handle CSV files with different column types (e.g., strings,
integers, floats) and provide a flexible interface for interacting with the data. This interface
will be a static library (libcuddle.a) that can be linked to other programs for data analysis.
Think about the data structure you will use to store the CSV data and how you will imple-
ment the various operations efficiently. linked lists, arrays, . . . ?
2
Core functions
The first step is to create a function that reads a CSV file and stores the data in the custom data
structure. You can start by defining the data structure dataframe_t and the function df_read_csv that
reads the CSV file and populates the dataframe_t structure.
Your program must be able to read a CSV file with a custom separator (e.g., comma, semi-
colon, tab). The default separator if separator is NULL is the comma.
If you need some help and/or feel lost, maybe you should check the Bootstrap.
3
The df_write_csv function should write the data from the dataframe_t structure to a CSV file and
return 0 if successful. Using the same separator as the input file.
int df_write_csv ( dataframe_t * dataframe , const char * filename ) ;
✓ Int: if there is at least one negative value otherwise it’s an unsigned int.
A value is considered as a number only if it contains only digits and a single dot (.) for float
values. For instance, “-25” is an integer, “25.0” is a float, “25.0.0” or “10 000” are strings.
4
Basics operations
Once you have the core functions implemented, you can start adding basic operations to ma-
nipulate the data in the dataframe_t structure.
∇ Terminal - + x
~/G-AIA-200> cat data.csv
name,age,city
Alice,25,Paris
Bob,30,London
Charlie,35,Berlin
Léo,25,Paris
Nathan,30,London
Alex,35,Berlin
Paul,25,Paris
head
The df_head function should return a new dataframe representing the first n rows of the given
dataframe.
dataframe_t * df_head ( dataframe_t * dataframe , int nb_rows ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle && cat head.csv
name,age,city
Alice,25,Paris
Bob,30,London
Charlie,35,Berlin
5
tail
The df_tail function should return a new dataframe representing the last n rows of the given
dataframe.
dataframe_t * df_tail ( dataframe_t * dataframe , int nb_rows ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle && cat tail.csv
name,age,city
Nathan,30,London
Alex,35,Berlin
Paul,25,Paris
shape
The df_shape function should print the number of rows and columns in the dataframe using a
dataframe_shape_t structure that contains two fields: nb_rows and nb_columns.
typedef struct dataframe_shape_s {
int nb_rows ;
int nb_columns ;
} dataframe_shape_t ;
printf ( " Shape : % d rows , % d columns \ n " , shape . nb_rows , shape . nb_columns ) ;
return 0;
}
∇ Terminal - + x
~/G-AIA-200> ./cuddle
Shape: 7 rows, 3 columns
6
info
The df_info function should print information about the dataframe, including the column names
and types.
void df_info ( dataframe_t * dataframe ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle
3 columns:
- name: string
- age: unsigned int
- city: string
Types should be displayed in lowercase: bool, int, unsigned int, float, string.
describe
The df_describe function should provide summary statistics for the numerical columns in the data-
frame, including count, mean, standard deviation, minimum and maximum.
void df_describe ( dataframe_t * dataframe );
∇ Terminal - + x
~/G-AIA-200> ./cuddle
Column: age
Count: 7
Mean: 29.29
Std: 4.16
Min: 25.00
Max: 35.00
7
Filtering
The next step is to add filtering capabilities to the dataframe. The df_filter function should return
a new dataframe that contains only the rows that satisfy a specified condition.
dataframe_t * df_filter ( dataframe_t * dataframe , const char * column , bool (* filter_func ) (
void * value ) ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle && cat filtered.csv
name,age,city
Charlie,35,Berlin
Alex,35,Berlin
8
Sorting
The df_sort function should return a new dataframe that contains the same data as the original
dataframe but sorted based on a specified column.
dataframe_t * df_sort ( dataframe_t * dataframe , const char * column , bool (* sort_func ) ( void *
value1 , void * value2 ) ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle && cat sorted.csv
name,age,city
Alice,25,Paris
Léo,25,Paris
Paul,25,Paris
Bob,30,London
Nathan,30,London
Charlie,35,Berlin
Alex,35,Berlin
9
Aggregation
The df_groupby function should return a new dataframe that groups the rows of the dataframe
based on a column name and aggregate the values of other columns using an aggregation
function.
dataframe_t * df_groupby ( dataframe_t * dataframe , const char * aggregate_by , const char **
to_aggregate , void *(* agg_func ) ( void ** values , int nb_values ) ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle && cat grouped.csv
city,age
Paris,75
London,60
Berlin,70
10
Transformation
apply
The df_apply function should return a new dataframe with a column transformed by an apply
function that takes a value from the column and returns a new value.
dataframe_t * df_apply ( dataframe_t * dataframe , const char * column , void *(* apply_func ) ( void
* value ) ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle && cat applied.csv
name,age,city
Alice,50,Paris
Bob,60,London
Charlie,70,Berlin
Léo,50,Paris
Nathan,60,London
Alex,70,Berlin
Paul,50,Paris
11
to_type
The df_to_type function should return a new dataframe with a column converted to another type
based on downcast parameter.
dataframe_t * df_to_type ( dataframe_t * dataframe , const char * column , column_type_t downcast
);
∇ Terminal - + x
~/G-AIA-200> cat money.csv
name,amount
Alice,25e
Bob,30e
Léo,25e
~/G-AIA-200> ./cuddle && cat numeric.csv
2 columns:
- name: string
- amount: int
name,amount
Alice,25
Bob,30
Léo,25
12
Utilities
get_value
The df_get_value function should return the value at the specified row and column index in the
dataframe.
void * df_get_value ( dataframe_t * dataframe , int row , const char * column ) ;
∇ Terminal - + x
~/G-AIA-200> ./cuddle
Value: 25
get_values
The df_get_values function should return an array of values in the specified column of the data-
frame.
void ** df_get_values ( dataframe_t * dataframe , const char * column ) ;
while (* values ) {
printf ( " Value : % d \ n " , *( int *) * values ) ;
values ++;
}
}
∇ Terminal - + x
~/G-AIA-200> ./cuddle
Value: 25
Value: 30
Value: 35
Value: 25
Value: 30
Value: 35
Value: 25
13
get_unique_values
The df_get_unique_values function should return an array of unique values in the specified column
of the dataframe.
void ** df_get_unique_values ( dataframe_t * dataframe , const char * column ) ;
while (* values ) {
printf ( " Value : % s \ n " , ( char *) * values ) ;
values ++;
}
return 0;
}
∇ Terminal - + x
~/G-AIA-200> ./cuddle
Value: Paris
Value: London
Value: Berlin
free
The df_free function should free the memory allocated for the dataframe and its components.
void df_free ( dataframe_t * dataframe ) ;
df_free ( dataframe ) ;
return 0;
}
Be careful to implement first the df_read_csv and df_write_csv functions that will be use
to test your project.
14
Bonus
Here are some additional features you can implement to enhance the functionality of your lib-
rary:
✓ Join:
Implement a function to join two dataframes based on a common column.
✓ Performance optimization:
Optimize the performance of the library by using efficient data structures and algorithms.
✓ Error handling:
Add error handling to the library to provide informative messages when an operation fails.
✓ Interactive interface:
Create an interactive CLI for users to interact with the dataframe and perform operations.
✓ Documentation:
Document your library, its functions, data structures, and usage examples.
✓ ...
Conclusion
The Cuddle project provides an opportunity to develop your programming skills and deepen
your understanding of data manipulation techniques. By re-coding the pandas library in C, you
will gain valuable experience in handling tabular data, implementing data structures, and design-
ing efficient algorithms.
We encourage you to explore the full range of pandas functionalities and experiment with
different data sets to test the robustness and flexibility of your implementation. You can
also compare your results with the pandas library to ensure that your version produces
accurate and consistent output.
Unit tests are essential to verify the correctness of your implementation and ensure that
it behaves as expected in various scenarios.
Don’t forget to split your code into manageable functions and modules to facilitate testing
and debugging.
15
v 1.0.2