0% found this document useful (0 votes)
6 views14 pages

NM Assignment

The document presents an Exploratory Data Analysis (EDA) on a Netflix dataset, focusing on understanding content trends, types, and viewer preferences. Key findings include a predominance of movies over TV shows, significant content production growth since 2018, and the dominance of genres like Drama and Comedy. The analysis highlights the U.S. as the top content producer, with notable contributions from India and the U.K., and emphasizes Netflix's strategy to cater to diverse global audiences.

Uploaded by

mohanaramanan75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

NM Assignment

The document presents an Exploratory Data Analysis (EDA) on a Netflix dataset, focusing on understanding content trends, types, and viewer preferences. Key findings include a predominance of movies over TV shows, significant content production growth since 2018, and the dominance of genres like Drama and Comedy. The analysis highlights the U.S. as the top content producer, with notable contributions from India and the U.K., and emphasizes Netflix's strategy to cater to diverse global audiences.

Uploaded by

mohanaramanan75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

NAAN MUDHALVAN - ASSIGNMENT SUBMISSION

NAME : PREMALATHA S

REGISTER NO. : 421322104033

DEPARTMENT : COMPUTER SCIENCE & ENGINEERING

NAAN MUDHALVAN ID : aut421322104030

PROJECT TITLE : EXPLORATORY DATA ANALYSIS (EDA)


ON NETFLIX

COLLEGE CODE : 4213

COLLEGE NAME :KRISHNASAMY COLLEGE OF


ENGINEERING & TECHNOLOGY
Exploratory Data Analysis (EDA) on Netflix Dataset

INTRODUCTION:
In the era of digital streaming, Netflix has emerged as a dominant platform,
offering a wide range of movies and TV shows across different countries and genres.
To better understand the nature and diversity of Netflix’s content, we explore a
publicly available Netflix dataset containing detailed information such as title, type
(Movie or TV Show), director, cast, country of origin, release year, rating, duration,
genres, and date added to the platform. This dataset provides a rich foundation for
analyzing trends in content production and distribution.

To gain meaningful insights from this data, we apply Exploratory Data Analysis
(EDA), a crucial step in the data science process. EDA allows us to examine the
structure of the dataset, identify missing or inconsistent data, and uncover patterns
related to genre popularity, content duration, regional production, and release
timelines. Through data visualization and statistical summaries, EDA helps us
understand how Netflix’s catalog has evolved over time and supports further decision-
making or modeling efforts. This analysis aims to reveal hidden trends in Netflix’s
content strategy and viewer preferences.

OBJECTIVE:

The primary objective of this Exploratory Data Analysis (EDA) is to thoroughly


examine and understand the Netflix dataset by uncovering meaningful patterns, trends,
and relationships within the data. This includes analyzing the distribution of content
types (Movies vs. TV Shows), exploring release trends over the years, identifying the
most common genres and content ratings, evaluating the geographic diversity of
Netflix's catalog, and detecting missing or inconsistent values. The analysis will be
supported by effective data visualizations and will aim to provide actionable insights
that reflect Netflix’s content strategy and viewer engagement trends. This EDA will
also serve as a foundation for further modeling or decision-making processes.

DATA SOURCE:

Netflix Movies and TV Shows Dataset:

Dataset Link: https://www.kaggle.com/datasets/shivamb/netflix-shows

This dataset includes information about TV shows and movies available on Netflix,
including their type, director, cast, country, release year, rating, and more.
Part 1: Data Loading and Understanding

a) Load the dataset into your tool of choice (Excel / Python / Power BI / Tableau).

Program:

import pandas as pd

# Load the dataset


df = pd.read_csv("/content/netflix_titles.csv")

b) Display the first 5–10 rows of the dataset.

Program:

print(df.head(10))

Output:

c) Check the number of rows and columns.

Program:

import pandas as pd

# Load the dataset


df = pd.read_csv("/content/netflix_titles.csv")

# Get the shape of the DataFrame


print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Output:

d) Identify any missing or inconsistent data.

Program:

import pandas as pd

# Check for missing values in each column


missing_data = df.isnull().sum()

# Print the missing data count for each column


print("Missing data count per column:")
print(missing_data)

# Check for inconsistent data types (e.g., numbers in text


columns)
print("\nData types of each column:")
print(df.dtypes)

# Check for any rows with inconsistencies (optional: check for


non-numeric values in numeric columns)
# For example, if you expect a column to have numeric values
only:
inconsistent_data
=df[df["release_year"].apply(pd.to_numeric,errors='coerce').isn
a()]
print("\nRows with inconsistent data (non-numeric in numeric
column):")
print(inconsistent_data)
Output:

Part 2: Data Cleaning (if necessary):

a) Handle missing values (either by filling or removing them).

Program:

import pandas as pd
df["country"] = df["country"].fillna(df["country"].mode()[0])
df.drop_duplicates(inplace=True)
print("After handling missing values and removing duplicates:")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
Output:

Part 3: Exploratory Data Analysis (EDA):

a) Find the total number of Movies vs TV Shows.

Program:

import pandas as pd
print(df['type'].unique()) # This is to check the unique values in the 'type' column

# Count the number of Movies and TV Shows


movie_count = df[df['type'] == 'Movie'].shape[0]
tv_show_count = df[df['type'] == 'TV Show'].shape[0]

# Display the result


print(f"Total number of Movies: {movie_count}")
print(f"Total number of TV Shows: {tv_show_count}")

Output:

b) Identify the top 5 countries producing the most Netflix content.

Program:

import pandas as pd
df['country'] = df['country'].fillna('') # Fill NaN values with an empty string
country_count = df['country'].str.split(',',
expand=True).stack().str.strip().value_counts()
top_5_countries = country_count.head(5)
print("\nTop 5 Countries Producing the Most Netflix Content:")
print(top_5_countries)

Output:

c) Find the top 10 directors with the highest number of shows/movies.

Program:

import pandas as pd
df['director'] = df['director'].fillna('Jay Oliva')
director_count=df['director'].str.split(',',expand=True).stack().str.strip().value_counts()
top_10_directors = director_count.head(10)
# Display the top 10 directors
print("\nTop 10 Directors with the Most Movies/Shows:")
print(top_10_directors)

Output:
d) Find out the most common genres (column: listed_in).

Program:

import pandas as pd
df['listed_in'] = df['listed_in'].fillna('crime_thriller') # Fill NaN values with empty
string
genre_count=df['listed_in'].str.split(',',expand=True).stack().str.strip().value_counts()
top_genres = genre_count.head(10) # Adjust the number for top N genres you want
# Display the most common genres
print("\nMost Common Genres:")
print(top_genres)

Output:

e) Analyze the trend: How many shows/movies were released each year?

Program:

import pandas as pd

df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# Group by 'release_year' and count the number of shows/movies


print(df.groupby('release_year').size())
Output:

Part 4: Visualizations

Create at least three types of visualizations:

a) One bar chart (e.g., number of movies vs TV shows).

Program:

import pandas as pd
import matplotlib.pyplot as plt
content_count = df['type'].value_counts()
plt.figure(figsize=(8, 6))
content_count.plot(kind='bar', color=['skyblue', 'lightgreen'])
plt.title('Number of Movies vs TV Shows')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Output:

b) One pie chart (e.g., top 5 countries by number of productions).

Program:

import pandas as pd
import matplotlib.pyplot as plt
country_production_count = df.groupby('country').size().sort_values(ascending=False)
# Select the top 5 countries by production count
top_5_countries = country_production_count.head(5)
# Create a pie chart for the top 5 countries by production count
plt.figure(figsize=(8, 8))
plt.pie(top_5_countries, labels=top_5_countries.index, autopct='%1.1f%%',
startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99','#c2c2f0'])
# Title for the pie chart
plt.title('Top 5 Countries by Number of Netflix Productions')
# Display the pie chart
plt.show()
Output:

c) One line graph (e.g., number of releases per year).

Program:

import pandas as pd
import matplotlib.pyplot as plt
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')
Group by 'release_year' and count the number of shows/movies
release_trend = df.groupby('release_year').size()
Plot the trend
plt.figure(figsize=(12, 6))
release_trend.plot(kind='line', color='b', marker='o', linestyle='-', linewidth=2,
markersize=5)
plt.title('Trend of Shows/Movies Released Each Year')
plt.xlabel('Year')
plt.ylabel('Number of Shows/Movies Released')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
# Show the plot
plt.show()
Output:

d) Create a word cloud showing most frequent actors or genres.

Program:

import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
all_genres = ' '.join(df['listed_in'].dropna().astype(str))
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(all_genres)
# Display the word cloud
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off') # Hide the axes
plt.title('Most Frequent Genres in Netflix Dataset')
plt.show()
Output:

Part 5: Insight Writing:

For each analysis, write 2–3 sentences of insight.

Movies vs TV Shows:

Movies make up the majority of Netflix content, accounting for about 70% of the total
titles, while TV Shows represent the remaining 30%. This suggests that Netflix
emphasizes one-time viewing content more heavily than episodic series.

Top 5 Countries:

The United States leads as the top producer of Netflix content, followed by India, the
United Kingdom, Canada, and Brazil. This reflects Netflix's strong presence in
English-speaking countries, along with significant contributions from India and other
global regions.

Top 5 Directors:

The dataset reveals that directors like Martin Scorsese, David Fincher, Steven
Spielberg, Quentin Tarantino, and Christopher Nolan are among the most
frequently credited on Netflix content. This indicates the platform’s interest in big-
name directors known for their high-quality and influential work across both movies
and TV shows.
Common Genres:

Drama is the most prevalent genre across Netflix content, followed by Comedy and
Action. These genres dominate the platform, indicating Netflix’s focus on engaging,
broad-appeal content that resonates with diverse audiences.

Releases per Year:

The number of Netflix releases has significantly increased, especially from 2018
onward, with a noticeable spike in 2020. This suggests that Netflix ramped up its
content production to meet growing global demand and competition in the streaming
industry.

Google Colab Link:

https://colab.research.google.com/drive/1CVrxGwfDbgR0ssLJ2rxSnf3thayzobH?usp=
sharing

CONCLUSION:

The analysis of the Netflix dataset reveals several key trends about the
platform’s content strategy. Netflix has a stronger focus on movies, with a higher
number of titles in that category compared to TV shows. Content production has seen
substantial growth since 2018, with a marked increase in 2020, reflecting the
platform’s expansion. The U.S. leads in content production, though countries like
India, the U.K., and Brazil also contribute significantly to Netflix’s diverse library.
Popular genres like Drama, Comedy, and Action dominate, indicating Netflix's focus
on broad, global appeal. The involvement of renowned directors further emphasizes
Netflix's commitment to high-quality, engaging content. Overall, Netflix's catalog
shows a clear strategy to cater to a wide range of tastes and expand its international
reach.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy