NM Assignment
NM Assignment
NAME : PREMALATHA S
INTRODUCTION:
In the era of digital streaming, Netflix has emerged as a dominant platform,
offering a wide range of movies and TV shows across different countries and genres.
To better understand the nature and diversity of Netflix’s content, we explore a
publicly available Netflix dataset containing detailed information such as title, type
(Movie or TV Show), director, cast, country of origin, release year, rating, duration,
genres, and date added to the platform. This dataset provides a rich foundation for
analyzing trends in content production and distribution.
To gain meaningful insights from this data, we apply Exploratory Data Analysis
(EDA), a crucial step in the data science process. EDA allows us to examine the
structure of the dataset, identify missing or inconsistent data, and uncover patterns
related to genre popularity, content duration, regional production, and release
timelines. Through data visualization and statistical summaries, EDA helps us
understand how Netflix’s catalog has evolved over time and supports further decision-
making or modeling efforts. This analysis aims to reveal hidden trends in Netflix’s
content strategy and viewer preferences.
OBJECTIVE:
DATA SOURCE:
This dataset includes information about TV shows and movies available on Netflix,
including their type, director, cast, country, release year, rating, and more.
Part 1: Data Loading and Understanding
a) Load the dataset into your tool of choice (Excel / Python / Power BI / Tableau).
Program:
import pandas as pd
Program:
print(df.head(10))
Output:
Program:
import pandas as pd
Output:
Program:
import pandas as pd
Program:
import pandas as pd
df["country"] = df["country"].fillna(df["country"].mode()[0])
df.drop_duplicates(inplace=True)
print("After handling missing values and removing duplicates:")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
Output:
Program:
import pandas as pd
print(df['type'].unique()) # This is to check the unique values in the 'type' column
Output:
Program:
import pandas as pd
df['country'] = df['country'].fillna('') # Fill NaN values with an empty string
country_count = df['country'].str.split(',',
expand=True).stack().str.strip().value_counts()
top_5_countries = country_count.head(5)
print("\nTop 5 Countries Producing the Most Netflix Content:")
print(top_5_countries)
Output:
Program:
import pandas as pd
df['director'] = df['director'].fillna('Jay Oliva')
director_count=df['director'].str.split(',',expand=True).stack().str.strip().value_counts()
top_10_directors = director_count.head(10)
# Display the top 10 directors
print("\nTop 10 Directors with the Most Movies/Shows:")
print(top_10_directors)
Output:
d) Find out the most common genres (column: listed_in).
Program:
import pandas as pd
df['listed_in'] = df['listed_in'].fillna('crime_thriller') # Fill NaN values with empty
string
genre_count=df['listed_in'].str.split(',',expand=True).stack().str.strip().value_counts()
top_genres = genre_count.head(10) # Adjust the number for top N genres you want
# Display the most common genres
print("\nMost Common Genres:")
print(top_genres)
Output:
e) Analyze the trend: How many shows/movies were released each year?
Program:
import pandas as pd
Part 4: Visualizations
Program:
import pandas as pd
import matplotlib.pyplot as plt
content_count = df['type'].value_counts()
plt.figure(figsize=(8, 6))
content_count.plot(kind='bar', color=['skyblue', 'lightgreen'])
plt.title('Number of Movies vs TV Shows')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Output:
Program:
import pandas as pd
import matplotlib.pyplot as plt
country_production_count = df.groupby('country').size().sort_values(ascending=False)
# Select the top 5 countries by production count
top_5_countries = country_production_count.head(5)
# Create a pie chart for the top 5 countries by production count
plt.figure(figsize=(8, 8))
plt.pie(top_5_countries, labels=top_5_countries.index, autopct='%1.1f%%',
startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99','#c2c2f0'])
# Title for the pie chart
plt.title('Top 5 Countries by Number of Netflix Productions')
# Display the pie chart
plt.show()
Output:
Program:
import pandas as pd
import matplotlib.pyplot as plt
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')
Group by 'release_year' and count the number of shows/movies
release_trend = df.groupby('release_year').size()
Plot the trend
plt.figure(figsize=(12, 6))
release_trend.plot(kind='line', color='b', marker='o', linestyle='-', linewidth=2,
markersize=5)
plt.title('Trend of Shows/Movies Released Each Year')
plt.xlabel('Year')
plt.ylabel('Number of Shows/Movies Released')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
# Show the plot
plt.show()
Output:
Program:
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
all_genres = ' '.join(df['listed_in'].dropna().astype(str))
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(all_genres)
# Display the word cloud
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off') # Hide the axes
plt.title('Most Frequent Genres in Netflix Dataset')
plt.show()
Output:
Movies vs TV Shows:
Movies make up the majority of Netflix content, accounting for about 70% of the total
titles, while TV Shows represent the remaining 30%. This suggests that Netflix
emphasizes one-time viewing content more heavily than episodic series.
Top 5 Countries:
The United States leads as the top producer of Netflix content, followed by India, the
United Kingdom, Canada, and Brazil. This reflects Netflix's strong presence in
English-speaking countries, along with significant contributions from India and other
global regions.
Top 5 Directors:
The dataset reveals that directors like Martin Scorsese, David Fincher, Steven
Spielberg, Quentin Tarantino, and Christopher Nolan are among the most
frequently credited on Netflix content. This indicates the platform’s interest in big-
name directors known for their high-quality and influential work across both movies
and TV shows.
Common Genres:
Drama is the most prevalent genre across Netflix content, followed by Comedy and
Action. These genres dominate the platform, indicating Netflix’s focus on engaging,
broad-appeal content that resonates with diverse audiences.
The number of Netflix releases has significantly increased, especially from 2018
onward, with a noticeable spike in 2020. This suggests that Netflix ramped up its
content production to meet growing global demand and competition in the streaming
industry.
https://colab.research.google.com/drive/1CVrxGwfDbgR0ssLJ2rxSnf3thayzobH?usp=
sharing
CONCLUSION:
The analysis of the Netflix dataset reveals several key trends about the
platform’s content strategy. Netflix has a stronger focus on movies, with a higher
number of titles in that category compared to TV shows. Content production has seen
substantial growth since 2018, with a marked increase in 2020, reflecting the
platform’s expansion. The U.S. leads in content production, though countries like
India, the U.K., and Brazil also contribute significantly to Netflix’s diverse library.
Popular genres like Drama, Comedy, and Action dominate, indicating Netflix's focus
on broad, global appeal. The involvement of renowned directors further emphasizes
Netflix's commitment to high-quality, engaging content. Overall, Netflix's catalog
shows a clear strategy to cater to a wide range of tastes and expand its international
reach.