0% found this document useful (0 votes)
19 views24 pages

PCED - Lösung en

Uploaded by

huynhhatien1402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views24 pages

PCED - Lösung en

Uploaded by

huynhhatien1402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

PCED_Solution

October 31, 2024

[1]: import pandas as pd


import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns

0.1 ### 1.1.1 Understand different data collection methods and their
roles in decision-making and research
0.1.1 Task 1
Answer: C (The combination of name, e-mail and date of birth is sufficient to uniquely identify a person
and must therefore be anonymized).

0.1.2 Task 2
Answer: B (The telephone numbers are encrypted by hashing without being able to be restored, which
represents secure anonymization).

0.1.3 Task 3
Answer: C (For sufficient anonymization, the first name, surname and address should be made
unrecognizable. The diagnosis code can remain as it does not establish a direct personal reference).

0.2 ### 1.1.2 Explain the data gathering process and various data sources.
0.2.1 Task 1
Answer: B (Stratified sampling makes it possible to target different population groups in different
cities and thus obtain a more comprehensive and representative data set).

1
0.2.2 Task 2
Answer: D (By surveying from each department in proportion to department size, different
perspectives and opinions are taken into account, resulting in a comprehensive data set).

0.2.3 Task 3
Answer: C (A quota sample that specifically includes participants from both groups ensures that
opinions from urban and rural areas are recorded equally, thus creating a balanced data set).

0.3 ### 1.1.3 Aggregate data from multiple sources and integrate them
into datasets.
0.3.1 Task 1
Answer: B (By converting all satisfaction values to a common scale, the data can be compared and
combined).

0.3.2 Task 2
Answer: B (The use of a unique key such as the customer ID enables a correct link and ensures that
each customer can be uniquely identified).

0.3.3 Task 3
Answer: B (By converting to a common date format or to consistent quarterly or annual values, the
purchase histories can be merged and compared more easily).

0.4 ### 1.1.4 Explain various data storage solutions.


0.4.1 Task 1
Answer: B (SQL Database is well suited for structured data and supports complex queries and
analyses).

0.4.2 Task 2
Answer: A (A data lake stores large amounts of raw data without prior structuring).

2
0.4.3 Task 3
Answer: C (A .txt file stores data simply and without any special structure, which makes it
ideal for notes or logs).

0.4.4 Task 4
Answer: B (The cloud enables secure and flexible access to data from anywhere).

0.4.5 Task 5
Answer: D (Excel is known for its tabular and user-friendly functions, which are ideal for smaller data
sets).

0.5 ### 1.2.1 Understand structured and unstructured data and their
impli- cations in data analysis.
0.5.1 Task 1
Answer: A (Structured Data, as the data is completely organized and in a tabular form).

0.5.2 Task 2
Answer: B (Semi-Structured Data, as the emails contain a mixture of structured fields and free
text).

0.5.3 Task 3
Answer: C (Unstructured data, as images and videos have no defined structure and are difficult
to search or analyze).

0.6 ### 1.2.2 Identify, rectify, or remove erroneous data.


0.6.1 Task 1
Answer: A (Since the e-mail is crucial for communication, lines without this information could be
deleted to avoid incorrect data records).

3
0.6.2 Task 2
Answer: A (The missing values can be replaced by the average value, which is a common method
to close gaps without distorting the data set).

0.6.3 Task 3
Answer: B (The median is often useful because it is less sensitive to outliers and is a more realistic
estimate).

0.7 ### 1.2.3 Understand data normalization and scaling.


0.7.1 Task 1
Answer: B (Normalization to 0-1 ensures that columns with large scales such as purchase price
and satisfaction have comparable value ranges and the analysis is not dominated by large numbers).

0.7.2 Task 2
Answer: B (One-hot encoding is a common method to convert categorical variables into a format
readable by machine learning without assuming numerical relationships).

0.7.3 Task 3
Answer: C (The email column should be deleted as it is irrelevant for the analysis and does not contain
any useful information for pattern recognition).

0.8 ### 1.2.4 Apply data cleaning and standardization techniques.


0.8.1 Task 1
Answer: B (Using .replace(), "missing" values can be replaced by np.nan, which enables uniform
handling of all missing data).

0.8.2 Task 2
Answer: B (Marking the negative values as missing data and then applying a suitable strategy for
handling the missing values makes sense, as negative values are considered data errors here).

4
0.8.3 Task 3
Answer: B (Using .replace(), "Missing" and -1 can be marked as np.nan and thus treated as
missing values, which facilitates cleanup and analysis).

0.8.4 1.3.1 Execute and understand basic data validation methods.


0.8.5 1.3.2 Establish and maintain data integrity through clear validation rules.

[2]: order = pd.read_csv("Order.csv") order

[2] : CustomerID OrderNo Quantity Payment Cancellation


0 2 232432 2 55 -1
1 3 123125 1 37,5 -1
2 7 295242 3 99,6 -1
3 3 543212 1 -37,5 123125
Task: Write a function that checks whether this dataframe is OK. It must be checked whether
customer ID, quantity and order number are integers and payment is a float. In addition, payment
must be positive if there i s no cancellation. Quantity must be positive.

0.9 ### 1.4.1 Understand File Formats in Data Acquisition.


Answer: A (TXT stores raw text data, XLS is table-based and XML uses tags for structured
data).

0.9.1 Task 2
Answer: A (CSV uses commas as separators, XML uses tags, and XLS is table-based).

0.9.2 Task 3
Answer: A (JSON is hierarchical and API-friendly, TXT is unstructured, and CSV uses commas
for tabular data).

0.10 ### 1.4.2 Access, manage, and effectively utilize datasets.


0.10.1 Task 1
Answer: A (The code filters the list data and only keeps values greater than 1000).

5
0.10.2 Task 2
Answer: A (The code filters words with more than 5 characters and converts them to capital
letters).

0.10.3 Task 3
Answer: A (The code filters all even numbers and calculates their square).

0.11 ### 1.4.3 Extract data from various sources.


0.11.1 Task 1
Answer: B (By using JOIN statements, the tables in the relational database can be linked efficiently
and relevant information can be merged).

0.11.2 Task 2
Answer: B (ETL tools are ideal for extracting data from different formats and transforming it into a
standardized format, which facilitates analysis).

0.11.3 Task 3
Answer: B (A script for regular and incremental extraction of API data into the data warehouse
is efficient and facilitates daily data integration).

0.11.4 Task 4
Answer: A (A data warehouse already contains consolidated and cleansed data, w h i c h makes data
analysis more efficient and reliable).

0.11.5 Task 5
Answer: B (The application of transformation rules in an ETL process or script helps to standardize data
formats and establish compatibility).

0.11.6 1.4.4 Enhance data readability and format in spreadsheets.

6
0.11.7 Task 1
Answer: B (Bold headings and an adjusted column width improve the readability of important
information).

0.11.8 Task 2
Answer: A (Conditional formatting emphasizes important numerical values and makes trends easier
to see).

0.11.9 Task 3
Answer: A (Dividing the data by department into separate tabs provides clarity and makes navigation
easier).

0.12 ### 1.4.5 Prepare, adapt, and pre-process data for analysis
0.12.1 Task 1
Answer: B (A mapping step helps to standardize the spellings so that the region
"North America" can be used consistently in the analysis).

0.12.2 Task 2
Answer: B (The conversion to a standardized format and the extraction of year and month facilitate
the temporal analysis).

0.12.3 Task 3
Answer: B (The standardization of spellings and the filling in of missing values ensure that the
regions are correctly and completely taken into account in the analysis).

0.12.4 2.1.1 Apply Python syntax and control structures to solve data-related prob-
lems.
Maximum PCAP level

0.12.5 2.1.2 Analyze and create Python functions.


Maximum PCAP level

7
0.12.6 2.1.3 Evaluate and navigate the Python Data Science ecosystem.
0.12.7 1. data analysis and manipulation
• Pandas: One of the most basic libraries for working with structured data, such a s
DataFrames. Enables data manipulation, cleansing and aggregation.
• NumPy: Provides basic support for large, multidimensional arrays and matrices as well as
mathematical functions that work on these structures.

0.12.8 2. data visualization


• Matplotlib: A comprehensive library for creating static, animated and interactive
visualizations. It is very flexible and often the basis for other visualization libraries.
• Seaborn: Is based on Matplotlib and offers a simpler interface for statistical visualizations.
Well suited for creating appealing graphics and visualizations with just a few lines of code.
• Plotly: A library for interactive, web-based visualizations that also supports complex diagrams
and dashboards.

0.12.9 3. machine learning and statistics


• Scikit-Learn: One of the most widely used libraries for machine learning in Python, with a
variety of algorithms for classification, regression, clustering and model evaluation.
• Statsmodels: Provides statistical models and tests as well as advanced analysis tools that are
particularly suitable for time series analyses and regression models.

0.12.10 4. deep learning


• TensorFlow: An open source library for machine learning and deep learning, which is used in
particular for the creation and training of neural networks.
• PyTorch: A deep learning library that is characterized by flexibility and user-friendliness and
supports a dynamic calculation method.
• Keras: A user-friendly deep learning framework that is often used together with Ten- sorFlow
and facilitates the rapid creation and training of neural networks.

0.12.11 5. database and query


• SQLAlchemy: A library for working with relational databases in Python that enables SQL
queries and database manipulations.
• Psycopg2: An adapter specially developed for working with PostgreSQL databases in Python.
• PyODBC: Enables the connection and querying of databases via ODBC, which supports
work with many different database systems.

0.12.12 6. big data and distributed computing


• Dask: A library for parallel and distributed processing that enables Pandas- and NumPy-
compatible operations for large data sets on clusters.

8
• PySpark: An interface to Apache Spark designed for processing large amounts of data and
creating machine learning pipelines on Spark environments.

0.12.13 7. data preparation and cleansing


• BeautifulSoup: Used to extract data from HTML and XML files, useful for web scraping and
data preparation.
• OpenCV: A computer vision library for processing and analyzing image and video data.

0.12.14 8. jupyter notebooks for interactive analyses


• Jupyter Notebook: An interactive development environment that enables the writing,
execution and sharing of Python code, especially for data analysis and machine learning.
• JupyterLab: An extended version of the Jupyter Notebook that offers additional functions and
a flexible user interface.

0.12.15 1. clean up tables


• Pandas: Pandas is the standard library for working with tabular data and enables a variety of
data cleansing tasks such as removing missing values, cleansing and reformatting columns and
filtering data. It offers DataFrames that make data manipulation and cleansing efficient.

0.12.16 2. regression calculations


• Scikit-Learn: Scikit-Learn offers many methods for performing linear and logistic regression
as well as advanced regression methods such as ridge, lasso and elastic net regression.
• Statsmodels: This library is particularly useful for statistical analyses and offers detailed
model evaluations as well as statistical tests and diagnostic tools for regression analyses.

0.12.17 3. time series analyses


• Statsmodels: Statsmodels contains specialized time series models such as ARIMA, SARIMA
and state-space models and is particularly well suited for the investigation of time series.
• Prophet: Developed by Facebook, Prophet is an open-source library specifically designed for
predicting and modeling time series, especially seasonal patterns.
• Pandas: Pandas is an excellent tool for basic time series operations such as resampling,
aggregation and indexing.

0.12.18 4. natural language processing (NLP)


• NLTK (Natural Language Toolkit): One of the oldest NLP libraries in Python, providing a
variety of tools for tokenization, stemming, word type tagging and more.

9
• spaCy: A powerful NLP library specially developed for speed and efficiency. It is well suited
for Named Entity Recognition (NER), dependency parsing and text classification.
• Hugging Face Transformers: This library contains pre-trained models such as BERT, GPT
and other advanced models used for many NLP tasks such as text generation, translation and
sentiment analysis.

0.12.19 5. convolutional neural networks (CNN)


• TensorFlow and Keras: These frameworks are ideal for the development and training of
CNNs. Keras offers a user-friendly API that is particularly suitable for the rapid prototyping of
CNN models.
• PyTorch: Another popular library for training CNNs, which i s valued for its dynamic
calculation graphics and flexibility. PyTorch is particularly suitable for research and
development.

0.12.20 6 Recurrent Neural Networks (RNN)


• TensorFlow/Keras: These frameworks offer native support for RNN models such as LSTMs
and GRUs and are therefore particularly suitable for tasks such as time series analysis and
language modeling.
• PyTorch: PyTorch is also suitable for the implementation of RNNs and is often used in
research for working with recurrent networks and hybrid model architectures.

0.12.21 7. deep learning in general


• TensorFlow: One of the leading deep learning frameworks with comprehensive support for
training and deploying neural networks. TensorFlow is suitable for a wide range of deep
learning applications, from computer vision to NLP.
• Keras: A high-level API that builds on TensorFlow and is particularly useful for rapid
prototyping. Keras makes the creation and training of deep learning models easier and clearer.
• PyTorch: Particularly popular with researchers, PyTorch offers a flexible architecture for deep
learning models and is often used for experiments and the implementation of new model
architectures.
SciPy is mainly used for scientific and technical calculations and b u i l d s on NumPy to support
more complex mathematical and statistical operations. Here are the main application areas of SciPy:
1. Numerical calculations and optimization:
• SciPy offers functions for optimization problems, such as minimization and optimization
of functions, often in applications of statistics, physics and engineering.
2. Linear Algebra:
• SciPy contains many advanced linear algebra operations such as matrix factorization,
eigenvalues and eigenvectors and solving linear systems of equations.
3. Signal and image processing:

10
• There are special modules for Fourier transforms, signal processing and filters, which are
used in areas such as image processing and time series analysis.
4. Integration and differential equations:
• SciPy has tools for numerical integration and for solving differential equations that are
frequently encountered in scientific and technical fields.
5. Statistics:
• SciPy offers a large number of statistical functions, e.g . distributions, statistical tests and
random number generation, which are useful for data analysis and statistical modeling.

0.12.22 2.1.4 Organize and manipulate data using Python's core data structures.
Lists, tuples, dictionaries etc.

0.12.23 2.1.5 Explain and implement Python scripting best practices.


e.g.: PEP 8

0.12.24 2.2.1 Import modules and manage Python packages using PIP.
from numpy.random import
randint pip install pandas

0.12.25 2.2.2 Apply basic exception handling and maintain script robustness.
try except Recognize block correctly!

0.12.26 2.3.1 Perform SQL queries to retrieve and manipulate data.


0.12.27 2.3.2 Execute fundamental SQL commands to create, read, update, and
delete data in database tables.
0.12.28 2.3.3 Establish connections to databases using Python.
0.12.29 2.3.4 Execute parameterized SQL queries through Python to safely interact
with databases.
0.12.30 2.3.5 Understand, manage and convert SQL data types appropriately within
Python scripts.
0.12.31 2.3.6 Understand essential database security concepts, including strategies to
prevent SQL query injection
https://en.wikipedia.org/wiki/SQL_injection
https://en.wikipedia.org/wiki/Prepared_statement

0.12.32 3.1.1 Understand and apply statistical measures in data analysis.


What is the mean, median, mode and range of the temperatures?

11
[3] : df1 = pd.DataFrame({"Month":["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",␣
𝗌 "Aug",\
"Sep", "Oct", "Nov", "Dec"]})
df2 = pd.DataFrame({"Temperature":[5.6, 7.7, 10.4, 18,22.3, 27.5, 29.4, 33.4, 27.
𝗌4, 20.1, 7.7, 2.3]})

monat_temp = pd.concat([df1,df2], axis=1)


monat_temp.set_index("Month", inplace=True)
monat_temp
[3]: Temperature
month
Jan 5.6
Feb 7.7
Mar 10.4
Apr 18.0
May 22.3
Jun 27.5
Jul 29.4
Aug 33.4
Sep 27.4
Oct 20.1
Nov 7.7
Dec 2.3

[4] : month_temp.describe()

[4]: Temperature
count 12.000000
mean 17.650000
std 10.605959
min 2.300000
25% 7.700000
50% 19.050000
75% 27.425000
max 33.400000

Mean: 17.65
Median: 19.05
Mode: 7.7
Range: 33.4 - 2.3 = 31.1

0.12.33 3.1.2 Analyze and evaluate data relationships.


Given is the table and the Pearson R coefficient 0.78!
Are there any unusual values and what does the Pearson coefficient mean?
There is one outlier: namely "Acer saccharum" with a height of only 9 cm after 7 hours of sunlight.

12
seem! The Pearson R coefficient is 0.78 > 0.5 and positive. This means that there is a STRONG positive
correlation between the number of hours of sunshine and the altitude.

Pearson Strength

>0,5 strong
0.3 to 0.5 medium
0 to 0.3 light
0 none

Then, of course, there is "positive" or "negative"! Positive if


both variables increase!
Negative if one rises and the other falls.
[5] : flowers = pd.read_csv("Flowers.csv") flowers

[5] : Name Sunshine Height


0Acer circinatum 8 20
1 Acer pseudoplatanus 3 9
2Acer saccharum 7 9
3Acer tegmentosum 9 23
4Quercus castaneifolia 7 17
5Quercus dentata 6 16
6Quercus petraea 8 21
[6] : pearsonr(blumen['sunshine'],blumen['altitude'])

[6]: (0.7851358838853206, 0.03647130365762987)

0.12.34 Task 1
Answer: C (It can be seen that the higher the price, the lower the sales figures, which indicates a
negative correlation).

0.12.35 Task 2
Answer: C (Age and income are positively correlated, as income also rises with increasing age).

0.12.36 Task 3
Answer: A (There is a tendency for more study hours to correlate with better grades, indicating
a positive relationship between study hours and grade quality).

13
0.12.37 3.2.1 Understand and apply bootstrapping for sampling distributions.

0.12.38 Task 1
Answer: B (In bootstrapping, many samples are drawn from the original sample with backfilling
in order to create a distribution of average values).

0.12.39 Task 2
Answer: C (The standard deviation of the bootstrapped mean is the standard error that quantifies the
uncertainty of the mean).

0.12.40 Task 3
Answer: B (The 95% confidence interval is determined by the values of the 2.5% and 97.5%
percentiles of the distribution of the 1,000 medians).

0.12.41 Task 4
What does a 95% confidence interval mean?
If you create a lot of confidence intervals for different samples of this data, then the ACTUAL value
of the median should be in about 95% of these confidence intervals.

0.12.42 3.2.2 Explain when and how to use linear and logistic regression, including
appropriateness and limitations.

0.12.43 Task 1
Answer: A (Linear regression is best suited here because the score is a continuous variable and a
linear relationship to the learning hours is assumed).

0.12.44 Task 2
Answer: B (Logistic regression is best suited because it can modulate probabilities and was developed
for binary target variables).

14
0.12.45 Task 3
Answer: A (Linear regression is not suitable if the target variable is binary, as it cannot model
probabilities correctly and logistic regression should be used instead).

0.12.46 4.1.1 Manage data effectively with Pandas.


What do you do with "fillna", "dropna", "drop_duplicates", "apply" in Pandas?

0.12.47 1. fillna()
• Usage: The fillna() method is used to replace missing values (NaN) in a DataFrame or a
series.
• Examples:
– Fill with a fixed value: df['Column'].fillna(0), replaces all NaN values in the
column with 0.
– Fill with the average value: df['Column'].fillna(df['Column'].mean()), replaces
all NaN values with the average value of the column.

0.12.48 2. dropna()
• Usage: dropna() is used to remove rows or columns with missing values.
• Examples:
– Remove rows with NaN values: df.dropna(), removes all rows containing at least one
NaN included.
– Only remove rows where all values are NaN: df.dropna(how='all').
– Remove columns with missing values: df.dropna(axis=1).

0.12.49 3. drop_duplicates()
• Usage: This method removes duplicate lines in the DataFrame so that only the first
occurrences are retained.
• Examples:
– Remove duplicate lines: df.drop_duplicates(), keeps only the first occurrences of each
line.
– Remove duplicate entries based on a specific column:
df.drop_duplicates(subset='column'), removes all duplicates based on the values
in the specified column.

0.12.50 4. apply()
• Usage: apply() is used to apply a function to each row or column of a DataFrame or to each
element of a series. This is useful for calculations
and transformations.
• Examples:
– Apply function to a column: df['Column'].apply(lambda x: x * 2), duplicates all
values in the specified column.

15
– Apply function to each row: df.apply(lambda row: row['Column1'] +
row['Column2'], axis=1), calculates the sum of column1 and column2 for each
row.

0.12.51 4.1.2 Understand and Utilize the Relationship Between DataFrame and Series
in Pandas.

0.12.52 Task 1
Answer: B (With df['Age'] a single column is extracted as a series, while
df[['age']] returns a DataFrame).

0.12.53 Task 2
Answer: A (With df['Price'] = df['Price'] * 0.9 he can directly ref- erence the column as a
series and apply the discount).

0.12.54 Task 3
Answer: C (A Series is one-dimensional, similar to a single column, while a DataFrame is two-
dimensional and can contain multiple columns).

0.12.55 Task 4
D, a series!

0.12.56 4.1.3 Perform Array Operations and Differentiate Data Structures with
NumPy.
Answer: B (a * b calculates the element-by-element product and gives [4, 10, 18]).

0.12.57 Task 2
Answer: D (All these options calculate the transpose of A and return [[1, 3], [2, 4]]).

0.12.58 Task 3
Answer: A (np.linalg.det(B) calculates the determinant of the matrix B, which results in 4 here).

16
0.12.59 Task 4
How to calculate the matrix multiplication of A = np.array([[2, 0], [0, 2], [1,2]]) and
B = np.array([[1,3,2],[2,-3,1]]) correct? There are 2 correct answers!
1) A: np.dot(A,B)
2) B: np.multiply(A,B)
3) C: A @ B
4) D: np.dot(A,B.T)
A and C
[7] : A = np.array([[2, 0], [0, 2], [1,2]])
B = np.array([[1,3,2],[2,-3,1]])

print(np.dot(A,B))
print(A @ B)

[[ 2 6 4]
[ 4 -6 2]
[ 5 -3 4]]
[[ 2 6 4]
[ 4 -6 2]
[ 5 -3 4]]

[8] : print(np.multiply(A,B))

ValueError Traceback (most recent call last)


~\AppData\Local\Temp\ipykernel_6536\1187759338.py in <module>
----> 1 print(np.multiply(A,B))

ValueError: operands could not be broadcast together with shapes (3,2) (2,3)

[ ]: print(np.dot(A,B.T))

0.12.60 4.1.4 Apply and Analyze Data Organization Techniques in Pandas and
NumPy.

0.12.61 Task 1
Answer: B (groupby with sum() groups the data by region and product and totals the sales for
each combination).

17
0.12.62 Task 2
Answer: A (data.reshape(3, 3) reshapes the 1D array into a 3x3 matrix so that row and column
operations are possible).

0.12.63 Task 3
Answer: C (drop_duplicates(subset='values') removes duplicate entries in the values
column and leaves only the unique values in the DataFrame).

0.12.64 4.2.1 Apply Python's descriptive statistics for dataset analysis.

0.12.65 Task 1
Answer: B (data.mode() calculates the mode and returns 4 in this case, since 4 occurs most
frequently).

0.12.66 Task 2
Answer: B (sales.min() returns the minimum value and sales.max() the maximum value of the
array).

0.12.67 Task 3
Answer: B (df['Age'].mean() calculates the average age in the Age column).

0.12.68 Task 4
How do you calculate the mode if you don't have a Pandas DataFrame, but only a normal Python
list?
[ ]: data = [1, 2, 2, 2, 2, 2, 3, 4, 4, 4, 5]
result = max([(data.count(i), i) for i in data])
print(f "The number of {result[1]} is the largest and equal to {result[0]}!")

0.12.69 4.2.2 Recognize the importance of test datasets in model evaluation.

18
0.12.70 Task 1
Answer: B (The test data set helps to evaluate the performance of the model on new data and avoid
overfitting).

0.12.71 Task 2
Answer: B (A high accuracy on the training data, but a low accuracy on the test data, i n d i c a t e s
overfitting).

0.12.72 Task 3
Answer: B (The test data set is used for the final evaluation of the model after training and
optimization are completed).

0.12.73 4.2.3 Analyze and Evaluate Supervised Learning Algorithms and Model Ac-
curacy.

0.12.74 Task 1
Answer: B (Accuracy is often misleading for unbalanced classes. Metrics such as precision and recall
are more appropriate in such cases).

0.12.75 Task 2
Answer: C (A lower MSE on the test dataset indicates that the linear regressor is better on this
data, but data understanding and applicability of the model should also be considered for model
selection).

0.12.76 Task 3
Answer: B (cross-validation reduces variance and provides a more robust estimate of model
performance, especially on new data).

0.12.77 Task 4
Answer: B (High precision means that the model has a high accuracy in predicting "cancer" and
generates few "no cancer" false alarms).

19
0.12.78 Task 5
Answer: A (A recall value of 85% means that the model has recognized 85% of all actual spam
emails, which i n d i c a t e s a good ability to recognize the relevant class).

0.12.79 Task 6
Answer: C (A high precision and low recall mean that the model is accurate in its positive predictions,
but misses many true positive cases).

0.12.80 5.1.1 Demonstrate essential proficiency in data visualization with Matplotlib


and Seaborn.

0.12.81 Task 1
Answer: B (The command sns.histplot(data=df, x='age') creates a histogram of the age distribution,
which is ideal for visualizing frequencies).

0.12.82 Task 2
Answer: B ( sns.scatterplot(x='height', y='weight', data=df) generates a scatterplot that is
ideal for visualizing the relationship between two numeric variables
suitable.)

0.12.83 Task 3
Answer: B (sns.barplot(x='category', y='sales', data=df, estimator=np.mean)
creates a bar chart that shows the average sales figures per category, which
is ideal for comparing group averages).

0.12.84 5.1.2 Assess the pros and cons of different data representations.

0.12.85 Task 1
Answer: A (A line chart shows the trend over the months well for data set A, while a bar chart is
ideal to show the frequency of age groups in data set B).

20
0.12.86 Task 2
Answer: A (A scatter plot is suitable for the correlation between height and weight in data set A, while a
histogram visualizes the income distribution in data set B well).

0.12.87 Task 3
Answer: A (The line graph is good for visualizing the average temperatures over the year, while
the histogram is good for showing the frequency of notes in data set B).

0.12.88 Task 4
Answer: B (A pie chart is suitable for showing the market shares of car brands as a percentage,
while a scatter chart can visualize the relationship between kilometers driven and fuel consumption).

0.12.89 5.1.3 Label, annotate, and test insights from datavisualizations.

Task 1
Answer: A (Testing the marketing strategy for product B and comparing the results could provide
valuable insights on how to optimize the strategy for both products).

Task 2
Answer: C (A detailed analysis of the different effects on product A and B could help to optimize the
strategy and achieve better results).

0.12.90 Task 3
Answer: C (With plt.text(x, y, label) or ax.bar_label(bars) the exact values can be placed
directly on the bars).

0.12.91 Task 4
Answer: B (With plt.annotate() the analyst can mark and annotate the specific points in the scatter
plot).

21
0.12.92 Task 5
Answer: B (With plt.text() or plt.annotate() he can mark the conspicuous age group and
highlight the finding directly in the visualization).

0.12.93 5.1.4 Improve the clarity and accuracy of data interpretation by managing
display features such as colors, labels and legends.
0.12.94 Task 1
Which graph is created with the following code?
The first graph! The line is red and the markers are blue and slightly transparent (alpha=0.7)!
x = [1,2,3,4,5,6]
y = [2,3,5,7,10,14]

plt.plot(x,y, marker='o', markersize=9, markerfacecolor="blue", linewidth=4, \


alpha=0.7, label="Rising plot", linestyle="--", color="red")

0.12.95 Task 2
Answer: A (The use of contrasting colors such as blue and red makes it easier to distinguish between
the two products).

22
0.12.96 Task 3
Answer: B (A legend at the top right or outside the diagram improves readability and avoids
overlapping with the data lines).

0.12.97 Task 4
Answer: A (By rotating the x-axis labels, age groups are clearly indicated, even if the names are
longer).

0.12.98 5.2.1 Tailor communication to different audience needs, and combine visual-
izations and text for clear data presentation.

0.12.99 Task 1
Answer: B (Giving the management an overview and providing the specialist department with
detailed programs optimally adapts the presentation to the needs of both target groups).

0.12.100 Task 2
Answer: B (A bar chart with a short text summary is easy to understand for a general audience and
clearly h i g h l i g h t s the decrease in satisfaction).

0.12.101 Task 3
Answer: B (Separate sections for the overview and the detailed analysis enable the needs of both
target groups to be optimally met).

0.12.102 5.2.2 Summarize key findings and support claims with evidence and reason-
ing.

0.12.103 Task 1
Answer: B (The comparison of sales figures with the previous quarters provides a clear basis and
supports the statement about the increase).

23
0.12.104 Task 2
Answer: B (By showing the regional differences and analyzing the influencing factors, the results are
clearly and comprehensibly supported).

0.12.105 Task 3
Answer: A (The comparison of the sales figures before and after the campaign and the consideration of
possible influencing factors support the statement about the effect of the campaign).

[ ]:

[ ]:

[ ]:

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy