PCED - Lösung en
PCED - Lösung en
0.1 ### 1.1.1 Understand different data collection methods and their
roles in decision-making and research
0.1.1 Task 1
Answer: C (The combination of name, e-mail and date of birth is sufficient to uniquely identify a person
and must therefore be anonymized).
0.1.2 Task 2
Answer: B (The telephone numbers are encrypted by hashing without being able to be restored, which
represents secure anonymization).
0.1.3 Task 3
Answer: C (For sufficient anonymization, the first name, surname and address should be made
unrecognizable. The diagnosis code can remain as it does not establish a direct personal reference).
0.2 ### 1.1.2 Explain the data gathering process and various data sources.
0.2.1 Task 1
Answer: B (Stratified sampling makes it possible to target different population groups in different
cities and thus obtain a more comprehensive and representative data set).
1
0.2.2 Task 2
Answer: D (By surveying from each department in proportion to department size, different
perspectives and opinions are taken into account, resulting in a comprehensive data set).
0.2.3 Task 3
Answer: C (A quota sample that specifically includes participants from both groups ensures that
opinions from urban and rural areas are recorded equally, thus creating a balanced data set).
0.3 ### 1.1.3 Aggregate data from multiple sources and integrate them
into datasets.
0.3.1 Task 1
Answer: B (By converting all satisfaction values to a common scale, the data can be compared and
combined).
0.3.2 Task 2
Answer: B (The use of a unique key such as the customer ID enables a correct link and ensures that
each customer can be uniquely identified).
0.3.3 Task 3
Answer: B (By converting to a common date format or to consistent quarterly or annual values, the
purchase histories can be merged and compared more easily).
0.4.2 Task 2
Answer: A (A data lake stores large amounts of raw data without prior structuring).
2
0.4.3 Task 3
Answer: C (A .txt file stores data simply and without any special structure, which makes it
ideal for notes or logs).
0.4.4 Task 4
Answer: B (The cloud enables secure and flexible access to data from anywhere).
0.4.5 Task 5
Answer: D (Excel is known for its tabular and user-friendly functions, which are ideal for smaller data
sets).
0.5 ### 1.2.1 Understand structured and unstructured data and their
impli- cations in data analysis.
0.5.1 Task 1
Answer: A (Structured Data, as the data is completely organized and in a tabular form).
0.5.2 Task 2
Answer: B (Semi-Structured Data, as the emails contain a mixture of structured fields and free
text).
0.5.3 Task 3
Answer: C (Unstructured data, as images and videos have no defined structure and are difficult
to search or analyze).
3
0.6.2 Task 2
Answer: A (The missing values can be replaced by the average value, which is a common method
to close gaps without distorting the data set).
0.6.3 Task 3
Answer: B (The median is often useful because it is less sensitive to outliers and is a more realistic
estimate).
0.7.2 Task 2
Answer: B (One-hot encoding is a common method to convert categorical variables into a format
readable by machine learning without assuming numerical relationships).
0.7.3 Task 3
Answer: C (The email column should be deleted as it is irrelevant for the analysis and does not contain
any useful information for pattern recognition).
0.8.2 Task 2
Answer: B (Marking the negative values as missing data and then applying a suitable strategy for
handling the missing values makes sense, as negative values are considered data errors here).
4
0.8.3 Task 3
Answer: B (Using .replace(), "Missing" and -1 can be marked as np.nan and thus treated as
missing values, which facilitates cleanup and analysis).
0.9.1 Task 2
Answer: A (CSV uses commas as separators, XML uses tags, and XLS is table-based).
0.9.2 Task 3
Answer: A (JSON is hierarchical and API-friendly, TXT is unstructured, and CSV uses commas
for tabular data).
5
0.10.2 Task 2
Answer: A (The code filters words with more than 5 characters and converts them to capital
letters).
0.10.3 Task 3
Answer: A (The code filters all even numbers and calculates their square).
0.11.2 Task 2
Answer: B (ETL tools are ideal for extracting data from different formats and transforming it into a
standardized format, which facilitates analysis).
0.11.3 Task 3
Answer: B (A script for regular and incremental extraction of API data into the data warehouse
is efficient and facilitates daily data integration).
0.11.4 Task 4
Answer: A (A data warehouse already contains consolidated and cleansed data, w h i c h makes data
analysis more efficient and reliable).
0.11.5 Task 5
Answer: B (The application of transformation rules in an ETL process or script helps to standardize data
formats and establish compatibility).
6
0.11.7 Task 1
Answer: B (Bold headings and an adjusted column width improve the readability of important
information).
0.11.8 Task 2
Answer: A (Conditional formatting emphasizes important numerical values and makes trends easier
to see).
0.11.9 Task 3
Answer: A (Dividing the data by department into separate tabs provides clarity and makes navigation
easier).
0.12 ### 1.4.5 Prepare, adapt, and pre-process data for analysis
0.12.1 Task 1
Answer: B (A mapping step helps to standardize the spellings so that the region
"North America" can be used consistently in the analysis).
0.12.2 Task 2
Answer: B (The conversion to a standardized format and the extraction of year and month facilitate
the temporal analysis).
0.12.3 Task 3
Answer: B (The standardization of spellings and the filling in of missing values ensure that the
regions are correctly and completely taken into account in the analysis).
0.12.4 2.1.1 Apply Python syntax and control structures to solve data-related prob-
lems.
Maximum PCAP level
7
0.12.6 2.1.3 Evaluate and navigate the Python Data Science ecosystem.
0.12.7 1. data analysis and manipulation
• Pandas: One of the most basic libraries for working with structured data, such a s
DataFrames. Enables data manipulation, cleansing and aggregation.
• NumPy: Provides basic support for large, multidimensional arrays and matrices as well as
mathematical functions that work on these structures.
8
• PySpark: An interface to Apache Spark designed for processing large amounts of data and
creating machine learning pipelines on Spark environments.
9
• spaCy: A powerful NLP library specially developed for speed and efficiency. It is well suited
for Named Entity Recognition (NER), dependency parsing and text classification.
• Hugging Face Transformers: This library contains pre-trained models such as BERT, GPT
and other advanced models used for many NLP tasks such as text generation, translation and
sentiment analysis.
10
• There are special modules for Fourier transforms, signal processing and filters, which are
used in areas such as image processing and time series analysis.
4. Integration and differential equations:
• SciPy has tools for numerical integration and for solving differential equations that are
frequently encountered in scientific and technical fields.
5. Statistics:
• SciPy offers a large number of statistical functions, e.g . distributions, statistical tests and
random number generation, which are useful for data analysis and statistical modeling.
0.12.22 2.1.4 Organize and manipulate data using Python's core data structures.
Lists, tuples, dictionaries etc.
0.12.24 2.2.1 Import modules and manage Python packages using PIP.
from numpy.random import
randint pip install pandas
0.12.25 2.2.2 Apply basic exception handling and maintain script robustness.
try except Recognize block correctly!
11
[3] : df1 = pd.DataFrame({"Month":["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",␣
𝗌 "Aug",\
"Sep", "Oct", "Nov", "Dec"]})
df2 = pd.DataFrame({"Temperature":[5.6, 7.7, 10.4, 18,22.3, 27.5, 29.4, 33.4, 27.
𝗌4, 20.1, 7.7, 2.3]})
[4] : month_temp.describe()
[4]: Temperature
count 12.000000
mean 17.650000
std 10.605959
min 2.300000
25% 7.700000
50% 19.050000
75% 27.425000
max 33.400000
Mean: 17.65
Median: 19.05
Mode: 7.7
Range: 33.4 - 2.3 = 31.1
12
seem! The Pearson R coefficient is 0.78 > 0.5 and positive. This means that there is a STRONG positive
correlation between the number of hours of sunshine and the altitude.
Pearson Strength
>0,5 strong
0.3 to 0.5 medium
0 to 0.3 light
0 none
0.12.34 Task 1
Answer: C (It can be seen that the higher the price, the lower the sales figures, which indicates a
negative correlation).
0.12.35 Task 2
Answer: C (Age and income are positively correlated, as income also rises with increasing age).
0.12.36 Task 3
Answer: A (There is a tendency for more study hours to correlate with better grades, indicating
a positive relationship between study hours and grade quality).
13
0.12.37 3.2.1 Understand and apply bootstrapping for sampling distributions.
0.12.38 Task 1
Answer: B (In bootstrapping, many samples are drawn from the original sample with backfilling
in order to create a distribution of average values).
0.12.39 Task 2
Answer: C (The standard deviation of the bootstrapped mean is the standard error that quantifies the
uncertainty of the mean).
0.12.40 Task 3
Answer: B (The 95% confidence interval is determined by the values of the 2.5% and 97.5%
percentiles of the distribution of the 1,000 medians).
0.12.41 Task 4
What does a 95% confidence interval mean?
If you create a lot of confidence intervals for different samples of this data, then the ACTUAL value
of the median should be in about 95% of these confidence intervals.
0.12.42 3.2.2 Explain when and how to use linear and logistic regression, including
appropriateness and limitations.
0.12.43 Task 1
Answer: A (Linear regression is best suited here because the score is a continuous variable and a
linear relationship to the learning hours is assumed).
0.12.44 Task 2
Answer: B (Logistic regression is best suited because it can modulate probabilities and was developed
for binary target variables).
14
0.12.45 Task 3
Answer: A (Linear regression is not suitable if the target variable is binary, as it cannot model
probabilities correctly and logistic regression should be used instead).
0.12.47 1. fillna()
• Usage: The fillna() method is used to replace missing values (NaN) in a DataFrame or a
series.
• Examples:
– Fill with a fixed value: df['Column'].fillna(0), replaces all NaN values in the
column with 0.
– Fill with the average value: df['Column'].fillna(df['Column'].mean()), replaces
all NaN values with the average value of the column.
0.12.48 2. dropna()
• Usage: dropna() is used to remove rows or columns with missing values.
• Examples:
– Remove rows with NaN values: df.dropna(), removes all rows containing at least one
NaN included.
– Only remove rows where all values are NaN: df.dropna(how='all').
– Remove columns with missing values: df.dropna(axis=1).
0.12.49 3. drop_duplicates()
• Usage: This method removes duplicate lines in the DataFrame so that only the first
occurrences are retained.
• Examples:
– Remove duplicate lines: df.drop_duplicates(), keeps only the first occurrences of each
line.
– Remove duplicate entries based on a specific column:
df.drop_duplicates(subset='column'), removes all duplicates based on the values
in the specified column.
0.12.50 4. apply()
• Usage: apply() is used to apply a function to each row or column of a DataFrame or to each
element of a series. This is useful for calculations
and transformations.
• Examples:
– Apply function to a column: df['Column'].apply(lambda x: x * 2), duplicates all
values in the specified column.
15
– Apply function to each row: df.apply(lambda row: row['Column1'] +
row['Column2'], axis=1), calculates the sum of column1 and column2 for each
row.
0.12.51 4.1.2 Understand and Utilize the Relationship Between DataFrame and Series
in Pandas.
0.12.52 Task 1
Answer: B (With df['Age'] a single column is extracted as a series, while
df[['age']] returns a DataFrame).
0.12.53 Task 2
Answer: A (With df['Price'] = df['Price'] * 0.9 he can directly ref- erence the column as a
series and apply the discount).
0.12.54 Task 3
Answer: C (A Series is one-dimensional, similar to a single column, while a DataFrame is two-
dimensional and can contain multiple columns).
0.12.55 Task 4
D, a series!
0.12.56 4.1.3 Perform Array Operations and Differentiate Data Structures with
NumPy.
Answer: B (a * b calculates the element-by-element product and gives [4, 10, 18]).
0.12.57 Task 2
Answer: D (All these options calculate the transpose of A and return [[1, 3], [2, 4]]).
0.12.58 Task 3
Answer: A (np.linalg.det(B) calculates the determinant of the matrix B, which results in 4 here).
16
0.12.59 Task 4
How to calculate the matrix multiplication of A = np.array([[2, 0], [0, 2], [1,2]]) and
B = np.array([[1,3,2],[2,-3,1]]) correct? There are 2 correct answers!
1) A: np.dot(A,B)
2) B: np.multiply(A,B)
3) C: A @ B
4) D: np.dot(A,B.T)
A and C
[7] : A = np.array([[2, 0], [0, 2], [1,2]])
B = np.array([[1,3,2],[2,-3,1]])
print(np.dot(A,B))
print(A @ B)
[[ 2 6 4]
[ 4 -6 2]
[ 5 -3 4]]
[[ 2 6 4]
[ 4 -6 2]
[ 5 -3 4]]
[8] : print(np.multiply(A,B))
ValueError: operands could not be broadcast together with shapes (3,2) (2,3)
[ ]: print(np.dot(A,B.T))
0.12.60 4.1.4 Apply and Analyze Data Organization Techniques in Pandas and
NumPy.
0.12.61 Task 1
Answer: B (groupby with sum() groups the data by region and product and totals the sales for
each combination).
17
0.12.62 Task 2
Answer: A (data.reshape(3, 3) reshapes the 1D array into a 3x3 matrix so that row and column
operations are possible).
0.12.63 Task 3
Answer: C (drop_duplicates(subset='values') removes duplicate entries in the values
column and leaves only the unique values in the DataFrame).
0.12.65 Task 1
Answer: B (data.mode() calculates the mode and returns 4 in this case, since 4 occurs most
frequently).
0.12.66 Task 2
Answer: B (sales.min() returns the minimum value and sales.max() the maximum value of the
array).
0.12.67 Task 3
Answer: B (df['Age'].mean() calculates the average age in the Age column).
0.12.68 Task 4
How do you calculate the mode if you don't have a Pandas DataFrame, but only a normal Python
list?
[ ]: data = [1, 2, 2, 2, 2, 2, 3, 4, 4, 4, 5]
result = max([(data.count(i), i) for i in data])
print(f "The number of {result[1]} is the largest and equal to {result[0]}!")
18
0.12.70 Task 1
Answer: B (The test data set helps to evaluate the performance of the model on new data and avoid
overfitting).
0.12.71 Task 2
Answer: B (A high accuracy on the training data, but a low accuracy on the test data, i n d i c a t e s
overfitting).
0.12.72 Task 3
Answer: B (The test data set is used for the final evaluation of the model after training and
optimization are completed).
0.12.73 4.2.3 Analyze and Evaluate Supervised Learning Algorithms and Model Ac-
curacy.
0.12.74 Task 1
Answer: B (Accuracy is often misleading for unbalanced classes. Metrics such as precision and recall
are more appropriate in such cases).
0.12.75 Task 2
Answer: C (A lower MSE on the test dataset indicates that the linear regressor is better on this
data, but data understanding and applicability of the model should also be considered for model
selection).
0.12.76 Task 3
Answer: B (cross-validation reduces variance and provides a more robust estimate of model
performance, especially on new data).
0.12.77 Task 4
Answer: B (High precision means that the model has a high accuracy in predicting "cancer" and
generates few "no cancer" false alarms).
19
0.12.78 Task 5
Answer: A (A recall value of 85% means that the model has recognized 85% of all actual spam
emails, which i n d i c a t e s a good ability to recognize the relevant class).
0.12.79 Task 6
Answer: C (A high precision and low recall mean that the model is accurate in its positive predictions,
but misses many true positive cases).
0.12.81 Task 1
Answer: B (The command sns.histplot(data=df, x='age') creates a histogram of the age distribution,
which is ideal for visualizing frequencies).
0.12.82 Task 2
Answer: B ( sns.scatterplot(x='height', y='weight', data=df) generates a scatterplot that is
ideal for visualizing the relationship between two numeric variables
suitable.)
0.12.83 Task 3
Answer: B (sns.barplot(x='category', y='sales', data=df, estimator=np.mean)
creates a bar chart that shows the average sales figures per category, which
is ideal for comparing group averages).
0.12.84 5.1.2 Assess the pros and cons of different data representations.
0.12.85 Task 1
Answer: A (A line chart shows the trend over the months well for data set A, while a bar chart is
ideal to show the frequency of age groups in data set B).
20
0.12.86 Task 2
Answer: A (A scatter plot is suitable for the correlation between height and weight in data set A, while a
histogram visualizes the income distribution in data set B well).
0.12.87 Task 3
Answer: A (The line graph is good for visualizing the average temperatures over the year, while
the histogram is good for showing the frequency of notes in data set B).
0.12.88 Task 4
Answer: B (A pie chart is suitable for showing the market shares of car brands as a percentage,
while a scatter chart can visualize the relationship between kilometers driven and fuel consumption).
Task 1
Answer: A (Testing the marketing strategy for product B and comparing the results could provide
valuable insights on how to optimize the strategy for both products).
Task 2
Answer: C (A detailed analysis of the different effects on product A and B could help to optimize the
strategy and achieve better results).
0.12.90 Task 3
Answer: C (With plt.text(x, y, label) or ax.bar_label(bars) the exact values can be placed
directly on the bars).
0.12.91 Task 4
Answer: B (With plt.annotate() the analyst can mark and annotate the specific points in the scatter
plot).
21
0.12.92 Task 5
Answer: B (With plt.text() or plt.annotate() he can mark the conspicuous age group and
highlight the finding directly in the visualization).
0.12.93 5.1.4 Improve the clarity and accuracy of data interpretation by managing
display features such as colors, labels and legends.
0.12.94 Task 1
Which graph is created with the following code?
The first graph! The line is red and the markers are blue and slightly transparent (alpha=0.7)!
x = [1,2,3,4,5,6]
y = [2,3,5,7,10,14]
0.12.95 Task 2
Answer: A (The use of contrasting colors such as blue and red makes it easier to distinguish between
the two products).
22
0.12.96 Task 3
Answer: B (A legend at the top right or outside the diagram improves readability and avoids
overlapping with the data lines).
0.12.97 Task 4
Answer: A (By rotating the x-axis labels, age groups are clearly indicated, even if the names are
longer).
0.12.98 5.2.1 Tailor communication to different audience needs, and combine visual-
izations and text for clear data presentation.
0.12.99 Task 1
Answer: B (Giving the management an overview and providing the specialist department with
detailed programs optimally adapts the presentation to the needs of both target groups).
0.12.100 Task 2
Answer: B (A bar chart with a short text summary is easy to understand for a general audience and
clearly h i g h l i g h t s the decrease in satisfaction).
0.12.101 Task 3
Answer: B (Separate sections for the overview and the detailed analysis enable the needs of both
target groups to be optimally met).
0.12.102 5.2.2 Summarize key findings and support claims with evidence and reason-
ing.
0.12.103 Task 1
Answer: B (The comparison of sales figures with the previous quarters provides a clear basis and
supports the statement about the increase).
23
0.12.104 Task 2
Answer: B (By showing the regional differences and analyzing the influencing factors, the results are
clearly and comprehensibly supported).
0.12.105 Task 3
Answer: A (The comparison of the sales figures before and after the campaign and the consideration of
possible influencing factors support the statement about the effect of the campaign).
[ ]:
[ ]:
[ ]:
24