Combined Exam 29.06.2020
Combined Exam 29.06.2020
Duration: 120 minutes. Provide the solutions on the designated pages. Good Luck!
Question 1: (12)
For each of the statements below, decide if it is true or false and tick the corresponding circle. You get +2
credits for each correct answer, −2 credits for each wrong answer and 0 credit if you leave the answer open.
Each of the parts (a) – (c) is graded separately and you always get ≥ 0 credits on each part of exam question 1,
that is: 0 – 4 credits for each of the parts (a) – (c).
1. Consider two relations R(A, B) and S(C, A), where S.A is a foreign key from S to R, and suppose that
there exists a B ∗ -index for each primary key. Then the implementation of the natural join R ./ S by an
index nested loops join always requires a smaller number of I/O-operations than the implementation of the
same join by a hash join. true false ×
2. Consider two relations R(ABC) und S(ABD). Then the following equalities always holds:
R ./ (πAB (R) ∩ πAB (S)) = πABC (R ./ S) = πABC (S ./ R) true × false
1. HDFS splits files into blocks with a recommended block size between 64 and 128 kBytes. true false ×
2. In MapReduce, combiners allow the user to reduce the communication cost by pushing some of the map
logic into the reduce task. true false ×
1. If a distributed database management system provides linear scale up, then doubling the number of CPUs
allows one to halve the time needed for a given operation. true false ×
2. In a document store, the data may have an aribtrarily deeply nested structure (e.g., JSON, XML). In
contrast, in key value stores, no nesting of values is allowed. true false ×
Question 2: (8)
The American presidential elections 2020 are coming closer and the parties are already making plans for their
campaigns in the next months. Above all, the parties want to make sure of the support of their own members.
To this end, they maintain a database of volunteers, members, and activities how to contact the members. The
database contains the following relations:
Volunteers(Id, Name, Age, Sex) (short v)
Members(Id, Name, Education, Age, State) (short m)
Contact(Id, VolId, MemId, Method, Date) (short c).
Further, suppose that the following query has to be processed:
i.e., we are interested in information on female volunteers who are supposed to contact elderly party members.
a) In the first box below, draw the query tree of the canonical translation. To save space, you may use the
abbreviations v, m, and c, for the relations Volunteers, Members, and Contact, respectively. You may also
abbreviate any attribute name to a unique prefix (e.g., m.N for Members.Name or c.Mem for Contact.MemId).
b) In the second box below, draw the query tree of the optimized Relational Algebra expression. For the
optimization, apply the following rules:
• Push selections as deep in the tree as possible,
• project out attributes as soon as they are not needed anymore,
• replace cross products by joins.
We see that the selection digital = ”besser” has a selectivity of 1/300. Because
we always assume uniformity, the tuples in Meta have a 1/300th chance of keeping
their join partner (they have exactly one before the selection because of the FK
constraint) . Hence, only 1000/300 = 3.3 tuples are left in the join. Since tuples
can not be fractional, we arrive at a cardinality of 3 for the expression.
b) [4 credits]
Now consider a DBMS that maintains equi-depth histograms to support selectivity calculation for query
planning. In particular, for the column score of integers of a table adbsexam with 280 rows, the following 5
values divide the column values into 6 groups of equal size: 41, 52, 67, 73, 85. The maximum value in score is
100. Assume that the dividing values in the histogram are always part of the left bucket, i.e., the
value 41 is in the first bucket, 52 in the second, and so on.
Estimate the selectivity for the following relational algebra expression. Assume that values are evenly
distributed inside the buckets.
σscore<52∨score≥91 (adbsexam)
The disjuncts clearly capture disjoint events and we can therefore simply add
their respective individual selectivities.
i) score < 52 : 1/6 + range from 42 to 51 (10 values) of the bucket that ranges
from 42 to 52 (11 values) ( 16 · r).
51 − 41 10
r= =
52 − 41 11
Thus, the selectivity of the first disjunct is 16 (1 + 11
10
).
1 0
ii) score ≥ 91: part of last bucket ( 6 ·r ). The range [91, 100] contains 10 values.
The total bucket spans the values [86, 100], i.e., 15 values. Hence, assuming
uniformity
10
r0 =
15
1 10
The overall selectivity is therefore 6
(1 + 11
+ 23 ).
Question 4: (8)
Below we define the symmetric set difference operation.
Sketch out a MapReduce algorithm for computing the symmetric set difference A 4 B. In addition
to this, also identify the communication cost of your algorithm, as a function of the input sizes.
Mapper
For each input t from A produce a tuple (t, ’A’) and for each input t from B
produce a tuple (t, ’B’)
Reducer
Count the occurrences of ’A’, called a, and occurrences of ’B’, called b, in
the value list of t. In case that a = 0 and b > 0 , then emit t, or if a > 0
and b = 0, then emit t. In all other cases, do nothing.
Communication cost
The mapper emits for every tuple from A and every tuple from B one output
record, therefore the communication cost is |A| + |B|.
Question 5: (8)
You are given a database for a pharmaceutical testing facility with the following relational schema.
underlines represent primary keys, italics represent foreign keys. We do not state the type of attributes
Note: underlines
here, just assume some reasonable defaults (string, int, float, . . . ).
Note: You may shorten dateframe names (e.g.: “pcDF” instead of “pharma compDF”), as long as it is
unambiguous
unambiguous what the shortened name refers to.
d) val query4DF = testingDF.groupBy(testingDF("company"),testingDF("drug")).agg(count("*"))
.except(
testingDF.join(test_groupDF, testingDF("title") === test_groupDF("title") &&
testingDF("date") === test_groupDF("date"))
.select(testingDF("*")).distinct()
.groupBy(testingDF("company"),testingDF("drug")).agg(count("*"))
)
Watched
Users Movies
uid mid date time
id name email balance id title genre released
u1 m1 2020-04-01 20:15
u1 Alice a@gmx.at -150.00 m1 title1 drama 2019
u1 m3 2020-04-02 21:00
u2 Bob b@gmx.at 0.00 m2 title2 action 1998
u2 m1 2020-05-01 22:00
u3 Carol c@gmx.at 200.00 m3 title3 comedy 2010
u3 m2 2020-06-01 23:00
In part (a), this relational database should be transformed into a document store (in JSON). Recall that for
the data design it is sometimes advantageous to apply some form of denormalization if you want to speed up
certain queries or update operations. In your transformation of the relational database into a document store,
you will be requested to apply at least one denormalization. Moreover, you will be requested to discuss the pros
(in part b) and cons (in part c) of the chosen denormalization.
(a) Give a representation of the relational database as a document store (in JSON) and make sure that you
apply at least one denormalization. [4 credits]
Remark. Many students solved this problem by realizing the watched-information as an array
inside the users- or movies-collection. Strictly speaking, this is not a denormalization; it is
just one possible way of realizing an m:n relationship in a document store. However, this was
only mildly penalized (-1 credit); moreover, if the reasoning in parts (b) and (c) made sense,
full credits were given for those parts.
(b) Present a query or update operation that should profit from your denormalization and explain why it
should profit, i.e., what kind of work by the dabase engine can be avoided because of your denormalization. You
may use plain text for the presentation of your query or update operation; no formal query language is required
here. [2 credits]
Database access: a query that asks for the most popular genre, i.e.: movies of which genre
are watched most often?
(c) Present a query or update operation that might suffer from your denormalization and explain why it
might suffer, i.e., what kind of extra work by the dabase engine might be required because of your
denormalization. You may use plain text for the presentation of your query or update operation; no formal
query language is required here. [2 credits]
(a) Evaluate the following Cypher query on the Database given on the last sheet of the exam:
match (f)<-[:likes]-(b:Band)-[:plays]->(v:Venue)-[:blocks]->(f)
return f,b,v
(5kHD,Dives,Flex), (GY!BE,RTJ,B72)
(b) Evaluate the following Cypher query on the Database given on the last sheet of the exam:
(Flex, Arena, 5),(Flex, Werk, 2), (Flex, B72, 4), (B72, Arena, 2), (B72, Werk, 3)
(c) Assume the data model described on the last sheet of the exam. Write a Cypher query that returns all
bands with the number of venues that they have been blocked from. Sort the output by the number of
venues that have blocked the band.
(d) Assume the data model described on the last sheet of the exam. Write a Cypher query that finds all bands
that have played at or been blocked from a venue that does not contain a cool attribute. Return the band
name, the kind of connect (played or blocked) and the venue for each occurence. If a band has both a
played and a blocked connection to a venue, return a tuple for both cases, i.e., (band1, played, venue1) and
(band1, blocked, venue1).
MATCH (b:Band)-[r:blocks|:plays]-(v:Venue)
WHERE v.cool is null
RETURN b.name, type(r), v.name
Good luck!
Overall: 60 points
Graph DB Data Model
The graph contains bands (:Band) and venues (:Venue). A venue always has an attribute cap that stores the
capacity.
There are three types of relationships. A band :likes other bands. A band :plays at venues. A venue :blocks
bands. A visual representation of the data model is given in Figure ??.
Graph Database
The labels contain the name attributes for Band nodes and the name and cap attributes for Vanue nodes.
Figure 2: Database