Assignment - Big Data Management
Assignment - Big Data Management
Identify any ONE business that uses one or more NoSQL databases (simple KV, column family,
document or graph databases). Critically analyse their use of NoSQL database using secondary data
sources. Prepare a report outlining their business goals, methodology adopted, realized outcomes
along with your insights and recommendations in not more than 4 pages (about 1200-1400 words).
Suppose that you are given a set of customer purchase transactions. Each transaction contains a
basket identifier and a set of items. Assume that the items in individual transactions are not
repeated and occur only once. A subset of customer transactions is stored in the data nodes of the
Hadoop cluster. You are expected to compute the support, and confidence of rules of the form X =>
Y, where X and Y are individual items in the transaction database. Generate all rules with a support
value greater than or equal to 20%. Assume that the total number of transactions (N) is known in
advance and is available to all the data nodes in the cluster. A sample input, output and formulas are
provided below. The samples are provided only for illustrative purpose and your solution should
handle any large-scale transactional database.
Sample transactions
Basket Id Transactions
1 Bread, Diaper, Milk
2 Beer, Bread
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Milk, Diaper, Coke
Formulas
contain
Support(X) =
( ) ( ⇒ )
Support(X => Y) = Confidence(X => Y) =
( )
Page 1 of 2
Sample output
Submission Instructions
Your submission should consist of the following components (in a single zip file):
Part B – (1) map-reduce pseudo-code, (2) map-reduce python program – python notebook with
display of execution results of individual steps, and (3) input transactional database files used in your
program evaluation.
Page 2 of 2