0% found this document useful (0 votes)
26 views13 pages

Data in Enterprise End Term Cheat Sheet

The document is a comprehensive cheat sheet on data in enterprise, covering topics such as data types, cloud computing, business intelligence, data governance, and data ethics. It explains key concepts like structured and unstructured data, data warehouses, and machine learning techniques, while also addressing challenges and opportunities in data management. Additionally, it outlines data preprocessing techniques, data cleaning methods, and best practices for data security.

Uploaded by

pearl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views13 pages

Data in Enterprise End Term Cheat Sheet

The document is a comprehensive cheat sheet on data in enterprise, covering topics such as data types, cloud computing, business intelligence, data governance, and data ethics. It explains key concepts like structured and unstructured data, data warehouses, and machine learning techniques, while also addressing challenges and opportunities in data management. Additionally, it outlines data preprocessing techniques, data cleaning methods, and best practices for data security.

Uploaded by

pearl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data in Enterprise End Term Cheat Sheet

Compiled by: Tanisha Khandelwal, Aryan Patel, Suzy Paladiya, Yashvi Patel, Neha Bhansali & Vanshika Modi

Module 1

1. What is data? Explain structured and unstructured data types with examples
Answer - Data refers to information in the form of facts, statistics, or raw observations. There are
2 types of data structured and unstructured. Structured data is data that has been predefined and
formatted to a set structure before being placed in data storage and unstructured data is data
stored in its native format and not processed until used. The example of structured data would be
relational databases and for unstructured data it would be media and entertainment data.

2. Explain briefly about Flat file formats and their limitations


Answer- Flat file formats are a type of data storage format that stores data in a plain text file,
where each line represents a single record and each field is separated by a delimeter, such as a
comma or tab. Limitations could include it’s difficult to work with large amounts of data or
complex data structure.

3. Differentiate between Data Warehouse and Datalake


Answer- A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process. A data lake is a central
location that holds a large amount of data in its native, raw format.

4. What is cloud computing? Explain Elasticity and Scalability in cloud computing


Answer-Cloud computing is the delivery of computing services—including servers, storage,
databases, networking, software, analytics, and intelligence—over the Internet to offer faster
innovation, flexible resources, and economies of scale.
ELASTICITY: As your workload changes, resources can be changed to compensate (up or
down. Example: Seasonal Demand for retail website Black Friday
SCALABILITY:Increase or decrease resources based on workload demand. There are two types
of scalability: vertical and horizontal.

5. Short notes on i) Big data ii) NoSQL Databases


Answer i) Big Data- Big data refers to the massive amounts of structured and unstructured data
that are generated every day. This data comes from a variety of sources, including social media,
sensors, and other digital devices. Example : New York Stock Exchange that generates 1 tarabyte
of new trade data per day.
ii) NoSQL Databases- It refers to distributed databases with dynamic schema. It is horizontally
scalable and is highly available.

6. Explain different types of Machine Learning Techniques


Answer - Machine learning encompasses various techniques and algorithms to train models and
make predictions or decisions based on data. Here are some different types of machine learning
techniques:

Supervised Learning:

Involves training a model on labeled data to make predictions or classifications.


Types include regression (predicting continuous values) and classification (categorizing data into
classes).
Unsupervised Learning:

Deals with unlabeled data, aiming to find patterns, structures, or groupings within the data.
Clustering and dimensionality reduction are common techniques.

7. Explain any 3 Key challenges in context with data enterprise


● Data Quality and Consistency: Enterprises often deal with data from multiple sources and
systems, leading to inconsistencies and inaccuracies. Ensuring data quality and
consistency across the organization is challenging.
● Data Security and Privacy: Enterprises handle sensitive customer and business data.
Protecting this data from breaches and ensuring compliance with data privacy regulations
(e.g., GDPR, CCPA) is a significant challenge.
● Data Integration: Enterprises use a variety of applications and systems, resulting in data
silos. Integrating these disparate data sources to get a unified view can be complex and
time-consuming.
● Scalability: With the growth of data volume, enterprises need to scale their infrastructure
to handle and process large amounts of data effectively.

8. Explain any 3 Key opportunities in context with data enterprise


● Improved Decision-Making: Data-driven insights enable more informed and accurate
decision-making, helping organizations stay competitive and responsive to market
changes.
● Innovation: Analyzing data can uncover new opportunities for innovation, product
development, and process improvement.
● Personalization: Enterprises can use data to understand customer preferences and
behaviors, allowing them to provide tailored products and services.
● Operational Efficiency: Data can be used to optimize processes, reduce inefficiencies,
and enhance overall operational performance.
Module 2 & 5

1. List the different types of Data preprocessing techniques


● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation and Data Discretization

2. Explain Equal width Discretization

Discretization is used to divide the range of a continuous attribute into intervals and then n
Interval labels can then be used to replace actual data values

● Equal-width is (distance) partitioning


● It divides the range into N intervals of equal size: uniform grid
● if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B –A)/N.
● It is the most straightforward method
● The drawback is that outliers may dominate presentation and Skewed data is not handled
well

3. Explain Equal depth Discretization

Discretization is used to divide the range of a continuous attribute into intervals and then n
Interval labels can then be used to replace actual data values

● Equal-depth is a frequency partitioning


● nDivides the range into N intervals, each containing approximately same number of
samples
● It has good data scaling
● nManaging categorical attributes can be tricky

4. Explain any 2 data manipulation techniques

● Data manipulation involves making changes or transformations to data to extract useful


information, clean the data, or prepare it for analysis. Here are some common examples
of data manipulation tasks:
● Filtering Data:
Example: Removing rows from a dataset where the age of individuals is less than 18.
python
Example: Sorting a list of sales transactions by date.
● Merging Data:
Example: Combining data from two different datasets based on a common key, such as
merging customer data with order data using a customer ID.
merge(customer_data, order_data, on='customer_id')

● Cleaning Data:
Example: Replacing missing values with a default value or removing rows with missing
data.
● Transforming Data:
Example: Converting data types, such as converting a string date to a datetime object.

5. Explain any 2 data cleaning techniques

● Handling Missing Data:

Identify and handle missing values, which can be represented as NaN, NULL, or other
placeholders.

Options include removing rows with missing data, imputing missing values with means,
medians, or modes, or using more advanced imputation techniques.

Impute missing values with the mean

● Removing Duplicates:

Identify and remove duplicate rows from the dataset.

● Dealing with Outliers:

Detect and handle outliers in the data, either by removing them or transforming them.

● Standardizing Data:

Ensure consistency in data format, like converting text to lowercase or uppercase.

6. Explain about Data Transformation in detail


Data transformation is the process of converting data from one format, such as a database file,
XML document or Excel spreadsheet, into another.

Transformations typically involve converting a raw data source into a cleansed, validated and
ready-to-use format. Data transformation is crucial to data management processes that include
data integration, data migration, data warehousing and data preparation.
It is a critical component for any organization seeking to leverage its data to generate timely
business insights.
As the volume of data has proliferated, organizations must have an efficient way to harness data
to effectively put it to business use. Data transformation is one element of harnessing this data,
because -- when done properly -- it ensures data is easy to access, consistent, secure and
ultimately trusted by the intended business users

7. Draw the block diagram to show the data hierarchy in data governance

•MDM – Master Data Management

•RDM – Reference Data Management

•CM – Content Management

•RM – Record Management


8. List any 5 unique properties of data governance

9. Write the importance of Data governance and Data management

Data Governance:

● Ensures data accuracy, compliance, and accountability.


● Manages data policies, ownership, and risk.
● Improves decision-making and efficiency.

Data Management:

● Organizes, integrates, and secures data.


● Cleans and transforms data for accuracy.
● Supports data analytics and reporting.
● Enables effective data governance.
Module 3

1. What do you mean by business intelligence? Explain in brief the BI process


Business intelligence (BI) is a technology-driven process for analyzing data and delivering
actionable information that helps executives, managers and workers make informed business
decisions.
Process
Collect data from internal IT systems and external sources, prepare it for analysis, run queries
against the data
Create data visualizations, BI dashboards and reports to make the analytics results available to
business users for operational decision-making and strategic planning.

2. What are the benefits of BI


● Speed up and improve decision-making

● Optimize internal business processes

● Increase operational efficiency and productivity

● Spot business problems that need to be addressed


● BI enables C-suite executives and department managers to monitor business performance
on an ongoing basis so they can act quickly when issues or opportunities arise.
● Analyzing customer data helps make marketing, sales and customer service efforts more
effective.
● Supply chain, manufacturing and distribution bottlenecks can be detected before they
cause financial harm.
● HR managers are better able to monitor employee productivity, labor costs and other
workforce data.

3. How does the BI process work? Explain.


There are 5 steps to the BI process and they are:
1) Data from source systems is integrated and loaded into a data warehouse or other
analytics repository.
2) Data sets are organized into analysis data models or OLAP cubes to prepare them for
analysis.
3) BI analysts, other analytics professionals and business users run analytical queries against
the data.
4) The query results are built into data visualizations, dashboards, reports and online portals.
5) Business executives and workers use the information for decision-making and strategic
planning.
4. What is a Data Warehouse?
A data warehouse is a type of data repository used to store large amounts of structured data from
various data sources.

5. What is primary purpose of Data Warehouse


Data warehouses are designed to feed information into decision support systems, business
intelligence (BI) software, data dashboards, and other types of analytics and reporting tools.
Enables an organization to easily access and analyze relevant data to extract key business
insights and plan for the future.

6. What is a cloud data warehouse?


A cloud data warehouse is a type of data warehouse that is managed and hosted by a cloud
service provider (CSP).

7. What are the different components of Data warehouse architecture


● Central Database
● Data integration tools
● Metadata
● Data access tools

8. Compare ETL and ELT in detail

● ETL is Extract Transform Load


Takes raw data, transforms it into a predetermined format, then loads it into the target
data warehouse.
ETL is slower than ELT.

● ELT is Extract Load Transform


Takes raw data, loads it into the target data warehouse, then transforms it just before
analytics.
ELT is faster than ETL as it can use the internal resources of the data warehouse.
9. List down the different steps in data pipeline development process
The data pipeline development process starts by defining what, where
Data ingestion.
Data integration.
Data cleansing.
Data filtering.
Data transformation.
Data enrichment.
Data validation.
Data loading.

10. Draw the data pipeline architecture.


11. Map each component of data pipeline architecture with a data pipeline process (Refer to the
block diagram)

Data ingestion. Data Data Loading Data Visualization


Cleaning
and
integration

Module 4
1. What is Data ethics in context with data in an enterprise?
Ans)
1. Responsible Data Use
Adherence to ethical guidelines and principles governing the responsible collection, processing,
and use of data
within the organization.
2. Privacy and Confidentiality
Ensuring the protection of individuals' privacy rights and sensitive information through proper
data anonymization, encryption, and access controls.
3. Transparency and Accountability
- Promoting transparency by clearly communicating data practices, policies, and purposes to
stakeholders.
- Holding individuals and departments accountable for ethical data handling and
decision-making.
4. Fairness and Impartiality
- Avoiding biases in data collection, analysis, and decision-making to ensure fairness and
impartiality in outcomes.
- Mitigating algorithmic biases to prevent discrimination in automated decision systems.
5. Informed Consent and Control
- Obtaining informed consent from individuals regarding data collection, use, and sharing,
providing individuals
control over their data.
6. Compliance with Regulations
- Abiding by legal and regulatory frameworks related to data protection, privacy, and security
(e.g., GDPR, HIPAA, CCPA).
7. Ethical AI and Machine Learning- Integrating ethical considerations into the development and
deployment of AI and machine learning models,
addressing issues of bias, fairness, and interpretability.
8. Continuous Education and Improvement
- Providing ongoing training and education to employees regarding ethical data practices and
evolving ethical standards.
- Regularly reviewing and updating policies and procedures to align with emerging ethical
challenges and changes in regulations.

2. What are the 5 principles of data ethics


Ans) 1. Transparency:
➢ Openness: Ensure clear and understandable communication about data practices, including
how data is collected, used, and shared within the organization.
➢ Disclosure: Inform stakeholders about the purposes, methods, and potential impacts of data
usage, fostering trust and accountability.
2. Fairness:
➢ Equality: Strive to eliminate biases in data collection, analysis, and decision-making
processes, ensuring fair treatment of individuals and groups.
➢ Avoidance of Discrimination: Mitigate algorithmic biases to prevent discriminatory outcomes
in automated decision systems or AI models
3. Accountability:
➢ Responsibility: Hold individuals and departments accountable for ethical data handling,
ensuring compliance with regulations and internal policies.
➢ Oversight and Governance: Establish mechanisms for oversight, auditing, and governance to
monitor adherence to data ethics standards.
4. Privacy:
➢ Data Protection: Respect individuals' privacy rights by safeguarding their personal and
sensitive information through encryption, anonymization, and access controls.
➢ Informed Consent: Obtain informed consent from individuals regarding data collection,
usage, and sharing, granting individuals control over their data.
5. Beneficence:
➢ Positive Impact: Strive to use data for the benefit of individuals, society, and the organization,
considering the broader impact of data practices on stakeholders.
➢ Ethical Decision-Making: Prioritize ethical considerations in data-related decisions, aiming
for positive social outcomes and responsible data stewardship.

3. List any 5 data security best practices


Ans) The 5 data security practices
1) Regular data backups
2) Access control and user permissions
3) Data Encryption
4) Regular Software Updates and Patch Management
5) Employee Training and Awareness

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy