8-9 Spatial Data Maintenance
8-9 Spatial Data Maintenance
Lecture 06-07
Spatial Data Management
Overview of Spatial Data Management
Spatial database management deals with the storage, indexing, and
querying of data with spatial features, such as location and geometric
extent.
Many applications require the efficient management of spatial data,
including Geographic Information Systems, Computer Aided Design, and
Location Based Services.
Spatial indices are used by spatial databases (databases which store
information related to objects in space) to optimize spatial queries.
Conventional index types do not efficiently handle spatial queries such as
how far two points differ, or whether points fall within a spatial area of
interest.
Data management means more than simply handling updates. Here are
four scenarios that we often see.
Data migration
Data transformation
Data enhancement
Data integration
Data conflation
2
Data Migration
The process of selecting, preparing, extracting, and transforming data
and permanently transferring it from one computer storage system to
another.
It is a key consideration for any system implementation, upgrade, or
consolidation, and it is typically performed in such a way as to be as
automated as possible, freeing up human resources from tedious tasks.
It occurs for a variety of reasons, including server or storage equipment
replacements, maintenance or upgrades, application migration, website
consolidation, disaster recovery, and data center relocation.
Categories
Storage migration: Result in having to move physical blocks of data from one disk to
another, often using virtualization techniques. The data format and content itself will
not usually be changed in the process.
Database migration: Similarly, it may be necessary to move from one database vendor
to another, or to upgrade the version of database software being used
Application migration: Changing application vendor like a new CRM or ERP platform
Business process migration: Business processes operate through a combination of
human and application systems actions. When these change they can require the
movement of data from one store, database or application to another to reflect the
changes to the organization. 3
Disadvantages of Data Migration
Migration addresses the possible obsolescence of the data carrier but
does not address the fact that certain technologies which run the data
may be abandoned altogether, leaving migration useless.
Time-consuming – migration is a continual process, which must be
repeated every time a medium reaches obsolescence, for all data objects
stored on a certain media.
Costly - an institution must purchase additional data storage media at
each migration
If you have poor data quality now in your old data management system
and you plan to migrate that data into your new system, your new system
will most likely inherit the same challenges, headaches, and poor data
quality
4
Data Transformation
Data transformation is the process of converting data from one format,
such as a database file, XML document or Excel spreadsheet, into
another.
Transformations often involve converting a raw data source into a
cleansed, validated and ready-to-use format.
Data transformation can be simple, or complex based on the required
changes to the data between the source (initial) data and the target
(final) data.
Data transformation can be divided into the following steps, each
applicable as needed based on the complexity of the transformation
required.
Data discovery: Typically the data is profiled using profiling tools or sometimes using
manually written profiling scripts to better understand the structure and characteristics
of the data and decide how it needs to be transformed.
Data mapping: The process of defining how individual fields are mapped, modified,
joined, filtered, aggregated etc. to produce the final desired output
Code generation: The process of generating executable code (e.g. SQL, Python, R, or
other executable instructions) that will transform the data based on the desired and
defined data mapping rules
5
Data Transformation…
Code execution: Step whereby the generated code is executed against the data to
create the desired output. The executed code may be tightly integrated into the
transformation tool, or it may require separate steps by the developer to manually
execute the generated code
Data review: is the final step in the process, which focuses on ensuring the output data
meets the transformation requirements. Any anomalies or errors in the data that are
found and communicated back to the developer or data analyst as new requirements
to be implemented in the transformation process.
Types of Data Transformation
Batch Data Transformation
This is whereby developers write code or implement transformation rules in a data
integration tool, and then execute that code or those rules on large volumes of data.
Batch data transformation is the cornerstone of virtually all data integration
technologies such as data warehousing, data migration and application integration.
Interactive Data Transformation
This is an emerging capability that allows business analysts and business users the
ability to directly interact with large datasets through a visual interface, understand the
characteristics of the data (via automated data profiling or visualization), and change or
correct the data through simple interactions such as clicking or selecting certain
elements of the data.
6
Data Conflation
Geospatial data conflation is the compilation or reconciliation of two
different geospatial datasets covering overlapping regions (Saalfeld
1988).
In general, the goal of conflation is to combine the best quality elements
of both datasets to create a composite dataset that is better than either
of them.
The consolidated dataset can then provide additional information that
cannot be gathered from any single dataset.
Based on the types of geospatial datasets dealt with, the conflation
technologies can be categorized into the following three groups.
Vector to vector data conflation: A typical example is the conflation of two road
networks of different accuracy levels.
Raster to raster data conflation
Rasta to vector data conflation
7
Introduction: Geodatabase
At its most basic level, a geodatabase is a collection of geographic
datasets of various types held in a common file system folder, a Microsoft
Access database, or a multiuser relational DBMS (such as Oracle,
Microsoft SQL Server, PostgreSQL, Informix, or IBM DB2).
Geodatabases come in many sizes, have varying numbers of users and
can scale from small, single-user databases built on files up to larger
workgroup, department, and enterprise geodatabases accessed by many
users.
It is the physical store of geographic information, primarily using a
database management system or file system.
Geodatabases have a comprehensive information model for representing
and managing geographic information.
This comprehensive information model is implemented as a series of tables holding
feature classes, raster datasets, and attributes.
In addition, advanced GIS data objects add GIS behavior; rules for managing spatial
integrity; and tools for working with numerous spatial relationships of the core
features, rasters, and attributes
Geodatabases have a transaction model for managing GIS data workflows.
8
Types: Personal geodatabases
Personal geodatabases—All datasets are stored within a Microsoft Access
data file, which is limited in size to 2 GB.
Original data format for ArcGIS geodatabases stored and managed in
Microsoft Access data files.(This is limited in size and tied to the Windows
operating system.)
Single user and small workgroups with smaller datasets: some readers and one writer.
Concurrent use eventually degrades for large numbers of readers.
All the contents in each personal geodatabase are held in a single Microsoft Access file
(.mdb).
Two GB per Access database. The effective limit before performance degrades is
typically between 250 and 500 MB per Access database file.
Often used as an attribute table manager (via Microsoft Access). Users like the string
handling for text attributes.
9
File Geodatabases
A collection of various types of GIS datasets held in a file system folder
Stored as folders in a file system.
Each dataset is held as a file that can scale up to 1 TB in size. The file
geodatabase is recommended over personal geodatabases.
Features:
Provide a widely available, simple, and scalable geodatabase solution for all users.
Provide a portable geodatabase that works across operating systems.
Scale up to handle very large datasets.
Use an efficient data structure that is optimized for performance and storage.
File geodatabases also allow users to compress vector data to a read-only format to
reduce storage requirements even further.
Outperform shapefiles for operations involving attributes and scale the data size limits
way beyond shapefile limits
10
Enterprise geodatabases
Enterprise geodatabases—Also known as multiuser geodatabases, they can
be unlimited in size and numbers of users.
A collection of various types of GIS datasets held as tables in a relational
database
Stored in a relational database using Oracle, Microsoft SQL Server, IBM DB2,
IBM Informix, or PostgreSQL.
Features:
Extremely large, continuous GIS databases
Many simultaneous users
Long transactions and versioned workflows
Relational database support for GIS data management (providing the benefits of a relational
database for scalability, reliability, security, backup, integrity, and so forth)
SQL types for Spatial in all supported DBMSs (Oracle, SQL Server, PostgreSQL, Informix, and
DB2)
High performance that can scale to a very large number of users
11
The architecture of a geodatabase
The geodatabase is object relational
Based on a series of simple yet essential relational database concepts and
leverages the strengths of the underlying database management system.
Simple tables and well-defined attribute types are used to store the schema,
rule, base, and spatial attribute data for each geographic dataset.
Through this approach, structured query language (SQL)—a series of
relational functions and operators—can be used to create, modify, and query
tables and their data elements
12