Page382

Database Integrity

In addition to the previously discussed relational database integrity issues of semantic, referential, and entity integrity, databases must also ensure data integrity: the integrity of the entries in the database tables. This treats integrity as a more general issue: mitigating unauthorized modifications of data. The primary challenge associated with data integrity within a database is simultaneous attempted modifications of data. A database server typically runs multiple threads (lightweight processes), each capable of altering data. What happens if two threads attempt to alter the same record?

DBMSs may attempt to commit updates: make the pending changes permanent. If the commit is unsuccessful, the DBMSs can rollback (also called abort) and restore from a savepoint (clean snapshot of the database tables).

A database journal is a log of all database transactions. Should a database become corrupted, the database can be reverted to a backup copy, and then subsequent transactions can be “replayed” from the journal, restoring database integrity.

Database Replication and Shadowing

Databases may be highly available (HA), replicated with multiple servers containing multiple copies of tables. Integrity is the primary concern with replicated databases: if a record is updated in one table, it must be simultaneously updated in all tables. Also, what happens if two processes attempt to update the same tuple simultaneously on two different servers? They both cannot be successful; this would violate the integrity of the tuple.

Database replication mirrors a live database, allowing simultaneous reads and writes to multiple replicated databases by clients. Replicated databases pose additional integrity challenges. A two-phase (or multiphase) commit can be used to assure integrity: before committing, the DBMS requests a vote. If the DBMSs on each server agree to commit, the changes are made permanent. If any DBMSs disagree, the vote fails, and the changes are not committed (not made permanent).

A shadow database is similar to a replicated database, with one key difference: a shadow database mirrors all changes made to a primary database, but clients do not access the shadow. Unlike replicated databases, the shadow database is one-way (data flows from primary to shadow): it serves as a live data backup of the primary.

Data Warehousing and Data Mining

As the name implies, a data warehouse is a large collection of data. Modern data warehouses may store many terabytes (1000 gigabytes) or even petabytes (1000 terabytes) of data. This requires large scalable storage solutions.

The storage must be high performance, and allow analysis and searches of the data.

Once data is collected in a warehouse, data mining is used to search for patterns. Commonly sought patterns include signs of fraud. Credit card companies manage some of the world’s largest data warehouses, tracking billions of transactions per year. Fraudulent transactions are a primary concern of credit card companies that lead to millions of dollars in lost revenue. No human could possibly monitor all of those transactions, so the credit card companies use data mining to separate the signal from noise. A common data mining fraud rule monitors multiple purchases on one card in different states or countries in a short period of time. A violation record can be produced when this occurs, leading to suspension of the card or a phone call to the card owner’s home.