File Structure Data Storage Query Evaluation Indexing and Hashing
File Structure Data Storage Query Evaluation Indexing and Hashing
Chapter 6
File structure
Data storage
Query Evaluation
Indexing and hashing
Fixed-Length Records
Suppose we have a table that has the following organization:
branch-name : char(22); 22
account-number : char(10); 10 40
balance : real; 08
File structure & Data storage End
1
Fixed-Length Records Fixed-Length Records
record. How it can be used for another record ? Record 2 Deleted and All Records
Moved
If fix the size then it may possible some records will cross block
boundaries and it would require two block access to read or write such a
record.
2
Fixed-Length Records Variable-Length Records
Storage of multiple record types in a file. Attach an end-of-record (^) control character to the end of each record
Difficulty with deletion and growth (how to reuse deleted space?)
Record types that allow variable lengths for one or more fields.
No space, in general, for a record to grow
Record types that allow repeating fields (used in some older data
models).
Fixed-length representation
3
Variable-Length Records Variable-Length Records
Pointer method
Wastage
of space
4
Variable-Length Records Sequential File Organization (Cont.)
Pointer method Deletion – use pointer chains
Insertion –locate the position where the record is to be inserted
Disadvantage to pointer structure; space is wasted in all records except the first in a chain.
if there is free space insert there
Solution is to allow two kinds of block in file:
if no free space, insert the record in an overflow block
Anchor block – contains the first records of chain
In either case, pointer chain must be updated
Overflow block – contains records other than those that are the first records
Need to reorganize the file from time to time to restore sequential order
of chairs.
Suitable for applications that require sequential processing of the entire file
Query Evaluation
5
Query Processing & Optimization Processing Steps
What is Query Processing?
Steps required to transform high level SQL query into a correct and
“efficient” strategy for execution and retrieval.
E.g.
σbalance<2500(Πbalance(account)) is equivalent to
Πbalance(σbalance<2500(account))
6
Three Major Steps of Processing Measures of Query Cost
Parsing and Translation Cost is generally measured as total elapsed time for answering query.
EVALUAION PRIMITIVE: Relational algebra expression annotated with instruction
Factors contribute to time cost including:
specifying how to evaluate those operation
Disk accesses
Evaluation plan: Annotated expression specifying detailed evaluation
strategy. “ a sequence of primitive operations that can be used to evaluate a How does the index/hashing approach impact?
query is a query execution plan or query evaluation plan” CPU time
E.g., use an index on balance to find accounts with balance < 2500, or perform
Network communication
complete relation scan and discard accounts with balance ≥ 2500
Measured by taking into account
Πbalance
Number of seeks * average-seek-cost
σbalance<2500 ; use index 1(e.g. index 1 is index on balance) Number of blocks read * average-blockread-cost
To choose among different query evaluation plans, the optimizer has to number of seeks as the cost measures.
estimate the cost of each evaluation plan. Cost is estimated using tT – time to transfer one block
statistical information from the database catalog. E.g. number of tuples in tS – time for one seek
each relation, size of tuples, etc. Cost for b block transfers plus S seeks:
b * tT + S * tS
for simplicity we ignore.
Evaluation
CPU costs (Real systems do take CPU cost into account)
Once the query plan is chosen, the query is evaluated with that plan, and Costs of writing to disk. (Taken into account separately where necessary)
the result of the query is output.
7
Index Evaluation Metrics
Access types
Access time
Insertion time
Space overhead
Search Key - attribute to set of attributes used to look up records in a Primary index: in a sequentially ordered file, the index whose search key
An index file consists of records (called index entries) of the form Also called clustering index
search-key pointer The search key of a primary index is usually but not necessarily the
Index files are typically much smaller than the original file primary key.
Two basic kinds of indices: Secondary index: an index whose search key specifies an order different
Ordered indices: search keys are stored in sorted order from the sequential order of the file. Also called non-clustering index.
8
Ordered Indices…Primary Index Primary Index…Sparse Index Files
Sparse Index: contains index records for only some search-key values.
Index-sequential file: ordered sequential file with a primary index.
Applicable when records are sequentially ordered on search-key
Good tradeoff: sparse index with an index entry for every block in file,
corresponding to least search-key value in the block.
9
Primary Index…Sparse Index Files Primary Index…Multilevel Indices
Even with a sparse index, index size may still grow too large.
For 100,000 records, 10 per block, at one index record per block, that's
10,000 index records! Even if we can fit 100 index records per block, this is
100 blocks.
Binary search method can be used for search : for b blocks blocks
to be read
outer index – a sparse index of primary index Dense indices – deletion of search-key: similar to file record deletion.
If even outer index is too large to fit in main memory, yet another level of if an entry for the search key exists in the index, it is deleted by
index can be created, and so on. replacing the entry in the index with the next search-key value in the
file (in search-key order).
Indices at all levels must be updated on insertion or deletion from the file.
If the next search-key value already has an index entry, the entry is
deleted instead of being replaced.
10
Primary Index…Index Update: Deletion Primary Index…final words
Sparse indices – if index stores an entry for each block of the file, no
change needs to be made to the index unless a new block is created. Can we perform sequential search if we use dense index for
If a new block is created, the first search-key value appearing in secondary index?
the new block is inserted into the index. Can we use sparse index for secondary index?
11
Ordered Indices…Secondary Indices Primary and Secondary Indices
It is not enough to point to just the first record with each search-key value Indices offer substantial benefits when searching for records.
because the remaining records with the same search-key value could be
BUT: Updating indices imposes overhead on database modification
anywhere in the file.
--when a file is modified, every index on the file must be updated,
Therefore, a secondary index must contain pointers to all the
Sequential scan using primary index is efficient, but a sequential scan
records.
using a secondary index is expensive
Use an extra-level of indirection to implement secondary indices on
Each record access may fetch a new block from disk
search keys that are not candidate keys. A pointer does not point directly
Block fetch requires about 5 to 10 micro seconds, versus about 100
to the file but to a bucket that contains pointers to the file.
nanoseconds for memory access
Hashing
Index record points to a bucket that contains pointers to all the actual
records with that particular search-key value.
12
Static Hashing Hash Functions
Worst hash function maps all search-key values to the same bucket;
A bucket is a unit of storage containing one or more records (a bucket
search time proportional to number of search key values.
is typically a disk block).
Two properties of hash function
In a hash file organization we obtain the bucket of a record directly
Uniform : each bucket is assigned the same number of search-key
from its search-key value using a hash function.
values
Hash function h is a function from the set of all search-key values K to
Random : each bucket will have the same number of records
the set of all bucket addresses B. h: K B
assigned.
Hash function is used to locate records for access, insertion as well as
Typical hash functions perform computation on the internal binary
deletion.
representation of the search-key.
Records with different search-key values may be mapped to the same For example, for a string search-key, the binary representations of all
bucket; thus entire bucket has to be searched sequentially to locate a the characters in the string could be added and the sum modulo the
record. number of buckets could be returned. .
13
Handling of Bucket Overflows (Cont.)
Overflow chaining – the overflow buckets of a given bucket are chained
together in a linked list.
14