0% found this document useful (0 votes)
86 views14 pages

File Structure Data Storage Query Evaluation Indexing and Hashing

The document discusses file organization and query processing. It covers fixed-length records, variable-length records represented using byte strings, pointers, and reserved space. It also discusses sequential file organization, including deletion, insertion, and the need to reorganize files over time. Query processing involves parsing, optimization of evaluating queries in the most efficient way, and evaluation of the optimized query execution plan.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views14 pages

File Structure Data Storage Query Evaluation Indexing and Hashing

The document discusses file organization and query processing. It covers fixed-length records, variable-length records represented using byte strings, pointers, and reserved space. It also discusses sequential file organization, including deletion, insertion, and the need to reorganize files over time. Query processing involves parsing, optimization of evaluating queries in the most efficient way, and evaluation of the optimized query execution plan.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

File Organization

Chapter 6

File structure
Data storage
Query Evaluation
Indexing and hashing

Fixed-Length Records
Suppose we have a table that has the following organization:

type deposit = record

branch-name : char(22); 22

account-number : char(10); 10 40

balance : real; 08
File structure & Data storage End

Assumptions: If each character occupies 1 byte and a real occupies 8


bytes, then this record occupies 40 bytes. If the first record occupies the
first 40 bytes and the second record occupies the second 40 bytes.

1
Fixed-Length Records Fixed-Length Records

Problems with this approach are:

Difficult to delete a record, because there is no way to identify deleted

record. How it can be used for another record ? Record 2 Deleted and All Records
Moved

If fix the size then it may possible some records will cross block

boundaries and it would require two block access to read or write such a

record.

Fixed-Length Records Fixed-Length Records

Record 2 deleted and Final Record


Moved

2
Fixed-Length Records Variable-Length Records

Variable-Length Records Variable-Length Records


Variable-length records arise in database systems in several ways: Byte string representation

Storage of multiple record types in a file. Attach an end-of-record (^) control character to the end of each record
Difficulty with deletion and growth (how to reuse deleted space?)
Record types that allow variable lengths for one or more fields.
No space, in general, for a record to grow
Record types that allow repeating fields (used in some older data
models).

Different methods to represent variable-length records

Byte string representation

Slotted page structure

Fixed-length representation

3
Variable-Length Records Variable-Length Records
Pointer method

A variable-length record is represented by a list of fixed-length records,


chained together via pointers.

Can be used even if the maximum record length is not known

Variable-Length Records Variable-Length Records


Use one or more fixed length records:
Pointer method
reserved space
Disadvantage to pointer structure; space is wasted in all records except
pointers
the first in a chain.
Reserved space – can use fixed-length records of a known maximum length;
unused space in shorter records filled with a null or end-of-record symbol.

Disadvantage: useful when most of records are of length near to maximum


otherwise wastage of space

Wastage
of space

4
Variable-Length Records Sequential File Organization (Cont.)
Pointer method Deletion – use pointer chains
Insertion –locate the position where the record is to be inserted
Disadvantage to pointer structure; space is wasted in all records except the first in a chain.
if there is free space insert there
Solution is to allow two kinds of block in file:
if no free space, insert the record in an overflow block
Anchor block – contains the first records of chain
In either case, pointer chain must be updated
Overflow block – contains records other than those that are the first records
Need to reorganize the file from time to time to restore sequential order
of chairs.

Sequential File Organization

Suitable for applications that require sequential processing of the entire file

The records in the file are ordered by a search-key

Example: account ( account-number, branch-name, balance )

Query Evaluation

5
Query Processing & Optimization Processing Steps
What is Query Processing?

Steps required to transform high level SQL query into a correct and
“efficient” strategy for execution and retrieval.

What is Query Optimization?

The activity of choosing a single “efficient” execution strategy (from


hundreds) as determined by database catalog statistics.

Three Major Steps of Processing Three Major Steps of Processing


Parsing and Translation
1. Parsing and translation
Translate query into its internal form.

2. Optimization This is then translated into relational algebra

Parser checks syntax, verifies relations.


3. Evaluation
A relational algebra expression may have many equivalent expressions.

E.g.

σbalance<2500(Πbalance(account)) is equivalent to

Πbalance(σbalance<2500(account))

6
Three Major Steps of Processing Measures of Query Cost
Parsing and Translation Cost is generally measured as total elapsed time for answering query.
EVALUAION PRIMITIVE: Relational algebra expression annotated with instruction
Factors contribute to time cost including:
specifying how to evaluate those operation
Disk accesses
Evaluation plan: Annotated expression specifying detailed evaluation
strategy. “ a sequence of primitive operations that can be used to evaluate a How does the index/hashing approach impact?
query is a query execution plan or query evaluation plan” CPU time
E.g., use an index on balance to find accounts with balance < 2500, or perform
Network communication
complete relation scan and discard accounts with balance ≥ 2500
Measured by taking into account
Πbalance
Number of seeks * average-seek-cost

σbalance<2500 ; use index 1(e.g. index 1 is index on balance) Number of blocks read * average-blockread-cost

Number of blocks written * average-blockwrite-cost

Three Major Steps of Processing Measures of Query Cost


Optimization Cost to write a block is greater than cost to read a block as data is read
It is the process of selecting the most efficient query evaluation plan for a back after being written to ensure that the write was successful
query. For simplicity, we just use the number of block transfers from disk and the

To choose among different query evaluation plans, the optimizer has to number of seeks as the cost measures.

estimate the cost of each evaluation plan. Cost is estimated using tT – time to transfer one block

statistical information from the database catalog. E.g. number of tuples in tS – time for one seek

each relation, size of tuples, etc. Cost for b block transfers plus S seeks:

b * tT + S * tS
for simplicity we ignore.
Evaluation
CPU costs (Real systems do take CPU cost into account)
Once the query plan is chosen, the query is evaluated with that plan, and Costs of writing to disk. (Taken into account separately where necessary)
the result of the query is output.

7
Index Evaluation Metrics
Access types

Access time

Insertion time

Indexing and Hashing Deletion time

Space overhead

Basic Concepts Ordered Indices


Indexing mechanisms used to speed up access to desired data. In an ordered index, index entries are stored sorted on the search key

E.g., author catalog in library value. E.g., author catalog in library.

Search Key - attribute to set of attributes used to look up records in a Primary index: in a sequentially ordered file, the index whose search key

file. specifies the sequential order of the file.

An index file consists of records (called index entries) of the form Also called clustering index

search-key pointer The search key of a primary index is usually but not necessarily the
Index files are typically much smaller than the original file primary key.

Two basic kinds of indices: Secondary index: an index whose search key specifies an order different

Ordered indices: search keys are stored in sorted order from the sequential order of the file. Also called non-clustering index.

Hash indices: search keys are distributed uniformly across


“buckets” using a “hash function”.

8
Ordered Indices…Primary Index Primary Index…Sparse Index Files
Sparse Index: contains index records for only some search-key values.
Index-sequential file: ordered sequential file with a primary index.
Applicable when records are sequentially ordered on search-key

To locate a record with search-key value K we:

Find index record with largest search-key value < K

Search file sequentially starting at the record to which the index


record points

Sequential file for account records (sorted branch name)

Two type of ordered indices that we can use:

1) Dense Index 2) Sparse Index

Primary Index…Dense Index Files Primary Index…Sparse Index Files


Compared to dense indices:
Dense index — Index record (or index entry) appears for every search-
key value in the file. Less space and less maintenance overhead for insertions and
deletions.
Index entry for each search key values
Generally slower than dense index for locating records.

Good tradeoff: sparse index with an index entry for every block in file,
corresponding to least search-key value in the block.

9
Primary Index…Sparse Index Files Primary Index…Multilevel Indices
Even with a sparse index, index size may still grow too large.

For 100,000 records, 10 per block, at one index record per block, that's
10,000 index records! Even if we can fit 100 index records per block, this is
100 blocks.

If index is too large to be kept in main memory, a search results in several


disk reads.

Binary search method can be used for search : for b blocks blocks
to be read

Primary Index…Multilevel Indices Primary Index…Index Update: Deletion


If primary index does not fit in memory, access becomes expensive. If deleted record was the only record in the file with its particular search-
key value, the search-key is deleted from the index also.
Solution: treat primary index kept on disk as a sequential file and construct
a sparse index on it. Single-level index deletion:

outer index – a sparse index of primary index Dense indices – deletion of search-key: similar to file record deletion.

inner index – the primary index file Sparse indices –

If even outer index is too large to fit in main memory, yet another level of if an entry for the search key exists in the index, it is deleted by
index can be created, and so on. replacing the entry in the index with the next search-key value in the
file (in search-key order).
Indices at all levels must be updated on insertion or deletion from the file.
If the next search-key value already has an index entry, the entry is
deleted instead of being replaced.

10
Primary Index…Index Update: Deletion Primary Index…final words

In primary index sequential search is allowed because record file is in


sorted according to search key.

So we can also use sparse index form in primary index.

Primary Index…Index Update: Insertion Ordered Indices…Secondary Indices


Single-level index insertion:

Perform a lookup using the search-key value appearing in the record


to be inserted.

Dense indices – if the search-key value does not appear in the


index, insert it.

Sparse indices – if index stores an entry for each block of the file, no
change needs to be made to the index unless a new block is created. Can we perform sequential search if we use dense index for

If a new block is created, the first search-key value appearing in secondary index?

the new block is inserted into the index. Can we use sparse index for secondary index?

Multilevel insertion (as well as deletion) algorithms are simple extensions


of the single-level algorithms

11
Ordered Indices…Secondary Indices Primary and Secondary Indices
It is not enough to point to just the first record with each search-key value Indices offer substantial benefits when searching for records.
because the remaining records with the same search-key value could be
BUT: Updating indices imposes overhead on database modification
anywhere in the file.
--when a file is modified, every index on the file must be updated,
Therefore, a secondary index must contain pointers to all the
Sequential scan using primary index is efficient, but a sequential scan
records.
using a secondary index is expensive
Use an extra-level of indirection to implement secondary indices on
Each record access may fetch a new block from disk
search keys that are not candidate keys. A pointer does not point directly
Block fetch requires about 5 to 10 micro seconds, versus about 100
to the file but to a bucket that contains pointers to the file.
nanoseconds for memory access

Ordered Indices…Secondary Indices

Hashing

Secondary index on balance field of account

Index record points to a bucket that contains pointers to all the actual
records with that particular search-key value.

Secondary indices have to be dense

12
Static Hashing Hash Functions
Worst hash function maps all search-key values to the same bucket;
A bucket is a unit of storage containing one or more records (a bucket
search time proportional to number of search key values.
is typically a disk block).
Two properties of hash function
In a hash file organization we obtain the bucket of a record directly
Uniform : each bucket is assigned the same number of search-key
from its search-key value using a hash function.
values
Hash function h is a function from the set of all search-key values K to
Random : each bucket will have the same number of records
the set of all bucket addresses B. h: K B
assigned.
Hash function is used to locate records for access, insertion as well as
Typical hash functions perform computation on the internal binary
deletion.
representation of the search-key.
Records with different search-key values may be mapped to the same For example, for a string search-key, the binary representations of all
bucket; thus entire bucket has to be searched sequentially to locate a the characters in the string could be added and the sum modulo the
record. number of buckets could be returned. .

Example of Hash File Organization Handling of Bucket Overflows

Bucket overflow can occur because of

Insufficient buckets : nb > nr / fr


Hash file organization of Skew in distribution of records (some buckets are overflow while
account file, using
some are free). This can occur due to two reasons:
branch_name as key
multiple records have same search-key value

chosen hash function produces non-uniform distribution of key


values

Although the probability of bucket overflow can be reduced, it cannot be


eliminated; it is handled by using overflow buckets.

13
Handling of Bucket Overflows (Cont.)
Overflow chaining – the overflow buckets of a given bucket are chained
together in a linked list.

Above scheme is called closed hashing.

An alternative, called open hashing, which does not use overflow


buckets, is not suitable for database applications.

14

You might also like