Dynamic Fragmentation and Heuristic Join Algorithm

advertisement
Dynamic Fragmentation and Heuristic Join
Algorithm
Aditya C Awalkar
Dept. of Computer Engg.
Vidyalankar Inst. of Tech.
Mumbai, India
awalkaradi95@gmail.com
Ahmed Sabeeh Quadri
Dept. of Computer Engg.
Vidyalankar Inst. of Tech.
Mumbai, India
asabeeh18@hotmail.com
ABSTRACT— Distributed database systems
yield an enhancement on communication and data
processing because of its data distribution
throughout different network sites. Fragmentation
of the database, query processing and query
optimization techniques of the distributed database
system are the most important factors contributing
to its efficiency. In this paper we present a model
for dynamically fragmenting a distributed database
which ensures optimization. Further we discuss
about an improved join algorithm which we call as
heuristic join algorithm. This paper compares this
algorithm with traditional join and semijoin
algorithms. For this we take variety of cases and
plot the results on a graph. This way we analyze
the cases in which the algorithm works better than
others.
Keywords—Distributed Database, Query,
Dynamic Fragmentation, Join, Semijoin, Heuristic,
Optimization.
I. INTRODUCTION
A database management system is a software that
manages the storage, retrieval and updating of data
in a computer system. When the database is kept at
the same location where it is accessed from, it is
called as centralized database. However, it has
major disadvantages like it creates a bottleneck
when the demand is huge on the database i.e. many
users are accessing the data at the same time. To
overcome this a distributed database scheme was
adopted. In distributed database scheme the
database is divided into several sites which are
logically interrelated and can be accessed from any
site connected in that network. Distributed database
system (DDBS) technology is the combination of
two diametrically opposed approaches to data
processing: database system and computer network
technologies. Distributed processing better
corresponds to the organizational structure of
today’s widely distributed enterprises, and thus this
Aayush Khator
Dept. of Computer Engg.
Vidyalankar Inst. of Tech.
Mumbai, India
aayush.khator@vit.edu.in
system is more reliable as well as responsive. Also,
most of the daily applications of computer
technology are distributed.
Following are the promises of Distributed Database
Management Systems:
1) Transparent Management of Distributed and
Replicated Data
2) Reliability through Distributed Transactions
3) Improved Performance
4) Easier System Expansion
5) Replication
In distributed databases, a table is fragmented and
stored at different sites. The fragmentation is done
according to minterm predicates and the individual
fragments hence formed should satisfy lossless join
condition.
There are 3 types of fragmentation1) Horizontal Fragmentation
2) Vertical Fragmentation
3) Hybrid Fragmentation
Since fragmentation determines the data stored at
each site, it becomes an important aspect in
distributed database management.
II. DYNAMIC FRAGMENTATION
MODEL
Issues with existing Methodology
In Distributed Database, tables/fragments are
stored at different sites to insure optimization. User
fires a query to fetch results from the database. He
is unaware about the internal architecture i.e.
fragmentation and distribution. The query
processor sends the query to the site where the
relevant data is stored. However, due to many
different reasons it may happen that the data access
of a particular site is reduced while the data access
on other sites is increased. Thus the former
becomes redundant while increased load on the
latter slows down the overall operation. This will
ultimately reduce the overall efficiency.
Consider an example where there are two
sites having tables tax_1 and tax_2 respectively.
Site 1- tax_1 (Tax details for annual income less
than Rs 10 lacs)
Site 2- tax_2 (Tax details for annual income more
than Rs 10 lacs)
Due to increase in income tax slab rates, the usage
of tax details for income less than 10 lacs reduced
while that of income greater than 10 lacs increased
i.e.
usage(tax_1) <<< usage(tax_2)
This leads to higher load on site 2 while site 1
becomes almost redundant. This ultimately reduces
efficiency. The user gets the result after a longer
period of time. To solve this problem, we propose
our model.
Model Architecture
Each computation is done after a specific period of
time since the efficiency will reduce if the
computation is done for every query or done after a
short time interval. The database of the
computation module however keeps storing query
statistics and data accessed at every query fired on
the given site. At every computation, the
computation module calculates the result for a
particular site or many sites using a set of
parameters, the records of which were stored in the
computation module. The parameters are fragments
accessed, number of queries fired per week,
average and peak CPU load, query fired on a
particular site and the cost procured in executing it.
The result obtained is compared with the threshold
which was set initially. If the result is more than the
threshold, re-fragmentation will take place. The refragmentation will be done by the fragmentation
module. The fragmentation module will use the
basic fragmentation methodology with addition to
database statistics for optimization. It will take into
account the last fragmentation taken place and
transfer only those records which are extraneous on
the site to be re-fragmented. Since the refragmentation considers the last fragmentation and
transfers the additional records, it saves much of
communication cost.
Advantages
Dynamic Fragmentation Model showing 2 loops
(in reality it goes to infinity)
The figure shows architecture of Iterative
model of Fragmentation. It consists of 2 modules:-
Fragmentation Module
It does the job of fragmenting the database and
distributing it over various sites. It takes input from
computation module. It also takes the database
statistics and previous fragmentation details.
Computation Module
It is the most important module. It stores query
statistics and also maintains record of which
fragments were accessed over a given period of
time. It has access of database statistics from the
time of database creation. In each loop, it does
computation over the query statistics and gives
results. If the results are more than a set threshold,
the database needs to be re-fragmented.
Basic Working




Redundancy Elimination
Faster Query Processing
Improved reliability i.e. decreased chances
of server crash.
Distribution of load across all sites.
III. JOIN BASED ALGORITHMS
Join and semi-join based algorithms are frequently
used algorithms. Every database has relations
between tables, hence while deriving specific
information, joins are widely used. In case of
distributed databases, join adds much of cost in the
query operation. Hence, optimizing join operation
becomes a crucial step for faster query processing.
The drawback of join approach is that the entire
operand relations must be transferred amongst the
sites. The benefit of semi-join is that it reduces the
number of tuples that are needed to form the join.
In distributed databases, it is of great importance
since it reduces the network cost.
The semijoin operator has the vital property of
shrinking the size of the operand relation. When the
primal cost component which is considered by the
query processor is communication, a semijoin
proves very useful since it significantly reduces the
data sent between the sites. However, use of
semijoins may cause an increase in the number of
messages which will further increase the local
processing time.
We consider only network cost. We consider the
best cases in all the 3 methods
Join Method
But in certain cases, such as those involving
relations with contrasting relation size, a different
approach proves to be much more efficient and
productive than the regular traditional approach of
semijoin. Let us see all the three approaches.
1) Transfer Department relation to site 1
2) Perform join operation as follows:
R = ΠDName,EName (Department
⋈Employee.DeptNo=Department.DNo Employee)
3) Transfer the result R to site 3.
Example
Cost:
Transferring Department Relation from Site 2 to
Site 1
+ Transferring computed result of Site 1 to Site 3
_____________________________________
= 100*(30)
+ 10,000*(32+16)
= 4,83,000
Consider two relations Employee and Department
with the following schema1. Employee (ENo,EName,DeptNo,……)
ENo
EName
DeptNo
……..
where each data item in
1. ENo. is of 2 Bytes
2. EName is of 32 Bytes
3. DeptNo is of 2 Bytes
Semijoin Method
Total Size of Employee Relation attributes is
100 Bytes. Number of tuples in Employee relation=
10,000
2.
Department (DNo,DName,…..)
DNo
DName
……….
where each data item in
1. DNo is 2 Bytes
2. DName is of 16 bytes
Total Size of Department Relation attributes is
30 Bytes. Number of tuples in Department
relation= 100
Employee relation is at Site 1 and Department table
is at Site 2 and both sites are connected to each
other over network and the query is fired from a
third site which too is a part of the network
Employee
Department
Site 1
Site 2
1) Transfer Department relation to site 3.
2) Transfer DNo to site 1 as R1.
3) Perform join operation as follows:
R2= ΠDeptNo,EName (R1
⋈Employee.DeptNo=R1.DNo Employee)
4) Transfer R2 to site 3.
5) R = ΠDName,EName
(R2⋈R2.DeptNo=Department.DNo Department)
Cost:
Transferring Department Relation from Site 2 to
Site 3
+ Transferring Department.DNo Relation(R1) from
Site 2 to Site 1
+ Transferring computed result(R2) of Site 1 to Site
3
_____________________________________
= 100*(30)
+ 100*2
+ 10,000*(32+2)
= 343,200
Heuristic Method
Result Site
Site 3
SQL Query:
SELECT Employee.EName,Department,DName
FROM Department, Employee
WHERE Employee.DNo = Department.DNo
1) Transfer Department.DNo and
Department.DName relation as R1 to site 3. R1=
ΠDName,DNo (Department)
2) Transfer Employee.DeptNo and
Employee.Ename relation as R2 to site 3.R2=
ΠEName,DeptNo (Employee) 3. 3. Perform join
operation as follows:
3) R= ΠDName,EName (R1⋈R1.DNo=R2.DeptNo R2)
Cost:
Transferring R1 from Site 2 to Site 3
+ Transferring R2 from Site 2 to Site 3
_____________________________________
= 100*(16+2)
+ 10,000*(32+2)
= 341,800
550000
500000
Cost
450000
From the example, we can draw out that semijoin
improves the efficiency as compared to join.
However the efficiency further increases by using
the heuristic method as shown. Although the
difference was only of 1400 Bytes, but in real cases
where the relation sizes are humongous in terms of
both tuples and attributes, the difference will be
much larger. Hence, the heuristic method is better
than both join methods.
Algorithm
400000
350000
300000
0
500
1000
1500
2000
Number of Tuples of Department
Join
Heuristic
Result: Heuristic performs better than other two
algorithms.
Case 3: Changing the size of DName
attribute
500000
Comparison
450000
Cost
1) For all the sites participating in the query.
2) Select the required attributes as mentioned in the
query and also the joining attributes.
3) Transfer the resultant relations to a centralized
site, preferably the site at which result is expected.
4) Perform natural join between all the relations.
400000
350000
300000
0
Comparing the algorithms on various test cases by
varying a particular quantity while keeping others
constant.
Case 1: Changing size of join attribute DNO
500000
20
Join
40
DName size
60
Semijoin
80
Heuristic
Result: Join cost increases rapidly, while semijoin
cost remains constant. Heuristic cost goes on
increasing and at a point crosses semijoin.
Case 4: Changing total size of Department
attributes
450000
Cost
Semijoin
550000
400000
500000
450000
Cost
350000
400000
300000
350000
0
5
Size of DNo
Join
Semijoin
10
Heuristic
Result: Join cost remains constant. Heuristic
performs slightly better than Semijoin and both
increase linearly.
Case 2: Changing the number of Tuples of
Department
300000
0
100
200
300
400
Total size of Department attributes
Join
Semijoin
Heuristic
Result: Heuristic cost remains constant while join
and semijoin cost goes on increasing.
CONCLUSION
This paper proposes a model which dynamically
fragments the database when the load on a site
increases to a large extent. It re-fragments it in such
a way that the load gets distributed across all sites
thus increasing the efficiency. In the second part of
the paper, we analyzed that our heuristic algorithm
works better than traditional join algorithms in
majority of the cases.
Acknowledgment
We would like to thank Prof. Vipul Dalal
(Distributed Database Management System
Professor, VIT) for guiding us in writing this paper.
We would also like to thank Prof. Sachin
Deshpande, head of Computer Engg. Dept. for
motivating us.
References
[1] “Principles of Distributed Database Systems”, M. Tamer Ozsu
[2]”Database System Concept”, Mc­Graw Hill-
Silberschatz, Korth, Sudarshan
[3]“Distributed Database System”, Pearson Educati
on India- Chhanda Ray
[4] “Distributed Database Management System”,
Wiley India- Seed K. Rahimi and Frank S. Haug
[5] Database Fragmentation and Allocation: An
Integrated Methodlogy and Case Study- Ajit M.
Tamhankar and Sudha Ram, Member, IEEE
[6] Carey, M. And Lu,h. Load balancing in a
locally distributed database system. In Proc. ACM
SIGMOD Conf., Washington,USA,1986, PP.108119.
[7] Distributed Database Management Systems
Issues and Approaches- Amjad Umar, July 1988.
Download