Hi, This is the abstract of my thesis. You also have the detailled plan. I've translated the section I.5 for the contributions and the associated bibliography. ABSTRACT In this thesis, we tackle the problem of query optimization in object oriented database systems. We especially take interest in the transformation of queries into an executable code and in their cost estimation. Object oriented databases are coming from object programming languages. Query optimization in this context faces many problems : the search space is very big, the placement model of objects is extremely flexible and the execution model is powerful. In this thesis we propose a complete physical optimization system able to take into account the features of the object model. Most of the contributions of this thesis have been implemented in an environment for object query optimization : the Flora prototype. As part of this prototype, we studied the different principles of dynamic optimization. This thesis first introduce a very general placement model easily adaptable to existing systems. The programmer use the schema of the database to specify the placement of objects. The system is then able to deduce the physical storage structures and their correspondent statistics. The optimizer rewrite queries by using this structures to improve their performances. We also define an object execution model based on the relational model and the object programming language. We present a new type of path expressions : the qualified path expressions. This study deals with a lot of experiments on the O2 system by using the OO7 benchmark. By the definition of these two models, we deduce the most important component of a physical optimizer : the cost model. This model is used to choose the best way to execute a query. We have built three levels of cost formulas adapted for each phases of the optimization process. Our cost model has the ability to take into account a large variety of features of the object model. We have validated this model with a lot of experiments on commercial object oriented database systems. Key Word : object oriented database systems, query optimization, data clustering, path expression, cost model, experimental validation, O2, OO7, physical optimization, dynamic optimization. ********************************************************************* Plan I Introduction 1 The query optimization problem ......................................2 1.1 General principles .........................................2 1.2 Physical optimization principles ...........................3 2 Problems introduced by the object concept ...........................3 2.1 The object model ...........................................4 2.2 The main problems ..........................................4 3 Some object query optimizers ........................................5 3.1 GOM ........................................................5 3.2 Epoq .......................................................6 3.3 ORION ......................................................7 3.4 O2 .........................................................7 3.5 Volcano ....................................................8 3.6 Cascades ...................................................9 3.7 Open OODB ..................................................9 3.8 Some OODBMS ...............................................10 3.9 Conclusion ................................................10 4 Context of the thesis ..............................................11 5 Contribution of the thesis .........................................12 6 Organization .......................................................12 II The Flora optimizer ......................................................15 1 Introduction .......................................................15 2 The algebra ........................................................16 2.1 The data model ............................................17 2.1.1 The schema ..........................................17 2.1.2 The database ........................................18 2.2 The operators .............................................18 2.3 The terms .................................................20 2.3.1 The different types of terms ........................20 2.3.2 The normal form of terms ............................21 3 The Flora system architecture ......................................21 3.1 The Flora optimizer architecture ..........................22 3.1.1 The search space ....................................23 3.1.2 The modular architecture ............................23 3.1.3 The main modules ....................................24 3.1.4 The code generator ..................................25 3.2 The meta-base .............................................26 3.2.1 The content .........................................26 3.2.2 The reliability problems ............................28 4 Physical optimization ..............................................29 4.1 Local optimization versus global optimization ............29 4.2 Physical optimization principles ..........................31 5 Dynamic optimization ...............................................34 5.1 Dynamic optimization principles ...........................34 5.2 Dynamic parameters ........................................35 5.3 State of art ..............................................35 5.3.1 Dynamic execution model .............................35 5.3.2 Two phases optimization .............................36 5.3.3 Dynamic optimization ................................37 5.4 Proposition ...............................................38 6 Conclusion .........................................................39 III Placement model .........................................................41 1 Introduction .......................................................41 2 The storage structures .............................................42 2.1 The relational model ......................................42 2.2 The object storage model ..................................43 2.2.1 The object model ....................................43 2.2.2 The storage model ...................................45 3 The different model of data organization ...........................46 3.1 The data distribution models ..............................46 3.2 The object clustering models ..............................47 3.2.1 The problems ........................................48 3.2.2 Some placement models ...............................48 4 A general clustering model and distribution model ..................49 4.1 Physical clustering of objects ............................50 4.1.1 Hypotheses ..........................................50 4.1.2 Definition ..........................................50 4.1.3 Some examples .......................................51 4.1.4 Uniform placement ...................................53 4.1.5 Fragmentation .......................................54 4.1.6 Placement statistics ................................56 4.1.7 Optimization ........................................57 4.1.8 The Yao' function ...................................58 4.2 Physical distribution of data .............................59 4.2.1 Mode of specification ...............................59 4.2.2 Level of defragmentation ............................61 5 Conclusion .........................................................62 IV Execution model ..........................................................63 1 Introduction .......................................................63 2 Physical algebra ...................................................64 2.1 Low level operators .......................................64 2.2 Access methods ............................................65 2.3 Algorithms for selection ..................................67 2.4 Algorithms for join .......................................67 2.5 Execution trees ...........................................68 2.6 The other operators .......................................69 3 The algorithms coming from the object model ........................70 3.1 The qualified path expressions ............................70 3.2 Existing evaluation methods ...............................71 3.2.1 Depth first fetch evaluation ........................72 3.2.2 Pointer Based Joins evaluation ......................73 3.2.3 Relational join evaluation ..........................74 3.3 Projections in qualified path expressions .................74 3.4 Execution tree ............................................75 3.5 A heuristic to reduce the search space ....................76 3.6 The methods ...............................................77 4 An execution model for distributed and parallel DBMSs ..............78 4.1 Definition of the transmit operator .......................78 4.2 Generation rules ..........................................79 5 Conclusion .........................................................81 V Cost model ................................................................83 1 Introduction .......................................................83 2 Three cost models ..................................................84 3 Statistics for cost calculating ....................................85 3.1 Machine parameters ........................................85 3.2 Statistics on objects .....................................85 3.3 Statistics on placement ...................................87 3.4 Statistics of the access methods ..........................88 4 Selectivity estimation of predicates ...............................88 4.1 Selectivity estimation of simple predicates ...............89 4.2 Selectivity estimation of complex predicates ..............90 4.3 Selectivity of qualified path expressions .................91 5 Cost of qualified path expressions .................................91 6 Cost of algorithms ................................................92 6.1 Cost of predicates ........................................93 6.1.1 Selection predicates ................................93 6.1.2 Join predicates .....................................94 6.1.3 Normal form of predicates ...........................94 6.1.4 Cost of a conjunction ..............................95 6.1.5 Cost of a disjunction ..............................95 6.1.6 Predicate placement .................................95 7 Cost of projections ...............................................95 8 Cost of access methods ...........................................96 8.1 The Scan ..................................................97 8.2 The Index Scan ............................................97 8.3 The Nested Join ...........................................97 8.4 The Index Join ............................................98 8.5 The Hash Join .............................................98 8.6 The Merge Join ............................................99 8.7 The Pointer Based Join ....................................99 9 Cost of the transmit operator ....................................100 9.1 Round robin ..............................................100 9.2 Range ....................................................100 9.3 Exact ....................................................100 9.4 Spread ...................................................101 10 Cost of execution plans .........................................101 11 Conclusion .......................................................102 VI Validation ..............................................................105 1 Introduction ......................................................105 1.1 The OO7 Benchmark ........................................105 1.1.1 Validation hypotheses ..............................107 2 Validation of the placement strategy ..............................108 2.1 Validation of the clustering techniques ..................108 2.2 Validation of the Yao' formula ...........................110 2.3 Validation of object placement in path expressions .......110 3 Validation of the execution model .................................112 3.1 Comparison of the DFF and the RJ .........................112 3.2 Comparison of the DFF, BFF and the RJ ....................115 3.3 Selectivity, FanOut and length of the path expression ....115 3.4 Comparison of mixed strategies ...........................116 4 Conclusion ........................................................120 VII Conclusion .............................................................121 Appendix A : Internal structure of the Flora optimizer .....................125 Appendix B : Meta-base schema ..............................................129 Appendix C : Statistics of placement calculation ...........................131 Appendix D : Algorithms description in O2C/O2API ...........................135 Bibliography ...............................................................138 ********************************************************************* I.5 Contribution of the thesis The most important module of the physical optimizer is the module of cost estimation. The preponderant component of this model is the time to read data on disk. Most of the existing cost models try to estimate this number with over-simple placement hypotheses. However, in OODBMSs, the placement of objects is quiet uncertain. Cost models coming from RDBMSs are not accurate enough for OODBMSs. The execution model of OODBMSs propose new algorithms for query evaluation. A small number of query optimization are able to calculate the cost of these algorithms. Furthermore, the usual relational techniques are not really developed in these systems. Most of the optimizers are not able to take into account the dynamic parameters of the system. Static optimization can't be totally inefficient in a real system. In order to solve these problems, this thesis propose some solutions : * A general placement model able to take into account most a the usual placement techniques in OODBMSs and in distributed systems [IDEA23N]. * A novel model of object clustering on disk able to improve the system performances [Gruser95]. * A cost model for an object optimizer able to take into account the placement information [VLDB95]. * A new way of expressing queries by the way of path expressions. * A cost model for the different possibilities of execution of these path expressions [VLDB96]. * An execution model for distributed database systems [IDEA4N]. * An extensible prototype proposing an environment for query optimization in OODBMSs [Flora96, ISI96]. * A proposition of implementation of a dynamic optimizer. ********************************************************************* Bibliography [ISI96], "Flora : A query optimizer for Object DataBase Systems" in ISI (Engineering of Information Systems), to appear, 1996 (with D. Florescu, H. Naacke, M. Ziane and P. Valduriez). [Flora96], "Design and Implementation of Flora, a Language for Object Algebra", in Information Science, Vol.87, Num.1-3, November 1996 (with D. Florescu, M. Novak, P. Valduriez and M. Ziane). [VLDB96] "Cost-Based Selection of Path Expression Processing in Object-Oriented Databases", in International Conference on Very Large DataBase, VLDB'96, Bombay, September 1996 (with G. Gardarin and Z. H. Tang). [VLDB95] "A Cost Model for Clustered Object-Oriented Databases", in International Conference on Very Large DataBase, VLDB'95, Zurich, September 1995 (with G. Gardarin and Z. H. Tang). [Gruser95] "A Cost Model for Evaluating Path Expression using Placement Information", in Int. Conf. for Young Computer Scientists, ICYCS'95, Beijing, July 1995 (with Z. H. Tang). [IDEA4N] "The Flora-opt language", IDEA technical report IDEA4N, June 1994 (M. Ziane and C. Chachaty). [IDEA23N] "A Cost Model for the Flora Optimizer", IDEA technical report IDEA23N, June 1994 (with Z. H. Tang).