TY - CONF T1 - The Provenance of WINE T2 - Dependable Computing Conference (EDCC), 2012 Ninth European Y1 - 2012 A1 - Tudor Dumitras A1 - Efstathopoulos, P. KW - Benchmark testing KW - CYBER SECURITY KW - cyber security experiments KW - data attacks KW - data collection KW - dependability benchmarking KW - distributed databases KW - distributed sensors KW - experimental research KW - field data KW - information quality KW - MALWARE KW - Pipelines KW - provenance KW - provenance information KW - raw data sharing KW - research groups KW - security of data KW - self-documenting experimental process KW - sensor fusion KW - software KW - variable standards KW - WINE KW - WINE benchmark AB - The results of cyber security experiments are often impossible to reproduce, owing to the lack of adequate descriptions of the data collection and experimental processes. Such provenance information is difficult to record consistently when collecting data from distributed sensors and when sharing raw data among research groups with variable standards for documenting the steps that produce the final experimental result. In the WINE benchmark, which provides field data for cyber security experiments, we aim to make the experimental process self-documenting. The data collected includes provenance information – such as when, where and how an attack was first observed or detected – and allows researchers to gauge information quality. Experiments are conducted on a common test bed, which provides tools for recording each procedural step. The ability to understand the provenance of research results enables rigorous cyber security experiments, conducted at scale. JA - Dependable Computing Conference (EDCC), 2012 Ninth European ER - TY - JOUR T1 - A Dual Framework and Algorithms for Targeted Online Data Delivery JF - IEEE Transactions on Knowledge and Data Engineering Y1 - 2011 A1 - Roitman,Haggai A1 - Gal,Avigdor A1 - Raschid, Louiqa KW - client/server multitier systems KW - distributed databases KW - online data delivery. KW - online information services AB - A variety of emerging online data delivery applications challenge existing techniques for data delivery to human users, applications, or middleware that are accessing data from multiple autonomous servers. In this paper, we develop a framework for formalizing and comparing pull-based solutions and present dual optimization approaches. The first approach, most commonly used nowadays, maximizes user utility under the strict setting of meeting a priori constraints on the usage of system resources. We present an alternative and more flexible approach that maximizes user utility by satisfying all users. It does this while minimizing the usage of system resources. We discuss the benefits of this latter approach and develop an adaptive monitoring solution Satisfy User Profiles (SUPs). Through formal analysis, we identify sufficient optimality conditions for SUP. Using real (RSS feeds) and synthetic traces, we empirically analyze the behavior of SUP under varying conditions. Our experiments show that we can achieve a high degree of satisfaction of user utility when the estimations of SUP closely estimate the real event stream, and has the potential to save a significant amount of system resources. We further show that SUP can exploit feedback to improve user utility with only a moderate increase in resource utilization. VL - 23 SN - 1041-4347 CP - 1 M3 - http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15 ER - TY - CONF T1 - Minimizing Communication Cost in Distributed Multi-query Processing T2 - IEEE 25th International Conference on Data Engineering, 2009. ICDE '09 Y1 - 2009 A1 - Li,Jian A1 - Deshpande, Amol A1 - Khuller, Samir KW - Approximation algorithms KW - Communication networks KW - Computer science KW - Cost function KW - Data engineering KW - distributed communication network KW - distributed databases KW - distributed multi-query processing KW - grid computing KW - Large-scale systems KW - NP-hard KW - optimisation KW - Polynomials KW - Publish-subscribe KW - publish-subscribe systems KW - Query optimization KW - Query processing KW - sensor networks KW - Steiner tree problem KW - Tree graphs KW - trees (mathematics) AB - Increasing prevalence of large-scale distributed monitoring and computing environments such as sensor networks, scientific federations, Grids etc., has led to a renewed interest in the area of distributed query processing and optimization. In this paper we address a general, distributed multi-query processing problem motivated by the need to minimize the communication cost in these environments. Specifically we address the problem of optimally sharing data movement across the communication edges in a distributed communication network given a set of overlapping queries and query plans for them (specifying the operations to be executed). Most of the problem variations of our general problem can be shown to be NP-Hard by a reduction from the Steiner tree problem. However, we show that the problem can be solved optimally if the communication network is a tree, and present a novel algorithm for finding an optimal data movement plan. For general communication networks, we present efficient approximation algorithms for several variations of the problem. Finally, we present an experimental study over synthetic datasets showing both the need for exploiting the sharing of data movement and the effectiveness of our algorithms at finding such plans. JA - IEEE 25th International Conference on Data Engineering, 2009. ICDE '09 PB - IEEE SN - 978-1-4244-3422-0 M3 - 10.1109/ICDE.2009.85 ER - TY - CONF T1 - Spatial indexing of distributed multidimensional datasets T2 - IEEE International Symposium on Cluster Computing and the Grid, 2005. CCGrid 2005 Y1 - 2005 A1 - Nam,B. A1 - Sussman, Alan KW - centralized global index algorithm KW - centralized index server KW - Computer science KW - database indexing KW - distributed databases KW - distributed multidimensional dataset KW - Educational institutions KW - File servers KW - Indexing KW - Large-scale systems KW - Multidimensional systems KW - Network servers KW - replication protocol KW - replication techniques KW - scalability KW - Sensor systems KW - spatial data structures KW - spatial indexing KW - two-level hierarchical index algorithm KW - wide area networks AB - While declustering methods for distributed multidimensional indexing of large datasets have been researched widely in the past, replication techniques for multidimensional indexes have not been investigated deeply. In general, a centralized index server may become the performance bottleneck in a wide area network rather than the data servers, since the index is likely to be accessed more often than any of the datasets in the servers. In this paper, we present two different multidimensional indexing algorithms for a distributed environment - a centralized global index and a two-level hierarchical index. Our experimental results show that the centralized scheme does not scale well for either insertion or searching the index. In order to improve the scalability of the index server, we have employed a replication protocol for both the centralized and two-level index schemes that allows some inconsistency between replicas without affecting correctness. Our experiments show that the two-level hierarchical index scheme shows better scalability for both building and searching the index than the non-replicated centralized index, but replication can make the centralized index faster than the two-level hierarchical index for searching in some cases. JA - IEEE International Symposium on Cluster Computing and the Grid, 2005. CCGrid 2005 PB - IEEE VL - 2 SN - 0-7803-9074-1 M3 - 10.1109/CCGRID.2005.1558637 ER - TY - CONF T1 - Exploiting multiple paths to express scientific queries T2 - 16th International Conference on Scientific and Statistical Database Management, 2004. Proceedings Y1 - 2004 A1 - Lacroix,Z. A1 - Moths,T. A1 - Parekh,K. A1 - Raschid, Louiqa A1 - Vidal,M. -E KW - access protocols KW - biology computing KW - BioNavigation system KW - complex queries KW - Costs KW - Data analysis KW - data handling KW - Data visualization KW - data warehouse KW - Data warehouses KW - Databases KW - diseases KW - distributed databases KW - hard-coded scripts KW - information resources KW - Information retrieval KW - mediation-based data integration system KW - multiple paths KW - query evaluation KW - Query processing KW - scientific data collection KW - scientific discovery KW - scientific information KW - scientific information systems KW - scientific object of interest KW - scientific queries KW - sequences KW - Web resources AB - The purpose of this demonstration is to present the main features of the BioNavigation system. Scientific data collection needed in various stages of scientific discovery is typically performed manually. For each scientific object of interest (e.g., a gene, a sequence), scientists query a succession of Web resources following links between retrieved entries. Each of the steps provides part of the intended characterization of the scientific object. This process is sometimes partially supported by hard-coded scripts or complex queries that will be evaluated by a mediation-based data integration system or against a data warehouse. These approaches fail in guiding the scientists during the collection process. In contrast, the BioNavigation approach presented in the paper provides the scientists with information on the available alternative resources, their provenance, and the costs of data collection. The BioNavigation system enhances a mediation-based integration system and provides scientists with support for the following: to ask queries at a high conceptual level; to visualize the multiple alternative resources that may be exploited to execute their data collection queries; to choose the final execution path to evaluate their queries. JA - 16th International Conference on Scientific and Statistical Database Management, 2004. Proceedings PB - IEEE SN - 0-7695-2146-0 M3 - 10.1109/SSDM.2004.1311231 ER - TY - CONF T1 - Improving access to multi-dimensional self-describing scientific datasets T2 - 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003 Y1 - 2003 A1 - Nam,B. A1 - Sussman, Alan KW - Application software KW - application-specific semantic metadata KW - Bandwidth KW - Computer science KW - database indexing KW - disk I/O bandwidth KW - distributed databases KW - Educational institutions KW - Indexing KW - indexing structures KW - Libraries KW - meta data KW - Middleware KW - multidimensional arrays KW - multidimensional datasets KW - Multidimensional systems KW - NASA KW - NASA remote sensing data KW - Navigation KW - query formulation KW - self-describing scientific data file formats KW - structural metadata KW - very large databases AB - Applications that query into very large multidimensional datasets are becoming more common. Many self-describing scientific data file formats have also emerged, which have structural metadata to help navigate the multi-dimensional arrays that are stored in the files. The files may also contain application-specific semantic metadata. In this paper, we discuss efficient methods for performing searches for subsets of multi-dimensional data objects, using semantic information to build multidimensional indexes, and group data items into properly sized chunks to maximize disk I/O bandwidth. This work is the first step in the design and implementation of a generic indexing library that will work with various high-dimension scientific data file formats containing semantic information about the stored data. To validate the approach, we have implemented indexing structures for NASA remote sensing data stored in the HDF format with a specific schema (HDF-EOS), and show the performance improvements that are gained from indexing the datasets, compared to using the existing HDF library for accessing the data. JA - 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003 PB - IEEE SN - 0-7695-1919-9 M3 - 10.1109/CCGRID.2003.1199366 ER - TY - CONF T1 - Decoupled query optimization for federated database systems T2 - 18th International Conference on Data Engineering, 2002. Proceedings Y1 - 2002 A1 - Deshpande, Amol A1 - Hellerstein,J. M KW - Algorithm design and analysis KW - Cohera federated database KW - Computer science KW - Corporate acquisitions KW - Cost function KW - Database systems KW - decoupled optimization KW - Design optimization KW - distributed databases KW - federated databases KW - federated relational database systems KW - Internet KW - Query optimization KW - query optimizer KW - Query processing KW - Relational databases KW - Space exploration AB - We study the problem of query optimization in federated relational database systems. The nature of federated databases explicitly decouples many aspects of the optimization process, often making it imperative for the optimizer to consult underlying data sources while doing cost-based optimization. This not only increases the cost of optimization, but also changes the trade-offs involved in the optimization process significantly. The dominant cost in the decoupled optimization process is the "cost of costing" that traditionally has been considered insignificant. The optimizer can only afford a few rounds of messages to the underlying data sources and hence the optimization techniques in this environment must be geared toward gathering all the required cost information with minimal communication. In this paper, we explore the design space for a query optimizer in this environment and demonstrate the need for decoupling various aspects of the optimization process. We present minimum-communication decoupled variants of various query optimization techniques, and discuss tradeoffs in their performance in this scenario. We have implemented these techniques in the Cohera federated database system and our experimental results, somewhat surprisingly, indicate that a simple two-phase optimization scheme performs fairly well as long as the physical database design is known to the optimizer, though more aggressive algorithms are required otherwise JA - 18th International Conference on Data Engineering, 2002. Proceedings PB - IEEE SN - 0-7695-1531-2 M3 - 10.1109/ICDE.2002.994788 ER - TY - CONF T1 - Integrating distributed scientific data sources with MOCHA and XRoaster T2 - Thirteenth International Conference on Scientific and Statistical Database Management, 2001. SSDBM 2001. Proceedings Y1 - 2001 A1 - Rodriguez-Martinez,M. A1 - Roussopoulos, Nick A1 - McGann,J. M A1 - Kelley,S. A1 - Mokwa,J. A1 - White,B. A1 - Jala,J. KW - client-server systems KW - data sets KW - data sites KW - Databases KW - Distributed computing KW - distributed databases KW - distributed scientific data source integration KW - Educational institutions KW - graphical tool KW - hypermedia markup languages KW - IP networks KW - java KW - Large-scale systems KW - Maintenance engineering KW - meta data KW - metadata KW - Middleware KW - middleware system KW - MOCHA KW - Query processing KW - remote sites KW - scientific information systems KW - user-defined types KW - visual programming KW - XML KW - XML metadata elements KW - XML-based framework KW - XRoaster AB - MOCHA is a novel middleware system for integrating distributed data sources that we have developed at the University of Maryland. MOCHA is based on the idea that the code that implements user-defined types and functions should be automatically deployed to remote sites by the middleware system itself. To this end, we have developed an XML-based framework to specify metadata about data sites, data sets, and user-defined types and functions. XRoaster is a graphical tool that we have developed to help the user create all the XML metadata elements to be used in MOCHA JA - Thirteenth International Conference on Scientific and Statistical Database Management, 2001. SSDBM 2001. Proceedings PB - IEEE SN - 0-7695-1218-6 M3 - 10.1109/SSDM.2001.938560 ER - TY - JOUR T1 - Techniques for update handling in the enhanced client-server DBMS JF - IEEE Transactions on Knowledge and Data Engineering Y1 - 1998 A1 - Delis,A. A1 - Roussopoulos, Nick KW - client disk managers KW - client resources KW - client-server computing paradigm KW - client-server systems KW - Computational modeling KW - Computer architecture KW - concurrency control KW - data pages KW - Database systems KW - distributed databases KW - enhanced client-server DBMS KW - Hardware KW - Local area networks KW - long-term memory KW - main-memory caches KW - Network servers KW - operational spaces KW - Personal communication networks KW - server update propagation techniques KW - Transaction databases KW - update handling KW - Workstations KW - Yarn AB - The Client-Server computing paradigm has significantly influenced the way modern Database Management Systems are designed and built. In such systems, clients maintain data pages in their main-memory caches, originating from the server's database. The Enhanced Client-Server architecture takes advantage of all the available client resources, including their long-term memory. Clients can cache server data into their own disk units if these data are part of their operational spaces. However, when updates occur at the server, a number of clients may need to not only be notified about these changes, but also obtain portions of the updates as well. In this paper, we examine the problem of managing server imposed updates that affect data cached on client disk managers. We propose a number of server update propagation techniques in the context of the Enhanced Client-Server DBMS architecture, and examine the performance of these strategies through detailed simulation experiments. In addition, we study how the various settings of the network affect the performance of these policies VL - 10 SN - 1041-4347 CP - 3 M3 - 10.1109/69.687978 ER - TY - JOUR T1 - ADMS: a testbed for incremental access methods JF - IEEE Transactions on Knowledge and Data Engineering Y1 - 1993 A1 - Roussopoulos, Nick A1 - Economou,N. A1 - Stamenas,A. KW - Add-drop multiplexers KW - ADMS KW - advanced database management system KW - client-server architecture KW - commercial database management systems KW - Computational modeling KW - Database systems KW - distributed databases KW - heterogeneous DBMS KW - incremental access methods KW - incremental gateway KW - Information retrieval KW - interoperability KW - join index KW - large databases KW - Navigation KW - network operating systems KW - Object oriented databases KW - Object oriented modeling KW - Query processing KW - System testing KW - very large databases KW - view index KW - Workstations AB - ADMS is an advanced database management system developed-to experiment with incremental access methods for large and distributed databases. It has been developed over the past eight years at the University of Maryland. The paper provides an overview of ADMS, and describes its capabilities and the performance attained by its incremental access methods. This paper also describes an enhanced client-server architecture that allows an incremental gateway access to multiple heterogeneous commercial database management systems VL - 5 SN - 1041-4347 CP - 5 M3 - 10.1109/69.243508 ER - TY - CONF T1 - An algebra and calculus for relational multidatabase systems T2 - , First International Workshop on Interoperability in Multidatabase Systems, 1991. IMS '91. Proceedings Y1 - 1991 A1 - Grant,J. A1 - Litwin,W. A1 - Roussopoulos, Nick A1 - Sellis,T. KW - Algebra KW - autonomous databases KW - Calculus KW - Computer networks KW - Computer science KW - Data models KW - Data structures KW - Database systems KW - database theory KW - distributed databases KW - Military computing KW - multidatabase manipulation language KW - multidatabase system KW - multirelational algebra KW - query languages KW - relational algebra KW - Relational databases KW - Spatial databases KW - theoretical foundation AB - With the existence of many autonomous databases widely accessible through computer networks, users will require the capability to jointly manipulate data in different databases. A multidatabase system provides such a capability through a multidatabase manipulation language. The authors propose a theoretical foundation for such languages by presenting a multirelational algebra and calculus based on the relational algebra and calculus. The proposal is illustrated by various queries on an example multidatabase JA - , First International Workshop on Interoperability in Multidatabase Systems, 1991. IMS '91. Proceedings PB - IEEE SN - 0-8186-2205-9 M3 - 10.1109/IMS.1991.153694 ER - TY - JOUR T1 - A pipeline N-way join algorithm based on the 2-way semijoin program JF - IEEE Transactions on Knowledge and Data Engineering Y1 - 1991 A1 - Roussopoulos, Nick A1 - Kang,H. KW - 2-way semijoin program KW - backward size reduction KW - Bandwidth KW - Computer networks KW - Costs KW - Data communication KW - data transmission KW - Database systems KW - database theory KW - Delay KW - distributed databases KW - distributed query KW - forward size reduction KW - intermediate results KW - Local area networks KW - network KW - Parallel algorithms KW - pipeline N-way join algorithm KW - pipeline processing KW - Pipelines KW - programming theory KW - Query processing KW - Relational databases KW - relational operator KW - SITES KW - Workstations AB - The semijoin has been used as an effective operator in reducing data transmission and processing over a network that allows forward size reduction of relations and intermediate results generated during the processing of a distributed query. The authors propose a relational operator, two-way semijoin, which enhanced the semijoin with backward size reduction capability for more cost-effective query processing. A pipeline N-way join algorithm for joining the reduced relations residing on N sites is introduced. The main advantage of this algorithm is that it eliminates the need for transferring and storing intermediate results among the sites. A set of experiments showing that the proposed algorithm outperforms all known conventional join algorithms that generate intermediate results is included VL - 3 SN - 1041-4347 CP - 4 M3 - 10.1109/69.109109 ER -