TY - CONF T1 - Lineage processing over correlated probabilistic databases T2 - Proceedings of the 2010 international conference on Management of data Y1 - 2010 A1 - Kanagal,Bhargav A1 - Deshpande, Amol KW - conjunctive queries KW - Indexing KW - junction trees KW - lineage KW - Probabilistic databases AB - In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of junction trees. We observe that evaluating even read-once (tree structured) lineages (e.g., those generated by hierarchical conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #P-complete for lightly correlated probabilistic databases like Markov sequences. We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called lwidth (analogous to the notion of treewidth). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases. JA - Proceedings of the 2010 international conference on Management of data T3 - SIGMOD '10 PB - ACM CY - New York, NY, USA SN - 978-1-4503-0032-2 UR - http://doi.acm.org/10.1145/1807167.1807241 M3 - 10.1145/1807167.1807241 ER - TY - CONF T1 - Indexing correlated probabilistic databases T2 - Proceedings of the 35th SIGMOD international conference on Management of data Y1 - 2009 A1 - Kanagal,Bhargav A1 - Deshpande, Amol KW - caching KW - Indexing KW - inference queries KW - junction trees KW - Probabilistic databases AB - With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an important research direction. While there is an exhaustive body of literature on querying independent probabilistic data, supporting efficient queries over large-scale, correlated databases remains a challenge. In this paper, we develop efficient data structures and indexes for supporting inference and decision support queries over such databases. Our proposed hierarchical data structure is suitable both for in-memory and disk-resident databases. We represent the correlations in the probabilistic database using a junction tree over the tuple-existence or attribute-value random variables, and use tree partitioning techniques to build an index structure over it. We show how to efficiently answer inference and aggregation queries using such an index, resulting in orders of magnitude performance benefits in most cases. In addition, we develop novel algorithms for efficiently keeping the index structure up-to-date as changes (inserts, updates) are made to the probabilistic database. We present a comprehensive experimental study illustrating the benefits of our approach to query processing in probabilistic databases. JA - Proceedings of the 35th SIGMOD international conference on Management of data T3 - SIGMOD '09 PB - ACM CY - New York, NY, USA SN - 978-1-60558-551-2 UR - http://doi.acm.org/10.1145/1559845.1559894 M3 - 10.1145/1559845.1559894 ER - TY - CONF T1 - Real-time shape retrieval for robotics using skip Tri-Grams T2 - IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009. IROS 2009 Y1 - 2009 A1 - Li,Yi A1 - Bitsakos,K. A1 - Fermüller, Cornelia A1 - Aloimonos, J. KW - Bullseye retrieval test KW - Clocks KW - closed contour shape retrieval KW - Image retrieval KW - Image segmentation KW - Indexing KW - Information retrieval KW - Intelligent robots KW - Jacobian matrices KW - mobile robot KW - Mobile robots KW - MPEG 7 shape dataset KW - piecewise linear segments KW - Piecewise linear techniques KW - Real time systems KW - real-time shape retrieval KW - robot vision KW - SHAPE KW - shape recognition KW - shape representation KW - skip Tri-Grams KW - Testing AB - The real time requirement is an additional constraint on many intelligent applications in robotics, such as shape recognition and retrieval using a mobile robot platform. In this paper, we present a scalable approach for efficiently retrieving closed contour shapes. The contour of an object is represented by piecewise linear segments. A skip Tri-Gram is obtained by selecting three segments in the clockwise order while allowing a constant number of segments to be ¿skipped¿ in between. The main idea is to use skip Tri-Grams of the segments to implicitly encode the distant dependency of the shape. All skip Tri-Grams are used for efficiently retrieving closed contour shapes without pairwise matching feature points from two shapes. The retrieval is at least an order of magnitude faster than other state-of-the-art algorithms. We score 80% in the Bullseye retrieval test on the whole MPEG 7 shape dataset. We further test the algorithm using a mobile robot platform in an indoor environment. 8 objects are used for testing from different viewing directions, and we achieve 82% accuracy. JA - IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009. IROS 2009 PB - IEEE SN - 978-1-4244-3803-7 M3 - 10.1109/IROS.2009.5354738 ER - TY - CONF T1 - DiST: fully decentralized indexing for querying distributed multidimensional datasets T2 - Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International Y1 - 2006 A1 - Nam,Beomseok A1 - Sussman, Alan KW - Computer network management KW - distributed multidimensional dataset querying KW - failure recovery KW - fault tolerant computing KW - fully decentralized multidimensional indexing KW - grid computing KW - Indexing KW - large scale distributed resource management KW - Large-scale systems KW - Multidimensional systems KW - Network servers KW - P2P systems KW - Peer to peer computing KW - peer-to-peer computing KW - peer-to-peer systems KW - Publishing KW - Query processing KW - query routing KW - resource allocation KW - Resource management KW - telecommunication network routing KW - wide area networks AB - Grid computing and peer-to-peer (P2P) systems are emerging as new paradigms for managing large scale distributed resources across wide area networks. While grid computing focuses on managing heterogeneous resources and relies on centralized managers for resource and data discovery, P2P systems target scalable, decentralized methods for publishing and searching for data. In large distributed systems, a centralized resource manager is a potential performance bottleneck and decentralization can help avoid this bottleneck, as is done in P2P systems. However, the query functionality provided by most existing P2P systems is very rudimentary, and is not directly applicable to grid resource management. In this paper, we propose a fully decentralized multidimensional indexing structure, called DiST, that operates in a fully distributed environment with no centralized control. In DiST, each data server only acquires information about data on other servers from executing and routing queries. We describe the DiST algorithms for maintaining the decentralized network of data servers, including adding and deleting servers, the query routing algorithm, and failure recovery algorithms. We also evaluate the performance of the decentralized scheme against a more structured hierarchical indexing scheme that we have previously shown to perform well in distributed grid environments JA - Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International PB - IEEE SN - 1-4244-0054-6 M3 - 10.1109/IPDPS.2006.1639280 ER - TY - CONF T1 - Spatial indexing of distributed multidimensional datasets T2 - IEEE International Symposium on Cluster Computing and the Grid, 2005. CCGrid 2005 Y1 - 2005 A1 - Nam,B. A1 - Sussman, Alan KW - centralized global index algorithm KW - centralized index server KW - Computer science KW - database indexing KW - distributed databases KW - distributed multidimensional dataset KW - Educational institutions KW - File servers KW - Indexing KW - Large-scale systems KW - Multidimensional systems KW - Network servers KW - replication protocol KW - replication techniques KW - scalability KW - Sensor systems KW - spatial data structures KW - spatial indexing KW - two-level hierarchical index algorithm KW - wide area networks AB - While declustering methods for distributed multidimensional indexing of large datasets have been researched widely in the past, replication techniques for multidimensional indexes have not been investigated deeply. In general, a centralized index server may become the performance bottleneck in a wide area network rather than the data servers, since the index is likely to be accessed more often than any of the datasets in the servers. In this paper, we present two different multidimensional indexing algorithms for a distributed environment - a centralized global index and a two-level hierarchical index. Our experimental results show that the centralized scheme does not scale well for either insertion or searching the index. In order to improve the scalability of the index server, we have employed a replication protocol for both the centralized and two-level index schemes that allows some inconsistency between replicas without affecting correctness. Our experiments show that the two-level hierarchical index scheme shows better scalability for both building and searching the index than the non-replicated centralized index, but replication can make the centralized index faster than the two-level hierarchical index for searching in some cases. JA - IEEE International Symposium on Cluster Computing and the Grid, 2005. CCGrid 2005 PB - IEEE VL - 2 SN - 0-7803-9074-1 M3 - 10.1109/CCGRID.2005.1558637 ER - TY - CONF T1 - Strategies for exploring large scale data T2 - Parallel Architectures, Algorithms and Networks, 2004. Proceedings. 7th International Symposium on Y1 - 2004 A1 - JaJa, Joseph F. KW - algorithms; KW - association; KW - asymptotic KW - bounds; KW - business KW - data KW - data; KW - database KW - databases; KW - demographic KW - discovery; KW - Indexing KW - indexing; KW - information KW - knowledge KW - large KW - linear KW - mining; KW - multidimensional KW - objects; KW - optimal KW - Parallel KW - pattern KW - processing; KW - query KW - querying; KW - range KW - scale KW - scientific KW - search KW - serial KW - series KW - series; KW - simulation KW - space KW - structure; KW - structures; KW - techniques; KW - temporal KW - TIME KW - value KW - very KW - window; AB - We consider the problem of querying large scale multidimensional time series data to discover events of interest, test and validate hypotheses, or to associate temporal patterns with specific events. This type of data currently dominates most other types of available data, and will very likely become even more prevalent in the future given the current trends in collecting time series of business, scientific, demographic, and simulation data. The ability to explore such collections interactively, even at a coarse level, will be critical in discovering the information and knowledge embedded in such collections. We develop indexing techniques and search algorithms to efficiently handle temporal range value querying of multidimensional time series data. Our indexing uses linear space data structures that enable the handling of queries in I/O time that is essentially the same as that of handling a single time slice, assuming the availability of a logarithmic number of processors as a function of the temporal window. A data structure with provably almost optimal asymptotic bounds is also presented for the case when the number of multidimensional objects is relatively small. These techniques improve significantly over standard techniques for either serial or parallel processing, and are evaluated by extensive experimental results that confirm their superior performance. JA - Parallel Architectures, Algorithms and Networks, 2004. Proceedings. 7th International Symposium on M3 - 10.1109/ISPAN.2004.1300447 ER - TY - CONF T1 - Improving access to multi-dimensional self-describing scientific datasets T2 - 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003 Y1 - 2003 A1 - Nam,B. A1 - Sussman, Alan KW - Application software KW - application-specific semantic metadata KW - Bandwidth KW - Computer science KW - database indexing KW - disk I/O bandwidth KW - distributed databases KW - Educational institutions KW - Indexing KW - indexing structures KW - Libraries KW - meta data KW - Middleware KW - multidimensional arrays KW - multidimensional datasets KW - Multidimensional systems KW - NASA KW - NASA remote sensing data KW - Navigation KW - query formulation KW - self-describing scientific data file formats KW - structural metadata KW - very large databases AB - Applications that query into very large multidimensional datasets are becoming more common. Many self-describing scientific data file formats have also emerged, which have structural metadata to help navigate the multi-dimensional arrays that are stored in the files. The files may also contain application-specific semantic metadata. In this paper, we discuss efficient methods for performing searches for subsets of multi-dimensional data objects, using semantic information to build multidimensional indexes, and group data items into properly sized chunks to maximize disk I/O bandwidth. This work is the first step in the design and implementation of a generic indexing library that will work with various high-dimension scientific data file formats containing semantic information about the stored data. To validate the approach, we have implemented indexing structures for NASA remote sensing data stored in the HDF format with a specific schema (HDF-EOS), and show the performance improvements that are gained from indexing the datasets, compared to using the existing HDF library for accessing the data. JA - 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003 PB - IEEE SN - 0-7695-1919-9 M3 - 10.1109/CCGRID.2003.1199366 ER - TY - JOUR T1 - Visualization of large data sets with the Active Data Repository JF - IEEE Computer Graphics and Applications Y1 - 2001 A1 - Kurc, T. A1 - Catalyurek,U. A1 - Chang,Chialin A1 - Sussman, Alan A1 - Saltz, J. KW - active data repository KW - ADR runtime system KW - Algorithm design and analysis KW - application-specific processing KW - C++ KW - data mining KW - data retrieval KW - data visualisation KW - Data visualization KW - distributed-memory parallel machines KW - Indexing KW - Information retrieval KW - isosurface rendering KW - Isosurfaces KW - large data sets KW - large-scale multidimensional data KW - Memory management KW - modular services KW - out-of-core data sets KW - parallel machine KW - Parallel machines KW - Partitioning algorithms KW - ray-casting-based volume rendering KW - Rendering (computer graphics) KW - Runtime KW - software libraries KW - storage management AB - We implement ray-casting-based volume rendering and isosurface rendering methods using the Active Data Repository (ADR) for visualizing out-of-core data sets. We have developed the ADR object-oriented framework to provide support for applications that employ range queries with user-defined mapping and aggregation operations on large-scale multidimensional data. ADR targets distributed-memory parallel machines with one or more disks attached to each node. It is designed as a set of modular services implemented in C++, which can be customized for application-specific processing. The ADR runtime system supports common operations such as memory management, data retrieval, and scheduling of processing across a parallel machine VL - 21 SN - 0272-1716 CP - 4 M3 - 10.1109/38.933521 ER - TY - CONF T1 - Developing the next generation of Earth science data systems: the Global Land Cover Facility T2 - Geoscience and Remote Sensing Symposium, 1999. IGARSS '99 Proceedings. IEEE 1999 International Y1 - 1999 A1 - Lindsay,F.E. A1 - Townshend,J.R.G. A1 - JaJa, Joseph F. A1 - Humphries,J. A1 - Plaisant, Catherine A1 - Shneiderman, Ben KW - Computer architecture KW - data archiving KW - data distribution system KW - Data systems KW - Distributed computing KW - Earth science data products KW - Earth science data system KW - ESIP KW - geographic information system KW - geographic information systems KW - Geography KW - geophysical measurement technique KW - geophysical signal processing KW - geophysical techniques KW - Geoscience KW - GIS KW - GLCF KW - Global Land Cover Facility KW - High performance computing KW - Indexing KW - information service KW - Information services KW - Institute for Advanced Computer Studies KW - land cover KW - NASA KW - next generation KW - PACS KW - Remote sensing KW - terrain mapping KW - UMIACS KW - University of Maryland KW - User interfaces KW - web-based interface AB - A recent initiative by NASA has resulted in the formation of a federation of Earth science data partners. These Earth Science Information Partners (ESIPs) have been tasked with creating novel Earth science data products and services as well as distributing new and existing data sets to the Earth science community and the general public. The University of Maryland established its ESIP activities with the creation of the Global Land Cover Facility (GLCF). This joint effort of the Institute for Advanced Computer Studies (UMIACS) and the Department of Geography has developed an operational data archiving and distribution system aimed at advancing current land cover research efforts. The success of the GLCF is tied closely to assessing user needs as well. As the timely delivery of data products to the research community. This paper discusses the development and implementation of a web-based interface that allows users to query the authors' data holdings and perform user requested processing tasks on demand. The GLCF takes advantage of a scaleable, high performance computing architecture for the manipulation of very large remote sensing data sets and the rapid spatial indexing of multiple format data types. The user interface has been developed with the cooperation of the Human-Computer Interaction Laboratory (HCIL) and demonstrates advances in spatial and temporal querying tools as well as the ability to overlay multiple raster and vector data sets. Their work provides one perspective concerning how critical earth science data may be handled in the near future by a coalition of distributed data centers JA - Geoscience and Remote Sensing Symposium, 1999. IGARSS '99 Proceedings. IEEE 1999 International PB - IEEE VL - 1 SN - 0-7803-5207-6 M3 - 10.1109/IGARSS.1999.773583 ER -