%0 Conference Paper %B 2011 IEEE 27th International Conference on Data Engineering Workshops (ICDEW) %D 2011 %T Declarative analysis of noisy information networks %A Moustafa,W. E %A Namata,G. %A Deshpande, Amol %A Getoor, Lise %K Cleaning %K Data analysis %K data cleaning operations %K data management system %K data mining %K Databases %K Datalog %K declarative analysis %K graph structure %K information networks %K Noise measurement %K noisy information networks %K Prediction algorithms %K semantics %K Syntactics %X There is a growing interest in methods for analyzing data describing networks of all types, including information, biological, physical, and social networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This has resulted in a growing body of work on extracting, cleaning, and annotating network data. Unfortunately, much of this work is ad hoc and domain-specific. In this paper, we present the architecture of a data management system that enables efficient, declarative analysis of large-scale information networks. We identify a set of primitives to support the extraction and inference of a network from observational data, and describe a framework that enables a network analyst to easily implement and combine new extraction and analysis techniques, and efficiently apply them to large observation networks. The key insight behind our approach is to decouple, to the extent possible, (a) the operations that require traversing the graph structure (typically the computationally expensive step), from (b) the operations that do the modification and update of the extracted network. We present an analysis language based on Datalog, and show how to use it to cleanly achieve such decoupling. We briefly describe our prototype system that supports these abstractions. We include a preliminary performance evaluation of the system and show that our approach scales well and can efficiently handle a wide spectrum of data cleaning operations on network data. %B 2011 IEEE 27th International Conference on Data Engineering Workshops (ICDEW) %I IEEE %P 106 - 111 %8 2011/04/11/16 %@ 978-1-4244-9195-7 %G eng %R 10.1109/ICDEW.2011.5767619 %0 Conference Paper %B 2010 IEEE Symposium on Visual Analytics Science and Technology (VAST) %D 2010 %T The state of visual analytics: Views on what visual analytics is and where it is going %A May,R. %A Hanrahan,P. %A Keim,D. A %A Shneiderman, Ben %A Card,S. %K analytical reasoning %K Data analysis %K data mining %K data visualisation %K European VisMaster program %K interactive visual interfaces %K knowledge discovery %K United States %K visual analytics %K VisWeek community %X In the 2005 publication "Illuminating the Path" visual analytics was defined as "the science of analytical reasoning facilitated by interactive visual interfaces." A lot of work has been done in visual analytics over the intervening five years. While visual analytics started in the United States with a focus on security, it is now a worldwide research agenda with a broad range of application domains. This is evidenced by efforts like the European VisMaster program and the upcoming Visual Analytics and Knowledge Discovery (VAKD) workshop, just to name two.There are still questions concerning where and how visual analytics fits in the large body of research and applications represented by the VisWeek community. This panel will present distinct viewpoints on what visual analytics is and its role in understanding complex information in a complex world. The goal of this panel is to engender discussion from the audience on the emergence and continued advancement of visual analytics and its role relative to fields of related research. Four distinguished panelists will provide their perspective on visual analytics focusing on what it is, what it should be, and thoughts about a development path between these two states. The purpose of the presentations is not to give a critical review of the literature but rather to give a review on the field and to provide a contextual perspective based on the panelists' years of experience and accumulated knowledge. %B 2010 IEEE Symposium on Visual Analytics Science and Technology (VAST) %I IEEE %P 257 - 259 %8 2010/10/25/26 %@ 978-1-4244-9488-0 %G eng %R 10.1109/VAST.2010.5649078 %0 Journal Article %J IEEE Computer Graphics and Applications %D 2009 %T Integrating Statistics and Visualization for Exploratory Power: From Long-Term Case Studies to Design Guidelines %A Perer,A. %A Shneiderman, Ben %K case studies %K Control systems %K Data analysis %K data mining %K data visualisation %K Data visualization %K data-mining %K design guidelines %K Employment %K exploration %K Filters %K Guidelines %K Information Visualization %K insights %K laboratory-based controlled experiments %K Performance analysis %K social network analysis %K Social network services %K social networking (online) %K social networks %K SocialAction %K statistical analysis %K Statistics %K visual analytics %K visual-analytics systems %K Visualization %X Evaluating visual-analytics systems is challenging because laboratory-based controlled experiments might not effectively represent analytical tasks. One such system, Social Action, integrates statistics and visualization in an interactive exploratory tool for social network analysis. This article describes results from long-term case studies with domain experts and extends established design goals for information visualization. %B IEEE Computer Graphics and Applications %V 29 %P 39 - 51 %8 2009/06//May %@ 0272-1716 %G eng %N 3 %R 10.1109/MCG.2009.44 %0 Journal Article %J IEEE Transactions on Visualization and Computer Graphics %D 2009 %T Temporal Summaries: Supporting Temporal Categorical Searching, Aggregation and Comparison %A Wang,T. D %A Plaisant, Catherine %A Shneiderman, Ben %A Spring, Neil %A Roseman,D. %A Marchand,G. %A Mukherjee,V. %A Smith,M. %K Aggregates %K Collaborative work %K Computational Biology %K Computer Graphics %K Data analysis %K data visualisation %K Data visualization %K Databases, Factual %K Displays %K Event detection %K Filters %K Heparin %K History %K Human computer interaction %K Human-computer interaction %K HUMANS %K Information Visualization %K Interaction design %K interactive visualization technique %K Medical Records Systems, Computerized %K Pattern Recognition, Automated %K Performance analysis %K Springs %K temporal categorical data visualization %K temporal categorical searching %K temporal ordering %K temporal summaries %K Thrombocytopenia %K Time factors %X When analyzing thousands of event histories, analysts often want to see the events as an aggregate to detect insights and generate new hypotheses about the data. An analysis tool must emphasize both the prevalence and the temporal ordering of these events. Additionally, the analysis tool must also support flexible comparisons to allow analysts to gather visual evidence. In a previous work, we introduced align, rank, and filter (ARF) to accentuate temporal ordering. In this paper, we present temporal summaries, an interactive visualization technique that highlights the prevalence of event occurrences. Temporal summaries dynamically aggregate events in multiple granularities (year, month, week, day, hour, etc.) for the purpose of spotting trends over time and comparing several groups of records. They provide affordances for analysts to perform temporal range filters. We demonstrate the applicability of this approach in two extensive case studies with analysts who applied temporal summaries to search, filter, and look for patterns in electronic health records and academic records. %B IEEE Transactions on Visualization and Computer Graphics %V 15 %P 1049 - 1056 %8 2009/12//Nov %@ 1077-2626 %G eng %N 6 %R 10.1109/TVCG.2009.187 %0 Conference Paper %B IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008 %D 2008 %T Online Filtering, Smoothing and Probabilistic Modeling of Streaming data %A Kanagal,B. %A Deshpande, Amol %K Data analysis %K data streaming %K declarative query %K dynamic probabilistic model %K Filtering %K Global Positioning System %K hidden Markov models %K Monitoring %K Monte Carlo methods %K Noise generators %K Noise measurement %K online filtering %K particle filter %K particle filtering (numerical methods) %K probabilistic database view %K probability %K Real time systems %K real-time application %K relational database system %K Relational databases %K sequential Monte Carlo algorithm %K Smoothing methods %K SQL %X In this paper, we address the problem of extending a relational database system to facilitate efficient real-time application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of model-based views for this purpose, by allowing users to declaratively specify the model to be applied, and by presenting the output of the models to the user as a probabilistic database view. We support declarative querying over such views using an extended version of SQL that allows for querying probabilistic data. Underneath we use particle filters, a class of sequential Monte Carlo algorithms, to represent the present and historical states of the model as sets of weighted samples (particles) that are kept up-to-date as new data arrives. We develop novel techniques to convert the queries on the model-based view directly into queries over particle tables, enabling highly efficient query processing. Finally, we present experimental evaluation of our prototype implementation over several synthetic and real datasets, that demonstrates the feasibility of online modeling of streaming data using our system and establishes the advantages of tight integration between dynamic probabilistic models and databases. %B IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008 %I IEEE %P 1160 - 1169 %8 2008/04/07/12 %@ 978-1-4244-1836-7 %G eng %R 10.1109/ICDE.2008.4497525 %0 Conference Paper %D 2007 %T A Comparison between Internal and External Malicious Traffic %A Michel Cukier %A Panjwani,S. %K Computer networks %K Data analysis %K external traffic %K honeypot target computers %K internal traffic %K malicious traffic data %K security of data %K user activity profile %X This paper empirically compares malicious traffic originating inside an organization (i.e., internal traffic) with malicious traffic originating outside an organization (i.e., external traffic). Two honeypot target computers were deployed to collect malicious traffic data over a period of fifteen weeks. In the first study we showed that there was a weak correlation between internal and external traffic based on the number of malicious connections. Since the type of malicious activity is linked to the port that was targeted, we focused on the most frequently targeted ports. We observed that internal malicious traffic often contained different malicious content compared to that of external traffic. In the third study, we discovered that the volume of malicious traffic was linked to the day of the week. We showed that internal and external malicious activities differ: where the external malicious activity is quite stable over the week, the internal traffic varied as a function of the users' activity profile. %P 109 - 114 %8 2007/// %G eng %R 10.1109/ISSRE.2007.32 %0 Conference Paper %B X-RAY ABSORPTION FINE STRUCTURE - XAFS13: 13th International Conference %D 2007 %T Geometry and Charge State of Mixed‐Ligand Au13 Nanoclusters %A Frenkel, A. I. %A Menard, L. D. %A Northrup, P. %A Rodriguez, J. A. %A Zypman, F. %A Dana Dachman-Soled %A Gao, S.-P. %A Xu, H. %A Yang, J. C. %A Nuzzo, R. G. %K Atom surface interactions %K Charge transfer %K Data analysis %K Extended X-ray absorption fine structure spectroscopy %K Gold %K nanoparticles %K Scanning transmission electron microscopy %K Surface strains %K Total energy calculations %K X-ray absorption near edge structure %X The integration of synthetic, experimental and theoretical tools into a self‐consistent data analysis methodology allowed us to develop unique new levels of detail in nanoparticle characterization. We describe our methods using an example of Au 13 monolayer‐protected clusters (MPCs), synthesized by ligand exchange methods. The combination of atom counting methods of scanning transmission electron microscopy and Au L3‐edge EXAFS allowed us to characterize these clusters as icosahedral, with surface strain reduced from 5% (as in ideal, regular icosahedra) to 3%, due to the interaction with ligands. Charge transfer from Au to the thiol and phosphine ligands was evidenced by S and P K‐edge XANES. A comparison of total energies of bare clusters of different geometries was performed by equivalent crystal theory calculations. %B X-RAY ABSORPTION FINE STRUCTURE - XAFS13: 13th International Conference %I AIP Publishing %V 882 %P 749 - 751 %8 2007/02/02/ %G eng %U http://scitation.aip.org/content/aip/proceeding/aipcp/10.1063/1.2644652 %0 Journal Article %J IEEE Transactions on Visualization and Computer Graphics %D 2006 %T Balancing Systematic and Flexible Exploration of Social Networks %A Perer,A. %A Shneiderman, Ben %K Aggregates %K algorithms %K attribute ranking %K Cluster Analysis %K Computer Graphics %K Computer simulation %K Coordinate measuring machines %K coordinated views %K Data analysis %K data visualisation %K Data visualization %K exploratory data analysis %K Filters %K Gain measurement %K graph theory %K Graphical user interfaces %K Information Storage and Retrieval %K interactive graph visualization %K matrix algebra %K matrix overview %K Models, Biological %K Navigation %K network visualization %K Pattern analysis %K Population Dynamics %K Social Behavior %K social network analysis %K Social network services %K social networks %K social sciences computing %K Social Support %K SocialAction %K software %K statistical analysis %K statistical methods %K User-Computer Interface %X Social network analysis (SNA) has emerged as a powerful method for understanding the importance of relationships in networks. However, interactive exploration of networks is currently challenging because: (1) it is difficult to find patterns and comprehend the structure of networks with many nodes and links, and (2) current systems are often a medley of statistical methods and overwhelming visual output which leaves many analysts uncertain about how to explore in an orderly manner. This results in exploration that is largely opportunistic. Our contributions are techniques to help structural analysts understand social networks more effectively. We present SocialAction, a system that uses attribute ranking and coordinated views to help users systematically examine numerous SNA measures. Users can (1) flexibly iterate through visualizations of measures to gain an overview, filter nodes, and find outliers, (2) aggregate networks using link structure, find cohesive subgroups, and focus on communities of interest, and (3) untangle networks by viewing different link types separately, or find patterns across different link types using a matrix overview. For each operation, a stable node layout is maintained in the network visualization so users can make comparisons. SocialAction offers analysts a strategy beyond opportunism, as it provides systematic, yet flexible, techniques for exploring social networks %B IEEE Transactions on Visualization and Computer Graphics %V 12 %P 693 - 700 %8 2006/10//Sept %@ 1077-2626 %G eng %N 5 %R 10.1109/TVCG.2006.122 %0 Journal Article %J IEEE Transactions on Visualization and Computer Graphics %D 2006 %T Knowledge discovery in high-dimensional data: case studies and a user survey for the rank-by-feature framework %A Seo,Jinwook %A Shneiderman, Ben %K case study %K Computer aided software engineering %K Computer Society %K Data analysis %K data mining %K data visualisation %K Data visualization %K database management systems %K e-mail user survey %K Genomics %K Helium %K Hierarchical Clustering Explorer %K hierarchical clustering explorer. %K high-dimensional data %K Histograms %K Information visualization evaluation %K interactive systems %K interactive tool %K knowledge discovery %K multivariate data %K Rank-by-feature framework %K Scattering %K Testing %K user interface %K User interfaces %K user survey %K visual analytic tools %K visual analytics %K visualization tools %X Knowledge discovery in high-dimensional data is a challenging enterprise, but new visual analytic tools appear to offer users remarkable powers if they are ready to learn new concepts and interfaces. Our three-year effort to develop versions of the hierarchical clustering explorer (HCE) began with building an interactive tool for exploring clustering results. It expanded, based on user needs, to include other potent analytic and visualization tools for multivariate data, especially the rank-by-feature framework. Our own successes using HCE provided some testimonial evidence of its utility, but we felt it necessary to get beyond our subjective impressions. This paper presents an evaluation of the hierarchical clustering explorer (HCE) using three case studies and an e-mail user survey (n=57) to focus on skill acquisition with the novel concepts and interface for the rank-by-feature framework. Knowledgeable and motivated users in diverse fields provided multiple perspectives that refined our understanding of strengths and weaknesses. A user survey confirmed the benefits of HCE, but gave less guidance about improvements. Both evaluations suggested improved training methods %B IEEE Transactions on Visualization and Computer Graphics %V 12 %P 311 - 322 %8 2006/06//May %@ 1077-2626 %G eng %N 3 %R 10.1109/TVCG.2006.50 %0 Conference Paper %D 2006 %T A Statistical Analysis of Attack Data to Separate Attacks %A Michel Cukier %A Berthier,R. %A Panjwani,S. %A Tan,S. %K attack data statistical analysis %K attack separation %K computer crime %K Data analysis %K data mining %K ICMP scans %K K-Means algorithm %K pattern clustering %K port scans %K statistical analysis %K vulnerability scans %X This paper analyzes malicious activity collected from a test-bed, consisting of two target computers dedicated solely to the purpose of being attacked, over a 109 day time period. We separated port scans, ICMP scans, and vulnerability scans from the malicious activity. In the remaining attack data, over 78% (i.e., 3,677 attacks) targeted port 445, which was then statistically analyzed. The goal was to find the characteristics that most efficiently separate the attacks. First, we separated the attacks by analyzing their messages. Then we separated the attacks by clustering characteristics using the K-Means algorithm. The comparison between the analysis of the messages and the outcome of the K-Means algorithm showed that 1) the mean of the distributions of packets, bytes and message lengths over time are poor characteristics to separate attacks and 2) the number of bytes, the mean of the distribution of bytes and message lengths as a function of the number packets are the best characteristics for separating attacks %P 383 - 392 %8 2006/06// %G eng %R 10.1109/DSN.2006.9 %0 Journal Article %J Computing in Science & Engineering %D 2006 %T A telescope for high-dimensional data %A Shneiderman, Ben %K Chemistry %K Data analysis %K data mining %K data visualisation %K degenerative disease %K Degenerative diseases %K diseases %K Finance %K genetic process %K Genetics %K high-dimensional data %K interactive data analysis %K medical computing %K Meteorology %K muscle %K muscle development %K Muscles %K muscular dystrophy %K Rank-by-feature framework %K software %K software libraries %K software tools %K Telescopes %K visual data %K visual data analysis %X Muscular dystrophy is a degenerative disease that destroys muscles and ultimately kills its victims. Researchers worldwide are racing to find a cure by trying to uncover the genetic processes that cause it. Given that a key process is muscle development, researchers at a consortium of 10 institutions are studying 1,000 men and women, ages 18 to 40 years, to see how their muscles enlarge with exercise. The 150 variables collected for each participant will make this data analysis task challenging for users of traditional statistical software tools. However, a new approach to visual data analysis is helping these researchers speed up their work. At the University of Maryland's Human-Computer Interaction Library, we developed an interactive approach to let researchers explore high-dimensional data in an orderly manner, focusing on specific features one at a time. The rank-by-feature framework lets them adjust controls to specify what they're looking for, and then, with only a glance, they can spot strong relationships among variables, find tight data clusters, or identify unexpected gaps. Sometimes surprising outliers invite further study as to whether they represent errors or an unusual outcome. Similar data analysis problems come up in meteorology, finance, chemistry, and other sciences in which complex relationships among many variables govern outcomes. The rank-by-feature framework could be helpful to many researchers, engineers, and managers because they can then steer their analyses toward the action %B Computing in Science & Engineering %V 8 %P 48 - 53 %8 2006/04//March %@ 1521-9615 %G eng %N 2 %R 10.1109/MCSE.2006.21 %0 Conference Paper %B Visual Analytics Science And Technology, 2006 IEEE Symposium On %D 2006 %T A Visual Interface for Multivariate Temporal Data: Finding Patterns of Events across Multiple Histories %A Fails,J. A %A Karlson,A. %A Shahamat,L. %A Shneiderman, Ben %K ball-and-chain visualization %K Chromium %K Computer science %K Data analysis %K data visualisation %K Data visualization %K Database languages %K event pattern discovery %K Graphical user interfaces %K History %K Information Visualization %K Medical treatment %K multivariate temporal data %K Pattern analysis %K pattern recognition %K PatternFinder integrated interface %K Query processing %K query visualization %K result-set visualization %K Spatial databases %K tabular visualization %K temporal pattern discovery %K temporal pattern searching %K Temporal query %K user interface %K User interfaces %K visual databases %K visual interface %X Finding patterns of events over time is important in searching patient histories, Web logs, news stories, and criminal activities. This paper presents PatternFinder, an integrated interface for query and result-set visualization for search and discovery of temporal patterns within multivariate and categorical data sets. We define temporal patterns as sequences of events with inter-event time spans. PatternFinder allows users to specify the attributes of events and time spans to produce powerful pattern queries that are difficult to express with other formalisms. We characterize the range of queries PatternFinder supports as users vary the specificity at which events and time spans are defined. Pattern Finder's query capabilities together with coupled ball-and-chain and tabular visualizations enable users to effectively query, explore and analyze event patterns both within and across data entities (e.g. patient histories, terrorist groups, Web logs, etc.) %B Visual Analytics Science And Technology, 2006 IEEE Symposium On %I IEEE %P 167 - 174 %8 2006/11/31/Oct. %@ 1-4244-0591-2 %G eng %R 10.1109/VAST.2006.261421 %0 Conference Paper %B Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International %D 2005 %T Comparing the Performance of High-Level Middleware Systems in Shared and Distributed Memory Parallel Environments %A Kim,Jik-Soo %A Andrade,H. %A Sussman, Alan %K Application software %K Computer science %K Computer vision %K Data analysis %K Distributed computing %K distributed computing environment %K distributed memory parallel environment %K distributed shared memory systems %K Educational institutions %K high-level middleware system %K I/O-intensive data analysis application %K Libraries %K Middleware %K parallel computing environment %K parallel library support %K parallel memories %K programming language %K programming languages %K Runtime environment %K shared memory parallel environment %K Writing %X The utilization of toolkits for writing parallel and/or distributed applications has been shown to greatly enhance developer's productivity. Such an approach hides many of the complexities associated with writing these applications, rather than relying solely on programming language aids and parallel library support, such as MPI or PVM. In this work, we evaluate three different middleware systems that have been used to implement a computation and I/O-intensive data analysis application from the domain of computer vision. This study shows the benefits and overheads associated with each of the middleware systems, in different homogeneous computational environments and with different workloads. Our results lead the way toward being able to make better decisions for tuning the application environment, for selecting the appropriate middleware, and also for designing more powerful middleware systems to efficiently build and run highly complex applications in both parallel and distributed computing environments. %B Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International %I IEEE %P 30 - 30 %8 2005/04// %@ 0-7695-2312-9 %G eng %R 10.1109/IPDPS.2005.144 %0 Conference Paper %B Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International %D 2005 %T High Performance Communication between Parallel Programs %A Jae-Yong Lee %A Sussman, Alan %K Adaptive arrays %K Analytical models %K Chaotic communication %K Computational modeling %K Computer science %K Data analysis %K data distribution %K Educational institutions %K high performance communication %K image data analysis %K image resolution %K inter-program communication patterns %K InterComm %K Libraries %K Message passing %K parallel languages %K parallel libraries %K parallel programming %K parallel programs %K performance evaluation %K Wind %X We present algorithms for high performance communication between message-passing parallel programs, and evaluate the algorithms as implemented in InterComm. InterComm is a framework to couple parallel programs in the presence of complex data distributions within a coupled application. Multiple parallel libraries and languages may be used in the different programs of a single coupled application. The ability to couple such programs is required in many emerging application areas, such as complex simulations that model physical phenomena at multiple scales and resolutions, and image data analysis applications. We describe the new algorithms we have developed for computing inter-program communication patterns. We present experimental results showing the performance of various algorithmic tradeoffs, and also compare performance against an earlier system. %B Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International %I IEEE %P 177b- 177b - 177b- 177b %8 2005/04// %@ 0-7695-2312-9 %G eng %R 10.1109/IPDPS.2005.243 %0 Conference Paper %B 16th International Conference on Scientific and Statistical Database Management, 2004. Proceedings %D 2004 %T Exploiting multiple paths to express scientific queries %A Lacroix,Z. %A Moths,T. %A Parekh,K. %A Raschid, Louiqa %A Vidal,M. -E %K access protocols %K biology computing %K BioNavigation system %K complex queries %K Costs %K Data analysis %K data handling %K Data visualization %K data warehouse %K Data warehouses %K Databases %K diseases %K distributed databases %K hard-coded scripts %K information resources %K Information retrieval %K mediation-based data integration system %K multiple paths %K query evaluation %K Query processing %K scientific data collection %K scientific discovery %K scientific information %K scientific information systems %K scientific object of interest %K scientific queries %K sequences %K Web resources %X The purpose of this demonstration is to present the main features of the BioNavigation system. Scientific data collection needed in various stages of scientific discovery is typically performed manually. For each scientific object of interest (e.g., a gene, a sequence), scientists query a succession of Web resources following links between retrieved entries. Each of the steps provides part of the intended characterization of the scientific object. This process is sometimes partially supported by hard-coded scripts or complex queries that will be evaluated by a mediation-based data integration system or against a data warehouse. These approaches fail in guiding the scientists during the collection process. In contrast, the BioNavigation approach presented in the paper provides the scientists with information on the available alternative resources, their provenance, and the costs of data collection. The BioNavigation system enhances a mediation-based integration system and provides scientists with support for the following: to ask queries at a high conceptual level; to visualize the multiple alternative resources that may be exploited to execute their data collection queries; to choose the final execution path to evaluate their queries. %B 16th International Conference on Scientific and Statistical Database Management, 2004. Proceedings %I IEEE %P 357 - 360 %8 2004/06/21/23 %@ 0-7695-2146-0 %G eng %R 10.1109/SSDM.2004.1311231 %0 Conference Paper %B IEEE Symposium on Information Visualization, 2004. INFOVIS 2004 %D 2004 %T A Rank-by-Feature Framework for Unsupervised Multidimensional Data Exploration Using Low Dimensional Projections %A Seo,J. %A Shneiderman, Ben %K axis-parallel projections %K boxplot %K color-coded lower-triangular matrix %K computational complexity %K computational geometry %K Computer displays %K Computer science %K Computer vision %K Data analysis %K data mining %K data visualisation %K Data visualization %K Displays %K dynamic query %K Educational institutions %K exploratory data analysis %K feature detection %K feature detection/selection %K Feature extraction %K feature selection %K graph theory %K graphical displays %K histogram %K Information Visualization %K interactive systems %K Laboratories %K Multidimensional systems %K Principal component analysis %K rank-by-feature prism %K scatterplot %K statistical analysis %K statistical graphics %K statistical graphs %K unsupervised multidimensional data exploration %K very large databases %X Exploratory analysis of multidimensional data sets is challenging because of the difficulty in comprehending more than three dimensions. Two fundamental statistical principles for the exploratory analysis are (1) to examine each dimension first and then find relationships among dimensions, and (2) to try graphical displays first and then find numerical summaries (D.S. Moore, (1999). We implement these principles in a novel conceptual framework called the rank-by-feature framework. In the framework, users can choose a ranking criterion interesting to them and sort 1D or 2D axis-parallel projections according to the criterion. We introduce the rank-by-feature prism that is a color-coded lower-triangular matrix that guides users to desired features. Statistical graphs (histogram, boxplot, and scatterplot) and information visualization techniques (overview, coordination, and dynamic query) are combined to help users effectively traverse 1D and 2D axis-parallel projections, and finally to help them interactively find interesting features %B IEEE Symposium on Information Visualization, 2004. INFOVIS 2004 %I IEEE %P 65 - 72 %8 2004/// %@ 0-7803-8779-3 %G eng %R 10.1109/INFVIS.2004.3 %0 Journal Article %J Computer %D 2002 %T Interactively exploring hierarchical clustering results [gene identification] %A Seo,Jinwook %A Shneiderman, Ben %K algorithmic methods %K arrays %K Bioinformatics %K biological data sets %K biology computing %K Data analysis %K data mining %K data visualisation %K Data visualization %K DNA %K Fluorescence %K gene functions %K gene identification %K gene profiles %K Genetics %K Genomics %K Hierarchical Clustering Explorer %K hierarchical systems %K interactive exploration %K interactive information visualization tool %K interactive systems %K Large screen displays %K meaningful cluster identification %K metrics %K microarray data analysis %K pattern clustering %K pattern extraction %K Process control %K Sensor arrays %K sequenced genomes %K Tiles %X To date, work in microarrays, sequenced genomes and bioinformatics has focused largely on algorithmic methods for processing and manipulating vast biological data sets. Future improvements will likely provide users with guidance in selecting the most appropriate algorithms and metrics for identifying meaningful clusters-interesting patterns in large data sets, such as groups of genes with similar profiles. Hierarchical clustering has been shown to be effective in microarray data analysis for identifying genes with similar profiles and thus possibly with similar functions. Users also need an efficient visualization tool, however, to facilitate pattern extraction from microarray data sets. The Hierarchical Clustering Explorer integrates four interactive features to provide information visualization techniques that allow users to control the processes and interact with the results. Thus, hybrid approaches that combine powerful algorithms with interactive visualization tools will join the strengths of fast processors with the detailed understanding of domain experts %B Computer %V 35 %P 80 - 86 %8 2002/07// %@ 0018-9162 %G eng %N 7 %R 10.1109/MC.2002.1016905 %0 Conference Paper %B 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002 %D 2002 %T Multiple Query Optimization for Data Analysis Applications on Clusters of SMPs %A Andrade,H. %A Kurc, T. %A Sussman, Alan %A Saltz, J. %K Aggregates %K Application software %K Bandwidth %K Data analysis %K Data structures %K Delay %K Query processing %K scheduling %K Subcontracting %K Switched-mode power supply %X This paper is concerned with the efficient execution of multiple query workloads on a cluster of SMPs. We target applications that access and manipulate large scientific datasets. Queries in these applications involve user-defined processing operations and distributed data structures to hold intermediate and final results. Our goal is to implement system components to leverage previously computed query results and to effectively utilize processing power and aggregated I/O bandwidth on SMP nodes so that both single queries and multi-query batches can be efficiently executed. %B 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002 %I IEEE %P 154 - 154 %8 2002/05/21/24 %@ 0-7695-1582-7 %G eng %R 10.1109/CCGRID.2002.1017123 %0 Conference Paper %B Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM %D 2002 %T Scheduling multiple data visualization query workloads on a shared memory machine %A Andrade,H. %A Kurc, T. %A Sussman, Alan %A Saltz, J. %K Atomic force microscopy %K Biomedical informatics %K Computer science %K Data analysis %K data visualisation %K Data visualization %K datasets %K deductive databases %K digitized microscopy image browsing %K directed graph %K directed graphs %K dynamic query scheduling model %K Educational institutions %K high workloads %K image database %K limited resources %K multiple data visualization query workloads %K multiple query optimization %K performance %K priority queue %K Processor scheduling %K Query processing %K query ranking %K Relational databases %K scheduling %K shared memory machine %K shared memory systems %K Virtual Microscope %K visual databases %X Query scheduling plays an important role when systems are faced with limited resources and high workloads. It becomes even more relevant for servers applying multiple query optimization techniques to batches of queries, in which portions of datasets as well as intermediate results are maintained in memory to speed up query evaluation. We present a dynamic query scheduling model based on a priority queue implementation using a directed graph and a strategy for ranking queries. We examine the relative performance of several ranking strategies on a shared-memory machine using two different versions of an application, called the Virtual Microscope, for browsing digitized microscopy images %B Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM %I IEEE %P 11 - 18 %8 2002/// %@ 0-7695-1573-8 %G eng %R 10.1109/IPDPS.2002.1015482 %0 Journal Article %J Parallel Computing %D 2001 %T Distributed processing of very large datasets with DataCutter %A Beynon,Michael D. %A Kurc,Tahsin %A Catalyurek,Umit %A Chang,Chialin %A Sussman, Alan %A Saltz,Joel %K Component architectures %K Data analysis %K Distributed computing %K Multi-dimensional datasets %K Runtime systems %X We describe a framework, called DataCutter, that is designed to provide support for subsetting and processing of datasets in a distributed and heterogeneous environment. We illustrate the use of DataCutter with several data-intensive applications from diverse fields, and present experimental results. %B Parallel Computing %V 27 %P 1457 - 1478 %8 2001/10// %@ 0167-8191 %G eng %U http://www.sciencedirect.com/science/article/pii/S0167819101000990 %N 11 %R 10.1016/S0167-8191(01)00099-0 %0 Conference Paper %B Proceedings of the IEEE 2nd International Symposium on Bioinformatics and Bioengineering Conference, 2001 %D 2001 %T Optimized seamless integration of biomolecular data %A Eckman,B. A %A Lacroix,Z. %A Raschid, Louiqa %K analysis %K Bioinformatics %K biology computing %K cost based knowledge %K Costs %K Data analysis %K data mining %K data visualisation %K Data visualization %K Data warehouses %K decision support %K digital library %K Educational institutions %K information resources %K Internet %K low cost query evaluation plans %K Mediation %K meta data %K metadata %K molecular biophysics %K multiple local heterogeneous data sources %K multiple remote heterogeneous data sources %K optimized seamless biomolecular data integration %K scientific discovery %K scientific information systems %K semantic knowledge %K software libraries %K visual databases %K Visualization %X Today, scientific data is inevitably digitized, stored in a variety of heterogeneous formats, and is accessible over the Internet. Scientists need to access an integrated view of multiple remote or local heterogeneous data sources. They then integrate the results of complex queries and apply further analysis and visualization to support the task of scientific discovery. Building a digital library for scientific discovery requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web, as well as data that is locally materialized in warehouses or is generated by software. We consider several tasks to provide optimized and seamless integration of biomolecular data. Challenges to be addressed include capturing and representing source capabilities; developing a methodology to acquire and represent metadata about source contents and access costs; and decision support to select sources and capabilities using cost based and semantic knowledge, and generating low cost query evaluation plans %B Proceedings of the IEEE 2nd International Symposium on Bioinformatics and Bioengineering Conference, 2001 %I IEEE %P 23 - 32 %8 2001/11/04/6 %@ 0-7695-1423-5 %G eng %R 10.1109/BIBE.2001.974408 %0 Conference Paper %B 2001 IEEE Symposium on Security and Privacy, 2001. S&P 2001. Proceedings %D 2001 %T A trend analysis of exploitations %A Browne,H. K %A Arbaugh, William A. %A McHugh,J. %A Fithen,W. L %K Computer science %K computer security exploits %K Data analysis %K data mining %K Educational institutions %K exploitations %K Performance analysis %K Predictive models %K Regression analysis %K Risk management %K security of data %K software engineering %K system intrusions %K System software %K trend analysis %K vulnerabilities %K vulnerability exploitation %X We have conducted an empirical study of a number of computer security exploits and determined that the rates at which incidents involving the exploit are reported to CERT can be modeled using a common mathematical framework. Data associated with three significant exploits involving vulnerabilities in phf, imap, and bind can all be modeled using the formula C=I+S×√M where C is the cumulative count of reported incidents, M is the time since the start of the exploit cycle, and I and S are the regression coefficients determined by analysis of the incident report data. Further analysis of two additional exploits involving vulnerabilities in mountd and statd confirm the model. We believe that the models will aid in predicting the severity of subsequent vulnerability exploitations, based on the rate of early incident reports %B 2001 IEEE Symposium on Security and Privacy, 2001. S&P 2001. Proceedings %I IEEE %P 214 - 229 %8 2001/// %@ 0-7695-1046-9 %G eng %R 10.1109/SECPRI.2001.924300 %0 Conference Paper %B Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International %D 2000 %T Optimizing retrieval and processing of multi-dimensional scientific datasets %A Chang,Chialin %A Kurc, T. %A Sussman, Alan %A Saltz, J. %K active data repository %K Area measurement %K Computer science %K Data analysis %K distributed memory parallel machines %K Educational institutions %K Information retrieval %K Information Storage and Retrieval %K infrastructure %K Microscopy %K Microwave integrated circuits %K multi-dimensional scientific datasets retrieval %K PARALLEL PROCESSING %K Pathology %K range queries %K regular d-dimensional array %K Satellites %K Tomography %X We have developed the Active Data Repository (ADR), an infrastructure that integrates storage, retrieval, and processing of large multi-dimensional scientific datasets on distributed memory parallel machines with multiple disks attached to each node. In earlier work, we proposed three strategies for processing range queries within the ADR framework. Our experimental results show that the relative performance of the strategies changes under varying application characteristics and machine configurations. In this work we investigate approaches to guide and automate the selection of the best strategy for a given application and machine configuration. We describe analytical models to predict the relative performance of the strategies where input data elements are uniformly distributed in the attribute space of the output dataset, restricting the output dataset to be a regular d-dimensional array %B Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International %I IEEE %P 405 - 410 %8 2000/// %@ 0-7695-0574-0 %G eng %R 10.1109/IPDPS.2000.846013 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 1995 %T Going beyond integer programming with the Omega test to eliminate false data dependences %A Pugh, William %A Wonnacott,D. %K Algorithm design and analysis %K Arithmetic %K Computer science %K Data analysis %K false data dependences %K integer programming %K Linear programming %K Omega test %K Privatization %K Production %K production compilers %K program compilers %K Program processors %K program testing %K program transformations %K Testing %X Array data dependence analysis methods currently in use generate false dependences that can prevent useful program transformations. These false dependences arise because the questions asked are conservative approximations to the questions we really should be asking. Unfortunately, the questions we really should be asking go beyond integer programming and require decision procedures for a subclass of Presburger formulas. In this paper, we describe how to extend the Omega test so that it can answer these queries and allow us to eliminate these false data dependences. We have implemented the techniques described here and believe they are suitable for use in production compilers %B IEEE Transactions on Parallel and Distributed Systems %V 6 %P 204 - 211 %8 1995/02// %@ 1045-9219 %G eng %N 2 %R 10.1109/71.342135 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 1995 %T An integrated runtime and compile-time approach for parallelizing structured and block structured applications %A Agrawal,G. %A Sussman, Alan %A Saltz, J. %K Bandwidth %K block structured applications %K block structured codes %K compile-time approach %K compiling applications %K data access patterns %K Data analysis %K Delay %K distributed memory machines %K distributed memory systems %K FORTRAN %K Fortran 90D/HPF compiler %K High performance computing %K HPF-like parallel programming languages %K integrated runtime approach %K irregularly coupled regular mesh problems %K multigrid code %K Navier-Stokes solver template %K Parallel machines %K parallel programming %K Pattern analysis %K performance evaluation %K program compilers %K Program processors %K Runtime library %K Uninterruptible power systems %X In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library %B IEEE Transactions on Parallel and Distributed Systems %V 6 %P 747 - 754 %8 1995/07// %@ 1045-9219 %G eng %N 7 %R 10.1109/71.395403 %0 Conference Paper %B , IEEE Conference on Visualization, 1991. Visualization '91, Proceedings %D 1991 %T Tree-maps: a space-filling approach to the visualization of hierarchical information structures %A Johnson,B. %A Shneiderman, Ben %K Computer displays %K Computer Graphics %K Computer science %K Data analysis %K display space %K Educational institutions %K Feedback %K hierarchical information structures %K HUMANS %K Laboratories %K Libraries %K Marine vehicles %K rectangular region %K semantic information %K space-filling approach %K tree-map visualization technique %K trees (mathematics) %K Two dimensional displays %K Visualization %X A method for visualizing hierarchically structured information is described. The tree-map visualization technique makes 100% use of the available display space, mapping the full hierarchy onto a rectangular region in a space-filling manner. This efficient use of space allows very large hierarchies to be displayed in their entirety and facilitates the presentation of semantic information. Tree-maps can depict both the structure and content of the hierarchy. However, the approach is best suited to hierarchies in which the content of the leaf nodes and the structure of the hierarchy are of primary importance, and the content information associated with internal nodes is largely derived from their children %B , IEEE Conference on Visualization, 1991. Visualization '91, Proceedings %I IEEE %P 284 - 291 %8 1991/10/22/25 %@ 0-8186-2245-8 %G eng %R 10.1109/VISUAL.1991.175815 %0 Journal Article %J IEEE Transactions on Software Engineering %D 1988 %T Learning from examples: generation and evaluation of decision trees for software resource analysis %A Selby,R. W %A Porter, Adam %K Analysis of variance %K Artificial intelligence %K Classification tree analysis %K Data analysis %K decision theory %K Decision trees %K Fault diagnosis %K Information analysis %K machine learning %K metrics %K NASA %K production environment %K software engineering %K software modules %K software resource analysis %K Software systems %K Termination of employment %K trees (mathematics) %X A general solution method for the automatic generation of decision (or classification) trees is investigated. The approach is to provide insights through in-depth empirical characterization and evaluation of decision trees for one problem domain, specifically, that of software resource data analysis. The purpose of the decision trees is to identify classes of objects (software modules) that had high development effort, i.e. in the uppermost quartile relative to past data. Sixteen software systems ranging from 3000 to 112000 source lines have been selected for analysis from a NASA production environment. The collection and analysis of 74 attributes (or metrics), for over 4700 objects, capture a multitude of information about the objects: development effort, faults, changes, design style, and implementation style. A total of 9600 decision trees are automatically generated and evaluated. The analysis focuses on the characterization and evaluation of decision tree accuracy, complexity, and composition. The decision trees correctly identified 79.3% of the software modules that had high development effort or faults, on the average across all 9600 trees. The decision trees generated from the best parameter combinations correctly identified 88.4% of the modules on the average. Visualization of the results is emphasized, and sample decision trees are included %B IEEE Transactions on Software Engineering %V 14 %P 1743 - 1757 %8 1988/12// %@ 0098-5589 %G eng %N 12 %R 10.1109/32.9061