In this dissertation, we study the problem of supporting schema evolution in snapshot databases and historical databases. Schema evolution requires several error-prone and time-consuming tasks, including migration of stored data, rewriting of applications, and in historical databases, management of heterogeneous data. Thus, our primary goal is to relieve users from the burden of performing the tasks manually by having the system to perform them automatically.
The outline and the main contributions of this dissertation are as follows. First, we design schema modification operators (SMOs), for high-level specification of schema changes. Then, based on SMOs, we develop a workbench system called PRISM, where users can evaluate the effect of schema changes ("what-if" scenario) and (semi-)automatically execute data migration and application query translation. For query translation, the system computes for the user possible inverse schema mappings and performs automatic query translation according to the inverse selected by the user.
Evolution of schemas also affect historical data management: repeated schema changes lead to the accumulation of successive schema versions in history, since historical data must be archived according to the schema version under which they were originally created, as to assure faithful preservation of the database history. To write a historical query on such databases, users have to refer to every schema version containing relevant parts of the historical database. Since this is too cumbersome for typical application, we let users write a single historical query on the current schema version, as if all historical data were stored under that version. Thus our PRIMA system uses efficient rewriting techniques to map the queries formulated against the current schema onto the relevant schema versions.
We also study the problem of efficient support for transaction-time databases with evolving schemas and propose novel optimization techniques for temporal queries. In particular, we propose an efficient temporal coalescing technique called CNesT, which outperforms the best coalescing algorithm to date by almost two orders of magnitude.
Many modern applications require continuous processing of massive data streams, creating difficult research challenges on multiple fronts. Foremost among these there is the design and implementation of Data Stream Management Systems (DSMSs) that optimize the execution of multiple continuous queries, on massive and bursty data streams to produce fast, near-realtime, responses upon tuple arrival. Existing query constructs and query processing techniques of DataBase Management Systems must be greatly extended and revised to support efficiently continuous queries in DSMSs.
In this dissertation, we propose novel query scheduling algorithms and query constructs whose effectiveness was demonstrated in the Stream Mill DSMS prototype. Thus, we introduce on-demand timestamp generation and propagation mechanisms based on a flexible query-execution model, to solve the idle-waiting problem, improve query response, and reduce memory consumption. Moreover, we introduce practical scheduling algorithms, which assume uniform cost for tuples on the same operator, to handle query graphs of arbitrary complexity. Finally, we compare the performance of these practical algorithms against the theoretically optimal ones that assume perfect knowledge on the costs of the individual tuples being processed. Our experiments show that the performance of the former closely approximate that of the latter.
In order to utilize DBMS languages for data stream processing, stream-oriented constructs, such as sliding windows, have to be defined and integrated into the query language. Thus, we introduce novel constructs and techniques for supporting sliding windows on arbitrary User Defined Aggregates, as to achieve their incremental maintenance on logical and physical windows.
Finally, we explore some performance optimization issues in advanced applications, including RFID (Radio Frequency IDentification) applications, and load-shedding in multi-source stream classifiers.
The eXtensible Markup Language (XML) is rapidly becoming the de facto standard for retrieving and exchanging web information, and is having a significant impact on the evolution of existing database management systems. Commercial SQL-compliant data base systems are moving aggressively to support the integrated management of XML documents and XML-published relational data. This is achieved by supporting XML query languages, such as XQuery, along with extensions of relational languages such as SQL/XML.
In this dissertation, we focus our work on three areas that present research issues of particular signi/cance and interest. The /rst area involves managing and querying the temporal history of databases using XML, since its data model and query languages are more supportive of temporal information than relational databases and SQL. We then discuss the problem of supporting this XML historical view e1ciently by mapping back to relational databases, since these achieve better scalability performance. We /nally introduce an algorithm which can process temporal coalescing e1ciently, within the current SQL framework.
The second and fast growing opportunity area is represented by data streams, whereby query languages need to be extended to support continuous query applications on incoming relational streams and XML streams. While current approaches design separate data stream management systems for relational and XML data, the focus of my research has been on providing a uni/ed support for both kinds of data streams. Toward this goal, XML SAX event streams and relational streams need to be integrated, and SQL and XQuery must be extended to overcome their limitations with certain data stream queries. We also overview the state-of-the-art technologies based on Finite State Automata (FSA) model to process multiple queries, and use UDAs to simulate such FSAs with comparable performance. These techniques allow us to build a system which integrated the management of relational and XML streams.
The last but not the least research problem in my thesis is OLAP applications on XML data. With XML gaining importance as the standard for representing business data, XQuery must support the types of queries that are common in business analytics. One such class of queries is OLAP-style aggregation queries. We review several recent research proposals which introduce new constructs into XQuery, pinpoint their disadvantages, and bring forward our function import mechanism. Basically, we allow XQuery UDFs to import functions written in SQL:2003 and UDAs. All complex moving window queries and grouping set queries can thus be speci/ed in SQL easily using current SQL standards, and optimize using the mature query optimization technology of relational DBMSs. We show our approach provides a clear language expressive power and e1cient query performance.
A new generation of data-intensive applications is emerging for managing and querying information that, rather than residing in databases, flows continuously through the network in the form of massive data streams. Hence, there is much research work on designing Data Stream Management Systems, and the approach favored by many research projects consists in extending database languages and technology for data streams. However, the new computational environment brings signi/cant research challenges in areas such as query languages, query processing, and advanced applications.
In this dissertation, we first focus on the limitations of relational languages in expressing continuous stream queries. A main limitation follows from the fact that only nonblocking operators can be used in continuous queries, which makes relational languages incomplete on these queries. To address this problem, we investigate user-defined aggregates natively defined in SQL itself, and prove that these make SQL (i) Turing-complete on stored data, and (ii) complete on data streams. Furthermore, we illustrate the e.ectiveness of the proposed extensions on complex applications involving time-series queries, and mining queries.
For advanced applications, we focus on data-stream mining algorithms, which must now be redesigned to make lighter demands on resources and display greater adaptability than those on stored data. In this dissertation, we introduce ANNCAD, which uses multi-resolution data representation to classify new test points using the nearest-neighbors principle. The incremental property and very fast update speed make ANNCAD very suitable for mining data streams. Our experiments show that ANNCAD is adaptive and works well in many applications, including image recognition and censor surveying.
We then study the problem of stream query processing. We propose a load-shedding technique for multi-join, called MSketch, which makes decisions based on the productivity of tuples, rather than only the content of the joined pair stream. A thorough study shows that Msketch outperforms other existing algorithms. Finally, we propose general techniques for optimizing the accuracy of window aggregates, statistical aggregates and mining queries in the presence of sampling. Our method incorporates prior knowledge into an error model that is used to reduce the uncertainty introduced by sampling. We also extend the method to adjust to concept shifts.
Data stream mining and sequence mining have many applications and pose challenging research problems. Typical applications, such as network monitoring, web searching, telephone services and credit card purchases, are characterized by the need to mine continuously massive data streams to discover up-to-date patterns, which are invaluable for timely strategic decisions. These new requirements call for the design of new mining methods to replace the traditional ones, since those would require the data to be first stored and then processed off-line using complex algorithms that make several passes over the data. Therefore, a first research challenge is designing fast and light mining methods for data streams !* e.g., algorithms that only require one pass over the data and work with limited memory. Another challenge is created by the highly dynamic nature of data streams, whereby the stream mining algorithms need to detect promptly changing concepts and data distribution and adapt to them. While noise represents a general problem in data mining, it poses new challenges on data streams insofar as adaptability becomes more difficult when the data stream contains noise.
The main limitation of data stream mining methods is that they cannot reveal long term trends as they only keep a small snapshot of most recent data. Neither can they discover very complicated patterns that can be detected by methods that require extensive computational resources. However, these patterns can be discovered by off-line mining after data streams are stored as sequences. Sequence mining, in general, can reveal long-term trends and more complicated patterns, defined in a multidimensional space via some similarity criteria. The key research challenges that arise in this context include on (i) designing metrics that measure the similarity of sequences, (ii) dealing with high dimensionality, and (iii) achieving scalability.
This dissertation makes a number of contributions toward the solution of these problems, including the following ones:
1. Adaptive Boosting: A stream ensemble method is proposed that maintains a very accurate predictive model with fast learning and light memory consumption. The method is also highly adaptive through novel change detection techniques.
2. Robust Regression Ensemble: This method enhances the ensemble methods with outlier detection and a statistical learning theory.
3. SeqClus: A pattern-based subspace clustering algorithm is introduced along with a novel pattern similarity metric for sequences. The algorithm is scalable and efficient.
4. Mining Quality: To deal with noise and improve mining quality, a general approach is introduced based on data dependency. The approach exploits local data dependency between samples using pairwise Markov Networks and Bayesian belief propagation techniques.
The efficacy of the techniques proposed was demonstrated through extensive experiments, both on synthetic and on real-life data.
Current database systems do not provide effective means for archiving the database history and querying past snapshots of the database and its temporal evolution. Better support for temporal applications by database systems represents an important objective that is difficult to achieve, since it requires an integrated solution for technical problems that are challenging on their own, including (i) expressive temporal representations and data models, (ii) powerful languages for temporal queries and snapshot queries, (iii) indexing, clustering and query optimization techniques for managing temporal information efficiently, and (iv) architectures that bring together the different pieces of enabling technology into a robust system.
There is much current interest in publishing and viewing databases as XML documents. The general benefits of this approach follow from the popularity of XML and the tool set available for processing information encoded in this universal standard. In this dissertation, we explore the additional and unique benefits achieved by this approach on temporal database applications. We show that XML with XQuery can provide surprisingly effective solutions to the problem of supporting historical queries on past contents of database relations and their evolution. Indeed, using XML, the histories of database relations can be represented naturally using a temporally grouped data model, and complex temporal queries can be expressed in XQuery without requiring extensions to the current standard.
Therefore, we present the ArchIS system that achieves these benefits. ArchIS' architecture uses (a) XML to support a temporally grouped (virtual) representation of the database history, (b) XQuery to express powerful temporal queries on such views, (c) temporal clustering and indexing techniques for managing the actual historical data in an RDBMS, and (d) SQL/XML for executing the queries on the XML views as equivalent queries on the relational database. Extensive performance studies show that ArchIS is quite effective at storing and retrieving under complex query conditions the transaction-time history of relational databases. By supporting database compression as an option, ArchIS also achieves excellent storage efficiency for archived histories. Therefore, ArchIS delivers full-functionality transaction-time databases without requiring temporal extensions in XML or database standards. Moreover, these techniques can be extended to valid-time and bitemporal databases, where complex temporal queries can also be expressed in standard XQuery.
The temporal modeling and querying approach proposed in this thesis is very general and can be extended to arbitrary XML documents. In our extension, we manage the XML document revision history effectively by (i) representing concisely the successive versions of a document as another XML document that implements a temporally grouped data model, ii) using structured diff algorithms to build such documents, and iii) using XQuery to express complex temporal queries on the evolution of the document structure and its content.
In this dissertation, we extend database models and query languages to support spatio-temporal information, including representations for changing positions and shapes. Furthermore, we propose techniques to support spatio-temporal extensions on Object-Relational systems and achieve end-user extensibility.
We begin from temporal data models and query languages. Our approach is based upon a point-based representation of time enhanced with user-defined aggregates that support interval-oriented operators such as duration and during. This approach provides several advantages, since it solves the coalescing problem, minimizes extensions required from SQL, and it is applicable to all query languages.
Then, we extend the time model at the physical level to spatio-temporal databases. As the basis of our implementation, we introduce counterclockwise-directed triangles as our core spatial abstract data type. Then, user-defined-aggregates are introduced to support spatial operators such as contain and overlap, and spatial-temporal operators such as moving_distance. Finally, we represent moving objects as sequences of snapshots of spatial objects such as points, lines and polygons.
To reconcile logical and physical requirements, we follow an implementation approach based on a layered architecture. Thus, for temporal information, point-based representations are mapped to an interval-oriented representation at the internal level. Likewise, we map polygon-based representations into a triangle-based internal representation. We implement these mappings through query transformations, and user-defined aggregates.
A final advantage of our approach is that it allows end-users to further extend and customize their system: in fact in our SQLST system, users can extend database functionality by writing new user-defined aggregates in SQLST itself.
In summary, the spatio-temporal data models and query languages introduced in this dissertation offer several conceptual advantages; furthermore, our implementation approach provides considerable practical benefits, including compatibility with Object-Relational systems, flexibility, robustness, and ease of customization by end-users.
The need to search for complex and recurring patterns in database sequences is shared by many applications. In this work, we discuss how to express and support e.ciently sophisticated sequential pattern queries in relational database systems. Thus, we first introduce SQL-TS, an extension of SQL, to express these patterns, and then we study how to optimize search queries for this language. We take the optimal text search algorithm of Knuth, Morris and Pratt, and generalize it to handle complex queries on sequences. Our algorithm exploits the inter- dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. We then present extensions of the algorithm for detecting repeated patterns and disjunctive patterns. We also provide methods for finding the inter-dependencies between the pattern elements for important domains including intervals and vector time-series. In addition, a logic based semantics for SQL-TS is given. Experimental results on typical sequence queries, such as double bottom queries, confirm that substantial speedups are achieved by our new optimization techniques.