Schema Evolution Benchmark

From Schema Evolution
Jump to: navigation, search

Logo SCHEVOBEN.png

This webpage publishes the results of an intense analysis of the MediaWiki DB backend, which constitutes the core of the Pantha Rei Schema Evolution Benchmark. These results have been presented at ICEIS 2008 [1] in the paper "Schema Evolution in Wikipedia: toward a Web Information System Benchmark". The paper is available here: [2]

4.5 year of development have been analyzed and over 170 schema versions compared and studied. In this website we report the results of our analysis and provide the entire dataset we collected, to the purpose of defining a unified Benchmark for Schema Evolution.

Author:

  • Carlo A. Curino (contact author) [3]
  • Hyun J. Moon[4]
  • Letizia Tanca[5]
  • Carlo Zaniolo[6]


Contents


If you are only interested in downloading the dataset please visit our Downloads section.

Presentation and Screencast

A PDF version of the presentation given at ICEIS 2008 is available at [7].

And the corresponding Screencast is available at: [8].


Introduction

Every Information System (IS) is the subject of a constant evolution process to adapt the system to many factors such as changing requirements, new functionalities, compliance to new regulations, integration with other systems, and new security and privacy measures. The data management core of an IS is one of the most critical portions to evolve. Often based on Relational DataBase (DB) technology, the data management core of a system needs to evolve whenever the revision process requires modifications in the logical and physical organization of the data. Given its fundamental role, the evolution of the DB underlying an IS has a very strong impact on the applications accessing the data; thus, support for graceful evolution is of paramount importance. The complexity of DB and software maintenance, clearly, grows with the size and complexity of the system. Furthermore, when moving from intra-company systems -- typically managed by rather small and stable teams of developers/administrators -- to collaboratively-developed-and-maintained public systems, the need for a well-managed evolution becomes indispensable. Leading-edge web projects, characterized by massive collaborations and fast growth, experience a relentless drive for changes, which in turn generates a critical need for widespread consensus and rich documentation. Schema evolution has been extensively studied in the scenario of traditional information systems. An authoritative and comprehensive survey of the approaches to relational schema evolution and schema versioning is presented in <bibref f="defbib.bib">roddick95schema</bibref>. More recently, <bibref f="defbib.bib">rametal03</bibref> has surveyed schema evolution on the object-oriented, relational, and conceptual data models. Case studies on schema evolution on various application domains appear in <bibref f="defbib.bib">sjoberg93quantifying</bibref><bibref f="defbib.bib">marche93measuring</bibref>. Schema evolution has also been studied in the context of model management -- research which aims at developing a systematic approach to schema management and mapping <bibref f="defbib.bib">bernstein03applying</bibref>. Other interesting approaches tackled the problem of schema evolution in XML <bibref f="defbib.bib">MoroML07</bibref>, data warehouse <bibref f="defbib.bib">Rizzi07</bibref> and object-oriented databases <bibref f="defbib.bib">Galante:2005eq</bibref><bibref f="defbib.bib">Franconi:2001da</bibref>.

Of particular interest, are Web Information Systems (WIS), often based on open-source solutions. This large and fast-growing class include, among many other examples: Content Management Systems, Wiki-based web portals, E-commerce systems, Blog, and Public Scientific Databases from `Big Science' Projects. The common denominator among these systems is the collaborative and distributed nature of their development and content management. Among the best known examples we have: MediaWiki [9], a website software underlying a huge number of web portals, including Wikipedia [10] and this one, Joomla [11], a complete Content Management System (CMS) and Web Application Framework, and TikiWiki [12], an open source groupware and CMS solution.

Moreover, inasmuch as large collaborative projects are now very common in natural science research, their reliance on databases and web systems as the venue needed to promptly shared results and data has created many large Scientific Databases, including the Human Genome DB [13], HGVS [14], and many others. Although different in many ways, these all share a common evolution problem for which the slow labor-intensive solutions of the past have become inadequate. New conceptual and operational tools are needed to enable graceful evolution by systematically supporting the migration of the DB and the maintenance of its applications. Among the desiderata in such a scenario, we seek systems that preserve and manage the past contents of a database and the history of its schema, while allowing legacy applications to access new contents by means of old schemas <bibref f="defbib.bib">vldb2008a</bibref><bibref f="defbib.bib">vldb2008b</bibref>.

In the rest of this paper, we shall analyze the case of MediaWiki, a data-intensive, open-source, collaborative, web-portal software, originally developed to run Wikipedia, a multilingual, web-based, free-content encyclopedia: this platform is currently used by over 30,000 wikis, for a grand total of over 100 million pages (See http://s23.org/wikistats/). While the Wikipedia content evolution has been analyzed previously <bibref f="defbib.bib">barzan</bibref>, this report is the first that focuses on the problem of DB schema evolution. MediaWiki has seen, during its 4 years and 7 months of life, 171 different DB schema versions released to the public by means of a CVS/Subversion versioning system [15]. As one can easily imagine, every schema change has a profound impact on the application queries and the code managing the results, which must thus be revised. In the case of MediaWiki, we observed in our analysis that only a small fraction (about 22% queries designed to run on old schema versions are still valid throughout the schema evolution (see discussion in Section ). Our analysis was made possible by the collaborative, public, and open-source nature of the development, documentation and release of MediaWiki and Wikipedia.

Contributions The main contributions of this paper are the following: (i) we present the first schema evolution analysis of a real-life Web Information System DB, by studying the MediaWiki DB backend. This provides a deep insight on Wikipedia, one of the ten most popular websites to date [16] and reveals the need for DB schema evolution and versioning techniques, and (ii) we provide and plant the seeds of the first public, real-life-based, benchmark for schema evolution, which will offer to researchers and practitioners a rich data-set to evaluate their approaches and solutions. As a part of the benchmark, we also release a simple but effective tool-suite for evolution analysis.

Why Wikipedia

Google Search statistics on wikipedia popularity
Alexa.com statistics on wikipedia popularity

Wikipedia represent one of the 10 most popular websites in the WWW, is a DB-centric Web Information System, and is released under open-source license.

As shown by the two following graphs, showing respectively the google search popularity and the percentage of user visiting Wikipedia by http://www.alexa.com. Moreover, the PHP back-end underlying wikipedia is shared by other 30,000 wikis. Both software and content are released under open-source licenses. This make Wikipedia, and thus MediaWiki a perfect starting point for our goals.

MediaWiki Schema Evolution: a short Introduction

Evolving the database that is at the core of an Information System represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source MediaWiki software. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution (see Prism for the solution we propose). Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.

MediaWiki Architecture

MediaWiki Architecture

The MediaWiki software is a browser-based web-application, whose architecture is described in details in [Help:MediaWikiarchitecture] and in the MediaWiki Workbook2007 [17]. As shown in Figure, the users interact with the PHP frontend through a standard web browser, submitting a page request (e.g., a search for pages describing ``Paris). The frontend software consists of a simple presentation and management layer (MediaWiki PHP Scripts) interpreted by the Apache PHP engine. The user requests are carried out by generating appropriate SQL queries (or updates), that are then issued against the data stored in the backend DB (e.g., the database is queried looking for article's text containing the term ``Paris). The backend DB can be stored in any DBMS: MySQL, being open-source and scalable, is the default DBMS for the MediaWiki software. The results returned by the DBMS are rendered in XHTML and delivered to the user's browser to be displayed (e.g., a set of of links to pages mentioning ``Paris is rendered as an XHTML list). Due to the heavy load of the Wikipedia installation of this software, much of effort has been devoted to performance optimization, introducing several levels of caching (Rendered Web Page, DB caches, Media caches), which is particularly effective thanks to the very low rate (0.04\%) of updates w.r.t. queries. Obviously, every modification of the DB schema has a strong impact on the queries the frontend can pose. Typically each schema evolution step can require several queries to be modified, and so several PHP scripts (cooperating to interrogate the DB and render a page) to be manually fixed, in order to balance the schema changes.

MediaWiki Growth

Attributes and Tables growth of MediaWiki Schema

In this section, we analyze the schema evolution of MediaWiki based on its 171 schema versions, as committed to SVN between April 2003 (first schema revision) and November 2007 (date of this analysis).

Schema Size Growth In Figures, we report the size of MediaWiki DB schema in history, in terms of the number of tables and columns, respectively. The graphs show an evident trend of growth in sizes, where the number of tables has increased from 17 to 34 (100% increase) and the number of columns from 100 to 242 (142%). Sudden drops in the graphs are due to schema versions with syntax errors, i.e., schema versions that could not be properly installed. In both graphs we observe different rates of growth over time, which seem to be related to the time periods preceding or following official releases of the overall software (see Table in section Available Software Version).

Schema growth is due to three main driving forces as follows:

  • performance improvement, e.g., introduction of dedicated cache tables,
  • addition of new features, e.g., support for logging and content validation,
  • the growing need for preservation of DB content history, i.e., introduction of tables and columns to store outdated multimedia content such as the

'filearchive' table. The Figure shows a histogram representation of the table lifetimes, in terms of number of versions. The lifetimes range from very long ones, e.g., the user table that was alive throughout the entire history, to short ones, e.g., random table that only survived for two revisions. On average, each table lasted 103.3 versions (60.4% of the total DB history). Figure 5 presents lifetimes of columns in histogram, where columns lasted 97.17 versions on average (56.8% of the total DB history). Interestingly, both figures show that there are two main groups of tables and columns: “short-living” and “long-living”. The former might be due to the fact that the schema has been growing lately so a significant portion of tables and columns has been introduced only recently. The latter can be explained noting that the core tables/columns tend to be rather stable throughout the entire history.

Some historical statistics on Wikipedia content growth are available at http://stats.wikimedia.org/EN/TablesDatabaseSize.htm .

Per-Month Revision Count


The Figure Per-month Revision Count shows how many schema versions were committed during each month in history, providing an estimation of the development effort devoted to the DB backend over time, while Table "MediaWiki Release History" shows the main releases of the sofware front-end.

MediaWiki Relase History

More Advanced Statistics

Macro-Classification of Changes

"Macro Schema changes"

We group the 170 evolution steps based on the types of evolution they present as in the Table. While the “actual schema changes” have an impact on the queries, as they modify the schema layout, the evolution of the DBMS engine, indexes, and data types, (while being relevant to performance) does not require any query correction, because of the physical data-independence provided by the DBMS. The table shows the frequencies of the types of changes among the 170 evolution steps. In particular, the table highlights that:

  • almost 55% of the evolution steps involve actual schema changes (further discussed in Section 3.3);
  • over 40% of the evolution steps involve index/key

adjustments and this is due to the performance-critical role of the DB in a data-intensive, high-load, website such as Wikipedia;

  • 8.8% of the evolution steps were rollbacks to previous schema versions;
  • 7.6% of the analyzed evolution steps present only

documentation changes.

Micro-Classification of Changes

SMO Table

Schema Modification Operators: To better understand the Relational DB schema evolution, we introduce a classification of the “actual schema changes”. Different formalisms can be exploited for this purpose. Shneiderman and Thomas proposed in [Shneiderman and Thomas, 1982] a comprehensive set of schema changes, including structural schema changes and also changes regarding the keys and dependencies. More recently, Bernstein et al. have also proposed a set of schema evolution primitives using algebra-based constraints as their primitives [Bernstein et al., 2006, Bernstein et al., 2008]. Among several options, we chose the Schema Modification Operators (SMOs) that we proposed in <bibref f="defbib.bib">vldb2008a</bibref><bibref f="defbib.bib">vldb2008b</bibref> (briefly described in Table 3). SMOs capture the essence of the existing works, but can also express schema changes not modeled by previous approaches. For example, by using function in the ADD COLUMN operator, SMOs can support semantic conversion of columns (e.g., currency exchange), column concatenation/split (e.g., different address formats), and other similar changes that have been heavily exploited in modeling MediaWiki schema changes. The effectiveness of SMOs have been validated in <bibref f="defbib.bib">vldb2008a</bibref><bibref f="defbib.bib">vldb2008b</bibref>, where the PRISM and PRIMA systems used SMOs to describe schema evolution in transaction-time databases and to support historical query reformulations over multi-schema-version transaction-time databases. The syntax of SMO is similar to that of SQL DDL [ISO/IEC 9075-*: 2003, 2003, Eisenberg et al., 2004], and provides a concise way to describe typical modifications of a database schema and the corresponding data migration. Every SMO takes as input a schema and produces as output a new version of the same schema. Table 3 presents a list of SMOs, operating on tables (the first six) and on columns (the last five) of a given DB schema, together with a brief explanation. Note that simple SMOs can be arbitrarily combined in a sequence, to describe complex structural changes, as those occured in the MediaWiki DB schema evolution. Classification Using SMOs In this context we exploit SMOs as a pure classification instrument to provide a fine-grained analysis of the types of change the schema has been subject to. While there might be several ways to describe a schema evolution step by means of SMOs, we carefully select, analyzing the available documentation, the most natural set of SMOs describing each schema change in the MediaWiki history.

SMO Frequency

Table "SMO Frequency" shows the distribution of the SMOs, presenting, for each type, how many times it has been used in the entire schema evolution history. Is interesting to notice that the more sophisticated SMOs (e.g., MERGE TABLE) while being indispensable are not very common. The balance between column/table additions and deletions highlights the “content preserving” attitude of Wikipedia14 .

SMO per version

In Figure we show the number of SMOs (overall) for each evolution step. The curve shows how the schema evolution has been mainly a continuous process of adjustment, with few exceptions shown as spikes in the figure, coinciding with major evolution steps, such as:

  • v6696 (41st version) - v6710 (42nd version), 92

SMOs: a change in the storage strategy of the ar- ticle versions,

  • v9116 (61st version) - v9199 (62nd version), 12

SMOs: a change in link management,

  • v20393(138th version) - v20468 (139th version), 9 SMOs: history management (deletion and log

features added to several tables).

The Impact on Applications

Real Queries Impact (backward)


In order to study the effect of schema evolution on the frontend application, we analyze the impact of the schema changes on six representative sets of queries. Each experiment tests the success or failure of a set of queries, originally designed to run on a specific schema version, when issued against other schema versions.

Experimental Setting

To simulate a case where current applications are run on databases under older schema versions, we test three sets of queries, valid on the last schema version, on all the previous schema versions (first Figure). Also, to study how legacy applications succeed or fail on newer versions of the database schema, we test three sets of legacy queries on all the subsequent schema versions (second Figure). The six sets considered in our experiments are as follows:

  • real-world templates, current (first Figure): the 500 most common query templates (extracted from over 780 millions of query instances), derived from the Wikipedia on-line profiler and post-processed for cleaning;
  • lab-gen queries, current (first Figure): 2496 query instances generated by a local installation of the current version of MediaWiki (release 1.11, schema version 171), interacting with the frontend and logging the queries issued against the underlying MySQL DBMS;
  • lab-gen templates, current (first Figure): 148 templates of queries extracted from the above lab-gen queries, current;
Lab-generated Queries Impact (forward)
  • lab-gen queries, legacy (second Figure): 4175 query instances generated by a local installation of an old version of MediaWiki (release 1.3\footnote{The oldest version compatible with the environment of our experimental setting.}, schema version 28), interacting with the frontend and logging the queries issued against the underlying MySQL DBMS;
  • lab-gen templates, legacy (second Figure): 74 templates extracted from the above lab-gen queries, legacy;
  • synthetic probe queries, legacy (second Figure): 133 synthetic queries accessing single columns (i.e., select tab_j.att_i from tab_j) of schema version 28, designed to highlight the schema portion affected by the evolution.


The Results

Each set has been tested against all schema versions: the resulting query execution success rates are shown in the first Figure (for the first three sets) and second Figure (for the last three sets). The outliers in the graphs (sudden and extremely low values) are due to syntactically incorrect DB schema versions.

The first three sets are shown in the first Figure. It is interesting to notice that:

  • proceeding from right to left, a series of descending steps illustrates that more and more of the current queries become incorrect as we move to older schemata.
  • the sudden drop in query success -- of about 30% -- which appears between commit revisions v6696 (41st schema version) and v6710 (42nd schema version) for SVN commit version to ordinal numbers conversion. highlights one of the most intense evolution steps of the MediaWiki data management core, involving a deep change in the management of article revisions;
  • the lab-generated and real-world templates carry very similar information. This seems to indicate that our local query generation method is capable of producing a representative set of queries.

The second Figure shows a graph of the average execution success rates for the latter three query sets. Some interesting observations are as follows:

  • the synthetic probe queries, by failing systematically when a column or a table are modified, highlight the portion of the schema affected (changed in such a way that makes query to fail) by the evolution. The figure shows how the schema evolution invalidates (in the worst case) only the 32% of the schema.
  • in the last version, a very large portion (77%) of the lab-gen templates fails due to schema evolution.
  • for lab-gen templates, the big evolution step between commit revisions v6696 (41st schema version) and v6710 (42nd schema version) invalidates over 70% of the queries.
  • lab-gen templates failure rate compared to synthetic probe queries failure rate (representing the affected portion of the schema) exposes that the schema modifications mainly affected the portion of the schema heavily used by the application (32% of the schema being affected invalidates 77% of the query templates).
  • the gap between the success rate of legacy query instances (2.9%) and legacy query templates (22%) shows that the failing templates actually correspond to the most common query instances (in our distribution).

Finally it is interesting to notice that the number of features of the MediaWiki software has grown in time; this explains the growth in the number of the query templates extracted from legacy queries (74) and current queries (148). This also affects the percentage (but not the absolute number) of queries failing due to each schema evolution (the current-query graph appear smoother).

All in all these experiments provide a clear evidence of the strong impact of schema changes on applications, and support the claim for better schema evolution support.

The Tool-suite

To collect the statistics described in this paper, we developed a set of tools, organized in a tool-suite. This step-by-step process, primarily designed to help researchers to gain better insight in the schema evolution of existing Information Systems, can be effectively exploited by:


  • DB administrators and developers, in any data-centric scenario, to analyze the history of the DB schema and create a (summarized) view of its evolution history. The tool suite will support the analysis of the evolution process and help to highlight possible flaws in the design and maintenance of the Information System.
  • Researchers and designers of support methods and tools for DB evolution and versioning, to test their approaches against real-life scenarios.

We now discuss some of the features of our tool-suite referring to its application to the MediaWiki DB.

First of all, by means of an appropriate tool, the 171 MediaWiki DB schema versions have been downloaded from SVN repository and batch-installed in a MySQL DBMS (MySQL version 5.0.22-Debian). We developed a tool, named statistics_collection, that can be applied on this data to derive the basic statistics of schema versions, such as schema size and average table/column lifetime. The statistics_collection tool queries the MySQL data dictionary (the information_schema meta-database) to gather the statistical measures presented so far.

For fine-grained view of the schema evolution we also provide the Schema Modification Operators extraction tool. This tool, by operating on the differences between subsequent schema versions, semiautomatically extracts a set of candidate SMOs describing the schema evolution, minimizing the user effort (Complex evolution patterns as the one appeared from the 41st and 42nd schema versions in MediaWiki, require the user to refine the set of SMOs according to his/her understanding of the schema evolution.). To estimate query success against different schema versions, the users can exploit a tool named query_success analyzer. This tool performs a query success rate analysis by batch-running its input queries against all schema versions. The tool, relying on MySQL query engine, measures and computes both per-query and aggregate success ratios (this is clearly a rough-estimation).

For users' convenience, we also provide a log_analyzer which can be used to extract and clean the SQL query instances and templates from the raw mysql_log format.

Every component of the tool-suite stores the collected information, in a non-aggregated form, in a database, named evolution_metadb. This database is later queried to provide statistical measures of the schema evolution. This approach, relying on the SQL aggregation operators, offers the user a flexible interface. The graphs and tables presented in this paper have been derived by means of appropriate SQL queries on the evolution_metadb; all the data collected for our MediaWiki analysis are released to the public and available in our Downloads section.

Downloads

All the data, schemas, queries, tools are available to be download. Moreover, a couple of useful services are running on our servers providing a sandbox for testing and investingating schema evolution.

Please visit our Downloads section


References

<bibreferences />

Personal tools