Difference between revisions of "Schema Evolution Benchmark"

From Schema Evolution
Jump to: navigation, search
(Micro-Classification of Changes)
Line 152: Line 152:
 
[Moon et al., 2008, Curino et al., 2008c] (briefly described in Table 3). SMOs capture the essence of the  
 
[Moon et al., 2008, Curino et al., 2008c] (briefly described in Table 3). SMOs capture the essence of the  
 
existing works, but can also express schema changes  
 
existing works, but can also express schema changes  
not modeled by previous approaches. For example, by using function13 in the ADD COLUMN operator,  
+
not modeled by previous approaches. For example, by using function in the ADD COLUMN operator,  
 
SMOs can support semantic conversion of columns  
 
SMOs can support semantic conversion of columns  
 
(e.g., currency exchange), column concatenation/split (e.g., different address formats), and other similar changes that have been heavily exploited in modeling MediaWiki schema changes. The effectiveness of SMOs have been validated in [Moon et al.,  
 
(e.g., currency exchange), column concatenation/split (e.g., different address formats), and other similar changes that have been heavily exploited in modeling MediaWiki schema changes. The effectiveness of SMOs have been validated in [Moon et al.,  
Line 169: Line 169:
 
in a sequence, to describe complex structural changes,  
 
in a sequence, to describe complex structural changes,  
 
as those occured in the MediaWiki DB schema evolution.  
 
as those occured in the MediaWiki DB schema evolution.  
Classification Using SMOs In this context we exploit SMOs as a pure classification instrument to provide a fine-grained analysis of the types of change  
+
Classification Using SMOs In this context we exploit SMOs as a pure classification instrument to provide a fine-grained analysis of the types of change  
 
the schema has been subject to. While there might  
 
the schema has been subject to. While there might  
 
be several ways to describe a schema evolution step  
 
be several ways to describe a schema evolution step  

Revision as of 17:51, 12 May 2008

This webpage is dedicated to publish the results of an intense analysis of the MediaWiki DB backend, which constitutes the core of the Pantha Rei Schema Evolution Benchmark. These results have been presented at ICEIS 2008 [1] in the paper "Schema Evolution in Wikipedia: toward a Web Information System Benchmark" authored by Carlo A. Curino [2], Hyun J. Moon[3], Letizia Tanca[4] and Carlo Zaniolo[5]. The paper is available here: [6]

4.5 year of development have been analyzed and over 170 schema versions compared and studied. In this website we report the results of our analysis and provide the entire dataset we collected, to the purpose of defining a unified Benchmark for Schema Evolution.

Contents


If you are only interested in downloading the dataset please visit our Downloads section.


Why Wikipedia

Google Search statistics on wikipedia popularity
Alexa.com statistics on wikipedia popularity

Wikipedia represent one of the 10 most popular websites in the WWW, is a DB-centric Web Information System, and is released under open-source license.

As shown by the two following graphs, showing respectively the google search popularity and the percentage of user visiting Wikipedia by http://www.alexa.com. Moreover, the PHP back-end underlying wikipedia is shared by other 30,000 wikis. Both software and content are released under open-source licenses. This make Wikipedia, and thus MediaWiki a perfect starting point for our goals.

MediaWiki Schema Evolution: a short Introduction

Evolving the database that is at the core of an Information System represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source MediaWiki software. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution. Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.

MediaWiki Architecture

MediaWiki Architecture

The MediaWiki software is a browser-based web-application, whose architecture is described in details in [Help:MediaWikiarchitecture] and in the MediaWiki Workbook2007 [7]. As shown in Figure, the users interact with the PHP frontend through a standard web browser, submitting a page request (e.g., a search for pages describing ``Paris). The frontend software consists of a simple presentation and management layer (MediaWiki PHP Scripts) interpreted by the Apache PHP engine. The user requests are carried out by generating appropriate SQL queries (or updates), that are then issued against the data stored in the backend DB (e.g., the database is queried looking for article's text containing the term ``Paris). The backend DB can be stored in any DBMS: MySQL, being open-source and scalable, is the default DBMS for the MediaWiki software. The results returned by the DBMS are rendered in XHTML and delivered to the user's browser to be displayed (e.g., a set of of links to pages mentioning ``Paris is rendered as an XHTML list). Due to the heavy load of the Wikipedia installation of this software, much of effort has been devoted to performance optimization, introducing several levels of caching (Rendered Web Page, DB caches, Media caches), which is particularly effective thanks to the very low rate (0.04\%) of updates w.r.t. queries. Obviously, every modification of the DB schema has a strong impact on the queries the frontend can pose. Typically each schema evolution step can require several queries to be modified, and so several PHP scripts (cooperating to interrogate the DB and render a page) to be manually fixed, in order to balance the schema changes.

MediaWiki Growth

Attributes and Tables growth of MediaWiki Schema

In this section, we analyze the schema evolution of MediaWiki based on its 171 schema versions, as committed to SVN between April 2003 (first schema revision) and November 2007 (date of this analysis).

Schema Size Growth In Figures, we report the size of MediaWiki DB schema in history, in terms of the number of tables and columns, respectively. The graphs show an evident trend of growth in sizes, where the number of tables has increased from 17 to 34 (100% increase) and the number of columns from 100 to 242 (142%). Sudden drops in the graphs are due to schema versions with syntax errors, i.e., schema versions that could not be properly installed. In both graphs we observe different rates of growth over time, which seem to be related to the time periods preceding or following official releases of the overall software (see Table in section Available Software Version).

Schema growth is due to three main driving forces as follows:

  • performance improvement, e.g., introduction of dedicated cache tables,
  • addition of new features, e.g., support for logging and content validation,
  • the growing need for preservation of DB content history, i.e., introduction of tables and columns to store outdated multimedia content such as the

'filearchive' table. The Figure shows a histogram representation of the table lifetimes, in terms of number of versions. The lifetimes range from very long ones, e.g., the user table that was alive throughout the entire history, to short ones, e.g., random table that only survived for two revisions. On average, each table lasted 103.3 versions (60.4% of the total DB history). Figure 5 presents lifetimes of columns in histogram, where columns lasted 97.17 versions on average (56.8% of the total DB history). Interestingly, both figures show that there are two main groups of tables and columns: “short-living” and “long-living”. The former might be due to the fact that the schema has been growing lately so a significant portion of tables and columns has been introduced only recently. The latter can be explained noting that the core tables/columns tend to be rather stable throughout the entire history.

Some historical statistics on Wikipedia content growth are available at http://stats.wikimedia.org/EN/TablesDatabaseSize.htm .

Per-Month Revision Count


The Figure Per-month Revision Count shows how many schema versions were committed during each month in history, providing an estimation of the development effort devoted to the DB backend over time, while Table "MediaWiki Release History" shows the main releases of the sofware front-end.

MediaWiki Relase History

More Advanced Statistics

Macro-Classification of Changes

"Macro Schema changes"

We group the 170 evolution steps based on the types of evolution they present as in the Table. While the “actual schema changes” have an impact on the queries, as they modify the schema layout, the evolution of the DBMS engine, indexes, and data types, (while being relevant to performance) does not require any query correction, because of the physical data-independence provided by the DBMS. The table shows the frequencies of the types of changes among the 170 evolution steps. In particular, the table highlights that:

  • almost 55% of the evolution steps involve actual schema changes (further discussed in Section 3.3);
  • over 40% of the evolution steps involve index/key

adjustments and this is due to the performance-critical role of the DB in a data-intensive, high-load, website such as Wikipedia;

  • 8.8% of the evolution steps were rollbacks to previous schema versions;
  • 7.6% of the analyzed evolution steps present only

documentation changes.

Micro-Classification of Changes

SMO Table

Schema Modification Operators: To better understand the Relational DB schema evolution, we introduce a classification of the “actual schema changes”. Different formalisms can be exploited for this purpose. Shneiderman and Thomas proposed in [Shneiderman and Thomas, 1982] a comprehensive set of schema changes, including structural schema changes and also changes regarding the keys and dependencies. More recently, Bernstein et al. have also proposed a set of schema evolution primitives using algebra-based constraints as their primitives [Bernstein et al., 2006, Bernstein et al., 2008]. Among several options, we chose the Schema Modification Operators (SMOs) that we proposed in [Moon et al., 2008, Curino et al., 2008c] (briefly described in Table 3). SMOs capture the essence of the existing works, but can also express schema changes not modeled by previous approaches. For example, by using function in the ADD COLUMN operator, SMOs can support semantic conversion of columns (e.g., currency exchange), column concatenation/split (e.g., different address formats), and other similar changes that have been heavily exploited in modeling MediaWiki schema changes. The effectiveness of SMOs have been validated in [Moon et al., 2008, Curino et al., 2008c], where the PRISM and PRIMA systems used SMOs to describe schema evolution in transaction-time databases and to support historical query reformulations over multi-schema-version transaction-time databases. The syntax of SMO is similar to that of SQL DDL [ISO/IEC 9075-*: 2003, 2003, Eisenberg et al., 2004], and provides a concise way to describe typical modifications of a database schema and the corresponding data migration. Every SMO takes as input a schema and produces as output a new version of the same schema. Table 3 presents a list of SMOs, operating on tables (the first six) and on columns (the last five) of a given DB schema, together with a brief explanation. Note that simple SMOs can be arbitrarily combined in a sequence, to describe complex structural changes, as those occured in the MediaWiki DB schema evolution. Classification Using SMOs In this context we exploit SMOs as a pure classification instrument to provide a fine-grained analysis of the types of change the schema has been subject to. While there might be several ways to describe a schema evolution step by means of SMOs, we carefully select, analyzing the available documentation, the most natural set of SMOs describing each schema change in the MediaWiki history.

SMO Frequency

Table "SMO Frequency" shows the distribution of the SMOs, presenting, for each type, how many times it has been used in the entire schema evolution history. Is interesting to notice that the more sophisticated SMOs (e.g., MERGE TABLE) while being indispensable are not very common. The balance between column/table additions and deletions highlights the “content preserving” attitude of Wikipedia14 .

SMO per version

In Figure we show the number of SMOs (overall) for each evolution step. The curve shows how the schema evolution has been mainly a continuous process of adjustment, with few exceptions shown as spikes in the figure, coinciding with major evolution steps, such as:

  • v6696 (41st version) - v6710 (42nd version), 92

SMOs: a change in the storage strategy of the ar- ticle versions,

  • v9116 (61st version) - v9199 (62nd version), 12

SMOs: a change in link management,

  • v20393(138th version) - v20468 (139th version), 9 SMOs: history management (deletion and log

features added to several tables).

The Impact on Applications

Downloads

All the data, schemas, queries, tools are available to be download. Moreover, a couple of useful services are running on our servers providing a sandbox for testing and investingating schema evolution.

Please visit our Downloads section

Personal tools