Schema Evolution Benchmark

From Schema Evolution
Jump to: navigation, search

This webpage is dedicated to publish the results of an intense analysis of the MediaWiki DB backend, which constitutes the core of the Pantha Rei Schema Evolution Benchmark. These results have been presented at ICEIS 2008 [1] in the paper "Schema Evolution in Wikipedia: toward a Web Information System Benchmark" authored by Carlo A. Curino [2], Hyun J. Moon[3], Letizia Tanca[4] and Carlo Zaniolo[5]. The paper is available here: [6]

4.5 year of development have been analyzed and over 170 schema versions compared and studied. In this website we report the results of our analysis and provide the entire dataset we collected, to the purpose of defining a unified Benchmark for Schema Evolution.

Contents


Why Wikipedia

Google Search statistics on wikipedia popularity
Alexa.com statistics on wikipedia popularity

Wikipedia represent one of the 10 most popular websites in the WWW, is a DB-centric Web Information System, and is released under open-source license.

As shown by the two following graphs, showing respectively the google search popularity and the percentage of user visiting Wikipedia by http://www.alexa.com. Moreover, the PHP back-end underlying wikipedia is shared by other 30,000 wikis. Both software and content are released under open-source licenses. This make Wikipedia, and thus MediaWiki a perfect starting point for our goals.

MediaWiki Schema Evolution: a short Introduction

Evolving the database that is at the core of an Information System represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source MediaWiki software. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution. Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.

MediaWiki Architecture

MediaWiki Architecture

The MediaWiki software is a browser-based web-application, whose architecture is described in details in [Help:MediaWikiarchitecture] and in the MediaWiki Workbook2007 [7]. As shown in Figure, the users interact with the PHP frontend through a standard web browser, submitting a page request (e.g., a search for pages describing ``Paris). The frontend software consists of a simple presentation and management layer (MediaWiki PHP Scripts) interpreted by the Apache PHP engine. The user requests are carried out by generating appropriate SQL queries (or updates), that are then issued against the data stored in the backend DB (e.g., the database is queried looking for article's text containing the term ``Paris). The backend DB can be stored in any DBMS: MySQL, being open-source and scalable, is the default DBMS for the MediaWiki software. The results returned by the DBMS are rendered in XHTML and delivered to the user's browser to be displayed (e.g., a set of of links to pages mentioning ``Paris is rendered as an XHTML list). Due to the heavy load of the Wikipedia installation of this software, much of effort has been devoted to performance optimization, introducing several levels of caching (Rendered Web Page, DB caches, Media caches), which is particularly effective thanks to the very low rate (0.04\%) of updates w.r.t. queries. Obviously, every modification of the DB schema has a strong impact on the queries the frontend can pose. Typically each schema evolution step can require several queries to be modified, and so several PHP scripts (cooperating to interrogate the DB and render a page) to be manually fixed, in order to balance the schema changes.

MediaWiki Growth

Attributes and Tables growth of MediaWiki Schema

In this section, we analyze the schema evolution of MediaWiki based on its 171 schema versions, as committed to SVN between April 2003 (first schema revision) and November 2007 (date of this analysis).

Schema Size Growth In Figures, we report the size of MediaWiki DB schema in history, in terms of the number of tables and columns, respectively. The graphs show an evident trend of growth in sizes, where the number of tables has increased from 17 to 34 (100% increase) and the number of columns from 100 to 242 (142%). Sudden drops in the graphs are due to schema versions with syntax errors, i.e., schema versions that could not be properly installed. In both graphs we observe different rates of growth over time, which seem to be related to the time periods preceding or following official releases of the overall software (see Table in section Available Software Version).

Schema growth is due to three main driving forces as follows:

  • performance improvement, e.g., introduction of dedicated cache tables,
  • addition of new features, e.g., support for logging and content validation,
  • the growing need for preservation of DB content history, i.e., introduction of tables and columns to store outdated multimedia content such as the

'filearchive' table. The Figure shows a histogram representation of the table lifetimes, in terms of number of versions. The lifetimes range from very long ones, e.g., the user table that was alive throughout the entire history, to short ones, e.g., random table that only survived for two revisions. On average, each table lasted 103.3 versions (60.4% of the total DB history). Figure 5 presents lifetimes of columns in histogram, where columns lasted 97.17 versions on average (56.8% of the total DB history). Interestingly, both figures show that there are two main groups of tables and columns: “short-living” and “long-living”. The former might be due to the fact that the schema has been growing lately so a significant portion of tables and columns has been introduced only recently. The latter can be explained noting that the core tables/columns tend to be rather stable throughout the entire history.

Some historical statistics on Wikipedia content growth are available at http://stats.wikimedia.org/EN/TablesDatabaseSize.htm .

More Advanced Statistics

Downloads

All the data, schemas, queries, tools are available to be download. Moreover, a couple of useful services are running on our servers providing a sandbox for testing and investingating schema evolution.

Please visit our Downloads section

Personal tools