Main Page

From Schema Evolution
Revision as of 03:44, 3 March 2008 by Schemaevolution (Talk | contribs)

Jump to: navigation, search

This webpage is dedicated to publish the results of an intense analysis of the MediaWiki DB backend. This results have been presented at ICEIS 2008 [1] in the paper "Schema Evolution in Wikipedia: toward a Web Information System Benchmark" authored by Carlo A. Curino [2], Hyun J. Moon[3], Letizia Tanca[4] and Carlo Zaniolo[5].

4.5 year of development have been analyzed and over 170 schema versions compared and studied. In this website we report the results of our analysis and provide the entire dataset we collected, to the purpose of defining a unified Benchmark for Schema Evolution.

Contents


MediaWiki Schema Evolution: a short Introduction

Evolving the database that is at the core of an Information System represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in- depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source MediaWiki software. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution. Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.

MediaWiki Architecture

MediaWikiArchitecture.png


The MediaWiki software is a browser-based web-application, whose architecture is described in details in [Help:MediaWikiarchitecture] and in the MediaWiki Workbook2007 [6]. As shown in Figure, the users interact with the PHP frontend through a standard web browser, submitting a page request (e.g., a search for pages describing ``Paris). The frontend software consists of a simple presentation and management layer (MediaWiki PHP Scripts) interpreted by the Apache PHP engine. The user requests are carried out by generating appropriate SQL queries (or updates), that are then issued against the data stored in the backend DB (e.g., the database is queried looking for article's text containing the term ``Paris). The backend DB can be stored in any DBMS: MySQL, being open-source and scalable, is the default DBMS for the MediaWiki software. The results returned by the DBMS are rendered in XHTML and delivered to the user's browser to be displayed (e.g., a set of of links to pages mentioning ``Paris is rendered as an XHTML list). Due to the heavy load of the Wikipedia installation of this software, much of effort has been devoted to performance optimization, introducing several levels of caching (Rendered Web Page, DB caches, Media caches), which is particularly effective thanks to the very low rate (0.04\%) of updates w.r.t. queries. Obviously, every modification of the DB schema has a strong impact on the queries the frontend can pose. Typically each schema evolution step can require several queries to be modified, and so several PHP scripts (cooperating to interrogate the DB and render a page) to be manually fixed, in order to balance the schema changes.

Available Schema

The base source of information is the MediaWiki SVN, freely browsable at:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup

However to simplify life we provide a .tar.gz download of all the schema versions. It also contains a set of scripts to create, load and delete the entire MediaWiki history.

http://yellowstone.cs.ucla.edu/schema-evolution/documents/mediawiki-schema.tar.gz

Available Queries

In our dataset we have a mix of synthetic and real queries.

Synthetic Queries

The synthetic queries we have are divided into two classes, a set of queries generated by installing MediaWiki (different versions) and logging the query generating during typical user sessions, and completely synthetic queries. This last set contains queries operating on entire queries and on single attributes and can be use to obtain a rough estimation of the portion of the schema being affected by an evolution step.

Lab-Generated MediaWiki Queries:

Mediawiki 1.3 (4,175 query + update instances):

Mediawiki 1.3 (1,948 distinct instances):

Mediawiki 1.3 (1,657 distinct "SELECT" only instances):

Mediawiki 1.3 (74 query template): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw_13_legacy_template.sql


Syntethic Data:

Available Data

In order to ease the testing of MediaWiki backend we provide a set of data and installed version of MediaWiki freely accessible.

Installed Versions of MediaWiki

To provide a comparison of the features available in the main MediaWiki Software Release we installed all of them, and they are available to test. [7]


SoftwareRelease.png


http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.2.0/ 13-Mar-2004 08:08

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.3.0/ 02-Aug-2004 09:51

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.4.0/ 07-Mar-2005 18:07

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.5.1/ 22-Jul-2005 23:30

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.6.0/ 05-Apr-2006 03:11

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.7.0/ 07-Jul-2006 10:30

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.8.0/ 10-Oct-2006 15:37

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.9.0/ 10-Jan-2007 12:38

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.10.0/ 22-Apr-2007 14:17

http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.11.0/ 28-Jun-2007 18:19


Full Access to the Backend MySQL DB

To provide further insight we setup an access to the MySQL backend for the above installations. In this way the user can freely access the MediaWiki bakend and test simple queries. The access is limited to reading, in order to avoid vandalism. user: "mediawikireder" password: "imareader" The phpMyAdmin web access is the following: http://yellowstone.cs.ucla.edu/phpMyAdmin/

Personal tools