Difference between revisions of "Main Page"

From Schema Evolution
Jump to: navigation, search
 
(42 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This webpage is dedicated to publish the results of an intense analysis of the MediaWiki DB backend, which constitutes the core of the '''Pantha Rei Schema Evolution Benchmark'''.
+
This wiki reports the research advances of the '''Panta Rhei''', a research project for data management under schema evolution.
These results have been presented at  ICEIS 2008 [http://www.iceis.org/] in the paper "Schema Evolution in Wikipedia: toward a Web Information System Benchmark" authored
+
by Carlo A. Curino [http://carlo.curino.us/], Hyun J. Moon[http://www.cs.ucla.edu/~hjmoon/], Letizia Tanca[http://home.dei.polimi.it/tanca/] and Carlo Zaniolo[http://www.cs.ucla.edu/~zaniolo/]. The paper is available here: [http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf]
+
  
4.5 year of development have been analyzed and over 170 schema versions compared and studied.
+
The Panta Rhei project aims at providing powerful
In this website we report the results of our analysis and provide the entire dataset we collected, to the purpose
+
tools that: (i) facilitate schema evolution and guide the Database
of defining a unified Benchmark for Schema Evolution.
+
Administrator in planning and evaluating changes, (ii) support
 +
automatic rewriting of legacy queries against the current schema
 +
version, (iii) enable efficient archiving of the histories of data
 +
and metadata, and (iv) support complex temporal queries over such
 +
histories, and (v) automate the documentation and querying of metadata histories.
  
__TOC__
 
  
== '''Why Wikipedia''' ==
+
== People ==
 +
* MIT
 +
** Carlo A. Curino [http://carlo.curino.us/]
  
[[Image:documents/viz.png]]
+
* Politecnico di Milano
 +
** Letizia Tanca [http://home.dei.polimi.it/tanca/]
 +
** Fabrizio Moroni
  
== '''MediaWiki Schema Evolution''': a short Introduction ==
+
* UCLA
Evolving the database that is at the core of an Information System represents a difficult maintenance problem
+
** Hyun J. Moon [http://yellowstone.cs.ucla.edu/~hjmoon/]
that has only been studied in the framework of traditional information systems. However, the problem is likely
+
** Carlo Zaniolo [http://www.cs.ucla.edu/~zaniolo/]
to be even more severe in web information systems, where open-source software is often developed through
+
** Myungwon Ham
the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-
+
depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known
+
example of a large family of web information systems built using the open-source '''MediaWiki''' software. Our
+
study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation
+
for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed
+
us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of
+
growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis
+
suggests the need for developing better methods and tools to support graceful schema evolution. Therefore,
+
we briefly discuss documentation and automation support systems for database evolution, and suggest that the
+
Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.
+
  
== MediaWiki Architecture ==
+
* UC San Diego
[[Image:MediaWikiArchitecture.png]]
+
** Alin Deutsch [http://db.ucsd.edu/People/alin/]
 +
** Chien-Yi Hou
  
 +
== Projects ==
 +
Within this macro-project the following projects have been developed (please follow the links for further details):
  
The MediaWiki software is a browser-based web-application, whose
+
* The [[Schema_Evolution_Benchmark | '''Pantha Rei Schema Evolution Benchmark''']] <bibref f="defbib.bib">iceis2008</bibref>, a benchmark for schema evolution developed from the actual evolution of the MediaWiki DB backend.
architecture is described in details in [Help:MediaWikiarchitecture] and in the MediaWiki Workbook2007 [http://www.scribd.com/doc/43868/Wikipedia-site-internals-workbook-2007?ga_related_doc=1].
+
* The [[HMM | '''History Metadata Manager''']] <bibref f="defbib.bib">ecdm2008</bibref>, a tool to support temporal queries over metadata histories, and its Semantic Web Extension the [[SHMM | '''Semantic HMM''']] <bibref f="defbib.bib">stsm2008</bibref>
As shown in Figure, the users interact with the PHP frontend
+
* The prototype system [[Prism|'''PRISM: tool for Graceful Schema Evolution''']] <bibref f="defbib.bib">vldb2008a</bibref>, <bibref f="defbib.bib">curinovldb2011update</bibref>, and its demo [[PrismDemo| '''Prism demo page''' ]]<bibref f="defbib.bib">icde2009demo</bibref>
through a standard web browser, submitting a page request (e.g.,
+
a search for pages describing ``Paris''). The frontend software consists of a
+
simple presentation and management layer (MediaWiki PHP Scripts)
+
interpreted by the Apache PHP engine.
+
The user requests are carried out by generating appropriate SQL queries
+
(or updates), that are then issued against the data stored in the backend DB
+
(e.g., the database is queried looking for article's text containing the term ``Paris''). The backend DB can
+
be stored in any DBMS: MySQL, being open-source and scalable, is the
+
default DBMS for the MediaWiki software. The results returned by
+
the DBMS are rendered in XHTML and delivered to the user's
+
browser to be displayed (e.g., a set of of links to pages mentioning ``Paris''
+
is rendered as an XHTML list). Due to the heavy load of the Wikipedia
+
installation of this software, much of effort has been devoted to
+
performance optimization, introducing several levels of caching
+
(Rendered Web Page, DB caches, Media caches), which is particularly
+
effective thanks to the very low rate (0.04\%) of updates w.r.t.
+
queries. Obviously, every modification of the DB schema
+
has a strong impact on the queries the frontend can pose. Typically
+
each schema evolution step can require several queries to be
+
modified, and so several PHP scripts (cooperating to interrogate the DB and render a page) to be manually fixed, in order to balance
+
the schema changes.
+
  
== Available Schema ==
+
* The prototype system [[Prima|'''PRIMA: a system for querying Transaction-Time DB under evolving schema''']] <bibref f="defbib.bib">Vldb2008b</bibref> and its demo<bibref f="defbib.bib">moon09sigmoddemo</bibref>.
  
The base source of information is the MediaWiki SVN, freely browsable at:
+
* The extension of PRIMA named [[AIMS|'''Scalable Architecture and Query Optimization for Transaction-time DBs with Evolving Schemas''']] <bibref f="defbib.bib">Sigmod2010</bibref>.
+
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup
+
  
However to simplify life we provide a .tar.gz download of all the schema versions. It also contains a set of scripts to create, load and delete the entire MediaWiki history.
+
== Funding ==
 +
This work was supported in part by '''NSF-IIS''' award '''0705345''': ''“III-COR: Collaborative Research: Graceful Evolution and Historical Queries in Information Systems – A Unified Approach"''
  
http://yellowstone.cs.ucla.edu/schema-evolution/documents/mediawiki-schema.tar.gz
+
== References ==
 
+
<bibreferences />
== Available Queries ==
+
 
+
In our dataset we have a mix of synthetic and real queries.
+
 
+
=== Synthetic Queries ===
+
 
+
The synthetic queries we have are divided into two classes, a set of queries generated by installing MediaWiki (different versions) and logging the query generating during typical user sessions, and completely synthetic queries. This last set contains queries operating on entire queries and on single attributes and can be use to obtain a rough estimation of the portion of the schema being affected by an evolution step.
+
 
+
Lab-Generated MediaWiki Queries:
+
 
+
* Mediawiki 1.3 (~4,175 query + update instances): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_query_update_all.sql
+
* Mediawiki 1.3 (~1,948 distinct instances): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_query_update_distinct.sql
+
* Mediawiki 1.3 (~1,657 distinct query only instances): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_query_only_distinct.sql
+
* Mediawiki 1.3 (~75 query templates): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_legacy_template.sql
+
 
+
* Mediawiki 1.11 (~2,346 query update instances):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_query_update_all.sql
+
* Mediawiki 1.11 (~385 query update distinct instances):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_query_update_distinct.sql
+
* Mediawiki 1.11 (~147 query distinct instances):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_query_only_distinct.sql
+
* Mediawiki 1.11 (~256 query templates):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_256_query_templates.sql
+
 
+
Syntethic Data:
+
* MediaWiki 1.3 (Version 28, SVN commit 4696) synthetic 1-attribute (133) queries: http://yellowstone.cs.ucla.edu/schema-evolution/documents/v28_synthetic.sql
+
 
+
 
+
=== Real Queries ===
+
 
+
The real queries have been obtain by cleaning the log of the Wikipedia on-line profiler available at: http://noc.wikimedia.org/cgi-bin/report.py?db=enwiki&sort=real&limit=50000 
+
 
+
The available data are the templates, as extracted by the MediaWiki profiler.
+
 
+
* (~1,945) query templates, cross version: http://yellowstone.cs.ucla.edu/schema-evolution/documents/wikipedia_real_all.sql
+
 
+
Moreover we report the data available from the Wikipedia profiler, i.e., number of execution of each query, cpu and real execution time.
+
 
+
From the Wikipedia profiler the information on the unit measure and exact column semantics is not available, therefore we report the data as-is, the interested user can refer to the on-line profiler for further details.
+
 
+
The data are reported in a CSV file.
+
 
+
* (~1,945) query templates and workload data: http://yellowstone.cs.ucla.edu/schema-evolution/documents/wikipedia_profiler_data.csv
+
 
+
==Available Data ==
+
 
+
=== Wikipedia Dump Download ===
+
Since MediaWiki is used by over 30.000 wiki websites around the world including Wikipedia the availability of DB data is impressive.
+
In particular the Wikimedia Foundation releases the entire Wikipedia DB dump bi-weekly.
+
The user can experience with small data-set for some non-popular language (<10Mb) or work on the entire English Wikipedia *enwiki* or even install the entire dataset > 700Gb.
+
 
+
To obtain updated data contents please refer to the official Wikimedia page: http://download.wikimedia.org/backup-index.html
+
the downloads are xml and sql and appropriate tools are offered to simplify the import of the data.
+
Please bare in mind that the prefered DBMS for MediaWiki is MySQL.
+
 
+
If for your research you are interested into older backup, please contact us.
+
 
+
== Available Installed Softwares ==
+
 
+
=== Installed Versions of MediaWiki===
+
In order to ease the testing of MediaWiki backend we provide a set of data and installed version of MediaWiki freely accessible.
+
To provide a comparison of the features available in the main MediaWiki Software Release we installed all of them, and they are available to test.
+
[http://yellowstone.cs.ucla.edu/mediawiki/]
+
 
+
[[Image:SoftwareRelease.png]]
+
 
+
 
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.2.0/        13-Mar-2004 08:08   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.3.0/        02-Aug-2004 09:51   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.4.0/        07-Mar-2005 18:07   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.5.1/        22-Jul-2005 23:30     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.6.0/        05-Apr-2006 03:11   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.7.0/        07-Jul-2006 10:30     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.8.0/        10-Oct-2006 15:37     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.9.0/        10-Jan-2007 12:38     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.10.0/      22-Apr-2007 14:17     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.11.0/      28-Jun-2007 18:19
+
 
+
 
+
==== Full Access to the Backend MySQL DB ====
+
 
+
To provide further insight we setup an access to the MySQL backend for the above installations. In this way the user can freely access the MediaWiki bakend and test simple queries.
+
The access is limited to reading, in order to avoid vandalism.
+
'''user:''' "mediawikireder"
+
'''password:''' "imareader"
+
The phpMyAdmin web access is the following: http://yellowstone.cs.ucla.edu/phpMyAdmin/
+

Latest revision as of 17:27, 6 December 2010

This wiki reports the research advances of the Panta Rhei, a research project for data management under schema evolution.

The Panta Rhei project aims at providing powerful tools that: (i) facilitate schema evolution and guide the Database Administrator in planning and evaluating changes, (ii) support automatic rewriting of legacy queries against the current schema version, (iii) enable efficient archiving of the histories of data and metadata, and (iv) support complex temporal queries over such histories, and (v) automate the documentation and querying of metadata histories.


Contents

[edit] People

  • MIT
    • Carlo A. Curino [1]
  • Politecnico di Milano
    • Letizia Tanca [2]
    • Fabrizio Moroni
  • UCLA
    • Hyun J. Moon [3]
    • Carlo Zaniolo [4]
    • Myungwon Ham
  • UC San Diego
    • Alin Deutsch [5]
    • Chien-Yi Hou

[edit] Projects

Within this macro-project the following projects have been developed (please follow the links for further details):

[edit] Funding

This work was supported in part by NSF-IIS award 0705345: “III-COR: Collaborative Research: Graceful Evolution and Historical Queries in Information Systems – A Unified Approach"

[edit] References

<bibreferences />

Personal tools