Difference between revisions of "Main Page"

From Schema Evolution
Jump to: navigation, search
(MediaWiki Growth)
 
(33 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This webpage is dedicated to publish the results of an intense analysis of the MediaWiki DB backend, which constitutes the core of the '''Pantha Rei Schema Evolution Benchmark'''.
+
This wiki reports the research advances of the '''Panta Rhei''', a research project for data management under schema evolution.
These results have been presented at  ICEIS 2008 [http://www.iceis.org/] in the paper "Schema Evolution in Wikipedia: toward a Web Information System Benchmark" authored
+
by Carlo A. Curino [http://carlo.curino.us/], Hyun J. Moon[http://www.cs.ucla.edu/~hjmoon/], Letizia Tanca[http://home.dei.polimi.it/tanca/] and Carlo Zaniolo[http://www.cs.ucla.edu/~zaniolo/]. The paper is available here: [http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf]
+
  
4.5 year of development have been analyzed and over 170 schema versions compared and studied.
+
The Panta Rhei project aims at providing powerful
In this website we report the results of our analysis and provide the entire dataset we collected, to the purpose
+
tools that: (i) facilitate schema evolution and guide the Database
of defining a unified Benchmark for Schema Evolution.
+
Administrator in planning and evaluating changes, (ii) support
 +
automatic rewriting of legacy queries against the current schema
 +
version, (iii) enable efficient archiving of the histories of data
 +
and metadata, and (iv) support complex temporal queries over such
 +
histories, and (v) automate the documentation and querying of metadata histories.
  
__TOC__
 
  
== '''Why Wikipedia''' ==
+
== People ==
[[Image:Viz.png|thumb|Google Search statistics on wikipedia popularity]]
+
* MIT
[[Image:Alexa.png|thumb|Alexa.com statistics on wikipedia popularity]]
+
** Carlo A. Curino [http://carlo.curino.us/]
  
Wikipedia represent one of the 10 most popular websites in the WWW.  
+
* Politecnico di Milano
 +
** Letizia Tanca [http://home.dei.polimi.it/tanca/]
 +
** Fabrizio Moroni
  
As shown by the two following graphs, showing respectively the google search popularity and the percentage of user visiting Wikipedia by [http://www.alexa.com http://www.alexa.com].
+
* UCLA
Moreover, the PHP back-end underlying wikipedia is shared by other 30,000 wikis. Both software and content are released under open-source licenses. This make Wikipedia, and thus MediaWiki a perfect starting point for our goals.
+
** Hyun J. Moon [http://yellowstone.cs.ucla.edu/~hjmoon/]
 +
** Carlo Zaniolo [http://www.cs.ucla.edu/~zaniolo/]
 +
** Myungwon Ham
  
== '''MediaWiki Schema Evolution''': a short Introduction ==
+
* UC San Diego
Evolving the database that is at the core of an Information System represents a difficult maintenance problem
+
** Alin Deutsch [http://db.ucsd.edu/People/alin/]
that has only been studied in the framework of traditional information systems. However, the problem is likely
+
** Chien-Yi Hou
to be even more severe in web information systems, where open-source software is often developed through
+
the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-
+
depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known
+
example of a large family of web information systems built using the open-source '''MediaWiki''' software. Our
+
study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation
+
for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed
+
us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of
+
growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis
+
suggests the need for developing better methods and tools to support graceful schema evolution. Therefore,
+
we briefly discuss documentation and automation support systems for database evolution, and suggest that the
+
Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.
+
  
== MediaWiki Architecture ==
+
== Projects ==
[[Image:MediaWikiArchitecture.png|MediaWiki Architecture]]
+
Within this macro-project the following projects have been developed (please follow the links for further details):
  
The MediaWiki software is a browser-based web-application, whose
+
* The [[Schema_Evolution_Benchmark | '''Pantha Rei Schema Evolution Benchmark''']] <bibref f="defbib.bib">iceis2008</bibref>, a benchmark for schema evolution developed from the actual evolution of the MediaWiki DB backend.
architecture is described in details in [Help:MediaWikiarchitecture] and in the MediaWiki Workbook2007 [http://www.scribd.com/doc/43868/Wikipedia-site-internals-workbook-2007?ga_related_doc=1].
+
* The [[HMM | '''History Metadata Manager''']] <bibref f="defbib.bib">ecdm2008</bibref>, a tool to support temporal queries over metadata histories, and its Semantic Web Extension the [[SHMM | '''Semantic HMM''']] <bibref f="defbib.bib">stsm2008</bibref>
As shown in Figure, the users interact with the PHP frontend
+
* The prototype system [[Prism|'''PRISM: tool for Graceful Schema Evolution''']] <bibref f="defbib.bib">vldb2008a</bibref>, <bibref f="defbib.bib">curinovldb2011update</bibref>, and its demo [[PrismDemo| '''Prism demo page''' ]]<bibref f="defbib.bib">icde2009demo</bibref>
through a standard web browser, submitting a page request (e.g.,
+
a search for pages describing ``Paris''). The frontend software consists of a
+
simple presentation and management layer (MediaWiki PHP Scripts)
+
interpreted by the Apache PHP engine.
+
The user requests are carried out by generating appropriate SQL queries
+
(or updates), that are then issued against the data stored in the backend DB
+
(e.g., the database is queried looking for article's text containing the term ``Paris''). The backend DB can
+
be stored in any DBMS: MySQL, being open-source and scalable, is the
+
default DBMS for the MediaWiki software. The results returned by
+
the DBMS are rendered in XHTML and delivered to the user's
+
browser to be displayed (e.g., a set of of links to pages mentioning ``Paris''
+
is rendered as an XHTML list). Due to the heavy load of the Wikipedia
+
installation of this software, much of effort has been devoted to
+
performance optimization, introducing several levels of caching
+
(Rendered Web Page, DB caches, Media caches), which is particularly
+
effective thanks to the very low rate (0.04\%) of updates w.r.t.
+
queries. Obviously, every modification of the DB schema
+
has a strong impact on the queries the frontend can pose. Typically
+
each schema evolution step can require several queries to be
+
modified, and so several PHP scripts (cooperating to interrogate the DB and render a page) to be manually fixed, in order to balance
+
the schema changes.
+
  
== MediaWiki Growth ==
+
* The prototype system [[Prima|'''PRIMA: a system for querying Transaction-Time DB under evolving schema''']] <bibref f="defbib.bib">Vldb2008b</bibref> and its demo<bibref f="defbib.bib">moon09sigmoddemo</bibref>.
[[Image:Att-tab3.png|thumb|Attributes and Tables growth of MediaWiki Schema]]
+
In this section, we analyze the schema evolution
+
of MediaWiki based on its 171 schema versions, as
+
committed to SVN between April 2003 (first schema
+
revision) and November 2007 (date of this analysis).      
+
  
Schema Size Growth In Figures, we report
+
* The extension of PRIMA named [[AIMS|'''Scalable Architecture and Query Optimization for Transaction-time DBs with Evolving Schemas''']] <bibref f="defbib.bib">Sigmod2010</bibref>.
the size of MediaWiki DB schema in history, in terms
+
of the number of tables and columns, respectively.
+
The graphs show an evident trend of growth in sizes,
+
where the number of tables has increased from 17
+
to 34 (100% increase) and the number of columns
+
from 100 to 242 (142%). Sudden drops in the graphs
+
are due to schema versions with syntax errors, i.e.,
+
schema versions that could not be properly installed.
+
In both graphs we observe different rates of growth
+
over time, which seem to be related to the time periods preceding or following official releases of the
+
overall software (see Table in section Available Software Version).
+
  
Schema growth is due to three main driving forces
+
== Funding ==
as follows:  
+
This work was supported in part by '''NSF-IIS''' award '''0705345''': ''“III-COR: Collaborative Research: Graceful Evolution and Historical Queries in Information Systems – A Unified Approach"''
* performance improvement, e.g., introduction of
+
dedicated cache tables,
+
* addition of new features, e.g., support for logging
+
and content validation,
+
* the growing need for preservation of DB content
+
history, i.e., introduction of tables and columns
+
to store outdated multimedia content such as the
+
'filearchive' table.
+
The Figure shows a histogram representation of the table lifetimes, in terms
+
of number of versions. The lifetimes range from very
+
long ones, e.g., the user table that was alive throughout the entire history, to short ones, e.g., random table that only survived for two revisions. On average,
+
each table lasted 103.3 versions (60.4% of the total
+
DB history). Figure 5 presents lifetimes of columns
+
in histogram, where columns lasted 97.17 versions on
+
average (56.8% of the total DB history). Interestingly,
+
both figures show that there are two main groups of
+
tables and columns: “short-living” and “long-living”.
+
The former might be due to the fact that the schema
+
has been growing lately so a significant portion of tables and columns has been introduced only recently.
+
The latter can be explained noting that the core tables/columns tend to be rather stable throughout the
+
entire history.
+
  
== Available Schema ==
+
== References ==
 
+
<bibreferences />
The base source of information is the MediaWiki SVN, freely browsable at:
+
+
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup
+
 
+
However to simplify life we provide a .tar.gz download of all the schema versions. It also contains a set of scripts to create, load and delete the entire MediaWiki history.
+
 
+
http://yellowstone.cs.ucla.edu/schema-evolution/documents/mediawiki-schema.tar.gz
+
 
+
== Available Queries ==
+
 
+
In our dataset we have a mix of synthetic and real queries.
+
 
+
=== Synthetic Queries ===
+
 
+
The synthetic queries we have are divided into two classes, a set of queries generated by installing MediaWiki (different versions) and logging the query generating during typical user sessions, and completely synthetic queries. This last set contains queries operating on entire queries and on single attributes and can be use to obtain a rough estimation of the portion of the schema being affected by an evolution step.
+
 
+
Lab-Generated MediaWiki Queries:
+
 
+
* Mediawiki 1.3 (~4,175 query + update instances): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_query_update_all.sql
+
* Mediawiki 1.3 (~1,948 distinct instances): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_query_update_distinct.sql
+
* Mediawiki 1.3 (~1,657 distinct query only instances): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_query_only_distinct.sql
+
* Mediawiki 1.3 (~75 query templates): http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw13_legacy_template.sql
+
 
+
* Mediawiki 1.11 (~2,346 query update instances):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_query_update_all.sql
+
* Mediawiki 1.11 (~385 query update distinct instances):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_query_update_distinct.sql
+
* Mediawiki 1.11 (~147 query distinct instances):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_query_only_distinct.sql
+
* Mediawiki 1.11 (~256 query templates):http://yellowstone.cs.ucla.edu/schema-evolution/documents/mw1.11_256_query_templates.sql
+
 
+
Syntethic Data:
+
* MediaWiki 1.3 (Version 28, SVN commit 4696) synthetic 1-attribute (133) queries: http://yellowstone.cs.ucla.edu/schema-evolution/documents/v28_synthetic.sql
+
 
+
 
+
=== Real Queries ===
+
 
+
The real queries have been obtain by cleaning the log of the Wikipedia on-line profiler available at: http://noc.wikimedia.org/cgi-bin/report.py?db=enwiki&sort=real&limit=50000 
+
 
+
The available data are the templates, as extracted by the MediaWiki profiler.
+
 
+
* (~1,945) query templates, cross version: http://yellowstone.cs.ucla.edu/schema-evolution/documents/wikipedia_real_all.sql
+
 
+
Moreover we report the data available from the Wikipedia profiler, i.e., number of execution of each query, cpu and real execution time.
+
 
+
From the Wikipedia profiler the information on the unit measure and exact column semantics is not available, therefore we report the data as-is, the interested user can refer to the on-line profiler for further details.
+
 
+
The data are reported in a CSV file.
+
 
+
* (~1,945) query templates and workload data: http://yellowstone.cs.ucla.edu/schema-evolution/documents/wikipedia_profiler_data.csv
+
 
+
==Available Data ==
+
 
+
=== Wikipedia Dump Download ===
+
Since MediaWiki is used by over 30.000 wiki websites around the world including Wikipedia the availability of DB data is impressive.
+
In particular the Wikimedia Foundation releases the entire Wikipedia DB dump bi-weekly.
+
The user can experience with small data-set for some non-popular language (<10Mb) or work on the entire English Wikipedia *enwiki* or even install the entire dataset > 700Gb.
+
 
+
To obtain updated data contents please refer to the official Wikimedia page: http://download.wikimedia.org/backup-index.html
+
the downloads are xml and sql and appropriate tools are offered to simplify the import of the data.
+
Please bare in mind that the prefered DBMS for MediaWiki is MySQL.
+
 
+
If for your research you are interested into older backup, please contact us.
+
 
+
== Available Installed Softwares ==
+
 
+
=== Installed Versions of MediaWiki===
+
In order to ease the testing of MediaWiki backend we provide a set of data and installed version of MediaWiki freely accessible.
+
To provide a comparison of the features available in the main MediaWiki Software Release we installed all of them, and they are available to test.
+
[http://yellowstone.cs.ucla.edu/mediawiki/]
+
 
+
[[Image:SoftwareRelease.png]]
+
 
+
 
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.2.0/        13-Mar-2004 08:08   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.3.0/        02-Aug-2004 09:51   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.4.0/        07-Mar-2005 18:07   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.5.1/        22-Jul-2005 23:30     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.6.0/        05-Apr-2006 03:11   
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.7.0/        07-Jul-2006 10:30     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.8.0/        10-Oct-2006 15:37     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.9.0/        10-Jan-2007 12:38     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.10.0/      22-Apr-2007 14:17     
+
* http://yellowstone.cs.ucla.edu/mediawiki/mediawiki-1.11.0/      28-Jun-2007 18:19
+
 
+
 
+
==== Full Access to the Backend MySQL DB ====
+
 
+
To provide further insight we setup an access to the MySQL backend for the above installations. In this way the user can freely access the MediaWiki bakend and test simple queries.
+
The access is limited to reading, in order to avoid vandalism.
+
'''user:''' "mediawikireder"
+
'''password:''' "imareader"
+
The phpMyAdmin web access is the following: http://yellowstone.cs.ucla.edu/phpMyAdmin/
+

Latest revision as of 17:27, 6 December 2010

This wiki reports the research advances of the Panta Rhei, a research project for data management under schema evolution.

The Panta Rhei project aims at providing powerful tools that: (i) facilitate schema evolution and guide the Database Administrator in planning and evaluating changes, (ii) support automatic rewriting of legacy queries against the current schema version, (iii) enable efficient archiving of the histories of data and metadata, and (iv) support complex temporal queries over such histories, and (v) automate the documentation and querying of metadata histories.


Contents

[edit] People

  • MIT
    • Carlo A. Curino [1]
  • Politecnico di Milano
    • Letizia Tanca [2]
    • Fabrizio Moroni
  • UCLA
    • Hyun J. Moon [3]
    • Carlo Zaniolo [4]
    • Myungwon Ham
  • UC San Diego
    • Alin Deutsch [5]
    • Chien-Yi Hou

[edit] Projects

Within this macro-project the following projects have been developed (please follow the links for further details):

[edit] Funding

This work was supported in part by NSF-IIS award 0705345: “III-COR: Collaborative Research: Graceful Evolution and Historical Queries in Information Systems – A Unified Approach"

[edit] References

<bibreferences />

Personal tools