Course Project: Wikipedia Analysis

You will analyze the Wikipedia documents and generate interesting statistics about Wikipedia content and structure. The project will be done in two phases:

Phase 1: Data pre-processing

The class will be split into different groups, each group will be assigned a task and derive output that will be shared among all students. The tasks are the following:

1. Identify pages that have infoboxes. You will scan the Wikipedia pages and generate a CSV file named infobox.csv with the following format for each line corresponding to a page that contains an infobox: page_id, infobox_text

If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values.

2. Extract links from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named links.csv where each line corresponds to an internal Wikipedia link, i.e., a link that points to a Wikipedia page. Each line in the file should have the following format: page_id, url

3. Extract text from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named content.csv where each line corresponds to a Wikipedia page and contains the page id and its content. The content should be pre-processed to remove the Wiki markup. Each line in the file should have the following format: page_id, text

4. Extract metadata from Wikipedia pages. You will scan the Wikipedia pages and generate an XML file named metadata.xml with where each element corresponds to a Wikipedia page. The metadata is represented in the file using XML markup. You can use the same schema, e.g.,

<page> <title>Albedo</title> <ns>0</ns> <id>39</id> <revision> <id>519638683</id> <parentid>519635723</parentid> <timestamp>2012-10-24T20:53:50Z</timestamp> <contributor> <username>Jenny2017</username> <id>17023873</id> </contributor> </page> In addition, you will add information about the categories assigned to the page as well as the cross-language links. Here's a sample of the categories: <nowiki> [Category:Climate forcing]] [[Category:Climatology]] [[Category:Electromagnetic radiation]] [[Category:Radiometry]] [[Category:Scattering, absorption and radiative transfer (optics)]] [[Category:Radiation]]

And here's a sample of the cross-language links: [[als:Albedo]] [[ar:بياض]] [[an:Albedo]] [[ast:Albedu]] [[bn:প্রতিফলন অনুপাত]] [[bg:Албедо]] [[bs:Albedo]] [[ca:Albedo]] [[cs:Albedo]] [[cy:Albedo]] [[da:Albedo]] [[de:Albedo]] [[et:Albeedo]]

Phase 2: Data Analysis

− Count number of pages that have an infobox − Group pages according to the kind of infobox they contain − Count the number of tables − Group pages according to the kind of table they contain − Histogram of cross-language links per language − Histogram for last date of change for all pages − Extra credit: Tag cloud for groups of old articles − Compute page rank for all wikipedia pages: print top 100 − Students can come up with questions: share with instructors (extra credit -- if we agree) − Distribution of pages across categories (histogram of categories and the number of pages in each category) − Number of pages with multiple categories − Number of info boxes that do not use a template − Compute the word co-occurrence matrix (i.e., number of times word w_i occurs with word w_j within an article)

Course Project: Wikipedia Analysis

Phase 1: Data pre-processing

Navigation menu

Search