Difference between revisions of "Course Project: Wikipedia Analysis"

From VistrailsWiki
Jump to navigation Jump to search
(Created page with 'You will analyze the Wikipedia documents and generate ''interesting'' statistics about Wikipedia content and structure. The project will be done in two phases: == Phase 1: Data …')
(No difference)

Revision as of 21:55, 7 November 2012

You will analyze the Wikipedia documents and generate interesting statistics about Wikipedia content and structure. The project will be done in two phases:

Phase 1: Data pre-processing

The class will be split into different groups, each group will be assigned a task and derive output that will be shared among all students. The tasks are the following:

  • 1. Identify pages that have infoboxes. You will scan the Wikipedia pages and generate a CSV file named infobox.csv with the following format for each line corresponding to a page that contains an infobox: page_id, infobox_text

If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values.

  • 2. Extract links from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named links.csv where each line corresponds to an internal Wikipedia link, i.e., a link that points to a Wikipedia page. Each line in the file should have the following format: page_id, url
  • 3. Extract text from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named content.csv where each line corresponds to a Wikipedia page and contains the page id and its content. The content should be pre-processed to remove the Wiki markup. Each line in the file should have the following format: page_id, text
  • 4. Extract metadata from Wikipedia pages. You will scan the Wikipedia pages and generate an XML file named metadata.xml with where each element corresponds to a Wikipedia page. The metadata is represented in the file using XML markup. You can use the same schema, e.g.,

<page> <title>Albedo</title> <ns>0</ns> <id>39</id> <revision> <id>519638683</id> <parentid>519635723</parentid> <timestamp>2012-10-24T20:53:50Z</timestamp> <contributor> <username>Jenny2017</username> <id>17023873</id> </contributor> </page> In addition, you will add information about the categories assigned to the page as well as the cross-language links. Here's a sample of the categories: <nowiki> [Category:Climate forcing]] [[Category:Climatology]] [[Category:Electromagnetic radiation]] [[Category:Radiometry]] [[Category:Scattering, absorption and radiative transfer (optics)]] [[Category:Radiation]]

And here's a sample of the cross-language links: [[als:Albedo]] [[ar:بياض]] [[an:Albedo]] [[ast:Albedu]] [[bn:প্রতিফলন অনুপাত]] [[bg:Албедо]] [[bs:Albedo]] [[ca:Albedo]] [[cs:Albedo]] [[cy:Albedo]] [[da:Albedo]] [[de:Albedo]] [[et:Albeedo]]


Phase 2: Data Analysis


− Count number of pages that have an infobox − Group pages according to the kind of infobox they contain − Count the number of tables − Group pages according to the kind of table they contain − Histogram of cross-language links per language − Histogram for last date of change for all pages − Extra credit: Tag cloud for groups of old articles − Compute page rank for all wikipedia pages: print top 100 − Students can come up with questions: share with instructors (extra credit -- if we agree) − Distribution of pages across categories (histogram of categories and the number of pages in each category) − Number of pages with multiple categories − Number of info boxes that do not use a template − Compute the word co-occurrence matrix (i.e., number of times word w_i occurs with word w_j within an article)