CS6093/Projects

From VistrailsWiki
Revision as of 05:22, 14 February 2012 by Juliana (talk | contribs)
Jump to navigation Jump to search

Matching Entities and News

In this project, you will build a real time service that matches entities with news. Given a set of entities mentioned in some input text (e.g., tweets), this service will identify and ranks a set of relevant news documents. To accomplish this, you will have to accomplish the following three tasks:

  • 1 Given an input text I (tweets, news, article) with a timestamp, you need to identify the set of entities E (PERSON, LOCATION, ORGANIZATION, MISC) present in I. To find the entities, you can use the tool described in "L. Ratinov and D. Roth Design Challenges and Misconceptions in Named Entity Recognition - CoNLL 2009", called LbjNerTagger, which is freely-available: http://cogcomp.cs.illinois.edu/page/download_view/NETagger.
  • 2 Given the entities found in the input text and the timestamp, find the related news stories. You can do this by submitting a query to news apis, such us: google news, bing news, the NYTimes and digg.com, in

order to obtain the titles, content, links, publisher and publication date of news that have mentioned the given entities.

  • 3 Create a method that ranks the most relevant news. In addition to the actual entities, you should also consider the available metadata, including the input text and other features that were automatically obtained within the news, e.g., content, news title, publisher.

Here's a sample of the input for this task, which consists of a set of tweets: http://vgc.poly.edu/~juliana/courses/cs6093/Projects/TweetNews/sample.data.json

(Additional data consisting of millions of tweets will be provided to students that select this project.)

Organizing Geo-Temporal Data

Given a set of documents D, wdocumentos d in D consists of

  • Textual data (i.e., the contents of the document)
  • coordinates (x,y) that indicate the geographical position of d
  • a time stamp t

The goal of this project is to design a data structure and associated indexes to store these data, which the goal of efficiently supporting the following queries:

  • (Q1) Given a set of keywords, a spatial region (x0,y0)-(x1,y1) and a time interval [t0-t1], return all documents d(content,x,y,t) whose content match the keywords, such that "d" lies in the spatial region and its time lies within the time interval, i.e.,

x0 <= x <= x1, y0 <= y <= y1, t0 <= t <= t1

  • (Q2) Given Q1 and an integer k, return the k most frequent keywords that occur in the content of the documents returned by Q1.

Here's a sample of the input for this task: http://vgc.poly.edu/files/llins/twitter/twitter_us_2012-02-13_12h.json.tar.gz

(Additional data consisting of millions of tweets will be provided to students that select this project.)