Difference between revisions of "CS6093/Projects"

From VistrailsWiki
Jump to navigation Jump to search
Line 10: Line 10:
* 3 Create a method that ranks the most relevant news. In addition to the actual entities, you should also consider the available metadata, including the input text  and other features that were automatically obtained within the news, e.g., content, news title, publisher.
* 3 Create a method that ranks the most relevant news. In addition to the actual entities, you should also consider the available metadata, including the input text  and other features that were automatically obtained within the news, e.g., content, news title, publisher.


Here's a sample of the input for this task, which consists of a set of tweets: [[Sample Data]].   
Here's a sample of the input for this task, which consists of a set of tweets: http://vgc.poly.edu/~juliana/courses/cs6093/Projects/TweetNews/sample.data.json
(Additional data will be provided.)
 
(Additional data consisting of millions of tweets will be provided to students that select this project.)
 
== Organizing Geo-Temporal Data ==
 
Given a set of documents D, wdocumentos d in D consists of
 
* Textual data (i.e., the contents of the document)
*  coordinates (x,y) that indicate the geographical position of d
* a time stamp t
 
The goal of this project is to design a data structure and associated indexes to store these data, which the goal of efficiently supporting the following queries:
 
* (Q1) Given a set of keywords, a spatial region (x0,y0)-(x1,y1) and a time interval [t0-t1], return all documents ''d(content,x,y,t)'' whose ''content'' match the keywords, such that "d" lies in the spatial region and its time lies within the time interval, i.e.,
x0 <= x <= x1,  y0 <= y <= y1, t0 <= t <= t1
 
* (Q2) Given Q1 and an integer k, return the k most frequent keywords that occur in the content of the documents returned by Q1.
 
Here's a sample of the input for this task: http://vgc.poly.edu/files/llins/twitter/twitter_us_2012-02-13_12h.json.tar.gz
 
(Additional data consisting of millions of tweets will be provided to students that select this project.)

Revision as of 05:22, 14 February 2012

Matching Entities and News

In this project, you will build a real time service that matches entities with news. Given a set of entities mentioned in some input text (e.g., tweets), this service will identify and ranks a set of relevant news documents. To accomplish this, you will have to accomplish the following three tasks:

  • 1 Given an input text I (tweets, news, article) with a timestamp, you need to identify the set of entities E (PERSON, LOCATION, ORGANIZATION, MISC) present in I. To find the entities, you can use the tool described in "L. Ratinov and D. Roth Design Challenges and Misconceptions in Named Entity Recognition - CoNLL 2009", called LbjNerTagger, which is freely-available: http://cogcomp.cs.illinois.edu/page/download_view/NETagger.
  • 2 Given the entities found in the input text and the timestamp, find the related news stories. You can do this by submitting a query to news apis, such us: google news, bing news, the NYTimes and digg.com, in

order to obtain the titles, content, links, publisher and publication date of news that have mentioned the given entities.

  • 3 Create a method that ranks the most relevant news. In addition to the actual entities, you should also consider the available metadata, including the input text and other features that were automatically obtained within the news, e.g., content, news title, publisher.

Here's a sample of the input for this task, which consists of a set of tweets: http://vgc.poly.edu/~juliana/courses/cs6093/Projects/TweetNews/sample.data.json

(Additional data consisting of millions of tweets will be provided to students that select this project.)

Organizing Geo-Temporal Data

Given a set of documents D, wdocumentos d in D consists of

  • Textual data (i.e., the contents of the document)
  • coordinates (x,y) that indicate the geographical position of d
  • a time stamp t

The goal of this project is to design a data structure and associated indexes to store these data, which the goal of efficiently supporting the following queries:

  • (Q1) Given a set of keywords, a spatial region (x0,y0)-(x1,y1) and a time interval [t0-t1], return all documents d(content,x,y,t) whose content match the keywords, such that "d" lies in the spatial region and its time lies within the time interval, i.e.,

x0 <= x <= x1, y0 <= y <= y1, t0 <= t <= t1

  • (Q2) Given Q1 and an integer k, return the k most frequent keywords that occur in the content of the documents returned by Q1.

Here's a sample of the input for this task: http://vgc.poly.edu/files/llins/twitter/twitter_us_2012-02-13_12h.json.tar.gz

(Additional data consisting of millions of tweets will be provided to students that select this project.)