CS6093/Projects

Matching Entities and News

In this project, you will build a real time service that matches entities with news. Given a set of entities mentioned in some input text (e.g., tweets), this service will identify and ranks a set of relevant news documents. To accomplish this, you will have to accomplish the following three tasks:

1 Given an input text I (tweets, news, article) with a timestamp, you need to identify the set of entities E (PERSON, LOCATION, ORGANIZATION, MISC) present in I. To find the entities, you can use the tool described in "L. Ratinov and D. Roth Design Challenges and Misconceptions in Named Entity Recognition - CoNLL 2009", called LbjNerTagger, which is freely-available: http://cogcomp.cs.illinois.edu/page/download_view/NETagger.

2 Given the entities found in the input text and the timestamp, find the related news stories. You can do this by submitting a query to news apis, such us: google news, bing news, the NYTimes and digg.com, in

order to obtain the titles, content, links, publisher and publication date of news that have mentioned the given entities.

3 Create a method that ranks the most relevant news. In addition to the actual entities, you should also consider the available metadata, including the input text and other features that were automatically obtained within the news, e.g., content, news title, publisher.

Here's a sample of the input for this task, which consists of a set of tweets: http://vgc.poly.edu/~juliana/courses/cs6093/Projects/TweetNews/sample.data.json

(Additional data consisting of millions of tweets will be provided to students that select this project.)

Organizing Geo-Temporal Data

Given a set of documents D, wdocumentos d in D consists of

Textual data (i.e., the contents of the document)
coordinates (x,y) that indicate the geographical position of d
a time stamp t

The goal of this project is to design a data structure and associated indexes to store these data, which the goal of efficiently supporting the following queries:

(Q1) Given a set of keywords, a spatial region (x0,y0)-(x1,y1) and a time interval [t0-t1], return all documents d(content,x,y,t) whose content match the keywords, such that "d" lies in the spatial region and its time lies within the time interval, i.e.,

x0 <= x <= x1, y0 <= y <= y1, t0 <= t <= t1

(Q2) Given Q1 and an integer k, return the k most frequent keywords that occur in the content of the documents returned by Q1.

Here's a sample of the input for this task: http://vgc.poly.edu/files/llins/twitter/twitter_us_2012-02-13_12h.json.tar.gz

(Additional data consisting of millions of tweets will be provided to students that select this project.)

Analyzing Web Forms

Forms are ubiquitous on the Web. They serve as interfaces to Web services as well as entry points to online databases. Several applications have emerged which manipulate and interact with these forms, from information integration systems to hidden-Web crawlers. But doing so is challenging. There is a wide variation on how forms are designed, their complexity, and the underlying technology they use. In addition, these applications also need to deal with the dynamic nature of forms, as they change over time. However, to date, there has been no large-scale study of what Web forms look like and how they evolve. To help inform developers in the design of effective applications, in this project you will analyze a large collection of forms---consisting of a set of over 1.5 million forms, which was tracked over a period of five months.

As part of your project you will:

formulate the questions and hypotheses for the data analysis
design the experimental setup, e.g., a relational database on a single machine, a map-reduce-based solution
report the results

Suggested reading: Accessing the deep web. He et al., CACM 2007.

CS6093/Projects

Matching Entities and News

Organizing Geo-Temporal Data

Analyzing Web Forms

Navigation menu

Search