DataVis2012/Projects/Hillegass

From VistrailsWiki
Revision as of 22:25, 1 May 2012 by Rhille01 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview: The goal of this project is to take tweets hashtagged with a specific college name (ie #Harvard) and find the most used words in the tweets about that college. This information will then be made into a word cloud. The word cloud will help prospective students learn about the atmosphere of each college.

Implementation: The first step will be to write a script that will pull out tweets associated with the college and then search for terms from a predetermined library of descriptors. The word clouds of colleges can then be compared to each other.

Another part of the project will be to use geotagged tweets from the college's surrounding area. The word clouds for the surrounding areas will be compared to the hashtags relating directly to the college to discover how the atmosphere of the surrounding area compares to the atmosphere of the college. It will also determine the relationship (how intertwined they are) of the surrounding area and the college.

The word clouds will be live updating (the tweets will be streamed in) and a log will be kept of the different word clouds to see if there is a steady evolution or if the outcome is random and thus unhelpful. I will also plot the data to analyze the graphs.

Notes/Ideas: The evolution of graphs may prove more reliable depending on the college. For example, the surrounding area tweets and the college tweets may be more intertwined for a school in a very rural area, such as Penn State, as opposed to in the city, such as NYU.

The evolution may be more steady over periods of time. For example, week to week may be unreliable, but season to season may be more consistent.


Update #1 Using the "tweepy" library to search for each college as a hashtag. Wrote a python script to print out each tweet hashtagged with the college name. I can also print out just the hashtag at this point. Made a library of words I plan on searching for -

   tel = {'hate': 1, 'love': 2, 'smart':3, 'party': 4, 'hope':5, 'stupid':6, 'drunk':7, 'homework':8,
          'drugs':9, 'weed':10, 'class':11, 'hard':12, 'drink':13, 'bored':14, 'alcohol':15, 'crazy':16,
          'great':17, 'dream':18, 'innovation':19, 'preppy':20, 'chill':21, 'stress':22, 'exam':22, 'fail':23,
          'pass':24, 'football':25, 'sports':26, 'pride':27, 'win':28, 'basketball':29, 'music':30, 'performance':31,
          'recital':32, 'game':33, 'lost':34, 'concert':35, 'volleyball':36, 'struggle':37, 'wasted':38, 'abroad':39,
          'excited':40, 'terrible':41, 'books':42, 'learning':43, 'graduation':44, 'over':45}

Update #2 Wrote a function to take in the list of hashtags (more concise than full tweets) in tweets that include the hashtag of the college name. Passed this list into a function which finds each of the words from the predetermined dictionary in the list and prints out how many times each word was found and at what location in the list.

Update #3 Scrapped the predetermined library idea. Wrote an algorithm that will make a list of all of the hashtags associated with each college, and then count how many times each word appears. With this algorithm, I won't leave out any data that could be useful in determining the school's atmosphere. It will also be unbiased, because I will not be coming up with a finite list of terms. I'm going to run this program at the end of the day, searching through the most recent 5000 tweets and output it into a file. I will make a word cloud everyday for each college. At the end, I will compare the word clouds to each other for each college, and I will compare each college by their word clouds. I also will create a word cloud of the most used words throughout the entire week (for each college) and then compare this word cloud to each of the day's word clouds to determine overall consistency as opposed to day to day consistency.


Findings/problems to consider -Finding:

  +the college name always has the most number of appearances, it can be factored out when deciding the words that are most prominent
  +very useful if something big at the college is happening that everyone tweets about 

-Problems:

  +tweets may have full name of college or abbreviation - it's impossible to amass all of the tweets regarding every college
  +slang