This is your first real assignment for CS 5630/6630.
The assignment is due at midnight on September ??, 2007. You will need to use the CADE handin functionality to turn in your assignment. The class account is "cs5630".
The purpose of this initial assignment is to make sure you understand the basic plotting concepts covered in class. Examples of plotting were provided after the lectures and can be found here: PlottingVistrails.zip. As you work on the assignment, we encourage you to read the available documentation on both matplotlib and python.
Here is the initial vistrail file hw1.vt and plotting data hw1_data.zip that you should use for completing your work. The paths in the existing File modules may need to be updated in the vistrail to correctly execute the existing nodes. You should add upon this vistrail to do your assignment. As before, show your work by submitting the complete vistrail you used to solve the problem
The data we will be using for this assignment comes from weather measurements near Snowbird Ski Resort in Little Cottonwood Canyon (original data found here and here). To make things simpler, the data we provide has been reformatted so that it is easy to parse. The measurements were taken daily (or monthly) for a water year (Starting Oct 1 and ending Sep 30).
This problem deals with simple connected symbol plots, as shown in the MaunaLoa.vt example. The "Precip" node in the history tree plots a list accumulated precipitation in inches for monthly measurements in 2007. Start with this node and perform the following changes. Label them "Problem 1a", "Problem 1b", etc.
a. Apply the principles of plotting described in the notes to improve the vision and the understanding of the plot. In the notes, list the principles that were addressed and how they were addressed.
b. The "Precip" pipeline reads data for 2007 from precip07.dat. Directly compare this with the 2006 measurements found in precip06.dat by Superposition (on the same plot).
c. Repeat part b, but compare using Juxtaposition (each plot in a different spreadsheet cell). In the notes, describe which technique (superpostion vs. juxtaposition) makes the most sense for this data and why.
This problem deals with histograms and showing distributions of data, for an example see Histogram.vt. The data file snowdepth07.dat contains snow depths in inches for the entire water year (one entry per line). Show the distribution of snow depths using a histogram. In the notes, describe how you chose the number of bins that were used.
This problem deals with dot plots for labeled data, as an example, see DeathRate.vt. The annual_snowfall.dat file consists of all the Utah ski resorts and their average annual snowfall in inches (in the form string:int just like the DeathRate data). Interestingly, there is no correlation between snow fall and ticket cost. Plot the data on a dot plot and in the notes, describe what you had to did to the plot.
This problem deals with correlation (for an example, see the Correlation.vt example). The temp_precip07.dat file contains a line for each day of the year which includes the air temperature in Celcius and amount of precipitation in inches (in form "10:0.5" for 10 degrees C and 0.5 inches). Note, this is a similar format that the labeled data in the MammalScaling.vt example is provided, so you can use a similar parser. Perform the following tasks and label the nodes "Problem4a", "Problem4b", etc.
a. (Grads and UGrads) Plot the data using a scatterplot with temperature on the X axis and precipitation on the Y axis. Be sure to use the basic principles of plotting. In the notes for this node, describe any correlation that you can perceive (rough judgement, not calculated) and any conclusions that could be drawn.
b. (Grads only) Because of the limited resolution of the measurements, the data takes a regular spacing and points are stacked. This makes it difficult to analyze concentrations of the data. Resolve this problem by using one of the following techniques:
- jittering: Perturb the points by a small amount of randomness such that the overlap is reduced.
- symbols: Find stacked points and represent them using one point that is drawn differently (heavier weight or different symbol)
- colormap: Find stacked points and color them differently depending on how many are in the stack.
In the notes for the node, describe what you did.
c. (Grads only) Perform a linear regression to fit a line through the data. Is a degree 1 polynomial (line) sufficient? What happens with a higher degree polynomial such as a cubic (degree 3) polynomial? Note, the 3rd parameter of the scipy.polyfit function defines the degree of the polynomial. The number of coefficients returned from scipy.polyfit is determined by the degree. Thus (ar,br) = scipy.polyfit(x,y,1) would need to be (ar,br,cr) = scipy.polyfit(x,y,2). The polyval function would need to be changed in a similar way. Also note that a sort on the x axis may need to be performed on the data for the polyval points to be monotonic (and thus not overlapping). In the notes, describe what fit you settled on and why.