Difference between revisions of "SciVisFall2008/Assignment 1"

From VistrailsWiki
Jump to navigation Jump to search
m (New page: ** Exercise 1: Principles of plotting The file stocks.dat has the the first quote for each month from January 2006 to September 2008 for the papers from Apple Inc. (AAPL) and Microsof...)
 
 
(41 intermediate revisions by one other user not shown)
Line 1: Line 1:
** Exercise 1: Principles of plotting
This is your second assignment for CS 5630/6630.


The file [[stocks.dat]] has the the first quote
The assignment is due at midnight on October 6th, 2008.  
for each month from January 2006 to September 2008
You will need to use the CADE handin functionality to turn
for the papers from Apple Inc. (AAPL) and Microsoft
in your assignment. The class account is "cs5630".
Corporation (MSFT). Below we present the first three
lines and the last two lines of this file.


month,apple,microsoft
This assignment was successfully tested in release 1.2.1rev1336.
2008-09,140.91,25.16
It should work fine in releases >=1.2.1rev1336.  
2008-08,169.53,27.29
'''Check your release before starting your work and upgrade it if necessary.'''
...
2006-02,68.49,25.92
2006-01,75.51,27.06


The [http://downloads.sourceforge.net/vistrails/vistrails-usersguide-1.2-rev191.pdf Vistrails User's Guide] will probably be helpful to you in this assignment.


a. Apply the principles of plotting described in class  
The purpose of this assignment is to make sure you understand the basic plotting  
and in the class notes to generate a simple connected
concepts covered in class and learn matplotlib/python/Vistrails as a tool do
symbol plot for all Apple's quotes in the file
produce plots. Examples of plotting were provided in the lectures
[[stocks.dat]]. Tag the final version of this plot as
and can be found here: [http://www.sci.utah.edu/~stevec/classes/cs5630/PlottingVistrails.zip PlottingVistrails.zip].  
"Problem 1a" and annotate it with an explanation
As you work on the assignment, we encourage you to read the available documentation
of the plotting principles you used to make this
on both [http://matplotlib.sourceforge.net/ matplotlib] and
a clear plot.
[http://www.diveintopython.org/ python].


b. Using as reference the quote of January 2006 directly
The data for the four problems of this assignment are in four files: stocks.dat (problem 1),
compare the progress of Apple's and Microsoft's papers by  
actions-fall-2007.dat (problem 2), microprocessors.dat (problem 3) and genes.dat (problem 4).
These four files are packed into a single zip file called: Hw1data.zip. The task of unzipping
and locating these files is already done in the starting vistrails file for this assignment:
[http://www.vistrails.org/images/Assignment1.vt Assignment1.vt]. You should solve
the problems by working directly in this vistrails file. When you open [http://www.vistrails.org/images/Assignment1.vt Assignment1.vt]
you will see four tagged versions that basically loads the raw data needed in each of the four problems.
As before, show your work by submitting the complete vistrail you used to solve the problems.
 
== Exercise 1: Principles of plotting and connectd symbols plot ==
 
The file ''stocks.dat'' has the first quote for each   
month from January 2006 to September 2008 for the papers
from Apple Inc. (AAPL) and Microsoft Corporation (MSFT).
Below we present the first three lines and the last two 
lines of this file.
 
month,apple,microsoft
2008-09,140.91,25.16
2008-08,169.53,27.29
...
2006-02,68.49,25.92
2006-01,75.51,27.06
 
(a) Apply the principles of plotting described in class 
and in the class notes to generate a simple connected 
symbol plot for all Apple's quotes in the file         
''stocks.dat''. Tag the final version of this plot as 
"Problem 1a" and annotate it with an explanation       
of the plotting principles you used to make this       
a clear plot.                                         
 
(b) Using as reference the quote of January 2006 directly
compare the progress of Apple's and Microsoft's papers by
generating a plot using superposition (both curves in the  
generating a plot using superposition (both curves in the  
same plot). Tag this final plot as "Problem 1b" and annotate
same plot). Tag this final plot as "Problem 1b" and annotate
it with the conclusions you can draw from this plot.
it with the conclusions you can draw from this plot.


c. Repeat item b, but now using juxtaposition: split the  
(c) Repeat item b, but now using juxtaposition: split the  
two curves (i.e. Apple's paper progress relative to January  
two curves (i.e. Apple's paper progress relative to January  
2006 and Microsoft's paper progress relative to January 2006)  
2006 and Microsoft's paper progress relative to January 2006)  
into two different plots (each plot in a different spreadsheet  
into two different plots (each plot in a different spreadsheet
cell). Tag the final version as "Problem 1c" and annotate it
cell). Tag the final version as "Problem 1c" and annotate it
describing which technique (superpostion vs. juxtaposition)
describing which technique (superpostion vs. juxtaposition)  
makes more sense for this data and why.
makes more sense for this data and why.


 
== Exercise 2: Histogram and number of bins ==
# improve the vision and the understanding of the plot. Note, not all principles may be addressable with matplotlib. In the notes for the node, list the principles that were addressed and how they were addressed.
# Using
# the opening stock market value
# for the papers of Google (GOOG), Microsoft (MSFT)
# and Apple (AAPL) for the months starting at January 2007
# to September 2008.
# Apple Inc.
# This problem deals with simple connected symbol plots, as shown in the MaunaLoaPlot.vt example. The "Precip" node in the history tree plots a list accumulated precipitation in inches for monthly measurements in 2007. Start with this node and perform the following changes. Label them "Problem 1a", "Problem 1b", etc.
# a. Apply the principles of plotting described in class (and in the class notes) to improve the vision and the understanding of the plot. Note, not all principles may be addressable with matplotlib. In the notes for the node, list the principles that were addressed and how they were addressed.
# b. The "Precip" pipeline reads data for 2007 from precip07.dat. Directly compare this with the 2006 measurements found in precip06.dat by Superposition (on the same plot).
# c. Repeat part b, but compare using Juxtaposition (each plot in a different spreadsheet cell). In the notes, describe which technique (superpostion vs. juxtaposition) makes the most sense for this data and why. w
# 1. connected symbol plot for the precipitation data from 2007 and 2006
#    principles of plotting in the notes:
#      principle 1:
#      principle 2:
#      principle 3:
#      principle 4:
#    a. improve the plot using the principles
#    b. compare 2006 and 2007 by superposition
#    c. compare 2006 and 2007 by juxtaposition
 
** Exercise 2: Histogram and number of bins


Like this year, in the Fall of 2007, during  
Like this year, in the Fall of 2007, during  
the Scientific Visualization Course we collected  
the Scientific Visualization Course we collected  
all the assignments of the students in Vistrails'
all the assignments of the students in Vistrails'
format. The file [[actions_fall_2007.dat]] has all the  
format. The file ''actions_fall_2007.dat'' has all the  
timestamps of all the actions of all the students
timestamps of all the actions of all the students
in all the assignments: a total of 132131 actions.
in all the assignments: a total of 132131 actions.
Using matplotlib in Vistrails, create a histogram  
The first three lines of this file are:
for the distribution of these timestamps and  
 
timestamp
2007-09-15 21:24:56
2007-09-15 21:25:16
...
 
Create a histogram for the distribution of these timestamps and  
highlight the folowing due dates in the histogram.  
highlight the folowing due dates in the histogram.  
(obs. note that by some reason assignment 5 had a  
(obs. note that by some reason assignment 5 had a  
due data before assignment 6).
due data before assignment 6).


| Assigment | Due Date            |
| Assigment | Due Date            |
|-----------+---------------------|
|-----------+---------------------|
|        0 | 2007-09-18 12:00:00 |
|        0 | 2007-09-18 12:00:00 |
|        1 | 2007-09-18 12:00:00 |
|        1 | 2007-09-18 12:00:00 |
|        2 | 2007-10-04 12:00:00 |
|        2 | 2007-10-04 12:00:00 |
|        3 | 2007-10-25 12:00:00 |
|        3 | 2007-10-25 12:00:00 |
|        4 | 2007-11-27 12:00:00 |
|        4 | 2007-11-27 12:00:00 |
|        5 | 2007-12-15 12:00:00 |
|        5 | 2007-12-15 12:00:00 |
|        6 | 2007-12-11 12:00:00 |
|        6 | 2007-12-11 12:00:00 |


When you finish your histogram tag its pipeline  
When you finish your histogram tag its pipeline  
Line 87: Line 99:
answering the following questions:
answering the following questions:


a. How did you select the bins for the histogram
(a) How did you select the bins for the histogram
and why?
and why?


b. What hypothesis can you make about the  
(b) What hypothesis can you make about the  
amount of work (i.e. number of actions) for  
amount of work (i.e. number of actions) for  
the different assignments just by looking to  
the different assignments just by looking to  
this histogram.
this histogram.


c. What pattern can you observe for the amount
(c) What pattern can you observe for the amount
of work (i.e. number of actions) close to the
of work (i.e. number of actions) close to the
deadlines?
deadlines?


** Exercise 3: Dot plots for labeled data
== Exercise 3: Dot plots for labeled data ==


Each line of the file [[microprocessors.dat]] has  
Each line of the file ''microprocessors.dat'' (except for the header line) has  
two quantitative values associated to a
two quantitative values associated with a
label. The quantitative values are "year of introduction"
label. The quantitative values are "year of introduction"
and "number of transistors" and the label is  
and "number of transistors" and the label is  
name of the "microprocessor". Generate two
the name of a "microprocessor" (e.g. 286, 386, 486, Pentium 4).
dot plots horizontally juxtaposed for these
See the first three lines of this file:
 
Processor,Year of Introduction,Transistors
Pentium 4 processor,2000,42000000
286,1982,120000
...
 
Generate two dot plots horizontally juxtaposed for these
microprocessors: one for "year of introduction"  
microprocessors: one for "year of introduction"  
and the other for "number of transistors".
and the other for "number of transistors".
For "number of transistors" use log base 10
For "number of transistors" dot plot use log  
scale.
base 10 scale. The two plots should be in the same
spreadsheet cell. Tag your final pipeline version
as "Problem 3".


** Exercise 4: Correlation, scatterplot and regression plotting capabilities
== Exercise 4: Correlation, scatterplots and regression ==


Let A, B, C, D be four genes. A scientist measured the activity
Let A, B, C, D be four genes. A scientist measured the activity
(i.e. the expression) of these genes in 100 different conditions. The
(i.e. the expression) of these genes in 100 different conditions. The
results are given in file [[genes.dat]]. Generate a 4 x 4 matrix of
results are given in file ''genes.dat''. Here are the first
three lines of this file:
 
A,B,C,D
0.636244,0.239430,0.745650,0.900198
0.342974,0.800676,0.375399,0.457818
...
 
Generate a 4 x 4 matrix of
scatter plots to understand correlations between the four
scatter plots to understand correlations between the four
genes. Visually analyze the plot and rank the genes B, C, D in
genes. Visually analyze the plot and rank the genes B, C, D in
decrescent order of correlation to A. Now draw a linear best fit line
decrescent order of correlation to A. Now draw a linear best fit line
in the plots of A with its most correlated gene, a quadratic best fit
in the plots of A with its most correlated gene, a cubic best fit
in the plots o A with its second most correlated gene and a cubic best
curve in the plots of A with its second most correlated gene and a  
fit in the plots of A with its most uncorrelated gene.  Tag the final
degree-5 polynomial best fit curve in the plots of A with its most  
pipeline version that does all this work as "Problem 4".
uncorrelated gene.  Tag the final pipeline version that does all  
this plots (in a single spreadsheet cell) as "Problem 4".

Latest revision as of 16:02, 25 September 2008

This is your second assignment for CS 5630/6630.

The assignment is due at midnight on October 6th, 2008. You will need to use the CADE handin functionality to turn in your assignment. The class account is "cs5630".

This assignment was successfully tested in release 1.2.1rev1336. It should work fine in releases >=1.2.1rev1336. Check your release before starting your work and upgrade it if necessary.

The Vistrails User's Guide will probably be helpful to you in this assignment.

The purpose of this assignment is to make sure you understand the basic plotting concepts covered in class and learn matplotlib/python/Vistrails as a tool do produce plots. Examples of plotting were provided in the lectures and can be found here: PlottingVistrails.zip. As you work on the assignment, we encourage you to read the available documentation on both matplotlib and python.

The data for the four problems of this assignment are in four files: stocks.dat (problem 1), actions-fall-2007.dat (problem 2), microprocessors.dat (problem 3) and genes.dat (problem 4). These four files are packed into a single zip file called: Hw1data.zip. The task of unzipping and locating these files is already done in the starting vistrails file for this assignment: Assignment1.vt. You should solve the problems by working directly in this vistrails file. When you open Assignment1.vt you will see four tagged versions that basically loads the raw data needed in each of the four problems. As before, show your work by submitting the complete vistrail you used to solve the problems.

Exercise 1: Principles of plotting and connectd symbols plot

The file stocks.dat has the first quote for each month from January 2006 to September 2008 for the papers from Apple Inc. (AAPL) and Microsoft Corporation (MSFT). Below we present the first three lines and the last two lines of this file.

month,apple,microsoft
2008-09,140.91,25.16
2008-08,169.53,27.29
...
2006-02,68.49,25.92
2006-01,75.51,27.06

(a) Apply the principles of plotting described in class and in the class notes to generate a simple connected symbol plot for all Apple's quotes in the file stocks.dat. Tag the final version of this plot as "Problem 1a" and annotate it with an explanation of the plotting principles you used to make this a clear plot.

(b) Using as reference the quote of January 2006 directly compare the progress of Apple's and Microsoft's papers by generating a plot using superposition (both curves in the same plot). Tag this final plot as "Problem 1b" and annotate it with the conclusions you can draw from this plot.

(c) Repeat item b, but now using juxtaposition: split the two curves (i.e. Apple's paper progress relative to January 2006 and Microsoft's paper progress relative to January 2006) into two different plots (each plot in a different spreadsheet cell). Tag the final version as "Problem 1c" and annotate it describing which technique (superpostion vs. juxtaposition) makes more sense for this data and why.

Exercise 2: Histogram and number of bins

Like this year, in the Fall of 2007, during the Scientific Visualization Course we collected all the assignments of the students in Vistrails' format. The file actions_fall_2007.dat has all the timestamps of all the actions of all the students in all the assignments: a total of 132131 actions. The first three lines of this file are:

timestamp
2007-09-15 21:24:56
2007-09-15 21:25:16
...

Create a histogram for the distribution of these timestamps and highlight the folowing due dates in the histogram. (obs. note that by some reason assignment 5 had a due data before assignment 6).

| Assigment | Due Date            |
|-----------+---------------------|
|         0 | 2007-09-18 12:00:00 |
|         1 | 2007-09-18 12:00:00 |
|         2 | 2007-10-04 12:00:00 |
|         3 | 2007-10-25 12:00:00 |
|         4 | 2007-11-27 12:00:00 |
|         5 | 2007-12-15 12:00:00 |
|         6 | 2007-12-11 12:00:00 |

When you finish your histogram tag its pipeline version with "Problem 2". And annotate it answering the following questions:

(a) How did you select the bins for the histogram and why?

(b) What hypothesis can you make about the amount of work (i.e. number of actions) for the different assignments just by looking to this histogram.

(c) What pattern can you observe for the amount of work (i.e. number of actions) close to the deadlines?

Exercise 3: Dot plots for labeled data

Each line of the file microprocessors.dat (except for the header line) has two quantitative values associated with a label. The quantitative values are "year of introduction" and "number of transistors" and the label is the name of a "microprocessor" (e.g. 286, 386, 486, Pentium 4). See the first three lines of this file:

Processor,Year of Introduction,Transistors
Pentium 4 processor,2000,42000000
286,1982,120000
...

Generate two dot plots horizontally juxtaposed for these microprocessors: one for "year of introduction" and the other for "number of transistors". For "number of transistors" dot plot use log base 10 scale. The two plots should be in the same spreadsheet cell. Tag your final pipeline version as "Problem 3".

Exercise 4: Correlation, scatterplots and regression

Let A, B, C, D be four genes. A scientist measured the activity (i.e. the expression) of these genes in 100 different conditions. The results are given in file genes.dat. Here are the first three lines of this file:

A,B,C,D
0.636244,0.239430,0.745650,0.900198
0.342974,0.800676,0.375399,0.457818
...

Generate a 4 x 4 matrix of scatter plots to understand correlations between the four genes. Visually analyze the plot and rank the genes B, C, D in decrescent order of correlation to A. Now draw a linear best fit line in the plots of A with its most correlated gene, a cubic best fit curve in the plots of A with its second most correlated gene and a degree-5 polynomial best fit curve in the plots of A with its most uncorrelated gene. Tag the final pipeline version that does all this plots (in a single spreadsheet cell) as "Problem 4".