Fitting Models to Data (Part 2)#
Prof. Ryan Trainor, Franklin & Marshall College
Objectives#
To introduce the role of measurement uncertainties in model evaluation
To compare logarithmic and linear scaling for visualizing data
To estimate the uncertainty on inferred model parameters
Learning Outcomes#
By the end of this workbook, students should be able to
Include measurement uncertainties in the presentation of data and model fitting
Compute the Chi-square value for a model and use it to evaluate the goodness of fit
Use logarithmic scaling to recognize and evaluate powerlaw relationships
Use comments to document their code and functions
Prerequisites#
This notebook assumes that the student has already completed Part 1 of this Model Fitting activity and is familiar with its prerequisites
Dependencies#
This notebook requires the standard libraries for scientific Python (numpy, SciPy) as well as the Astropy library. Astropy is included in the full installation of the Anaconda Distribution of Python, or it can be installed on its own by running conda install astropy
or pip install astropy
.
Note for instructors: this notebook has been written as an activity to be completed by a group of students in a lab or classroom setting. A solution with filled-in answers is available upon email request.
Review: Plotting the data#
Our first step is to create the same plot we made in the final step of Part 1 of the Model Fitting activity: i.e., to make a plot of luminosity vs. mass that includes the best-fit power-law function. You do not need to include the linear or quadratic curves we considered in that activity in this plot.
NOTES:
Feel free to copy and paste code from Part 1, but try to leave out any unnecessary commands: this is a good exercise to see what is the minimal amount of code you need to make this plot!
You will need the data file from Part 1 if you do not have it. You can download the data at this link or using the
wget
command in the cell below.As you write or copy your code, practice using comments. Remember that comments are sections of code that are not read by python—they only exist to help you (or other people) understand what your code is doing. A comment starts with a # sign, and everything after the # on the same line is ignored by python. For each line or section of 2-3 lines that you copy from your Fitting Models to Data (Part 1) code, add a comment to explain what your code is doing in that section and why you need it.
While you’re looking at your previous lab, make sure you refresh your memory about the Residual Sum of Squares (RSS) and how it relates to the best-fit line. You’ll need this information later for Checkpoint 1.
# Download the file 'star-mass-lum-lumerr.dat' using wget (use this if you are running on Google Colab)
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1BWet8_t2gyNGtFNMP_1opqcYBHDinEfG' -O star-mass-lum-lumerr.dat
--2024-07-14 00:16:13-- https://drive.google.com/uc?export=download&id=1BWet8_t2gyNGtFNMP_1opqcYBHDinEfG
Resolving drive.google.com (drive.google.com)...
142.250.189.238, 2607:f8b0:4005:814::200e
Connecting to drive.google.com (drive.google.com)|142.250.189.238|:443... connected.
HTTP request sent, awaiting response...
303 See Other
Location: https://drive.usercontent.google.com/download?id=1BWet8_t2gyNGtFNMP_1opqcYBHDinEfG&export=download [following]
--2024-07-14 00:16:14-- https://drive.usercontent.google.com/download?id=1BWet8_t2gyNGtFNMP_1opqcYBHDinEfG&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)...
142.250.189.161, 2607:f8b0:4005:80c::2001
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.250.189.161|:443... connected.
HTTP request sent, awaiting response...
200 OK
Length: 7790 (7.6K) [application/octet-stream]
Saving to: ‘star-mass-lum-lumerr.dat’
star-mass-lum-lumer 0%[ ] 0 --.-KB/s
star-mass-lum-lumer 100%[===================>] 7.61K --.-KB/s in 0s
2024-07-14 00:16:16 (85.6 MB/s) - ‘star-mass-lum-lumerr.dat’ saved [7790/7790]
# Re-create the final plot from Part 1, omitting unecessary steps
Displaying error bars#
Once you have reproduced your plot from Part 1, look at the data compared to the best-fit line. You should see that the line does not perfectly go through all of the points. Does this mean our model is a bad model? Maybe, but maybe not—if there are measurement errors in our data, then it may be that our measurement errors are causing the data and model to disagree.
Thankfully, our data include an estimate of our uncertainty on each of the luminosity measurements (which we have ignored in Part 1). These uncertainties are given in the 4th column of the Astropy Table that includes our data. This column is named ‘Lum_err_Lsun’, so we can print the contents of this column by using the name of the Table indexed by the column name. Test this below:
# Print the column named 'Lum_err_Lsun' in your Astropy Table
print(?)
Cell In[3], line 2
print(?)
^
SyntaxError: invalid syntax
Given these uncertainties, we can now add error bars to the plot we created previously. You can do this by replacing the plt.plot()
command that you used to plot the measured masses and luminosities with the plt.errorbar()
command. This command has the following syntax:
plt.errorbar(x,y,yerr,marker=?,ls=?,color=?)
Here, x
is your array of masses, y
is your array of luminosities, and yerr
is your array of luminosity uncertainties. marker
allows you to specify a symbol to mark the points, ls
allows you to set a linestyle that connects the points, and color
allows you to specify the color. Let’s use orange squares, and let’s not use a line to connect them. Remember that you can find information on format options by googling “matplotlib plot formats” or going to the following link: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html. You may also find ls='none'
useful).
Also, continue adding comments to explain what you’re doing and changing with each step
# Create your plot with errorbars here:
Error bars are useful for determining whether the model is statistically consistent with the data. Note that we cannot know the actual error on each measurement: if we knew the the error on a given measurement, we would simply subtract the error to get the true value!
Instead, our uncertainty (as shown by our error bars) represents the magnitude of the typical error we expect for a single measurement: i.e., it represents the the expected magnitude of the residual (the difference between the data and the model, introduced in Part 1) for a true model. For a good model, we expect that 68% of the points will have error bars that include the model line, and 95% of the points will be less than 2x the error bar away from the model line.
CHECKPOINT 1#
If you are completing this activity in class, stop here and check your plot with your instructor or TA before you continue. You may want to think about the following questions in advance:
What does the “residual sum of squares” (RSS) mean, and how does it relate to your best-fit line?
Which stars in your plot have the the most precisely-determined luminosities? Which have the most uncertain luminosities?
Does your best-fit line seem consistent with your data, given the uncertainties in your fit?
Using a logarithmic plot#
One thing you might notice about your plot above is that most of the points are all clustered together at very small values in the bottom-left corner. This is a common issue in astronomy, where we often deal with numbers that cover many orders of magnitude in scale—if you make your plot big enough to accommodate the big numbers, then you can barely see the differences among the small numbers!
One way to deal with this is to use a plot with logarithmic scaling: this will change our x and y axes to show powers of 10 instead of linear values. Python has an easy way to do this by adding a single command to the script for our plot.
Sometimes we have data that is evenly distributed on either the x or y axes but heavily skewed on the other axis. In these situations, you can use plt.semilogx()
to make the x axis logarithmic while leaving the y axis on a linear scale, while plt.semilogy()
does the opposite. If both axes should be shown logarithmically, we can use plt.loglog()
. Re-create your plot from above (with the errorbars) and try adding one of these commands. Test all three commands and decide as a group which one makes the data easiest to see.
# Create your plot with errorbars and logarithmic scaling here:
If you have not already included your best-fit powerlaw relationship, make sure to display it as well. What shape does it make? Hopefully you find that it looks like a straight line (i.e., the power law looks linear when plotted logarithmically).
One other thing you may find is that our power-law model that looked so good in linear space suddenly looks like a poor fit, especially for the low masses: it systematically misses all of the low-mass points! What’s going on here?
Remember that we made our fit to minimize the RSS: we wanted a line that was “as close as possible” to all of the points, and each point was treated equally. This means that being off by 1 L\(_\odot\) from our largest luminosity was just as bad as being off by 1 L\(_\odot\) from our smallest luminosity. However, that doesn’t really make sense—1 L\(_\odot\) is a tiny fraction of the total luminosity of our bright stars, but it’s a huge fraction for our faint stars. Also, and more importantly, remember what we saw for our error bars: the largest luminosities are the most uncertain, whereas the small values are more precisely known in our data. With this in mind, we should focus on trying to get a better fit to the points we know precisely, even if it means we get a slightly worse fit to the other points. That is, we need to give the points with small uncertainties additional weight, so that they count more in the fit.
In order to create a fit that weights points according to their uncertainties, we will need to discuss the \(\chi^2\) value (pronounced “\(\chi\)-squared”, where \(\chi\) is a Greek letter that physicists pronounce like “KAI”).
\(\chi^2\) values#
The \(\chi^2\) value of a fit is very similar to the RSS. The only difference is that each residual is weighted by the inverse of its uncertainty, so that points with small uncertainties are given larger weight. It is defined as follows:
As in Part 1, \(\Delta y_i\) is the residual of point i, which is the difference between the observed value at point i (\(y_{i,\mathrm{obs}}\)) and the value at point i predicted by the model (\(y_{i,\mathrm{pred}}=f(x_i)\), where \(f(x)\) represents our model). \(\sigma_i\) is the uncertainty in the y value for that particular data point. Note that if don’t divide by \(\sigma_i\) (or if \(\sigma=1\) for every point) then the \(\chi^2\) value is equivalent to the RSS.
In the box below, do the following three things:
Create a function that will calculate the \(\chi^2\) value for arbitrary arrays
yobs
,ypred
, andsigma
.Also create a second function that computes our old friend the RSS based on
yobs
andypred
.Finally, use your function to find the \(\chi^2\) value for the best-fit power-law model you found above.
# Fill in the functions below
def chi2(?,?,?):
return ?
def RSS(?,?):
return ?
chi2_orig = ?
print(chi2_orig)
If your calculation is working, you find a \(\chi^2\) value of approximately 5000. This is a lot! Remember: if our uncertainties and our model are both accurate, then on average our residuals will be equal in magnitude to our uncertainties: \((\Delta y)^2=\sigma^2\), and thus for a true model:
Since our table has 235 stars in it (you can check this with print(len(data))
if you named your Table data
), then we should have \(\chi^2\approx 235\). The fact that the model above has a much larger \(\chi^2\) value means that our residuals are much larger than our uncertainties on average, and thus that there is a significant disagreement between our data and the model that cannot be explained by measurement errors alone.
\(\chi^2\) minimization#
Now our goal is to create a new fit that takes our uncertainties into account. We can do this by passing an optional argument called sigma
to curve_fit
. In the cell below use curve_fit()
as you have previously, but this time set sigma
equal to the array of luminosity uncertainties:
popt,pcov=?
A_new,alpha_new=popt
print(A_new,alpha_new)
In the cell below, now create a plot showing the data (with errorbars) and two power-law curves: one with your original best-fit parameters, and one with your new best-fit parameters. You may want to look at your plot in the log-log view and the linear view as you consider the questions below.
# Plot the luminosity vs. mass for the real data with errorbars
# include the model lines from the original fit and the fit weighted by the uncertainties
CHECKPOINT 2#
If you are completing this activity in class, stop here and check your plot with your instructor or TA before you continue. You may want to think about the following questions in advance:
Which line seems to be consistent with more of your points (i.e., within the uncertainties)?
Which line is more precise at predicting the value of the star with \(M=40\) M\(_\odot\)? Is this the best fit overall?
Which line has a smaller \(\chi^2\) value? Which has a smaller sum of squared residuals? You can use the cell below to do some calculations to check. Is either \(\chi^2\) value consistent with a “good” fit?
# Check chi^2 and RSS of original model and new model
Uncertainties in the model#
Because our data has some uncertainty on it, our model parameters have some uncertainty as well. What does this mean? Let’s imagine that we made new measurements of all of the luminosities of these stars. We might expect to find slightly different luminosities the second time, with the differences between the first and second measurements (i.e., the difference in measurement error) being similar in size to our uncertainties.
If we then fit a best-fit line to our new measurements, we would expect that we might get a slightly different “best” set of parameters A and \(\alpha\) as well. How different might we expect them to be? This question is asking about the uncertainties on our best-fit parameter values. If we can calculate uncertainties on the parameters, then we can be reasonably sure that anyone else calculating a similar fit would get parameters that are consistent with our to within ~2x the uncertainty (technically, the values should be within 2x uncertainties 95% of the time).
The uncertainties on our best-fit parameters are given by the covariance matrix, which is the pcov
variable we’ve been ignoring so far. A covariance matrix is an \(N\times N\) matrix, where \(N\) is the number of parameters in our model. In our case, our parameters are A and alpha, so \(N=2\) and the covariance matrix looks like this:
The diagonal elements of this matrix (\(\sigma_{AA}\) and \(\sigma_{\alpha\alpha}\)) represent the variance (i.e., the square of the uncertainty) on our estimated parameters, which is to say that the uncertainties on our parameters estimated by our \(\chi^2\) minimization are the square-roots of pcov[0,0]
and pcov[1,1]
for A and alpha, respectively.
Given the above, complete the code block below to print the uncertainties:
A_err = ?
alpha_err = ?
print("The best-fit parameters are A={:4.2f}+/-{:4.2f} and alpha={:4.2f}+/-{:4.2f}".format(A_new,
A_err,
alpha_new,
alpha_err))
One way to visualize the uncertainties on A and alpha is to plot many different variations of our power law model where we slightly change the values of A and alpha in a manner consistent with the uncertainties on those parameters. The function below generates 100 random pairs (A,alpha) that are consistent with the uncertainties on those parameters:
def rand_A_alpha(A,alpha,A_err,alpha_err):
'''Given input values of A and alpha and uncertainties on each value, generate
100 pairs (A,alpha) such that the generated parameters form a gaussian
distribution about the input values with a 1-sigma width given by the
uncertainty.'''
n_rand = 100
rand_A_array = A + A_err * np.random.normal(size=n_rand)
rand_alpha_array = alpha + alpha_err * np.random.normal(size=n_rand)
return zip(rand_A_array,rand_alpha_array)
In the cell below, create a new version of your plot with the logarithmic scaling, and use the rand_A_alpha()
function above to generate 100 model lines that each use randomly-generated values of A and alpha.
NOTES:
You can use the function
rand_A_alpha()
to generate the fake values and store them as a series of pairs, after which you can plot a line for each pair of values by including a for loop in your plot. Try something like this:
rand_A_alpha_pairs=rand_A_alpha(A_new,alpha_new,A_err,alpha_err)
for (A_i,alpha_i) in rand_A_alpha_pairs:
plt.plot(xmodel,powerlaw_func(xmodel,A_i,alpha_i),alpha=0.1,color='k')
The
alpha=0.1
portion of the plotting command above sets the model lines to be partially transparent so that they don’t all cover each other (or your data). Note that “alpha” here is an old printing term, wherealpha=1
means fully opaque (the default) andalpha=0
means fully transparent—it is completely unrelated to the fact that one of our variables in this activity also happens to be named alpha!
# Plot the luminosity vs. mass for the real data with errorbars
# include 100 versions of the model line consistent with the parameter uncertainties
FINAL CHECKPOINT#
Congratulations! You have completed this activity. Call at TA or your instructor to check you off. You may want to think about the following questions in advance:
Where does the prediction of the fit appear to be most certain? Least certain? Does this make sense, given where the data are?
Does the uncertainty in the fit curve seem to be as large as the uncertainty in the surrounding data? If not, why do you think this is?
What are the differences and similarities between the RSS and \(\chi^2\) value for defining “goodness of fit”? When might it make sense to use one rather than the other?