Fitting Models to Data (Part 1)#

Prof. Ryan Trainor, Franklin & Marshall College

Objectives#

  • To introduce how mathematical models describe relationships within a dataset

  • To introduce model-fitting techniques and the meaning of a “best fit” model

  • To practice defining and using functions

Learning Outcomes#

By the end of this workbook, students should be able to

  • Use plotting techniques from matplotlib to visually evaluate relationships within a dataset and the utility of a mathematical model for describing those relationships

  • Use the curve_fit function from scipy.optimize to fit custom models to data

  • Explain how the “residual sum of squares” between a model and data relates to the best-fit model parameters

Prerequisites#

This notebook assumes that the student has already learned

  • Basic Python skills related to variable types, loops, and functions

  • Tools for creating and performing mathematical operations on numpy arrays

  • How to read in data from an ASCII file into an Astropy Table object

  • How to make basic plots using the the plot function from matplotlib.pyplot

Dependencies#

This notebook requires the standard libraries for scientific Python (numpy, SciPy) as well as the Astropy library. Astropy is included in the full installation of the Anaconda Distribution of Python, or it can be installed on its own by running conda install astropy or pip install astropy.

Note for instructors: this notebook has been written as an activity to be completed by a group of students in a lab or classroom setting. A solution with filled-in answers is available upon email request.

Masses and Luminosities of Stars#

In many areas of astronomy, we find that two properties of objects are correlated: an increase in one parameter typically corresponds to an increase or decrease in a second parameter.

In this activity, we will explore the relationship between two important properties of stars: their masses and their luminosities. In particular, we will try to figure out how to predict the luminosity of a star from its mass by finding a mathematical relationship between mass and luminosity.

Plotting the data#

Our first step is to read in a file that contains the measured masses and luminosities of a large number of stars, and then we will plot the luminosities of these stars against their masses in order to visually determine whether there is a relationship between them. Note: this activity uses simulated (fake) data to explore techniques that can be applied to real data—the true relationship between stellar mass and luminosity is more complicated that the mathematical relationships we will use here.

We’ll need to use numpy, matplotlib.pyplot, and the ascii package from astropy.io, so we will import those libraries below:

# Import necessary libraries

import numpy as np
import matplotlib.pyplot as plt
from astropy.io import ascii
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 5
      3 import numpy as np
      4 import matplotlib.pyplot as plt
----> 5 from astropy.io import ascii

ModuleNotFoundError: No module named 'astropy'

Now we need to download the file of stellar masses and luminosities that we will be using for this activity. You can download the data file at this link, or you can run the cell below to download it to your working directory automatically using wget.

Note that running the cell below is particularly useful if you are running this notebook on Google Colab, where any files in your working directory are intermittently deleted—it allows you to re-download the file automatically everytime you run the notebook. wget can be installed in an Anaconda environment by entering conda install -c anaconda wget at the command line, or with Homebrew via brew install wget.

# Download the file 'star-mass-lum-lumerr.dat' using wget (use this if you are running on Google Colab)
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1BWet8_t2gyNGtFNMP_1opqcYBHDinEfG' -O star-mass-lum-lumerr.dat

Now we will read in the file we just downloaded (“star-mass-lum-lumerr.dat”) using the ascii.read function and then print it:

data=ascii.read('star-mass-lum-lumerr.dat')
print(data)

You should have an Astropy Table object with four columns that looks like this:

   ID   Mass_Msun Lum_Lsun Lum_err_Lsun
------- --------- -------- ------------
STAR000     0.524    0.046       0.0347
STAR001     3.259    130.0         32.3
STAR002     0.393    0.039       0.0111
STAR003     0.533    0.062       0.0345
...
Length = 235 rows

Each row of the table represents one star, and each column represents information about each star: an ID number for that star, the mass of that star (in units of solar masses, \(M_\odot\)), the luminosity of that star, and the measurement uncertainty on that luminosity (both in units of solar luminosities, \(L_\odot\)).

For this activity, we will only use the ‘Mass_Msun’ and ‘Lum_Lsun’ columns, and we can access those individual columns of the table using the name of the column as a string:

print(data['Mass_Msun'])
print(data['Lum_Lsun'])

The above syntax for isolating a single column of our Astropy table enables us to access just the masses or luminosities and treat it like an array: we can use data['Mass_Msun']) in most of the same ways you would use a numpy array, including for performing mathematical operations or for plotting.

To get a better sense of our data, make a plot of the luminosities vs. the masses in the cell below. Note that when you plot “A vs. B”, quantity A should always go on the vertical axis, and quantity B should go on the horizontal axis.

Fill in the question marks below to plot the data, and see if you can use green pentagons as your markers. Note that appropriate axis labels have already been added for you.

NOTES:

  • The plt.xlabel and plt.ylabel commands produce the axis labels. The first argument of those functions should be a string containing the text you want plotted on that axis. Give them a descriptive name as well as the appropriate units in parentheses.

  • Remember that the plt.plot command takes in two arrays for the x and y parameters, and then the third argument of the function is a “format string” that specifies how the plotted points will look.

  • You can find information about how to use formatting strings to make green pentagons (or any other color/shape combination) by going to this website and scrolling down to the section on “Format Strings”.

# Fill in the question marks below to make your plot

plt.rcParams['figure.figsize'] = (20.0, 10.0)

plt.plot(?,?,?) # plot the mass on the x axis and luminosity on the y axis

plt.xlabel("Mass ($M_\odot$)",size=20)
plt.ylabel("Luminosity ($L_\odot$)",size=20)
plt.xticks(size=18)
plt.yticks(size=18)
plt.show()

You should have a scatterplot that shows a positive correlation between mass and luminosity.

CHECKPOINT 1#

If you are completing this activity in class, stop here and check your plot with your instructor or TA before you continue. You may want to think about the following questions in advance:

  • Are the mass and luminosity on the correct axes and labeled correctly?

  • Are the points plotted with the correct shape and color?

Creating a model#

Based on these measurements of mass and luminosity for a sample of stars, we might want find a model for our data: an equation that would allow us to predict the luminosity of another star (one not in our current sample) for which we have measured a mass. For example, based on our data, what do you think the luminosity would be of a star that had a mass of 30 \(M_\odot\), given the measurements you have for other stars? Look at your graph above as a group and decide on a value of the luminosity that you think would be realistic for this mass.

We’ll start with a simple linear model of the data: we’ll assume that mass and luminosity are related by an equation of the following form:

\[ y=ax+b \]

where \(y\) will represent our luminosity and \(x\) will represent our mass. In order to do this, let’s start by defining a function that takes in an array of \(x\) values, a value of \(a\), and a value of \(b\), and then returns an array of \(y\) values:

def linear_func(x,a,b):
    '''Function that '''
    y = ? # write an expression for y in terms of x, a, and b
    return y

Ok, now we have a generic function for describing a linear relationship. In order to apply it to our data, we need to figure out what the best values of \(a\) (the slope) and \(b\) (the y intercept) would be to describe our data. As a group look at your plotted data above and try to estimate what might be a reasonable slope and y-intercept.

Next, we will test our your line to see how it looks compared to your data. In order to plot a line, we’ll need to make an array of \(x\) values (i.e., an array of “fake” mass values) at which to plot our model. We will then use our function to calculate the \(y\) values we’d expect at those positions.

Below, use the np.linspace() command to create an array of \(x\) values, and then use the linear_func function and your guessed slope and intercept to create a list of \(y\) values:

xmodel = np.linspace(?,?,100) # remember the syntax: np.linspace(start, stop, number of points)
ymodel = linear_func(?,?,?) # remember the syntax: linear_func(x values, a, b)

In the box below, copy and paste your code that you used to make your first plot above in order to plot your data again. This time, add a second plot command immediately after the first one that plots ymodel vs. xmodel using a magenta line (you may need to look again at the website that lists formatting options for plt.plot).

# copy your code to make the plot from above with the additional line added

How does it look? Since the data doesn’t make a straight line, it probably doesn’t look great, but if you think you can improve your guesses, choose a new intercept and/or slope and re-run the last two cells above to see how well you can make your line fit the data.

CHECKPOINT 2#

Once you feel your line fits about as well as it can, check your plot with your TA or instructor before going further. You may want to think about the following questions in advance:

  • Does your plot include axis labels, points, and the linear function you created?

  • How does changing the values of \(a\) and \(b\) affect the agreement between your data and the model? Is the model more sensitive to one of them than the other?

  • You should have two plt.plot commands that each plot some y values vs. some x values. Which one represents the actual data? Which one represents the predictions of your model?

  • We used np.linspace to generate your xmodel array, while we used linear_func to generate the ymodel array. Why couldn’t we use np.linspace to generate your ymodel array? Why couldn’t we use linear_func to generate the xmodel array?

  • Once we had decided to use the xmodel array for our plot of the linear model, why did we use xmodel as an input to linear_func in order to generate ymodel? Since xmodel represents values of the mass, could we have used the array of actual masses instead?

Evaluating the model#

Obviously, estimating the best-fit line by trial and error can be a time-consuming process. Instead of doing it by hand, we can allow Python to fit it for us. In order to do this, we must have an objective standard of what we mean by a “best-fit”, since a visual inspection is obviously a subjective measure of fit.

The usual method of calculating a best-fit line is called least-squares fitting. This method of fitting chooses the line (or other equation) that minimizes the square of the differences between the predictions of your equation and the real points.

Another way of saying this is that least-square fitting minimizes the sum of the square of the residuals to the fit. The residuals of a fit are the differences between the real values and the fit values (i.e., the real values and the values your equation predicts).

Let’s calculate the residuals and the sum of the squared residuals by hand to start. In the cell below, use linear_func and your chosen values of \(a\) and \(b\) to generate an array containing the luminosity values that your equation would predict for each of the masses in your data set.(you should be able to do this in one line).

NOTES:

  • You are calculating a predicted luminosity for each of the masses in our dataset—you could do this with a loop, but see if you can calculate all the predicted luminosities in a single line using the power of numpy arrays!

  • Remember how we generated y values (i.e., predicted luminosities, based on our linear model) for each of the “fake” masses in our xmodel array. How can we use linear_func to generate predicted luminosities for each of the “real” masses in our dataset?

# Replace the question mark with an expression to calculate the luminosities predicted by our model for each of the masses in our data
predicted_lum = ?

Calculating residuals#

Now let’s calculate an array of residuals for each star in our dataset, recalling that the residual array is defined as:

\[ \textrm{residuals} = \textrm{real values} - \textrm{predicted (model) values} \]

or equivalently:

\[ \Delta y_i = y_i - f(x_i) \]

where \(y_i\) is the observed y value (i.e., the luminosity) for star \(i\), \(x_i\) is the observed x value (i.e., the mass) for star \(i\), \(f(x)\) is the mathematical function that describes our predictive model, and \(\Delta y_i\) is the residual for star \(i\).

As in the previous cell, try creating your residual array in a single line below using the arrays you’ve already defined:

residuals = ?

Now we will plot our residuals to see if they make sense.

  1. Make a plot similar to your luminosity vs. mass plot, but this time plot the residuals of the luminosities on the y axis instead of the luminosities themselves. You will want to change the y-axis label of your plot accordingly.

  2. On your plot, add a horizontal line at \(y=0\) so that we can easily tell which residuals are positive and which are negative.

NOTES:

  • You can use the command plt.axhline(yval,ls=?,color=?) to make a horizontal line at \(y=\) yval that spans the entire axis. There is an analogous command plt.axvline(xval) to make a vertical line at \(x=\) xval.

  • Both plt.axhline and plt.axvline accept optional keyword arguments to let you change the properties of the line drawn. For example, you can plt.axhline(yval,ls='--',color='brown') to make the a dashed line (ls = “linestyle”) that is brown. Google “matplotlib colors” or “python linestyles” and then pick a color and linestyle you like.

# copy your code to make the plot from above and then make your changes and additions to plot

Look at your plot of residuals and compare it to your plot of luminosity vs. mass with the line for the model you calculated. Hopefully you can see that the residuals are positive when the observed luminosities were above the model line, and the residuals are negative when the observed luminosities were below the line.

Residual Sum of Squares#

At the end of the day, it’s nice to have a single number we can use to evaluate how well our model describes the relationship in our data. The value we will use is the residual sum of squares (RSS):

\[ \textrm{RSS} = \sum_{i=1}^N \Delta y_i = \sum_{i=1}^N (y_i - f(x_i))^2 \]

where the variables above are the same as in our definition of the residuals.

In the cell below, calculate the RSS value using the residual array you already have and some mathematical operations.

NOTES:

  • The np.sum() function will be useful, but be careful with your order of operations!

  • The residual sum of squares should be small for a good fit.

  • In the cell below, the print command uses some special string formatting to print the result in scientific notation, which may be useful if your RSS is very large.

RSS = ?
print("{:e}".format(RSS))

CHECKPOINT 3#

Before going on, check your plot with your TA or instructor. You may want to think about the following questions in advance:

  • Conceptually, how does the RSS relate to the residual plot and data + model plot you made previously? Why should the RSS be small for a good fit?

  • In calculating the RSS, did you square the residuals before or after you sum them? Why?

  • What do you think the sum of the unsquared residuals would be if your best-fit line was just flat line at \(y=y_\textrm{avg}\), where \(y_\textrm{avg}\) is the average luminosity?

  • In the case of a model that was a constant \(y=y_\textrm{avg}\), would the sum of the unsquared residuals be small? Would it be a “good” fit, subjectively speaking?

Fitting the model automatically#

Now that we have a quantitative definition of a “best-fit” line, we can tell python to figure out the slope and y-intercept that will produce the best-fit line (i.e., the values of \(a\) and \(b\) that minimize RSS).

For this, we’ll need to use the curve_fit function from the scipy.optimize package, so we will import that function below:

from scipy.optimize import curve_fit

The syntax for curve_fit is curve_fit(func,x,y), where ‘func’ is the name of the function for which we are estimating the best parameter values (i.e., linear_func in our case), ‘x’ is our array of observed x values (the masses), and ‘y’ is our array of observe y values (the luminosities).

Fill in the correct variables in the code below and then run it:

popt,pcov=curve_fit(?,?,?) # fill in these variables
a_best,b_best=popt
print(a_best,b_best)

The output of curve_fit() has two parts: an array of optimal best-fit parameters (which we’ve stored in popt) and a covariance matrix (which we’ve stored in pcov). We won’t worry about the covariance matrix for now, but it is useful for calculating the uncertainties on our fit parameters.

Note that in the cell above we have done a funny thing multiple times: we have set two variables (separated by a comma) equal to a single variable or a function. What’s going on here?

In the first line above, curve_fit returns an object with length 2, so popt,pcov=curve_fit(func,x,y) sets popt equal to the first element returned by curve_fit and pcov equal to the second element.

It turns out that popt is itself an array with two elements (a slope and an intercept). Thus, the second line (a_best,b_best=popt) assigns the first element of popt (the best-fit slope) to the variable a_best and assigns the second element of popt (the best-fit y-intercept) to the variable b_best. This syntax is a handy way to define multiple variables at the same time.

Ok, now let’s create another plot of our mass and luminosity data just like we did before, but this time we will include two lines:

  1. the best-fit line we made already by adjusting the parameters by eye

  2. a new best-fit line that uses a_best and b_best as our slope and y-intercept

NOTES:

  • Make sure to make the lines different colors and/or linestyles so you know which is which.

  • You will need to make a new version of ymodel to hold the y-values of your new best-fit line. Perhaps you can call this ymodel_best or something similar. Do you need to make a new version of xmodel?

# copy your code to make the plot from above and then make your changes and additions to plot

Maybe the two lines look similar, or perhaps they do not. To compare them quantitatively, *calculate the RSS again using your new best-fit parameters.

Note that your new best-fit values of the slope and y intercept mean that the predicted values and residuals are different for the updated model! You can calculate the new RSS value step-by-step as we did before, or you can calculate the new RSS in a single line if you prefer.

# Calculate the residual sum of squares for the data and your best-fit linear model

CHECKPOINT 4#

Before going on, check your new and old RSS value with your TA or instructor. You may want to think about the following questions in advance:

  • How do the model parameters generated by curve_fit compare to the values you selected yourself?

  • Is the model generated by curve_fit better or worse than the one you generated by hand? How do you know?

  • Looking at the plotted lines, you will likely see that each model fits some of the data better and some of the data worse than the other model. Why does the best-fit model seem to “prioritize” fitting certain parts of your plot?

Fitting complex models#

So far we have done a lot of work to find the best-fit linear model to our data, but there’s an elephant in the room: our data doesn’t really look like a straight line! We need to consider some other types of equations that might be a better fit to our data.

If we want to try a model that a little more complicated than a linear model, a quadratic model might be a logical next step. A quadratic model takes the following form: $\( y=ax^2+bx+c \)$

Below, let’s define quadratic_func to be a function that takes in an array of \(x\) values and parameters for \(a\), \(b\), and \(c\) and then returns an array of y values:

Note: you may want to look back at your definition for linear_func to remind yourself how we defined it and why.

def quadratic_func(?):
    return ?

Now we can find the best-fit values of \(a\), \(b\), and \(c\) for this expression. Let’s not bother doing it by hand this time—in the cell below, use curve_fit to find the best-fit values of these parameters in a similar way as we found the best values for our best-fit linear parameters:

popt,pcov = ?
a_quad,b_quad,c_quad = ?

As before, let’s make a plot of our data and include our best-fit linear and quadratic curves. Again, use different colors and/or linestyles for the two lines.

# copy your code to make the plot from above and then make your changes and additions to plot

Which curve appears to have a better fit? Let’s check quantitatively as well. Calculate the RSS for the quadratic fit, and compare the number you found to the sum of squared residuals from the best-fit linear relationship:

# Calculate the sum of the squared residuals to the best-fit equation

CHECKPOINT 5#

Before going on, check in with your TA or instructor. You may want to think about the following questions in advance:

  • Is the linear model or the quadratic model a better fit to your data? How do you know? Do you think that would always be the case when comparing a linear model to a quadratic model?

  • Is the “best” slope and y-intercept the same for both models? Why or why not?

  • Are there any areas of your plot where neither model seems to fit particularly well?

Power-law models#

Let’s try one more type of equation: a power-law model. Power-law equations show up a lot in astronomy, and they take the following form: $\( y=Ax^\alpha \)$

Define a new function called powerlaw_func that takes in an array of \(x\) values and parameter values for \(A\) and \(\alpha\) and returns an array of y values. Note: you can just call the second variable “alpha” instead of “\(\alpha\)”.

# fill in the ? as needed to define your function
def powerlaw_func(?):
    return ?

Now use curve_fit to calculate the best-fit values of \(A\) and \(\alpha\):

# your curve_fit code here

Plot the data (again) along with all three lines using different colors and/or linestyles:

# your copied, pasted, and edited plotting code here

Finally, calculate the residual sum of squares for the power-law fit:

Which fit is best by our quantitative metric? Does this agree with what your eyes tell you based on your plot?

FINAL CHECKPOINT#

Congratulations! You have completed this activity. Call at TA or your instructor to check you off. You may want to think about the following questions in advance:

  • Which model is the best fit according to our quantitative metric, the RSS?

  • Does the best model according to the RSS match what looks best visually as you look at the plot?

  • Can you explain in words why the RSS is useful for defining the “best” model parameters for a given model, as well as for deciding between two different models?