SDSS Tutorial #1: Data Querying and Visualization#

Prof. Britt Lundgren, University of North Carolina Asheville

Objectives#

  • To introduce the SciServer cloud computing platform and the Sloan Digital Sky Survey (SDSS) astronomical database

  • To demonstrate ways of querying and visualizing data from the SDSS, using Python and SQL on the SciServer platform

Learning Outcomes#

By the end of this notebook, you should be able to use the SciServer cloud computing platform to:

  • visualize data in scatter plots, histograms, and heatmaps using the Python library matplotlib

  • wrap SQL queries in Python to fetch astronomical data from the Sloan Digital Sky Survey (SDSS) database

  • create a color-mangitude diagram with SDSS galaxy data

Prerequisites#

  • This notebook assumes that the user has basic proficiency in the Python programming language. For an excellent introduction, we recommend the two-part Python Bootcamp tutorials and accompanying free eText “Python for Astronomers” by Imad Pascha and Chris Agostino. A more extensive tutorial and complete documentation for Python 3 can be accessed here.

  • This notebook uses SQL queries to select data from the SDSS database. If the SQL language and/or the SDSS database is new to you, we recommend this brief tutorial.

  • This notebook is intended to be run in the SciServer Compute platform. If you have not used SciServer before, please review the following:

I. What is SciServer?#

The SciServer is a cloud-based computing platform that enables users to easily interact with the vast astronomical imaging and spectroscopy data from the Sloan Digital Sky Survey. The following notebook provides a tutorial for using SciServer Compute for your own SDSS research from anywhere with access to the internet. All of the computing is done in the cloud, so there’s no need to download anything to a local computer. The only thing required to get started exploring the universe is a web browser. For this tutorial we will be using the programming language Python 3 to interface with the SciServer, although you can also write scripts using R.

II. Getting Started with SciServer#

  1. The first thing you need to do is to set up an account with SciServer Compute, following the instructions provided here.

  2. Once you are logged in to your account and have created a container for your work, navigate to your persistent directory. This is where you can write and run your code. You can upload existing notebooks or create your own.

  3. Load this notebook into the persistent directory in your SciServer container to edit and run it.

  4. Remember to click the “play” button (alternatively, “Shift+Return”) through each code block as you go.

Visualizing Data#

The standard Python library for plotting and visualization is called matplotlib. Below are a few examples showing what it can do. An excellent longer tutorial by Nicolas P. Rougier can be found here.

For your reference, here is a list of matplotlib codes for different plot colors, and here is a full list of the different shapes you can use as markers in scatter plots.

We’ll start off by importing the matplotlib library and some of its modules:

import matplotlib
%config InlineBackend.rc = {}
from matplotlib import pylab as plt
%matplotlib inline

# If everything imports without a problem, the next line should print...
print('OK')
OK

In the cell below we’ll make an example of a simple scatter plot. (For an example with full documentation, see here.)

Note: This code below plots new randomly generated data every time you run it. You might notice how it changes each time you play it.

# We'll need to import the numerical Python (Numpy) library if we want to generate random data
import numpy as np

# Here we'll use a function in the Numpy library to generate some random numbers between zero and one, in arrays 50 elements long...
N = 50
x = np.random.rand(N)
y = np.random.rand(N)

# We can also generate an array of numbers (the same length as the others) to represent random colors for our points
colors = np.random.rand(N)

# We can generate another array of the same length that makes circles of various random sizes
area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radii

# Here we'll make our scatter plot of x vs. y, with circles described by those random colors and sizes
# The 'alpha' keyword is used to make the markers semi-transparent. 0 = totally transparent; 1 = fully opaque
#
plt.scatter(x, y, s=area, c=colors, alpha=0.5)

# display the plot to the screen
plt.show()
../../_images/8839452b4052a19c545e27b7c31a0a6f88093c9f17806696f3a30ecd2ff50da7.png

In the case where you have two sets of data that you want to show on the same plot, you can also do that easily with the scatter function. To help make it understandable, you’ll want to add a labels and a legend. An example of how to do that is shown below.

# We don't need to import the numpy library again, since we already did it once in this notebook.

# Let's make two samples of simulated data, which represent the locations of dogs and cats in a neighborhood

# Here are the locations of 15 dogs 
N_dogs = 15
x_dogs = np.random.rand(N_dogs)
y_dogs = np.random.rand(N_dogs)

# Here are the locations of 30 cats
N_cats=30
x_cats = np.random.rand(N_cats)
y_cats = np.random.rand(N_cats)

# Let's plot their locations in the same figure
plt.scatter(x_dogs, y_dogs, marker='s', color='b', alpha=0.5, label='dogs')
plt.scatter(x_cats, y_cats, marker='v', color='r', alpha=0.5, label='cats')

# label the axes
plt.xlabel('X coordinate')
plt.ylabel('Y coordinate')

# add a legend and a title
plt.title('Dogs and cats, living together!')
plt.legend(loc="upper right")

# display it
plt.show()
../../_images/de0e376b70dc2eedea0e31fa1f26afff5a91a8d3c458d80bebf6feb2635c3221.png

If you wanted to compare the distribution of cats and dogs along the x-axis, a histogram is probably the easiest way to do that…

(More documentation on histograms can be found here.)

# let's make the histogram have 10 equally spaced bins, spanning the range of 0-1:
bins=np.linspace(0,1,10)
plt.hist(x_dogs, bins=bins, color='b', alpha=0.5, label='dogs')
plt.hist(x_cats, bins=bins, color='r', alpha=0.5, label='cats')

plt.xlabel('X coordinate')
plt.ylabel('Number')
plt.legend()
plt.show()
../../_images/ec87198b85cfb855ad38ee7670a4720438ff8de5e74b0207a2d43291a092babc.png

If the overlap in the samples is too confusing, you can change the format of the histogram (histtype) so that it’s not filled in. Different line styles (ls) will make sure the plot is readable even if it’s printed in black and white.

# let's make the histogram have 10 equally spaced bins, spanning the range of 0-1:
bins=np.linspace(0,1,10)
plt.hist(x_dogs, bins=bins, color='b', histtype='step', ls='dashed', alpha=0.5, label='dogs')
plt.hist(x_cats, bins=bins, color='r', histtype='step', ls='solid', alpha=0.5, label='cats')

plt.xlabel('X coordinate')
plt.ylabel('Number')
plt.legend()
plt.show()
../../_images/ca4869d3dce748fd8d3244de9b3310f8c204d4249032530cff35b3b456e950f3.png

Normalizing the histograms (using the keyword: density=‘True’) makes it easier to compare the shapes of distributions even when one sample is much more numerous than the other. This is calculated by dividing each bin by the total area of the histogram.

# let's make the histogram have 10 equally spaced bins, spanning the range of 0-1:
bins=np.linspace(0,1,10)
plt.hist(x_dogs, bins=bins, color='b', histtype='step', ls='dashed', density='True', alpha=0.5, label='dogs')
plt.hist(x_cats, bins=bins, color='r', histtype='step', ls='solid', density='True', alpha=0.5, label='cats')

plt.xlabel('X coordinate')
plt.ylabel('Normalized Number')
plt.legend()
plt.show()
../../_images/e786c0fcf32def0de02d97b6d3d2be3f424c38d44d3c831e06bc1d5f17309fa5.png

YOUR TURN: What happens if you change the style of the plot?#

Try dramatically changing the formatting of one of the example plots from above in the code block below. Extra points for creativity and bling will be awarded!

I have provided only a few examples of figures that can be made using Matplotlib. This thumbnail gallery is useful for finding example code for making just about any kind of plot or visualization you can think of. If you’re a fan of XKCD, you can also XKCD-ify your plots with the code found here :) You can find examples for inspiration here

# your code here

Importing SciServer and other important libraries#

Now let’s get our hands on some real astronomy data to plot.

The SciServer team has written a number of libraries, generally prefixed with “SciServer”, that assist in various functions like connecting to and querying the SDSS database. As with all Python libraries, they must be actively imported before being used.

The next code block imports those, together with some standard Python libraries helpful for scientific analysis. The code block below applies some settings you may find helpful.

# Import Python libraries to work with SciServer (see http://www.sciserver.org/docs/sciscript-python/)

import SciServer.CasJobs as CasJobs # query with CasJobs, the primary database for the SDSS
import SciServer.SkyServer as SkyServer # show individual objects through SkyServer
import SciServer.SciDrive           # read/write to/from SciDrive, cloud-based storage for your work
import numpy as np                  # standard Python library for math operations
import pandas                       # package for structuring and visualizing data
import matplotlib.pyplot as plt     # package for graphing and visualization 
print('SciServer libraries imported')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[8], line 3
      1 # Import Python libraries to work with SciServer (see http://www.sciserver.org/docs/sciscript-python/)
----> 3 import SciServer.CasJobs as CasJobs # query with CasJobs, the primary database for the SDSS
      4 import SciServer.SkyServer as SkyServer # show individual objects through SkyServer
      5 import SciServer.SciDrive           # read/write to/from SciDrive, cloud-based storage for your work

ModuleNotFoundError: No module named 'SciServer'
# Apply some recommended settings to the imported libraries

# ensures columns get written completely in notebook
pandas.set_option('display.max_colwidth', -1)
# do *not* show python warnings 
import warnings
warnings.filterwarnings('ignore')
print('Settings applied')
Settings applied

If you don’t receive any errors after running the blocks of code above, the libraries will be successfully loaded. More information about the libraries, which may be useful for the unusually curious or expert user, can be found on the SciServer Python documentation site.

Querying the SDSS database#

The next code block searches the SDSS Data Release 16 database with lightning speed, using the CasJobs REST API.

A very helpful tutorial on how to use SQL to query the SDSS database can be found here. Please take 15 minutes at this time to review the tutorial if SQL is new to you.

A description of the SDSS database structure and the specific information available in different defined Tables and Views, can be found here.

The simple search coded below returns data for the first 16 galaxies with spectroscopic information in the SDSS “galaxy” table that meet some specific criteria.

If you don’t recognize the names for the data types listed in the “galaxy” table (e.g., “p.ra”, “p.g”), you can look them up here.

# Find objects in the Sloan Digital Sky Survey's Data Release 16.
#
# Query the Sloan Digital Sky Servey's Data Release 16.
#
# This query finds "nice-looking galaxies": 
#   galaxies in the SDSS database that have a spectrum 
#   and have a size (petror90_r) larger than 10 arcsec.
#   We are also requiring that the galaxies have apparent
#   model magnitudes (u,g,r,i,z) in the u-band filter between 0 and 19.6
#   and in the g-band filter between 0 and 17
#
# The query is stored in an object called "query"
#   and then executed with a function in the CasJobs library
#
# The returned data is stored in a table called "gals"
#
query="""
SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r, p.g, p.r
  FROM galaxy AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.u BETWEEN 0 AND 19.6
  AND p.g BETWEEN 0 AND 17  AND p.petror90_r > 10
"""
gals = CasJobs.executeQuery(query, "dr16")

# this line requires that the unique object name 'objId' is the first column in the returned data table "gals"
gals = gals.set_index('objId')
gals
ra dec petror90_r g r
objId
1237667781775327268 219.594953 18.111218 12.58201 15.60288 14.90701
1237662264855625834 219.588623 9.227637 14.91277 15.31813 14.61456
1237662264855625968 219.580353 9.291655 10.88518 16.84558 15.97781
1237665351318503492 219.591756 27.336304 17.26094 15.94505 15.18246
1237661850943947119 219.585008 39.393224 14.90473 16.72122 16.38783
1237654879665324133 219.578818 3.639681 10.83427 15.93781 15.30992
1237667781238456442 219.577525 17.609425 18.30117 15.18190 14.29627
1237665532251013352 219.592920 23.428761 14.47842 15.66112 14.86994
1237667735059956026 219.614557 18.717180 11.17537 16.82319 16.02281
1237668332564513027 219.604100 14.967911 10.65178 16.44650 15.59398
1237662239091982527 219.606238 9.382254 11.72686 16.56822 16.07388
1237654879128453333 219.614435 3.274918 11.88382 16.46257 16.06645
1237665567145918782 219.603206 21.670006 11.70972 16.91023 15.91467
1237667782312132766 219.607141 18.572540 15.30089 16.94920 15.97377
1237661387606524108 219.646867 52.706440 12.95517 15.60632 15.14363
1237659325482008768 219.652172 54.923958 13.29521 16.18980 15.27536

Visualizing Returned Data#

Now that we have run the query and stored the results, we can start analyzing the results.

Let’s start by making a simple plot of the galaxy coordinates (RA vs. Dec) returned by our query:

plt.scatter(gals['ra'], gals['dec'])
plt.xlabel('RA')
plt.ylabel('Dec')
plt.show()
../../_images/1d73d6a376844c1ea143f1724f3f0086c55e0f262aeabf1cd6c9cd130d76b64b.png

The figure above visualizes your data in the simplets possible way: as a scatterplot of RA and Dec positions with default graphing parameters. But you could make many other possible graphs, and you can always add labels, change axis scales, or adjust many other features of graphs. You practiced making some of these display changes in the sections above. Full details on all the possible graph options are described in the online documentation for the pyplot module at https://matplotlib.org/stable/api/pyplot_summary.html.

In addition to containing tables of cataloged measurements for millions of objects, the SDSS database also includes millions of spectra and color imaging covering the 14,000 square degree footprint of the survey. It is often helpful to be able to view the images of objects selected from the SDSS tables.

The code below shows how to display image cutouts for the 16 galaxies we selected from the database in our last query:

# set thumbnail parameters
width=200           # image width
height=200          # height
pixelsize=0.396     # image scale
plt.figure(figsize=(15, 15))   # display in a 4x4 grid
subPlotNum = 1

i = 0
nGalaxies = len(gals)
for index,gal in gals.iterrows():           # iterate through rows in the DataFrame
    i = i + 1
    if i<17:
        print('Getting image '+str(i)+' of '+str(nGalaxies)+'...')
        if (i == nGalaxies):
            print('Plotting images...')
        scale=2*gal['petror90_r']/pixelsize/width
        img= SciServer.SkyServer.getJpegImgCutout(ra=gal['ra'], dec=gal['dec'], width=width, height=height, scale=scale,dataRelease='DR13')
        plt.subplot(4,4,subPlotNum)
        subPlotNum += 1
        plt.imshow(img)                               # show images in grid
        plt.title(index)                              # show the object identifier (objId) above the image.
Getting image 1 of 16...
Getting image 2 of 16...
Getting image 3 of 16...
Getting image 4 of 16...
Getting image 5 of 16...
Getting image 6 of 16...
Getting image 7 of 16...
Getting image 8 of 16...
Getting image 9 of 16...
Getting image 10 of 16...
Getting image 11 of 16...
Getting image 12 of 16...
Getting image 13 of 16...
Getting image 14 of 16...
Getting image 15 of 16...
Getting image 16 of 16...
Plotting images...
../../_images/9283c04066c819ee9f481f2f4610e1668ae4706fd157852bc4f57cc1f21d83bf.png

YOUR TURN: Applying what you’ve learned#

Edit the example query (from a few cells earlier) to retrieve and display a set of small, red galaxies. Recall that “color” can be defined as the difference between any two magnitudes measured in different filters (e.g., g-r, i-z, etc.,), and angular size can be approximated using the measured radius enclosing 90 percent of an object’s light (p.petror90_r).

But what counts as “small”, and what counts as “red”?
(Hint: It may be helpful to start by making histograms of the distribution of color indices and the r90 sizes. It may also be helpful to review the descriptions of the five SDSS filters, and what magnitudes mean in astronomy.)

# your code here

Producing a Galaxy Color-Magnitude Diagram#

Now let’s grab some different data and make a scientifically useful plot.

In the code block below, write a SQL query that returns magnitude information for 1,000 galaxies between redshifts 0.2 and 0.3.

Hints:

#1: Recall that a galaxy’s redshift is deterimined by taking a spectrum and measuring the shift of identified emission and absorption lines. So this is information that’s provided in the SpecObj table, not the “galaxy” table, which contains only photometric (i.e., imaging) information.

#2: If your query times out, you may be asking for the impossible… are your magnitude, redshift, and size limits all compatible with a sample of real galaxies? Keep modifying the query until you are able to retrieve 1000 galaxies in the redshift range 0.2<z<0.3

# your code here

Now, use your returned galaxy data to generate a color-magnitude diagram (e.g., g vs. g-r) in the cell below.

# your code here

Interpreting your plot:#

As we will explore in more detail in future activities, galaxies in the modern universe (redshifts z<0.5) generally group into two general areas of the color-magnitude diagram (“the red sequence” and the “blue cloud”). Do you see two distinct populations in your plot? If not, does anything change if you try refining the signal to noise of your data (i.e., requiring a minimum ratio of g / err_g)? What about if you require a narrower slice in redshift?

Try requiring higher signal to noise and narrowing the range in redshift in the code block below.

Reminder: as described in the SDSS schema of the “galaxy” table, the uncertainty on each magnitude is also available and can be fetched in the SQL query.

# your code here

By experimenting with different slices in redshift, and a requirement that the signal to noise in both the g and r magnitudes be greater than 10, it’s possible to see a fairly strong red sequence and blue cloud. It helps to pull over even larger numbers of galaxies (>2000).

Your scatter plot is probably pretty crowded. If so, it can also be helpful to convert your scatter plot into a heatmap. See this example to get started.

# your code here

Comment on your observations in the cell below.

  • Describe what you see in the plots you’ve made, and whether it agrees with what you were expecting.

  • If the results are surprising, do you have an explanation why?