Learning objectives

Understand the main steps involved in exploratory data analysis
Visualise geographical data with seaborn
Slice, mask, and index pandas.Series and pandas.DataFrame objects
Merge pandas.DataFrame objects together on a common key
Apply the DataFrame.groupby() operation to aggregate data across different groups of interest

References

This lesson draws heavily on the following textbook chapters:

Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
Chapter 3 (pp. 146-170) of Python Data Science Handbook by Jake Vanderplas

You may also find the following blog post useful:

Exploratory Data Analysis of Craft Beers: Data Profiling

Homework

Solve the exercises included in this notebook
Work through lesson 2 and 4 from Kaggle Learn's introduction to pandas

If you get stuck on understanding a particular pandas technique, you might find their docs to be helpful.

What is exploratory data analysis?

In data science we apply the scientific method to data with the goal gain insights. This means that we state a hypothesis about the data, test it and refine it if necessary. In this framework, exploratory data analysis (EDA) is the step where we explore the data before actually building models. This helps us understand what information is actually contained in the data and what insights could be gained from it.

Formally, the goals of EDA are:

Suggest hypotheses about the phenomena of interest
Check if necessary data is available to test these hypotheses
Make a selection of appropriate methods and models to achieve the goal
Suggest what data should be gathered for further investigation

This exploratory phase lays out the path for the rest of a data science project and is therefore a crucial part of the process.

The data

In this lesson we will analyse two datasets:

housing.csv
housing_addresses.csv

The first is the California housing dataset we saw in lesson 1, while the second provides information about the neighbourhoods associated with each house. This auxiliary data was generated using the reverse geocoding functionality from Google Maps, where the latitude and longitude coordinates for each house are converted into the closest, human-readable address.

The type of questions we will try to find answers to include:

Which cities have the most houses?
Which cities have the most expensive houses?
What is the breakdown of the house prices by proximity to the ocean?

Import libraries

As in lesson 1, we will be making use of the pandas and seaborn libraries. It is often a good idea to import all your libraries in a single cell block near the top of your notebooks so that your collaborators can quickly see whether they need to install new libraries or not.

# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing the Python code
%autoreload 2
# render plots in notebook
%matplotlib inline

# data wrangling
import pandas as pd
import numpy as np
from pathlib import Path
from dslectures.core import *

# data viz
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# these commands define the color scheme
sns.set(color_codes=True)
sns.set_palette(sns.color_palette('muted'))

Load the data

As usual, we can download our datasets using our helper function get_datasets:

get_dataset('housing.csv')
get_dataset('housing_addresses.csv')

Download of housing dataset complete.
Download of housing_addresses dataset complete.

We also make use of the pathlib library to handle our filepaths:

DATA = Path('../data/')
!ls {DATA}

housing (1).csv                housing_gmaps_data_raw (1).csv
housing.csv                    housing_gmaps_data_raw.csv
housing_addresses (1).csv      imdb.csv
housing_addresses.csv          uc
housing_backup.csv

housing_data = pd.read_csv(DATA/'housing.csv'); housing_data.head()

housing_addresses = pd.read_csv(DATA/'housing_addresses.csv'); housing_addresses.head()

Inspect the data

The shape of data

Whenever we have a new dataset it is handy to begin by getting an idea of how large the DataFrame is. This can be done with either the len or DataFrame.shape methods:

# get number of rows
len(housing_data), len(housing_addresses)

(20640, 12590)

# get tuples of (n_rows, n_columns)
housing_data.shape, housing_addresses.shape

((20640, 10), (12590, 5))

Rename columns

Usually one finds that the column headers in the raw data are either ambiguous or appear in multiple DataFrame objects, in which case it is handy to give them the same name. Although it's obvious from the DataFrame.head() method what the column headers are for our housing and address data, in most cases one has tens or hundreds of columns and the fastest way to see their names is as follows:

housing_addresses.columns

Index(['street_number', 'route', 'locality-political', 'postal_code',
       'latitude_longitude'],
      dtype='object')

Let's rename the route and locality-political columns to something more transparent:

housing_addresses.rename(columns={'route':'street_name', 'locality-political':'city'}, inplace=True)

# check that the renaming worked
housing_addresses.head()

Note: Many functions like DataFrame.rename() can manipulate an object in-place without returning a new object. In other words, when inplace=True is passed the data is renamed in place as data_frame.an_operation(inplace=True) so we don’t need the usual assignment LHS = RHS. By default, inplace=False so you have to set it explicitly if needed.

Unique values

Since we are dealing with data about California, we should check that the city column contains a reasonable number of unique entries. In pandas we can check this the DataFrame.nunique() method:

housing_addresses['city'].nunique()

989

Exercise #1

Does the above number make sense to you? What additional data could you find to determine if it does or does not?

Visualising geographical data

In this lesson we will be focusing on how the house location affects its price, so let's make a scatterplot of the latitude and longitude values to see if we can identify any interesting patterns:

sns.scatterplot(x="longitude", y="latitude", data=housing_data);

Although the points look like the shape of California, we see that many are overlapping which obscures potentially interesting substructure. We can fix this by configuring the transparency of the points with the alpha argument:

sns.scatterplot(x="longitude", y="latitude", data=housing_data, alpha=0.1);

This is much better as we can now see distinct clusters of houses. To make this plot even more informative, let's colour the points according to the median house value; we will use the viridis colourmap (palette) as this has been carefully designed for data that has a sequential nature (i.e. low to high values):

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=housing_data,
    alpha=0.1,
    hue="median_house_value",
    palette="viridis",
    size=housing_data["population"] / 100
)

# place legend outside of figure
fig.legend(loc="center left", bbox_to_anchor=(1.01, 0.6), ncol=1);

Exercise #2

What does the figure above tell us about the relationship between house prices and location or population density?

Finally, to make our visualisation a little more intuitive, we can also overlay the scatter plot on an actual map of California:

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=housing_data,
    alpha=0.1,
    hue="median_house_value",
    palette="viridis",
    size=housing_data["population"] / 100
)

# place legend outside of figure
fig.legend(loc="center left", bbox_to_anchor=(1.01, 0.6), ncol=1);

# ref - www.kaggle.com/camnugent/geospatial-feature-engineering-and-visualization/
california_img=mpimg.imread('images/california.png')
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5);

Exercise #3

Can you explain the light green and yellow hotspots in the above figure?

Merging DataFrames

Although the housing_data and housing_addresses DataFrames contain interesting information, it would be nice if there was a way to join the two tables.

More generally, one of the most common operations in pandas (or data science for that matter) is the combination of data contained in various objects. In particular, merge or join operations combine datasets by linking rows using one of more keys. These operations are central to relational databases (e.g. SQL-based). The pandas.merge() function in pandas is the main entry point for using these algorithms on your data.

Let's use this idea to combine the housing_data and housing_addresses pandas.DataFrame objects via their common latitude and longitude coordinates. First we need to combine the latitude and longitude columns of housing_data into the same lat,lon format as our housing_addresses pandas.DataFrame. To do so, we will use the Series.astype() function to convert the numerical column values to strings, then use string concatenation to create a new latitude_longitude column with the desired format:

# create new column
housing_data['latitude_longitude'] = housing_data['latitude'].astype(str) + ',' + housing_data['longitude'].astype(str)
# check the column was created
housing_data.head()

Exercise #4

Calculate the number of unique values in the latitude_longitude column of both housing_data and housing_addresses.

What can you conclude from this comparison?
Why do you think there might be less unique coordinate pairs than the total number of houses?

Now that we have latitude_longitude present in both pandas.DataFrame objects we can merge the two tables together in pandas as follows:

# create a new DataFrame from the merge
housing_merged = pd.merge(
    housing_data, housing_addresses, how="left", on="latitude_longitude"
)
# check merge worked
housing_merged.head()

Boom! We now have a single pandas.DataFrame that links information for house prices and attributes and their addresses.

Note: The merge operation above is an example of a left join, namely we use all the key combinations of latitude_longitude found in the left table housing_data.

In general the 'how' argument of pandas.merge() allows four different join types:

'left': Use all key combinations found in the left table
'right': Use all key combinations found in the right table
'inner': Use only the key combinations observed in both tables
'outer': Use all key combinations observed in both tables together

A visual example of these different merges in action is shown in the figure below.

Figure: Graphical representation of the different types of merges between two DataFrames df1 and df2 that are possible in pandas.

Warning: If the keys in one table match more than one row in the other pandas.DataFrame you can expect a larger table to appear after you do a left join. To avoid this behaviour, you can run DataFrame.drop_duplicates() before doing the merge.

Dropping columns

It is not uncommon for a dataset to have tens or hundreds of columns, and in practice we may only want to focus our attention on a smaller subset. One way to remove unwanted columns is via the DataFrame.drop() method. Since we have duplicate information about the latitude and longitude coordinate, we may as well drop the latitude_longitude column from our merged pandas.DataFrame:

# inplace=True manipulates an object without returning a new object
housing_merged.drop(['latitude_longitude'], axis=1, inplace=True); housing_merged.head()

Note: You will often encounter the cryptic axis parameter when dealing with pandas.DataFrame and pandas.Series objects. This parameter is used to specify along which dimension we want to apply a given transformation - see the figure below for a graphical representation.

Figure: Visualisation of the axis parameter in pandas.

Saving a DataFrame to disk

At this point, we have a unfified table with housing data and their addresses. This is usually a good point to save the intermediate results to disk so that we can reload them without having to run all preprocessing steps. To do so in pandas, we can make use of the DataFrame.to_csv() function:

# use index=False to avoid adding an extra column for the index
housing_merged.to_csv(path_or_buf=DATA/'housing_merged.csv', index=False)

Indexing, selection, and filtering

Now that we have a tidy pandas.DataFrame let's apply some of the most common pandas methods to make queries on the data.

Series

A pandas.Series object provides array-style item selection that allows one to perform slicing, masking, and fancy indexing. For example, we can study the housing_median_age colum from a variety of angles:

# define a new Series object 
age = housing_merged['housing_median_age']

age

0        41.0
1        21.0
2        52.0
3        52.0
4        52.0
         ... 
20635    25.0
20636    18.0
20637    17.0
20638    18.0
20639    16.0
Name: housing_median_age, Length: 20640, dtype: float64

# get first entry
age[0]

41.0

# slicing by index
age[2:5]

2    52.0
3    52.0
4    52.0
Name: housing_median_age, dtype: float64

# slice and just get values
age[2:5].values

array([52., 52., 52.])

age.describe()

count    20640.000000
mean        28.639486
std         12.585558
min          1.000000
25%         18.000000
50%         29.000000
75%         37.000000
max         52.000000
Name: housing_median_age, dtype: float64

# masking (filtering)
age[(age > 10.0) & (age < 20.0)]

75       17.0
77       19.0
80       17.0
90       16.0
131      18.0
         ... 
20632    15.0
20636    18.0
20637    17.0
20638    18.0
20639    16.0
Name: housing_median_age, Length: 4259, dtype: float64

# fancy indexing
age[[0, 4]]

0    41.0
4    52.0
Name: housing_median_age, dtype: float64

DataFrame

Since a pandas.DataFrame acts like a two-dimensional array, pandas provides special indexing operators called DataFrame.iloc[] and DataFrame.loc[] to slice and dice our data. As an example, let's select a single row and multiple columns by label:

housing_merged.loc[2, ['city', 'population']]

city          Oakland
population        496
Name: 2, dtype: object

We can perform similar selections with integers using DataFrame.iloc[]:

housing_merged.iloc[2, [3, 11]]

total_rooms           1467
street_name    Chabot Road
Name: 2, dtype: object

Masks or filters are especially common to use, e.g. let's select the subset of expensive houses:

housing_merged.loc[housing_merged['median_house_value'] > 450000]

We can also use Python's string methods to filter for things like

housing_merged.loc[housing_merged['ocean_proximity'].str.contains('O')]

Note: An important feature with pandas.Series and pandas.DataFrame objects is the fact that they contain an index that lets us slice and modify the data. This Index object can be accessed via Series.index or DataFrame.index and is typically an array of integers that denote the location of each row.

For example, our age object has an index for each row or house in the dataset:

age.index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            20630, 20631, 20632, 20633, 20634, 20635, 20636, 20637, 20638,
            20639],
           dtype='int64', length=20640)

We can also sort the values according to the index, which in our example amounts to reversing the order:

age.sort_index(ascending=False)

20639    16.0
20638    18.0
20637    17.0
20636    18.0
20635    25.0
         ... 
4        52.0
3        52.0
2        52.0
1        21.0
0        41.0
Name: housing_median_age, Length: 20640, dtype: float64

Finally, there are times when you want to reset the index of the pandas.Series or pandas.DataFrame objects; this can be achieved as follows:

age.reset_index()

Note this creates a new index column and resets the order of the pandas.Series object in ascending order.

Which cities have the most houses?

Whenever you need to quickly find the frequencies associated with categorical data, the DataFrame.value_counts() and Series.nlargest() functions come in handy. For example, if we want to see which city has the most houses, we can run the following:

housing_merged['city'].value_counts().nlargest(10)

Los Angeles      2372
San Diego         809
San Francisco     574
San Jose          498
Sacramento        422
Oakland           374
Long Beach        317
Fresno            309
Stockton          242
Bakersfield       222
Name: city, dtype: int64

This seems to make sense, since Los Angeles, San Diego, and San Francisco have some of the largest populations. We can check whether this is indeed the case by aggregating the data to calculate the total population value across the group of cities. pandas provides a flexibly DataFrame.groupby() interface that enables us to slice, dice, and summarise datasets in a natural way. In particular, pandas allows one to:

Split a pandas object into pieces using one or more keys
Calculate group summary statistics, like count, mean, standard deviation, or a user-defined function
Apply within-group transformations or other manipulations, like normalisation, rank, or subset selection
Compute pivot tables and cross-tabulations.

Let's combine these ideas to answer our question, followed by an explanation of how the GroupBy mechanics really work:

housing_merged.groupby('city').agg({'population':'sum'})

This seems to work - we now have the average number of people residing in a block of houses, but the result seems to be sorted alphabetically. To get the cities with the largest populations, we can use the Series.sort_values() method as follows:

housing_merged.groupby("city").agg({"population": "sum"}).sort_values(
    by="population", ascending=False
)

That's much better! We can store the result as a new pandas.DataFrame and plot the distribution:

# use outer parentheses () to add line breaks
top10_largest_cities = (
    # group by key=city
    housing_merged.groupby("city")
    # calculate total population
    .agg({"population": "sum"})
    # select 10 largest
    .nlargest(10, columns="population")
    # reset index so city becomes column
    .reset_index()
)

top10_largest_cities.head()

sns.barplot(x='population', y='city', data=top10_largest_cities, color='b');

GroupBy mechanics

The group operation above can be best understood by the H. Wickam's split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Split: In the first stage of the process, data contained in a pandas.Series or pandas.DataFrame object is split into groups based on one or more specified keys. The splitting is performed on a particular axis of the object (see the notebook from lesson 2). For example, a pandas.DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).

Apply: Once the split done, a function is applied to each group, producing a new value.

Combine: Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usualy depend on the what's being done to the data.

See the figure below for an example of a simple group aggregation.

Figure: Illustraion of a group aggregation.

In general, the grouping key can take many forms and the key do not have to be all of the same type. Frequently, the grouping information is found in the same pandas.DataFrame as the data you want to work on, so the key is usually a column name. For example let's create a simple pandas.DataFrame as follows:

df_foo = pd.DataFrame(
    {
        "key1": ["a", "a", "b", "b", "a"],
        "key2": ["one", "two", "one", "two", "one"],
        "data1": np.random.randn(5),
        "data2": np.random.randn(5),
    }
)
df_foo

We can then use the column names as the group keys (similar to what we did above with the beers):

grouped = df_foo.groupby('key1'); grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x132484ed0>

This grouped variable is now a GroupBy object. It has not actually calculated anything yet except for some intermediate data about the group key df_foo['key1']. The main idea is that this object has all of the information needed to then apply some operation to each of the groups. For example, we can get the mean per group as follows:

df_grouped = grouped.mean(); df_grouped

Exercise #5

Use the above DataFrame.groupby() techniques to find the top 10 cities which have the most expensive houses on average. Look up some of the names on the web - do the results make sense?
Use the DataFrame.loc[] method to filter out the houses with the capped values of over $500,000. Repeat the same step as above.

Exercise #6

Use the DataFrame.groupby() and agg() techniques to find the distribution of mean house prices according to ocean_proximity.
Store the result from the above step in a new pandas.DataFrame and visualise it with seaborn's barplot.

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

	street_number	route	locality-political	postal_code	latitude_longitude
0	3130	Grizzly Peak Boulevard	Berkeley	94705.0	37.88,-122.23
1	2005	Tunnel Road	Oakland	94611.0	37.86,-122.22
2	6886	Chabot Road	Oakland	94618.0	37.85,-122.24
3	6365	Florio Street	Oakland	94618.0	37.85,-122.25
4	5407	Bryant Avenue	Oakland	94618.0	37.84,-122.25

	street_number	street_name	city	postal_code	latitude_longitude
0	3130	Grizzly Peak Boulevard	Berkeley	94705.0	37.88,-122.23
1	2005	Tunnel Road	Oakland	94611.0	37.86,-122.22
2	6886	Chabot Road	Oakland	94618.0	37.85,-122.24
3	6365	Florio Street	Oakland	94618.0	37.85,-122.25
4	5407	Bryant Avenue	Oakland	94618.0	37.84,-122.25

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	latitude_longitude
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	37.88,-122.23
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY	37.86,-122.22
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY	37.85,-122.24
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY	37.85,-122.25
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY	37.85,-122.25

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	latitude_longitude	street_number	street_name	city	postal_code
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	37.88,-122.23	3130	Grizzly Peak Boulevard	Berkeley	94705.0
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY	37.86,-122.22	2005	Tunnel Road	Oakland	94611.0
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY	37.85,-122.24	6886	Chabot Road	Oakland	94618.0
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY	37.85,-122.25	6365	Florio Street	Oakland	94618.0
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY	37.85,-122.25	6365	Florio Street	Oakland	94618.0

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	street_number	street_name	city	postal_code
701	-121.97	37.64	32.0	1283.0	194.0	485.0	171.0	6.0574	431000.0	<1H OCEAN	33803	Palomares Road	Castro Valley	94552.0
830	-121.99	37.61	9.0	3666.0	711.0	2341.0	703.0	4.6458	217000.0	<1H OCEAN	NaN	South Fork Trail	Castro Valley	94552.0
859	-121.97	37.57	21.0	4342.0	783.0	2172.0	789.0	4.6146	247600.0	<1H OCEAN	121	Overacker Terrace	Fremont	94536.0
860	-121.96	37.58	15.0	3575.0	597.0	1777.0	559.0	5.7192	283500.0	<1H OCEAN	NaN	Deer Gulch Loop Trail	Fremont	94536.0
861	-121.98	37.58	20.0	4126.0	1031.0	2079.0	975.0	3.6832	216900.0	<1H OCEAN	37296	Mission Boulevard	Fremont	94536.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20502	-118.68	34.33	45.0	121.0	25.0	67.0	27.0	2.9821	325000.0	<1H OCEAN	NaN	Windmill Canyon Road	Simi Valley	93063.0
20503	-118.75	34.33	27.0	534.0	85.0	243.0	77.0	8.2787	330000.0	<1H OCEAN	NaN	Middle Ridge Fire Road	Simi Valley	93065.0
20504	-118.73	34.29	11.0	5451.0	736.0	2526.0	752.0	7.3550	343900.0	<1H OCEAN	3435	Avenida Simi	Simi Valley	93063.0
20505	-118.72	34.29	22.0	3266.0	529.0	1595.0	494.0	6.0368	248000.0	<1H OCEAN	3889	Avenida Simi	Simi Valley	93063.0
20506	-118.73	34.29	8.0	4983.0	754.0	2510.0	725.0	6.9454	276500.0	<1H OCEAN	3435	Avenida Simi	Simi Valley	93063.0

	population
city
Acampo	9626.0
Acton	6740.0
Adelanto	6583.0
Adin	364.0
Agoura Hills	26776.0
...	...
Yreka	8971.0
Yuba City	49772.0
Yucaipa	31965.0
Yucca Valley	18955.0
Zenia	228.0

	population
city
Los Angeles	3495957.0
San Diego	1069557.0
San Jose	818234.0
San Francisco	702282.0
Sacramento	614478.0
...	...
Parker	83.0
El Portal	79.0
Pearsonville	48.0
Forest Ranch	47.0
North Richmond	42.0

	key1	key2	data1	data2
0	a	one	1.837510	0.875095
1	a	two	0.771915	-0.256958
2	b	one	-0.830038	-0.037795
3	b	two	0.727881	-0.254475
4	a	one	0.168928	-1.287844

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	street_number	street_name	city	postal_code
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	3130	Grizzly Peak Boulevard	Berkeley	94705.0
89	-122.27	37.80	52.0	249.0	78.0	396.0	85.0	1.2434	500001.0	NEAR BAY	321	10th Street	Oakland	94607.0
140	-122.18	37.81	30.0	292.0	38.0	126.0	52.0	6.3624	483300.0	NEAR BAY	NaN	NaN	Oakland	94611.0
459	-122.25	37.87	52.0	609.0	236.0	1349.0	250.0	1.1696	500001.0	NEAR BAY	15	Canyon Road	Berkeley	94704.0
489	-122.25	37.86	48.0	2153.0	517.0	1656.0	459.0	3.0417	489600.0	NEAR BAY	2805	Kelsey Street	Berkeley	94705.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20422	-118.90	34.14	35.0	1503.0	263.0	576.0	216.0	5.1457	500001.0	<1H OCEAN	890	West Potrero Road	Thousand Oaks	91361.0
20426	-118.69	34.18	11.0	1177.0	138.0	415.0	119.0	10.0472	500001.0	<1H OCEAN	NaN	East Las Virgenes Canyon Road	NaN	91307.0
20427	-118.80	34.19	4.0	15572.0	2222.0	5495.0	2152.0	8.6499	500001.0	<1H OCEAN	5135	Island Forest Place	Westlake Village	91362.0
20436	-118.69	34.21	10.0	3663.0	409.0	1179.0	371.0	12.5420	500001.0	<1H OCEAN	6	Ranchero Road	Bell Canyon	91307.0
20443	-118.85	34.27	50.0	187.0	33.0	130.0	35.0	3.3438	500001.0	<1H OCEAN	2182	Tierra Rejada Road	Moorpark	93021.0

	data1	data2
key1
a	0.926117	-0.223236
b	-0.051078	-0.146135

Lesson 2 - Exploratory data analysis