References
This primer draws heavily on materials from Stanford's CS231n Convolutional Neural Networks for Visual Recognition:
You may also find the Python section of Kaggle Learn to be helpful.
Why Python for Data Analysis?
Python is currently one of the fastest growing programming languages in the world. There are many reasons for this trend, but it largely comes down to the fact that Python is easy to learn, has a library or framework for almost everything, and has a large and active community of developers in scientific computing and data analysis. (This last point is very important: if you get stuck, there's a good chance that somebody has already asked your question on StackOverflow!)
These features have led Python to become one of the most important languages in data science, machine learning, and general software development in academia and industry. Thus, if you're going to learn a new programming language in 2020, then Python is a great choice!
But I don't know how to program!
Don't worry, we will help you!
Since this is a course on practical data science, we believe that the most effective way to learn data science is to actually do it on a recurring basis. By analogy, you do not need to know the theoretical basis for music in order to learn how to play an instrument! Thus the main goal throughout this semester is to spend the majority of our time working directly with data and code; by the end the semantics of Python and its data analysis libraries will be almost second-nature.
Essential Python libraries
In this course we will be introducing several core libraries that underpin much of the work done in data science and machine learning. We will provide more details later in the course, but for now a summary is produced below.
NumPy
NumPy (short for Numerical Python) provides the data structures, algorithms, and glue needed for most scientific applications involving numerical data. It provides blazing fast array-processing capabilities and the ability to do linear algebra operations with ease. (The latter aspect is in part why NumPy underpins much of machine and deep learning applications.)
pandas
pandas provides data structures and functions that make working with structured or tabular data fast and simple. The primary objects that we will encounter in this course are:
DataFrame
: a tabular, column-oriented data structure with row and column labels;Series
: a 1d labelled array object.
matplotlib
matplotlib is the defacto Python library for producing plots and other data visualisations.
scikit-learn
scikit-learn is the defacto machine learning toolkit for Python. It includes submodules such as:
- Classification: SVM, nearest neighbours, random forest, logistic regression etc;
- Regression: Lasso, ridge regression;
- Clustering: $k$-means, etc;
- Model selection: Grid search, cross-validation, metrics
- Preprocessing: Feature extraction, normalisation
Don't worry if some of these terms are unfamiliar at this stage - we will explain them in detail later in the course.
Jupyter notebooks
Jupyter notebooks are an essential tool for any data science project. As described on Jupyter's website:
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include:data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Jupyter notebooks are made of cells and there are two basic types of cell that you'll encounter:
- Text cells for comments, images etc;
- Code cells for executing Python commands.
You can edit a cell by double-clicking on it and you can execute the code in it by either clicking on the "play" button ▶ or simply pressing Shift + Enter
. You can also stop a running cell by pressing the "stop" button ◼︎.
Tab completion
One of the very cool things about Jupyter notebooks is that they provide tab completion, similar to many IDEs. If you enter an expression in a cell, the Tab
key will search the namespace for any vairable that match the characters you've typed:
the_answer_to_life_the_unverse_and_everything = 42
the_beatles = ['john', 'paul', 'george', 'ringo']
# tab complete
the_beatles
the_beatles?
Introspection can also be used to access the docstring of a function, e.g.
def add_numbers(a, b):
"""
Add two numbers together
Returns:
the_sum : type of arguments
"""
return a + b
Then using ?
or pressing Shift + Tab
will show us the docstring
add_numbers?
Using ??
will also show the source code if possible:
add_numbers??
Basic data types
This section is based on Stanford's excellent NumPy tutorial and Wes McKinney's Python for Data Analysis book - go check them out for more details! Like most languages, Python has a number of basic data types including:
- Integers
- Floats (double-precision 64-bit floating point number)
- Booleans (True or False)
- Strings
These data types behave in ways that are familiar from other programming languages. However, there are some important differences. For example statically typed languages like C or Java require each variables to be explicitly declared. By contrast, dynamically typed languages like Python skip this specification. For example, in Java you might specify an operation as follows:
/* Java code */
int result = 0;
for(int i=0; i<100; i++){
result += i;
}
while in python the same operation could be written this way:
# Python code
result = 0
for i in range(100):
result += i
Note the main difference: in Java the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means we can assign any kind of data to any variable:
# Python code
x = 4
x = "four"
Here we switched the contents of x
from an integer to a string. The same thing in Java would lead to a compilation error or some other disaster:
/* Java code */
int x = 4;
x = "four"; // FAILS!
This sort of flexibility makes Python easy (and sometimes dangerous!) to use.
some_integer = 73
some_float = 3.14159
We can print out these variables and their types as follows:
print(some_integer, type(some_integer))
print(some_float, type(some_float))
What if you want to print some text and then some numbers? One way to do this is to cast the number as a string and then print it:
print('My integer was ' + str(some_integer))
print('My float was ' + str(some_float))
However I often find it more convenient to use the print
statement with comma separated values:
print('My integer was', some_integer)
print('My float was', some_float)
A natural thing to want to do with numbers is perform arithmetic. We've seen the +
operator for addition, and the *
operator for multiplication (of a sort). Python also has us covered for the rest of the basic binary operations we might be interested in:
Operator | Name | Description |
---|---|---|
a + b |
Addition | Sum of a and b |
a - b |
Subtraction | Difference of a and b |
a * b |
Multiplication | Product of a and b |
a / b |
True division | Quotient of a and b |
a // b |
Floor division | Quotient of a and b , removing fractional parts |
a % b |
Modulus | Integer remainder after division of a by b |
a ** b |
Exponentiation | a raised to the power of b |
-a |
Negation | The negative of a |
print('Sum:', some_integer + some_float)
print('Multiplication:', some_integer * some_float)
print('Division:', some_integer / some_float)
print('Power:', 10 ** some_integer)
We can also store the result from math operations in new variables, e.g.
my_sum = some_integer + some_float
print('My sum was', my_sum)
Warning!
A common bug that can creep into your code is a lack of care between integer division and floor division. For example, integer division not resulting in a whole number will yield a float
3 / 2
while the floor division operator drops the fractional part
3 // 2
t = True
f = False
print(type(t), type(f))
# logical AND
print(t and f)
# logical OR
print(t or f)
# logical NOT
print(not t)
# logical XOR
print(t != f)
a = 'one way of writing a string'
b = "another way"
We can also get the number of elements in a string sequence as follows
hello = 'hello'
len(hello)
We can also access each character in a string and print it's value:
for letter in hello:
print(letter)
Adding two strings together concatenates them and produces a new string
world = 'world'
hello + ' ' + world
String objects also come with a range of built-in functions to convert them into various forms:
hello.capitalize()
hello.upper()
# replace all instances of one substring with another
s = 'hitchhiker'
s.replace('hi', 'ma')
some_list = [1,1,2,3,5,8,13,21,34,55,89]
print('This is a list:', some_list)
How do we access the individual elements in a list? In Python, the index of elements in a list starts at zero. Thus we should look at the zeroth index to get the first element:
print(some_list[0])
To get elements at the end of a list, we use negative indices, e.g.
print(some_list[-1])
As noted above, Python lists can contain elements of different types. Let's add a new element to the end of the list:
some_list.append('fibonacci')
print(some_list)
We can also replace values in a list based on their index, e.g.
some_list[-1] = 148
print(some_list)
Finally we can use the pop
method to remove and return the last element of a list:
last_element = some_list.pop()
print(last_element)
L = list(range(10))
L
To get a slice from $[2,4)$ we run
L[2:4]
while to get a slice from index 2 to the end of the list we run
L[2:]
To get a slice from the start to index 5 (exclusive) we run
L[:5]
To get a slice of the whole list:
L[:]
Slices can also be negative
L[:-1]
L[2:4] = ['a', 'b']
L
animals = ['cat', 'dog', 'monkey']
for animal in animals:
print(animal)
nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
squares.append(x ** 2)
squares
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]; squares
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]; even_squares
for idx, animal in enumerate(animals):
print(f'#{idx + 1}: {animal}')
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2)
list(zipped)
empty_dict = {}
d = {'cat': 'cute', 'dog': 'furry'}; d
We can access, insert, or set elements using the same approach as for lists:
d['cat']
d['fish'] = 'wet'; d
You can check if a key exists as follows
'cat' in d
Finally, you can delete values using the del
keyword
del d['fish']; d
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
legs = d[animal]
print(f'A {animal} has {legs} legs')
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal, legs in d.items():
print(f'A {animal} has {legs} legs')
nums = [0, 1, 2, 3, 4]
even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}
print(even_num_to_square)
animals = {'cat', 'dog'}; animals
Sets allow us to perform the standard set operation like union, intersection, difference and symmetric difference. For example
felines = {'cat', 'tiger', 'lion'}
animals.union(felines)
animals.intersection(felines)
tup = 4, 5, 6; tup
nested_tup = (4, 5, 6), (7, 8); nested_tup
Multiplying a tuple by an integer has the effect of concatenating together copies of the tuple:
('foo', 'bar') * 4
string = 'This is a string!\n(But not a very interesting one)\n\n\tEnd.'
print(string)
In Python strings are lists of characters and as such one can iterate through them like lists:
for character in string:
print(character)
Check their length like lists:
len(string)
Check if they contain certain elements like lists:
'!' in string
'?' not in string
We can also check if a substring is present in a string:
'very' in string
Capitalisation: There are different ways to manipulate the casing of strings:
'test'.upper()
'TEST'.lower()
'test'.capitalize()
Adding strings:
result = 'a'+'b'
print(result)
Splitting strings:
Often we need to split sentences into words or file paths into components. For this task we can use the split()
function. By default a string is split wherever a whitespace is (this could be normal space, a tab \t
or a newline \n
).
string.split()
'path/to/file/image.jpg'.split('/')
Stripping strings:
Sometimes strings contain leading or trailing characters that we want to get rid of, such as whitespaces or unnecessary characters. We can remove them with the strip()
function. Like the split()
function it removes whitespaces by default but we can set any characters we want:
'_path/to/file/image.jpg_'.strip('_')
'_-_path/to/file/image.jpg,_,'.strip(',_-')
Replacing:
With the replace()
function one can replace substrings in a string.
'one plus one equals two!'.replace('two','three')
Joining strings
Sometimes we split strings into a list of words for processing (like stemming or stop word removal) and then want to join them back to a single string. To to this we can use the join()
function:
' '.join(['this', 'is', 'a', 'list', 'of', 'words'])
'-'.join(['this', 'is', 'a', 'list', 'of', 'words'])
Exercise 2: Write a function that performs the following on string_1
- split the string into words with spaces
- then strip the special character
/
from each word - join the words back together with single spaces (
' '
) - make the whole string lower-case
string_1 = 'This is a string!\n/(But not a very interesting one)/\n\n\tEnd.'
print(string_1)
Functions
Function are a very important method of code reuse in Python. In general, if you find yourself repeating the same code more than once, it can be useful to port it to a reusable function. Doing so also makes your code more readable, which is very important when collaborating with others (or even your future self!).
Functions are declare with the def
keyword and returned from the return
keyword, e.g.
def area_of_a_circle(radius):
area = 3.14159 * radius ** 2
return area
area_of_a_circle(5)
Note that we can have multiple return statements, e.g. based on the result of some conditional statements:
def sign(x):
if x > 0:
return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'
for x in [-1, 0, 1]:
print(sign(x))
In addition to positional arguments, functions can have keyword arguments that are typically used to specify default values:
def hello(name, loud=False):
if loud:
print('G\'day, %s!' % name.upper())
else:
print('G\'day, %s' % name)
hello('Bob')
hello('Fred', loud=True)
Finally, it is quite common to write functions that return multiple objects as a tuple. For example we can write something like
def is_number_positive(number):
if number > 0:
return True, number
else:
return False, number
is_number_positive(-1)
is_number_positive(42)
One cool aspect of such functions is that the returned objects can be unpacked in two different ways:
tup = is_number_positive(3); tup
is_positive, number = is_number_positive(-10)
print(is_positive)
print(number)
Ordered pairs
You are given three integers $x, y$ and $z$. You need to find the ordered pairs $(i, j )$ , such that $( i + j )$ is equal to $z$ and print them in lexicographic order. Here $i$ and $j$ are constrained to lie in the intervals $ 0 \leq i \leq x $ and $ 0 \leq j \leq y $. Below is a solution if we do not use list comprehension - can you find a solution that does?
# initialise input variables
x = 5
y = 4
z = 3
# initialise array for ordered pairs and counter
arr = []
pair_counter = 0
for i in range(x + 1):
for j in range(y + 1):
if i + j == z:
arr.append([])
arr[pair_counter] = [i, j]
pair_counter += 1
print(arr)
Squirrel cigar party!
When squirrels get together for a party, they like to have cigars. A squirrel party is successful when the number of cigars is between 40 and 60, inclusive. Unless it is the weekend, in which case the squirrels go wild and there is no upper bound on the number of cigars. Complete the function below to return True
if the party with the given values is successful, or False
otherwise. Example output is shown below:
cigar_party(30, False) → False
cigar_party(50, False) → True
cigar_party(70, True) → True