Python Tools for Data Analysis

1. Python Installation: Anaconda

If you don’t feel like using the local environment, you can try Google Colab for a free online python environment. The examples are also available on Colab:
Open in Colab

Ignore this session if you already have a python environment.

Anaconda is a complete, open source data science package with a community of over 6 million users. It is easy to download and install; and it supports Linux, macOS, and Windows (source).

In this tutorial, we’ll use Miniconda for minimal installation. Please refer to this page for the difference between Anaconda and Miniconda and which one to choose.

1.1. Windows and macOS

  1. Download the latest Miniconda installer (with Python 3.9) from the official website.
  2. Install the package according to the instructions.
  3. Start to use conda environment with Anaconda Prompt or other shells if you enabled this feature during installation.

Notice: To use conda command in other shells/prompts, you need to add the conda directory to your PATH environment variable.

Reference: [1, 2].

1.2. Linux with terminal

  1. Start the terminal.
  2. Switch to ~/Download/ with command cd ~/Download/. If the path does not exist, create one using mkdir ~/Download/.
  3. Download the latest Linux Miniconda distribution using wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh.
  4. Once download is complete, install Miniconda using bash Miniconda3-latest-Linux-x86_64.sh.
  5. Follow the prompts on the installer screens. If you are unsure about any setting, accept the defaults. You can change them later.
  6. To make the changes take effect, close and then re-open your terminal window or use the command source ~/.bashrc.
  7. If you are using zsh or other shells, make sure conda is initiated. To do this, switch back to bash and type the command conda init <shell name>.

Reference: [1].

1.3. Verify your installation

You can use the command conda list to check your conda installation. If the terminal returns a bunch of python packages, then your installation is successful.

Reference: [1].

1.4. Conda environment

With conda, you can create, remove, and update environments, each with an independent version of Python interpreter and Python packages. Switching or moving between environments is called activating the environment.

This part is not necessary as you can directly base environment, which is the default conda environment. For those who want to know more, please refer to conda: managing environments for details and instructions.

2. Package Installation

If you are using Anaconda or Miniconda, you can use the Anaconda package manager conda. You can also use other managers such as pip when the packages are not provided by any conda sources. However, in this tutorial, we’ll only cover how to install packages with conda instructions.

To look for a specific package, you can visit this website and type the name of that package in the search box. For today’s instruction, we need to install numpy, matplotlib, scikit-learn and pandas.

First, switch to your conda environment using conda activate <env name> (not necessary if you are using the default base environment), then install those packages by executing these instructions:

conda install -c conda-forge numpy
conda install -c conda-forge matplotlib
conda install -c conda-forge scikit-learn
conda install -c conda-forge pandas

The package manager will automatically install all dependencies. So if you choose to install scikit-learn first, then you don’t have to install numpy manually as scikit-learn depends on numpy.

If you prefer a fancier and more powerful python shell, you can choose to install ipython and even jupyter notebook, which allows you to run your commands in your browser.

conda install -c conda-forge ipython
conda install jupyter

3. Basic Python Concepts

A more comprehensive tutorial can be found at the Stanford CS231n website. In this and the following sections, we’ll introduce the basic concepts due to time limitations.

Open in Colab

We use Python 3.9 in this tutorial.

Notice that previous Python interpreter versions may behave differently. Please refer to the official document for more details.

First, in your terminal, type python or ipython or jupyter notebook to start an interactive python shell. ipython or jupyter notebook is recommended.

3.1. Variable definition, input and output (print)

There’s no type constraint for a variable, i.e., a variable can be of any type.

a = 123
b = '123'
c = "1234"
print(a, b, c, type(a), type(b), type(c))

A variable can be overwritten by different types

a = 123.456
print(type(a))
a = '123'
print(type(a))

Input some strings interactively:

x = input('Input something: ')
print(x, type(x))

Notice that this input method is rarely used in big data scenarios. A more practical input method is argparse.

3.2. List, tuple, set and dictionary

  • List is a collection that is ordered and changeable. It allows duplicate members.
  • Tuple is a collection which is ordered but not changeable. It also allows duplicate members.
  • Set is a collection which is unordered and unindexed. It does not allow duplicate members. Elements in a set cannot be retrieved by index.
  • Dictionary is a collection which is ordered, changeable and indexed. It does not allow duplicate members.

Notice that Dictionary used to be unordered before Python 3.7.

_list = [1, 2, 1.2, '1', '2', 1]  # this is a list
_tuple = (1, 2, 1.2, '1', '2', 1)  # this is a tuple
_set = {1, 2, 1.2, '1', '2', 1}  # this is a set
_dict = {  # this is a dict
    1: '111',
    2: '222',
    '1': 567,
    2.2: ['J', 'Q', 'K']
}
print(_list, '\n', _tuple, '\n', _set, '\n', _dict)

Access elements

print(_list[0], _list[-2], _list[1: 3])
print(_tuple[1], _tuple[-2])
print(_set[0], _set[-1])
print(_dict[1], _dict['1'], _dict[2.2])

Shallow copy

a = _list
a[0] = 888
print(a, '\n', _list)

3.3. If else

if 888 not in _dict.keys():
    _dict[888] = '???'
elif 999 not in _dict.keys():
    _dict[999] = '!@#$%'
else:
    _dict['qwert'] = 'poiuy'

3.4. Loops

for loop:

for x in _list:
    print(x)

for i in range(len(_list)):
    print(_list[i])

while loop:

i = 0
while i != len(_list):
    print(_list[i])
    i += 1

3.5 Function

Define a function:

def my_func(x):
	x += 1
    print('in function: ', x)
    return x

Call a function

t = 10
tt = my_func(t)
print(f'out of funciton, t: {t}, tt: {tt}')

4. Basic Numpy Usage

4.1. Array creation

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

import numpy as np

a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a), a.dtype)
print(a.shape)
print(a[1])

b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)
print(b[0, 0], b[0, 1], b[1, 0])

Change the type of an array:

print(a.dtype)
a = a.astype(float)
print(a.dtype)

Other array creation methods:

a = np.zeros((2,2))   # Create an array of all zeros
print(a)
b = np.ones((1,2))    # Create an array of all ones
print(b)
c = np.full((2,2), 7, dtype=np.float32)  # Create a constant array
print(c)
d = np.eye(3)         # Create a 3x3 identity matrix
print(d)
e = np.random.random((3,3))  # Create an array filled with random values
print(e)

4.2. Array indexing

Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

# Create a rank 1 array and reshape it to a 3x4 matrix
a = np.arange(12).reshape(3, 4)
b = a[:2, 1:3]
print(a)
print(b)

# Shallow copy
b[0, 0] = 888
print(a)

You can mix integer indexing with slice indexing. However, integer indexing will yield an array of lower rank than the original array:

row_r1 = a[1, :]    # Rank 1 view of the second row of a
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)
print(row_r2, row_r2.shape)

You can also access element in the array through lists:

x = [0, 1, 2]
y = [3, 1, 0]
print(a[x, y])

Or through a boolean array:

b = a > 4
print(b)
print(a[b])

4.3. Array math

Basic mathematical functions operate element-wise on arrays, and are available both as operator overloads and as functions in the numpy module:

x = np.arange(1, 5, dtype=np.float).reshape(2, 2)
y = np.arange(5, 9, dtype=np.float).reshape(2, 2)
print(x)
print(y)

# Elementwise sum
print(x + y)
print(np.add(x, y))

# Elementwise difference
print(x - y)
print(np.subtract(x, y))

# Elementwise product
print(x * y)
print(np.multiply(x, y))

# Elementwise division
print(x / y)
print(np.divide(x, y))

# Elementwise square
print(x ** 2)
print(np.power(x, 2))

# Elementwise square root
print(x ** 0.5)
print(np.sqrt(x))

Matrix multiplication

x = np.arange(1, 5, dtype=np.float).reshape(2, 2)
y = np.arange(5, 9, dtype=np.float).reshape(2, 2)
print(x)
print(y)

v = np.array([9, 10], dtype=np.float)
w = np.array([11, 12], dtype=np.float)

# Inner product
print(v.dot(w))
print(np.dot(v, w))
print(v @ w)

# Matrix / vector product
print(x.dot(v))
print(np.dot(x, v))
print(x @ v)

# Matrix / matrix product
print(x.dot(y))
print(np.dot(x, y))
print(x @ y)

Attention: np.dot() and @ behaves differently when the matrix rank is greater than 2.

Numpy provides many useful functions for performing computations on arrays such as sum:

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(x.sum())  # same as above
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

To transpose a matrix, use the T attribute of an array object:

print(x.T)

# Note that taking the transpose of a rank one array does nothing:
print(v)
print(v.T)

5. Using Matplotlib for visualization

import numpy as np
import matplotlib.pyplot as plt
# %matplotlib qt

# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()  # You must call plt.show() to make graphics appear.

Note: for jupyter notebook, you can use the command %matplotlib inline to make the graphics embedded in the editor or %matplotlib qt to make them pop out.

To plot multiple lines at once, and add a title, legend, and axis labels:

x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

You can plot different things in the same figure using the subplot function. Here is an example:

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

6. Pandas and Scikit-Learn for Data Science

In this section, we will look at a data science example using pandas as data management tool and scikit-learn (sklearn) as algorithm implementation. This section is modified from this tutorial.

6.1. Import packages

import numpy as np
import pandas as pd

# automatically split the data into training and test set
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# classifiers and regressors
from sklearn.ensemble import RandomForestRegressor
# Construct a Pipeline from the given estimators
from sklearn.pipeline import make_pipeline
# Exhaustive search over specified parameter values for an estimator.
from sklearn.model_selection import GridSearchCV

# Training objective and evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
# For model persistence
# you can use `from sklearn.externals import joblib` if your sklearn version is earlier than 0.23
import joblib

6.2. Load data

You can download the data by clicking the link or using wget: wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv and move the file to your current folder. Then, load the csv data into memory through pandas:

data = pd.read_csv('winequality-red.csv', sep=';')

Or, you can directly load the data through URL.

dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

You can also load datasets stored in other formats with pandas. A detailed document is at pandas: io.

6.3. Take a look of the loaded data

The data loaded is stored in the type of pandas.core.frame.DataFrame

To give a peak of the data, we can use

print(data)

This will return a nice-looking preview of the elements in the DataFrame.

To view the name of the features of a DataFrame, one can use

print(data.keys())

To access one column, i.e., all instances of a feature, e.g., pH, one can use

# These will return the same result
print(data['pH'])
print(data.pH)

To access a row, you need the DataFrame.iloc attribute:

print(data.iloc[10])

We can also easily print some summary statistics:

print(data.describe())

6.4. Split data

First, let’s separate our target (y) feature from our input (X) features and divide the dataset into training and test sets using the train_test_split function:

y = data.quality
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

Stratify your sample by the target variable will ensure your training set looks similar to your test set, making your evaluation metrics more reliable.

6.5. Pre-processing

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. It is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# To prove the trainig and testing sets have (nearly) zero mean and one deviation
print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))
print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))

6.6. Fit the model

If we do not need to fine-tune the hyperparameters, we can define a random forest regression model with the default hyperparameters and fit the model using

regr = RandomForestRegressor()
regr.fit(X_train_scaled, y_train)

To examine the performance, we use the test set to calculate the scores

pred = regr.predict(X_test_scaled)

print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))

6.7. Define the cross-validation pipeline

Fine-tuning hyperparameters is an important job in Machine Learning since a set of carefully chosen hyperparameters may greatly improve the performance of the model.

In practice, when we set up the cross-validation pipeline, we won’t even need to manually fit the data. Instead, we’ll simply declare the class object, like so:

pipeline = make_pipeline(
    preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100)
)

To check the hyperparameters, we may use

print pipeline.get_params()

or refer to the official document.

Now, let’s declare the hyperparameters we want to tune through cross-validation.

hyperparameters = {
    'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
    'randomforestregressor__max_depth': [None, 5, 3, 1]
}

Then, we can set a 10-fold cross validation as simple as

clf = GridSearchCV(pipeline, hyperparameters, cv=10)

Finally, we can automatically fine-tune the model using

clf.fit(X_train, y_train)

After the model fitting, if we want to check the best hyperparameters, we can use

print(clf.best_params_)

Same as before, we evaluate the fitted model on test set

pred = clf.predict(X_test)

print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))

6.8. Save and load models

After training, we may want to save the trained model for future use. For this purpose, we can use

joblib.dump(clf, 'rf_regressor.pkl')

When you want to load the model again, simply use this function:

clf2 = joblib.load('rf_regressor.pkl')
 
# Predict data set using loaded model
clf2.predict(X_test)

A more comprehensive example of scikit-learn can be found here.