0% found this document useful (0 votes)
933 views

Python Introduction

A Python intro

Uploaded by

chhackl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
933 views

Python Introduction

A Python intro

Uploaded by

chhackl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 405

Introduction to Python for Econometrics, Statistics and Data Analysis

Kevin Sheppard
University of Oxford
Tuesday 5th August, 2014

2012, 2013, 2014 Kevin Sheppard

Changes since the Second Edition


Version 2.2.1 (August 2014)

Fixed typos reported by a reader thanks to Ilya Sorvachev


Version 2.2 (July 2014)

Code verified against Anaconda 2.0.1.


Added diagnostic tools and a simple method to use external code in the Cython section.
Updated the Numba section to reflect recent changes.
Fixed some typos in the chapter on Performance and Optimization.
Added examples of joblib and IPythons cluster to the chapter on running code in parallel
Version 2.1 (February 2014)

New chapter introducing object oriented programming as a method to provide structure and organization to related code.
Added seaborn to the recommended package list, and have included it be default in the graphics
chapter.
Based on experience teaching Python to economics students, the recommended installation has
been simplified by removing the suggestion to use virtual environment. The discussion of virtual
environments as been moved to the appendix.
Rewrote parts of the pandas chapter.
Code verified against Anaconda 1.9.1.
Version 2.02 (November 2013)

Changed the Anaconda install to use both create and install, which shows how to install additional
packages.
Fixed some missing packages in the direct install.
Changed the configuration of IPython to reflect best practices.
Added subsection covering IPython profiles.
i

Version 2.01 (October 2013)

Updated Anaconda to 1.8 and added some additional packages to the installation for Spyder.
Small section about Spyder as a good starting IDE.

ii

Notes to the 2nd Edition


This edition includes the following changes from the first edition (March 2012):
The preferred installation method is now Continuum Analytics Anaconda. Anaconda is a complete
scientific stack and is available for all major platforms.
New chapter on pandas. pandas provides a simple but powerful tool to manage data and perform
basic analysis. It also greatly simplifies importing and exporting data.
New chapter on advanced selection of elements from an array.
Numba provides just-in-time compilation for numeric Python code which often produces large performance gains when pure NumPy solutions are not available (e.g. looping code).
Dictionary, set and tuple comprehensions
Numerous typos
All code has been verified working against Anaconda 1.7.0.

iii

iv

Contents

Introduction

1.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Important Components of the Python Scientific Stack . . . . . . . . . . . . . . . . . . . . . . . .

1.4

Setup

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5

Using Python

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6

Exercises

1.A

Frequently Encountered Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.B

register_python.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.C

Advanced Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Python 2.7 vs. 3 (and the rest)

27

2.1

Python 2.7 vs. 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2

Intel Math Kernel Library and AMD Core Math Library . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3

Other Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.A

Relevant Differences between Python 2.7 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Built-in Data Types

31

3.1

Variable Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2

Core Native Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3

Python and Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4

Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Arrays and Matrices

47

4.1

Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2

Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3

1-dimensional Arrays

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4

2-dimensional Arrays

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5

Multidimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6

Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.7

Accessing Elements of an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8

Slicing and Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9

import and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.10 Calling Functions


4.11 Exercises
5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Basic Math

63

5.1

Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2

Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3

Array and Matrix Addition (+) and Subtraction (-) . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4

Array Multiplication (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.5

Matrix Multiplication (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6

Array and Matrix Division (/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.7

Array Exponentiation (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.8

Matrix Exponentiation (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.9

Parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.10 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.11 Operator Precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.12 Exercises
6

Basic Functions and Numerical Indexing

71

6.1

Generating Arrays and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2

Rounding

6.3

Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.4

Complex Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.5

Set Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.6

Sorting and Extreme Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.7

Nan Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.8

Functions and Methods/Properties

6.9

Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Special Arrays

83

7.1
8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Array and Matrix Functions

85

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.1

Views

8.2

Shape Information and Transformation

8.3

Linear Algebra Functions

8.4

Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Importing and Exporting Data

99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9.1

Importing Data using pandas

9.2

Importing Data without pandas

9.3

Saving or Exporting Data using pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

vi

9.4

Saving or Exporting Data without pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

9.5

Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

10 Inf, NaN and Numeric Limits

109

10.1 inf and NaN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


10.2 Floating point precision
10.3 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

11 Logical Operators and Find

113

11.1 >, >=, <, <=, ==, != . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


11.2 and, or, not and xor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
11.3 Multiple tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
11.4 is* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
12 Advanced Selection and Assignment
12.1 Numerical Indexing

119

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

12.2 Logical Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124


12.3 Performance Considerations and Memory Management . . . . . . . . . . . . . . . . . . . . . . . 128
12.4 Assignment with Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
12.5 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

13 Flow Control, Loops and Exception Handling

133

13.1 Whitespace and Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


13.2 if . . . elif . . . else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
13.3 for

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

13.4 while . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137


13.5 try . . . except . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.6 List Comprehensions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

13.7 Tuple, Dictionary and Set Comprehensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


13.8 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

14 Dates and Times

143

14.1 Creating Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143


14.2 Dates Mathematics
14.3 Numpy datetime64

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

15 Graphics

147

15.1 seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147


15.2 2D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
15.3 Advanced 2D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
15.4 3D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

vii

15.5 General Plotting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


15.6 Exporting Plots
15.7 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

16 Structured Arrays

167

16.1 Mixed Arrays with Column Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167


16.2 Record Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
17 pandas

171

17.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171


17.2 Statistical Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
17.3 Time-series Data

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

17.4 Importing and Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196


17.5 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
17.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
18 Custom Function and Modules
18.1 Functions

207

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

18.2 Variable Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214


18.3 Example: Least Squares with Newey-West Covariance . . . . . . . . . . . . . . . . . . . . . . . 215
18.4 Anonymous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
18.5 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
18.6 Packages

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

18.7 PYTHONPATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219


18.8 Python Coding Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
18.9 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

18.A Listing of econometrics.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221


19 Probability and Statistics Functions

225

19.1 Simulating Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225


19.2 Simulation and Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
19.3 Statistics Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
19.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
19.5 Select Statistics Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
19.6 Select Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
19.7 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

20 Non-linear Function Optimization

243

20.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244


20.2 Derivative-free Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
20.3 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
20.4 Scalar Function Minimization

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
viii

20.5 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253


20.6 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

21 String Manipulation

255

21.1 String Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255


21.2 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
21.3 Formatting Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
21.4 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
21.5 Safe Conversion of Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
22 File System Operations

267

22.1 Changing the Working Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267


22.2 Creating and Deleting Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
22.3 Listing the Contents of a Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
22.4 Copying, Moving and Deleting Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
22.5 Executing Other Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
22.6 Creating and Opening Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
22.7 Reading and Writing Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
22.8 Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

23 Performance and Code Optimization


23.1 Getting Started

273

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

23.2 Timing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273


23.3 Vectorize to Avoid Unnecessary Loops

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

23.4 Alter the loop dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275


23.5 Utilize Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
23.6 Use In-place Assignment

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

23.7 Avoid Allocating Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276


23.8 Inline Frequent Function Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
23.9 Consider Data Locality in Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
23.10Profile Long Running Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
23.11Numba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
23.12Cython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
23.13External Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
23.14Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

24 Executing Code in Parallel

303

24.1 map and related functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303


24.2 multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
24.3 joblib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
24.4 IPythons Parallel Cluster

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

24.5 Converting a Serial Program to Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

ix

24.6 Other Concerns when executing in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316


25 Object Oriented Programming (OOP)

319

25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319


25.2 Class basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
25.3 Building a class for Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

25.4 Exercises

26 Other Interesting Python Packages

331

26.1 statsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331


26.2 pytz and babel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
26.3 rpy2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
26.4 PyTables and h5py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
27 Examples

333

27.1 Estimating the Parameters of a GARCH Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 333


27.2 Estimating the Risk Premia using Fama-MacBeth Regressions . . . . . . . . . . . . . . . . . . . 338
27.3 Estimating the Risk Premia using GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
27.4 Outputting LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
28 Quick Reference

347

28.1 Built-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347


28.2 NumPy (numpy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
28.3 SciPy

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

28.4 Matplotlib

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

28.5 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374


28.6 IPython

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

Chapter 1

Introduction
1.1

Background

These notes are designed for someone new to statistical computing wishing to develop a set of skills necessary to perform original research using Python. They should also be useful for students, researchers or
practitioners who require a versatile platform for econometrics, statistics or general numerical analysis
(e.g. numeric solutions to economic models or model simulation).
Python is a popular general purpose programming language which is well suited to a wide range of
problems.1 Recent developments have extended Pythons range of applicability to econometrics, statistics
and general numerical analysis. Python with the right set of add-ons is comparable to domain-specific
languages such as R, MATLAB or Julia. If you are wondering whether you should bother with Python (or
another language), a very incomplete list of considerations includes:
You might want to consider R if:
You want to apply statistical methods. The statistics library of R is second to none, and R is clearly
at the forefront in new statistical algorithm development meaning you are most likely to find that
new(ish) procedure in R.
Performance is of secondary importance.
Free is important.
You might want to consider MATLAB if:
Commercial support, and a clean channel to report issues, is important.
Documentation and organization of modules is more important than raw routine availability.
Performance is more important than scope of available packages. MATLAB has optimizations, such
as Just-in-Time (JIT) compilation of loops, which is not automatically available in most other packages.
You might want to consider Julia if:
1

According to the ranking onhttp://www.tiobe.com/, Python is the 8th most popular language. http://langpop.corger.
nl/ ranks Python as 5th or 6th , and on http://langpop.com/, Python is 6th .

Performance in an interactive based language is your most important concern.


You dont mind learning enough Python to interface with Python packages. The Julia ecosystem is
in its infancy and a bridge to Python is used to provide important missing features.
You like living on the bleeding edge, and arent worried about code breaking across new versions of
Julia.
You like to do most things yourself.
Having read the reasons to choose another package, you may wonder why you should consider Python.
You need a language which can act as an end-to-end solution so that everything from accessing webbased services and database servers, data management and processing and statistical computation
can be accomplished in a single language. Python can even be used to write server-side apps such as
dynamic website (see e.g. http://stackoverflow.com), apps for desktop-class operating systems
with graphical user interfaces and even tablets and phones apps (iOS and Android).
Data handling and manipulation especially cleaning and reformatting is an important concern.
Python is substantially more capable at data set construction than either R or MATLAB.
Performance is a concern, but not at the top of the list.2
Free is an important consideration Python can be freely deployed, even to 100s of servers in a
compute cluster or in the cloud (e.g. Amazon Web Services or Azure).
Knowledge of Python, as a general purpose language, is complementary to R/MATLAB/Julia/Ox/GAUSS/Stata.

1.2

Conventions

These notes will follow two conventions.


1. Code blocks will be used throughout.
"""A docstring
"""
# Comments appear in a different color
# Reserved keywords are highlighted
and as assert break class continue def del elif else
except exec finally for from global if import in is
lambda not or pass print raise return try while with yield
# Common functions and classes are highlighted in a
# different color. Note that these are not reserved,
2

Python performance can be made arbitrarily close to C using a variety of methods, including Numba (pure python), Cython
(C/Python creole language) or directly calling C code. Moreover, recent advances have substantially closed the gap with respect
to other Just-in-Time compiled languages such as MATLAB.

# and can be used although best practice would be


# to avoid them if possible
array matrix xrange list True False None
# Long lines are indented
some_text = This is a very, very, very, very, very, very, very, very, very, very, very
, very long line.

2. When a code block contains >>>, this indicates that the command is running an interactive IPython
session. Output will often appear after the console command, and will not be preceded by a command indicator.
>>> x = 1.0
>>> x + 2
3.0

If the code block does not contain the console session indicator, the code contained in the block is
intended to be executed in a standalone Python file.
from __future__ import print_function
import numpy as np
x = np.array([1,2,3,4])
y = np.sum(x)
print(x)
print(y)

1.3
1.3.1

Important Components of the Python Scientific Stack


Python

Python 2.7.6 (or later, but in the Python 2.7.x family) is required. This provides the core Python interpreter.

1.3.2

NumPy

NumPy provides a set of array and matrix data types which are essential for statistics, econometrics and
data analysis.

1.3.3

SciPy

SciPy contains a large number of routines needed for analysis of data. The most important include a wide
range of random number generators, linear algebra routines and optimizers. SciPy depends on NumPy.

1.3.4

IPython

IPython provides an interactive Python environment which enhances productivity when developing code
or performing interactive data analysis.
3

1.3.5

matplotlib and seaborn

matplotlib provides a plotting environment for 2D plots, with limited support for 3D plotting. seaborn is
a Python package that improves the default appearance of matplotlib plots without any additional code.

1.3.6

pandas

pandas provides high-performance data structures.

1.3.7

Performance Modules

A number of modules are available to help with performance. These include Cython and Numba. Cython
is a Python module which facilitates using a simple Python-derived creole to write functions that can be
compiled to native (C code) Python extensions. Numba uses a method of just-in-time compilation to
translate a subset of Python to native code using Low-Level Virtual Machine (LLVM).

1.4

Setup

The recommended method to install the Python scientific stack is to use Continuum Analytics Anaconda.
Appendix 1.C describes a more complex installation procedure with instructions for directly installing
Python and the required modules when it is not possible to install Anaconda. The appendix also discusses
using virtual environments, which are considered best practices when using Python.

1.4.1

Continuum Analytics Anaconda

Anaconda, a free product of Continuum Analytics (www.continuum.io), is a virtually complete scientific


stack for Python. It includes both the core Python interpreter and standard libraries as well as most
modules required for data analysis. Anaconda is free to use and modules for accelerating the performance of linear algebra on Intel processors using the Math Kernel Library (MKL) are available (free to
academic users and for a small cost to non-academic users). Continuum Analytics also provides other
high-performance modules for reading large data files or using the GPU to further accelerate performance
for an additional, modest charge. Most importantly, installation is extraordinarily easy on Windows, Linux
and OS X. Anaconda is also simple to update to the latest version using
conda update conda
conda update anaconda

Windows

Installation on Windows requires downloading the installer and running. These instructions use ANACONDA to indicate the Anaconda installation directory (e.g. the default is C:\Anaconda). Once the setup
has completed, open a command prompt (cmd.exe) and run
cd ANACONDA\Scripts
conda update conda
conda update anaconda
conda install mkl

which will first ensure that Anaconda is up-to-date. The final line installs the recommended Intel Math
Kernel Library to accelerate linear algebra routines. Using MKL requires a license which is available for
free to academic uses and for a modest charge otherwise. If acquiring a license is not possible, omit this
line. conda install can be used later to install other packages that may be of interest. Next, change to
and then run
cd ANACONDA\Scripts
pip install pylint html5lib seaborn

which installs additional packages not directly available in Anaconda. Note that if Anaconda is installed
into a directory other than the default, the full path should not contain unicode characters or spaces.
Notes

The recommended settings for installing Anaconda on Windows are:


Install for all users, which requires admin privileges. If these are not available, then choose the Just
for me option, but be aware of installing on a path that contains non-ASCII characters which can
cause issues.
Add Anaconda to the System PATH - This is important to ensure that Anaconda commands can be
run from the command prompt.
Register Anaconda as the system Python - If Anaconda is the only Python installed, then select this
option.
If Anaconda is not added to the system path, it is necessary to add the ANACONDA and ANACONDA\Scripts
directories to the PATH using
set PATH=ANACONDA;ANACONDA\Scripts;%PATH%

before running Python programs.


Linux and OS X

Installation on Linux requires executing


bash Anaconda-x.y.z-Linux-ISA.sh

where x.y.z will depend on the version being installed and ISA will be either x86 or more likely x86_64.
The OS X installer is available either in a GUI installed (pkg format) or as a bash installer which is installed
in an identical manner to the Linux installation. It is strongly recommended that the anaconda/bin is
prepended to the path. This can be performed in a session-by-session basis by entering
export PATH=/home/python/anaconda/bin;$PATH

On Linux this change can be made permanent by entering this line in .bashrc which is a hidden file located
in ~/. On OS X, this line can be added to .bash_profile which is located in the home directory (~/).
After installation completes, change to the folder where Anaconda installed (written here as ANACONDA, default ~/anaconda) and execute
conda update conda
conda update anaconda
conda install mkl

which will first ensure that Anaconda is up-to-date and then to install the Intel Math Kernel library-linked
modules, which provide substantial performance improvements this package requires a license which
is free to academic users and low cost to others. If acquiring a license is not possible, omit this line.
conda install can be used later to install other packages that may be of interest. Finally, run the command
pip install pylint html5lib seaborn

to install some packages not included in Anaconda.


Notes

All instructions for OS X and Linux assume that ANACONDA/bin has been added to the path. If this is not
the case, it is necessary to run
cd ANACONDA
cd bin

and then all commands must be prepended by a . as in


.conda update conda

1.5

Using Python

Python can be programmed using an interactive session using IPython or by directly executing Python
scripts text files that end in the extension .py using the Python interpreter.

1.5.1

Python and IPython

Most of this introduction focuses on interactive programming, which has some distinct advantages when
learning a language. The standard Python interactive console is very basic and does not support useful
features such as tab completion. IPython, and especially the QtConsole version of IPython, transforms
the console into a highly productive environment which supports a number of useful features:
Tab completion - After entering 1 or more characters, pressing the tab button will bring up a list of
functions, packages and variables which match the typed text. If the list of matches is large, pressing
tab again allows the arrow keys can be used to browse and select a completion.
Magic function which make tasks such as navigating the local file system (using %cd ~/directory/
or just cd ~/directory/ assuming that %automagic is on) or running other Python programs (using
run program.py) simple. Entering %magic inside and IPython session will produce a detailed description of the available functions. Alternatively, %lsmagic produces a succinct list of available
magic commands. The most useful magic functions are
cd - change directory
edit filename - launch an editor to edit filename
ls or ls pattern - list the contents of a directory
6

run filename - run the Python file filename


timeit - time the execution of a piece of code or function
Integrated help - When using the QtConsole, calling a function provides a view of the top of the help
function. For example, entering mean( will produce a view of the top 20 lines of its help text.
Inline figures - The QtConsole can also display figure inline which produces a tidy, self-contained
environment. (when using the --pylab=inline switch when starting, or when using the configuration option _c.IPKernelApp.pylab="inline").
The special variable _ contains the last result in the console, and so the most recent result can be
saved to a new variable using the syntax x = _.
Support for profiles, which provide further customization of sessions.

1.5.2

IPython Profiles

IPython supports using profiles which allows for alternative environments (at launch), either in appearance or in terms of packages which have been loaded into the IPython session. Profiles are configured
using a set of files located in
%USERPROFILE%\.ipython\

on Windows and
~/.config/ipython/

on OS X or Linux. There should be one directory in this location, profile_default, that is mostly empty. To
configure a profile open a terminal or command prompt and run
ipython profile create econometrics

This will create a directory named profile_econometrics and populate it with 4 files:
File

Purpose

ipython_config.py

General IPython setting for all IPython sessions


Settings used by the Notebook converter
Settings specific to IPython Notebook (browser) sessions
Settings specific to QtConsole sessions

ipython_nbconvert_config.py
ipython_notebook_config.py
ipython_qtconsole_config.py

The two most important are ipython_config and ipython_qtconsole_config. Opening these files in a text
editor will reveal a vast array of options, all which are commented out using #. A full discussion of these
files would require a chapter or more, and so please refer to the online IPython documentation for details
about a specific setting (although most settings have a short comment containing an explanation and
possible values).
ipython_config

The settings in this file apply to all IPython sessions using this profile, irrespective of whether they are in
the terminal, QtConsole or Notebook. One of the most useful settings is
c.InteractiveShellApp.exec_lines

which allows commands to be executed each time an IPython session is open. This is useful, for example,
to import specific packages commonly used in a project. Another useful configuration options is
c.InteractiveShellApp.pylab

which can be used to load pylab in the session, and is identical to launching an IPython session using the
command line switch --pylab=backend. An alternative is to use
c.InteractiveShellApp.matplotlib

which will only load matplotlib and not the rest of pylab.
ipython_qtconsole_config

The settings in this file only apply to QtConsole sessions, and the most useful affect the appearance of the
console. The first two can be used to set the font size (a number) and font family (a string, containing the
name of the font).
c.IPythonWidget.font_size

c.IPythonWidget.font_family

The next setting sets the model for pylab, which can in particular be set to "inline" which is identical to
using the command line switch --pylab=inline when starting IPython using the QtConsole. This setting
is similar to the previous pylab setting, but since this is specific to QtConsole sessions, it will override the
general setting (only) in using QtConsole, and so it is possible to use, for example, "qt4", for terminalbased IPython sessions, and to use "inline" for QtConsole sessions.
c.IPKernelApp.pylab

This final setting is identical to the command-line switch --colors and can be set to "linux" to produce
a console with a dark background and light characters.
c.ZMQInteractiveShell.colors

1.5.3

Configuring IPython

These notes assume that two imports are made when running code in IPython or as stand-alone Python
programs. These imports are
from __future__ import print_function, division

which imports the future versions of print and / (division). Open ipython_config.py in the directory profile_econometrics and set the values
c.InteractiveShellApp.exec_lines=["from __future__ import print_function, division",
"import os",
"os.chdir(c:\\dir\\to\\start\\in)"]

and
c.InteractiveShellApp.pylab="qt4"

This code does two things. First, it imports two future features (which are standard in Python 3.x+), the
print function and division, which are useful for numerical programming.
In Python 2.7, print is not a standard function and is used like print string to print. Python 3.x
changes this behavior to be a standard function call, print(string to print). I prefer the latter
since it will make the move to 3.x easier, and find it more coherent with other function in Python.
In Python 2.7, division of integers always produces an integer so that the result is truncated (i.e.
9/5=1). In Python 3.x, division of integers does not produce an integer if the integers are not even
multiples (i.e. 9/5=1.8). Additionally, Python 3.x uses the syntax 9//5 to force integer division with
truncation (i.e. 11/5=2.2, while 11//5=2).
Second, pylab will be loaded by default using the qt4 backend.
Changing settings in ipython_qtconsole_config.py is optional, although I recommend using

c.IPythonWidget.font_size=11
c.IPythonWidget.font_family="Bitstream Vera Sans Mono"
c.IPKernelApp.pylab="inline"
c.ZMQInteractiveShell.colors="linux"

These commands assume that the Bitstream Vera fonts have been locally installed, which are available
from http://ftp.gnome.org/pub/GNOME/sources/ttf-bitstream-vera/1.10/.

1.5.4

Launching IPython

OS X and Linux

IPython can be started by running


ipython --profile=econometrics

in the terminal. Starting IPython using the QtConsole is virtually identical.


ipython qtconsole --profile=econometrics

A single line launcher on OS X or Linux can be constructed using


bash -c "ipython qtconsole --profile=econometrics"

This single line launcher can be saved as filename.command where filename is a meaningful name (e.g.
IPython-Terminal) to create a launcher on OS X by entering the command
chmod 755 /FULL/PATH/TO/filename.command

The same command can to create a Desktop launcher on Ubuntu by running


sudo apt-get install --no-install-recommends gnome-panel
gnome-desktop-item-edit ~/Desktop/ --create-new

and then using the command as the Command in the dialog that appears.
9

Figure 1.1: IPython running in the standard Windows console (cmd.exe).

Windows (Anaconda)

To run IPython open cmd and enter


ipython --profile=econometrics

Starting IPython using the QtConsole is similar.


ipython qtconsole --profile=econometrics

Launchers can be created for these shortcuts. Start by creating a launcher to run IPython in the standard
Windows cmd.exe console. Open a text editor enter
cmd "/c cd ANACONDA\Scripts\ && start "" "ipython.exe" --profile=econometrics"

and save the file as ANACONDA\ipython-plain.bat. Finally, right click on ipython-plain.bat select Sent To, Desktop (Create Shortcut). The icon of the shortcut will be generic, and if you want a more meaningful icon,
select the properties of the shortcut, and then Change Icon, and navigate to
c:\Anaconda\Menu\ and select IPython.ico. Opening the batch file should create a window similar to that in
figure 1.1.
Launching the QtConsole is similar. Start by entering the following command in a text editor
cmd "/c cd ANACONDA\Scripts &&

start "" "pythonw" ANACONDA\Scripts\ipython-script.py

qtconsole --profile=econometrics"

and then saving the file as ANACONDA\ipython-qtconsole.bat. Create a shortcut for this batch file, and change
the icon if desired. Opening the batch file should create a window similar to that in figure 1.2 (although
the appearance might differ).

1.5.5

Getting Help

Help is available in IPython sessions using help(function). Some functions (and modules) have very long
help files. When using IPython, these can be paged using the command ?function or function? so that the
10

Figure 1.2: IPython running in a QtConsole session.

11

text can be scrolled using page up and down and q to quit. ??function or function?? can be used to type
the entire function including both the docstring and the code.

1.5.6

Running Python programs

While interactive programing is useful for learning a language or quickly developing some simple code,
complex projects require the use of complete programs. Programs can be run either using the IPython
magic work %run program.py or by directly launching the Python program using the standard interpreter
using python program.py. The advantage of using the IPython environment is that the variables used in
the program can be inspected after the program run has completed. Directly calling Python will run the
program and then terminate, and so it is necessary to output any important results to a file so that they
can be viewed later.3
To test that you can successfully execute a Python program, input the code in the block below into a
text file and save it as firstprogram.py.
# First Python program
from __future__ import print_function, division
import time
print(Welcome to your first Python program.)
raw_input(Press enter to exit the program.)
print(Bye!)
time.sleep(2)

Once you have saved this file, open the console, navigate to the directory you saved the file and enter
python firstprogram.py. Finally, run the program in IPython by first launching IPython, and the using
%cd to change to the location of the program, and finally executing the program using %run firstprogram.py.

1.5.7

Testing the Environment

To make sure that you have successfully installed the required components, run IPython using the shortcut
previously created on windows, or by running ipython --pylab or ipython qtconsole --pylab in a
Unix terminal window. Enter the following commands, one at a time (the meaning of the commands will
be covered later in these notes).
>>> x = randn(100,100)
>>> y = mean(x,0)
>>> plot(y)
>>> import scipy as sp

If everything was successfully installed, you should see something similar to figure 1.3.

1.5.8

IPython Notebook

IPython notebooks are a useful method to share code with others. Notebooks allow for a fluid synthesis
of formatted text, typeset mathematics (using LATEX via MathJax) and Python. The primary method for
using IPython notebooks is through a web interface. The web interface allow creation, deletion, export
3

Programs can also be run in the standard Python interpreter using the command:

exec(compile(open(filename.py).read(),filename.py,exec))

12

Figure 1.3: A successful test that matplotlib, IPython, NumPy and SciPy were all correctly installed.

13

and interactive editing of notebooks. Before running IPython Notebook for the first time, it is useful to
open IPython and run the following two commands.

>>> from IPython.external.mathjax import install_mathjax


>>> install_mathjax()

These commands download a local copy of MathJax, a Javascript library for typesetting LATEX math on web
pages.
To launch the IPython notebook server on Anaconda/Windows, open a text editor, enter

cmd "/c cd ANACONDA\Scripts && start "" "ipython.exe" notebook --matplotlib=inline


--notebook-dir=uc:\\PATH\\TO\\NOTEBOOKS\\"

and save the file as ipython-notebook.bat.


If using Linux or OS X, run

ipython notebook --matplotlib=inline --notebook-dir=/PATH/TO/NOTEBOOKS/

The command uses two optional argument. --matplotlib=inline launches IPython with inline figures
so that they show in the browser, and is highly recommended. --notebook-dir=/PATH/TO/NOTEBOOKS/
allows the default path for storing the notebooks to be set. This can be set to any location, and if not
set, a default value is used. Note that both of these options can be set in ipython_notebook_config.py in
profile_econometrics using

c.IPKernelApp.matplotlib = inline
c.FileNotebookManager.notebook_dir = /PATH/TO/NOTEBOOKS/

and then the notebook should be started using only --profile=econometrics.


These commands will start the server and open the default browser which should be a modern version
of Chrome (preferable) Chromium or Firefox. If the default browser is Safari, Internet Explorer or Opera,
the URL can be copied into the Chrome address bar. The first screen that appears will look similar to figure
1.4, except that the list of notebooks will be empty. Clicking on New Notebook will create a new notebook,
which, after a bit of typing, can be transformed to resemble figure 1.5. Notebooks can be imported by
dragging and dropping and exported from the menu inside a notebook.

1.5.9

Integrated Development Environments

As you progress in Python and begin writing more sophisticated programs, you will find that using an Integrated Development Environment (IDE) will increase your productivity. Most contain productivity enhancements such as built-in consoles, code completion (or intellisense, for completing function names)
and integrated debugging. Discussion of IDEs is beyond the scope of these notes, although Spyder is a
reasonable choice (free, cross-platform). Aptana Studio is another free alternative. My preferred IDE is
PyCharm, which has a community edition that is free for use (the professional edition is low cost for academics).
14

Figure 1.4: The default IPython Notebook screen showing two notebooks.

Figure 1.5: An IPython notebook showing formatted markdown, LATEX math and cells containing code.

15

Figure 1.6: The default Spyder IDE on Windows.

Spyder

Spyder is an IDE specialized for use in scientific application rather than for general purpose Python application development. This is both an advantage and a disadvantage when compared to more full featured
IDEs such as PyCharm, PyDev or Aptana Studio. The main advantage is that many powerful but complex
features are not integrated into Spyder, and so the learning curve is much shallower. The disadvantage is
similar - in more complex projects, or if developing something that is not straight scientific Python, Spyder is less capable. However, netting these two, Spyder is almost certainly the IDE to use when starting
Python, and it is always relatively simple to migrate to a sophisticated IDE if needed.
Spyder is started by entering spyder in the terminal or command prompt. A window similar to that
in figure 1.6 should appear. The main components are the the editor (1), the object inspector (2), which
dynamically will show help for functions that are used in the editor, and the console (3). By default Spyder
opens a standard Python console, although it also supports using the more powerful IPython console. The
object inspector window, by default, is grouped with a variable explorer, which shows the variables that
are in memory and the file explorer, which can be used to navigate the file system. The console is grouped
with an IPython console window (needs to be activated first using the Interpreters menu along the top
edge), and the history log which contains a list of commands executed. The buttons along the top edge
facilitate saving code, running code and debugging.
16

1.6

Exercises

1. Install Python.
2. Test the installation using the code in section 1.5.7.
3. Configure IPython using the start-up script in section 1.5.3.
4. Customize IPython QtConsole using a font or color scheme. More customizations can be found by
running ipython -h.
5. Explore tab completion in IPython by entering a<TAB> to see the list of functions which start with
a and are loaded by pylab. Next try i<TAB>, which will produce a list longer than the screen press
ESC to exit the pager.
6. Launch IPython Notebook and run code in the testing section.
7. Open Spyder and explore its features.

1.A

Frequently Encountered Problems

All
Whitespace sensitivity

Python is whitespace sensitive and so indentation, either spaces or tabs, affects how Python interprets
files. The configuration files, e.g. ipython_config.py, are plain Python files and so are sensitive to whitespace.
Introducing white space before the start of a configuration option will produce an error, so ensure there
is no whitespace before configuration lines such as c.InteractiveShellApp.exec_lines.

Windows
Spaces in path

Python does not generally work when directories have spaces.


Unicode in path

Python 2.7 does not work well when a path contains unicode characters, such as in a user name. While this
isnt an issue for installing Python or Anaconda, it is an issue for IPython which looks in c:\user\username\.ipython
for configuration files. The solution is to define the HOME variable before launching IPython to a path that
has only ASCII characters.
mkdir c:\anaconda\ipython_config
set HOME=c:\anaconda\ipython_config
c:\Anaconda\Scripts\activate econometrics
ipython profile create econometrics
ipython --profile=econometrics

17

The set HOME=c:\anaconda\ipython_config can point to any path with directories containing only ASCII
characters, and can also be added to any batch file to achieve the same effect.

OS X
Installing Anaconda to the root of the partition

If the user account used is running as root, then Anaconda may install to /anaconda and not ~/anaconda by
default. Best practice is not to run as root, although in principle this is not a problem, and /anaconda can
be used in place of ~/anaconda in any of the instructions.
Unable to create profile for IPython

Non-ASCII characters can create issues for IPython since it look in $HOME/.ipython which is normally
/Users/username/.ipython. If username has non-ASCII characters, this can create difficulties. The solution is

to define an environment variable to a path that only contains ASCII characters.


mkdir /tmp/ipython_config
export IPYTHONDIR=/tmp/ipython_config
source ~/anacound/bin/activate econometrics
ipython profile create econometrics
ipython --profile=econometrics

These commands should create a profile directory in /tmp/ipython_config (which can be any directory with
only ASCII characters in the path). These changes can be made permanent by editing ~/.bash_profile and
adding the line
export IPYTHONDIR=/tmp/ipython_config

in which case no further modifications are needed to the commands previously discussed. Note that
~/.bash_profile is hidden and may not exist, so nano ~/.bash_profile can be used to create and edit this
file.

1.B

register_python.py

A complete listing of register_python.py is included in this appendix.


# -*- encoding: utf-8 -*#
# Script to register Python 2.0 or later for use with win32all
# and other extensions that require Python registry settings
#
# Adapted by Ned Batchelder from a script
# written by Joakim Law for Secret Labs AB/PythonWare
#
# source:
# http://www.pythonware.com/products/works/articles/regpy20.htm

18

import sys
from _winreg import *
# tweak as necessary
version = sys.version[:3]
installpath = sys.prefix
regpath = "SOFTWARE\\Python\\Pythoncore\\%s\\" % (version)
installkey = "InstallPath"
pythonkey = "PythonPath"
pythonpath = "%s;%s\\Lib\\;%s\\DLLs\\" % (
installpath, installpath, installpath
)
def RegisterPy():
try:
reg = OpenKey(HKEY_LOCAL_MACHINE, regpath)
except EnvironmentError:
try:
reg = CreateKey(HKEY_LOCAL_MACHINE, regpath)
except Exception, e:
print "*** Unable to register: %s" % e
return
SetValue(reg, installkey, REG_SZ, installpath)
SetValue(reg, pythonkey, REG_SZ, pythonpath)
CloseKey(reg)
print "--- Python %s at %s is now registered!" % (version, installpath)
if __name__ == "__main__":
RegisterPy()

1.C

Advanced Setup

The simplest method to install the Python scientific stack is to use directly Continuum Analytics Anaconda. These instructions describe alternative installation options using virtual environments, which are
considered best practices when using Python.

1.C.1

Using Virtual Environments with Anaconda

Windows

Installation on Windows requires downloading the installer and running. These instructions use ANACONDA to indicate the Anaconda installation directory (e.g. the default is C:\Anaconda). Once the setup
has completed, open a command prompt (cmd.exe) and run
cd ANACONDA
conda update conda
conda update anaconda

19

conda create -n econometrics ipython-qtconsole ipython-notebook scikit-learn matplotlib


numpy pandas scipy spyder statsmodels
conda install -n econometrics cython distribute lxml nose numba numexpr openpyxl pep8 pip
psutil pyflakes pytables pywin32 rope sphinx xlrd xlwt
conda install -n econometrics mkl

which will first ensure that Anaconda is up-to-date and then create a virtual environment named econometrics. The virtual environment provides a set of components which will not change even if Anaconda
is updated. Using a virtual environment is a best practice and is important since component updates can
lead to errors in otherwise working programs due to backward incompatible changes in a module. The
long list of modules in the conda create command includes the core modules. The first conda install
contains the remaining packages, and is shown as an example of how to add packages to a virtual environment after it has been created. The second conda install installs the Intel Math Kernel library linkedmodules which provide large performance gains in Intel systems this package requires a license from
Continuum which is is free to academic users (and low cost otherwise). I recommend acquiring a license
as the performance gains are substantial, even on dual core machines. If you will not be purchasing a
license, this line should be omitted. It is also possible to install all available packages using the command
conda create -n econometrics anaconda.
The econometrics environment must be activated before use. This is accomplished by running
ANACONDA\Scripts\activate.bat econometrics

from the command prompt, which prepends [econometrics] to the prompt as an indication that virtual
environment is active. Activate the econometrics environment and then run
pip install pylint html5lib seaborn

which installs one package not directly available in Anaconda.


Linux and OS X

Installation on Linux requires executing


bash Anaconda-x.y.z-Linux-ISA.sh

where x.y.z will depend on the version being installed and ISA will be either x86 or more likely x86_64.
The OS X installer is available either in a GUI installed (pkg format) or as a bash installer which is installed
in an identical manner to the Linux installation. After installation completes, change to the folder where
Anaconda installed (written here as ANACONDA, default ~/anaconda) and execute
cd ANACONDA
cd bin
./conda update conda
./conda update anaconda
./conda create -n econometrics ipython-qtconsole ipython-notebook matplotlib numpy pandas
scikit-learn scipy spyder statsmodels
./conda install -n econometrics cython distribute lxml nose numba numexpr openpyxl pep8 pip
psutil pyflakes pytables rope sphinx xlrd xlwt
./conda install -n econometrics mkl

which will first ensure that Anaconda is up-to-date and then create a virtual environment named econometrics with the required packages. conda create creates the environment and conda install installs
20

additional packages to the existing environment. The second invocation of conda install is used to install the Intel Math Kernel library-linked modules, which provide substantial performance improvements
this package requires a license which is free to academic users and low cost to others. If acquiring a
license is not possible, omit this line. conda install can be used later to install other packages that may
be of interest. To activate the newly created environment, run
source ANACONDA/bin/activate econometrics

and then run the command


pip install pylint html5lib seaborn

to install one package not included in Anaconda.

1.C.2

Installation without Anaconda

Anaconda greatly simplifies installing the scientific Python stack. However, there may be situations where
installing Anaconda is not possible, and so (substantially more complicated) instructions are included for
both Windows and Linux.
Windows

The list of required windows binary packages, along with the version and Windows installation file, required for these notes include:
Package

Version

File name

Python
Setuptools
Pip
Virtualenv
pywin32
Jinja2
Tornado
PyCairo
PyZMQ
PyQt
NumPy
SciPy
MatplotLib
pandas
IPython
scikit-learn
statsmodels
PyTables
lxml
psutil

2.7.5
2.2.0
1.5.4
1.11.1
218.5
2.7.2
3.2.0
1.10.0
14.0.1
4.9.6-1
1.8.0
0.13.3
1.3.1
0.13.0
1.2.0
0.14.1
0.5.0
3.1.0
3.3.1
1.2.1

python-2.7.5.amd64
setuptools-2.2.win-amd64-py2.7
pip-1.5.4.win-amd64-py2.7
virtualenv-1.11.4.win-amd64-py2.7
pywin32-218.5.win-amd64-py2.7
Jinja2-2.7.2.win-amd64-py2.7.exe
tornado-3.2.0.win-amd64-py2.7.exe
pycairo-1.10.0.win-amd64-py2.7
pyzmq-14.0.1.win-amd64-py2.7
PyQt-Py2.7-x64-gpl-4.9.6-1
numpy-MKL-1.8.0.win-amd64-py2.7
scipy-0.13.3.win-amd64-py2.7
matplotlib-1.3.1.win-amd64-py2.7
pandas-0.13.1.win-amd64-py2.7
ipython-1.2.0.win-amd64-py2.7
scikit-learn-0.14.1.win-amd64-py2.7
statsmodels-0.5.0.win-amd64-py2.7
tables-3.1.0.win-amd64-py2.7
lxml-3.3.1.win-amd64-py2.7
psutil-1.2.1.win-amd64-py2.7

21

These remaining packages are optional and are only discussed in the final chapters related to performance.
Package
Performance
Cython
Cython
Numba
LLVMPy
LLVMMath
Numba
pandas (Optional)
Bottleneck
NumExpr

Version

File name

0.20.1

Cython-0.20.1.win-amd64-py2.7

0.12.3
0.1.2
0.12.1

llvmpy-0.12.3.win-amd64-py2.7
llvmmath-0.1.2.win-amd64-py2.7
numba-0.12.1.win-amd64-py2.7

0.8.0
2.3.1

Bottleneck-0.8.0.win-amd64-py2.7
numexpr-2.3.1.win-amd64-py2.7

Begin by installing Python, setuptools, pip and virtualenv. After these four packages are installed, open
an elevated command prompt (cmd.exe with administrator privileges) and initialized the virtual environment using the command:
cd C:\Dropbox
virtualenv econometrics

I prefer to use my Dropbox as the location for virtual environments and have named the virtual environment econometrics. The virtual environment can be located anywhere (although best practice is to
use a path without spaces) and can have a different name. Throughout the remainder of this section, VIRTUALENV will refer to the complete directory containing the virtual environment (e.g. C:\Dropbox\econometrics).
Once the virtual environment setup is complete, run
cd VIRTUALENV\Scripts
activate.bat
pip install beautifulsoup4 html5lib meta nose openpyxl patsy pep8 pyflakes pygments pylint
pylint pyparsing pyreadline python-dateutil pytz==2013d rope seaborn sphinx spyder
wsgiref xlrd xlwt

which activates the virtual environment and installs some additional required packages. Finally, before
installing the remaining packages, it is necessary to register the virtual environment as the default Python
environment by running the script register_python.py4 , which is available on the website. Once the correct
version of Python is registered, install the remaining packages in order, including any optional packages.
Finally, run one final command in the prompt.

xcopy c:\Python27\tcl VIRTUALENV\tcl /S /E /I

This file registers the virtual environment as the default python in Windows. To restore the main Python installation (normally C:\Python27) run register_python.py with the main Python interpreter (normally C:\Python27\python.exe) in an elevated
command prompt.

22

Linux (Ubuntu 12.04 LTS)

To install on Ubuntu 12.04 LTS, begin by updating the system using


sudo apt-get update
sudo apt-get upgrade

Next, install the system packages required using


sudo apt-get install python-pip libzmq-dev python-all-dev build-essential gfortran libatlasbase-dev libatlas-dev libatlas3-base pyqt4-dev-tools libfreetype6-dev libpng12-dev
python-qt4 python-qt4-dev python-cairo python-cairo-dev hdf5-tools libhdf5-serial-dev
texlive-full dvipng pandoc

Finally, install virtualenv using


sudo pip install virtualenv

The next step is to initialize the virtual environment, which is assumed to be in your home directory
and named econometrics.
cd ~
virtualenv econometrics

The virtual environment can be activated using


source ~/econometrics/bin/activate

Once the virtual environment has been initialized, the remaining packages can be installed using the
commands
mkdir ~/econometrics/lib/python2.7/site-packages/PyQt4/
mkdir ~/econometrics/lib/python2.7/site-packages/cairo/
cp -r /usr/lib/python2.7/dist-packages/PyQt4/* ~/econometrics/lib/python2.7/site-packages/
PyQt4/
cp -r /usr/lib/python2.7/dist-packages/cairo/* ~/econometrics/lib/python2.7/site-packages/
cairo/
cp /usr/lib/python2.7/dist-packages/sip* ~/econometrics/lib/python2.7/site-packages/
pip install Cython
pip install numpy
pip install scipy
pip install matplotlib
pip install ipython[/*all*/]
pip install scikit-learn
pip install beautifulsoup4 html5lib lxml openpyxl pytz==2013d xlrd xlwt
pip install patsy bottleneck numexpr
pip install tables
pip install pandas
pip install statsmodels
pip install distribute meta rope pep8 pexpect pylint pyflakes psutil seaborn sphinx spyder

The three cp lines copy files from the default Python installation which are more difficult to build using
pip. Next, if interested in Numba, a package which can be used to enhance the performance of Python,
enter the following commands. Note: The correct version of llvm might change as llvmpy and numba
progress.
23

wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
tar -zxf llvm-3.2.src.tar.gz
cd llvm-3.2.src
./configure --enable-optimizations --prefix=/home/username/llvm
REQUIRES_RTTI=1 make
make install
cd ..
LLVM_CONFIG_PATH=/home/username/llvm/bin/llvm-config pip install llvmpy
pip install llvmmath
pip install numba

1.C.3

Launching IPython

OS X and Linux

Starting IPython requires activating the virtual environment and then starting IPython with the correct
profile.
source ANACONDA/bin/activate econometrics
ipython --profile=econometrics

Starting IPython using the QtConsole is virtually identical.


source ANACONDA/bin/activate econometrics
ipython qtconsole --profile=econometrics

A single line launcher on OS X or Linux can be constructed using


bash -c "source ANACONDA/bin/activate econometrics && ipython qtconsole --profile=
econometrics"

This single line launcher can be saved as filename.command where filename is a meaningful name (e.g.
IPython-Terminal) to create a launcher on OS X by entering the command
chmod 755 /FULL/PATH/TO/filename.command

The same command can to create a Desktop launcher on Ubuntu by running


sudo apt-get install --no-install-recommends gnome-panel
gnome-desktop-item-edit ~/Desktop/ --create-new

and then using the command as the Command in the dialog that appears.
Note that if Python was directly installed, launching IPython is identical only replacing the Anaconda
virtual environment activation line with the activation line for the directly created virtual environment,
as in
source VIRTUALENV/bin/activate econometrics
ipython qtconsole --profile=econometrics

Windows (Anaconda)

Starting IPython requires activating the virtual environment and the starting IPython with the correct profile using cmd.
24

ANACONDA/Scripts/activate.bat econometrics
ipython --profile=econometrics

Starting using the QtConsole is similar.


ANACONDA/Scripts/activate.bat econometrics
ipython qtconsole --profile=econometrics

Launchers can be created for the both the virtual environment and the IPython interactive Python
console. First, open a text editor, enter
cmd /k "ANACONDA\Scripts\activate econometrics"

and save the file as ANACONDA\envs\econometrics\python-econometrics.bat. The batch file will open a command prompt in the econometrics virtual environment. Right click on the batch file and select Send To,
Desktop (Create Shortcut) which will place a shortcut on the desktop. Next, create a launcher to run
IPython in the standard Windows cmd.exe console. Open a text editor enter
cmd "/c ANACONDA\Scripts\activate econometrics && start "" "ipython.exe" --profile=
econometrics"

and save the file as ANACONDA\envs\econometrics\ipython-plain.bat. Finally, right click on ipython-plain.bat


select Sent To, Desktop (Create Shortcut). The icon of the shortcut will be generic, and if you want a more
meaningful icon, select the properties of the shortcut, and then Change Icon, and navigate to
c:\Anaconda\envs\econometrics\Menu\ and select IPython.ico. Opening the batch file should create a window
similar to that in figure 1.1.
Launching the QtConsole is similar. Start by entering the following command in a text editor
cmd "/c ANACONDA\Scripts\activate econometrics &&

start "" "pythonw" ANACONDA\envs\

econometrics\Scripts\ipython-script.py qtconsole --profile=econometrics"

and then saving the file as ANACONDA\envs\econometrics\ipython-qtconsole.bat. Create a shortcut for this
batch file, and change the icon if desired.
Windows (Direct)

If using the direct installation method on Windows, open a text editor, enter the following text
cmd "/c VIRTUALENV\Scripts\activate.bat && start "" "python" VIRTUALENV\Scripts\
ipython-script.py --profile=econometrics"

and save the file in VIRTUALENV as ipython.bat. Right-click on ipython.bat and Send To, Desktop (Create
Shortcut). The icon of the shortcut will be generic, and if you want a nice icon, select the properties of the
shortcut, and then Change Icon, and navigate to VIRTUALENV\Scripts\ and select IPython.ico.
The QtConsole can be configured to run by entering
cmd "/c VIRTUALENV\Scripts\activate.bat && start "" "pythonw" VIRTUALENV\Scripts\
ipython-script.py qtconsole --profile=econometrics"

saving the file as VIRTUALENV\ipython-qtconsole.bat and finally right-click and Sent To, Desktop (Create
Shortcut). The icon can be changed using the same technique as the basic IPython shell.

25

26

Chapter 2

Python 2.7 vs. 3 (and the rest)


Python comes in a number of flavors which may be suitable for econometrics, statistics and numerical
analysis. This chapter explains why 2.7 was chosen for these notes and highlights some of the available
alternatives.

2.1

Python 2.7 vs. 3

Python 2.7 is the final version of the Python 2.x line all future development work will focus on Python 3.
It may seem strange to learn an old language. The reasons for using 2.7 are:
There are more modules available for Python 2.7. While all of the core python modules are available
for both Python 2.7 and 3, some of the more esoteric modules are either only available for 2.7 or
have not been extensively tested in Python 3. Over time, many of these modules will be available for
Python 3, but they arent ready today.
The language changes relevant for numerical computing are very small and these notes explicitly
minimize these so that there should few changes needed to run against Python 3+ in the future
(ideally none).
Configuring and installing 2.7 is easier.
Anaconda defaults to 2.7 and the selection of packages available for Python 3 is limited.
Learning Python 3 has some advantages:
No need to update in the future.
Some improved out-of-box behavior for numerical applications.

2.2

Intel Math Kernel Library and AMD Core Math Library

Intels MKL and AMDs CML provide optimized linear algebra routines. The functions in these libraries
execute faster than basic those in linear algebra libraries and are, by default, multithreaded so that a many
linear algebra operations will automatically make use all of the processors on your system. Most standard
builds of NumPy do not include these, and so it is important to use a Python distribution built with an
27

appropriate linear algebra library (especially if computing inverses or eigenvalues of large matrices). The
three primary methods to access NumPy built with the Intel MKL are:
Use Anaconda on any platform and secure a license for MKL (free for academic use, otherwise $29
at the time of writing).
Use the pre-built NumPy binaries made available by Christoph Gohlke for Windows.
Follow instructions for building NumPy on Linux with MKL, which is free on Linux.
There are no pre-built libraries using AMDs CML, and so it is necessary to build NumPy from scratch if
using an AMD processor (or buy an Intel system, which is an easier solution).

2.3

Other Variants

Some other variants of the recommended version of Python are worth mentioning.

2.3.1

Enthought Canopy

Enthought Canopy is an alternative to Anaconda. It is available for Windows, Linux and OS X. Canopy
is regularly updated and is currently freely available in its basic version. The full version is also freely
available to academic users. Canopy is built using MKL, and so matrix algebra performance is very fast.

2.3.2

IronPython

IronPython is a variant which runs on the Common Language Runtime (CLR , aka Windows .NET). The
core modules NumPy and SciPy are available for IronPython, and so it is a viable alternative for numerical computing, especially if already familiar with the C# or interoperation with .NET components
is important. Other libraries, for example, matplotlib (plotting) are not available, and so there are some
important limitations.

2.3.3

Jython

Jython is a variant which runs on the Java Runtime Environment (JRE). NumPy is not available in Jython
which severely limits Jythons usefulness for numeric work. While the limitation is important, one advantage of Python over other languages is that it is possible to run (mostly unaltered) Python code on a JVM
and to call other Java libraries.

2.3.4

PyPy

PyPy is a new implementation of Python which uses Just-in-time compilation to accelerate code, especially loops (which are common in numerical computing). It may be anywhere between 2 - 500 times
faster than standard Python. Unfortunately, at the time of writing, the core library, NumPy is only partially implemented, and so it is not ready for use. Current plans are to have a version ready in the near
future, and if so, PyPy may quickly become the preferred version of Python for numerical computing.
28

2.A

Relevant Differences between Python 2.7 and 3

Most differences between Python 2.7 and 3 are not important for using Python in econometrics, statistics
and numerical analysis. I will make three common assumptions which will allow 2.7 and 3 to be used
interchangeable. The configuration instructions in the previous chapter for IPython will produce the expected behavior when run interactively. Note that these differences are important in stand-alone Python
programs.

2.A.1 print
print is a function used to display test in the console when running programs. In Python 2.7, print is a

keyword which behaves differently from other functions. In Python 3, print behaves like most functions.
The standard use in Python 2.7 is
print String to Print

while in Python 3 the standard use is


print(String to Print)

which resembles calling a standard function. Python 2.7 contains a version of the Python 3 print, which
can be used in any program by including
from __future__ import print_function

at the top of the file. I prefer the Python 3 version of print, and so I assume that all programs will include
this statement.

2.A.2 division
Python 3 changes the way integers are divided. In Python 2.7, the ratio of two integers was always an
integer, and so results are truncated towards 0 if the result was fractional. For example, in Python 2.7, 9/5
is 1. Python 3 gracefully converts the result to a floating point number, and so in Python 3, 9/5 is 1.8. When
working with numerical data, automatically converting ratios avoids some rare errors. Python 2.7 can use
the Python 3 behavior by including
from __future__ import division

at the top of the program. I assume that all programs will include this statement.

2.A.3 range and xrange


It is often useful to generate a sequence of number for use when iterating over the some data. In Python
2.7, the best practice is to use the keyword xrange to do this, while in Python 3, this keyword has been
renamed range. I will always use xrange and so it is necessary to replace xrange with range if using Python
3.

2.A.4

Unicode strings

Unicode is an industry standard for consistently encoding text. The computer alphabet was originally limited to 128 characters which is insufficient to contain the vast array of characters in all written languages.
29

Unicode expands the possible space to be up to 231 characters (depending on encoding). Python 3 treats
all strings as unicode unlike Python 2.7 where characters are a single byte, and unicode strings require the
special syntax uunicode string or unicode(unicode string). In practice this is unlikely to impact
most numeric code written in Python except possibly when reading or writing data. If working in a language where characters outside of the standard but limited 128 character set are commonly encountered,
it may be useful to use
from __future__ import unicode_literals

to will help with future compatibility when moving to Python 3.

30

Chapter 3

Built-in Data Types


Before diving into Python for analyzing data or running Monte Carlos, it is necessary to understand some
basic concepts about the core Python data types. Unlike domain-specific languages such as MATLAB or
R, where the default data type has been chosen for numerical work, Python is a general purpose programming language which is also well suited to data analysis, econometrics and statistics. For example,
the basic numeric type in MATLAB is an array (using double precision, which is useful for floating point
mathematics), while the basic numeric data type in Python is a 1-dimensional scalar which may be either
an integer or a double-precision floating point, depending on the formatting of the number when input.

3.1

Variable Names

Variable names can take many forms, although they can only contain numbers, letters (both upper and
lower), and underscores (_). They must begin with a letter or an underscore and are CaSe SeNsItIve.
Additionally, some words are reserved in Python and so cannot be used for variable names (e.g. import or
for). For example,
x = 1.0
X = 1.0
X1 = 1.0
X1 = 1.0
x1 = 1.0
dell = 1.0
dellreturns = 1.0
dellReturns = 1.0
_x = 1.0
x_ = 1.0

are all legal and distinct variable names. Note that names which begin or end with an underscore, while
legal, are not normally used since by convention these convey special meaning.1 Illegal names do not
follow these rules.
1

Variable names with a single leading underscores, for example _some_internal_value, indicate that the variable is for
internal use by a module or class. While indicated to be private, this variable will generally be accessible by calling code. Double leading underscores, for example __some_private_value indicate that a value is actually private and is not accessible.
Variable names with trailing underscores are used to avoid conflicts with reserved Python words such as class_ or lambda_.
Double leading and trailing underscores are reserved for magic variable (e.g. __init__) , and so should be avoided except
when specifically accessing a feature.

31

# Not allowed
x: = 1.0
1X = 1
X-1 = 1
for = 1

Multiple variables can be assigned on the same line using commas,


x, y, z = 1, 3.1415, a

3.2
3.2.1

Core Native Data Types


Numeric

Simple numbers in Python can be either integers, floats or complex. Integers correspond to either 32
bit or 64-bit integers, depending on whether the python interpreter was compiled for a 32-bit or 64-bit
operating system, and floats are always 64-bit (corresponding to doubles in C/C++). Long integers, on the
other hand, do not have a fixed size and so can accommodate numbers which are larger than maximum
the basic integer type can handle. This chapter does not cover all Python data types, and instead focuses
on those which are most relevant for numerical analysis, econometrics and statistics. The byte, bytearray
and memoryview data types are not described.
3.2.1.1

Floating Point (float)

The most important (scalar) data type for numerical analysis is the float. Unfortunately, not all noncomplex numeric data types are floats. To input a floating data type, it is necessary to include a . (period,
dot) in the expression. This example uses the function type() to determine the data type of a variable.
>>> x = 1
>>> type(x)
int
>>> x = 1.0
>>> type(x)
float
>>> x = float(1)
>>> type(x)
float

This example shows that using the expression that x = 1 produces an integer-valued variable while x = 1.0
produces a float-valued variable. Using integers can produce unexpected results and so it is important to
include .0 when expecting a float.
3.2.1.2

Complex (complex)

Complex numbers are also important for numerical analysis. Complex numbers are created in Python
using j or the function complex().
32

>>> x = 1.0
>>> type(x)
float
>>> x = 1j
>>> type(x)
complex
>>> x = 2 + 3j
>>> x
(2+3j)
>>> x = complex(1)
>>> x
(1+0j)

Note that a +b j is the same as complex(a ,b ), while complex(a ) is the same as a +0j.
3.2.1.3

Integers (int and long)

Floats use an approximation to represent numbers which may contain a decimal portion. The integer data
type stores numbers using an exact representation, so that no approximation is needed. The cost of the
exact representation is that the integer data type cannot express anything that isnt an integer, rendering
integers of limited use in most numerical work.
Basic integers can be entered either by excluding the decimal (see float), or explicitly using the int()
function. The int() function can also be used to convert a float to an integer by round towards 0.
>>> x = 1
>>> type(x)
int
>>> x = 1.0
>>> type(x)
float
>>> x = int(x)
>>> type(x)
int

Integers can range from 231 to 231 1. Python contains another type of integer, known as a long
integer, which has no effective range limitation. Long integers are entered using the syntax x = 1L or by
calling long(). Additionally python will automatically convert integers outside of the standard integer
range to long integers.
>>> x = 1
>>> x
1
>>> type(x)
int

33

>>> x = 1L
>>> x
1L
>>> type(x)
long
>>> x = long(2)
>>> type(x)
long
>>> y = 2
>>> type(y)
int
>>> x = y ** 64
>>> x

# ** is denotes exponentiation, y^64 in TeX

18446744073709551616L

The trailing L after the number indicates that it is a long integer, rather than a standard integer.

3.2.2

Boolean (bool)

The Boolean data type is used to represent true and false, using the reserved keywords True and False.
Boolean variables are important for program flow control (see Chapter 13) and are typically created as a
result of logical operations (see Chapter 11), although they can be entered directly.
>>> x = True
>>> type(x)
bool
>>> x = bool(1)
>>> x
True
>>> x = bool(0)
>>> x
False

Non-zero, non-empty values generally evaluate to true when evaluated by bool(). Zero or empty values
such as bool(0), bool(0.0), bool(0.0j), bool(None), bool() and bool([]) are all false.

3.2.3

Strings (str)

Strings are not usually important for numerical analysis, although they are frequently encountered when
dealing with data files, especially when importing or when formatting output for human consumption.
Strings are delimited using or "" but not using combination of the two delimiters (i.e. do not try ") in
a single string, except when used to express a quotation.
>>> x = abc
>>> type(x)

34

str
>>> y = "A quotation!"
>>> print(y)
"A quotation!"

String manipulation is further discussed in Chapter 21.


3.2.3.1

Slicing Strings

Substrings within a string can be accessed using slicing. Slicing uses [] to contain the indices of the characters in a string, where the first index is 0, and the last is n 1 (assuming the string has n letters). The
following table describes the types of slices which are available. The most useful are s[i ], which will return the character in position i , s[:i ], which return the leading characters from positions 0 to i 1, and
s[i :] which returns the trailing characters from positions i to n 1. The table below provides a list of the
types of slices which can be used. The second column shows that slicing can use negative indices which
essentially index the string backward.

Slice

Behavior

s[:]

Entire string
Charactersi
Charactersi , . . . , n 1
Characters0, . . . , i 1
Charactersi , . . . , j 1
j i 1
Charactersi ,i + m,. . .i + m b m c

s[i ]
s[i :]
s[:i ]
s[i : j ]
s[i : j :m ]

Slice
s[i ]
s[i :]
s[:i ]
s[ j :i ]
s[ j :i :m ]

>>> text = Python strings are sliceable.


>>> text[0]
P
>>> text[10]
i
>>> L = len(text)
>>> text[L] # Error
IndexError: string index out of range
>>> text[L-1]
.
>>> text[:10]
Python str
>>> text[10:]
ings are sliceable.

35

Behavior

Characters n i
Charactersn i , . . . , n 1
Characters0, . . . , n i 1
Characters n j , . . . , n i 1, j < i
j i
Characters n j ,n j + m,. . .,n j + m b m

3.2.4

Lists (list)

Lists are a built-in data type which require other data types to be useful. A list is a collection of other objects
floats, integers, complex numbers, strings or even other lists. Lists are essential to Python programming
and are used to store collections of other values. For example, a list of floats can be used to express a vector
(although the NumPy data types array and matrix are better suited). Lists also support slicing to retrieve
one or more elements. Basic lists are constructed using square braces, [], and values are separated using
commas.
>>> x = []
>>> type(x)
builtins.list
>>> x=[1,2,3,4]
>>> x
[1,2,3,4]
# 2-dimensional list (list of lists)
>>> x = [[1,2,3,4], [5,6,7,8]]
>>> x
[[1, 2, 3, 4], [5, 6, 7, 8]]
# Jagged list, not rectangular
>>> x = [[1,2,3,4] , [5,6,7]]
>>> x
[[1, 2, 3, 4], [5, 6, 7]]
# Mixed data types
>>> x = [1,1.0,1+0j,one,None,True]
>>> x
[1, 1.0, (1+0j), one, None, True]

These examples show that lists can be regular, nested and can contain any mix of data types including
other lists.

3.2.4.1

Slicing Lists

Lists, like strings, can be sliced. Slicing is similar, although lists can be sliced in more ways than strings.
The difference arises since lists can be multi-dimensional while strings are always 1 n. Basic list slicing
is identical to slicing strings, and operations such as x[:], x[1:], x[:1] and x[-3:] can all be used. To
understand slicing, assume x is a 1-dimensional list with n elements and i 0, j > 0, i < j ,m 1.
Python uses 0-based indices, and so the n elements of x can be thought of as x0 , x1 , . . . , xn 1 .
36

Slice

Behavior,

x[:]

Return all x
Return xi
Return xi , . . . xn 1
Return x0 , . . . , xi 1
Return xi , xi +1 , . . . x j 1
Returns xi ,xi +m ,. . .xi +m b j i 1 c

x[i ]
x[i :]
x[:i ]
x[i : j ]
x[i : j :m ]

Slice

Behavior

x[i ]

Return xi
Returns xn i except when i = 0
Return xn i , . . . , xn 1
Return x0 , . . . , xn i
Return xn j , . . . , xn i
Returns xn j ,xn j +m ,. . .,xn j +m b j i 1 c

x[i ]
x[i :]
x[:i ]
x[ j :i ]
x[ j :i :m ]

The default list slice uses a unit stride (step size of one) . It is possible to use other strides using a third
input in the slice so that the slice takes the form x[i:j:m] where i is the index to start, j is the index to end
(exclusive) and m is the stride length. For example x[::2] will select every second element of a list and is
equivalent to x[0:n:2] where n = len(x). The stride can also be negative which can be used to select the
elements of a list in reverse order. For example, x[::-1] will reverse a list and is equivalent to x[0:n:-1] .
Examples of accessing elements of 1-dimensional lists are presented below.
>>> x = [0,1,2,3,4,5,6,7,8,9]
>>> x[0]
0
>>> x[5]
5
>>> x[10] # Error
IndexError: list index out of range
>>> x[4:]
[4, 5, 6, 7, 8, 9]
>>> x[:4]
[0, 1, 2, 3]
>>> x[1:4]
[1, 2, 3]
>>> x[-0]
0
>>> x[-1]
9
>>> x[-10:-1]
[0, 1, 2, 3, 4, 5, 6, 7, 8]

List can be multidimensional, and slicing can be done directly in higher dimensions. For simplicity, consider slicing a 2-dimensional list x = [[1,2,3,4], [5,6,7,8]]. If single indexing is used, x[0] will return
the first (inner) list, and x[1] will return the second (inner) list. Since the list returned by x[0] is sliceable,
the inner list can be directly sliced using x[0][0] or x[0][1:4].
>>> x = [[1,2,3,4], [5,6,7,8]]
>>> x[0]
[1, 2, 3, 4]
>>> x[1]
[5, 6, 7, 8]
>>> x[0][0]
1

37

>>> x[0][1:4]
[2, 3, 4]
>>> x[1][-4:-1]
[5, 6, 7]

3.2.4.2

List Functions

A number of functions are available for manipulating lists. The most useful are
Function

Method

Description

list.append(x,value)

x.append(value)

len(x)

list.extend(x,list )

x.extend(list )

Appends value to the end of the list.


Returns the number of elements in the list.
Appends the values in list to the existing list.2
Removes the value in position index and returns the value.
Removes the first occurrence of value from the list.
Counts the number of occurrences of value in the list.
Deletes the elements in slice.

list.pop(x,index)

x.pop(index)

list.remove(x,value)

x.remove(value)

list.count(x,value)

x.count(value)

del x[slice]

>>> x = [0,1,2,3,4,5,6,7,8,9]
>>> x.append(0)
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
>>> len(x)
11
>>> x.extend([11,12,13])
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13]
>>> x.pop(1)
1
>>> x
[0, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13]
>>> x.remove(0)
>>> x
[2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13]

Elements can also be deleted from lists using the keyword del in combination with a slice.
>>> x = [0,1,2,3,4,5,6,7,8,9]
>>> del x[0]
>>> x
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x[:3]
[1, 2, 3]

38

>>> del x[:3]


>>> x
[4, 5, 6, 7, 8, 9]
>>> del x[1:3]
>>> x
[4, 7, 8, 9]
>>> del x[:]
>>> x
[]

3.2.5

Tuples (tuple)

A tuple is in many ways like a list tuple contain multiple pieces of data which may contain a mix of data
types. Aside from using a different syntax to construct a tuple, they are close enough to lists to ignore the
difference except that tuples are immutable. Immutability means that the elements the comprise a tuple
cannot be changed. It is not possible to add or remove elements form a tuple. However, if a tuple contains
a mutable data type, for example a tuple that contains a list, the contents mutable data type can change.
Tuples are constructed using parentheses (()), rather than square braces ([]) of lists. Tuples can be
sliced in an identical manner as lists. A list can be converted into a tuple using tuple() (Similarly, a tuple
can be converted to list using list()).
>>> x =(0,1,2,3,4,5,6,7,8,9)
>>> type(x)
tuple
>>> x[0]
0
>>> x[-10:-5]
(0, 1, 2, 3, 4)
>>> x = list(x)
>>> type(x)
list
>>> x = tuple(x)
>>> type(x)
tuple
>>> x= ([1,2],[3,4])
>>> x[0][1] = -10
>>> x # Contents can change, elements cannot
([1, -10], [3, 4])

Note that tuples containing a single element must contain a comma when created, so that x = (2,) is
assign a tuple to x, while x=(2) will assign 2 to x. The latter interprets the parentheses as if they are part of
a mathematical formula rather than being used to construct a tuple. x = tuple([2]) can also be used to
39

create a single element tuple. Lists do not have this issue since square brackets do not have this ambiguity.
>>> x =(2)
>>> type(x)
int
>>> x = (2,)
>>> type(x)
tuple
>>> x = tuple([2])
>>> type(x)
tuple

3.2.5.1

Tuple Functions

Tuples are immutable, and so only have the methods index and count, which behave in an identical manner to their list counterparts.

3.2.6

Xrange (xrange)

A xrange is a useful data type which is most commonly encountered when using a for loop. xrange(a,b,i)
a
e. In other
creates the sequences that follows the pattern a , a + i , a + 2i , . . . , a + (m 1)i where m = d b
i
words, it find all integers x starting with a such a x < b and where two consecutive values are separated by i . xrange can be called with 1 or two parameters xrange(a,b) is the same as xrange(a,b,1)
and xrange(b) is the same as xrange(0,b,1).
>>> x = xrange(10)
>>> type(x)
xrange
>>> print(x)
xrange(0, 10)
>>> list(x)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x = xrange(3,10)
>>> list(x)
[3, 4, 5, 6, 7, 8, 9]
>>> x = xrange(3,10,3)
>>> list(x)
[3, 6, 9]
>>> y = range(10)
>>> type(y)
list
>>> y

40

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

xrange is not technically a list, which is why the statement print(x) returns xrange(0,10). Explicitly

converting with list produces a list which allows the values to be printed. Technically xrange is an iterator which does not actually require the storage space of a list. This can be seen in the differences between
using y = range(10), which returns a list and y=xrange(10) which returns an xrange object. Best practice is to use xrange instead of range.3

3.2.7

Dictionary (dict)

Dictionaries are encountered far less frequently than then any of the previously described data types in
numerical Python. They are, however, commonly used to pass options into other functions such as optimizers, and so familiarity with dictionaries is important. Dictionaries in Python are composed of keys
(words) and values (definitions). Dictionaries keys must be unique primitive data types (e.g. strings, the
most common key), and values can contain any valid Python data type.4 Values are accessed using keys.
>>> data = {age: 34, children : [1,2], 1: apple}
>>> type(data)
dict
>>> data[age]
34

Values associated with an existing key can be updated by making an assignment to the key in the dictionary.
>>> data[age] = xyz
>>> data[age]
xyz

New key-value pairs can be added by defining a new key and assigning a value to it.
>>> data[name] = abc
>>> data
{1: apple, age: xyz, children: [1, 2], name: abc}

Key-value pairs can be deleted using the reserved keyword del.


>>> del data[age]
>>> data
{1: apple, children: [1, 2], name: abc}

3.2.8

Sets (set, frozenset)

Sets are collections which contain all unique elements of a collection. set and frozenset only differ in
that the latter is immutable (and so has higher performance), and so set is similar to a unique list while
frozenset is similar to a unique tuple . While sets are generally not important in numerical analysis, they
can be very useful when working with messy data for example, finding the set of unique tickers in a long
list of tickers.
3

xrange has been removed in Python 3 and range is always an iterator.


Formally dictionary keys must support the __hash__ function, equality comparison and it must be the case that different
keys have different hashes.
4

41

3.2.8.1

Set Functions

A number of methods are available for manipulating sets. The most useful are
Function

Method

Description

set.add(x,element )

x.add(element )

len(x)

set.difference(x,set )

x.difference(set )

set.intersection(x,set )

x.intersection(set )

Appends element to a set.


Returns the number of elements in the set.
Returns the elements in x which are not in set.
Returns the elements of x which are also in set.
Removes element from the set.
Returns the set containing all elements of x and set.

set.remove(x,element )

x.remove(element )

set.union(x,set )

x.union(set )

The code below demonstrates the use of set. Note that MSFT is repeated in the list used to initialize
the set, but only appears once in the set since all elements must be unique.
>>> x = set([MSFT,GOOG,AAPL,HPQ,MSFT])
>>> x
{AAPL, GOOG, HPQ, MSFT}
>>> x.add(CSCO)
>>> x
{AAPL, CSCO, GOOG, HPQ, MSFT}
>>> y = set([XOM, GOOG])
>>> x.intersection(y)
{GOOG}
>>> x = x.union(y)
>>> x
{AAPL, CSCO, GOOG, HPQ, MSFT, XOM}
>>> x.remove(XOM)
{AAPL, CSCO, GOOG, HPQ, MSFT}

A frozenset supports the same methods except add and remove.

3.3

Python and Memory Management

Python uses a highly optimized memory allocation system which attempts to avoid allocating unnecessary memory. As a result, when one variable is assigned to another (e.g. to y = x), these will actually point
to the same data in the computers memory. To verify this, id() can be used to determine the unique
identification number of a piece of data.5
>>> x = 1
>>> y = x
>>> id(x)
82970264L
>>> id(y)
5

The ID numbers on your system will likely differ from those in the code listing.

42

82970264L
>>> x = 2.0
>>> id(x)
93850568L
>>> id(y)
82970264L

In the above example, the initial assignment of y = x produced two variables with the same ID. However,
once x was changed, its ID changed while the ID of y did not, indicating that the data in each variable was
stored in different locations. This behavior is both safe and efficient, and is common to the basic Python
immutable types: int, long, float, complex, string, tuple, frozenset and xrange.

3.3.1

Example: Lists

Lists are mutable and so assignment does not create a copy , and so changes to either variable affect both.
>>> x = [1, 2, 3]
>>> y = x
>>> y[0] = -10
>>> y
[-10, 2, 3]
>>> x
[-10, 2, 3]

Slicing a list creates a copy of the list and any immutable types in the list but not mutable elements in
the list.
>>> x = [1, 2, 3]
>>> y = x[:]
>>> id(x)
86245960L
>>> id(y)
86240776L

To see that the inner lists are not copied, consider the behavior of changing one element in a nested list.
>>> x=[[0,1],[2,3]]
>>> y = x[:]
>>> y
[[0, 1], [2, 3]]
>>> id(x[0])
117011656L
>>> id(y[0])
117011656L
>>> x[0][0]

43

0.0
>>> id(x[0][0])
30390080L
>>> id(y[0][0])
30390080L
>>> y[0][0] = -10.0
>>> y
[[-10.0, 1], [2, 3]]
>>> x
[[-10.0, 1], [2, 3]]

When lists are nested or contain other mutable objects (which do not copy), slicing copies the outermost
list to a new ID, but the inner lists (or other objects) are still linked. In order to copy nested lists, it is
necessary to explicitly call deepcopy(), which is in the module copy.
>>> import copy as cp
>>> x=[[0,1],[2,3]]
>>> y = cp.deepcopy(x)
>>> y[0][0] = -10.0
>>> y
[[-10.0, 1], [2, 3]]
>>> x
[[0, 1], [2, 3]]

3.4

Exercises

1. Enter the following into Python, assigning each to a unique variable name:
(a) 4
(b) 3.1415
(c) 1.0
(d) 2+4j
(e) Hello
(f) World
2. What is the type of each variable? Use type if you arent sure.
3. Which of the 6 types can be:
(a) Added +
(b) Subtracted (c) Multiplied *
44

(d) Divided /
4. What are the types of the output (when an error is not produced) in the above operations?
5. Input the variable ex = Python is an interesting and useful language for numerical computing!.
Using slicing, extract:
(a) Python
(b) !
(c) computing
(d) in
Note: There are multiple answers for all.
(e) !gnitupmoc laciremun rof egaugnal lufesu dna gnitseretni na si nohtyP (Reversed)
(f) nohtyP
(g) Pto sa neetn n sfllnug o ueia optn!
6. What are the direct 2 methods to construct a tuple that has only a single item? How many ways are
there to construct a list with a single item?
7. Construct a nested list to hold the matrix

"

1 .5
.5 1

so that item [i][j] corresponds to the position in the matrix (Remember that Python uses 0 indexing).
8. Assign the matrix you just created first to x, and then assign y=x. Change y[0][0] to 1.61. What
happens to x?
9. Next assign z=x[:] using a simple slice. Repeat the same exercise using y[0][0] = 1j. What happens to x and z ? What are the ids of x, y and z? What about x[0], y[0] and z[0]?
10. How could you create w from x so that w can be changed without affecting x?
11. Initialize a list containing 4, 3.1415, 1.0, 2+4j, Hello, World. How could you:
(a) Delete 1.0 if you knew its position? What if you didnt know its position?
(b) How can the list [1.0, 2+4j, Hello] be added to the existing list?
(c) How can the list be reversed?
(d) In the extended list, how can you count the occurrence of Hello?
12. Construct a dictionary with the keyword-value pairs: alpha and 1.0, beta and 3.1415, gamma
and -99. How can the value of alpha be retrieved?
13. Convert the final list at the end of problem 11 to a set. How is the set different from the list?

45

46

Chapter 4

Arrays and Matrices


NumPy provides the most important data types for econometrics, statistics and numerical analysis arrays and matrices. The difference between these two data types are:
Arrays can have 1, 2, 3 or more dimensions, and matrices always have 2 dimensions. This means
that a 1 by n vector stored as an array has 1 dimension and n elements, while the same vector stored
as a matrix has 2-dimensions where the sizes of the dimensions are 1 and n (in either order).
Standard mathematical operators on arrays operate element-by-element. This is not the case for
matrices, where multiplication (*) follows the rules of linear algebra. 2-dimensional arrays can be
multiplied using the rules of linear algebra using dot. Similarly, the function multiply can be used
on two matrices for element-by-element multiplication.
Arrays are more common than matrices, and all functions are thoroughly tested with arrays. Functions should also work with matrices, but an occasional strange result may be encountered.
Arrays can be quickly treated as a matrix using either asmatrix or mat without copying the underlying
data.
The best practice is to use arrays and to use the asmatrix view when writing linear algebra-heavy code. It is
also important to test any custom function with both arrays and matrices to ensure that false assumptions
about the behavior of multiplication have not been made.

4.1

Array

Arrays are the base data type in NumPy, are are arrays in some ways similar to lists since they both contain
collections of elements. The focus of this section is on homogeneous arrays containing numeric data
that is, an array where all elements have the same numeric type (heterogeneous arrays are covered in
Chapters 16 and 17). Additionally, arrays, unlike lists, are always rectangular so that all rows have the same
number of elements.
Arrays are initialized from lists (or tuples) using array. Two-dimensional arrays are initialized using
lists of lists (or tuples of tuples, or lists of tuples, etc.), and higher dimensional arrays can be initialized by
further nesting lists or tuples.
47

>>> x = [0.0, 1, 2, 3, 4]
>>> y = array(x)
>>> y
array([ 0.,

1.,

2.,

3.,

4.])

>>> type(y)
numpy.ndarray

Two (or higher) -dimensional arrays are initialized using nested lists.
>>> y = array([[0.0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])
>>> y
array([[ 0.,

1.,

2.,

3.,

4.],

[ 5.,

6.,

7.,

8.,

9.]])

>>> shape(y)
(2L, 5L)
>>> y = array([[[1,2],[3,4]],[[5,6],[7,8]]])
>>> y
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
>>> shape(y)
(2L, 2L, 2L)

4.1.1

Array dtypes

Homogeneous arrays can contain a variety of numeric data types. The most useful is float64, which corresponds to the python built-in data type of float (and C/C++ double). By default, calls to array will preserve
the type of the input, if possible. If an input contains all integers, it will have a dtype of int32 (similar to
the built in data type int). If an input contains integers, floats, or a mix of the two, the arrays data type will
be float64. If the input contains a mix of integers, floats and complex types, the array will be initialized
to hold complex data.
>>> x = [0, 1, 2, 3, 4] # Integers
>>> y = array(x)
>>> y.dtype
dtype(int32)
>>> x = [0.0, 1, 2, 3, 4] # 0.0 is a float
>>> y = array(x)
>>> y.dtype
dtype(float64)
>>> x = [0.0 + 1j, 1, 2, 3, 4] # (0.0 + 1j) is a complex
>>> y = array(x)

48

>>> y
array([ 0.+1.j,

1.+0.j,

2.+0.j,

3.+0.j,

4.+0.j])

>>> y.dtype
dtype(complex128)

NumPy attempts to find the smallest data type which can represent the data when constructing an array.
It is possible to force NumPy to select a particular dtype by using the keyword argument dtype=datetype
when initializing the array.
>>> x = [0, 1, 2, 3, 4] # Integers
>>> y = array(x)
>>> y.dtype
dtype(int32)
>>> y = array(x, dtype=float64) # String dtype
>>> y.dtype
dtype(float64)
>>> y = array(x, dtype=float64) # NumPy type dtype
>>> y.dtype
dtype(float64)

4.2

Matrix

Matrices are essentially a subset of arrays, and behave in a virtually identical manner. The two important
differences are:
Matrices always have 2 dimensions
Matrices follow the rules of linear algebra for *
1- and 2-dimensional arrays can be copied to a matrix by calling matrix on an array. Alternatively, calling
mat or asmatrix provides a faster method where an array can behave like a matrix without copying any
data.
>>> x = [0.0, 1, 2, 3, 4] # Any float makes all float
>>> y = array(x)
>>> type(y)
numpy.ndarray
>>> y * y # Element-by-element
array([ 0.,
1.,
4.,
9.,

16.])

>>> z = asmatrix(x)
>>> type(z)
numpy.matrixlib.defmatrix.matrix
>>> z * z # Error
ValueError: matrices are not aligned

49

4.3

1-dimensional Arrays

A vector
x = [1

5]

is entered as a 1-dimensional array using


>>> x=array([1.0,2.0,3.0,4.0,5.0])
array([ 1.,

2.,

3.,

4.,

5.])

>>> ndim(x)
1

If an array with 2-dimensions is required, it is necessary to use a trivial nested list.


>>> x=array([[1.0,2.0,3.0,4.0,5.0]])
array([[ 1.,

2.,

3.,

4.,

5.]])

>>> ndim(x)
2

A matrix is always 2-dimensional and so a nested list is not required to initialize a a row matrix
>>> x=matrix([1.0,2.0,3.0,4.0,5.0])
>>> x
matrix([[ 1.,

2.,

3.,

4.,

5.]])

>>> ndim(x)
2

Notice that the output matrix representation uses nested lists ([[ ]]) to emphasize the 2-dimensional
structure of all matrices. The column vector,

x =

1
2
3
4
5

is entered as a matrix or 2-dimensional array using a set of nested lists


>>> x=matrix([[1.0],[2.0],[3.0],[4.0],[5.0]])
>>> x
matrix([[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]])
>>> x = array([[1.0],[2.0],[3.0],[4.0],[5.0]])
>>> x
array([[ 1.],

50

[ 2.],
[ 3.],
[ 4.],
[ 5.]])

4.4

2-dimensional Arrays

Matrices and 2-dimensional arrays are rows of columns, and so

1 2 3

x = 4 5 6 ,
7 8 9
is input by enter the matrix one row at a time, each in a list, and then encapsulate the row lists in another
list.
>>> x = array([[1.0,2.0,3.0],[4.0,5.0,6.0],[7.0,8.0,9.0]])
>>> x
array([[ 1.,

2.,

3.],

[ 4.,

5.,

6.],

[ 7.,

8.,

9.]])

4.5

Multidimensional Arrays

Higher dimensional arrays are useful when tracking matrix valued processes through time, such as a timevarying covariance matrices. Multidimensional (N -dimensional) arrays are available for N up to about 30,
depending on the size of each matrix dimension. Manually initializing higher dimension arrays is tedious
and error prone, and so it is better to use functions such as zeros((2, 2, 2)) or empty((2, 2, 2)).

4.6

Concatenation

Concatenation is the process by which one vector or matrix is appended to another. Arrays and matrices
can be concatenation horizontally or vertically. For example, suppose

"
x =

1 2
3 4

"
and y =

5 6
7 8

"
and z =

x
y

needs to be constructed. This can be accomplished by treating x and y as elements of a new matrix and
using the function concatenate to join them. The inputs to concatenate must be grouped in a tuple and
the keyword argument axis specifies whether the arrays are to be vertically (axis = 0) or horizontally
(axis = 1) concatenated.
>>> x = array([[1.0,2.0],[3.0,4.0]])
>>> y = array([[5.0,6.0],[7.0,8.0]])
>>> z = concatenate((x,y),axis = 0)
>>> z
array([[ 1.,

2.],

51

[ 3.,

4.],

[ 5.,

6.],

[ 7.,

8.]])

>>> z = concatenate((x,y),axis = 1)
>>> z
array([[ 1.,

2.,

5.,

6.],

[ 3.,

4.,

7.,

8.]])

Concatenating is the code equivalent of block-matrix forms in standard matrix algebra. Alternatively, the
functions vstack and hstack can be used to vertically or horizontally stack arrays, respectively.
>>> z = vstack((x,y)) # Same as z = concatenate((x,y),axis = 0)
>>> z = hstack((x,y)) # Same as z = concatenate((x,y),axis = 1)

4.7

Accessing Elements of an Array

Four methods are available for accessing elements contained within an array: scalar selection, slicing,
numerical indexing and logical (or Boolean) indexing. Scalar selection and slicing are the simplest and so
are presented first. Numerical indexing and logical indexing both depends on specialized functions and
so these methods are discussed in Chapter 12.

4.7.1

Scalar Selection

Pure scalar selection is the simplest method to select elements from an array, and is implemented using
[i ] for 1-dimensional arrays, [i , j ] for 2-dimensional arrays and [i 1 ,i 2 ,. . .,i n ] for general n-dimensional
arrays.
>>> x = array([1.0,2.0,3.0,4.0,5.0])
>>> x[0]
1.0
>>> x = array([[1.0,2,3],[4,5,6]])
>>> x[1,2]
6.0
>>> type(x[1,2])
numpy.float64

Pure scalar selection always returns a single element which is not an array. The data type of the selected
element matches the data type of the array used in the selection. Scalar selection can also be used to
assign values in an array.
>>> x = array([1.0,2.0,3.0,4.0,5.0])
>>> x[0] = -5
>>> x
array([-5.,

2.,

3.,

4.,

5.])

52

4.7.2

Array Slicing

Arrays, like lists and tuples, can be sliced. Arrays slicing is virtually identical to lists slicing except that a
simpler slicing syntax is available since arrays are explicitly multidimensional and rectangular. Arrays are
sliced using the syntax [:,:,. . .,:] (where the number of dimensions of the arrays determines the size
of the slice).1 Recall that the slice notation a:b:s will select every sth element where the indices i satisfy
a i < b so that the starting value a is always included in the list and the ending value b is always
excluded. Additionally, a number of shorthand notations are commonly encountered
: and :: are the same as 0:n:1 where n is the length of the array (or list).
a: and a:n are the same as a:n:1 where n is the length of the array (or list).
:b is the same as 0:b:1.
::s is the same as 0:n:s where n is the length of the array (or list).
Basic slicing of 1-dimensional arrays is identical to slicing a simple list, and the returned type of all slicing
operations matches the array being sliced.
>>> x = array([1.0,2.0,3.0,4.0,5.0])
>>> y = x[:]
array([ 1.,

2.,

3.,

4.,

5.])

>>> y = x[:2]
array([ 1.,

2.])

>>> y = x[1::2]
array([ 2.,

4.])

In 2-dimensional arrays, the first dimension specifies the row or rows of the slice and the second dimension specifies the the column or columns. Note that the 2-dimensional slice syntax y[a:b,c:d] is the
same as y[a:b,:][:,c:d] or y[a:b][:,c:d], although clearly the shorter form is preferred. In the case
where only row slicing in needed y[a:b], which is the equivalent to y[a:b,:], is the shortest syntax.
>>> y = array([[0.0, 1, 2, 3, 4],[5, 6, 7, 8, 9]])
>>> y
array([[ 0.,

1.,

2.,

3.,

4.],

[ 5.,

6.,

7.,

8.,

9.]])

>>> y[:1,:] # Row 0, all columns


array([[ 0.,

1.,

2.,

3.,

4.]])

>> y[:1] # Same as y[:1,:]


array([[ 0.,

1.,

2.,

3.,

4.]])

>>> y[:,:1] # all rows, column 0


array([[ 0.],
1

It is not necessary to include all trailing slice dimensions, and any omitted trailing slices are set to select all elements (the
slice :). For example, if x is a 3-dimensional array, x[0:2] is the same as x[0:2,:,:] and x[0:2,0:2] is the same as
x[0:2,0:2,:].

53

[ 5.]])
>>> y[:1,0:3] # Row 0, columns 0 to 2
array([[ 0.,

1.,

2.]])

>>> y[:1][:,0:3] # Same as previous


array([[ 0.,

1.,

2.]])

>>> y[:,3:] # All rows, columns 3 and 4


array([[ 3.,

4.],

[ 8.,

9.]])

>>> y = array([[[1.0,2],[3,4]],[[5,6],[7,8]]])
>>> y[:1,:,:] # Panel 0 of 3D y
array([[[ 1.,

2.],

[ 3.,

4.]]])

In the previous examples, slice notation was always used even when only selecting 1 row or column. This
was done to emphasize the difference between using slice notation, which always returns an array with
the same dimension and using a scalar selector which will perform dimension reduction.

4.7.3

Mixed Selection using Scalar and Slice Selectors

When arrays have more than 1-dimension, it is often useful to mix scalar and slice selectors to select an
entire row, column or panel of a 3-dimensional array. This is similar to pure slicing with one important
caveat dimensions selected using scalar selectors are eliminated. For example, if x is a 2-dimensional
array, then x[0,:] will select the first row. However, unlike the 2-dimensional array constructed using the
slice x[:1,:], x[0,:] will be a 1-dimensional array.
>>> x = array([[1.0,2],[3,4]])
>>> x[:1,:] # Row 1, all columns, 2-dimensional
array([[ 1.,

2.]])

>>> x[0,:] # Row 1, all columns, dimension reduced


array([ 1.,

2.])

While these two selections appear similar, the first produces a 2-dimensional array (note the [[ ]] syntax)
while the second is a 1-dimensional array. In most cases where a single row or column is required, using
scalar selectors such as y[0,:] is the best practice. It is important to be aware of the dimension reduction since scalar selections from a 2-dimensional arrays will no longer have 2-dimensions. This type of
dimension reduction may matter when evaluating linear algebra expression.
The principle adopted by NumPy is that slicing should always preserve the dimension of the underlying array, while scalar indexing should always collapse the dimension(s). This is consistent with x[0,0]
returning a scalar (or 0-dimensional array) since both selections are scalar. This is demonstrated in the
next example which highlights the differences between pure slicing, mixed slicing and pure scalar selection. Note that the function ndim returns the number of dimensions of an array.
>>> x = array([[0.0, 1, 2, 3, 4],[5, 6, 7, 8, 9]])
>>> x[:1,:] # Row 0, all columns, 2-dimensional
array([[ 0.,

1.,

2.,

3.,

4.]])

54

>>> ndim(x[:1,:])
2
>>> x[0,:] # Row 0, all column, dim reduction to 1-d array
array([ 0.,

1.,

2.,

3.,

4.])

>>> ndim(x[0,:])
1
>>> x[0,0] # Top left element, dim reduction to scalar (0-d array)
0.0
>>> ndim(x[0,0])
0
>>> x[:,0] # All rows, 1 column, dim reduction to 1-d array
array([ 0., 5.])

4.7.4

Assignment using Slicing

Slicing and scalar selection can be used to assign arrays that have the same dimension as the slice.2
>>> x = array([[0.0]*3]*3)
>>> x

# *3 repeats the list 3 times

array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
>>> x[0,:] = array([1.0, 2.0, 3.0])
>>> x
array([[ 1.,

2.,

3.],

[ 0.,

0.,

0.],

[ 0.,

0.,

0.]])

>>> x[::2,::2] =

array([[-99.0,-99],[-99,-99]]) # 2 by 2

>>> x
array([[-99.,
[

0.,

[-99.,

2., -99.],
0.,

0.],

0., -99.]])

>>> x[1,1] = pi
>>> x
array([[-99.

2.

0.

3.14159265,

[-99.

0.

, -99.

],

0.

],

, -99.

]])

Formally, the array to be assigned must be broadcastable to the size of the slice. Broadcasting is described in Chapter 5, and
assignment using broadcasting is discussed in Chapter 12.

55

NumPy attempts to automatic (silent) data type conversion if an element with one data type is inserted
into an array wit a different data type. For example, if an array has an integer data type, place a float into
the array results in the float being truncated and stored as an integer. This is dangerous, and so in most
cases, arrays should be initialized to contain floats unless a considered decision is taken to use a different
data type.
>>> x = [0, 1, 2, 3, 4] # Integers
>>> y = array(x)
>>> y.dtype
dtype(int32)
>>> y[0] = 3.141592
>>> y
array([3, 1, 2, 3, 4])
>>> x = [0.0,1, 2, 3, 4] # 1 Float makes all float
>>> y = array(x)
>>> y.dtype
dtype(float64)
>>> y[0] = 3.141592
>>> y
array([ 3.141592,

4.7.5

1.

2.

3.

4.

])

Linear Slicing using flat

Data in matrices is stored in row-major order elements are indexed by first counting across rows and
then down columns. For example, in the matrix

1 2 3

x = 4 5 6
7 8 9
the first element of x is 1, the second element is 2, the third is 3, the fourth is 4, and so on.
In addition to slicing using the [:,:,. . .,:] syntax, k -dimensional arrays can be linear sliced. Linear
slicing assigns an index to each element of the array, starting with the first (0), the second (1), and so on
until the final element (n 1). In 2-dimensions, linear slicing works by first counting across rows, and
then down columns. To use linear slicing, the method or function flat must first be used.
>>> y = reshape(arange(25.0),(5,5))
>>> y
array([[

0.,

1.,

2.,

3.,

5.,

6.,

7.,

8.,

4.],
9.],

[ 10.,

11.,

12.,

13.,

14.],

[ 15.,

16.,

17.,

18.,

19.],

[ 20.,

21.,

22.,

23.,

24.]])

>>> y[0] # Same as y[0,:], first row


array([ 0.,

1.,

2.,

3.,

4.])

56

>>> y.flat[0] # Scalar slice, flat is 1-dimensional


0
>>> y[6] # Error
IndexError: index out of bounds
>>> y.flat[6] # Element 6
6.0
>>> y.flat[12:15]
array([ 12.,

13.,

14.])

>>> y.flat[:] # All element slice


array([[

0.,

1.,

2.,

3.,

4.,

5.,

6.,

7.,

8.,

9.,

10.,

11.,

12.,

13.,

14.,

15.,

16.,

17.,

18.,

19.,

20.,

21.,

22.,

23.,

24.]])

Note that arange and reshape are useful functions are described in later chapters.

4.8

Slicing and Memory Management

Unlike lists, slices of arrays are do not copy the underlying data. Instead a slice of an array returns a view of
the array which shares the data in the sliced array. This is important since changes in slices will propagate
to the underlying array and to any other slices which share the same element.
>>> x = reshape(arange(4.0),(2,2))
>>> x
array([[ 0.,

1.],

[ 2.,

3.]])

>>> s1 = x[0,:] # First row


>>> s2 = x[:,0] # First column
>>> s1[0] = -3.14 # Assign first element
>>> s1
array([-3.14,

1.

])

2.

])

>>> s2
array([-3.14,
>>> x
array([[-3.14,
[ 2.

1.

],

3.

]])

If changes should not propagate to parent and sibling arrays, it is necessary to call copy on the slice. Alternatively, they can also be copied by calling array on arrays, or matrix on matrices.
>>> x = reshape(arange(4.0),(2,2))
>>> s1 = copy(x[0,:]) # Function copy
>>> s2 = x[:,0].copy() # Method copy
>>> s3 = array(x[0,:]) # Create a new array

57

>>> s1[0] = -3.14


>>> s1
array([-3.14,

1.])

>>> s2
array([ 0.,

2.])

>>> s3
array([0.,

1.])

>>> x[0,0]
array([[ 0.,
[ 2.,

1.],
3.]])

There is one notable exception to this rule when using pure scalar selection the (scalar) value returned
is always a copy.
>>> x = arange(5.0)
>>> y = x[0] # Pure scalar selection
>>> z = x[:1] # A pure slice
>>> y = -3.14
>>> y # y Changes
-3.14
>>> x # No propagation
array([ 0.,
>>> z

1.,

2.,

3.,

4.])

# No changes to z either

array([ 0.])
>>> z[0] = -2.79
>>> y # No propagation since y used pure scalar selection
-3.14
>>> x # z is a view of x, so changes propagate
array([-2.79,

1.

2.

3.

4.

])

Finally, assignments from functions which change values will automatically create a copy of the underlying array.
>>> x = array([[0.0, 1.0],[2.0,3.0]])
>>> y = x
>>> print(id(x),id(y)) # Same
129186368 129186368
>>> y = x + 1.0
>>> y
array([[ 1.,
[ 3.,

2.],
4.]])

>>> print(id(x),id(y)) # Different


129186368 129183104

58

>>> x # Unchanged
array([[ 0.,

1.],

[ 2.,

3.]])

>>> y = exp(x)
>>> print(id(x),id(y)) # Also Different
129186368 129185120

Even trivial function such as y = x + 0.0 create a copy of x, and so the only scenario where explicit copying
is required is when y is directly assigned using a slice of x, and changes to y should not propagate to x.

4.9 import and Modules


Python, by default, only has access to a small number of built-in types and functions. The vast majority of
functions are located in modules, and before a function can be accessed, the module which contains the
function must be imported. For example, when using ipython --pylab (or any variants), a large number
of modules are automatically imported, including NumPy and matplotlib. This is style of importing useful
for learning and interactive use, but care is needed to make sure that the correct module is imported when
designing more complex programs.
import can be used in a variety of ways. The simplest is to use from module import * which imports
all functions in module. This method of using import can dangerous since if you use it more than once,
it is possible for functions to be hidden by later imports. A better method is to just import the required
functions. This still places functions at the top level of the namespace, but can be used to avoid conflicts.
from pylab import log2 # Will import log2 only
from scipy import log10 # Will not import the log2 from SciPy

The functions log2 and log10 can both be called in subsequent code. An alternative, and more common,
method is to use import in the form
import pylab
import scipy
import numpy

which allows functions to be accessed using dot-notation and the module name, for example scipy.log2.
It is also possible to rename modules when imported using as
import pylab as pl
import scipy as sp
import numpy as np

The only difference between these two is that import scipy is implicitly calling import scipy as scipy.
When this form of import is used, functions are used with the as name. For example, the load provided
by NumPy is accessed using sp.log2, while the pylab load is pl.log2 and both can be used where appropriate. While this method is the most general, it does require slightly more typing.

4.10

Calling Functions

Functions calls have different conventions than most other expressions. The most important difference
is that functions can take more than one input and return more than one output. The generic structure
59

of a function call is out1, out2, out3, . . . = functionname(in1, in2, in3, . . .). The important aspects of this
structure are
If multiple outputs are returned, but only one output variable is provided, the output will (generally)
be a tuple.
If more than one output variable is given in a function call, the number of output must match the
number of output provided by the function. It is not possible to ask for two output if a function returns three using an incorrect number of outputs results in ValueError: too many values to unpack.
Both inputs and outputs must be separated by commas (,)
Inputs can be the result of other functions as long only one output is returned. For example, the
following are equivalent,
>>> y = var(x)
>>> mean(y)

and
>>> mean(var(x))

Required Arguments

Most functions have required arguments. For example, consider the definition of array from help(array),
array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)

Array has 1 required input, object, which is the list or tuple which contains values to use when creating
the array. Required arguments can be determined by inspecting the function signature since all of the
input follow the patters keyword=default except object required arguments will not have a default value
provided. The other arguments can be called in order (array accepts at most 2 non-keyword arguments).
>>> array([[1.0,2.0],[3.0,4.0]])
array([[ 1.,

2.],

[ 3.,

4.]])

>>> array([[1.0,2.0],[3.0,4.0]], int32)


array([[1, 2],
[3, 4]])

Keyword Arguments

All of the arguments to array can be called by their keyword, which is listed in the help file definition.
array(object=[[1.0,2.0],[3.0,4.0]])
array([[1.0,2.0],[3.0,4.0]], dtype=None, copy=True, order=None, subok=False, ndmin=0)

Keyword arguments have two important advantages. First, they do not have to appear in any order (Note:
randomly ordering arguments is not good practice, and this is only an example), and second, keyword
arguments can be used only when needed since a default value is always given.
60

>>> array(dtype=complex64, object = [[1.0,2.0],[3.0,4.0]], copy=True)


array([[ 1.+0.j,

2.+0.j],

[ 3.+0.j,

4.+0.j]], dtype=complex64)

Default Arguments

Functions have defaults for optional arguments. These are listed in the function definition and appear
in the help in the form keyword=default. Returning to array, all inputs have default arguments except
object the only required input.
Multiple Outputs

Some functions can have more than 1 output. These functions can be used in a single output mode or in
multiple output mode. For example, shape can be used on an array to determine the size of each dimension.
>>> x = array([[1.0,2.0],[3.0,4.0]])
>>> s = shape(x)
>>> s
(2L, 2L)

Since shape will return as many outputs as there are dimensions, it can be called with 2 outputs when the
input is a 2-dimensional array.
>>> x = array([[1.0,2.0],[3.0,4.0]])
>>> M,N = shape(x)
>>> M
2L
>>> N
2L

Requesting more outputs than are required will produce an error.


>>> M,N,P = shape(x) # Error
ValueError: need more than 2 values to unpack

Similarly, providing two few output can also produce an error. Consider the case where the argument used
with shape is a 3-dimensional array.
>>> x = randn(10,10,10)
>>> shape(x)
(10L, 10L, 10L)
>>> M,N = shape(x) # Error
ValueError: too many values to unpack

4.11

Exercises

1. Input the following mathematical expressions into Python as both arrays and matrices.
u = [1

1
61

8]

v =

1
1
2
3
5
8

"

1 0
0 1

"

1 2
3 4

x =

y =

1 2 1 2

z = 3 4 3 4
1 2 1 2

"
w =

x
y

x
y

Note: A column vector must be entered as a 2-dimensional array.


2. What command would pull x out of w ? (Hint: w[?,?] is the same as x .)

0

3. What command would pull x 0 y 0 out of w? Is there more than one? If there are, list all alternatives.

4. What command would pull y out of z ? List all alternatives.


5. Explore the options for creating an array using keyword arguments. Create an array containing

"
y =

1 2
3 4

with combination of keyword arguments in:


(a) dtype in float, float64, int32 (32-bit integers), uint32 (32-bit unsigned integers) and complex128
(double precision complex numbers).
(b) copy either True or False.
(c) ndim either 3 or 4. Use shape(y) to see the effect of this argument.
6. Enter y = [1.6180 2.7182 3.1415] as an array. Define x = mat(y). How is x different from y ?

62

Chapter 5

Basic Math
Note: Python contains a math module providing functions which operate on built-in scalar data types
(e.g. float and complex). This and subsequent chapters assume mathematical functions must operate on
arrays and matrices, and so are imported from NumPy.

5.1

Operators

These standard operators are available:


Operator

Meaning

Example

Algebraic

+
*
/

Addition
Subtraction
Multiplication
Division (Left divide)

x + y

x+y
xy
xy

x/y

x
y

**

Exponentiation

x**y

xy

x - y
x * y

When x and y are scalars, the behavior of these operators is obvious. The only possible exception
occurs when both x and y are integers for division, where x/y returns the smallest integer less than the
ratio (e.g. b xy c). The simplest method to avoid this problem is use from __future__ import division
which changes the default behavior. Alternatively, declaring numeric values to be floats using 5.0 rather
than 5 will also mitigate this issue as well explicitly casting integers to floats before dividing.
>>> x = 9
>>> y = 5
>>> (type(x), type(y))
(int, int)
>>> x/y # Since division imported
1.8
>>> float(x)/y
1.8

When x and y are arrays or matrices, the behavior of mathematical operations is more complex. The
examples in this chapter refer to arrays, and except where explicit differences are noted, it is safe to assume
that the behavior of 2-dimensional arrays and matrices is identical.
63

I recommend using the import command from __future__ import division in all programs
and IPython. The future division avoids this issue by always casting division to floating point
when the result is not an exact integer.

5.2

Broadcasting

Under the normal rules of array mathematics, addition and subtraction are only defined for arrays with the
same shape or between an array and a scalar. For example, there is no obvious method to add a 5-element
vector and a 5 by 4 matrix. NumPy uses a technique called broadcasting to allow element-by-element
mathematical operations on arrays (and matrices) which would not be compatible under the standard
rules of array mathematics.
Arrays can be used in element-by-element mathematics if x is broadcastable to y. Suppose x is an mdimensional array with dimensions d = [d 1 , d 2 . . . d m ], and y is an n-dimensional array with dimensions
f = [f1 , f2 . . . fn ] where m n. Formally, two arrays are broadcastable if the following two conditions
hold.
1. If m > n, then treat y as a m-dimensional array with size g = [1, 1, . . . , 1, f1 , f2 . . . fn ] where the
number of 1s prepended is m n. The dimensions are g i = 1 for i = 1, . . . m n and g i = fi m +n
for i > m n.
2. For i = 1, . . . , m, max (d i , g i ) / min (d i , g i ) {1, max (d i , g i )}.
The first rule is simply states that if one array has fewer dimensions, it is treated as having the same number of dimensions as the larger array by prepending 1s. The second rule states that arrays will only be
broadcastable if either (a) they have the same dimension along axis i or (b) one has dimension 1 along
axis i . When 2 arrays are broadcastable, the dimension of the output array is max (d i , g i ) for i = 1, . . . n.
Consider the following examples where m, n and p are assumed to have different values.
x

Broadcastable

Output Size

x Operation

y Operation

Any
m, 1
m, 1
m, n
m, n, 1
m, n, p
m, n, 1
m, 1, p

Scalar
1, n or n
n, 1
1, n or n
1, 1, p or 1, p or p
1, 1, p or 1, p or p
p, 1
1, n, 1, 1, n, p or n, 1

Same as x
m, n

tile(y,shape(x))

tile(x,(1,n))

tile(y,(m,1))

m, n
m, n, p
m, n, p

tile(y,(m,1))

tile(x,(1,1,p))

tile(y,(m,n,1))

tile(y,(m,n,1))

m, n, p

tile(x,(1,n,1))

tile(y,(m,1,p))

One simple method to visualize broadcasting is to use an add and subtract operation where the addition
causes the smaller array to be broadcast, and then the subtract removes the values in the larger array. This
will produce a replicated version of the smaller array which shows the nature of the broadcasting.
64

>>> x = array([[1,2,3.0]])
>>> x
array([[ 1.,

2.,

3.]])

>>> y = array([[0],[0],[0.0]])
>>> y
array([[ 0.],
[ 0.],
[ 0.]])
>>> x + y

# Adding 0 produces broadcast

array([[ 1.,

2.,

3.],

[ 1.,

2.,

3.],

[ 1.,

2.,

3.]])

In the next example, x is 3 by 5, so y must be either scalar or a 5-element array or a 1 5 array to be


broadcastable. When y is a 3-element array (and so matches the leading dimension), an error occurs.
>>> x = reshape(arange(15),(3,5))
>>> x
array([[ 0,

1,

2,

3,

4],

[ 5,

6,

7,

8,

9],

[10, 11, 12, 13, 14]])


>>> y = 1
>>> x + y - x
array([[5, 5, 5, 5, 5],
[5, 5, 5, 5, 5],
[5, 5, 5, 5, 5]])
>>> y = arange(5)
>>> y
array([0, 1, 2, 3, 4])
>>> x + y - x
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
>>> y = arange(3)
>>> y
array([0, 1, 2])
>>> x + y - x # Error
ValueError: operands could not be broadcast together with shapes (3,5) (3)

5.3

Array and Matrix Addition (+) and Subtraction (-)

Subject to broadcasting restrictions, addition and subtraction works element-by-element.


65

5.4

Array Multiplication (*)

The standard multiplication operator differs for variables with type array and matrix. For arrays, * performs element-by-element multiplication and so inputs must be broadcastable. For matrices, * is matrix
multiplication as defined by linear algebra, and there is no broadcasting.
Conformable arrays can be multiplied according to the rules of matrix algebra using the function
dot(). For simplicity, assume x is N by M and y is K by L . If M = K , dot(x,y) will produce the array
N by L array z[i,j] = =dot( x[i,:], y[:,j]) where dot on 1-dimensional arrays is the usual vector
dot-product. The behavior of dot() is described as:
y

Scalar
x

Array

Scalar
Any
z = xy
Any
z i j = y xi j

Array
Any
z i j = x yi j
Inside Dimensions Match
P
zi j = M
k =1 x i k yk j

These rules conform to the standard rules of matrix multiplication. dot() can also be used on higher
dimensional arrays, and is useful if x is T by M by N and y is N by P to produce an output matrix which
is T by M by P , where each of the M by P (T in total) have the form dot(x[i],y).

5.5

Matrix Multiplication (*)

If x is N by M and y is K by L and both are non-scalar matrices, x*y requires M = K . Similarly, y*x requires
L = N . If x is scalar and y is a matrix, then z=x*y produces z(i,j)=x*y(i,j). Suppose z=x * y where both
x and y are matrices:
y

Scalar
x

Matrix

Scalar
Any
z = xy
Any
z i j = y xi j

Matrix
Any
z i j = x yi j
Inside Dimensions Match
P
zi j = M
k =1 x i k yk j

Note: These conform to the standard rules of matrix multiplication.


multiply() performs element-by-element multiplication of matrices, and will use broadcasting if necessary. Matrices are identical to 2-dimensional arrays when performing element-by-element multiplication.

5.6

Array and Matrix Division (/)

Division is always element-by-element, and the rules of broadcasting are used.

5.7

Array Exponentiation (**)

Array exponentiation operates element-by-element, and the rules of broadcasting are used.
66

5.8

Matrix Exponentiation (**)

Matrix exponentiation differs from array exponentiation, and can only be used on square matrices. When
x is a square matrix and y is a positive integer, x**y produces x*x*...*x (y times). When y is a negative integer, x**y produces inv(x**abs(y)) where inv produces the inverse, and so x must have full rank. Python
does not support non-integer values for y, although x p can be defined (in linear algebra) using eigenvalues
and eigenvectors for a subset of all matrices.

5.9

Parentheses

Parentheses can be used in the usual way to control the order in which mathematical expressions are
evaluated, and can be nested to create complex expressions. See section 5.11 on Operator Precedence for
more information on the order mathematical expressions are evaluated.

5.10

Transpose

Matrix transpose is expressed using either the transpose function, or the shortcut .T. For instance, if x is
an M by N matrix, transpose(x), x.transpose() and x.T are all its transpose with dimensions N by M .
In practice, using the .T is the preferred method and will improve readability of code. Consider
>>> x = asmatrix(randn(2,2))
>>> xpx1 = x.T * x
>>> xpx2 = x.transpose() * x
>>> xpx3 = transpose(x) * x

Transpose has no effect on 1-dimensaional arrays. In 2-dimensions, transpose switches indices so that
if z=x.T, z[j,i] is that same as x[i,j]. In higher dimensions, transpose reverses the order or the indices.
For example, if x has 3 dimensions and z=x.T, then x[i,j,k] is the same as z[k,j,i]. Transpose takes an
optional second argument to specify the axis to use when permuting the array.

5.11

Operator Precedence

Computer math, like standard math, has operator precedence which determined how mathematical expressions such as

2**3+3**2/7*13

are evaluated. Best practice is to always use parentheses to avoid ambiguity in the order or operations.
The order of evaluation is:
67

Operator
( ), [ ] , ( ,)
**
~
+,,
/
,
//
,%
*
+,&
^
<, <=, >, >=
==, !=
in, not in
is, is not
not
and
or
=,+=,-=,/=,*=,**=

Name

Rank

Parentheses, Lists, Tuples


Exponentiation
Bitwise NOT
Unary Plus, Unary Minus
Multiply, Divide, Modulo
Addition and Subtraction
Bitwise AND
Bitwise XOR
Bitwise OR
Comparison operators
Equality operators
Identity Operators
Membership Operators
Boolean NOT
Boolean AND
Boolean OR
Assignment Operators

1
2
3
3
4
5
6
7
8
9
9
9
9
10
11
12
13

Note that some rows of the table have the same precedence, and are only separated since they are conceptually different. In the case of a tie, operations are executed left-to-right. For example, x**y**z is interpreted as (x**y)**z. This table has omitted some operators available in Python which are not generally
useful in numerical analysis (e.g. shift operators).
Note: Unary operators are + or - operations that apply to a single element. For example, consider the
expression (-4). This is an instance of a unary negation since there is only a single operation and so
(-4)**2 produces 16. On the other hand, -4**2 produces -16 since has higher precedence than unary
negation and so is interpreted as -(4**2). -4 * -4 produces 16 since it is interpreted as (-4) * (-4), since
unary negation has higher precedence than multiplication.

5.12

Exercises

1. Using the arrays entered in exercise 1 of chapter 4, compute the values of u + v 0 , v + u 0 , v u , u v and
x y (where the multiplication is as defined as linear algebra).
2. Repeat exercise 1 treating the inputs as matrices.
3. Which of the arrays in exercise 1 are broadcastable with:
a = [3 2],

"
b =

3
2

#
,

c = [3 2 1 0] ,
68

d =

3
2
1
0

4. Is x/1 legal? If not, why not. What about 1/x?


5. Compute the values (x+y)**2 and x**2+x*y+y*x+y**2. Are they the same when x and y are arrays?
What if they are matrices?
6. Is x**2+2*x*y+y**2 the same as any of the above?
7. When will x**y for matrices be the same as x**y for vectors?
8. For conformable arrays, is a*b+a*c the same as a*b+c? If so, show with an example. If not, how can
the second be changed so they are equal.
9. Suppose a command x**y*w+z was entered. What restrictions on the dimensions of w, x, y and z
must be true for this to be a valid statement?
10. What is the value of -2**4? What about (-2)**4? What about -2*-2*-2*-2?

69

70

Chapter 6

Basic Functions and Numerical Indexing


6.1

Generating Arrays and Matrices

linspace
linspace(l,u,n) generates a set of n points uniformly spaced between l, a lower bound (inclusive) and u,

an upper bound (inclusive).


>>> x = linspace(0, 10, 11)
>>> x
array([

0.,

1.,

2.,

3.,

4.,

5.,

6.,

7.,

8.,

9.,

10.])

logspace
logspace(l,u,n) produces a set of logarithmically spaced points between 10l and 10u . It is identical to
10**linspace(l,u,n).

arange
arange(l,u,s) produces a set of points spaced by s between l, a lower bound (inclusive) and u, an up-

per bound (exclusive). arange can be used with a single parameter, so that arange(n) is equivalent to
arange(0,n,1). Note that arange will return integer data type if all inputs are integer.
>>> x = arange(11)
array([

0,

1,

2,

3,

4,

5,

6,

4.,

5.,

7,

8,

9,

10])

>>> x = arange(11.0)
array([

0.,

1.,

2.,

3.,

6.,

7.,

8.,

9.,

10.])

>>> x = arange(4, 10, 1.25)


array([ 4.

5.25,

6.5 ,

7.75,

9.

])

meshgrid
meshgrid broadcasts two vectors to produce two 2-dimensional arrays, and is a useful function when plot-

ting 3-dimensional functions.


71

>>> x = arange(5)
>>> y = arange(3)
>>> X,Y = meshgrid(x,y)
>>> X
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
>>> Y
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])

r_
r_ is a convenience function which generates 1-dimensional arrays from slice notation. While r_ is highly

flexible, the most common use it r_[ start : end : stepOrCount ] where start and end are the start and end
points, and stepOrCount can be either a step size, if a real value, or a count, if complex.
>>> r_[0:10:1] # arange equiv
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> r_[0:10:.5] # arange equiv
array([ 0. ,

0.5,

1. ,

1.5,

2. ,

2.5,

3. ,

3.5,

4. ,

5.5,

6. ,

6.5,

7. ,

7.5,

8. ,

8.5,

9. ,

9.5])

4.5,

5. ,

>>> r_[0:10:5j] # linspace equiv, includes end point


array([

0. ,

2.5,

5. ,

7.5,

10. ])

r_ can also be used to concatenate slices using commas to separate slice notation blocks.
>>> r_[0:2, 7:11, 1:4]
array([ 0,

1,

7,

8,

9, 10,

1,

2,

3])

Note that r_ is not a function and that is used with [].

c_
c_ is virtually identical to r_ except that column arrays are generates, which are 2-dimensional (second

dimension has size 1)


>>> c_[0:5:2]
array([[0],
[2],
[4]])
>>> c_[1:5:4j]
array([[ 1.

],

[ 2.33333333],
[ 3.66666667],
[ 5.

]])

c_, like r_, is not a function and is used with [].

72

ix_
ix_(a,b) constructs an n-dimensional open mesh from n 1-dimensional lists or arrays. The output of
ix_ is an n-element tuple containing 1-dimensional arrays. The primary use of ix_ is to simplify selecting

slabs inside a matrix. Slicing can also be used to select elements from an array as long as the slice pattern is
regular. ix_ is particularly useful for selecting elements from an array using indices which are not regularly
spaced, as in the final example.
>>> x = reshape(arange(25.0),(5,5))
>>> x
array([[

0.,

1.,

2.,

3.,

5.,

6.,

7.,

8.,

4.],
9.],

[ 10.,

11.,

12.,

13.,

14.],

[ 15.,

16.,

17.,

18.,

19.],

[ 20.,

21.,

22.,

23.,

24.]])

>>> x[ix_([2,3],[0,1,2])] # Rows 2 & 3, cols 0, 1 and 2


array([[ 10.,

11.,

12.],

[ 15.,

16.,

17.]])

>>> x[2:4,:3] # Same, standard slice


array([[ 10.,

11.,

12.],

[ 15.,

16.,

17.]])

>>> x[ix_([0,3],[0,1,4])] # No slice equiv

mgrid
mgrid is very similar to meshgrid but behaves like r_ and c_ in that it takes slices as input, and uses a

real valued variable to denote step size and complex to denote number of values. The output is an n + 1
dimensional vector where the first index of the output indexes the meshes.
>>> mgrid[0:3,0:2:.5]
array([[[ 0. ,

0. ,

0. ,

0. ],

[ 1. ,

1. ,

1. ,

1. ],

[ 2. ,

2. ,

2. ,

2. ]],

[[ 0. ,

0.5,

1. ,

1.5],

[ 0. ,

0.5,

1. ,

1.5],

[ 0. ,

0.5,

1. ,

1.5]]])

>>> mgrid[0:3:3j,0:2:5j]
array([[[ 0. ,

0. ,

0. ,

0. ,

0. ],

[ 1.5,

1.5,

1.5,

1.5,

1.5],

[ 3. ,

3. ,

3. ,

3. ,

3. ]],

[[ 0. ,

0.5,

1. ,

1.5,

2. ],

[ 0. ,

0.5,

1. ,

1.5,

2. ],

[ 0. ,

0.5,

1. ,

1.5,

2. ]]])

73

ogrid
ogrid is identical to mgrid except that the arrays returned are always 1-dimensional. ogrid output is gen-

erally more appropriate for looping code, while mgrid is usually more appropriate for vectorized code.
When the size of the arrays is large, then ogrid uses much less memory.
>>> ogrid[0:3,0:2:.5]
[array([[ 0.],
[ 1.],
[ 2.]]), array([[ 0. ,

0.5,

1. ,

1.5]])]

>>> ogrid[0:3:3j,0:2:5j]
[array([[ 0. ],
[ 1.5],
[ 3. ]]),
array([[ 0. ,

6.2

0.5,

1. ,

1.5,

2. ]])]

Rounding

around, round
around rounds to the nearest integer, or to a particular decimal place when called with two arguments.
>>> x = randn(3)
array([ 0.60675173, -0.3361189 , -0.56688485])
>>> around(x)
array([ 1.,

0., -1.])

>>> around(x, 2)
array([ 0.61, -0.34, -0.57])
around can also be used as a method on an ndarray except that the method is named round. For example,
x.round(2) is identical to around(x, 2). The change of names is needed to avoid conflicting with the

Python built-in function round.

floor
floor rounds to the next smallest integer.
>>> x = randn(3)
array([ 0.60675173, -0.3361189 , -0.56688485])
>>> floor(x)
array([ 0., -1., -1.])

ceil
ceil rounds to the next largest integer.

74

>>> x = randn(3)
array([ 0.60675173, -0.3361189 , -0.56688485])
>>> ceil(x)
array([ 1., -0., -0.])

Note that the values returned are still floating points and so -0. is the same as 0..

6.3

Mathematics

sum, cumsum
sum sums elements in an array. By default, it will sum all elements in the array, and so the second argument

is normally used to provide the axis to use 0 to sum down columns, 1 to sum across rows. cumsum produces
the cumulative sum of the values in the array, and is also usually used with the second argument to indicate
the axis to use.
>>> x = randn(3,4)
>>> x
array([[-0.08542071, -2.05598312,
[-0.17576066,

2.1114733 ,

0.7986635 ],

0.83327885, -0.64064119, -0.25631728],

[-0.38226593, -1.09519101,

0.29416551,

0.03059909]])

>>> sum(x) # all elements


-0.62339964288008698
>>> sum(x, 0) # Down rows, 4 elements
array([-0.6434473 , -2.31789529,

1.76499762,

0.57294532])

>>> sum(x, 1) # Across columns, 3 elements


array([ 0.76873297, -0.23944028, -1.15269233])
>>> cumsum(x,0) # Down rows
array([[-0.08542071, -2.05598312,

2.1114733 ,

[-0.26118137, -1.22270427,

1.47083211,

0.7986635 ],
0.54234622],

[-0.6434473 , -2.31789529,

1.76499762,

0.57294532]])

sum and cumsum can both be used as function or as methods. When used as methods, the first input is the

axis so that sum(x,0) is the same as x.sum(0).

prod, cumprod
prod and cumprod behave similarly to sum and cumsum except that the product and cumulative product are

returned. prod and cumprod can be called as function or methods.

diff
diff computes the finite difference of a vector (also array) and returns n-1 an element vector when used on

an n element vector. diff operates on the last axis by default, and so diff(x) operates across columns and
returns x[:,1:size(x,1)]-x[:,:size(x,1)-1] for a 2-dimensional array. diff takes an optional keyword
75

argument axis so that diff(x, axis=0) will operate across rows. diff can also be used to produce higher
order differences (e.g. double difference).
>>> x= randn(3,4)
>>> x
array([[-0.08542071, -2.05598312,
[-0.17576066,

2.1114733 ,

0.7986635 ],

0.83327885, -0.64064119, -0.25631728],

[-0.38226593, -1.09519101,

0.29416551,

0.03059909]])

>>> diff(x) # Same as diff(x,1)


-0.62339964288008698
>>> diff(x, axis=0)
array([[-0.09033996,

2.88926197, -2.75211449, -1.05498078],

[-0.20650526, -1.92846986,

0.9348067 ,

0.28691637]])

>>> diff(x, 2, axis=0) # Double difference, column-by-column


array([[-0.11616531, -4.81773183,

3.68692119,

1.34189715]])

exp
exp returns the element-by-element exponential (e x ) for an array.

log
log returns the element-by-element natural logarithm (ln(x )) for an array.

log10
log10 returns the element-by-element base-10 logarithm (log10 (x )) for an array.

sqrt

sqrt returns the element-by-element square root (

x ) for an array.

square
square returns the element-by-element square (x 2 ) for an array, and is equivalent to calling x**2.0 when
x is an array (but not a matrix)

absolute, abs
abs and absolute returns the element-by-element absolute value for an array. Complex modulus is re-

turned when the input is complex valued (|a + b i | =

a 2 + b 2 ).

sign
sign returns the element-by-element sign function, defined as 0 if x = 0, and x /| x | otherwise.

76

6.4

Complex Values

real
real returns the real elements of a complex array. real can be called either as a function real(x) or as an

attribute x.real.

imag
imag returns the complex elements of a complex array. imag can be called either as a function imag(x) or

as an attribute x.imag.

conj, conjugate
conj returns the element-by-element complex conjugate for a complex array. conj can be called either as

a function conj(x) or as a method x.conj(). conjugate is identical to conj.

6.5

Set Functions

unique
unique returns the unique elements in an array. It only operates on the entire array. An optional second

argument can be returned which contains the original indices of the unique elements.
>>> x = repeat(randn(3),(2))
array([ 0.11335982,

0.11335982,

0.26617443,

0.26617443,

0.26617443,

1.34424621])

1.34424621,

1.34424621])
>>> unique(x)
array([ 0.11335982,

>>> y,ind = unique(x, True)


>>> ind
array([0, 2, 4], dtype=int64)
>>> x.flat[ind]
array([ 0.11335982,

0.26617443,

1.34424621])

in1d
in1d returns a Boolean array with the same size as the first input array indicating the elements which are

also in a second array.


>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> in1d(x,y)
array([False, False, False, False, False,

True,

77

True,

True,

True,

True], dtype=bool)

intersect1d
intersect1d is similar to in1d, except that it returns the elements rather than a Boolean array, and only

unique elements are returned. It is equivalent to unique(x.flat[in1d(x,y)]).


>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> intersect1d(x,y)
array([ 5.,

6.,

7.,

8.,

9.])

union1d
union1d returns the unique set of elements in 2 arrays.
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> union1d(x,y)
array([

0.,

1.,

2.,

3.,

11.,

12.,

13.,

14.])

4.,

5.,

6.,

7.,

8.,

9.,

10.,

setdiff1d
setdiff1d returns the set of the elements which are only in the first array but not in the second array.
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> setdiff1d(x,y)
array([ 0.,

1.,

2.,

3.,

4.])

setxor1d
setxor1d returns the set of elements which are in one (and only one) of two arrays.
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> setxor1d(x,y)
array([

6.6

0.,

1.,

2.,

3.,

4.,

10.,

11.,

12.,

13.,

14.])

Sorting and Extreme Values

sort
sort sorts the elements of an array. By default, it sorts using the last axis of x. It uses an optional second

argument to indicate the axis to use for sorting (i.e. 0 for column-by-column, None for sorting all elements).
sort does not alter the input when called as function, unlike the method version of sort.
>>> x = randn(4,2)
>>> x
array([[ 1.29185667,

0.28150618],

[ 0.15985346, -0.93551769],

78

[ 0.12670061,

0.6705467 ],

[ 2.77186969, -0.85239722]])
>>> sort(x)
array([[ 0.28150618,

1.29185667],

[-0.93551769,

0.15985346],

[ 0.12670061,

0.6705467 ],

[-0.85239722,

2.77186969]])

>>> sort(x, 0)
array([[ 0.12670061, -0.93551769],
[ 0.15985346, -0.85239722],
[ 1.29185667,

0.28150618],

[ 2.77186969,

0.6705467 ]])

>>> sort(x, axis=None)


array([-0.93551769, -0.85239722,
0.6705467 ,

1.29185667,

0.12670061,

0.15985346,

0.28150618,

2.77186969])

ndarray.sort, argsort
ndarray.sort is a method for ndarrays which performs an in-place sort. It economizes on memory use,

although x.sort() is different from x after the function, unlike a call to sort(x). x.sort() sorts along
the last axis by default, and takes the same optional arguments as sort(x). argsort returns the indices
necessary to produce a sorted array, but does not actually sort the data. It is otherwise identical to sort,
and can be used either as a function or a method.
>>> x = randn(3)
>>> x
array([ 2.70362768, -0.80380223, -0.10376901])
>>> sort(x)
array([-0.80380223, -0.10376901,

2.70362768])

>>> x
array([ 2.70362768, -0.80380223, -0.10376901])
>>> x.sort() # In-place, changes x
>>> x
array([-0.80380223, -0.10376901,

2.70362768])

max, amax, argmax, min, amin, argmin


max and min return the maximum and minimum values from an array. They take an optional second ar-

gument which indicates the axis to use.


>>> x = randn(3,4)
>>> x
array([[-0.71604847,

0.35276614, -0.95762144,

79

0.48490885],

[-0.47737217,

1.57781686, -0.36853876,

[ 0.44921571, -0.03030771,

2.42351936],

1.28081091, -0.97422539]])

>>> amax(x)
2.4235193583347918
>>> x.max()
2.4235193583347918
>>> x.max(0)
array([ 0.44921571,

1.57781686,

1.28081091,

2.42351936])

2.42351936,

1.28081091])

>>> x.max(1)
array([ 0.48490885,

max and min can only be used on arrays as methods. When used as a function, amax and amin must be used
to avoid conflicts with the built-in functions max and min. This behavior is also seen in around and round.
argmax and argmin return the index or indices of the maximum or minimum element(s). They are used in
an identical manner to max and min, and can be used either as a function or method.

minimum, maximum
maximum and minimum can be used to compute the maximum and minimum of two arrays which are broad-

castable.
>>> x = randn(4)
>>> x
array([-0.00672734,

0.16735647,

0.00154181, -0.98676201])

array([-0.69137963, -2.03640622,

0.71255975, -0.60003157])

>>> y = randn(4)

>>> maximum(x,y)
array([-0.00672734,

6.7

0.16735647,

0.71255975, -0.60003157])

Nan Functions

NaN function are convenience function which act similarly to their non-NaN versions, only ignoring NaN
values (rather than propagating) when computing the function.

nansum
nansum is identical sum, except that NaNs are ignored. nansum can be used to easily generate other NaN-

functions, such as nanstd (standard deviation, ignoring nans) since variance can be implemented using 2
sums.
>>> x = randn(4)
>>> x[1] = nan
>>> x

80

array([-0.00672734,

nan,

0.00154181, -0.98676201])

>>> sum(x)
nan
>>> nansum(x)
-0.99194753275859726
>>> nansum(x) / sum(x[logical_not(isnan(x))])
1.0
>>> nansum(x) / sum(1-isnan(x)) # nanmean
-0.33064917999999999

nanmax, nanargmax, nanmin, nanargmin


nanmax, nanmin, nanargmax and nanargmin are identical to their non-NaN counterparts, except that NaNs
are ignored.

6.8

Functions and Methods/Properties

Many operations on NumPy arrays and matrices can be performed using a function or as a method of the
array. For example, consider reshape.
>>> x = arange(25.0)
>>> y = x.reshape((5,5))
>>> y
array([[

0.,

1.,

2.,

3.,

5.,

6.,

7.,

8.,

4.],
9.],

[ 10.,

11.,

12.,

13.,

14.],

[ 15.,

16.,

17.,

18.,

19.],

[ 20.,

21.,

22.,

23.,

24.]])

4.],

>>> z = reshape(x,(5,5))
>>> z
array([[

0.,

1.,

2.,

3.,

5.,

6.,

7.,

8.,

9.],

[ 10.,

11.,

12.,

13.,

14.],

[ 15.,

16.,

17.,

18.,

19.],

[ 20.,

21.,

22.,

23.,

24.]])

Both the function and method produce the same output and the choice of which to use is ultimately a
personal decision. I use both and the choice primarily depends on the context. For example, to get the
shape of an array, my preference is for x.shape over shape(x) since shape appears to be integral to x.1 On
the other hand, I prefer shape(y+z) over (y+z).shape due to the presence of the mathematical operation.
1

Formally shape is a property of an array, not a method since it does not require a function call.

81

6.9

Exercises

1. Construct each of the following sequences using linspace, arange and r_:
0, 1, . . . , 10
4, 5, 6, . . . , 13
0, .25, .5, .75, 1
0, 1, 2, . . . , 5
2. Show that logspace(0,2,21) can be constructed using linspace and 10 (and **). Similarly, show
how linsapce(2,10,51) can be constructed with logspace and log10.
3. Determine the differences between the rounding by applying round (or around), ceil and floor to
y = [0, 0.5, 1.5, 2.5, 1.0, 1.0001, 0.5, 1, 1.5, 2.5]
4. Prove the relationship that
math on an array.

Pn

j =1

j = n(n + 1)/2 for 0 n 10 using cumsum and directly using

5. randn(20) will generate an array containing draws from a standard normal random variable. If
x=randn(20), which element of y=cumsum(x) is the same as sum(x)?
6. cumsum computes the cumulative sum while diff computes the difference. Is diff(cumsum(x)) the
same as x? If not, how can a small modification be made to the this statement to recover x?
7. Compute the exp of
y = [ln 0.5 ln 1 ln e ]
Note: You should use log and the constant numpy.e to construct y.
8. What is absolute of 0.0, -3.14, and 3+4j?
9. Suppose x = [4 2 9 8 10]. What is the difference between y = sort(x) and x.sort()?
10. Using the same x as in the previous problem, find the max. Also, using argmax and a slice, retrieve
the same value.
11. Show that setdiff1d could be replaced with in1d and intersect1d using x = [1 2 3 4 5] and y =
[1 2 4 6]? How could setxor1d be replaced with union1d, intersect1d and in1d?
12. Suppose y = [nan 2.2 3.9 4.6 nan 2.4 6.1 1.8] . How can nansum be used to compute the variance or
the data? Note: sum(1-isnan(y)) will return the count of non-NaN values.

82

Chapter 7

Special Arrays
Functions are available to construct a number of useful, frequently encountered arrays.

ones
ones generates an array of 1s and is generally called with one argument, a tuple, containing the size of

each dimension. ones takes an optional second argument (dtype) to specify the data type. If omitted, the
data type is float.
>>> M, N = 5, 5
>>> x = ones((M,N)) # M by N array of 1s
>>> x =

ones((M,M,N)) # 3D array

>>> x =

ones((M,N), dtype=int32) # 32-bit integers

ones_like creates an array with the same shape and data type as the input. Calling ones_like(x) is equiv-

alent to calling ones(x.shape,x.dtype).

zeros
zeros produces an array of 0s in the same way ones produces an array of 1s, and commonly used to initialize an array to hold values generated by another procedure. zeros takes an optional second argument
(dtype) to specify the data type. If omitted, the data type is float.
>>> x = zeros((M,N)) # M by N array of 0s
>>> x = zeros((M,M,N)) # 3D array of 0s
>>> x = zeros((M,N),dtype=int64) # 64 bit integers
zeros_like creates an array with the same size and shape as the input. Calling zeros_like(x) is equivalent

to calling zeros(x.shape,x.dtype).

empty
empty produces an empty (uninitialized) array to hold values generated by another procedure. empty takes

an optional second argument (dtype) which specifies the data type. If omitted, the data type is float.
83

>>> x = empty((M,N)) # M by N empty array


>>> x = empty((N,N,N,N)) # 4D empty array
>>> x = empty((M,N),dtype=float32) # 32-bit floats (single precision)

Using empty is slightly faster than calling zeros since it does not assign 0 to all elements of the array
the empty array created will be populated with (essential random) non-zero values. empty_like creates an array with the same size and shape as the input. Calling empty_like(x) is equivalent to calling
empty(x.shape,x.dtype).

eye, identity
eye generates an identity array an array with ones on the diagonal, zeros everywhere else. Normally,

an identity array is square and so usually only 1 input is required. More complex zero-padded arrays
containing an identity matrix can be produced using optional inputs.
>>> In = eye(N)
identity is a virtually identical function with similar use, In = identity(N).

7.1

Exercises

1. Produce two arrays, one containing all zeros and one containing only ones, of size 10 5.
2. Multiply (linear algebra) these two arrays in both possible ways.
3. Produce an identity matrix of size 5. Take the exponential of this matrix, element-by-element.
4. How could ones and zeros be replaced with tile?
5. How could eye be replaced with diag and ones?
6. What is the value of y=empty((1,))? Is it the same as any element in y=empty((10,))?

84

Chapter 8

Array and Matrix Functions


Many functions operate exclusively on array inputs, including functions which are mathematical in nature, for example computing the eigenvalues and eigenvectors and functions for manipulating the elements of an array.

8.1

Views

Views are computationally efficient methods to produce objects of one type which behave as other objects
of another type without copying data. For example, an array x can always be converted to a matrix using
matrix(x), which will copy the elements in x. View fakes the call to matrix and only inserts a thin layer
so that x viewed as a matrix behaves like a matrix.

view
view can be used to produce a representation of an array, matrix or recarray as another type without copy-

ing the data. Using view is faster than copying data into a new class.
>>> x = arange(5)
>>> type(x)
numpy.ndarray
>>> x.view(matrix)
matrix([[0, 1, 2, 3, 4]])
>>> x.view(recarray)
rec.array([0, 1, 2, 3, 4])

asmatrix, mat
asmatrix and mat can be used to view an array as a matrix. This view is useful since matrix views will use

matrix multiplication by default.


>>> x = array([[1,2],[3,4]])
>>> x * x
array([[ 1,

# Element-by-element
4],

[ 9, 16]])

85

>>> mat(x) * mat(x) # Matrix multiplication


matrix([[ 7, 10],
[15, 22]])

Both commands are equivalent to using view(matrix).

asarray
asarray work in a similar matter as asmatrix, only that the view produced is that of ndarray. Calling
asarray is equivalent to using view(ndarray)

8.2

Shape Information and Transformation

shape
shape returns the size of all dimensions or an array or matrix as a tuple. shape can be called as a function

or an attribute. shape can also be used to reshape an array by entering a tuple of sizes. Additionally, the
new shape can contain -1 which indicates to expand along this dimension to satisfy the constraint that
the number of elements cannot change.
>>> x = randn(4,3)
>>> x.shape
(4L, 3L)
>>> shape(x)
(4L, 3L)
>>> M,N = shape(x)
>>> x.shape = 3,4
>>> x.shape
(3L, 4L)
>>> x.shape = 6,-1
>>> x.shape
(6L, 2L)

reshape
reshape transforms an array with one set of dimensions and to one with a different set, preserving the

number of elements. Arrays with dimensions M by N can be reshaped into an array with dimensions K
by L as long as M N = K L . The most useful call to reshape switches an array into a vector or vice versa.
>>> x = array([[1,2],[3,4]])
>>> y = reshape(x,(4,1))
>>> y
array([[1],
[2],
[3],

86

[4]])
>>> z=reshape(y,(1,4))
>>> z
array([[1, 2, 3, 4]])
>>> w = reshape(z,(2,2))
array([[1, 2],
[3, 4]])

The crucial implementation detail of reshape is that arrays are stored using row-major notation. Elements
in arrays are counted first across rows and then then down columns. reshape will place elements of the
old array into the same position in the new array and so after calling reshape, x (1) = y (1), x (2) = y (2),
and so on.

size
size returns the total number of elements in an array or matrix. size can be used as a function or an

attribute.
>>> x = randn(4,3)
>>> size(x)
12
>>> x.size
12

ndim
ndim returns the size of all dimensions or an array or matrix as a tuple. ndim can be used as a function or

an attribute .
>>> x = randn(4,3)
>>> ndim(x)
2
>>> x.ndim
2

tile
tile, along with reshape, are two of the most useful non-mathematical functions. tile replicates an array

according to a specified size vector. To understand how tile functions, imagine forming an array composed of blocks. The generic form of tile is tile(X , (M , N ) ) where X is the array to be replicated, M is
the number of rows in the new block array, and N is the number of columns in the new block array. For
example, suppose X was an array

"
X =

1 2
3 4

87

and the block array

"
Y =

X
X

X
X

X
X

was required. This could be accomplished by manually constructing y using hstack and vstack.
>>> x = array([[1,2],[3,4]])
>>> z = hstack((x,x,x))
>>> y = vstack((z,z))

However, tile provides a much easier method to construct y


>>> w = tile(x,(2,3))
>>> y - w
array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
tile has two clear advantages over manual allocation: First, tile can be executed using parameters de-

termined at run-time, such as the number of explanatory variables in a model and second tile can be
used for arbitrary dimensions. Manual array construction becomes tedious and error prone with as few
as 3 rows and columns. repeat is a related function which copies data is a less useful manner.

ravel
ravel returns a flattened view (1-dimensional) of an array or matrix. ravel does not copy the underlying

data (when possible), and so it is very fast.


>>> x = array([[1,2],[3,4]])
>>> x
array([[ 1,

2],

[ 3, 4]])
>>> x.ravel()
array([1, 2, 3, 4])
>>> x.T.ravel()
array([1, 3, 2, 4])

flatten
flatten works much like ravel, only that is copies the array when producing the flattened version.

flat
flat produces a numpy.flatiter object (flat iterator) which is an iterator over a flattened view of an array.

Because it is an iterator, it is especially fast and memory friendly. flat can be used as an iterator in a for
loop or with slicing notation.
88

>>> x = array([[1,2],[3,4]])
>>> x.flat
<numpy.flatiter at 0x6f569d0>
>>> x.flat[2]
3
>>> x.flat[1:4] = -1
>>> x
array([[ 1, -1],
[-1, -1]])

broadcast, broadcast_arrays
broadcast can be used to broadcast two broadcastable arrays without actually copying any data. It returns

a broadcast object, which works like an iterator.


>>> x = array([[1,2,3,4]])
>>> y = reshape(x,(4,1))
>>> b = broadcast(x,y)
>>> b.shape
(4L, 4L)
>>> for u,v in b:
...

print(x: , u, y: ,v)

x:

y:

x:

y:

x:

y:

x:

y:

x:

y:

... ... ...


broadcast_arrays works similarly to broadcast, except that it copies the broadcast arrays into new arrays.
broadcast_arrays is generally slower than broadcast, and should be avoided if possible.
>>> x = array([[1,2,3,4]])
>>> y = reshape(x,(4,1))
>>> b = broadcast_arrays(x,y)
>>> b[0]
array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
>>> b[1]
array([[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4]])

89

vstack, hstack
vstack, and hstack stack compatible arrays and matrices vertically and horizontally, respectively. Arrays

are vstack compatible if they have the same number of columns, and are hstack compatible if they have
the same number of rows. Any number of arrays can be stacked by placing the input arrays in a list or
tuple, e.g. (x,y,z).
>>> x = reshape(arange(6),(2,3))
>>> y = x
>>> vstack((x,y))
array([[0, 1, 2],
[3, 4, 5],
[0, 1, 2],
[3, 4, 5]])
>>> hstack((x,y))
array([[0, 1, 2, 0, 1, 2],
[3, 4, 5, 3, 4, 5]])

concatenate
concatenate generalizes vstack and hsplit to allow concatenation along any axis using the keyword ar-

gument axis.

split, vsplit, hsplit


vsplit and hsplit split arrays and matrices vertically and horizontally, respectively. Both can be used to
split an array into n equal parts or into arbitrary segments, depending on the second argument. If scalar,
the array is split into n equal sized parts. If a 1 dimensional array, the array is split using the elements of
the array as break points. For example, if the array was [2,5,8], the array would be split into 4 pieces using
[:2] , [2:5], [5:8] and [8:]. Both vsplit and hsplit are special cases of split, which can split along an
arbitrary axis.
>>> x = reshape(arange(20),(4,5))
>>> y = vsplit(x,2)
>>> len(y)
2
>>> y[0]
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> y = hsplit(x,[1,3])
>>> len(y)
3
>>> y[0]
array([[ 0],
[ 5],

90

[10],
[15]])
>>> y[1]
array([[ 1,

2],

[ 6,

7],

[11, 12],
[16, 17]])

delete
delete removes values from an array, and is similar to splitting an array, and then concatenating the values
which are not deleted. The form of delete is delete(x,rc, axis) where rc are the row or column indices to
delete, and axis is the axis to use (0 or 1 for a 2-dimensional array). If axis is omitted, delete operated on
the flattened array.
>>> x = reshape(arange(20),(4,5))
>>> delete(x,1,0) # Same as x[[0,2,3]]
array([[ 0,

1,

2,

3,

4],

[10, 11, 12, 13, 14],


[15, 16, 17, 18, 19]])
>>> delete(x,[2,3],1) # Same as x[:,[0,1,4]]
array([[ 0,

1,

4],

[ 5,

6,

9],

[10, 11, 14],


[15, 16, 19]])
>>> delete(x,[2,3]) # Same as hstack((x.flat[:2],x.flat[4:]))
array([ 0,

1,

4,

5,

6,

7,

8,

9, 10, 11, 12, 13, 14, 15, 16, 17, 18,

19])

squeeze
squeeze removes singleton dimensions from an array, and can be called as a function or a method.
>>> x = ones((5,1,5,1))
>>> shape(x)
(5L, 1L, 5L, 1L)
>>> y = x.squeeze()
>>> shape(y)
(5L, 5L)
>>> y = squeeze(x)

fliplr, flipud
fliplr and flipud flip arrays in a left-to-right and up-to-down directions, respectively. flipud reverses

the elements in a 1-dimensional array, and flipud(x) is identical to x[::-1]. fliplr cannot be used with
91

1-dimensional arrays.
>>> x = reshape(arange(4),(2,2))
>>> x
array([[0, 1],
[2, 3]])
>>> fliplr(x)
array([[1, 0],
[3, 2]])
>>> flipud(x)
array([[2, 3],
[0, 1]])

diag
The behavior of diag differs depending depending on the form of the input. If the input is a square array, it
will return a column vector containing the elements of the diagonal. If the input is an vector, it will return
an array containing the elements of the vector along its diagonal. Consider the following example:
>>> x = array([[1,2],[3,4]])
>>> x
array([[1, 2],
[3, 4]])
>>> y = diag(x)
>>> y
array([1, 4])
>>> z = diag(y)
>>> z
array([[1, 0],
[0, 4]])

triu, tril
triu and tril produce upper and lower triangular arrays, respectively.
>>> x = array([[1,2],[3,4]])
>>> triu(x)
array([[1, 2],
[0, 4]])
>>> tril(x)
array([[1, 0],
[3, 4]])

92

8.3

Linear Algebra Functions

matrix_power
matrix_power raises a square array or matrix to an integer power, and matrix_power(x,n) is identical to
x**n.

svd
svd computes the singular value decomposition of a matrix X , defined as

X = U V
where is diagonal, and U and V are unitary arrays (orthonormal if real valued). SVDs are closely related
to eigenvalue decompositions when X is a real, positive definite matrix. The returned value is a tuple
containing (U,s,V) where = diag (s ).

cond
cond computes the condition number of a matrix, which measures how close to singular a matrix is. Lower

numbers indicate that the input is better conditioned (further from singular).
>>> x = matrix([[1.0,0.5],[.5,1]])
>>> cond(x)
3
>>> x = matrix([[1.0,2.0],[1.0,2.0]]) # Singular
>>> cond(x)
inf

slogdet
slogdet computes the sign and log of the absolute value of the determinant. slogdet is useful for com-

puting determinants which may be very large or small to avoid numerical problems.

solve
solve solves the system X = y when X is square and invertible so that the solution is exact.
>>> X = array([[1.0,2.0,3.0],[3.0,3.0,4.0],[1.0,1.0,4.0]])
>>> y = array([[1.0],[2.0],[3.0]])
>>> solve(X,y)
array([[ 0.625],
[-1.125],
[ 0.875]])

93

lstsq
lstsq solves the system X = y when X is n by k , n > k by finding the least squares solution. lstsq

returns a 4-element tuple where the first element is and the second element is the sum of squared residuals. The final two outputs are diagnostic the third is the rank of X and the fourth contains the singular
values of X .
>>> X = randn(100,2)
>>> y = randn(100)
>>> lstsq(X,y)
(array([ 0.03414346,

0.02881763]),

array([ 3.59331858]),
2,
array([ 3.045516

1.99327863]))array([[ 0.625],

[-1.125],
[ 0.875]])

cholesky
cholesky computes the Cholesky factor of a positive definite matrix or array. The Cholesky factor is a lower

triangular matrix and is defined as C in


CC0 =
where is a positive definite matrix.
>>> x = matrix([[1,.5],[.5,1]])
>>> C = cholesky(x)
>>> C*C.T - x
matrix([[ 1. ,
[ 0.5,

0.5],
1. ]])

det
det computes the determinant of a square matrix or array.
>>> x = matrix([[1,.5],[.5,1]])
>>> det(x)
0.75

eig
eig computes the eigenvalues and eigenvectors of a square matrix. When used with one output, the eigen-

values and eigenvectors are returned as a tuple.


>>> x = matrix([[1,.5],[.5,1]])
>>> val,vec = eig(x)
>>> vec*diag(val)*vec.T
matrix([[ 1. , 0.5],
[ 0.5,

1. ]])

eigvals can be used if only eigenvalues are needed.

94

eigh
eigh computes the eigenvalues and eigenvectors of a symmetric array. When used with one output, the

eigenvalues and eigenvectors are returned as a tuple. eigh is faster than eig for symmetrix inputs since it
exploits the symmetry of the input. eigvalsh can be used if only eigenvalues are needed from a symmetric
array.

inv
inv computes the inverse of an array. inv(R) can alternatively be computed using x**(-1) when x is a
matrix.
>>> x = array([[1,.5],[.5,1]])
>>> xInv = inv(x)
>>> dot(x,xInv)
array([[ 1.,
[ 0.,

0.],
1.]])

>>> x = asmatrix(x)
>>> x**(-1)*x
matrix([[ 1.,
[ 0.,

0.],
1.]])

kron
kron computes the Kronecker product of two arrays,

z =xy
and is written as z = kron(x,y).

trace
trace computes the trace of a square array (sum of diagonal elements). trace(x) equals sum(diag(x)).

matrix_rank
matrix_rank computes the rank of an array using a SVD.
>>> x = array([[1,.5],[1,.5]])
>>> x
array([[ 1. ,
[ 1. ,

0.5],
0.5]])

>>> matrix_rank(x)
1

95

8.4

Exercises

1. Let x = arange(12.0). Use both shape and reshape to produce 1 12, 2 6, 3 4,4 3, 6 2 and
2 2 3 versions or the array. Finally, return x to its original size.
2. Let x = reshape(arange(12.0),(4,3)). Use ravel, flatten and flat to extract elements 1, 3, . . ., 11
from the array (using a 0 index).
3. Let x be 2 by 2 array, y be a 1 by 1 array, and z be a 3 by 2 array. Construct

y
y

w =

y
y
z0
y

y
y

using hstack, vstack, and tile.


4. Let x = reshape(arange(12.0),(2,2,3)). What does squeeze do to x ?
5. How can a diagonal matrix containing the diagonal elements of

"
y =

2 .5
.5 4

be constructed using diag?


6. Using the y array from the previous problem, verify that cholesky work by computing the Cholesky
factor, and then multiplying to get y again.
7. Using the y array from the previous problem, verify that the sum of the eigenvalues is the same as
the trace, and the product of the eigenvalues is the determinant.
8. Using the y array from the previous problem, verify that the inverse of y is equal to V D 1 V 0 where
V is the array containing the eigenvectors, and D is a diagonal array containing the eigenvalues.
9. Simulate some data where x = randn(100,2), e = randn(100,1), B = array([[1],[0.5]]) and y =
x + . Use lstsq to estimate from x and y .
10. Suppose

5
1.5 3.5

y = 1.5
2
0.5
3.5 0.5
4
use matrix_rank to determine the rank of this array. Verify the results by inspecting the eigenvalues
using eig and check that the determinant is 0 using det.
11. Let x = randn(100,2). Use kron to compute
I 2 X
96

where X is the 2 by 2 covariance matrix of x .

97

98

Chapter 9

Importing and Exporting Data


9.1

Importing Data using pandas

Pandas is an increasingly important component of the Python scientific stack, and a complete discussion
of its main features is included in Chapter 17. All of the data readers in pandas load data into a pandas
DataFrame (see Section 17.1.2), and so these examples all make use of the values property to extract a
NumPy array. In practice, the DataFrame is much more useful since it includes useful information such
as column names read from the data source. In addition to the three formats presented here, pandas can
also read json, SQL, html tables or from the clipboard, which is particularly useful for interactive work
since virtually any source that can be copied to the clipboard can be imported.

9.1.1

CSV and other formatted text files

Comma-separated value (CSV) files can be read using read_csv. When the CSV file contains mixed data,
the default behavior will read the file into an array with an object data type, and so further processing is
usually required to extract the individual series.
>>> from pandas import read_csv
>>> csv_data = read_csv(FTSE_1984_2012.csv)
>>> csv_data = csv_data.values
>>> csv_data[:4]
array([[2012-02-15, 5899.9, 5923.8, 5880.6, 5892.2, 801550000L, 5892.2],
[2012-02-14, 5905.7, 5920.6, 5877.2, 5899.9, 832567200L, 5899.9],
[2012-02-13, 5852.4, 5920.1, 5852.4, 5905.7, 643543000L, 5905.7],
[2012-02-10, 5895.5, 5895.5, 5839.9, 5852.4, 948790200L, 5852.4]], dtype=object)
>>> open = csv_data[:,1]

When the entire file is numeric, the data will be stored as a homogeneous array using one of the numeric
data types, typically float64. In this example, the first column contains Excel dates as numbers, which are
the number of days past January 1, 1900.
>>> csv_data = read_csv(FTSE_1984_2012_numeric.csv)
>>> csv_data = csv_data.values
>>> csv_data[:4,:2]
array([[ 40954. ,

5899.9],

[ 40953. ,

5905.7],

99

9.1.2

[ 40952. ,

5852.4],

[ 40949. ,

5895.5]])

Excel files

Excel files, both 97/2003 (xls) and 2007/10/13 (xlsx), can be imported using read_excel. Two inputs are
required to use read_excel, the filename and the sheet name containing the data. In this example, pandas
makes use of the information in the Excel workbook that the first column contains dates and converts
these to datetimes. Like the mixed CSV data, the array returned has object data type.
>>> from pandas import read_excel
>>> excel_data = read_excel(FTSE_1984_2012.xls,FTSE_1984_2012)
>>> excel_data = excel_data.values
>>> excel_data[:4,:2]
array([[datetime.datetime(2012, 2, 15, 0, 0), 5899.9],
[datetime.datetime(2012, 2, 14, 0, 0), 5905.7],
[datetime.datetime(2012, 2, 13, 0, 0), 5852.4],
[datetime.datetime(2012, 2, 10, 0, 0), 5895.5]], dtype=object)
>>> open = excel_data[:,1]

9.1.3

STATA files

Pandas also contains a method to read STATA files.


>>> from pandas import read_stata
>>> stata_data = read_stata(FTSE_1984_2012.dta)
>>> stata_data = stata_data.values
>>> stata_data[:4,:2]
array([[

0.00000000e+00,

4.09540000e+04],

1.00000000e+00,

4.09530000e+04],

2.00000000e+00,

4.09520000e+04],

3.00000000e+00,

4.09490000e+04]])

9.2

Importing Data without pandas

Importing data without pandas ranges from easy when files contain only numbers to difficult, depending
on the data size and format. A few principles can simplify this task:
The file imported should contain numbers only, with the exception of the first row which may contain the variable names.
Use another program, such as Microsoft Excel, to manipulate data before importing.
Each column of the spreadsheet should contain a single variable.
Dates should be converted to YYYYMMDD, a numeric format, before importing. This can be done
in Excel using the formula:
=10000*YEAR(A1)+100*MONTH(A1)+DAY(A1)+(A1-FLOOR(A1,1))

100

Store times separately from dates using a numeric format such as seconds past midnight or HHmmSS.sss.

9.2.1

CSV and other formatted text files

A number of importers are available for regular (e.g. all rows have the same number of columns) commaseparated value (CSV) data. The choice of which importer to use depends on the complexity and size of the
file. Purely numeric files are the simplest to import, although most files which have a repeated structure
can be directly imported (unless they are very large).

loadtxt
loadtxt is a simple, fast text importer. The basic use is loadtxt(filename), which will attempt to load the

data in file name as floats. Other useful named arguments include delim, which allow the file delimiter to
be specified, and skiprows which allows one or more rows to be skipped.
loadtxt requires the data to be numeric and so is only useful for the simplest files.
>>> data = loadtxt(FTSE_1984_2012.csv,delimiter=,) # Error
ValueError: could not convert string to float: Date
# Fails since CSV has a header
>>> data = loadtxt(FTSE_1984_2012_numeric.csv,delimiter=,) # Error
ValueError: could not convert string to float: Date
>>> data = loadtxt(FTSE_1984_2012_numeric.csv,delimiter=,,skiprows=1)
>>> data[0]
array([ 4.09540000e+04, 5.89990000e+03, 5.92380000e+03, 5.88060000e+03, 5.89220000e+03,
8.01550000e+08, 5.89220000e+03])

genfromtxt
genfromtxt is a slightly slower, more robust importer. genfromtxt is called using the same syntax as loadtxt,

but will not fail if a non-numeric type is encountered. Instead, genfromtxt will return a NaN (not-anumber) for fields in the file it cannot read.
>>> data = genfromtxt(FTSE_1984_2012.csv,delimiter=,)
>>> data[0]
array([ nan,

nan,

nan,

nan,

nan,

nan,

nan])

>>> data[1]
array([ nan, 5.89990000e+03, 5.92380000e+03, 5.88060000e+03, 5.89220000e+03, 8.01550000e+08,
5.89220000e+03])

Tab delimited data can be read in a similar manner using delimiter=\t.


>>> data = genfromtxt(FTSE_1984_2012_numeric_tab.txt,delimiter=\t)

101

csv2rec
csv2rec is an even more robust and slower CSV importer which imports non-numeric data such as

dates. It attempts to find the best data type for each column. Note that when pandas is available, read_csv
is a better option than csv2rec.
>>> data = csv2rec(FTSE_1984_2012.csv,delimiter=,)
>>> data[0]
(datetime.date(2012, 2, 15), 5899.9, 5923.8, 5880.6, 5892.2, 801550000L, 5892.2)

Unlike loadtxt and genfromtxt, which both return an array, csv2rec returns a record array (see Chapter 16) which is, in many ways, like a list. csv2rec converted each row of the input file into a datetime
(see Chapter 14), followed by 4 floats for open, high, low and close, then a long integer for volume, and
finally a float for the adjusted close. When the data contain non-numeric values, returned array is not
homogeneous, and so it is necessary to create an array to store the numeric content of the imported data.
>>> open = data[open]
>>> open
array([ 5899.9,

9.2.2

5905.7,

5852.4, ...,

1095.4,

1095.4,

1108.1])

Excel Files

xlrd
Reading Excel files in Python is more involved, and it is simpler to convert the xls to CSV. Excel files can be
read using xlrd (which is part of xlutils).
from __future__ import print_function
import xlrd
wb = xlrd.open_workbook(FTSE_1984_2012.xls)
# To read xlsx change the filename
# wb = xlrd.open_workbook(FTSE_1984_2012.xlsx)
sheetNames = wb.sheet_names()
# Assumes 1 sheet name
sheet = wb.sheet_by_name(sheetNames[0])
excelData = [] # List to hold data
for i in xrange(sheet.nrows):
excelData.append(sheet.row_values(i))
# Subtract 1 since excelData has the header row
open = empty(len(excelData) - 1)
for i in xrange(len(excelData) - 1):
open[i] = excelData[i+1][1]

The listing does a few things. First, it opens the workbook for reading (open_workbook(FTSE_1984_2012.xls)),
then it gets the sheet names (wb.sheet_names()) and opens a sheet (wb.sheet_by_name(sheetNames[0])).
From the sheet, it gets the number of rows (sheet.nrows), and fills a list with the values, row-by-row. Once
the data has been read-in, the final block fills an array with the opening prices. This is substantially more
complicated than importing from a CSV file, although reading Excel files is useful for automated work (e.g.
you have no choice but to import from an Excel file since it is produced by some other software).
102

openpyxl
openpyxl reads and write the modern Excel file format that is the default in Office 2007 or later. openpyxl
also supports a reader and writer which is optimized for large files, a feature not available in xlrd. Unfortunately, openpyxl uses a different syntax from xlrd, and so some modifications are required when using
openpyxl.
from __future__ import print_function
import openpyxl
wb = openpyxl.load_workbook(FTSE_1984_2012.xlsx)
sheetNames = wb.get_sheet_names()
# Assumes 1 sheet name
sheet = wb.get_sheet_by_name(sheetNames[0])
rows = sheet.rows
# Subtract 1 since excelData has the header row
open = empty(len(rows) - 1)
for i in xrange(len(excelData) - 1):
open[i] = rows[i+1][1].value

The strategy with 2007/10/13 xlsx files is essentially the same as with 97/2003 files. The main difference is
that the command sheet.rows() returns a tuple containing the all of the rows in the selected sheet. Each
row is itself a tuple which contains Cells (which are a type created by openpyxl), and each cell has a value
(Cells also have other useful attributes such as data_type and methods such as is_date()) .
Using the optimized reader is similar. The primary differences are:
The workbook must be opened using the keyword argument use_iterators = True
The rows are sequentially accessible using iter_rows().
value is not available, and so internal_value must be used.
The number of rows is not known, and so it isnt possible to pre-allocate the storage variable with
the correct number of rows.
from __future__ import print_function
import openpyxl
wb = openpyxl.load_workbook(FTSE_1984_2012.xlsx, use_iterators = True)
sheetNames = wb.get_sheet_names()
# Assumes 1 sheet name
sheet = wb.get_sheet_by_name(sheetNames[0])
# Use list to store data
open = []
# Changes since access is via memory efficient iterator
# Note () on iter_rows
for row in sheet.iter_rows():
# Must use internal_value

103

open.append(row[1].internal_value)
# Burn first row and convert to array
open = array(open[1:])

9.2.3

MATLAB Data Files (.mat)

SciPy enables MATLAB data files (mat files) to be read excluding except the latest V7.3 format, which can
be read using PyTables or h5py. Data from compatible mat files can be loaded using loadmat. The data is
loaded into a dictionary, and individual variables are accessed using the keys of the dictionary.
>>> import scipy.io as sio
>>> matData = sio.loadmat(FTSE_1984_2012.mat)
>>> type(matData)
dict
>>> matData.keys()
[volume,
__header__,
__globals__,
high,
adjclose,
low,
close,
__version__,
open]
>>> open = matData[open]

MATLAB data files in the newest V7.3 format can be easily read using PyTables.
>>> import tables
>>> matfile = tables.openFile(FTSE_1984_2012_v73.mat)
>>> matfile.root
/ (RootGroup)
children := [volume (CArray), high (CArray), adjclose (CArray), low (CArray),
close (CArray), open (CArray)]
>>> matfile.root.open
/open (CArray(1, 7042), zlib(3))
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := numpy
byteorder := little
chunkshape := (1, 7042)
>>> open = matfile.root.open.read()
open = matfile.root.open.read()
>>> matfile.close() # Close the file

104

9.2.4

Reading Complex Files

Python can be programmed to read any text file format since it contains functions for directly accessing
files and parsing strings. Reading poorly formatted data files is an advanced technique and should be
avoided if possible. However, some data is only available in formats where reading in data line-by-line is
the only option. For example, the standard import methods fail if the raw data is very large (too large for
Excel) and is poorly formatted. In this case, the only possibility may be to write a program to read the file
line-by-line (or in blocks) and to directly process the raw text.
The file IBM_TAQ.txt contains a simple example of data that is difficult to import. This file was downloaded from Wharton Research Data Services and contains all prices for IBM from the TAQ database between January 1, 2001 and January 31, 2001. It is too large to use in Excel and has both numbers, dates
and text on each line. The following code block shows one method for importing this data set.
import io
from numpy import array
f = io.open(IBM_TAQ.txt, r)
line = f.readline()
# Burn the first list as a header
line = f.readline()
date = []
time = []
price = []
volume = []
while line:
data = line.split(,)
date.append(int(data[1]))
price.append(float(data[3]))
volume.append(int(data[4]))
t = data[2]
time.append(int(t.replace(:,)))
line = f.readline()
# Convert to arrays, which are more useful than lists
# for numeric data
date = array(date)
price = array(price)
volume = array(volume)
time = array(time)
allData = array([date,price,volume,time])
f.close()

This block of code does a few thing:


Open the file directly using file
Reads the file line by line using readline
105

Initializes lists for all of the data


Rereads the file parsing each line by the location of the commas using split(,) to split the line at
each comma into a list
Uses replace(:,) to remove colons from the times
Uses int() and float() to convert strings to numbers
Closes the file directly using close()

9.3

Saving or Exporting Data using pandas

Pandas supports writing to CSV, general delimited text files, Excel files, json, html tables, HDF5 and STATA.
An understanding of the pandas DataFrame is required prior to using pandas file writing facilities, and
Chapter 17 provides further information.

9.4

Saving or Exporting Data without pandas

Native NumPy Format


A number of options are available for saving data. These include using native npz data files, MATLAB data
files, CSV or plain text. Multiple numpy arrays can be saved using savez_compressed (numpy.savez_compressed).
x = arange(10)
y = zeros((100,100))
savez_compressed(test,x,y)
data = load(test.npz)
# If no name is given, arrays are generic names arr_1, arr_2, etc
x = data[arr_1]
savez_compressed(test,x=x,otherData=y)
data = load(test.npz)
# x=x provides the name x for the data in x
x = data[x]
# otherDate = y saves the data in y as otherData
y = data[otherData]

A version which does not compress data but is otherwise identical is savez. Compression is usually a good
idea and is very helpful for storing arrays which have repeated values and are large.

9.4.1

Writing MATLAB Data Files (.mat)

SciPy enables MATLAB data files to be written. Data can be written using savemat, which takes two inputs,
a file name and a dictionary containing data, in its simplest form.
from __future__ import print_function
import scipy.io as sio

106

x = array([1.0,2.0,3.0])
y = zeros((10,10))
# Set up the dictionary
saveData = {x:x, y:y}
sio.savemat(test,saveData,do_compression=True)
# Read the data back in
matData = sio.loadmat(test.mat)
savemat uses the optional argument do_compression = True, which compresses the data, and is generally
a good idea on modern computers and/or for large datasets.

9.4.2

Exporting Data to Text Files

Data can be exported to a tab-delimited text files using savetxt. By default, savetxt produces tab delimited files, although then can be changed using the names argument delimiter.
x = randn(10,10)
# Save using tabs
savetxt(tabs.txt,x)
# Save to CSV
savetxt(commas.csv,x,delimiter=,)
# Reread the data
xData = loadtxt(commas.csv,delimiter=,)

9.5

Exercises

Note: There are no exercises using pandas in this chapter. For exercises using pandas to read or write data,
see Chapter 17.
1. The file exercise3.xls contains three columns of data, the date, the return on the S&P 500, and the
return on XOM (ExxonMobil). Using Excel, convert the date to YYYYMMDD format and save the
file.
2. Save the file as both CSV and tab delimited. Use the three text readers to read the file, and compare
the arrays returned.
3. Parse loaded data into three variables, dates, SP500 and XOM.
4. Save NumPy, compressed NumPy and MATLAB data files with all three variables. Which files is the
smallest?
5. Construct a new variable, sumreturns as the sum of SP500 and XOM. Create another new variable,
outputdata as a horizontal concatenation of dates and sumreturns.
6. Export the variable outputdata to a new CSV file using savetxt.
7. (Difficult) Read in exercise3.xls directly using xlrd.
8. (Difficult) Save exercise3.xls as exercise3.xlsx and read in directly using openpyxl.

107

108

Chapter 10

Inf, NaN and Numeric Limits


10.1 inf and NaN
inf represents infinity and inf is distinct from -inf. inf can be constructed in a number for ways, for

example or exp(710). nan stands for Not a Number, and nans are created whenever a function produces
a result that cannot be clearly evaluated to produce a number or infinity. For example, inf/inf results in
nan. nans often cause problems since most mathematical operations involving a nan produce a nan.
>>> x = nan
>>> 1.0 + x
nan
>>> 1.0 * x
nan
>>> 0.0 * x
nan
>>> mean(x)
nan

10.2

Floating point precision

All numeric software has limited precision; Python is no different. The easiest to understand the upper
and lower limits, which are 1.7976 10308 (see finfo(float).max) and 1.7976 10308 (finfo(float).min).
Numbers larger (in absolute value) than these are inf. The smallest positive number that can be expressed
is 2.2250 10308 (see finfo(float).tiny). Numbers between 2.2251 10308 and 2.2251 10308 are
numerically 0.
However, the hardest concept to understand about numerical accuracy is the limited relative precision which is 2.2204 1016 on most x86 and x86_64 systems. This value is returned from the command
finfo(float).eps and may vary based on the type of CPU and/or the operating system used. Numbers
which differ by less than 2.2204 1016 are numerically the same. To explore the role of precision, examine
the results of the following:
>>> x = 1.0

109

>>> eps = finfo(float).eps


>>> x = x+eps/2
>>> x == 1
True
>>> x-1
0.0
>>> x = 1 + 2*eps
>>> x == 1
False
>>> x-1
ans = 4.4408920985006262e-16

Moreover, any number y where y < x 2.2204 1016 is treated as 0 when added or subtracted.
This is referred to as relative range.

>>> x=10
>>> x+2*eps
>>> x-10
0
>>> (x-10) == 0
True
>>> (1e120 - 1e103) == 1e120
True
>>> 1e103 / 1e120
1e-17

In the first example, eps/2<eps when compared to 1 so it has no effect while 2*eps>eps and so this value
is different from 1. In the second example, 2*eps/10<eps, it has no effect when added. The final example
subtracts 10103 from 10120 and shows that this is numerically the same as 10120 again, this occurs since
10103 /10120 = 1017 <eps. While numeric limits is a tricky concept to understand, failure to understand
these limits can results in errors in code that appears to be otherwise correct. The practical usefulness of
limited precision is to consider data scaling since many variables have natural scales which are differ by
many orders of magnitude.

10.3

Exercises

Let eps = finfo(float).eps in the following exercises.


1. What is the value of log(exp(1000)) both analytically and in Python? Why do these differ?
2. Is eps/10 different from 0? If x = 1 + eps/10 - 1, is x different from 0?
3. Is 1-eps/10-1 difference from 0? What about 1-1-eps/10?
4. Is .1 different from .1+eps/10?
110

5. Is x = 10.0**120 (1 10120 ) different from y = 10.0**120 + 10.0**102? (Hint: Test with x == y)


6. Why is x = 10**120 (1 10120 ) different from y = 10**120 + 10**102?
7. Suppose x = 2.0. How many times (n) can x = 1.0 + (x-1.0)/2.0 be run before x==1 shows True?
What is the value of 2.0**(-n). Is this value surprising?

111

112

Chapter 11

Logical Operators and Find


Logical operators are useful when writing batch files or custom functions. Logical operators, when combined with flow control, allow for complex choices to be compactly expressed.

11.1 >, >=, <, <=, ==, !=


The core logical operators are
Symbol
>
>=
<

Function
greater
greater_equal
less

<=

less_equal

==

equal

!=

not_equal

Definition
Greater than
Greater than or equal to
Less than
Less than or equal to
Equal to
Not equal to

Logical operators can be used on scalars, arrays or matrices. All comparisons are done element-byelement and return either True or False. For example, suppose x and y are arrays which are broadcastable.
z= x < y will be an array of the same size as broadcast(x,y).shape composed of True and False. Alternatively, if one is scalar, say y, then the elements of z are z[i,j] = x[i,j] < y. For instance, suppose
z = xL y where L is one of the logical operators above such as < or ==. The following table examines the
behavior when x and/or y are scalars or arrays. Suppose z = x < y:
y

Scalar
x

Array

Scalar
Any
z =x <y
Any
z i j = xi j < y

Array
Any
z i j = x < yi j
Broadcastable
z i j = xi j < yi j

where x and y are the post-broadcasting versions of x and y . Logical operators are frequently used in portions of programs known as flow control (e.g. if ... else ... blocks) which are be discussed in Chapter
13. It is important to remember that array logical operations return arrays and that flow control blocks
require scalar logical expressions.
113

>>> x = array([[1,2],[-3,-4]])
>>> x > 0
array([[ True,

True],

[False, False]], dtype=bool)


>>> x == -3
array([[False, False],
[ True, False]], dtype=bool)
>>> y = array([1,-1])
>>> x < y # y broadcast to be (2,2)
array([[False, False],
[ True,

True]], dtype=bool)

>>> z = array([[1,1],[-1,-1]]) # Same as broadcast y


>>> x < z
array([[False, False],
[ True,

11.2

True]], dtype=bool)

and, or, not and xor

Logical expressions can be combined using four logical devices,


Keyword (Scalar)

Function

Bitwise

and

logical_and

&

or

logical_or

not

logical_not

logical_xor

True if . . .
Both True
Either or Both True
Not True
One True and One False

There are three versions of all operators except XOR. The keyword version (e.g. and) can only be used
with scalars and so it not useful when working with NumPy. Both the function and bitwise operators
can be used with NumPy arrays, although care is requires when using the bitwise operators. Bitwise operators have high priority higher than logical comparisons and so parentheses are requires around
comparisons. For example, (x>1) & (x<5) is a valid statement, while x>1 & x<5, which is evaluated as
(x>(1 & x))<5, produces an error.
>>> x = arange(-2.0,4)
>>> y = x >= 0
>>> z = x < 2
>>> logical_and(y, z)
array([False, False,

True,

True, False, False], dtype=bool)

True,

True, False, False], dtype=bool)

True,

True, False, False], dtype=bool)

>>> y & z
array([False, False,
>>> (x > 0) & (x < 2)
array([False, False,

114

>>> x > 0 & x < 4 # Error


TypeError: ufunc bitwise_and not supported for the input types, and the inputs could not
be safely coerced to any supported types according to the casting rule safe
>>> ~(y & z) # Not
array([ True,

True, False, False,

True,

True], dtype=bool)

These operators follow the same rules as most mathematical operators on arrays, and so require the broadcastable input arrays.

11.3

Multiple tests

all and any


The commands all and any take logical input and are self-descriptive. all returns True if all logical elements in an array are 1. If all is called without any additional arguments on an array, it returns True if all
elements of the array are logical true and 0 otherwise. any returns logical(True) if any element of an array
is True. Both all and any can be also be used along a specific dimension using a second argument or the
keyword argument axis to indicate the axis of operation (0 is column-wise and 1 is row-wise). When used
column- or row-wise, the output is an array with one less dimension than the input, where each element
of the output contains the truth value of the operation on a column or row.
>>> x = array([[1,2][3,4]])
>>> y = x <= 2
>>> y
array([[ True,

True],

[False, False]], dtype=bool)


>>> any(y)
True
>>> any(y,0)
array([[ True,

True]], dtype=bool)

>>> any(y,1)
array([[ True],
[False]], dtype=bool)

allclose
allclose can be used to compare two arrays for near equality. This type of function is important when

comparing floating point values which may be effectively the same although not identical.
>>> eps = np.finfo(np.float64).eps
>>> eps
2.2204460492503131e-16
>>> x = randn(2)
>>> y = x + eps

115

>>> x == y
array([False, False], dtype=bool)
>>> allclose(x,y)
True

The tolerance for being close can be set using keyword arguments either relatively (rtol) or absolutely
(atol).

array_equal
array_equal tests if two arrays have the same shape and elements. It is safer than comparing arrays di-

rectly since comparing arrays which are not broadcastable produces an error.

array_equiv
array_equiv tests if two arrays are equivalent, even if they do not have the exact same shape. Equivalence

is defined as one array being broadcastable to produce the other.


>>> x = randn(10,1)
>>> y = tile(x,2)
>>> array_equal(x,y)
False
>>> array_equiv(x,y)
True

11.4 is*

A number of special purpose logical tests are provided to determine if an array has special characteristics.
Some operate element-by-element and produce an array of the same dimension as the input while other
produce only scalars. These functions all begin with is.
Operator
isnan
isinf
isfinite
isposfin,isnegfin
isreal
iscomplex
isreal
is_string_like
is_numlike
isscalar
isvector

True if . . .
1 if nan
1 if inf
1 if not inf and not nan
1 for positive or negative inf
1 if not complex valued
1 if complex valued
1 if real valued
1 if argument is a string
1 if is a numeric type
1 if scalar
1 if input is a vector
116

Method of operation
element-by-element
element-by-element
element-by-element
element-by-element
element-by-element
element-by-element
element-by-element
scalar
scalar
scalar
scalar

x=array([4,pi,inf,inf/inf])
isnan(x)
array([[False, False, False,

True]], dtype=bool)

isinf(x)
array([[False, False,

True, False]], dtype=bool)

isfinite(x)
array([[ True,

True, False, False]], dtype=bool)

isnan(x) isinf(x) isfinite(x) always equals True for elements of a numeric array, implying any ele-

ment falls into one (and only one) of these categories.

11.5

Exercises

1. Using the data file created in Chapter 9, count the number of negative returns in both the S&P 500
and ExxonMobil.
2. For both series, create an indicator variable that takes the value 1 if the return is larger than 2 standard deviations or smaller than -2 standard deviations. What is the average return conditional on
falling each range for both returns.
3. Construct an indicator variable that takes the value of 1 when both returns are negative. Compute
the correlation of the returns conditional on this indicator variable. How does this compare to the
correlation of all returns?
4. What is the correlation when at least 1 of the returns is negative?
5. What is the relationship between all and any. Write down a logical expression that allows one or
the other to be avoided (i.e. write def myany(x) and def myall(y)).

117

118

Chapter 12

Advanced Selection and Assignment


Elements from NumPy arrays can be selected using four methods: scalar selection, slicing, numerical (or
list-of-locations) indexing and logical (or Boolean) indexing. Chapter 4 described scalar selection and slicing, which are the basic methods to access elements in an array. Numerical indexing and logical indexing
are closely related and allow for more flexible selection. Numerical indexing uses lists or arrays of locations
to select elements while logical indexing uses arrays containing Boolean values to select elements.

12.1

Numerical Indexing

Numerical indexing, also called list-of-location indexing, is an alternative to slice notation. The fundamental idea underlying numerical indexing is to use coordinates to select elements, which is similar to
the underlying idea behind slicing. Numerical indexing differs from standard slicing in three important
ways:
Arrays created using numerical indexing are copies of the underlying data, while slices are views
(and so do not copy the data). This means that while changing elements in a slice also changes
elements in the slices parent, changing elements in an array constructed using numerical indexing
does not. This also can create performance concerns and slicing should generally be used whenever
it is capable of selecting the required elements.
Numerical indices can contain repeated values and are not required to be monotonic, allowing for
more flexible selection. The sequences produced using slice notation are always monotonic with
unique values.
The shape of the array selected is determined by the shape of the numerical indices. Slices are similar to 1-dimensional arrays but the shape of the slice is determined by the slice inputs.
Numerical indexing in 1-dimensional arrays uses the numerical index values as locations in the array (0based indexing) and returns an array with the same dimensions as the numerical index. To understand the
core concept behind numerical indexing, consider the case of selecting 4 elements form a 1-dimensional
array with locations i 1 , . . ., i 4 . Numerical indexing uses the four indices and arranges them to determine
the shape (and order) of the output. For example, if the order was

"

i3
i4

i2
i1

119

then the array selected would be 2 by 2 with elements

"

xi 3
xi 4

xi 2
xi 1

#
.

Numerical indexing allows for arbitrary shapes and repetition, and so the selection matrix

i3

i4
i4

i2
i1
i1

i3
i3
i4

i2

i2
i1

could be used to produce a 4 by 2 array containing the corresponding elements of x . In these examples
the indices are not used in any particular order and are repeated to highlight the flexibility of numerical
indexing.
Note that the numerical index can be either a list or a NumPy array and must contain integer data.
>>> x = 10 * arange(5.0)
>>> x[[0]] # List with 1 element
array([ 0.])
>>> x[[0,2,1]] # List
array([ 0.,

20.,

10.])

>>> sel = array([4,2,3,1,4,4]) # Array with repetition


>>> x[sel]
array([ 40.,

20.,

30.,

10.,

40.,

40.])

>>> sel = array([[4,2],[3,1]]) # 2 by 2 array


>>> x[sel] # Selection has same size as sel
array([[ 40.,
[ 30.,

20.],
10.]])

>>> sel = array([0.0,1]) # Floating point data


>>> x[sel] # Error
IndexError: arrays used as indices must be of integer (or boolean) type
>>> x[sel.astype(int)] # No error
array([ 10.,

20.])

>>> x[0] # Scalar selection, not numerical indexing


1.0

These examples show that the numerical indices determine the element location and the shape of the
numerical index array determines the shape of the output. The final three examples show slightly different
behavior. The first two of these demonstrate that only integer arrays can be used in numerical indexing,
while the final example shows that there is a subtle difference between x[[0]] (or x[array([0])]), which is
using numerical indexing and x[0] which is using a scalar selector. x[[0]] returns a 1-dimensional array
since the list has 1 dimension while x[0] returns a non-array (or scalar or 0-dimensional array) since the
input is not a list or array.
120

Numerical indexing in 2- or higher-dimensional arrays uses numerical index arrays for each dimension. The fundamental idea behind numerical indexing in 2-dimensional arrays is to format coordinate
pairs of the form (i k , jk ) into separate arrays. The size of the arrays will determine the shape of the array
selected. For example, if the two selection arrays were
[i 1 , i 3 , i 2 , i 4 ] and [ j1 , j3 , j2 , j4 ]
then a 1-dimensional array would be selected containing the elements
[x (i i , ji ) , x (i 3 , j3 ) , x (i 2 , j2 ) , x (i 4 , j4 )] .
In practice multidimensional indexing is more flexible that this simple example since the arrays used as
selectors can have either the same shape or can be broadcastable (see Section 5.2).
Consider the following four examples.
>>> x = reshape(arange(10.0), (2,5))
>>> x
array([[ 0.,

1.,

2.,

3.,

4.],

[ 5.,

6.,

7.,

8.,

9.]])

>>> sel = array([0,1])


>>> x[sel,sel] # 1-dim arrays, no broadcasting
array([ 0.,

6.])

>>> x[sel, sel+1]


array([ 1.,

7.])

>>> sel_row = array([[0,0],[1,1]])


>>> sel_col = array([[0,1],[0,1]])
>>> x[sel_row,sel_col] # 2 by 2, no broadcasting
array([[ 0.,
[ 5.,

1.],
6.]])

>>> sel_row = array([[0],[1]])


>>> sel_col = array([[0,1]])
>>> x[sel_row,sel_col] # 2 by 1 and 1 by 2 - difference shapes, broadcasted as 2 by 2
array([[ 0.,
[ 5.,

1.],
6.]])

In the first example, sel is a 1-dimensional array containing [0,1], and so the returned value is also a
1-dimensional array containing the (0, 0) and (1, 1) elements of x. Numerical indexing uses the array in
the first position to determine row locations and the array in the second position to determine column
locations. The first element of the row selection is paired with the first element of column selection (as is
the second element). This is why x[sel,sel+1] selects the elements in the (0, 1) and (1, 2) positions (1 and
7, respectively). The third example uses 2-dimensional arrays and selects the elements (0, 0), (0, 1), (1, 0)
and (1, 1). The final example also uses 2-dimensional arrays but with different sizes 2 by 1 and 1 by 2
which are broadcastable to a common shape of 2 by 2 arrays.
Next, consider what happens when non-broadcastable arrays are used in as numerical indexing.
121

>>> sel_row = array([0,1]) # 1-dimensional with shape (2,)


>>> sel_col = array([1,2,3]) # 1-dimensional with shape (3,)
>>> x[sel_row,sel_col] # Error
ValueError: shape mismatch: objects cannot be broadcast to a single shape

An error occurs since these two 1-dimensional arrays are not broadcastable. ix_ can be used to easily select rows and columns using numerical indexing by translating the 1-dimesnional arrays to be the correct
size for broadcasting.
>>> x[ix_([0,1],[1,2,3])]
array([[ 2.,

3.,

4.],

[ 7.,

8.,

9.]])

12.1.1

Mixing Numerical Indexing with Scalar Selection

NumPy permits using difference types of indexing in the same expression. Mixing numerical indexing
with scalar selection is trivial since any scalar can be broadcast to any array shape.
>>> sel=array([[1],[2]]) # 2 by 1
>>> x[0,sel] # Row 0, elements sel
array([[ 1.],
[ 2.]])
>>> sel_row = array([[0],[0]])
>>> x[sel_row,sel] # Identical
array([[ 1.],
[ 2.]])

12.1.2

Mixing Numerical Indexing with Slicing

Mixing numerical indexing and slicing allow for entire rows or columns to be selected.
>>> x[:,[1]]
array([[ 2.],
[ 7.]])
>>> x[[1],:]
array([[

6.,

7.,

8.,

9.,

10.]])

Note that the mixed numerical indexing and slicing uses a list ([1]) so that it is not a scalar. This is important since using a scalar will result in dimension reduction.
>>> x[:,1] # 1-dimensional
array([ 2.,

7.])

Numerical indexing and slicing can be mixed in more than 2-dimensions, although some care is required.
In the simplest case where only one numerical index is used which is 1-dimensional, then the selection is
equivalent to calling ix_ where the slice a:b:s is replaced with arange(a,b,s).
>>> x = reshape(arange(3**3), (3,3,3)) # 3-d array
>>> sel1 = x[::2,[1,0],:1]
>>> sel2 = x[ix_(arange(0,3,2),[1,0],arange(0,1))]

122

>>> sel1.shape
(2L, 2L, 1L)
>>> sel2.shape
(2L, 2L, 1L)
>>> amax(abs(sel1-sel2))
0

When more than 1 numerical index is used, the selection can be viewed as a 2-step process.
1. Select using only slice notation where the dimensions using numerical indexing use the slice :.
2. Apply the numerical indexing to the array produced in step 1.
>>> sel1 = x[[0,0],[1,0],:1]
>>> step1 = x[:,:,:1]
>>> step2 = x[[0,0],[1,0],:]
>>> step2.shape
(2L, 1L)
>>> amax(abs(sel1-step2))
0

In the previous example, the shape of the output was (2L, 1L) which may seem surprising since the
numerical indices where both 1-dimensional arrays with 2 elements. The extra dimension comes from
the slice notation which always preserves its dimension. In the next example, the output is 3-dimensional
since the numerical indices are 1-dimensional and the 2 slices preserve their dimension.
>>> x = reshape(arange(4**4), (4,4,4,4))
>>> sel = x[[0,1],[0,1],:2,:2] # 1-dimensional numerical and 2 slices
>>> sel.shape
(2L, 2L, 2L)

It is possible to mix multidimensional numerical indexing with slicing and multidimensional arrays. This
type of selection is not explicitly covered since describing the output is complicated and this type of selection is rarely encountered.

12.1.3

Linear Numerical Indexing using flat

Like slicing, numerical indexing can be combined with flat to select elements from an array using the
row-major ordering of the array. The behavior of numerical indexing with flat is identical to that of using
numerical indexing on a flattened version of the underlying array.
>>> x.flat[[3,4,9]]
array([

4.,

5.,

10.])

>>> x.flat[[[3,4,9],[1,5,3]]]
array([[

4.,

5.,

2.,

6.,

10.],
4.]])

123

12.1.4

Mixing Numerical Indexing with Slicing and Scalar Selection

Mixing the three is identical to using numerical indexing and slicing since the scalar selection is always
broadcast to be compatible with the numerical indices.

12.2

Logical Indexing

Logical indexing differs from slicing and numeric indexing by using logical indices to select elements, rows
or columns. Logical indices act as light switches and are either on (True) or off (False). Pure logical
indexing uses a logical indexing array with the same size as the array being used for selection and always
returns a 1-dimensional array.
>>> x = arange(-3,3)
>>> x < 0
array([ True,

True,

True, False, False, False], dtype=bool)

>>> x[x < 0]


array([-3, -2, -1])
>>> x[abs(x) >= 2]
array([-3, -2,

2])

>>> x = reshape(arange(-8, 8), (4,4))


>>> x[x < 0]
array([-8, -7, -6, -5, -4, -3, -2, -1])

It is tempting to use two 1-dimensional logical arrays to act as row and column masks on a 2-dimensional
array. This does not work, and it is necessary to use ix_ if interested in this type of indexing.
>>> x = reshape(arange(-8,8),(4,4))
>>> cols = any(x < -6, 0)
>>> rows = any(x < 0, 1)
>>> cols
array([ True,

True, False, False], dtype=bool

>>> rows
array([ True,

True, False, False], dtype=bool)

>>> x[cols,rows] # Not upper 2 by 2


array([-8, -3])
>>> x[ix_(cols,rows)] # Upper 2 by 2
array([[-8, -7],
[-4, -3]])

The difference between the final 2 commands is due to how logical indexing operates when more than
logical array is used. When using 2 or more logical indices, they are first transformed to numerical indices using nonzero which returns the locations of the non-zero elements (which correspond to the True
elements of a Boolean array).
>>> cols.nonzero()

124

(array([0, 1], dtype=int64),)


>>> rows.nonzero()
(array([0, 1], dtype=int64),)

The corresponding numerical index arrays have compatible sizes both are 2-element, 1-dimensional
arrays and so numeric selection is possible. Attempting to use two logical index arrays which have
non-broadcastable dimensions produces the same error as using two numerical index arrays with nonbroadcastable sizes.
>>> cols = any(x < -6, 0)
>>> rows = any(x < 4, 1)
>>> rows
array([ True,

True,

True, False], dtype=bool)

>>> x[cols,rows] # Error


ValueError: shape mismatch: objects cannot be broadcast to a single shape

12.2.1

Mixing Logical Indexing with Scalar Selection

Logical indexing can be combined with scalar selection to select elements from a specific row or column
in a 2-dimensional array. Combining these two types of indexing is no different from first applying the
scalar selection to the array and then applying the logical indexing.
>>> x = reshape(arange(-8,8), (4,4))
>>> x
array([[-8, -7, -6, -5],
[-4, -3, -2, -1],
[ 0,

1,

2,

3],

[ 4,

5,

6,

7]])

>>> sum(x, 0)
array([-8, -4,

0,

4])

>>> sum(x, 0) >= 0


array([False, False,

True,

True], dtype=bool)

>>> x[0,sum(x, 0) >= 0]


array([-6, -5])

12.2.2

Mixing Logical Indexing with Slicing

Logical indexing can be freely mixed with slices by using 1-dimensional logical index arrays which act as
selectors for columns or rows.
>>> sel = sum(x < -1, 0) >= 2
>>> sel
array([ True,

True,

True, False], dtype=bool)

>>> x[:,sel] # All rows, sel selects columns

125

array([[-8, -7, -6],


[-4, -3, -2],
[ 0,

1,

2],

[ 4,

5,

6]])

>>> x[1:3,sel] # Rows 1 and 2, sel selects columns


array([[-4, -3, -2],
[ 0,

1,

2]])

>>> x[sel,2:] # sel selects rows, columns 2 and 3


array([[-6, -5],
[-2, -1],
[ 2,

12.2.3

3]])

Mixing Logical Indexing with Numerical Indexing

Mixing numerical indexing and logical indexing behaves identically to numerically indexing where the
logical index is converted to a numerical index using nonzero. It must be the case that the array returned
by nonzero and the numerical index arrays are broadcastable.
>>> sel = array([True,True,False,False])
>>> sel.nonzero()
(array([0, 1], dtype=int64),)
>>> x[[2,3],sel] # Elements (2,0) and (3,1)
array([0, 5])
>>> x[[2,3],[0,1]] # Identical
array([0, 5])

12.2.4

Logical Indexing Functions

nonzero and flatnonzero


nonzero is an useful function for working with multiple data series. nonzero takes logical inputs and re-

turns a tuple containing the indices where the logical statement is true. This tuple is suitable for indexing
so that the corresponding elements can be accessed using x[indices].
>>> x = array([[1,2],[3,4]])
>>> sel = x <= 3
>>> indices = nonzero(sel)
>>> indices
(array([0, 0, 1], dtype=int64), array([0, 1, 0], dtype=int64))
>>> x[indices]
array([[1, 2, 3]])
flatnonzero is similar to nonzero except that the indices returned are for the flattened version of the input.
>>> flatnonzero(sel)
array([0, 1, 2], dtype=int64)

126

>>> x.flat[flatnonzero(sel)]
array([1, 2, 3])

argwhere
argwhere returns an array containing the locations of elements where a logical condition is True. It is the
same as transpose(nonzero(x))
>>> x = randn(3)
>>> x
array([-0.5910316 ,

0.51475905,

0.68231135])

>>> argwhere(x<0.6)
array([[0],
[1]], dtype=int64)
>>> argwhere(x<-10.0) # Empty array
array([], shape=(0L, 1L), dtype=int64)
>>> x = randn(3,2)
>>> x
array([[ 0.72945913,

1.2135989 ],

[ 0.74005449, -1.60231553],
[ 0.16862077,

1.0589899 ]])

>>> argwhere(x<0)
array([[1, 1]], dtype=int64)
>>> argwhere(x<1)
array([[0, 0],
[1, 0],
[1, 1],
[2, 0]], dtype=int64)

extract
extract is similar to argwhere except that it returns the values where the condition is true rather than the

indices.
>>> x = randn(3)
>>> x
array([-0.5910316 ,

0.51475905,

0.68231135])

>>> extract(x<0, x)
array([-0.5910316])
>>> extract(x<-10.0, x) # Empty array
array([], dtype=float64)

127

>>> x = randn(3,2)
>>> x
array([[ 0.72945913,

1.2135989 ],

[ 0.74005449, -1.60231553],
[ 0.16862077,

1.0589899 ]])

>>> extract(x>0,x)
array([ 0.72945913,

12.3

1.2135989 ,

0.74005449,

0.16862077,

1.0589899 ])

Performance Considerations and Memory Management

Arrays constructed using any numerical indexing and/or logical indexing are always copies of the underlying array. This is different from the behavior of slicing and scalar selection which returns a view, not a
copy, of an array. This is easily verified by selecting the same elements using different types of selectors.
>>> x = reshape(arange(9), (3,3))
>>> s_slice = x[:1,:] # Pure slice
>>> s_scalar = x[0] # Scalar selection
>>> s_numeric = x[[0],:] # Numeric indexing
>>> s_logical = x[array([True,False,False]),:] # Logical indexing
>>> s_logical[0,0] = -40
>>> s_numeric[0,0] = -30
>>> s_numeric # -30
array([[-10,

1,

2]])

>>> s_logical # -40, not -30


array([[-40,

1,

2]])

>>> s_scalar[0] = -10


>>> s_scalar
array([-10,

1,

2])

>>> x # Has a -10


array([[-10,

1,

2],

3,

4,

5],

6,

7,

8]])

>>> s_slice # Has a -10


array([[-10,

1,

2]])

Since both numerical and logical indexing produce copies, some care is needed when using these selectors
on large arrays.

12.4

Assignment with Broadcasting

Any of the selection methods can be used for assignment. When the shape of the array to be assigned is
the same as the selection, the assignment simply replaces elements using an element-by-element correspondence.
128

>>> x = arange(-2,2.0)
>>> x
array([-2., -1.,

0.,

1.])

>>> x[0] = 999 # Scalar


>>> x
array([999., -1.,

0.,

1.])

>>> x[:2] = array([99.0,99]) # Slice


>>> x
array([ 99.,

99.,

0.,

1.])

>>> x[[0,1,2]] = array([-3.14,-3.14,-3.14]) # Numerical indexing


>>> x
array([-3.14, -3.14, -3.14,

1.

])

>>> x[x<0] = zeros(3) # Logical indexing


array([ 0.,

0.,

0.,

1.])

Assignment is not limited to arrays with exact shape matches, and any assignment where two conditions
are met is allowed:
Each dimension of the array to be assigned is either 1 or matches the selection.
The array to be assigned and the selection are broadcastable.
These two conditions ensure that the array to be assigned can be broadcast up to the shape of the selection
it is not sufficient that the selection and the array to be assigned are simply broadcastable. The simplest
form of broadcasting assigns a scalar to a selection, and since a scalar can always be broadcast to any
shape this is always possible.
>>> x = arange(-2,2.0)
>>> x[:2] = 99.0
>>> x
array([ 99.,

99.,

0.,

1.])

>>> x = log(x-2.0)
>>> x
array([ 4.57471098,

4.57471098,

nan,

nan])

>>> x[isnan(x)] = 0 # Logical indexing


>>> x
array([ 4.57471098,

4.57471098,

0.

0.

])

>>> x.shape = (2,2)


>>> x[:,:] = 3.14 # Could also use x[:]
>>> x
array([[ 3.14,
[ 3.14,

3.14],
3.14]])

While broadcasting a scalar is the most frequently encountered case, there are useful applications of vector
(or 1-dimensional array) to 2-dimensional array assignment. For example, it may be necessary to replace
all rows in an array where some criteria is met in the row.
129

>>> x = reshape(arange(-10,10.0),(4,5))
array([[-10.,

-9.,

-8.,

-7.,

-6.],

[ -5.,

-4.,

-3.,

-2.,

-1.],

0.,

1.,

2.,

3.,

4.],

5.,

6.,

7.,

8.,

9.]])

>>> x[sum(x,1)<0,:] = arange(5.0) # Replace rows w/ negative sum


>>> x = reshape(arange(-10,10.0),(4,5))
>>> x[:,sum(x,1)<0] = arange(4.0) # Error
ValueError: array is not broadcastable to correct shape
>>> x[:,sum(x,1)<0] = reshape(arange(4.0),(4,1)) # Correct col replacement
array([[ 0.,

0., -8., -7., -6.],

[ 1.,

1., -3., -2., -1.],

[ 2.,

2.,

2.,

3.,

4.],

[ 3.,

3.,

7.,

8.,

9.]])

The error in the previous example occurs because the slice selects a 4 by 2 array, but the array to be assigned
is 1-dimensional with 4 elements. The rules of broadcasting always prepend 1s when determining whether
two arrays are broadcastable, and so the 1-dimensional array is considered to be a 1 by 4 array, which is
not broadcastable to a 4 by 2 array. Using an explicitly 2-dimensional array with shape 4 by 1 allows for
broadcasting.

12.5

Exercises

Let x=arange(10.0), y=reshape(arange(25.0),(5,5)) and z=reshape(arange(64.0),(4,4,4)) in all exercises.


1. List all methods to select 4.0 from x.
2. List all methods to select the first 5 elements of x.
3. List all methods to select every second element of x.
4. List all methods to select the row 2 from y.
5. List all methods to select the rows 2 and 4 from y.
6. List all methods to select the rows 2 and 4 and columns 2, 3 and 4 from y.
7. Select all rows of y which have at least one number divisible by 5 and at least one divisible by 7.
8. List all the methods to select panel 1 from z.
9. List all the methods to select rows 2 and 3 from all panels of z.
10. Assign 0 to every second element of z. List the alternative methods.
11. Assign [1, 1, 1, 1] to all rows of z which have at least one number divisible by 4 and one divisible by 6. For example, the row containing [16, 17, 18, 19] satisfies this criteria.
130

12. (Difficult) Define sel = array([[0,1],[1,0]]), What shape does y[sel,:] have? Can this be explained?

131

132

Chapter 13

Flow Control, Loops and Exception Handling


The previous chapter explored one use of logical variables, selecting elements from an array. Flow control
also utilizes logical variables to allow different code to be executed depending on whether certain conditions are met. Flow control in Python comes in two forms - conditional statement and loops.

13.1

Whitespace and Flow Control

Python uses white space changes to indicate the start and end of flow control blocks, and so indention
matters. For example, when using if . . . elif . . . else blocks, all of the control blocks must have the same
indentation level and all of the statements inside the control blocks should have the same level of indentation. Returning to the previous indentation level instructs Python that the block is complete. Best practice
is to only use spaces (and not tabs), and to use 4 spaces when starting a indented level, which is a good
balance between readability and wasted space.

13.2 if . . . elif . . . else


if . . . elif . . . else blocks always begin with an if statement immediately followed by a scalar logical

expression. elif and else are optional and can always be replicated using nested if statements at the
expense of more complex logic and deeper nesting. The generic form of an if . . . elif . . . else block is
if logical_1:
Code to run if logical_1
logical_2:

elif

Code to run if logical_2 and not logical_1


logical_3:

elif

Code to run if logical_3 and not logical_1 or logical_2


...
...
else:
Code to run if all previous logicals are false

However, simpler forms are more common,


if logical:
Code to run if logical true

133

or
if logical:
Code to run if logical true
else:
Code to run if logical false

Important: Remember that all logicals should be scalar logical values. While it is possible to use arrays
containing a single element, attempting to use an array with more than 1 element results in an error.
A few simple examples
>>> x = 5
>>> if x<5:
...

x += 1

... else:
...

x -= 1

>>> x
4

and
>>> x = 5;
>>> if x<5:
...

x = x + 1

... elif x>5:


...

x = x - 1

... else:
...

x = x * 2

>>> x
10

These examples have all used simple logical expressions. However, any scalar logical expressions, such
as (y<0 or y>1), (x<0 or x>1) and (y<0 or y>1) or isinf(x) or isnan(x), can be used in if . . . elif . . .
else blocks.

13.3 for
for loops begin with for item in iterable:, and the generic structure of a for loop is
for item in iterable:
Code to run

item is an element from iterable, and iterable can be anything that is iterable in Python. The most common
examples are xrange or range, lists, tuples, arrays or matrices. The for loop will iterate across all items in
iterable, beginning with item 0 and continuing until the final item. When using multidimensional arrays,
only the outside dimension is directly iterable. For example, if x is a 2-dimensional array, then the iterable
elements are x[0], x[1] and so on.
count = 0
for i in xrange(100):

134

count += i
count = 0
x = linspace(0,500,50)
for i in x:
count += i
count = 0
x = list(arange(-20,21))
for i in x:
count += i

The first loop will iterate over i = 0, 1, 2,. . . , 99. The second loops over the values produced by the
function linspace, which returns an array with 50 uniformly spaced points between 0 and 500, inclusive.
The final loops over x, a vector constructed from a call to list(arange(-20,21)), which produces a list
containing the series 20,19,. . . , 0, . . .19,20. All three range, arrays, and lists are iterable. The key to
understanding for loop behavior is that for always iterates over the elements of the iterable in the order
they are presented (i.e. iterable[0], iterable[1], . . .).

Note: This chapter exclusively uses xrange in loops rather than range.
xrange is the preferred iterator in Python 2.7 since it avoids large memory allocations. range
has replaced xrange in Python 3.

Python 2.7 vs. 3

Loops can also be nested


count = 0
for i in xrange(10):
for j in xrange(10):
count += j

or can contain flow control variables


returns = randn(100)
count = 0
for ret in returns:
if ret<0:
count += 1

This for expression can be equivalently expressed using xrange as the iterator and len to get the number
of items in the iterable.
returns = randn(100)
count = 0
for i in xrange(len(returns)):
if returns[i]<0:
count += 1

Finally, these ideas can be combined to produce nested loops with flow control.
x = zeros((10,10))
for i in xrange(size(x,0)):

135

for j in xrange(size(x,1)):
if i<j:
x[i,j]=i+j;
else:
x[i,j]=i-j

or loops containing nested loops that are executed based on a flow control statement.
x = zeros((10,10))
for i in xrange(size(x,0)):
if (i % 2) == 1:
for j in xrange(size(x,1)):
x[i,j] = i+j
else:
for j in xrange(int(i/2)):
x[i,j] = i-j

Important: The iterable variable should not be reassigned once inside the loop. Consider, for example,
x = range(10)
for i in x:
print(i)
print(Length of x:, len(x))
x = range(5)

This produces the output


# Output
0
Length of x: 10
1
Length of x: 5
2
Length of x: 5
3
...
8
Length of x: 5
9
Length of x: 5

It is not safe to modify the sequence of the iterable when looping over it. The means that the iterable
should not change size, which can occur when using a list and the functions pop(), insert() or append()
or the keyword del. The loop below would never terminate (except for the if statement that breaks the
loop) since L is being extended each iteration.
L = [1, 2]
for i in L:
print(i)
L.append(i+2)
if i>5:
break

136

Finally, for loops can be used with 2 items when the iterable is wrapped in enumerate, which allows the
elements of the iterable to be directly accessed, as well as their index in the iterable.
x = linspace(0,100,11)
for i,y in enumerate(x):
print(i is :, i)
print(y is :, y)

13.3.1

Whitespace

Like if . . . elif . . . else flow control blocks, for loops are whitespace sensitive. The indentation of the line
immediately below the for statement determines the indentation that all statements in the block must
have.

13.3.2 break
A loop can be terminated early using break. break is usually used after an if statement to terminate the
loop prematurely if some condition has been met.
x = randn(1000)
for i in x:
print(i)
if i > 2:
break

Since for loops iterate over an iterable with a fixed size, break is generally more useful in while loops.

13.3.3 continue
continue can be used to skip an iteration of a loop, immediately returning to the top of the loop using the

next item in iterable. continue is commonly used to avoid a level of nesting, such as in the following two
examples.
x = randn(10)
for i in x:
if i < 0:
print(i)
for i in x:
if i >= 0:
continue
print(i)

Avoiding excessive levels of indentation is essential in Python programming 4 is usually considered the
maximum reasonable level. continue is particularly useful since it can be used to in a for loop to avoid
one level of indentation.

13.4 while
while loops are useful when the number of iterations needed depends on the outcome of the loop con-

tents. while loops are commonly used when a loop should only stop if a certain condition is met, such as
137

when the change in some parameter is small. The generic structure of a while loop is
while logical:
Code to run
Update logical

Two things are crucial when using a while loop: first, the logical expression should evaluate to true
when the loop begins (or the loop will be ignored) and second, the inputs to the logical expression must
be updated inside the loop. If they are not, the loop will continue indefinitely (hit CTRL+C to break an
interminable loop in IPython). The simplest while loops are (wordy) drop-in replacements for for loops:
count = 0
i = 1
while i<10:
count += i
i += 1

which produces the same results as


count=0;
for i in xrange(0,10):
count += i
while loops should generally be avoided when for loops are sufficient. However, there are situations where

no for loop equivalent exists.


# randn generates a standard normal random number
mu = abs(100*randn(1))
index = 1
while abs(mu) > .0001:
mu = (mu+randn(1))/index
index=index+1

In the block above, the number of iterations required is not known in advance and since randn is a standard
normal pseudo-random number, it may take many iterations until this criteria is met. Any finite for loop
cannot be guaranteed to meet the criteria.

13.4.1 break
break can be used in a while loop to immediately terminate execution. Normally, break should not be

used in a while loop instead the logical condition should be set to False to terminate the loop. However,
break can be used to avoid running code below the break statement even if the logical condition is False.
condition = True
i = 0
x = randn(1000000)
while condition:
if x[i] > 3.0:
break # No printing if x[i] > 3
print(x[i])
i += 1

It is better to update the logical statement which determines whether the while loop should execute.
138

i = 0
while x[i] <= 3:
print(x[i])
i += 1

13.4.2 continue
continue can be used in a while loop to skip any remaining code in the loop, immediately returning to the

top of the loop, which then checks the while condition, and executes the loop if it still true. Using continue
when the logical condition in the while loop is False is the same as using break.

13.5 try . . . except


Exception handling is an advanced programming technique which can be used to make code more resilient (often at the cost of speed). try . . . except blocks are useful for running code which may fail for
reasons outside of the programmers control. In most numerical applications, code should be deterministic and so dangerous code can usually be avoided. When it cant, for example, if reading data from a data
source which isnt always available (e.g. a website), then try . . . except can be used to attempt to execute
the code, and then to do something if the code fails to execute. The generic structure of a try . . . except
block is
try:
Dangerous Code
except ExceptionType1:
Code to run if ExceptionType1 is raised
except ExceptionType2:
Code to run if ExceptionType1 is raised
...
...
except:
Code to run if an unlisted exception type is raised

A simple example of exception handling occurs when attempting to convert text to numbers.
text = (a,1,54.1,43.a)
for t in text:
try:
temp = float(t)
print(temp)
except ValueError:
print(Not convertable to a float)

13.6

List Comprehensions

List comprehensions are an optimized method of building a list which may simplify code when an iterable
object is looped across and the results are saved to a list, possibly conditional on some logical test. Simple
list can be used to convert a for loop which includes an append into a single line statement.
139

>>> x = arange(5.0)
>>> y = []
>>> for i in xrange(len(x)):
...

y.append(exp(x[i]))

>>> y
[1.0,
2.7182818284590451,
7.3890560989306504,
20.085536923187668,
54.598150033144236]
>>> z = [exp(x[i]) for i in xrange(len(x))]
>>> z
[1.0,
2.7182818284590451,
7.3890560989306504,
20.085536923187668,
54.598150033144236]

This simple list comprehension saves 2 lines of typing. List comprehensions can also be extended to include a logical test.
>>> x = arange(5.0)
>>> y = []
>>> for i in xrange(len(x)):
...

if floor(i/2)==i/2:

...

y.append(x[i]**2)

>>> y
[0.0, 4.0, 16.0]
>>> z = [x[i]**2 for i in xrange(len(x)) if floor(i/2)==i/2]
>>> z
[0.0, 4.0, 16.0]

List comprehensions can also be used to loop over multiple iterable inputs.
>>> x1 = arange(5.0)
>>> x2 = arange(3.0)
>>> y = []
>>> for i in xrange(len(x1)):
...

for j in xrange(len(x2)):

...

y.append(x1[i]*x2[j])

>>> y
[0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 2.0, 4.0, 0.0, 3.0, 6.0, 0.0, 4.0, 8.0]
>>> z = [x1[i]*x2[j] for i in xrange(len(x1)) for j in xrange(len(x2))]
[0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 2.0, 4.0, 0.0, 3.0, 6.0, 0.0, 4.0, 8.0]
>>> # Only when i==j
>>> z = [x1[i]*x2[j] for i in xrange(len(x1)) for j in xrange(len(x2)) if i==j]
[0.0, 1.0, 4.0]

140

While list comprehensions are powerful methods to compactly express complex operations, they are never
essential to Python programming.

13.7

Tuple, Dictionary and Set Comprehensions

The other mutable Python structures, the dictionary and the set, support construction using comprehension, as does the immutable type tuple. Set and dictionary comprehensions use {} while tuple comprehensions require an explicit call to tuple since () has another meaning.
>>> x = arange(-5.0,5.0)
>>> z_set = {x[i]**2.0 for i in xrange(len(x))}
>>> z_set
{0.0, 1.0, 4.0, 9.0, 16.0, 25.0}
>>> z_dict = {i:exp(i) for i in x}
{-5.0: 0.006737946999085467,
-4.0: 0.018315638888734179,
-3.0: 0.049787068367863944,
-2.0: 0.1353352832366127,
-1.0: 0.36787944117144233,
...
>>> z_tuple = tuple(i**3 for i in x)
(-125.0, -64.0, -27.0, -8.0, -1.0, 0.0, 1.0, 8.0, 27.0, 64.0)

13.8

Exercises

1. Write a code block that would take a different path depending on whether the returns on two series
are simultaneously positive, both are negative, or they have different signs using an if . . . elif . . .
else block.
2. Simulate 1000 observations from an ARMA(2,2) where t are independent standard normal innovations. The process of an ARMA(2,2) is given by
yt = 1 yt 1 + 2 yt 2 + 1 t 1 + 2 t 2 + t
Use the values 1 = 1.4, 2 = .8, 1 = .4 and 2 = .8. Note: A T vector containing standard
normal random variables can be simulated using e = randn(T). When simulating a process, always
simulate more data than needed and throw away the first block of observations to avoid start-up
biases. This process is fairly persistent, at least 100 extra observations should be computed.
3. Simulate a GARCH(1,1) process where t are independent standard normal innovations. A GARCH(1,1)
process is given by
yt = t t

2t = + yt21 + 2t 1
141

Use the values = 0.05, = 0.05 and = 0.9, and set h0 = / (1 ).


4. Simulate a GJR-GARCH(1,1,1) process where t are independent standard normal innovations. A
GJR-GARCH(1,1) process is given by
yt = t t

2t = + yt21 + yt21 I[yt 1 <0] + 2t 1


Use the values = 0.05, = 0.02 = 0.07 and = 0.9 and set h0 = / 1 21 . Note
that some form of logical expression is needed in the loop. I[t 1 <0] is an indicator variable that takes
the value 1 if the expression inside the [ ] is true.

5. Simulate a ARMA(1,1)-GJR-GARCH(1,1)-in-mean process,


yt = 1 yt 1 + 1 t 1 t 1 + 2t + t t

2t = + 2t 1 2t 1 + 2t 1 2t 1 I[t 1 <0] + 2t 1
Use the values from Exercise 4 for the GJR-GARCH model and use the 1 = 0.1, 1 = 0.4 and
= 0.03.
6. Find two different methods to use a for loop to fill a 5 5 array with i j where i is the row index,
and j is the column index. One will use xrange as the iterable, and the other should directly iterate
on the rows, and then the columns of the matrix.
7. Using a while loop, write a bit of code that will do a bisection search to invert a normal CDF. A
bisection search cuts the interval in half repeatedly, only keeping the sub interval with the target
in it. Hint: keep track of the upper and lower bounds of the random variable value and use flow
control. This problem requires stats.norm.cdf.
8. Test out the loop using by finding the inverse CDF of 0.01, 0.5 and 0.975. Verify it is working by taking
the absolute value of the difference between the final value and the value produced by stats.norm.ppf.
9. Write a list comprehension that will iterate over a 1-dimensional array and extract the negative elements to a list. How can this be done using only logical functions (no explicit loop), without the list
comprehension (and returning an array)?

142

Chapter 14

Dates and Times


Date and time manipulation is provided by a built-in Python module datetime. This chapter assumes that
datetime has been imported using import datetime as dt.

14.1

Creating Dates and Times

Dates are created using date by providing integer values for year, month and day and times are created
using time using hours, minutes, seconds and microseconds.
>>> import datetime as dt
>>> yr, mo, dd = 2012, 12, 21
>>> dt.date(yr, mo, dd)
datetime.date(2012, 12, 21)
>>> hr, mm, ss, ms= 12, 21, 12, 21
>>> dt.time(hr, mm, ss, ms)
dt.time(12,21,12,21)

Dates created using date do not allow times, and dates which require a time stamp can be created using
datetime, which combine the inputs from date and time, in the same order.
>>> dt.datetime(yr, mo, dd, hr, mm, ss, ms)
datetime.datetime(2012, 12, 21, 12, 21, 12, 21)

14.2

Dates Mathematics

Date-times and dates (but not times, and only within the same type) can be subtracted to produce a
timedelta, which consists of three values, days, seconds and microseconds. Time deltas can also be added
to dates and times compute different dates although date types will ignore any information in the time
delta hour or millisecond fields.
>>> d1 = dt.datetime(yr, mo, dd, hr, mm, ss, ms)
>>> d2 = dt.datetime(yr + 1, mo, dd, hr, mm, ss, ms)
>>> d2-d1
datetime.timedelta(365)

143

Date Unit

Common Name

Range

Time Unit

Common Name

Range

Y
M
W
D

Year
Month
Week
Day

9.2 1018 years


7.6 1017 years
1.7 1017 years
2.5 1016 years

h
m
s
ms
us
ns
ps
fs
as

Hour
Minute
Second
Millisecond
Microsecond
Nanosecond
Picosecond
Femtosecond
Attosecond

1.0 1015 years


1.7 1013 years
2.9 1012 years
2.9 109 years
2.9 106 years
292 years
106 days
2.6 hours
9.2 seconds

Table 14.1: NumPy datetime64 range. The absolute range is January 1, 1970 plus the range.
>>> d2 + dt.timedelta(30,0,0)
datetime.datetime(2014, 1, 20, 12, 21, 12, 20)
>>> dt.date(2012,12,21) + dt.timedelta(30,12,0)
datetime.date(2013, 1, 20)

If times stamps are important, date types can be promoted to datetime using combine and a time.
>>> d3 = dt.date(2012,12,21)
>>> dt.datetime.combine(d3, dt.time(0))
datetime.datetime(2012, 12, 21, 0, 0)

Values in dates, times and datetimes can be modified using replace through keyword arguments.
>>> d3 = dt.datetime(2012,12,21,12,21,12,21)
>>> d3.replace(month=11,day=10,hour=9,minute=8,second=7,microsecond=6)
datetime.datetime(2012, 11, 10, 9, 8, 7, 6)

14.3

Numpy datetime64

Version 1.7.0 of NumPy introduces a NumPy native datetime type known as datetime64 (to distinguish it
from the usual datetime type). The NumPy datetime type is considered experimental and is not fully supported in the scientific python stack at the time of writing these notes. This said, it is already widely used
and should see complete support in the near future. Additionally, the native NumPy data type is generally
better suited to data storage and analysis and extends the Python datetime with additional features such
as business day functionality.
NumPy contains both datetime (datetime64) and timedelta (timedelta64) objects. These differ from
the standard Python datetime since they always store the datetime or timedelta using a 64-bit integer plus
a date or time unit. The choice of the date/time unit affects both the resolution of the datetime as well
as the permissible range. The unit directly determines the resolution - using a date unit of a day (D)
limits to resolution to days. Using a date unit of a week (W) will allow a minimum of 1 week difference.
Similarly, using a time unit of a second (s) will allow resolution up to the second (but not millisecond).
The set of date and time units, and their range are presented in Table 14.1.
NumPy datetimes can be initialized using either human readable strings or using numeric values. The
string initialization is simple and datetimes can be initialized using year only, year and month, the complete date or the complete date including a time (and optional timezone). The default time resolution is
nanoseconds (109 ) and T is used to separate the time from the date.
144

>>> datetime64(2013)
numpy.datetime64(2013)
>>> datetime64(2013-09)
numpy.datetime64(2013-09)
>>> datetime64(2013-09-01)
numpy.datetime64(2013-09-01)
>>> datetime64(2013-09-01T12:00) # Time
numpy.datetime64(2013-09-01T12:00+0100)
>>> datetime64(2013-09-01T12:00:01) # Seconds
numpy.datetime64(2013-09-01T12:00:01+0100)
>>> datetime64(2013-09-01T12:00:01.123456789) # Nanoseconds
numpy.datetime64(2013-09-01T12:00:01.123456789+0100)

Date or time units can be explicitly included as the second input. The final example shows that rounding
can occur if the date input is not exactly representable using the date unit chosen.
>>> datetime64(2013-01-01T00,h)
numpy.datetime64(2013-01-01T00:00+0000,h)
>>> datetime64(2013-01-01T00,s)
numpy.datetime64(2013-01-01T00:00:00+0000)
>>> datetime64(2013-01-01T00,ms)
numpy.datetime64(2013-01-01T00:00:00.000+0000)
>>> datetime64(2013-01-01,W)
numpy.datetime64(2012-12-27)

NumPy datetimes can also be initialized from arrays.


>>> dates = array([2013-09-01,2013-09-02],dtype=datetime64)
>>> dates
array([2013-09-01, 2013-09-02], dtype=datetime64[D])
>>> dates[0]
numpy.datetime64(2013-09-01)

The NumPy datetime type also supports including timezone information, and when no timezone is
provided the local timezone is used (currently BST on this computer, which is GMT+0100). These two
commands show a time in US/Central (using -0600) and in GMT (using Z for Zulu). Note that the returned
time is always displayed in the local time zone and so the time stamp is changed. Warning: datetime64
that have times always include a timezone this may be problematic in some situations.
>>> datetime64(2013-09-01T12:00:00-0600)
numpy.datetime64(2013-09-01T19:00:00+0100)
>>> datetime64(2013-09-01T19:00:00Z)

145

numpy.datetime64(2013-09-01T20:00:00+0100)

Dates which are initialized using one of the shorter forms are initialized at the earliest date (and time) in
the period.
>>> datetime64(2013)==datetime64(2013-01-01)
True
>>> datetime64(2013-09)==datetime64(2013-09-01)
True

However, dates which contain time information are not always equal to dates which have no time information. This occurs since time information forces a timezone onto the datetime while the pure date has
no timezone information.
>>> datetime64(2013-09-01)==datetime64(2013-09-01T00:00:00)
False
>>> datetime64(2013-09-01)==datetime64(2013-09-01T00:00:00Z)
True
>>> datetime64(2013-09-01T00:00:00) # Time is 00:00:00+0100
numpy.datetime64(2013-09-01T00:00:00+0100)
>>> datetime64(2013-09-01T00:00:00Z) # Time is 01:00:00+0100
numpy.datetime64(2013-09-01T01:00:00+0100)

A corresponding timedelta class, similarly named timedelta64, is created when dates are differenced.
The second example shows why the previous equality test returned False the dates differ by 1 hour due
to the timezone difference.
>>> datetime64(2013-09-02) - datetime64(2013-09-01)
numpy.timedelta64(1,D)
>>> datetime64(2013-09-01) - datetime64(2013-09-01T00:00:00)
numpy.timedelta64(3600,s)
timedelta64 types contain two pieces of information, a number indicating the number of steps between
the two dates and the size of the step.

146

Chapter 15

Graphics
Matplotlib is a complete plotting library capable of high-quality graphics. Matplotlib contains both high
level functions which produce specific types of figures, for example a simple line plot or a bar chart, as
well as a low level API for creating highly customized charts. This chapter covers the basics of producing
plots and only scratches the surface of the capabilities of matplotlib. Further information is available on
the matplotlib website or in books dedicated to producing print quality graphics using matplotlib.
Throughout this chapter, the following modules have been imported.
>>> import matplotlib.pyplot as plt
>>> import scipy.stats as stats

Other modules will be included only when needed for a specific graphic.

15.1

seaborn

seaborn is a Python package which provides a number of advanced data visualized plots. It also provides a
general improvement in the default appearance of matplotlib-produced plots, and so I recommend using
it by default.
>>> import seaborn as sns

All figure in this chapter were produced with seaborn loaded, using the default options. The dark grid
background can be swapped to a light grid or no grid using sns.set(stype=whitegrid) (light grid) or
sns.set(stype=nogrid) (no grid, most similar to matplotlib).

15.2

2D Plotting

15.2.1

autoscale and tight_layout

Two funciton, plt.autoscale and plt.tight_layout will generally improve the appearance of figures.
autoscale can be used to set tight limits within a figures axes and tight_layout will remove wasted space
around a figure. These were used in figures that appear in this chapter, although they have been omitted
the code listings (aside from the first)
147

15.2.2

Line Plots

The most basic, and often most useful 2D graphic is a line plot. Basic line plots are produced using plot
using a single input containing a 1-dimensional array.
>>> y = randn(100)
>>> plot(y)
>>> autoscale(tight=x)
>>> tight_layout()

The output of this command is presented in panel (a) of figure 15.1. A more flexible form adds a format
string which has 1 to 3 elements: a color, represented using a letter (e.g. g for green), a marker symbol
which is either a letter of a symbol (e.g. s for square, ^ for triangle up), and a line style, which is always a
symbol or series of symbols. In the next example, g-- indicates green (g) and dashed line ().
>>> plot(y,g--)

Format strings may contain any of the elements in the next table.
Color
Blue
Green
Red
Cyan
Magenta
Yellow
Black
White

Marker
b
g
r
c
m
y
k
w

Line Style

Point
Pixel
Circle
Square
Diamond
Thin diamond
Cross
Plus
Star
Hexagon
Alt. Hexagon
Pentagon
Triangles
Vertical Line
Horizontal Line

.
,
o
s

Solid
Dashed
Dash-dot
Dotted

--.
:

D
d
x
+
*
H
h
p
^, v, <, >
_

The default behavior is to use a blue solid line with no marker (unless there is more than one line, in
which case the colors will alter, in order, through those in the Colors column, skipping white). The format
string contains 1 or more or the three categories of formatting information. For example, kx-- would
produce a black dashed line with crosses marking the points, *: would produce a dotted line with the
default color using stars to mark points and yH would produce a solid yellow line with a hexagon marker.
When plot is called with one array, the default x-axis values 1,2, . . . are used. plot(x,y) can be used
to plot specific x values against y values. Panel (c) shows the results of running the following code.
>>> x = cumsum(rand(100))
>>> plot(x,y,r-)

148

While format strings are useful for quickly adding meaningful colors or line styles to a plot, they only
expose a limited range of the available customizations. The next example shows how keyword arguments
are used to add customizations to a plot. Panel (d) contains the plot produced by the following code.
>>> plot(x,y,alpha = 0.5, color = #FF7F00, \
...

label = Line Label, linestyle = -., \

...

linewidth = 3, marker = o, markeredgecolor = #000000, \

...

markeredgewidth = 2, markerfacecolor = #FF7F00, \

...

markersize=30)

Note that in the previous example, \ is used to indicate to the Python interpreter that a statement is spanning multiple lines. Some of the more useful keyword arguments are listed in the table below.
Keyword

Description

alpha
color
label
linestyle
linewidth
marker
markeredgecolor
markeredgewidth
markerfacecolor
markersize

Alpha (transparency) of the plot default is 1 (no transparency)


Color description for the line.1
Label for the line used when creating legends
A line style symbol
A positive integer indicating the width of the line
A marker shape symbol or character
Color of the edge (a line) around the marker
Width of the edge (a line) around the marker
Face color of the marker
A positive integer indicating the size of the marker

Many more keyword arguments are available for a plot. The full list can be found in the docstring or
by running the following code. The functions getp and setp can be used to get the list of properties for a
line (or any matplotlib object), and setp can also be used to set a particular property.
>>> h = plot(randn(10))
>>> getp(h)
agg_filter = None
alpha = None
animated = False
...
>>> setp(h, alpha)
alpha: float (0.0 transparent through 1.0 opaque)
>>> setp(h, color)
color: any matplotlib color
>>> setp(h, linestyle)
linestyle: [ - | -- | -. | : | None | | ]
and any drawstyle in combination with a linestyle, e.g. steps--.
>>> setp(h, linestyle, --) # Change the line style

Note that setp(h,prop) returns a description of the property and setp(h,prop,value) sets prop to value.
149

(a)

(b)