Some Notes From High Performance Python

CHAPTER 1

numexpr improves numerical computation by compiling (https://numexpr.readthedocs.io/en/latest/intro.html)

CHAPTER 2

Use unix command /usr/bin/time to get timing spent inside/outsie python process
- /usr/bin/time –verbose
  - Use full path instead of time to avoid Shell built-in that is different
- Major page faults give information about file I/O?
cProfile
Write unit tests for code
pstats library can load *.prof files
- Can give which functions call others, which could be used to build digraph
There are other visualizers of *.prof files
- snakeviz
- runsnakerun
  - requires wxpython
UML would visualize structure but not performance
line_profiler package providers a decorator to give detailed function performance
- kernprof.py -l -v
Use timeit package to profile snippets of code
In compound “if” statements, put faster comparisons further to the left
memory_profiler package gives line-by-line memory usage
- Runs faster if psutil is installed
- Reports MiB (mebibyte) not MB
- Shows changes in memory allocated to process, not size of objects
- python -m memory_profiler
- Uses @profile decorator
- Also has a utility mprof to give memory usage vs time data
  - mprof run
  - profile.timestamp context manager allows further annotation of time graph
guppy package contains heapy tool that gives detailed memory usage of different datatypes
- hpy.setrelheap allows deltas of usage from set point in code
dowser (and dozer?) package with cherryPy package provide live memory usage of instantiated variables
dis package disassembles code which can provide info on what the program is really doing
Use coverage.py to learn which parts of your code participate in unit tests
Use no-op @decorators found on page 58 of “High Performance Python” to make unit tests compatible with profiler decorators
Use unix diff to quickly check differences in profile outputs
Use git to version control every change made to project files
Consider the following to stabilize performance testing
- Disable TurboBoost in BIOS
- Disable the OS’s ability to override SpeedStep (option in BIOS)
- Use only mains power, never battery
- Disable backrgound tools (backups/Dropbox)
- Run experiments many times to obtain statistics
- Drop to run level 1 (Unix) to minimize running processes
- Reboot and rerun all experiments to duplicate results

CHAPTER 3

Use bisect standard library
- Construct sorted lists one element at a time
- Uses efficient bisection algorithm
- log (n) search, n log n sorted construction
Tuples vs lists
- lists are dynamic, tuples are static
- tuples (of sizes 1-20) are cached by Python runtime
- https://stackoverflow.com/questions/14135542/how-is-tuple-implemented-in-cpython
Try not to have mixed data types within tuples/lists
Built-in list.append method allocates more memory than just enough for the appended element
- How does this compare to n * [0] or [0 for i in range(n)]?
While tuples are immutable, their concatenation creates a new copy with ‘needed’ length
Lists are always bigger than tuples of the same length
Instantiating tuples is faster than instantiating lists
- tuples require less communication with operating system

CHAPTER 4

Dictionaries are only resized upon insertion of elements
Python’s namespace lookup follows this order: local -> global -> builtin
- explicit “from import " should be faster
- store things as local variables, all else being equal
Use hash magic method to make objects hashable
- Do NOT change the objects once they’re in a dictionary/set!

CHAPTER 5

use iter() to create iterables from simple objects
use iter in custom objects to make your custom object iterable
use ‘generator comprehensions’ instead of list comprehensions when possible
map, reduce, filter, slice, and zip in Python3 are generators available without import
itertools.chain combines multiple generators
itertools.takewhile adds a condition that will end a generator
itertools.cycle makes a finite generator infinite by repeating it
https://pypi.org/project/more-itertools/ has a bunch of ready-to-use itertool recipes
collections module has lots of ready-to-deploy data structures (trees, deques, etc)

CHAPTER 6

Always profile before optimizing
Minimize operations performed
Minimize memory allocations
Prefer to reuse allocated space rather than allocate new space
Von Neumann bottleneck
branch prection and pipelining are performed by CPU
perf gives information about a program’s CPU performance
- https://en.wikipedia.org/wiki/Perf_(Linux)
- example: perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python3
  - The -r flag is ‘runs’
- task-clock: number of clock cycles task took
- context-switches: program being halted to allow another program to run -tends to happen when I/O (memory,disk,network) is being performed
  - Can use nice to increase the program’s priority on Linux
    - https://en.wikipedia.org/wiki/Nice_(Unix)
- CPU-migrations: program is moved to a new CPU
  - Each CPU has its own L1 cache
- (minor) page-fault: kernel gives reference to memory when allocations are requested. When that memory block is first used, it throws a page fault interrupt, pausing the current program to properly allocate that memory.
  - lazy allocation system
- (major) page-fault: disk/network data request not only pauses current program, but takes time to access that data
- cache-reference: measure usage of cached data
- cache-miss: request for cache data that doesn’t exist in cache but can be pulled from memory
- instructions: how many instructions our code issued to the CPU -insns per cycle: instructions per clock cycle
- pipelining allows the CPU to run current operations while fetching and preparing the next one
  - frontend is the fetching and preparing of operations
  - backend is the running of current operation
  - stalling is wasted cycles caused by cache misses, mispredicted branches, or resource conflicts -stalled-cycles-frontend: How many cycles our program waited at frontend of pipeline -stalled-cycles-backend: How many cycles our program waited at backend of pipeline
- A branch is a part of the code where the execution flow changes -if/else statements cause branching
  - can result in stalled-cycles or branch-misses when branches mispredicted
Tutorial: http://web.cs.iastate.edu/~prabhu/Tutorial/title.html
Some terms to review:
- https://en.wikipedia.org/wiki/Hyper-threading
- https://en.wikipedia.org/wiki/Instruction_pipelining
- https://en.wikipedia.org/wiki/Branch_predictor
- https://en.wikipedia.org/wiki/Branch_(computer_science)
- https://en.wikipedia.org/wiki/CPU_cache
- https://en.wikipedia.org/wiki/Clock_signal
- https://en.wikipedia.org/wiki/Instructions_per_cycle
- https://en.wikipedia.org/wiki/Fragmentation_(computing)
- https://en.wikipedia.org/wiki/Array_programming
lists are fragmented, but not array.array
- but array.array are will not be vectorized, and their creation is actually slower than making lists
- array.array is suitable for non-math storage of fixed data types -numpy.array are continuous in memory, vectorized, can require less CPU instructions, and have fewer cache misses compared to lists
call builtin id() on things to check if they’re referencing the same memory
prefer using in-place operations (+=, *=, etc) over non-in-place operations (a = a + n)
- one form of inplace is ‘swapping’ like a,b = b,a because it changes references, not the original data
‘sometimes’ it is best to rewrite numpy functions with fewer allocations, sacrificing human understandability
use numexpr on vector expressions b/c it will compile them into code that minimizes cache misses, temp space used, and allow multicore utilization
- requires rewriting expressions as strings
- Works best if the problem cannot be contained within cache, else it might take ‘more’ time with numepxr
  - How can I automatically check cache size of system in Python?

CHAPTER 7 - Main reason to compile Python is to statically type - AOT - https://pypi.org/project/pythran/ - https://cython.org/ - Has OpenMP support - cython -a

CHAPTER 8

asynchronous programming reduces I/O wait
- Therefore tends to improve performance of I/O bound programs
https://en.wikipedia.org/wiki/Concurrency_(computer_science)
eventloops help implement asynchronous approaches
- can use callbacks
  - https://en.wikipedia.org/wiki/Callback_(computer_programming)
- can use futures/promises
  - uses partial from functools
- having too many things on the eventloop can slow performance -asyncio
- A library for asynchronous programming in Python
- https://docs.python.org/3/library/asyncio.html
- https://docs.python.org/3/library/asyncio-task.html
gevent
- coroutine-based library for Python
  - https://en.wikipedia.org/wiki/Coroutine
- monkey-patches standard I/O functions to have asynchronous behaviour
- https://pypi.org/project/gevent/
- a semaphore can be used to control the number of greenlets (a type of thread)
  - https://en.wikipedia.org/wiki/Semaphore_(programming)
  - sephamore can be a context manager
grequests
- library that combines making HTTP requests with gevent library
- handles semaphore logic for you
- https://pypi.org/project/grequests/
tornado
- library thats make asynchronous HTTP requests
- based on callbacks rather than coroutines
sometimes set membership in streamed data can be stochastically estimated
- https://en.wikipedia.org/wiki/Bloom_filter
- https://en.wikipedia.org/wiki/Counting_Bloom_filter
- https://github.com/mynameisfiber/fuggetaboutit
choose tool that is appropriate to domain and level of abstraction

Some Notes From High Performance Python

Further Reading

A Ciw Implementation of SimPy's Machine Shop Example

A Ciw Implementation of SimPy's Movie Renege Example

Insane Probability Density Functions From My MSc Thesis.