Home Some Notes From High Performance Python
Post
Cancel

Some Notes From High Performance Python

CHAPTER 1

  • numexpr improves numerical computation by compiling (https://numexpr.readthedocs.io/en/latest/intro.html)

CHAPTER 2

  • Use unix command /usr/bin/time to get timing spent inside/outsie python process
    • /usr/bin/time –verbose
      • Use full path instead of time to avoid Shell built-in that is different
    • Major page faults give information about file I/O?
  • cProfile
  • Write unit tests for code
  • pstats library can load *.prof files
    • Can give which functions call others, which could be used to build digraph
  • There are other visualizers of *.prof files
    • snakeviz
    • runsnakerun
      • requires wxpython
  • UML would visualize structure but not performance
  • line_profiler package providers a decorator to give detailed function performance
    • kernprof.py -l -v
  • Use timeit package to profile snippets of code
  • In compound “if” statements, put faster comparisons further to the left
  • memory_profiler package gives line-by-line memory usage
    • Runs faster if psutil is installed
    • Reports MiB (mebibyte) not MB
    • Shows changes in memory allocated to process, not size of objects
    • python -m memory_profiler
    • Uses @profile decorator
    • Also has a utility mprof to give memory usage vs time data
      • mprof run
      • profile.timestamp context manager allows further annotation of time graph
  • guppy package contains heapy tool that gives detailed memory usage of different datatypes
    • hpy.setrelheap allows deltas of usage from set point in code
  • dowser (and dozer?) package with cherryPy package provide live memory usage of instantiated variables
  • dis package disassembles code which can provide info on what the program is really doing
  • Use coverage.py to learn which parts of your code participate in unit tests
  • Use no-op @decorators found on page 58 of “High Performance Python” to make unit tests compatible with profiler decorators
  • Use unix diff to quickly check differences in profile outputs
  • Use git to version control every change made to project files
  • Consider the following to stabilize performance testing
    • Disable TurboBoost in BIOS
    • Disable the OS’s ability to override SpeedStep (option in BIOS)
    • Use only mains power, never battery
    • Disable backrgound tools (backups/Dropbox)
    • Run experiments many times to obtain statistics
    • Drop to run level 1 (Unix) to minimize running processes
    • Reboot and rerun all experiments to duplicate results

CHAPTER 3

  • Use bisect standard library
    • Construct sorted lists one element at a time
    • Uses efficient bisection algorithm
    • log (n) search, n log n sorted construction
  • Tuples vs lists
    • lists are dynamic, tuples are static
    • tuples (of sizes 1-20) are cached by Python runtime
    • https://stackoverflow.com/questions/14135542/how-is-tuple-implemented-in-cpython
  • Try not to have mixed data types within tuples/lists
  • Built-in list.append method allocates more memory than just enough for the appended element
    • How does this compare to n * [0] or [0 for i in range(n)]?
  • While tuples are immutable, their concatenation creates a new copy with ‘needed’ length
  • Lists are always bigger than tuples of the same length
  • Instantiating tuples is faster than instantiating lists
    • tuples require less communication with operating system

CHAPTER 4

  • Dictionaries are only resized upon insertion of elements
  • Python’s namespace lookup follows this order: local -> global -> builtin
    • explicit “from import " should be faster
    • store things as local variables, all else being equal
  • Use hash magic method to make objects hashable
    • Do NOT change the objects once they’re in a dictionary/set!

CHAPTER 5

  • use iter() to create iterables from simple objects
  • use iter in custom objects to make your custom object iterable
  • use ‘generator comprehensions’ instead of list comprehensions when possible
  • map, reduce, filter, slice, and zip in Python3 are generators available without import
  • itertools.chain combines multiple generators
  • itertools.takewhile adds a condition that will end a generator
  • itertools.cycle makes a finite generator infinite by repeating it
  • https://pypi.org/project/more-itertools/ has a bunch of ready-to-use itertool recipes
  • collections module has lots of ready-to-deploy data structures (trees, deques, etc)

CHAPTER 6

  • Always profile before optimizing
  • Minimize operations performed
  • Minimize memory allocations
  • Prefer to reuse allocated space rather than allocate new space
  • Von Neumann bottleneck
  • branch prection and pipelining are performed by CPU
  • perf gives information about a program’s CPU performance
    • https://en.wikipedia.org/wiki/Perf_(Linux)
    • example: perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python3
      • The -r flag is ‘runs’
    • task-clock: number of clock cycles task took
    • context-switches: program being halted to allow another program to run -tends to happen when I/O (memory,disk,network) is being performed
      • Can use nice to increase the program’s priority on Linux
        • https://en.wikipedia.org/wiki/Nice_(Unix)
    • CPU-migrations: program is moved to a new CPU
      • Each CPU has its own L1 cache
    • (minor) page-fault: kernel gives reference to memory when allocations are requested. When that memory block is first used, it throws a page fault interrupt, pausing the current program to properly allocate that memory.
      • lazy allocation system
    • (major) page-fault: disk/network data request not only pauses current program, but takes time to access that data
    • cache-reference: measure usage of cached data
    • cache-miss: request for cache data that doesn’t exist in cache but can be pulled from memory
    • instructions: how many instructions our code issued to the CPU -insns per cycle: instructions per clock cycle
    • pipelining allows the CPU to run current operations while fetching and preparing the next one
      • frontend is the fetching and preparing of operations
      • backend is the running of current operation
      • stalling is wasted cycles caused by cache misses, mispredicted branches, or resource conflicts -stalled-cycles-frontend: How many cycles our program waited at frontend of pipeline -stalled-cycles-backend: How many cycles our program waited at backend of pipeline
    • A branch is a part of the code where the execution flow changes -if/else statements cause branching
      • can result in stalled-cycles or branch-misses when branches mispredicted
  • Tutorial: http://web.cs.iastate.edu/~prabhu/Tutorial/title.html
  • Some terms to review:
    • https://en.wikipedia.org/wiki/Hyper-threading
    • https://en.wikipedia.org/wiki/Instruction_pipelining
    • https://en.wikipedia.org/wiki/Branch_predictor
    • https://en.wikipedia.org/wiki/Branch_(computer_science)
    • https://en.wikipedia.org/wiki/CPU_cache
    • https://en.wikipedia.org/wiki/Clock_signal
    • https://en.wikipedia.org/wiki/Instructions_per_cycle
    • https://en.wikipedia.org/wiki/Fragmentation_(computing)
    • https://en.wikipedia.org/wiki/Array_programming
  • lists are fragmented, but not array.array
    • but array.array are will not be vectorized, and their creation is actually slower than making lists
    • array.array is suitable for non-math storage of fixed data types -numpy.array are continuous in memory, vectorized, can require less CPU instructions, and have fewer cache misses compared to lists
  • call builtin id() on things to check if they’re referencing the same memory
  • prefer using in-place operations (+=, *=, etc) over non-in-place operations (a = a + n)
    • one form of inplace is ‘swapping’ like a,b = b,a because it changes references, not the original data
  • ‘sometimes’ it is best to rewrite numpy functions with fewer allocations, sacrificing human understandability
  • use numexpr on vector expressions b/c it will compile them into code that minimizes cache misses, temp space used, and allow multicore utilization
    • requires rewriting expressions as strings
    • Works best if the problem cannot be contained within cache, else it might take ‘more’ time with numepxr
      • How can I automatically check cache size of system in Python?

CHAPTER 7 - Main reason to compile Python is to statically type - AOT - https://pypi.org/project/pythran/ - https://cython.org/ - Has OpenMP support - cython -a

CHAPTER 8

  • asynchronous programming reduces I/O wait
    • Therefore tends to improve performance of I/O bound programs
  • https://en.wikipedia.org/wiki/Concurrency_(computer_science)

  • eventloops help implement asynchronous approaches
    • can use callbacks
      • https://en.wikipedia.org/wiki/Callback_(computer_programming)
    • can use futures/promises
      • uses partial from functools
    • having too many things on the eventloop can slow performance -asyncio
    • A library for asynchronous programming in Python
    • https://docs.python.org/3/library/asyncio.html
    • https://docs.python.org/3/library/asyncio-task.html
  • gevent
    • coroutine-based library for Python
      • https://en.wikipedia.org/wiki/Coroutine
    • monkey-patches standard I/O functions to have asynchronous behaviour
    • https://pypi.org/project/gevent/
    • a semaphore can be used to control the number of greenlets (a type of thread)
      • https://en.wikipedia.org/wiki/Semaphore_(programming)
      • sephamore can be a context manager
  • grequests
    • library that combines making HTTP requests with gevent library
    • handles semaphore logic for you
    • https://pypi.org/project/grequests/
  • tornado
    • library thats make asynchronous HTTP requests
    • based on callbacks rather than coroutines
  • sometimes set membership in streamed data can be stochastically estimated
    • https://en.wikipedia.org/wiki/Bloom_filter
    • https://en.wikipedia.org/wiki/Counting_Bloom_filter
    • https://github.com/mynameisfiber/fuggetaboutit
  • choose tool that is appropriate to domain and level of abstraction
This post is licensed under CC BY 4.0 by the author.

QuantEcon 2 Numba

Does freeness generalize to finite rank tensors?