

Python Analytics
Expertise with all Python packages, modules, and packages for data science analysis.
Scientific Computation
SciPy
A collection of mathematics, scientific, and engineering packages that include NumPy, Sympy, Matplotlib, IPython, and pandas.
NumPy
A fundamental scientific computing package that offers Ndimensional array object, broadcasting functions, integration with C/C++ and Fortran code, linear algebra, Fourier transform, multidimensional container of generic data, arbitrary data types, and database integration.
pandas
Data structures and data analysis tools similar to what is offered by default in the R language. Features include read/write data in various formats (i.e. CSV, Excel, and SQL databases), data alignment, handling of missing data, pivoting of data sets, size mutability, slicing, subsetting large data sets, splitapplycombine operations, merge/join data sets, and timeseries functionality.
IPython
Provides an interactive shell, data visualization, GUI tools, parallel computing. Used for advanced statistics and quantum mechanics. Also acts as a kernel for Jupyter.
Math and Statistics
SymPy
Symbolic mathematics and a fullfeatured computer algebra system. Statistics capabilities include probability, probability density, expected value/variance, and random variable types. The package can is used for solving equations, calculus, matrices, and discrete math.
Statsmodels
Explore data, estimate statistical models and perform statistical tests. This includes descriptive statistics, statistical tests, plotting, and result statistics. Features: linear regression, time series, nonparametric estimators, and unit tests for correctness of results.
Machine Learning
scikitlearn
Machine learning capabilities built on NumPy, SciPy, and matplotlib. Used for data mining and data analysis: classification (identifying which category an object belongs), regression (predicting a continuousvalued attribute association with an object), clustering (group similar items into sets), dimensionality reduction (reduce number of random variables), model selection (compare, validate, and choose parameters/models), and preprocessing (feature extraction and normalization).
SHOGUN
Designed for unified largescale learning for classification, regression, and explorative data analysis. A primary feature is the unified interface from multiple languages, such as Python, R, Java, and C++. Other benefits include clustering, metric, structured output, online learning algorithms.
PyBrain
Offers flexible and easytouse algorithms and the ability to test/compare these methods. The software is designed for both entry level students and stateoftheart research. Algorithms include neural networks, reinforcement learning, unsupervised learning, black box optimization, and evolution.
PyMC
Implements the MetropolisHastings algorithm as a statistical package for Markov Chain Monte Carlo sampling. Includes methods for summarizing output, plotting, goodnessoffit, and convergence diagnostics. Intended to provide efficient Bayesian analysis.
Plotting and Visualization
matplotlib
2D plotting for publication quality figures in hardcopy format. Used in python scripts, shell, Mathematica, Matlab, web application servers, and graphical user interfaces. Generate plots, histograms, power spectra, bar charts, errorcharts, and scatterplots.
Bokeh
Interactive visualization library for web browsers. Used to build elegant graphics with interactivity over very large or streaming data applications.
ggplot
Plotting system based on ggplot2 available for R. Used to make professional quality plots with minimal code. Not intended for highly customized data visualizations. Multiple layers can be combined, such as points, lines, and trendline.
Plotly
Used for dashboards, scatter plots, charts (line, bubble, bar, pie), time series, treemaps, and tables. Statistical features include error bars, box plots, histograms, 2D density plots, and distplots. 3D plots include wireframe, point clustering, parametric, scatter, surface, ribbon, and filled line.
prettyplotlib
Used to enhance mathplotlib plots through color perception and information design.
Seaborn
Visualization library based on mathplotlib for drawing attractive statistical graphics. Also supports numpy and pandas data structures long with statistical routines from scipy and statsmodels. Benefits include the ability to reveal patterns in data, comparisons between subsets, discover structure in matrices, and represent uncertainty of time series estimation.
Data Formatting and Storage
csvkit
Suite of tools for converting and working with CSV files, such as from Excel or JSON to CSV. Features include selecting a subset of columns, finding rows with matching cells, reorder columns, summary statistics, and SQL queries.
PyTables
Used to manage hierarchical datasets and handle extremely large amounts of data. Allows the ability to interactively browse, process, and search data by optimizing memory and disk resources. Features include table entities, multidimensional/nested table cells, indexing for columns of tables, numerical arrays, and variable length arrays.
SQLite3
A C library used as a lightweight diskbased database that doesn't require a separate server process. SQLite can be used for internal data storage. Methods exist for added security, such as avoiding the use of Python string operations that are vulnerable to SQL injection attacks.
Other useful packages
mrjob
Used to write and run Hadoop Streaming jobs and supports the AWS Elastic MapReduce service (EMR). Used to run jobs on EMR, a private Hadoop cluster, or locally for testing.
PyParsing
Alternative approach to simple grammars compared to the traditional lex/yacc method or regular expressions.
dateutil
Extends the standard datatime Python module. Used for relative deltas, compute dates based on recurrence rules, parsing, timezones, and how to determine date of Easter Sunday.
Copyright © 2016. Michael E. Byczek. All Rights Reserved.
