What you'll learn in this course?
At the end of Data Science with Python training course, participants will be able to
• Understand the difference between Python basic data types
• Know when to use different python collections
• Implement python functions
• Understand control flow constructs in Python
• Handle errors via exception handling constructs
• Be able to quantitatively define an answerable, actionable question
• Import both structured and unstructured data into Python
• Parse unstructured data into structured formats
• Understand the differences between NumPy arrays and pandas dataframes
• Understand where Python fits in the Python/Hadoop/Spark ecosystem
• Simulate data through random number generation
• Understand mechanisms for missing data and analytic implications
• Explore and Clean Data
• Create compelling graphics to reveal analytic results
• Reshape and merge data to prepare for advanced analytics
• Find test for group differences using inferential statistics
• Implement linear regression from a frequentist perspective
• Understand non-linear terms, confounding, and interaction in linear regression
• Extend to logistic regression to model binary outcomes
• Understand the difference between machine learning and frequentist approaches to statistics
• Implement classification and regression models using machine learning
• Score new datasets, evaluate model fit, and quantify variable importance
Prerequisites
All attendees should have prior programming experience and an understanding of basic statistics.
Course Curriculum
• History and current use
o Installing the Software
o Python Distributions
• String Literals and numeric objects
• Collections (lists, tuples, dicts)
• Datetime classes in Python
• Memory Management in Python
• Control Flow
• Functions
• Exception Handling
• Defining the quantitative construct to make inference on the question
• Identifying the data needed to support the constructs
• Identifying limitations to the data and analytic approach
• Constructing Sensitivity analyses
• Structured Data
o Structured Text Files
o Excel workbooks
o SQL databases
• Working with Unstructured Text Data
o Reading Unstructured Text
o Introduction to Natural Language Processing with Python
• Introduction to the ndarray
• NumPy operations
• Broadcasting
• Missing data in NumPy (masked array)
• NumPy Structured arrays
• Random number generation
• Filtering
• Creating and deleting variables
• Discretization of Continuous Data
• Scaling and standardizing data
• Identifying Duplicates
• Dummy Coding
• Combining Datasets
• Transposing Data
• Long to wide and back
• Univariate Statistical Summaries and Detecting Outliers
• Multivariate Statistical Summaries and Outlier Detection
• Group-wise calculations using Pandas
• Pivot Tables
• Histogram
• Box-and-whiskers plot
• Scatter plots
• Forest Plots
• Group-by plotting
• Introduction to the difference in Python, Hadoop, and Spark
• Importing data from Spark and Hadoop to Python
• Parallel execution leveraging Spark or Hadoop
• Exploring and understanding patterns in missing data
• Missing at Random
• Missing Not at Random
• Missing Completely at Random
• Data imputation methods
• Comparing Groups
o P-Values, summary statistics, sufficient statistics, inferential targets
o T-Tests (equal and unequal variances)
o ANOVA
o Chi-Square Tests
• Correlation
• Linear Regression
o Multivariate linear regression
o Capturing Non-linear Relationships
o Comparing Model Fits
o Scoring new data
o Poisson Regression Extension
• Logistic regression
o Logistic Regression Example
o Classification Metrics
• Machine Learning Theory
• Data pre-processing
o Missing Data
o Dummy Coding
o Standardization
o Training/Test data
• Supervised Versus Unsupervised Learning
• Unsupervised Learning: Clustering
o Clustering Algorithms
o Evaluating Cluster Performance
• Dimensionality Reduction
o A-priori
o Principal Components Analysis
o Penalized Regression
• Linear Regression
• Penalized Linear Regression
• Stochastic Gradient Descent
• Scoring New Data Sets
• Cross Validation
• Variance Bias-Tradeoff
• Feature Importance
• Logistic Regression
• LASSO
• Random Forest
• Ensemble Methods
• Feature Importance
• Scoring New Data Sets
• Cross Validation