Exploratory Data Analysis (EDA) and Feature Engineering Training Course

Exploratory Data Analysis (EDA) and Feature Engineering Training Course

This intensive five-day training course is focused on developing core skills in Exploratory Data Analysis (EDA) and Feature Engineering, which are crucial foundational steps in any successful data science and machine learning project. Participants will learn how to systematically investigate datasets, summarize their main characteristics, detect anomalies, and transform raw data into high-quality features that dramatically improve the performance and robustness of predictive models. The course adopts a practical, tool-agnostic approach, focusing on universally applicable statistical, visual, and computational techniques.

The curriculum is structured across 10 progressive modules, covering the entire workflow from initial data inspection to advanced feature creation. Key topics include calculating and visualizing descriptive statistics, mastering data quality assessment, imputation, and outlier detection, understanding and applying various categorical encoding schemes, implementing essential feature scaling and transformation techniques, and utilizing both statistical and machine learning methods for dimensionality reduction. Every module includes a mandatory Practical session designed to reinforce theoretical concepts through hands-on data manipulation and analysis.

Who should attend the training

·       Aspiring Data Scientists and Machine Learning Engineers

·       Business Intelligence Analysts

·       Research Analysts

·       Statisticians

·       Data Engineers

Objectives of the training

·       Personal benefits

o   Master the complete EDA workflow to gain deep insights from any dataset

o   Confidently identify, diagnose, and treat data quality issues (missing data, outliers)

o   Develop a strong intuition for creating high-impact features from raw variables

o   Correctly apply various encoding and scaling techniques to prepare data for modeling

o   Significantly improve the performance of machine learning models through superior feature sets

·       Organizational benefits

o   Reduce the time spent in the data cleaning and preparation phase of projects

o   Increase the predictive power and accuracy of deployed machine learning models

o   Standardize best practices for data validation and pre-processing

o   Improve collaboration between data engineering and data science teams

o   Ensure data integrity and quality across analytical workflows

 

Course Duration: 5 days

Training fee: USD 1500

Training methodology

·       Expert-led concept explanations and statistical foundation lectures

·       Hands-on coding exercises (using Python/Pandas/Scikit-learn concepts)

·       Interactive data visualization and interpretation labs

·       Group problem-solving focused on real-world business case studies

Trainer Experience

Our trainers are seasoned data scientists and machine learning practitioners who have designed, built, and deployed predictive models in commercial and research settings. They specialize in practical application, guiding participants through the nuances of data preparation, which accounts for the majority of effort in real-world data science projects.

Quality Statement

We are committed to delivering a high-quality, technically focused training program that provides actionable and fundamental data science skills. Our curriculum is designed to move participants beyond theoretical knowledge to practical mastery of data exploration and feature engineering techniques essential for building robust and scalable analytical solutions.

Tailor-made courses

This course can be customized to focus on specific programming languages (e.g., R), industry-specific data types (e.g., text, geospatial, time-series data), or advanced feature selection methods. We offer flexible delivery options, including on-site, virtual, and blended learning solutions to meet your organizational needs.

Module 1: Foundations of Data Exploration and the EDA Process

·       Defining EDA and its role in the data science lifecycle

·       The CRISP-DM framework overview and where EDA fits

·       Understanding data types (Nominal, Ordinal, Interval, Ratio)

·       The five-step EDA workflow

·       Identifying the business context and defining analysis goals

·       Practical session: Loading a raw dataset and generating initial statistical summaries (e.g., using Python/R/Excel)

Module 2: Descriptive Statistics for Data Understanding

·       Measures of Central Tendency (Mean, Median, Mode) and when to use each

·       Measures of Dispersion (Variance, Standard Deviation, IQR)

·       Understanding and interpreting Skewness and Kurtosis

·       Calculating correlation and covariance for bivariate analysis

·       Using quantiles and percentiles to analyze distribution

·       Practical session: Calculating and comparing descriptive statistics for continuous and categorical variables

Module 3: Univariate and Bivariate Data Visualization

·       Best practices for data visualization (chart selection, color theory)

·       Creating and interpreting Histograms and Density plots for distribution

·       Using Box Plots and Violin Plots to identify spread and outliers

·       Generating Bar Charts and Pie Charts for categorical data

·       Visualizing relationships with Scatter Plots and Heatmaps (correlation matrix)

·       Practical session: Generating a comprehensive set of univariate and bivariate visualizations to profile a new dataset

Module 4: Data Quality Assessment and Missing Value Imputation

·       Identifying data quality issues (missingness, inconsistencies, duplication)

·       Types of missing data (MCAR, MAR, MNAR) and their implications

·       Common imputation techniques: Mean/Median/Mode imputation

·       Advanced imputation: K-Nearest Neighbors (KNN) and regression imputation

·       Handling duplicate and inconsistent records

·       Practical session: Identifying missing data patterns and applying two different imputation methods (e.g., Mean and KNN) to compare results

Module 5: Outlier Detection and Treatment

·       Defining outliers and anomalies in different contexts

·       Statistical methods for outlier detection: Z-Score and IQR methods

·       Visualization techniques for outlier identification (Box Plots, Scatter Plots)

·       Data transformation techniques (Log, Square Root) to manage outliers

·       Strategies for treating outliers: Capping, Trimming, and Separation

·       Practical session: Implementing the IQR method to flag outliers in a numerical feature and applying capping to mitigate their effect

Module 6: Foundations of Feature Engineering

·       Defining Feature Engineering and its impact on model performance

·       The difference between raw data, features, and variables

·       Domain expertise as the primary driver of effective feature creation

·       Creating simple interaction features (multiplication, division)

·       The feature creation workflow

·       Practical session: Creating new interaction features (e.g., density = mass/volume) based on domain knowledge of a dataset

Module 7: Handling Categorical Variables

·       Strategies for encoding nominal categorical features (One-Hot Encoding)

·       Strategies for encoding ordinal categorical features (Label/Ordinal Encoding)

·       Dealing with high-cardinality categorical variables (Target/Frequency Encoding)

·       Binary encoding and its use cases

·       Handling text features using Bag-of-Words or TF-IDF overview

·       Practical session: Applying One-Hot Encoding to a nominal feature and Target Encoding to a high-cardinality feature

Module 8: Feature Scaling and Transformation

·       The necessity of feature scaling for distance-based algorithms

·       Implementing Standardization (Z-score scaling)

·       Implementing Normalization (Min-Max scaling)

·       Power transformations (Box-Cox and Yeo-Johnson) for achieving normality

·       Discretization (Binning) of continuous features into categorical bins

·       Practical session: Comparing the effects of Standardization and Normalization on a skewed numerical feature using Python

Module 9: Dimensionality Reduction and Feature Selection

·       The Curse of Dimensionality and its effects on model performance

·       Feature Selection methods: Filter (Chi-squared, Correlation), Wrapper (Forward/Backward Selection), and Embedded (Lasso)

·       Introduction to Principal Component Analysis (PCA) for dimensionality reduction

·       Interpreting PCA components and selecting the optimal number of components

·       Feature importance extraction from machine learning models

·       Practical session: Implementing PCA to reduce the dimensionality of a dataset and visualizing the explained variance ratio

Module 10: Time-Based and Automated Feature Engineering

·       Creating date and time-based features (day of week, month, holiday flags)

·       Generating lagged features and rolling window statistics (mean, sum, std dev)

·       Aggregating features for sequential and transactional data

·       Automating Feature Engineering using tools or libraries overview

·       Documenting the final feature set and preparing for modeling

·       Practical session: Creating lagged features and a 7-day rolling average for a time-series feature in the dataset

 

Requirements:

·       Participants should be reasonably proficient in English.

·       Applicants must live up to Armstrong Global Institute admission criteria.

Terms and Conditions

1. Discounts: Organizations sponsoring Four Participants will have the 5th attend Free

2. What is catered for by the Course Fees: Fees cater for all requirements for the training – Learning materials, Lunches, Teas, Snacks and Certification. All participants will additionally cater for their travel and accommodation expenses, visa application, insurance, and other personal expenses.

3. Certificate Awarded: Participants are awarded Certificates of Participation at the end of the training.

4. The program content shown here is for guidance purposes only. Our continuous course improvement process may lead to changes in topics and course structure.

5. Approval of Course: Our Programs are NITA Approved. Participating organizations can therefore claim reimbursement on fees paid in accordance with NITA Rules.

Booking for Training

Simply send an email to the Training Officer on training@armstrongglobalinstitute.com and we will send you a registration form. We advise you to book early to avoid missing a seat to this training.

Or call us on +254720272325 / +254725012095 / +254724452588

Payment Options

We provide 3 payment options, choose one for your convenience, and kindly make payments at least 5 days before the Training start date to reserve your seat:

1. Groups of 5 People and Above – Cheque Payments to: Armstrong Global Training & Development Center Limited should be paid in advance, 5 days to the training.

2. Invoice: We can send a bill directly to you or your company.

3. Deposit directly into Bank Account (Account details provided upon request)

Cancellation Policy

1. Payment for all courses includes a registration fee, which is non-refundable, and equals 15% of the total sum of the course fee.

2. Participants may cancel attendance 14 days or more prior to the training commencement date.

3. No refunds will be made 14 days or less before the training commencement date. However, participants who are unable to attend may opt to attend a similar training course at a later date or send a substitute participant provided the participation criteria have been met.

Tailor Made Courses

This training course can also be customized for your institution upon request for a minimum of 5 participants. You can have it conducted at our Training Centre or at a convenient location. For further inquiries, please contact us on Tel: +254720272325 / +254725012095 / +254724452588 or Email training@armstrongglobalinstitute.com

Accommodation and Airport Transfer

Accommodation and Airport Transfer is arranged upon request and at extra cost. For reservations contact the Training Officer on Email: training@armstrongglobalinstitute.com or on Tel: +254720272325 / +254725012095 / +254724452588

 

Instructor-led Training Schedule

Course Dates Venue Fees Enroll
Jul 13 - Jul 17 2026 Zoom $1,300
May 11 - May 15 2026 Nairobi $1,500
Apr 20 - Apr 24 2026 Nakuru $1,500
Mar 02 - Mar 06 2026 Naivasha $1,500
May 18 - May 22 2026 Nanyuki $1,500
Jun 01 - Jun 05 2026 Mombasa $1,500
May 04 - May 08 2026 Kisumu $1,500
Apr 13 - Apr 17 2026 Cape Town $4,500
Aug 24 - Aug 28 2026 Kigali $2,500
Mar 02 - Mar 06 2026 Kampala $2,500
Oct 05 - Oct 09 2026 Johannesburg $4,500
May 04 - May 08 2026 Addis Ababa $4,500
Nov 02 - Nov 06 2026 Casablanca $4,500
Apr 20 - Apr 24 2026 Dubai $5,000
Jun 08 - Jun 12 2026 Doha $5,000
Mar 09 - Mar 13 2026 Riyadh $5,000
Jul 06 - Jul 10 2026 London $6,500
Aug 03 - Aug 07 2026 Paris $6,500
Jul 13 - Jul 17 2026 Berlin $6,500
Apr 20 - Apr 24 2026 New York $6,950
May 04 - May 08 2026 Washington DC $6,950
Jul 20 - Jul 24 2026 Vancouver $7,000
Armstrong Global Institute

Armstrong Global Institute
Typically replies in minutes

Armstrong Global Institute
Hi there 👋

We are online on WhatsApp to answer your questions.
Ask us anything!
×
Chat with Us