This intensive five-day training course is focused on developing core skills in Exploratory Data Analysis (EDA) and Feature Engineering, which are crucial foundational steps in any successful data science and machine learning project. Participants will learn how to systematically investigate datasets, summarize their main characteristics, detect anomalies, and transform raw data into high-quality features that dramatically improve the performance and robustness of predictive models. The course adopts a practical, tool-agnostic approach, focusing on universally applicable statistical, visual, and computational techniques.
The curriculum is structured across 10 progressive modules, covering the entire workflow from initial data inspection to advanced feature creation. Key topics include calculating and visualizing descriptive statistics, mastering data quality assessment, imputation, and outlier detection, understanding and applying various categorical encoding schemes, implementing essential feature scaling and transformation techniques, and utilizing both statistical and machine learning methods for dimensionality reduction. Every module includes a mandatory Practical session designed to reinforce theoretical concepts through hands-on data manipulation and analysis.
Who should attend the training
· Aspiring Data Scientists and Machine Learning Engineers
· Business Intelligence Analysts
· Research Analysts
· Statisticians
· Data Engineers
Objectives of the training
· Personal benefits
o Master the complete EDA workflow to gain deep insights from any dataset
o Confidently identify, diagnose, and treat data quality issues (missing data, outliers)
o Develop a strong intuition for creating high-impact features from raw variables
o Correctly apply various encoding and scaling techniques to prepare data for modeling
o Significantly improve the performance of machine learning models through superior feature sets
· Organizational benefits
o Reduce the time spent in the data cleaning and preparation phase of projects
o Increase the predictive power and accuracy of deployed machine learning models
o Standardize best practices for data validation and pre-processing
o Improve collaboration between data engineering and data science teams
o Ensure data integrity and quality across analytical workflows
Course Duration: 5 days
Training fee: USD 1500
Training methodology
· Expert-led concept explanations and statistical foundation lectures
· Hands-on coding exercises (using Python/Pandas/Scikit-learn concepts)
· Interactive data visualization and interpretation labs
· Group problem-solving focused on real-world business case studies
Trainer Experience
Our trainers are seasoned data scientists and machine learning practitioners who have designed, built, and deployed predictive models in commercial and research settings. They specialize in practical application, guiding participants through the nuances of data preparation, which accounts for the majority of effort in real-world data science projects.
Quality Statement
We are committed to delivering a high-quality, technically focused training program that provides actionable and fundamental data science skills. Our curriculum is designed to move participants beyond theoretical knowledge to practical mastery of data exploration and feature engineering techniques essential for building robust and scalable analytical solutions.
Tailor-made courses
This course can be customized to focus on specific programming languages (e.g., R), industry-specific data types (e.g., text, geospatial, time-series data), or advanced feature selection methods. We offer flexible delivery options, including on-site, virtual, and blended learning solutions to meet your organizational needs.
· Defining EDA and its role in the data science lifecycle
· The CRISP-DM framework overview and where EDA fits
· Understanding data types (Nominal, Ordinal, Interval, Ratio)
· The five-step EDA workflow
· Identifying the business context and defining analysis goals
· Practical session: Loading a raw dataset and generating initial statistical summaries (e.g., using Python/R/Excel)
· Measures of Central Tendency (Mean, Median, Mode) and when to use each
· Measures of Dispersion (Variance, Standard Deviation, IQR)
· Understanding and interpreting Skewness and Kurtosis
· Calculating correlation and covariance for bivariate analysis
· Using quantiles and percentiles to analyze distribution
· Practical session: Calculating and comparing descriptive statistics for continuous and categorical variables
· Best practices for data visualization (chart selection, color theory)
· Creating and interpreting Histograms and Density plots for distribution
· Using Box Plots and Violin Plots to identify spread and outliers
· Generating Bar Charts and Pie Charts for categorical data
· Visualizing relationships with Scatter Plots and Heatmaps (correlation matrix)
· Practical session: Generating a comprehensive set of univariate and bivariate visualizations to profile a new dataset
· Identifying data quality issues (missingness, inconsistencies, duplication)
· Types of missing data (MCAR, MAR, MNAR) and their implications
· Common imputation techniques: Mean/Median/Mode imputation
· Advanced imputation: K-Nearest Neighbors (KNN) and regression imputation
· Handling duplicate and inconsistent records
· Practical session: Identifying missing data patterns and applying two different imputation methods (e.g., Mean and KNN) to compare results
· Defining outliers and anomalies in different contexts
· Statistical methods for outlier detection: Z-Score and IQR methods
· Visualization techniques for outlier identification (Box Plots, Scatter Plots)
· Data transformation techniques (Log, Square Root) to manage outliers
· Strategies for treating outliers: Capping, Trimming, and Separation
· Practical session: Implementing the IQR method to flag outliers in a numerical feature and applying capping to mitigate their effect
· Defining Feature Engineering and its impact on model performance
· The difference between raw data, features, and variables
· Domain expertise as the primary driver of effective feature creation
· Creating simple interaction features (multiplication, division)
· The feature creation workflow
· Practical session: Creating new interaction features (e.g., density = mass/volume) based on domain knowledge of a dataset
· Strategies for encoding nominal categorical features (One-Hot Encoding)
· Strategies for encoding ordinal categorical features (Label/Ordinal Encoding)
· Dealing with high-cardinality categorical variables (Target/Frequency Encoding)
· Binary encoding and its use cases
· Handling text features using Bag-of-Words or TF-IDF overview
· Practical session: Applying One-Hot Encoding to a nominal feature and Target Encoding to a high-cardinality feature
· The necessity of feature scaling for distance-based algorithms
· Implementing Standardization (Z-score scaling)
· Implementing Normalization (Min-Max scaling)
· Power transformations (Box-Cox and Yeo-Johnson) for achieving normality
· Discretization (Binning) of continuous features into categorical bins
· Practical session: Comparing the effects of Standardization and Normalization on a skewed numerical feature using Python
· The Curse of Dimensionality and its effects on model performance
· Feature Selection methods: Filter (Chi-squared, Correlation), Wrapper (Forward/Backward Selection), and Embedded (Lasso)
· Introduction to Principal Component Analysis (PCA) for dimensionality reduction
· Interpreting PCA components and selecting the optimal number of components
· Feature importance extraction from machine learning models
· Practical session: Implementing PCA to reduce the dimensionality of a dataset and visualizing the explained variance ratio
· Creating date and time-based features (day of week, month, holiday flags)
· Generating lagged features and rolling window statistics (mean, sum, std dev)
· Aggregating features for sequential and transactional data
· Automating Feature Engineering using tools or libraries overview
· Documenting the final feature set and preparing for modeling
· Practical session: Creating lagged features and a 7-day rolling average for a time-series feature in the dataset
Requirements:
· Participants should be reasonably proficient in English.
· Applicants must live up to Armstrong Global Institute admission criteria.
Terms and Conditions
1. Discounts: Organizations sponsoring Four Participants will have the 5th attend Free
2. What is catered for by the Course Fees: Fees cater for all requirements for the training – Learning materials, Lunches, Teas, Snacks and Certification. All participants will additionally cater for their travel and accommodation expenses, visa application, insurance, and other personal expenses.
3. Certificate Awarded: Participants are awarded Certificates of Participation at the end of the training.
4. The program content shown here is for guidance purposes only. Our continuous course improvement process may lead to changes in topics and course structure.
5. Approval of Course: Our Programs are NITA Approved. Participating organizations can therefore claim reimbursement on fees paid in accordance with NITA Rules.
Booking for Training
Simply send an email to the Training Officer on training@armstrongglobalinstitute.com and we will send you a registration form. We advise you to book early to avoid missing a seat to this training.
Or call us on +254720272325 / +254725012095 / +254724452588
Payment Options
We provide 3 payment options, choose one for your convenience, and kindly make payments at least 5 days before the Training start date to reserve your seat:
1. Groups of 5 People and Above – Cheque Payments to: Armstrong Global Training & Development Center Limited should be paid in advance, 5 days to the training.
2. Invoice: We can send a bill directly to you or your company.
3. Deposit directly into Bank Account (Account details provided upon request)
Cancellation Policy
1. Payment for all courses includes a registration fee, which is non-refundable, and equals 15% of the total sum of the course fee.
2. Participants may cancel attendance 14 days or more prior to the training commencement date.
3. No refunds will be made 14 days or less before the training commencement date. However, participants who are unable to attend may opt to attend a similar training course at a later date or send a substitute participant provided the participation criteria have been met.
Tailor Made Courses
This training course can also be customized for your institution upon request for a minimum of 5 participants. You can have it conducted at our Training Centre or at a convenient location. For further inquiries, please contact us on Tel: +254720272325 / +254725012095 / +254724452588 or Email training@armstrongglobalinstitute.com
Accommodation and Airport Transfer
Accommodation and Airport Transfer is arranged upon request and at extra cost. For reservations contact the Training Officer on Email: training@armstrongglobalinstitute.com or on Tel: +254720272325 / +254725012095 / +254724452588
| Course Dates | Venue | Fees | Enroll |
|---|---|---|---|
| Jul 13 - Jul 17 2026 | Zoom | $1,300 |
|
| May 11 - May 15 2026 | Nairobi | $1,500 |
|
| Apr 20 - Apr 24 2026 | Nakuru | $1,500 |
|
| Mar 02 - Mar 06 2026 | Naivasha | $1,500 |
|
| May 18 - May 22 2026 | Nanyuki | $1,500 |
|
| Jun 01 - Jun 05 2026 | Mombasa | $1,500 |
|
| May 04 - May 08 2026 | Kisumu | $1,500 |
|
| Apr 13 - Apr 17 2026 | Cape Town | $4,500 |
|
| Aug 24 - Aug 28 2026 | Kigali | $2,500 |
|
| Mar 02 - Mar 06 2026 | Kampala | $2,500 |
|
| Oct 05 - Oct 09 2026 | Johannesburg | $4,500 |
|
| May 04 - May 08 2026 | Addis Ababa | $4,500 |
|
| Nov 02 - Nov 06 2026 | Casablanca | $4,500 |
|
| Apr 20 - Apr 24 2026 | Dubai | $5,000 |
|
| Jun 08 - Jun 12 2026 | Doha | $5,000 |
|
| Mar 09 - Mar 13 2026 | Riyadh | $5,000 |
|
| Jul 06 - Jul 10 2026 | London | $6,500 |
|
| Aug 03 - Aug 07 2026 | Paris | $6,500 |
|
| Jul 13 - Jul 17 2026 | Berlin | $6,500 |
|
| Apr 20 - Apr 24 2026 | New York | $6,950 |
|
| May 04 - May 08 2026 | Washington DC | $6,950 |
|
| Jul 20 - Jul 24 2026 | Vancouver | $7,000 |
|
Armstrong Global Institute
Typically replies in minutes