Common Concepts in Machine Learning
Features (Attributes)
Usually the columns of the training sample set, which can be viewed as column names for each column. For example, to distinguish bird species, features such as weight and back color can be used for differentiation.
Feature Instance
The data within a specific feature column
Types of Features
- Numerical
- Binary (similar to Boolean type)
- Categorical (color values)
Training Set
A collection of data samples used to train machine learning algorithms
Training Sample
Each row of the sample set is a training sample
Target Variable
The prediction result of the machine learning algorithm
Classification:
- In classification algorithms, typically nominal, the target variable is called
class - In regression algorithms, typically continuous
Before training the sample set, the value of the target variable must be determined. For example, based on features like bird height, weight, and color, the specific bird species can be determined. The species is the target variable, and the species name is the specific value of the target variable.
Knowledge Representation
Refers to the work of presenting machine classification results
Forms include: rule sets probability distributions instances in training sample sets
Why?
So that machine learning algorithms can discover relationships between features and target variables
The main task of machine learning is classification
Dividing instance data into appropriate categories
Basic classification process:
- Obtain all feature information
- Algorithm training (learning how to classify)
- Testing the effectiveness of machine learning algorithms
How to test algorithm effectiveness?
To test effectiveness, two separate sample sets are typically used: training data and test data
Another important task in machine learning: Regression
Mainly used for predicting numerical data
Program starts running:
Training sample set provides target variable => Input to algorithm => Training completed => Input test data (without target variable) => Compare test results with actual target variable differences (regression fitting) => Obtain actual accuracy of the algorithm
Supervised Learning vs Unsupervised Learning
Supervised learning means the algorithm knows what to predict, i.e., has clear objectives
Regression and classification both belong to supervised learning
Compared to supervised learning, unsupervised learning has no target values
Unsupervised Learning
Purpose:
- Reduce the dimensionality of data features
Clustering (Analysis)
Divide similar objects into different groups or more subsets through static classification methods. This is equivalent to dividing one training set into multiple training sets, where the data features in each new training set are similar.
Density Estimation
Refers to the process of finding statistical descriptions of data. It can be understood as obtaining a frequency distribution histogram that describes the data, reducing feature dimensions to make it more intuitive.
Brief Discussion on Algorithm Selection
- Goal-oriented: Choose supervised or unsupervised learning algorithms based on the task to be accomplished
- Data source-oriented: Analyze or collect what kind of data is needed
General Steps for Developing Machine Learning Applications
Collect data
Prepare input data
Analyze input data (manual)
Train algorithm (machine learning)
Test algorithm
Use algorithm
Views