Splitting your dataset is a fundamental practice in building reliable AI and machine learning models. Without a proper data split, your model might overfit, underperform, or give misleading results.
What are Data Splits and Why are They Important?
What Are the Data Splits?
Split | Purpose |
Training Set | It is used to teach the model patterns in the data |
Validation Set | Used for AI model fine-tune services, model hyperparameters and avoid overfitting |
Test Set | Helps evaluate the final model’s performance on unseen data |
Why It Matters
- Prevents Overfitting – Helps ensure that your model generalizes well to new data.
- Improves Accuracy – Fine-tunes models to perform better on future predictions.
- Enables Fair Evaluation – Test set gives an unbiased performance estimate.
How to Split the Data?
Recommended Ratios (can vary based on dataset size):
- Training: 60–70%
- Validation: 15–20%
- Test: 15–20%
General Steps to Split Data:
- Load and preprocess your dataset.
- Split into training and temp (test + validation).
- Split the temp further into validation and test.
Code Example – Train/Validation/Test Split
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Sample dataset
data = {
'Age': [22, 25, 47, 52, 46, 56, 48, 33, 27, 29],
'Salary': [20000, 25000, 47000, 52000, 46000, 56000, 48000, 33000, 27000, 29000],
'Purchased': [0, 0, 1, 1, 1, 1, 1, 0, 0, 0]
}
df = pd.DataFrame(data)
# Features and target
X = df[['Age', 'Salary']]
y = df['Purchased']
# Step 1: Train-Test split (80% train_val, 20% test)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 2: Train-Validation split (from the 80%)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)
# (Now: 60% train, 20% val, 20% test)
# Step 3: Train model on training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Step 4: Tune using validation set (example: simple accuracy check)
val_predictions = model.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, val_predictions))
# Step 5: Final evaluation on test set
test_predictions = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, test_predictions))
Sample Output:
Validation Accuracy: 1.0
Test Accuracy: 1.0
Conclusion
Divide the data into proper training, validation and test sets to ensure trustworthy and professional AI software development services. Doing so:
- Helps detect and prevent overfitting.
- Improves model tuning and performance.
- Provides a real-world performance estimate.
Always treat your test set as “untouched” data. That’s how you’ll know your model works when it truly matters — in production.