Jun 4, 2025

Building a Custom Intrusion Detection System (IDS) with Python and Machine Learning

 
Develop a custom IDS with Python and machine learning. Detect network anomalies and potential security threats using scikit-learn for real-time monitoring.


Introduction to Intrusion Detection Systems and Machine Learning

Intrusion Detection Systems (IDSs) are crucial components of any robust network security infrastructure. Their primary function is to monitor network traffic and system activity for malicious activities or policy violations. Traditional IDSs often rely on signature-based detection, comparing observed events against a database of known attack signatures. However, these systems struggle against novel or zero-day exploits. This is where machine learning (ML) steps in, offering a powerful approach to anomaly detection and adaptive threat identification.

This article explores the development of a custom IDS using Python and machine learning techniques, specifically leveraging the popular scikit-learn library. We'll delve into the core concepts, implementation details, and practical considerations for building an effective anomaly-based intrusion detection system.

Understanding Anomaly Detection with Machine Learning

Anomaly detection, also known as outlier detection, involves identifying data points that deviate significantly from the normal or expected behavior. In the context of network security, an anomaly might represent a suspicious network connection, an unusual login attempt, or unexpected system resource consumption. Machine learning algorithms provide various techniques for automatically learning the patterns of normal behavior and flagging deviations as potential intrusions.

Key benefits of using machine learning for intrusion detection include:

  • Adaptability: ML models can adapt to evolving network traffic patterns and learn new attack behaviors.
  • Zero-day attack detection: Anomaly-based detection can identify attacks that have no known signatures.
  • Reduced false positives: Properly trained ML models can distinguish between legitimate anomalies and malicious activities.
  • Scalability: ML algorithms can handle large volumes of network data.

Building Your Custom IDS: A Step-by-Step Guide

Let's outline the process of building a custom IDS with Python and machine learning. We'll cover the key steps involved, from data collection and preprocessing to model training and evaluation.

1. Data Collection and Feature Engineering

The foundation of any machine learning-based IDS is the quality and relevance of the data used for training. You'll need to collect network traffic data, system logs, or other relevant security information. Publicly available datasets like the NSL-KDD dataset, CIC-IDS2017, or UNSW-NB15 can be used for experimentation and prototyping.

Feature engineering is the process of extracting meaningful features from the raw data. These features should capture the characteristics of network traffic or system activity that are indicative of normal or anomalous behavior. Examples of relevant features include:

  • Network traffic features: Source/destination IP addresses, source/destination ports, protocol, packet size, packet rate, flow duration, number of packets per flow.
  • System log features: Usernames, login times, system commands executed, file access patterns.
  • Statistical features: Mean, standard deviation, median, and other statistical measures calculated over a window of network traffic or system activity.

The choice of features depends on the specific data source and the type of attacks you want to detect. Proper feature selection is crucial for achieving good performance.


import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv('network_traffic.csv')

# Select relevant features
features = ['src_port', 'dst_port', 'protocol', 'packet_size', 'flow_duration']
X = data[features]

# Handle missing values (if any)
X = X.fillna(X.mean())

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)

2. Choosing a Machine Learning Algorithm

Several machine learning algorithms are suitable for anomaly detection in IDSs. Some popular choices include:

  • Isolation Forest: An ensemble learning algorithm that isolates anomalies by randomly partitioning the data space.
  • One-Class SVM: A support vector machine that learns a boundary around the normal data points.
  • Local Outlier Factor (LOF): A density-based algorithm that identifies anomalies as points with significantly lower density than their neighbors.
  • Autoencoders: Neural networks that learn to reconstruct the input data. Anomalies are identified as data points with high reconstruction error.

The best algorithm depends on the characteristics of your data and the specific requirements of your application. Experimentation with different algorithms is often necessary to find the optimal solution.

3. Model Training and Evaluation

Once you've chosen an algorithm, you need to train it on a dataset of normal network traffic or system activity. This training data should ideally be free of any intrusions or anomalies. Split your data into training and testing sets (e.g., 80% training, 20% testing).

After training, you need to evaluate the performance of your model on a separate test dataset. Key metrics for evaluating IDS performance include:

  • Precision: The proportion of correctly identified intrusions out of all instances flagged as intrusions.
  • Recall: The proportion of actual intrusions that are correctly identified.
  • F1-score: The harmonic mean of precision and recall.
  • False positive rate (FPR): The proportion of normal instances that are incorrectly flagged as intrusions.
  • Accuracy: The overall proportion of correctly classified instances (both normal and intrusion).

Aim for high precision and recall, while minimizing the false positive rate.


from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Split data into training and testing sets
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)

# Train the Isolation Forest model
model = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
model.fit(X_train)

# Predict anomalies on the test set
y_pred = model.predict(X_test)

# Convert predictions to binary labels (1 for normal, -1 for anomaly)
y_true = [1] * len(X_test)  # Assuming the test set contains only normal data
y_pred_binary = [1 if x == 1 else 0 for x in y_pred]

# Print classification report
print(classification_report(y_true, y_pred_binary, labels=[0, 1]))

4. Implementing the IDS in Python

Now, let's put it all together and create a basic intrusion detection system using Python.


import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
import joblib

class IDS:
    def __init__(self, model_path='ids_model.joblib', scaler_path='scaler.joblib'):
        self.model_path = model_path
        self.scaler_path = scaler_path
        self.model = None
        self.scaler = None
        self.features = ['src_port', 'dst_port', 'protocol', 'packet_size', 'flow_duration']  # Define your features here

    def train(self, data_path, contamination='auto'):
        """Trains the IDS model."""
        data = pd.read_csv(data_path)
        X = data[self.features].fillna(data[self.features].mean())  # Ensure only selected features are used

        # Scaling the data
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)

        # Train Isolation Forest model
        self.model = IsolationForest(n_estimators=100, contamination=contamination, random_state=42)
        self.model.fit(X_scaled)

        # Save the model and scaler
        joblib.dump(self.model, self.model_path)
        joblib.dump(self.scaler, self.scaler_path)
        print("Model trained and saved.")

    def load(self):
        """Loads a pre-trained model and scaler."""
        try:
            self.model = joblib.load(self.model_path)
            self.scaler = joblib.load(self.scaler_path)
            print("Model and scaler loaded.")
        except FileNotFoundError:
            print("Model or scaler file not found. Ensure they exist or train a new model.")
            return False
        return True


    def predict(self, data):
        """Predicts whether the given data is an anomaly."""
        if self.model is None or self.scaler is None:
            print("Model not loaded. Please train or load a model first.")
            return None

        X = pd.DataFrame([data], columns=self.features) #expects a dictionary
        X = X[self.features].fillna(X[self.features].mean())
        X_scaled = self.scaler.transform(X)
        prediction = self.model.predict(X_scaled)[0] #returns 1 or -1

        return prediction


# Example usage
if __name__ == "__main__":
    # Train a new model (if you have training data)
    ids = IDS()
    ids.train('network_traffic.csv', contamination=0.05) #train the model

    #or load an existing model:
    #ids = IDS()
    #if ids.load(): #load existing model
    #    print("IDS loaded successfully")
    #else:
    #    print("Failed to load IDS model.")

    # Now predict using the IDS instance
    new_data = {'src_port': 8080, 'dst_port': 22, 'protocol': 6, 'packet_size': 1500, 'flow_duration': 60}
    prediction = ids.predict(new_data)

    if prediction == 1:
        print("Normal traffic.")
    elif prediction == -1:
        print("Anomaly detected!")
    else:
        print("Prediction failed.")

5. Integration and Deployment

Integrating your custom IDS into your existing network infrastructure can be achieved using various methods. One approach is to deploy the IDS as a network sensor, capturing network traffic using tools like `tcpdump` or `Wireshark`. The captured data can then be fed into the Python-based IDS for real-time analysis. Alternatively, you can integrate the IDS with a Security Information and Event Management (SIEM) system for centralized monitoring and alerting.

6. Continuous Improvement and Retraining

Machine learning models can become outdated over time as network traffic patterns and attack techniques evolve. Regularly retrain your model with new data to maintain its accuracy and effectiveness. Implement a feedback loop where security analysts can review alerts generated by the IDS and provide feedback on whether they are true positives or false positives. This feedback can be used to further improve the model's performance.

No comments:

Post a Comment