Research & Results

Comprehensive analysis of our AI-powered phishing detection system

98.59%

Model Accuracy

450K+

URLs Analyzed

28

Features Extracted

Model Performance Analysis

Visual representation of our model's performance metrics and comparisons

Model Accuracy Graph

Accuracy Analysis

This graph demonstrates the model's accuracy progression across different training epochs, showing how the model's performance improved over time.

Model Metrics Graph

Performance Metrics

Detailed breakdown of various performance metrics including precision, recall, and F1-score, providing a comprehensive view of the model's effectiveness.

Parameter Comparison Graph

Parameter Comparison

Comparative analysis of different model parameters and their impact on overall performance, helping identify the optimal configuration.

Abstract

PhishShield represents a comprehensive approach to combating phishing threats through machine learning. Using a refined dataset of 450K+ URLs, we developed an advanced feature extraction system that analyzes 28 distinct characteristics across URL structure, page content, and security indicators. Our XGBoost model achieves 98.59% accuracy with a 99.85% AUC-ROC score. Implemented as a Chrome extension with Manifest V3, PhishShield provides real-time URL monitoring, visual threat alerts, and scan history tracking, demonstrating the successful integration of machine learning with practical browser-based security tools.

Data Collection Journey

Initial Dataset

Started with 11,000 URLs from Kaggle with 30 affiliated features

Source Expansion

  • PhishTank: ~65,000 live phishing URLs
  • OpenPhish: ~500 live phishing URLs
  • UCI Dataset: 110,000 labeled instances
  • Additional Kaggle Datasets: ~34,000 valid URLs
  • UCI Repository: 134,000 valid URLs

Data Cleaning

Initial cleaning reduced 450,000 URLs to 245,000 unique URLs, further refined to 186,262 samples (87,667 legitimate, 98,595 phishing)

Feature Extraction

URL-Based Features

  • URL Length Analysis
  • Domain Length
  • TLD Length
  • Letter Ratio in URL
  • Digit Ratio in URL
  • Special Character Ratio
  • Abnormal URL Patterns

Page Content Features

  • Largest Line Length
  • Number of Images
  • JavaScript Files Count
  • CSS Files Count
  • Self-Referencing Links
  • External References
  • Iframe Presence
  • Pop-Up Windows

Security Features

  • HTTPS Status
  • Obfuscation Detection
  • Title Tag Presence
  • Meta Description
  • Submit Button Analysis
  • Password Field Detection
  • Domain IP Address
  • Right-Click Disabled Check

Data Preprocessing Journey

Our comprehensive data preprocessing pipeline ensures high-quality input for our machine learning models.

Initial Cleaning

  • Removal of duplicate URLs
  • Handling of missing values
  • Standardization of URL formats
  • Verification of URL accessibility

Feature Engineering

  • URL structure analysis
  • Domain-based feature extraction
  • HTML content parsing
  • Security certificate validation

Data Balancing

  • Class distribution analysis
  • Undersampling of majority class
  • SMOTE for minority class
  • Validation split preparation

Model Development

Model Performance Comparison

Best Performer

XGBoost

  • Accuracy: 98.59%
  • AUC-ROC: 99.85%
  • Precision: 99.31%
  • Recall: 98.55%
  • F1 Score: 98.93%

Stacking Ensemble

  • Accuracy: 98.57%
  • Enhanced Robustness
  • Balanced Performance
  • Multi-Model Integration
  • Reduced Variance

AutoGluon

  • Accuracy: 98.43%
  • AUC-ROC: 99.82%
  • Automated Optimization
  • Quick Deployment
  • Self-Tuning

Chrome Extension Implementation

Key Features

  • Real-time URL monitoring and threat detection
  • Built on Chrome's Manifest V3 standard
  • Service worker-based architecture
  • Local storage for scan history (last 10 scans)
  • Visual alerts for detected threats
  • Integration with FastAPI backend
Protected
Extension is active and monitoring
Real-time protection active
ML-powered threat detection

Future Scope

Multi-Modal Analysis

Integrating visual and textual content analysis with URL features for enhanced detection.

Cross-Language Support

Expanding detection capabilities to handle multiple languages and regional variations.

Mobile & IoT Protection

Optimizing detection for mobile browsers and IoT devices.

Federated Learning

Implementing privacy-preserving collaborative learning across devices.