Phishing URL Detection with Python and ML

phishing detection with Python
Phishing is a form of fraudulent attack where the attacker tries to gain sensitive information by posing as a reputable source. In a typical phishing attack, a victim opens a compromised link that poses as a credible website. The victim is then asked to enter their credentials, but since it is a “fake” website, the sensitive information is routed to the hacker and the victim gets ”‘hacked.”

Phishing is popular since it is a low effort, high reward attack. Most modern web browsers, antivirus software and email clients are pretty good at detecting phishing websites at the source, helping to prevent attacks. To understand how they work, this blog post will walk you through a tutorial that shows you how to build your own phishing URL detector using Python and machine learning:

  1. Identify the criteria that can recognize fake URLs
  2. Build a decision tree that can iterate through the criteria
  3. Train our model to recognize fake vs real URLs
  4. Evaluate our model to see how it performs
  5. Check for false positives/negatives

Get Started: Install ML Tools With This Ready-To-Use Python Environment

To follow along with the code in this Python phishing detection tutorial, you’ll need to have a recent version of Python installed, along with all the packages used in this post. The quickest way to get up and running is to install the Phishing URL Detection runtime for Windows or Linux, which contains a version of Python and all the packages you’ll need. 

Required Packages

In order to download the ready-to-use phishing detection Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy and it unlocks the ActiveState Platform’s many benefits for you!

Supported Platforms

For Windows users, run the following at a CMD prompt to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:

powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Phishing-URL-Detection"

For Linux users, run the following to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:

sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Phishing-URL-Detection

1 — How to Identify A Fraudulent URL

A fraudulent domain or phishing domain is an URL scheme that looks suspicious for a variety of reasons. Most commonly, the URL: 

  • Is misspelled
  • Points to the wrong top-level domain
  • A combination of a valid and a fraudulent URL
  • Is incredibly long 
  • Is just be an IP address
  • Has a low pagerank
  • Has a young domain age
  • Ranks poorly on the Alexa Top 1 Million Sites

All these are characteristics of a phishing URL that can help us distinguish it from a valid URL. These characteristics can be converted into machine learning feature sets such as numbers, labels and booleans.

The University of California, Irvine put together a dataset identifying fraudulent versus valid URLs. Feature sets are divided into four main categories:

  1. Address Bar-Based Features – these are features extracted from the URL itself, like URL length >54 characters, or whether it contains an IP address, uses an URL shortening service like TinyURL or Bitly, or employs redirection. Addition features may also include:

    • Adding a prefix or suffix separated by (-) to the domain
    • Having sub-domain and multi-sub-domains
    • Existence of HTTPS
    • Domain registration age
    • Favicon loading from a different domain
    • Using a non-standard port
  2. Abnormal Features – these may include: 
    • Loading images loaded in the body from a different URL
    • Minimal use of meta tags
    • The use of a Server Form Handler (SFH)
    • Submitting information to email
    • An abnormal URL
  3. HTML and JavaScript-Based Features – these can include things like: 
    • Website forwarding 
    • Status bar customization typically using JavaScript to display a fake URL 
    • Disabling the ability to right-click so users can’t view page source code
    • Using pop-up windows
    • iFrame redirection
  4. Domain-Based Features – these can include:
    • Unusually young domains
    • Suspicious DNS record
    • Low volume of website traffic
    • PageRank, where 95% of phishing webpages have no PageRank
    • Whether the site has been indexed by Google

2 — Building A Decision Tree

Given all the criteria that can help us identify phishing URLs, we can use a machine learning algorithm, such as a decision tree classifier to help us decide whether an URL is valid or not. 

First, let’s download the UC Irvine dataset and explore its contents. The feature list contains:

  • having_IP_Address  { -1,1 }
  • URL_Length   { 1,0,-1 }
  • Shortining_Service { 1,-1 }
  • having_At_Symbol   { 1,-1 }
  • double_slash_redirecting { -1,1 }
  • Prefix_Suffix  { -1,1 }
  • having_Sub_Domain  { -1,0,1 }
  • SSLfinal_State  { -1,1,0 }
  • Domain_registeration_length { -1,1 }
  • Favicon { 1,-1 }
  • port { 1,-1 }
  • HTTPS_token { -1,1 }
  • Request_URL  { 1,-1 }
  • URL_of_Anchor { -1,0,1 }
  • Links_in_tags { 1,-1,0 }
  • SFH  { -1,1,0 }
  • Submitting_to_email { -1,1 }
  • Abnormal_URL { -1,1 }
  • Redirect  { 0,1 }
  • on_mouseover  { 1,-1 }
  • RightClick  { 1,-1 }
  • popUpWidnow  { 1,-1 }
  • Iframe { 1,-1 }
  • age_of_domain  { -1,1 }
  • DNSRecord   { -1,1 }
  • web_traffic  { -1,0,1 }
  • Page_Rank { -1,1 }
  • Google_Index { 1,-1 }
  • Links_pointing_to_page { 1,0,-1 }
  • Statistical_report { -1,1 }

And finally, the Result designates whether the URL is valid or not:

  • Result  { -1,1 }

Where -1 denotes an invalid URL and 1 is a valid URL.

Now let’s now jump into the code. First, we load the required modules:

# To perform operations on dataset
import pandas as pd
import numpy as np
# Machine learning model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# Visualization
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import export_graphviz

Next we read and split the dataset:

df = pd.read_csv('.../dataset.csv')
dot_file = '.../tree.dot'
confusion_matrix_file = '.../confusion_matrix.png'

And then print the results:

print(df.head())
-1  1  1.1  1.2  -1.1  -1.2  -1.3  -1.4  -1.5  1.3  1.4  -1.6  1.5  -1.7  1.6     ...    -1.9  -1.10  0  1.7  1.8  1.9  1.10  -1.11  -1.12  -1.13  -1.14  1.11  1.12  -1.15  -1.16
0   1  1    1    1     1    -1     0     1    -1    1    1    -1    1     0   -1  ...       1      1  0    1    1    1     1     -1     -1      0     -1     1     1      1     -1
1   1  0    1    1     1    -1    -1    -1    -1    1    1    -1    1     0   -1  ...      -1     -1  0    1    1    1     1      1     -1      1     -1     1     0     -1     -1
2   1  0    1    1     1    -1    -1    -1     1    1    1    -1   -1     0    0  ...       1      1  0    1    1    1     1     -1     -1      1     -1     1    -1      1     -1
3   1  0   -1    1     1    -1     1     1    -1    1    1     1    1     0    0  ...       1      1  0   -1    1   -1     1     -1     -1      0     -1     1     1      1      1
4  -1  0   -1    1    -1    -1     1     1    -1    1    1    -1    1     0    0  ...      -1     -1  0    1    1    1     1      1      1      1     -1     1    -1     -1      1

This dataset contains 5 rows and 31 columns, where each column contains a value for each of the attributes we discussed in the above section.

3 — Train the Model

As always, the first step in training a machine learning model is to split the dataset into testing and training data:

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

Since the dataset contains boolean data, it’s always best to use a Decision Tree, RandomForest Classifier or Logistic Regression algorithm since these models work best for classification. In this case, I chose to work with a Decision Tree, because it’s straightforward and generally gives the best results when trying to classify data.

model = DecisionTreeClassifier()
model.fit(Xtrain, ytrain)

4 — Evaluate the Model

Now that the model is trained, let’s see how well it does on the test data:

ypred = model.predict(Xtest)
print(metrics.classification_report(ypred, ytest))
print("\n\nAccuracy Score:", metrics.accuracy_score(ytest, ypred).round(2)*100, "%")

We used the model to predict Xtest data. Now let’s compare the results to ytest and see how well we did:

             precision    recall  f1-score   support
         -1       0.95      0.95      0.95      1176
          1       0.96      0.96      0.96      1588
  micro avg       0.96      0.96      0.96      2764
  macro avg       0.96      0.96      0.96      2764
weighted avg       0.96      0.96      0.96      2764
Accuracy Score: 96.0 %

Not bad! We made literally no modifications to the data and achieved an accuracy score of 96%. From here, you can dive deeper into the data and see if there’s any transformation that can be done to further improve the accuracy of prediction.

5 — Identify False Positives & False Negatives

The results of any decision tree evaluation are likely to contain both false positives (URLs that are actually valid, but that our model indicates are not), as well as false negatives (URLs that are actually bad, but our model indicates are fine). To help resolve these instances, let’s draw out a confusion matrix (a table with 4 different combinations of predicted and actual values) for our results. The matrix will help us identify:

  • True Positives
  • True Negatives
  • False Positives (Type 1 Error)
  • False Negatives (Type 2 Error)
mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.savefig(confusion_matrix_file)

As you can see, the number of false positives and false negatives are pretty low compared to our true positives and negatives, so we can be pretty sure of our results.

To see how the decision tree panned out in making these decisions, we can visualize it with sklearn, matplotlib and sns.

export_graphviz(model, out_file=dot_file, feature_names=X.columns.values)
>> dot -Tpng tree.dot -o tree.png

We use export_graphviz to create a dot file of the decision tree, which is a text file that lets us visualize the actual bifurcations in decisions. Then, using the command line tool dot we convert the text file to a PNG image which shows our final “tree” of decisions (open it in a new tab to view the details):

Decision Tree

Phishing URL Detection with Python: Summary

These days, when everyone is working for home, there’s a lot less opportunity to just casually ask your office colleagues if they’ve received a suspicious email like the one you just got. And attackers know it, driving a 300% increase in cybercrime since the start of the pandemic. It’s always good practice to check every link before you click on it, but of course, busy employees can get careless.

This blog post showed you how, given a set of criteria that can typically identify phishing URLs, you can build and train a simple decision tree model to evaluate any given URL, and indicate whether it is actually valid or not with 96% accuracy. Now, if only it was as easy as this to prevent people from clicking fraudulent links in the first place!

Related Reads

Top 5 Cybersecurity Tools for a Work-from-Home World

Using Python for CyberSecurity Testing

Recent Posts

Scroll to Top