Phishing is popular since it is a low effort, high reward attack. Most modern web browsers, antivirus software and email clients are pretty good at detecting phishing websites at the source, helping to prevent attacks. To understand how they work, this blog post will walk you through a tutorial that shows you how to build your own phishing URL detector using Python and machine learning:
- Identify the criteria that can recognize fake URLs
- Build a decision tree that can iterate through the criteria
- Train our model to recognize fake vs real URLs
- Evaluate our model to see how it performs
- Check for false positives/negatives
Get Started: Install ML Tools With This Ready-To-Use Python Environment
To follow along with the code in this Python phishing detection tutorial, you’ll need to have a recent version of Python installed, along with all the packages used in this post. The quickest way to get up and running is to install the Phishing URL Detection runtime for Windows or Linux, which contains a version of Python and all the packages you’ll need.
In order to download the ready-to-use phishing detection Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy and it unlocks the ActiveState Platform’s many benefits for you!
For Windows users, run the following at a CMD prompt to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:
powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Phishing-URL-Detection"
For Linux users, run the following to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:
sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Phishing-URL-Detection
1 — How to Identify A Fraudulent URL
A fraudulent domain or phishing domain is an URL scheme that looks suspicious for a variety of reasons. Most commonly, the URL:
- Is misspelled
- Points to the wrong top-level domain
- A combination of a valid and a fraudulent URL
- Is incredibly long
- Is just be an IP address
- Has a low pagerank
- Has a young domain age
- Ranks poorly on the Alexa Top 1 Million Sites
All these are characteristics of a phishing URL that can help us distinguish it from a valid URL. These characteristics can be converted into machine learning feature sets such as numbers, labels and booleans.
The University of California, Irvine put together a dataset identifying fraudulent versus valid URLs. Feature sets are divided into four main categories:
- Address Bar-Based Features – these are features extracted from the URL itself, like URL length >54 characters, or whether it contains an IP address, uses an URL shortening service like TinyURL or Bitly, or employs redirection. Addition features may also include:
- Adding a prefix or suffix separated by (-) to the domain
- Having sub-domain and multi-sub-domains
- Existence of HTTPS
- Domain registration age
- Favicon loading from a different domain
- Using a non-standard port
- Abnormal Features – these may include:
- Loading images loaded in the body from a different URL
- Minimal use of meta tags
- The use of a Server Form Handler (SFH)
- Submitting information to email
- An abnormal URL
- HTML and JavaScript-Based Features – these can include things like:
- Website forwarding
- Status bar customization typically using JavaScript to display a fake URL
- Disabling the ability to right-click so users can’t view page source code
- Using pop-up windows
- iFrame redirection
- Domain-Based Features – these can include:
- Unusually young domains
- Suspicious DNS record
- Low volume of website traffic
- PageRank, where 95% of phishing webpages have no PageRank
- Whether the site has been indexed by Google
2 — Building A Decision Tree
Given all the criteria that can help us identify phishing URLs, we can use a machine learning algorithm, such as a decision tree classifier to help us decide whether an URL is valid or not.
First, let’s download the UC Irvine dataset and explore its contents. The feature list contains:
- having_IP_Address { -1,1 }
- URL_Length { 1,0,-1 }
- Shortining_Service { 1,-1 }
- having_At_Symbol { 1,-1 }
- double_slash_redirecting { -1,1 }
- Prefix_Suffix { -1,1 }
- having_Sub_Domain { -1,0,1 }
- SSLfinal_State { -1,1,0 }
- Domain_registeration_length { -1,1 }
- Favicon { 1,-1 }
- port { 1,-1 }
- HTTPS_token { -1,1 }
- Request_URL { 1,-1 }
- URL_of_Anchor { -1,0,1 }
- Links_in_tags { 1,-1,0 }
- SFH { -1,1,0 }
- Submitting_to_email { -1,1 }
- Abnormal_URL { -1,1 }
- Redirect { 0,1 }
- on_mouseover { 1,-1 }
- RightClick { 1,-1 }
- popUpWidnow { 1,-1 }
- Iframe { 1,-1 }
- age_of_domain { -1,1 }
- DNSRecord { -1,1 }
- web_traffic { -1,0,1 }
- Page_Rank { -1,1 }
- Google_Index { 1,-1 }
- Links_pointing_to_page { 1,0,-1 }
- Statistical_report { -1,1 }
And finally, the Result designates whether the URL is valid or not:
- Result { -1,1 }
Where -1 denotes an invalid URL and 1 is a valid URL.
Now let’s now jump into the code. First, we load the required modules:
# To perform operations on dataset import pandas as pd import numpy as np # Machine learning model from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Visualization from sklearn import metrics from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt import seaborn as sns from sklearn.tree import export_graphviz
Next we read and split the dataset:
df = pd.read_csv('.../dataset.csv') dot_file = '.../tree.dot' confusion_matrix_file = '.../confusion_matrix.png'
And then print the results:
print(df.head())
-1 1 1.1 1.2 -1.1 -1.2 -1.3 -1.4 -1.5 1.3 1.4 -1.6 1.5 -1.7 1.6 ... -1.9 -1.10 0 1.7 1.8 1.9 1.10 -1.11 -1.12 -1.13 -1.14 1.11 1.12 -1.15 -1.16 0 1 1 1 1 1 -1 0 1 -1 1 1 -1 1 0 -1 ... 1 1 0 1 1 1 1 -1 -1 0 -1 1 1 1 -1 1 1 0 1 1 1 -1 -1 -1 -1 1 1 -1 1 0 -1 ... -1 -1 0 1 1 1 1 1 -1 1 -1 1 0 -1 -1 2 1 0 1 1 1 -1 -1 -1 1 1 1 -1 -1 0 0 ... 1 1 0 1 1 1 1 -1 -1 1 -1 1 -1 1 -1 3 1 0 -1 1 1 -1 1 1 -1 1 1 1 1 0 0 ... 1 1 0 -1 1 -1 1 -1 -1 0 -1 1 1 1 1 4 -1 0 -1 1 -1 -1 1 1 -1 1 1 -1 1 0 0 ... -1 -1 0 1 1 1 1 1 1 1 -1 1 -1 -1 1
This dataset contains 5 rows and 31 columns, where each column contains a value for each of the attributes we discussed in the above section.
3 — Train the Model
As always, the first step in training a machine learning model is to split the dataset into testing and training data:
X = df.iloc[:, :-1] y = df.iloc[:, -1] Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
Since the dataset contains boolean data, it’s always best to use a Decision Tree, RandomForest Classifier or Logistic Regression algorithm since these models work best for classification. In this case, I chose to work with a Decision Tree, because it’s straightforward and generally gives the best results when trying to classify data.
model = DecisionTreeClassifier() model.fit(Xtrain, ytrain)
4 — Evaluate the Model
Now that the model is trained, let’s see how well it does on the test data:
ypred = model.predict(Xtest) print(metrics.classification_report(ypred, ytest)) print("\n\nAccuracy Score:", metrics.accuracy_score(ytest, ypred).round(2)*100, "%")
We used the model to predict Xtest data. Now let’s compare the results to ytest and see how well we did:
precision recall f1-score support -1 0.95 0.95 0.95 1176 1 0.96 0.96 0.96 1588 micro avg 0.96 0.96 0.96 2764 macro avg 0.96 0.96 0.96 2764 weighted avg 0.96 0.96 0.96 2764 Accuracy Score: 96.0 %
Not bad! We made literally no modifications to the data and achieved an accuracy score of 96%. From here, you can dive deeper into the data and see if there’s any transformation that can be done to further improve the accuracy of prediction.
5 — Identify False Positives & False Negatives
The results of any decision tree evaluation are likely to contain both false positives (URLs that are actually valid, but that our model indicates are not), as well as false negatives (URLs that are actually bad, but our model indicates are fine). To help resolve these instances, let’s draw out a confusion matrix (a table with 4 different combinations of predicted and actual values) for our results. The matrix will help us identify:
- True Positives
- True Negatives
- False Positives (Type 1 Error)
- False Negatives (Type 2 Error)
mat = confusion_matrix(ytest, ypred) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False) plt.xlabel('true label') plt.ylabel('predicted label'); plt.savefig(confusion_matrix_file)
As you can see, the number of false positives and false negatives are pretty low compared to our true positives and negatives, so we can be pretty sure of our results.
To see how the decision tree panned out in making these decisions, we can visualize it with sklearn, matplotlib and sns.
export_graphviz(model, out_file=dot_file, feature_names=X.columns.values) >> dot -Tpng tree.dot -o tree.png
We use export_graphviz to create a dot file of the decision tree, which is a text file that lets us visualize the actual bifurcations in decisions. Then, using the command line tool dot we convert the text file to a PNG image which shows our final “tree” of decisions (open it in a new tab to view the details):
Phishing URL Detection with Python: Summary
These days, when everyone is working for home, there’s a lot less opportunity to just casually ask your office colleagues if they’ve received a suspicious email like the one you just got. And attackers know it, driving a 300% increase in cybercrime since the start of the pandemic. It’s always good practice to check every link before you click on it, but of course, busy employees can get careless.
This blog post showed you how, given a set of criteria that can typically identify phishing URLs, you can build and train a simple decision tree model to evaluate any given URL, and indicate whether it is actually valid or not with 96% accuracy. Now, if only it was as easy as this to prevent people from clicking fraudulent links in the first place!
- You can find the criteria for evaluating phishing URLs in UC Irvine’s dataset.
- To get started building your own URL phishing detector, sign up for a free ActiveState Platform account so you can download our Phishing URL Detection runtime environment and get started faster.
Related Reads
Recent Posts
Automating Vulnerability Management
Automating vul’n remediation is still limited by code coverage & breaking changes, but ActiveState closes some gaps to remediating at scale.
Regulatory Compliance & Open Source Software
Open source is rarely built with regulatory compliance in mind. Learn how to create & enforce compliance for OSS during software development.
Software Supply Chain Security for FinServ
Learn how financial services can secure their software supply chain with best practices and key insights.