Number Usage in Passwords

Published: 2024-01-17. Last Updated: 2024-01-17 00:15:53 UTC
by Jesse La Grew (Version: 1)
2 comment(s)

Numbers are often used in passwords to add complexity. Passwords submitted to honeypots are also often found within pre-existing passwords lists, containing compromised credentials. What numbers are most commonly used?

First, unique passwords were extracted using 'jq'.

# read cowrie JSON files
# cat /logs/cowrie.json*

# select any values with the password key present
# jq 'select(.password)'

# get password values without quotes (raw)
# jq -r .password 

# get unique values and save them to a text file
# sort | uniq > 2024-01-07_unique_passwords.txt

cat /logs/cowrie.json* | jq 'select(.password)' | jq -r .password | sort | uniq > 2024-01-07_unique_passwords.txt

 

Date range of data: 04/13/2022 - 01/05/2024
Number of unique passwords: 275,811

Python was usued to extract the numbers and chart the most common results.

import re
from collections import Counter
import matplotlib.pyplot as plt

filehandle = open("2024-01-07_unique_passwords.txt", "r", encoding="utf8")
passwords = []
for line in filehandle.readlines():
    passwords.append(line.replace("\n", ""))

individual_digits = []
contiguous_numbers = []
for each_password in passwords:
    individual_digits += re.findall(r'\d', each_password)
    contiguous_numbers += re.findall(r'\d+', each_password)

individual_digit_counts = Counter(individual_digits)
contiguous_number_counts = Counter(contiguous_numbers)

x = []
y = []
for number, frequency in contiguous_number_counts.most_common(10):
    x.append(number)
    y.append(frequency)

plt.bar(x, y)
plt.title(f"Top {len(x)} Numbers Used in Unique Passwords Submitted to Honeypot")
plt.xlabel("Number")
plt.ylabel("Frequency")
plt.show()

x = []
y = []
for number, frequency in individual_digit_counts.items():
    x.append(number)
    y.append(frequency)

plt.bar(x, y)
plt.title(f"Numbers Used in Unique Passwords Submitted to Honeypot")
plt.xlabel("Number")
plt.ylabel("Frequency")
plt.show()



Figure 1: Top 10 contiguous numbers used in passwords submitted to a honeypot

The most commonly used number within passwords submitted to one of my honeypots was "123". Of the contiguous numbers submitted, they are either the individual digits between 1 and 4 or numbers added sequentially to 1, such as "12", "123", or "1234".


Figure 2: Top 10 digits used most often within passwords submitted to a honeypot

Of the individual digits used, whether alone or within a larger number, the most commonly used in order are 1, 2 and 3. Generally, the lower the number, the more frequently it is used. 0 is a bit of an outlier. This may mean that most of the time, passwords may be used with incrementing numbers when changed, or simply appended with the next highest number. This isn't too surprising, but interesting to see within the data.

 

What about years? Well, the same method was used, but looking for contiguous numbers of length 4.


Figure 3: Top 10 numbers used within passwords containing only 4 digits

 

Outside of "1234", most of the values to appear to be years, with higher numbers generally seen for more recent years. Is there any difference in these values if submitted in a different year? While this could also be done in python, values for specific years were extacted with 'jq'.

# read all cowrie JSON files
# cat /logs/cowrie.json* 

# select data with the password key present and with timestamps between 1/1/2022 and 12/31/2022
# jq 'select((.password) and (.timestamp >= "2022-01-01") and (.timestamp <= "2022-12-31"))'

# select password values without quotes (raw)
# jq -r .password

# get unique password values and save to a file
# sort | uniq > 2022_unique_passwords_submitted.txt


cat /logs/cowrie.json* | jq 'select((.password) and (.timestamp >= "2022-01-01") and (.timestamp <= "2022-12-31"))' | jq -r .password | sort | uniq > 2022_unique_passwords_submitted.txt

cat /logs/cowrie.json* | jq 'select((.password) and (.timestamp >= "2023-01-01") and (.timestamp <= "2023-12-31"))' | jq -r .password | sort | uniq > 2023_unique_passwords_submitted.txt

cat /logs/cowrie.json* | jq 'select((.password) and (.timestamp >= "2024-01-01") and (.timestamp <= "2024-12-31"))' | jq -r .password | sort | uniq > 2024_unique_passwords_submitted.txt

 


Figure 4: Top 10 4-digit numbers contained in passwords submitted to a honeypot in 2022

 


Figure 5: Top 10 4-digit numbers contained in passwords submitted to a honeypot in 2023

 


Figure 6: Top 10 4-digit numbers contained in passwords submitted to a honeypot in 2024

 

Overall, the 4-digit numbers used within passwords submitted appear to be heavily influenced by the year in which they were submitted. The number 2024 is not heavily reflected, but is present in the data, with a total of 9 passwords containing this value. This was as of 1/7/2024.


Figure 7: Counts of 4-digit numbers seen within passwords, highlighting "2024" occurences

 

With some updated data including the last week, I went ahead and graphed when the numbers "2022", "2023" and "2024" were present within a submitted password. Unlike the previous data used, this was not deduplicated since it would truncate the dates identified when these values were submitted.

# read the cowrie JSON logs files
# cat /logs/cowrie.json*

# select data where the password key was present
# jq 'select(.password)'

# select submission timestamp and password
# delimited with " DELIMIT " since passwords may use common delimiting characters
# save passwords and submission timestamps to file
# jq '.timestamp + " DELIMIT " + .password' > 2024-01-13_passwords_submitted.txt

cat /logs/cowrie.json* | jq 'select(.password)' | jq '.timestamp + " DELIMIT " + .password' > 2024-01-13_passwords_submitted.txt

 

import pandas as pd
import re
import matplotlib.pyplot as plt

def return_matched_timestamps(string_to_match):
    incrementer = 0
    matched_timestamps = []
    while incrementer < len(passwords):
        match = 0
        #print(passwords[incrementer])
        numbers = re.findall(r'\d+', passwords[incrementer])
        for each_number in numbers:
            if str(each_number) == string_to_match:
                match = 1
        if match == 1:
            matched_timestamps.append(timestamps[incrementer])
        incrementer += 1
    return matched_timestamps

df = pd.read_csv("2024-01-13_passwords_submitted.txt", names=["Date", "Password"], sep=" DELIMIT", engine="python", parse_dates=["Date"], na_filter=False)

timestamps = df['Date'].tolist()
passwords = df['Password'].tolist()

timestamps_2024 = return_matched_timestamps("2024")
timestamps_2023 = return_matched_timestamps("2023")
timestamps_2022 = return_matched_timestamps("2022")

df_2022 = pd.DataFrame(timestamps_2022)
by_month_2022 = pd.to_datetime(df_2022[0]).dt.to_period('M').value_counts().sort_index()
df_month_2022 = by_month_2022.rename_axis('month').reset_index(name='counts')

df_2023 = pd.DataFrame(timestamps_2023)
by_month_2023 = pd.to_datetime(df_2023[0]).dt.to_period('M').value_counts().sort_index()
df_month_2023 = by_month_2023.rename_axis('month').reset_index(name='counts')

df_2024 = pd.DataFrame(timestamps_2024)
by_month_2024 = pd.to_datetime(df_2024[0]).dt.to_period('M').value_counts().sort_index()
df_month_2024 = by_month_2024.rename_axis('month').reset_index(name='counts')

plt.plot_date(df_month_2022['month'], df_month_2022['counts'], 'b-', label="Contains '2022'")
plt.plot_date(df_month_2023['month'], df_month_2023['counts'], 'r-', label="Contains '2023'")
plt.plot_date(df_month_2024['month'], df_month_2024['counts'], 'g-', label="Contains '2024'")
plt.xlabel("Date (broken down by month)")
plt.ylabel("Passwords Submitted")
plt.legend(loc="upper left")
plt.show()


Figure 8: Graph of individual passwords submitted each month including "2022", "2023" or "2024"

The incorporation of "2023" and "2024" increase close to the beginning of that year. These was a steep increase in the use of "2023" seen on this honeypot in November of 2023.

Probably a good idea to avoid using the current year in your password, or any recent year in general.

--
Jesse La Grew
Handler

2 comment(s)

Comments

It would be interesting to know how often a digit is used as a substitute for a similar looking letter or similar sounding word or substring. E.g. 0 for O, 1 for I or l, 2 for to or too, 3 for E, 4 for for, 5 for S, 6 for G, 7 for T, 8 for ate, 9 for g. I don't know if or how well you would be able to automate parsing for this, though.
Thanks for reading and for the feedback! When numbers replace alphabetic characters, I would anticipate the surrounding characters to have a higher frequency of being alphabetic rather than numeric. That could be a method to isolate possible alphabetic character replacements. I'll dig into that.

Diary Archives