A Study of Personal Information in Human-Chosen
A Study of Personal Information in Human-Chosen
hnw@udel.edu
Abstract—Though not recommended, Internet users often attacks on passwords have shown that users tend to use simple
include parts of personal information in their passwords for dictionary words to construct their passwords [9]. Language is
easy memorization. However, the use of personal information also vital since users tend to use their first languages when
in passwords and its security implications have not yet been constructing passwords [2]. Besides, passwords are mostly
studied systematically in the past. In this paper, we first dissect phonetically memorable [4] even though they are not simple
user passwords from a leaked dataset to investigate how and
to what extent user personal information resides in a password.
dictionary words. It is also indicated that users may use
In particular, we extract the most popular password structures keyboard and date strings in their passwords [5], [10], [11].
expressed by personal information and show the usage of personal However, most studies discover only superficial password
information. Then we introduce a new metric called Coverage patterns, and the semantic-rich composition of passwords is
to quantify the correlation between passwords and personal still mysterious to be fully uncovered. Fortunately, an enlight-
information. Afterwards, based on our analysis, we extend the ening work investigates how users generate their passwords by
Probabilistic Context-Free Grammars (PCFG) method to be learning the semantic patterns in passwords [12].
semantics-rich and propose Personal-PCFG to crack passwords
by generating personalized guesses. Through offline and online In this paper, we study password semantics from a different
attack scenarios, we demonstrate that Personal-PCFG cracks perspectivethe use of personal information. We utilize a leaked
passwords much faster than PCFG and makes online attacks password dataset, which contains personal information, from
much easier to succeed. a Chinese website for this study. We first measure the usage
of personal information in password creation and present
I. I NTRODUCTION interesting observations. We are able to obtain the most popular
password structures with personal information embedded. We
Text-based passwords still remain a dominating and ir- also observe that males and females behave differently when
replaceable authentication method in the foreseeable future. using personal information in password creation. Next, we
Although people have proposed different authentication mech- introduce a new metric called Coverage to accurately quan-
anisms, no alternative can bring all the benefits of passwords tify the correlation between personal information and user
without introducing any extra burden to users [1]. However, password. Since it considers both the length and continuation
passwords have long been criticized as one of the weakest links of personal information in a password, Coverage is a useful
in authentication. Due to human-memorability requirement, metric to measure the strength of a password. Our quantifi-
user passwords are usually far from true random strings [2]– cation results using the Coverage metric confirm our direct
[6]. In other words, human users are prone to choosing weak measurement results on the dataset, showing the efficacy of
passwords simply because they are easy to remember. As a Coverage. Moreover, Coverage is easy to be integrated with
result, most passwords are chosen within only a small portion existing tools, such as password strength meters for creating a
of the entire password space, being vulnerable to brute-force more secure password.
and dictionary attacks.
To demonstrate the security vulnerability induced by using
To increase password security, online authentication sys- personal information in passwords, we propose a semantics-
tems start to enforce stricter password policies. Meanwhile, rich Probabilistic Context-Free Grammars (PCFG) method
many websites deploy password strength meters to help users called Personal-PCFG, which extends PCFG [13] by consider-
choose secure passwords. However, these meters are proved to ing those symbols linked to personal information in password
be ad-hoc and inconsistent [7], [8]. To better assess the strength structures. Personal-PCFG is able to crack passwords much
of passwords, we need to have a deeper understanding on how faster than PCFG. It also makes an online attack more feasible
users construct their passwords. If an attacker knows exactly by drastically increasing the guess success rate. Finally, we
how users create their passwords, guessing their passwords discuss potential solutions to defend against semantics-aware
will become much easier. Meanwhile, if a user is aware of the attacks like Personal-PCFG.
potential vulnerability induced by a commonly used password
creation method, the user can avoid using the same method Our study is based on a dataset collected from a Chinese
for creating passwords. website. Although measurement results could be different
with other datasets, our observations still shed some light on
Toward this end, researchers have made significant efforts how personal information is used in passwords. As long as
to unveil the structures of passwords. Traditional dictionary memorability plays an important role in password creation, the
correlation between personal information and user password TABLE I: Most Frequent Passwords.
remains, regardless of which language users speak. We believe
that our work on personal information quantification, password Rank Password Amount Percentage
cracking, and password protection could be applicable to any 1 123456 389 0.296%
other text-based password datasets from different websites. 2 a123456 280 0.213%
The remainder of this paper is organized as follows. 3 123456a 165 0.125%
Section II measures how personal information resides in 4 5201314 160 0.121%
user passwords and shows the gender difference in password 5 111111 156 0.118%
creation. Section III introduces the new metric, Coverage, 6 woaini1314 134 0.101%
to accurately quantify the correlation between personal 7 qq123456 98 0.074%
information and user password. Section IV details Personal- 8 123123 97 0.073%
PCFG and shows cracking results compared with the original 9 000000 96 0.073%
PCFG. Section V discusses limitations and potential defenses. 10 1qaz2wsx 92 0.070%
Section VI surveys related work, and finally Section VII
concludes this paper. 2) Basic Analysis: We first conduct a simple analysis to
reveal some general characteristics of the 12306 dataset. For
data consistency, we remove users whose ID number is not 18-
II. P ERSONAL I NFORMATION IN PASSWORDS digit long. These users may have used other IDs (e.g., passport
number) to register on the system and count for 0.2% of the
Intuitively, people tend to create passwords based on their whole dataset. The dataset contains 131,389 passwords for
personal information because human beings are limited by analysis after being cleansed. Note that various websites may
their memory capacities and random passwords are much have different password creation policies. For instance, with a
harder to remember. We show that users’ personal information strict password policy, users may apply mangling rules (e.g.,
plays an important role in human-chosen password genera- abc − > @bc or abc1) to their passwords to fulfill the policy
tion by dissecting passwords in a mid-sized leaked password requirement [14]. Since the 12306 website has changed its
dataset. Understanding the usage of personal information in password policy after the password leak, we do not know the
passwords and its security implications can help us to further exact password policy when the dataset was first compromised.
enhance password security. To start, we introduce the dataset However, from the leaked dataset, we infer that the password
used throughout this study. policy is quite simple—all passwords cannot be shorter than
six symbols. There is no restriction on what type of symbols
can be used. Therefore, users are not required to apply any
A. 12306 Dataset
mangling rules to their passwords.
A number of password datasets have been exposed to the
The average length of passwords in the 12306 dataset
public in recent years, usually containing several thousands
is 8.44. The most common passwords in the 12306 dataset
to millions of real passwords. As a result, there are several
are listed in Table I. The dominating passwords are trivial
password measurement or password cracking studies based on
passwords (e.g., 123456, a123456, etc.), keyboard passwords
analyzing those datasets [2], [10]. In this paper, a dataset called
(e.g., 1qaz2wsx, 1q2w3e4r, etc.), and “iloveyou” passwords.
12306 is used to illustrate how personal information is involved
Both “5201314” and “woaini1314” mean “I love you forever”
in password creation.
in Chinese. The most commonly used Chinese passwords are
1) Introduction to Dataset: At the end of year 2014, a similar to a previous study [10]; however, the 12306 dataset
Chinese dataset is leaked to the public by anonymous attackers. is much more sparse. The most popular password “123456”
It is reported that the dataset is collected by trying usernames counts for less than 0.3% of all passwords while the number
and passwords from other leaked datasets online. We call this is 2.17% in [10]. We believe that the password sparsity is
dataset 12306 because all passwords are from the website due to the importance of the website; users are less prone to
www.12306.cn, which is the official site of the online railway use trivial passwords like “123456” and there are fewer sybil
ticket reservation system in China. There is no data available accounts because a real ID number is needed for registration.
on the exact number of users of the 12306 website; however,
we infer at least tens of millions of registered users in the Similar to [10], we measure the resistance to guessing of
system since it is the only official website for the entire the 12306 dataset in terms of various metrics including the
Chinese railway system. worst-case security bit representation (H∞ ), the guesswork
bit representation (G̃), the α-guesswork bit representations
The 12306 dataset contains more than 130,000 Chinese (G̃0.25 and G̃0.5 ), and the β-success rates (λ5 and λ10 ).
passwords. Having witnessed so many leaked large datasets, The result is shown in Table II. We found that users of
the size of the 12306 dataset is considered medium. What 12306 avoid using extremely guessable passwords such as
makes it special is that together with plaintext passwords, the “123456” because 12306 has a substantially higher worst-case
dataset also includes several types of user personal informa- security and the β-success rate for β = 5 and 10. We believe
tion, such as a user’s name and the government-issued unique users have certain password security concerns when creating
ID number (similar to the U.S. Social Security Number). As passwords for critical service systems like 12306. However,
the website requires a real ID number to register and people their concern seems to be limited by avoiding only extremely
must provide real personal information to book a ticket, we easy passwords. As indicated by values of alpha-guesswork,
consider the information in this dataset to be reliable. the overall password sparsity of the 12306 dataset is no higher
TABLE II: Resistance to guessing TABLE IV: Personal Information.
H∞ G̃ λ5 λ10 G̃0.25 G̃0.5 Type Description
8.4 16.85 0.25% 0.44% 16.65 16.8 Name User’s Chinese name
Email address User’s registered email address
Cell phone User’s registered cell phone number
TABLE III: Most Frequent Password Struc- Account name The username used to log in the system
tures. ID number Government issued ID number
Rank Structure Amount Percentage
1 D7 10,893 8.290% respectively. We try to match parts of a user’s password to the
2 D8 9,442 7.186% six types of personal information, and express the password
3 D6 9,084 6.913% with these personal information. For example, a password
4 L2 D7 5,065 3.854% “alice1987abc” can be represented as [N ame][Birthdate]L3 ,
5 L3 D6 4,820 3.668% instead of L3 D4 L3 as in a traditional representation. The
6 L1 D7 4,770 3.630% matched personal information is denoted by corresponding
7 L2 D6 4,261 3.243% tags—[Name] and [Birthdate] in this example; for segments
8 L3 D7 3,883 2.955% that are not matched, we still use “D”, “L”, and “S” to describe
9 D9 3,590 2.732% the symbol types.
10 L2 D8 3,362 2.558%
“D” represents digits and “L” represents English We believe that representations like [N ame][Birthdate]L3
letters. The number indicates the segment length. are better than L5 D4 L3 since they more accurately describe the
For example, L2 D7 means the password contains composition of a user password with more detailed semantic
2 letters followed by 7 digits. information. Using this representation, we apply the following
matching method to the entire 12306 dataset to see how these
than previously studied datasets. personal information tags appear in password structures.
We also study the basic structures of the passwords in 2) Matching Method: We propose a matching method to
12306. The most popular password structures are shown in locate personal information in a user password. The basic idea
Table III. Similar to a previous study [10], our results again is that we first generate all substrings of the password and
show that Chinese users prefer digits in their passwords as sort them in descending length order. Then we match these
opposed to letters like English-speaking users. The top five substrings from the longest to the shortest to all types of
structures all have a significant portion of digits, and at most personal information. If a match is found, the match function
2 or 3 letters are appended in front. The reason behind this is recursively applied over the remaining password segments
may be that Chinese characters are logogram-based, and digits until no further match is found. We require that a segment
seem to be the best alternative when creating a password. should be at least 2-symbol long to be matched. The segments
that are not matched to any personal information will then be
In summary, the 12306 dataset is a Chinese password labeled using the traditional “LDS” tags.
dataset that has general Chinese password characteristics.
Users have certain security concerns by choosing less trivial We describe the methods for matching each type of the
passwords. However, the overall sparsity of the 12306 dataset personal information as follows. For the Chinese names, we
is no higher than previously studied datasets. convert them into Pinyin form, which is alphabetic repre-
sentation of Chinese characters. Then we compare password
B. Personal Information segments to 10 possible permutations of a name, such as
lastname+firstname and last initial+firstname. If the segment
The 12306 dataset not only contains user passwords but is exactly the same as one of the permutations, we consider
also multiple types of personal information listed in Table IV. it a match. For birthdate, we list 17 possible permutations
and compare a password segment with these permutations. If
Note that the government-issued ID number is a unique 18- the segment is the same as any permutation, we consider it a
digit number, which includes personal information itself. Digits match. For account name, email address, cell phone number,
1-6 represent the birthplace of the owner, digits 7-14 represent and ID number, we further constrain the length of a segment to
the birthdate of the owner, and digit 17 represents the gender of be at least 3 to avoid mismatching by coincidence. Besides, as
the owner—odd means male and even means female. We take people tend to memorize a sequence of numbers by dividing
out the 8-digit birthdate and treat it separately since birthdate into 3-digit groups, we believe that a match of at least 3 is
is very important personal information in password creation. likely to be a real match.
Therefore, we finally have six types of personal information:
name, birthdate, email address, cell phone number, account Note that for a password segment, it may match multiple
name, and ID number (birthdate excluded). types of personal information. In such cases, all possible
matches are counted in the results.
1) New Password Representation: To better illustrate how
personal information correlates to user passwords, we de-
velop a new representation of a password by adding more 3) Matching Results: After applying the matching method
semantic symbols besides the conventional “D”, “L” and “S” to 12306 dataset, we find that 78,975 out of 131,389 (60.1%) of
symbols, which stand for digit, letter, and special symbol, the passwords contain at least one of the six types of personal
TABLE V: Most Frequent Password Structures. TABLE VII: Most Frequent Structures in Different Genders.
Rank Structure Amount Percentage Male Female
Rank
1 [ACCT] 6,820 5.190% Structure Percentage Structure Percentage
2 D7 6,224 4.737% 1 [ACCT] 4.647% D6 3.909%
2 D7 4.325% [ACCT] 3.729%
3 [NAME][BD] 5,410 4.117%
3 [NAME][BD] 3.594% D7 3.172%
4 [BD] 4,470 3.402% 4 [BD] 3.080% D8 2.453%
5 D6 4,326 3.292% 5 D6 2.645% [EMAIL] 2.372%
6 [EMAIL] 3,807 2.897% 6 [EMAIL] 2.541% [NAME][BD] 2.309%
7 D8 3,745 2.850% 7 D8 2.158% [BD] 1.968%
8 L1D7 2,829 2.153% 8 L1D7 2.088% L2D6 1.518%
9 [NAME]D7 2,504 1.905% 9 [NAME]D7 1.749% L1D7 1.267%
10 [ACCT][BD] 2,191 1.667% 10 [ACCT][BD] 1.557% L2D7 1.240%
NA TOTAL 28.384% TOTAL 23.937%