Thursday, 5 January 2012

A brief comparison study of leaked CSDN and RenRen data

So, here comes the biggest ever shock to over 457 millions Chinese Internet users. Just days away from the 2012 new year, tens of millions user accounts and passwords from major Chinese websites were leaked on 21st December 2011. These websites include,

- CSDN.net (over 6 million leaked) The full name is Chinese Software Developer Network. It was founded in 1999 and provides IT news, forums for software developers, and other services such as IT training etc. Most importantly, CSDN is hosting the largest IT community with 18 million registered users (including 10,000 CTOs and hundreds of thousands of architects, team leaders, and project managers) and 500,000 registered companies according to its website. CSDN.net ranks 30th in China and 230th globally.

- RenRen.com (over 4.7 million leaked) It was formerly known as XiaoNei Network and is a popular social networking website offering Facebook-alike services to Chinese users. It had 160 million registered users, and "a total of 31 million active monthly users." as stated its pre-IPO reports. RenRen.com ranks 19th in China and 97th globally according to Alexa.com traffic analyser.

- Tianya.cn (over 31 million leaked) It is an extremely popular forum-alike community website. Tianya.cn ranks 136th globally and 25th in China.

- Dodonew.com (over 16 million leaked) It is an alternative social networking site to RenRen.com. Dodonew.com ranks 25,849th in China and 580,866th globally.

- 178.com (over 9 million leaked) a website specialised in online games, especially WoW. It ranks 148th in China and 1,000 globally.

- 7k7k.com (over 19 million leaked) it offers a huge collection of online flash games. It ranks 161st in China and 1,118th globally.

- There are also some other websites rumoured to be the victims of this systematic hack, however, not confirmed yet.

It is a disaster from the perspective of the security of these Chinese websites since the user information is not encrypted and simply stored in plain text. However, such data offers a unique opportunity to study the password patterns of Chinese online users at a massive scale and from different categories (software developers, social users, gamers, etc.), and nevertheless, the most important thing is learning how to choose proper passwords and registration process to maximumly protect ourselves in the Internet.


Data
====


Leaked user data from CSDN.net and RenRen.com are selected for analysis purpose due to their distinctive representation of programmers and social users respectively. Note that It would be interesting to comparatively study programmers, social users and gamers (i.e. data from 7k7k.com) all together. However, due to lack of resources, such analysis has to be postponed at the current stage. Tianya.cn data is also a good candidates, however, such data contains no email information and is not fit for the analysis purpose here.

- CSDN.net data is 287.2MB after decompressing and contains 6,428,632 lines of user information including account name, password, and user email address separated by '#'. Users can use either account name or user email to login.

- RenRen.com data is 164MB after decompressing and contains 4,768,600 lines of user email address and password separated by a tab space. Users can only use their email addresses to login.



Password Length
============

This analysis aims to do the routine analysis of the length of the passwords used by programmers and casual users, and see if there is any difference between these two categories.


It is clear to notice in the figures below that programmers are likely to choose longer passwords comparing to the casual users; most of the programmers choose passwords being between 9 and 15 characters while most of the casual users choose passwords between 6 and 11.


CSDN.net Password Length
RenRen.com Password Length


Password Uniqueness
================


A well-known password dictionary (http://dazzlepod.com/site_media/txt/passwords.txt) is employed to do this part of analysis. The analysis aims at analysing

- In general, how likely the users are to choose passwords in the password dictionary?
- Would there be any behavioural difference among the users that use different email providers? In other words, would the hotmail users be more likely to choose passwords from the password dictionary than those of 163.com? Or vice versa?

The analysis shows that

- In CSDN.net dataset, over 36.5% of the passwords (4,034,934 unique passwords, 1,474,161 found in dictionary) are in the dictionary while in RenRen.com dataset, more than 54.3% of the passwords (1,905,069 unique passwords, 1,036,086 found in dictionary) can be found in the dictionary.



CSDN.net Passwords in dictionary
RenRen.com Passwords in dictionary


It is interesting to notice that casual users are less likely (comparing to programmers) to choose meaningful passwords since there are only 1,905,069 unique passwords among 4,768,600 RenRen users (i.e. on average, RenRen.com password-user ratio is around 2:5) while CSDN.net has 4,034,934 unique passwords among 6,428,632 users (CSDN.net password-user ration is 2:3). 

Now let's see if the users behave differently among the difference email providers. QQ.com, 163.com and hotmail are chosen to do this part of analysis.

- In CSDN.net dataset, over 31.8% of the QQ.com passwords (1379672 unique passwords, 439008 found in dictionary) are in the dictionary file, over 33.6% of the 163.com passwords (1200993 unique passwords, 402960 found in dictionary) are in the dictionary file, over 29.3% of the hotmail passwords (159522 unique passwords, 46724 found in dictionary) are in the dictionary file.


- In RenRen.com dataset, over 38% of the QQ.com passwords (883476 unique passwords, 335810 found in dictionary) are in the dictionary file, over 39.5% of the 163.com passwords (260892 unique passwords, 103141 found in dictionary) are in the dictionary file, over 54.5% of the hotmail passwords (81962 unique passwords, 44736 found in dictionary) are in the dictionary file.


It is straightforward to observe that both programmers and casual users maintain almost the same level of password-in-dictionary rate in CSDN.net and RenRen.com. However, the users in RenRen.com are less concerned with the uniqueness of their passwords comparing to that of in CSDN.net (54.5% vs.29.3%).  





Another interesting thing is that birthday seems to be a popular choice. 769,819 users (i.e. 16.1%) in RenRen.com choose to use someone's birthday as their passwords. Software developers are better in this case - 486,283 users (7.5%) choose to use birthday pattern, but still, the number is huge.

Cross providers password reuse
==============================

It is interesting to verify if the users are likely reuse their email passwords to register to CSDN.net or RenRen.com since email addresses are compulsory during the registration process. Moreover, it would be really exciting to check potential behaviour difference of users treating different email providers. In other words, would the users not reuse the password of 163.com but more likely to reuse that of hotmail.com because 163.com is treated as the main communication channel of Chinese users and they might pay extra attention to protect that account while hotmail is just treated as one off place to receive verification emails?

The experiments observe that

- some users (but not a majority!) do reuse their email passwords to register to CSDN.net and RenRen.com. Hotmail is selected to test the password use behaviour. Random user information is selected from both CSDN.net and RenRen.com data. 15.5% users (41,879 selected, 6,516 verified) that used hotmail address to register to CSDN.net reused their passwords, while 20.2% users (35,031 selected, 7,088 verified) in RenRen.com reused their hotmail passwords.

- and users are unlikely to reuse their 163.com email passwords to register to CSDN.net and RenRen.com comparing to their behaviour to hotmail addresses. Random user information is again selected from both datasets. In CSDN.net dataset, only 7.08% (25,708 randomly selected, 1,822 verified) users reused their 163.com passwords, while 4.2% users (22,483 selected, 961 verified) in RenRen.com reused their 163.com passwords.



Discussions
=========


Does user's password pattern implied by his/her account name pattern?
-------------------------------------------------------------------------------------

Actually it would be good to study user account name patterns as a valuable add-on to this analysis. Such analysis may offer some insights to the question. The rationale or, more properly, the assumption is that if a user is likely to (partially) reuse his/her email address name (the part before '@') as his/her website registration account name, it would be likely for him/her to reuse the email passwords. Psychologically , the user may be inclining to use this exact name to demonstrate his unique identity in a community. This assumption may especially be true in forum-based online communities where people purely rely on IDs to recognise active members. Such analysis can be carried on CSDN.net and Dodonew.com datasets since these two offer user account name, email address and password information.



Does data leaking pose real threats to users?
-----------------------------------------------------

The experiments demonstrates that Chinese users are inclining not to protect hotmail emails (comparing to 163.com account). Actually, the email accounts that are used for verification purpose only are called "马甲" in China. It would be inspiring to verify these hotmail accounts are users' "马甲" and are not protected purposely, which, in turn, do not represent real security risks to the users at all. Some factors (not a definitive list) that can be contribute to the identification process are

- timestamp of last read email.
- timestamp of last sent email.
- number of unread emails
- number of read emails
- number of sent emails
- frequency of sending emails
- categories of senders' domains of unread emails
- categories of senders' domains of read emails

However, such identification process is utterly unethical even for research purpose and was not carried out at all. It is discussed here simply because, by pure fantasy, well known email providers might have carried out/be interested in such analysis for advertisement/spam filtering purpose.


How far can hackers go to obtain the user's information?
--------------------------------------------------------------------

If the email address is mainly used by the users for registration purpose, it won't cause much damage to the user. However, if the email address is mainly used for online purchasing, private communications, etc., the damage is pretty scary. The hacker may get the user's purchase history (e.g. from online stores), bank details, home address, telephone number, working address, resumes, online chat history (e.g. google talk), etc. The breaching to your life and online information is unimaginable.


So, back to the main point, what we really learn from this leaking?

- Change your private email password regularly,
- Don't use private email address (especially those associated with your online purchase) to register to online communities only if it is very necessary,
- Keep an eye on hacking news.


Finally, this is a personal criticism against certain email feature.

- Why do some email providers allow the users to mark some email as
unread? It may be convenient to the users, however, it enables the hackers to read the emails and cover up their trace, at least, from the normal email users perspective.



- Why do email providers not block thousands of requests (i.e. different email addresses with valid/invalid passwords) from the same IP address? NOT EVEN HOTMAIL! Personally, email providers shall go extra miles to detect such exploitation by guessing if potential hacking process is being carried out!