In my last week at Princeton, I tried to finish as much as possible before I left. I prepared for my Work in Progress talk and presented it on Tuesday. I heard good feedback from professors to help guide me on what I should consider for the rest of this project. Professors asked me about my n-gram search and if it should be altered and what my training set would be for my classifier.
To create an n-gram classifier, I looked at the keywords that showed up from the frequency count and looked at which words showed up multiple times together. With this information, I created a simple classifier that looked to see if those pairs were present. I used 18 sites, 9 malicious and the top 9 from Alexa (it would have been 10 but weibo.com came back as HTTPS or had a connection error when loaded in all locations). With this, I received a 16.67% error rate with 1 malicious site coming back as benign and 2 Alexa sites coming back as malicious. This is under the assumption that Alexa sites are benign though. For the rest of the week, I worked on creating a crawler so that I could gather all of Alexa’s top sites that are HTTPS to use as examples of benign sites when building a classifier since those are encrypted and therefore shouldn’t have any malicious code injected into them. I also need to create a crawler to find more malicious sites to broaden the information already gathered from the few sites I looked at. The professor and Ph. D. students have asked me to continue working on this project after I leave Princeton and I’m really happy about how the project turned out. Since it isn’t finished yet, I won’t be able to link to the final project onto my website, but we are looking at trying to write a paper and submit it to a conference at the project. I had a great experience researching at Princeton and joining the STORMSHIP(s) project this summer. I’m glad I applied to the DREU program and had the opportunity to do research. This experience has showed me that I do want to go to graduate school for cybersecurity or digital forensics.
0 Comments
This week I ran the keyword search created by looking at known malicious sites on Alexa’s top 20 sites. I was surprised to see that a fair amount of keywords was found in some of these sites. It showed me that a keyword search wouldn’t be enough to classify whether or not a site was malicious or not, which was expected to not be enough.
I started to create a naïve classifier, hopefully a linear one. However, I soon discovered that the keyword search amount wasn’t consistent for malicious and benign sites. I found that benign sites usually had 0 or over 20 keyword frequencies. While malicious sites had below 10. Since a naïve classifier won’t work, I started to read tutorials on machine learning libraries. However, the information seems to be very complicated so I will get a crash course on how to create a classifier with a machine learning library this next week. I also started to work on the presentation I will have to present on my last week here. It’s interesting to see what I’ve all worked on and accomplished while here in the short time I’ve been researching. I’m excited to present my findings and work and see if other researchers and professors have suggestions on how the project can go from here. This week I continued to test malicious websites code on my malicious strings frequency script. If the website returned all 0's, I look at the code manually and tried to identify where the malicious code was so that I could add the keyword to our search so we didn't miss malicious code. Once I felt the list was long enough, I ran through the websites one more time and this time outputted the line that the malicious string was found on.
Now, I am collaborating with a Ph. D student who specializes in machine learning to use those lines of code to create a classifier to determine whether code is malicious or benign. The two graduate students I am working with and I don't know much about machine learning, so this other student is going to give us a crash course in the subject matter. I'm excited to learn about machine learning from her since I haven't had experience with it. At my weekly meeting, my mentor mentioned he would like me to present our research at a luncheon for the department. So next week, I will be preparing my presentation so that I can give my talk the week I leave. Princeton's Center for Information Technology Policy usually has talks every week during the school year, but they've only had one this summer. So along with two other undergraduates, I will be presenting how far I've come in my research and hear suggestions from other employees about how the research should continue. At the beginning of this week, I worked on improving the performance of the Virtual Machine (VM) I was working on. It ran very slowly and would lag by several minutes, so I felt that I needed to focus on improving the functionality of it before I attempted to do work with it. While trying to fix the VM, I lost some of the data I had already collected due to it freezing and the only solution being to force shut down the VM. However, by Tuesday, I had fixed the machine so that it ran well and didn’t lag and had redone my previous work so that I had the information I needed.
On Wednesday, I met with the two Ph. D students and discussed what I should work on. We decided that it was best I kept working on looking at malicious sites and improving the frequency script to check to see if there was any possibly malicious code in the HTML files. I’ve added a lot more elements to search for in the file by looking at known malicious HTML files, which is helping our program to be more effective. I enjoy this part of the project because we had discussed at the beginning that it would be the most time consuming and hardest portion since we had to decide between what content we would flag as malicious and what we would ignore. Working on looking at malicious code and determining which sections are cause for concern helps me to understand what we are looking for and how we can really classify this code in the future. I’m working on the first steps of classification right now and it’s interesting and exciting to see my code working and counting possibly malicious scripts. I finished the decoder for the HTML files at the beginning of the week and started working on a script that will remove all of the script tags in a HTML file so that each tag section can be compared to sections in other files. This wasn’t too confusing, but I did have difficulties figuring out if I was removing the correct scripts. Some of the scripts are contained in comments and if-statements and the program wasn’t grabbing those. While I was panicked at first, I realized that it wasn’t grabbing them because they were scripts specific for if the webpage was loaded on Internet Explorer and I was working with Chrome.
Later, I met up with one of the PH. D students to discuss what I should work on next. I was told to read an article on classifying obfuscated JavaScript. The paper gave a lot of insight to what we would need to include in our project because, since we are trying to find malicious code in webpages, we’ll have to know how to classify obfuscated code since malicious code usually hides behind obfuscation to prevent users from knowing what it’s doing by looking at it. I was then given a list of traits commonly found in malicious code, like injecting iframes and accessing ‘document.cookie’, and was told to change my program that calculated variable frequency to count possibly malicious signs. It was difficult figuring out an easy and efficient way to search for certain signs, but I think that I figured out a decent way to do so. At the end of this week I was attempting to download a Virtual Machine so that I could look at malicious websites without affecting my computer. I was also going to download a program called JSDetox to see how it could help with our project, since that program simplifies obfuscated code. However, the Virtual Machine was running very slowly and freezing my computer so it took me two full days to get the names of 5 websites and to start the download of JSDetox. Next week, I plan on figuring out how to make the VM run faster and not lag. This week started with a meeting with one of the PH. D students and my mentor. We talked about the best course of action to proceed with and decided what I should work on for the rest of the week. I created a program that continues off of the comparison code for the HTML files I created before. This new program would look at the differences between the files and see the frequency of the variables and the code blocks in which the differences occurred.
After that was completed, I met with another researcher at Princeton and talked to him about similarities and patterns in malicious code. He provided advice and links to look at to help me learn more about the topic and use that knowledge for the project. At the end of the week, I met with one of the PH. D students and we discussed what I should work on for the remainder of this week and the beginning of next week. We created a long list and I started with creating a decoder that takes in the encoded HTML files and returns a standard version a human is able to read. I will work on finishing the rest of the list next week. During this week, I was able to attend a PH. D student research panel. I had the opportunity to listen to Computer Science PH. D students talk about their research and the feedback they received from other students and professors. I learned a lot more about Bitcoins, online learning applications replacing traditional learning, and IOT hacking. I was glad that I was able to listen in on this meeting and learn more about the research other researchers in the department are working on. I began this week by having a meeting with the two PHD students who lead this project. We discussed what I should start working on now that I had made myself familiar with their code and data. Since our project is comparing the same webpage loaded in various locations to see the differences, and then decide if those differences are malicious or not, we decided it would be best if I made a new comparison script to compare the files and improve their previous method.
I worked on this for part of the week along with reading published papers that dealt with research similar to the problem we were researching. At our mid-week meeting, they looked at my code and provided feedback. For the rest of this week and the beginning of next, I will be working on adding parts to my code so that it can be successfully swapped in the project with their old code. I will also be looking up and taking notes on examples of malicious code to find if there are any similarities we can find to properly make code that can distinguish between malicious and benign changes. We’re working on finding the distinction between what we consider malicious and benign since some of the changes the code alerts to now are location differences or different user ID’s. Since the set up and variable names will change between sites, we are trying to see how we can create a universal method for ignoring those changes since they don’t count as malicious changes. I met with my professor this week after he returned from conferences. We discussed three possible topics of research that I could study: vulnerability risks of toys that connect to the internet, insecure websites susceptible to man-in-the-middle attacks, and Internet censorship. We have decided to go with insecure websites and their vulnerability to malicious attacks.
There were two students here at Princeton that had started a project on this years ago and eventually stopped working on it. While I am here for the summer, we are going to pick this project back up and try to make more progress on it. We are beginning by using Alexa from Amazon to give us the top 100 sites from various locations globally to see the difference in HTTP responses. Then, we will use this information to build a system that can notify users when the website they are using has been tampered with by a middle-man source. Our program will hopefully be able to tell the difference between benign and malicious attacks so that users will only be notified when there is a risk involved. I spent the later portion of this week meeting with my professor and other professors at Princeton as well as meeting with the students who began this website project, aka Stormship. I have been going over their data and trying to understand what they have done so far and what I will do for the rest of this summer to add onto their work. |
|