TY - GEN
T1 - Binary analysis with architecture and code section detection using supervised machine learning
AU - Beckman, Bryan
AU - Haile, Jed
AU - Foster, Rita
N1 - Funding Information:
The authors gratefully acknowledge the Department of Energy (DOE) and their Cybersecurity, Energy Security, and Emergency Response - Cybersecurity for Energy Delivery Systems CESER-CEDS mission for funding this research intended to help to protect our nation’s electric grid from cyber-attack.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - When presented with an unknown binary, which may or may not be complete, having the ability to determine information about it is critical to future reverse engineering, particularly in discovering the binary's intended use and potential malicious nature. This paper details techniques to both identify the machine architecture of the binary, as well as to locate the important code segments within the file. This identification of unknown binaries makes use of a technique called byte histogram in addition to various machine learning (ML) techniques, which we call 'What is it Binary' or WiiBin. Benefits of byte histograms reflect the simplicity of calculation and do not rely on file headers or metadata, allowing for acceptable results when only a small portion of the original file is available; e.g., when encrypted and/or compressed sections are present in a binary. Utilizing WiiBin, we were able to accurately (>80%) determine the architecture of test binaries with as little as a 20% contagious portion of the file present. We were also able to determine the location of code sections within a binary by utilizing the WiiBin framework. Ultimately, the more information that can be gleaned from a binary file, the easier it is to successfully reverse engineer.
AB - When presented with an unknown binary, which may or may not be complete, having the ability to determine information about it is critical to future reverse engineering, particularly in discovering the binary's intended use and potential malicious nature. This paper details techniques to both identify the machine architecture of the binary, as well as to locate the important code segments within the file. This identification of unknown binaries makes use of a technique called byte histogram in addition to various machine learning (ML) techniques, which we call 'What is it Binary' or WiiBin. Benefits of byte histograms reflect the simplicity of calculation and do not rely on file headers or metadata, allowing for acceptable results when only a small portion of the original file is available; e.g., when encrypted and/or compressed sections are present in a binary. Utilizing WiiBin, we were able to accurately (>80%) determine the architecture of test binaries with as little as a 20% contagious portion of the file present. We were also able to determine the location of code sections within a binary by utilizing the WiiBin framework. Ultimately, the more information that can be gleaned from a binary file, the easier it is to successfully reverse engineer.
KW - Algorithms
KW - Architecture Identification
KW - Binary
KW - Byte Histogram
KW - Endianness
KW - Entropy
KW - Machine Learning
UR - http://www.scopus.com/inward/record.url?scp=85099728331&partnerID=8YFLogxK
U2 - 10.1109/SPW50608.2020.00041
DO - 10.1109/SPW50608.2020.00041
M3 - Conference contribution
AN - SCOPUS:85099728331
SN - 9781728193465
T3 - Proceedings - 2020 IEEE Symposium on Security and Privacy Workshops, SPW 2020
SP - 152
EP - 156
BT - Proceedings - 2020 IEEE Symposium on Security and Privacy Workshops, SPW 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE Symposium on Security and Privacy Workshops, SPW 2020
Y2 - 21 May 2020
ER -