The security industry as a whole loves collecting data, and researchers are no different. With more data, they commonly become more confident in their statements about a threat. However, large volumes of data require more processing resources, as extracting meaningful and useful information from highly unstructured data is particularly difficult. As a result, manual data analysis is often the only choice, forcing security professionals like investigators, penetration testers, reverse engineers, and analysts to process data through tedious and repetitive operations.
We have created a flexible toolkit based on open-source libraries for efficiently analyzing millions of defaced web pages. It can also be used on web pages planted as a result of an attack in general. Called DefPloreX (a play on words from “Defacement eXplorer”), our tool uses a combination of machine-learning and visualization techniques to turn unstructured data into meaningful high-level descriptions. Real-time information on incidents, breaches, attacks, and vulnerabilities are efficiently processed and condensed into browsable objects that are suitable for efficient large-scale e-crime forensics and investigations.
DefPloreX ingests plain and flat tabular files (e.g., CSV files) containing metadata records of web incidents under analysis (e.g., URLs), explores their resources with headless browsers, extracts features from the deface web pages, and stores the resulting data to an Elastic index. The distributed headless browsers, as well as any large-scale data-processing operation, are coordinated via Celery, the de-facto standard for distributed task coordination. Using a multitude of Python-based data-analysis techniques and tools, DefPloreX creates offline “views” of the data, allowing easy pivoting and exploration.