fakecineaste : file carving algorithms

4. STRING MATCHING ALGORITHMS

Since   there   is   no   guarantee   that   file-system metadata exists to provide the location of each file within a file system, searching every byte of a raw disk    image    is    unavoidable   for    file    carving applications to identify the locations of structured files.
Therefore, employing a fast string matching algorithm   is   indispensable   for   minimizing   the overhead    of    file    carving    applications.The objective of string matching algorithms is to find one   or   more   occurrences   of   pattern   in   a   text through the sliding window mechanism.

4.1 The Brute Force Algorithm
The brute force algorithm checks for the pattern by shifting the window by exactly one position with the time complexity O(m×n).

4.2 The Boyer-Moore Algorithm
The Boyer-Moore algorithm (Boyer & Moore, 1977)    and    the    Knuth-Morris-Pratt    algorithm (Knuth et al., 1977) are among the most widely
used single pattern matching algorithms, in which each   pattern   is   searched   within a   given   text separately.     The     Boyer-Moore     algorithm     is
considered as the most efficient string searching algorithm in both theory and practice,and it has become the standard for practical string searching.

4.3 The Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt    algorithm    with    the    time complexity proportional to the sum of the lengths of pattern and text, O(m+n), which is independent of the alphabet size. The algorithm performs the string matching from left to right, and it needs a preprocessing phase to construct a partial-match table with the time complexity O(m).

4.4 The Karp-Rabin Algorithm
Since hashing is able to provide a simple method to    avoid a    quadratic    number    of    character comparisons, Karp and Rabin (Karp & Rabin, 1987)   propose   an   efficient   randomized   pattern matching algorithm that only checks if the window of text similar to the pattern through the hashing function.

4.5 The Horspool Algorithm
The Horspool algorithm (Horspool, 1980) is a simplified version of the Boyer-Moore algorithm (Boyer & Moore, 1977), which only utilizes the precomputed “bad-character shift” function for shifts in the window.

4.6 The Quick Search Algorithm
Similar   to   the   Horspool   algorithm   (Horspool, 1980), the Quick Search algorithm (Sunday, 1990) is also a simplified version of the Boyer-Moore algorithm (Boyer & Moore, 1977), which only utilizes the precomputed “bad-character shift” function for shifts in the window.

4.7 The Shift-Or Algorithm
The   main   idea   of   the   Shift-Or   algorithm   (R. Baeza-Yates & Gonnet, 1992) is to represent the search state as a number, and each search attempt performs a small number of arithmetic and logical operations

4.8 The Smith Algorithm
It utilizes the precomputed “bad-character shift” function for shifts in the window from the Horspool algorithm (Horspool, 1980) and the Quick Search algorithm (Sunday, 1990).

4.9 The Raita Algorithm
Raita   (Raita,   1992)   proposes   a   new implementation that makes use of the dependencies between successive symbols.

4.10 The Berry-Ravindran Algorithm
The Berry-Ravindran algorithm is a composite of the Quick Search   algorithm   (Sunday,   1990)   and   another variant of the Boyer-Moore algorithm (Boyer & Moore,   197
7),   the   Zhu-Takaoka   algorithms

To provide comparisons between multiple string matching algorithms described from the perspective of forensic analysis

experimental testbed implemented with VMware Workstation and Ubuntu 12.04.3 based on the AMD64 architecture.
The virtual machine utilizes a single CPU core with 1GB of memory
two test images for Scalpel 2.0 to extract various file formats.
The first image a raw partition image of a 65 MB FAT32 file system,
the second image a raw partition image of a 129.4 MB EXT2 file system.
the file formats within the two images include doc, gif, jpg, mov, pdf, wav, and wmv
12 known header and footer patterns in the configuration file ”scalpel.conf"

http://commons.erau.edu/cgi/viewcontent.cgi?article=1318&context=adfsl

B.Multiple Pattern String Matching Algorithms

Multiple pattern matching is an important problem in text processing and is commonly used to locate all the positions of an input string (the so called “text”) where
one or more keywords (the so called “patterns”) from a finite   set   of   keywords   occur.   Multi-pattern   string   matching arises in a number of applications including network intrusion detection, digital forensics, business analytics, and natural language processing

1) Aho-Corasick   String   Matching   Algorithm:   Aho-Corasick algorithm is one of the earliest multi-pattern exact matching algorithms.
2) Commentz – Walter String Matching Algorithm:    Commentz-Walter    algorithm    combines    the    Boyer-Moore technique with the Aho-Corasick algorithm.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.227.7016&rep=rep1&type=pdf

Multi-pattern string matching is widely used in applications such as network

intrusion detection, digital forensics and full text search.

We focus on Scalpel, a popular open source file recovery tool which performs file
carving using the Boyer-Moore string search algorithm to locate headers and footers in a

we propose fast in-place carving techniques of a popular digital forensics tool Scalpel as well as our
GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU and host-to-host.

FAST IN-PLACE FILE CARVING FOR DIGITAL FORENSICS
We show that the time required for file carving
may be reduced significantly by employing multi-pattern search algorithms such as the
multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk
reads and multithreading as typically supported on multicore commodity PC

3. We are exploring multi-pattern string matching algorithms on Graphics Processing
Units (GPUs) that comprise a large number of processors. The target applications are
digital forensics and intrusion detection. We have analyzed the performance of the
popular file-carving software Scalpel 1.6 and determined that this software spend
almost all of its time reading from disk and searching for headers and footers. The
time spent on the latter activity may be drastically reduced (by a factor of 17 when we
have 48 rules) by replacing Scalpel’s current search algorithm (Boyer Moore) by the
Aho-Corasick algorithm
http://etd.fcla.edu/UF/UFE0041979/zha_x.pdf

File carving is a forensics technique that recovers files based merely on file structure and content and without any matching file sys-

tem meta-data.
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4806206&tag=1

Block-Based Carving Any carving method (algorithm) that analyzes the input on block-by-block basis to determine if a block is part of a possible output file. This method assumes that each block can only be part of a single file (or embedded file).

Statistical Carving Any carving method (algorithm) that analyzes the input on characteristic or statistic for example, entropy) to determine if the input is part of a possible output file.

Header/Footer Carving A method for carving files out of raw data using a distinct header (start of file marker) and footer (end of file marker).

Header/Maximum (file) size Carving A method for carving files out of raw data using a distinct header (start of file marker) and a maximum (file) size. This approach works because many file formats (e.g. JPEG, MP3) do not care if additional junk is appended to the end of a valid file.

Header/Embedded Length Carving A method for carving files out of raw data using a distinct header and a file length (size) which is embedded in the file format File structure based Carving A method for carving files out of raw data using a certain level of knowledge of the internal structure of file types. Garfinkel called this approach "Semantic Carving" in his DFRWS2006 carving challenge submission, while Metz and Mora called the approach "Deep Carving." Semantic Carving A method for carving files based on a linguistic analysis of the file's content. For example, a semantic carver might conclude that six blocks of french in the middle of a long HTML file written in English is a fragment left from a previous allocated file, and not from the English-language HTML file.

Carving with Validation A method for carving files out of raw data where the carved files are validated using a file type specific validator.

Fragment Recovery Carving A carving method in which two or more fragments are reassembled to form the original file or object. Garfinkel previously called this approach "Split Carving."

Repackaging Carving A carving method that modifies the extracted data by adding new headers, footers, or other information so that it can be viewed with standard utilities. For example, Garfinkel's ZIP Carver looks for individual components of a ZIP file and repackages them with a new Central Directory so that they can be opened with a standard unzip utility.

http://www.cse.scu.edu/~tschwarz/COEN252_13/Labs/lab2.html

Scalpel, a popular open source file recovery tool,performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image.

We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as
well as asynchronous disk reads and multithreading as typically supported on multicore commodity PCs.

BoyerMoore.java

https://algs4.cs.princeton.edu/53substring/BoyerMoore.java.html

File Carving, or sometimes simply Carving, is the practice of searching an input for files or other kinds of objects based on content, rather than on metadata

http://www.forensicswiki.org/wiki/File_Carving#File_Carving_challenges_and_test_images

Tools:Data Recovery

http://www.forensicswiki.org/wiki/Tools:Data_Recovery

File carving is a well known computer forensics term used to describe the identification and extraction of file types from unallocated clusters using file signatures.

A file signature, also commonly referred to as a magic number, is a constant numerical or text value used to identify a file format.
The object of carving is to identify and extract (carve) the file based on this signature information alone.

http://www.forensicexplorer.com/data-carve.php

Data carving, or file carving is a process of reading files without reference to an file system.

The technique can be applied to any type if disk that stores data on sector boundaries
It is based on the fact that most files start with a recognisable data signature

https://www.cnwrecovery.com/html/data_carving.html

fakecineaste

Tuesday, October 24, 2017

file carving algorithms

No comments:

Post a Comment

Labels

Blog Archive

Followers

Search This Blog