Main content
Textual File Fragments Dataset and Code
- Fatemeh Mansouri Hanis
- Mehdi Teimouri
Date created: | Last Updated:
: DOI | ARK
Creating DOI. Please wait...
Category: Project
Description: In this study, we present a dataset that contains file fragments of five textual file formats: Binary file format for Word 97-Word 2003 (DOC), Microsoft Word open XML format (DOCX), portable document format (PDF), rich text file (RTF), and standard text document (TXT). This dataset contains the file fragments in three different languages: English, Persian, and Chinese. For each pair of file format and language, 1500 file fragments are provided. So, the dataset of file fragments contains 22500 file fragments.