1. Random access in nondelimited variable-length record collections for parallel reading with Hadoop
- Author
-
Christopher Gropp, Jason Anderson, Amy Apon, and Linh B. Ngo
- Subjects
Correctness ,business.industry ,Network packet ,Heuristic ,Computer science ,Search engine indexing ,Byte ,020206 networking & telecommunications ,02 engineering and technology ,020204 information systems ,Packet analyzer ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,business ,Random access ,Computer network - Abstract
The industry standard Packet CAPture (PCAP) format for storing network packet traces is normally only readable in serial due to its lack of delimiters, indexing, or blocking. This presents a challenge for parallel analysis of large networks, where packet traces can be many gigabytes in size. In this work we present RAPCAP, a novel method for random access into variable-length record collections like PCAP by identifying a record boundary within a small number of bytes of the access point. Unlike related heuristic methods that can limit scalability with a nonzero probability of error, the new method offers a correctness guarantee with a well formed file and does not rely on prior knowledge of the contents. We include a practical implementation of the algorithm with an extension to the Hadoop framework, and a performance comparison to serial ingestion. Finally, we present a number of similar storage types that could utilize a modified version of RAPCAP for random access.
- Published
- 2017
- Full Text
- View/download PDF