Linux dd+grep 大文件二分查找

Linux dd 命令用于读取、转换并输出数据。

dd 可从标准输入或文件中读取数据,根据指定的格式来转换数据,再输出到文件、设备或标准输出。

参数说明(dd --help)

Usage: dd [OPERAND]...
  or:  dd OPTION
Copy a file, converting and formatting according to the operands.

  bs=BYTES        read and write BYTES bytes at a time (also see ibs=,obs=)
  cbs=BYTES       convert BYTES bytes at a time
  conv=CONVS      convert the file as per the comma separated symbol list
  count=N         copy only N input blocks
  ibs=BYTES       read BYTES bytes at a time (default: 512)
  if=FILE         read from FILE instead of stdin
  iflag=FLAGS     read as per the comma separated symbol list
  obs=BYTES       write BYTES bytes at a time (default: 512)
  of=FILE         write to FILE instead of stdout
  oflag=FLAGS     write as per the comma separated symbol list
  seek=BLOCKS     skip BLOCKS obs-sized blocks at start of output
  skip=BLOCKS     skip BLOCKS ibs-sized blocks at start of input
  status=WHICH    WHICH info to suppress outputting to stderr;
                  'noxfer' suppresses transfer stats, 'none' suppresses all

BLOCKS and BYTES may be followed by the following multiplicative suffixes:
c =1, w =2, b =512, kB =1000, K =1024, MB =1000*1000, M =1024*1024, xM =M
GB =1000*1000*1000, G =1024*1024*1024, and so on for T, P, E, Z, Y.

Each CONV symbol may be:

  ascii     from EBCDIC to ASCII
  ebcdic    from ASCII to EBCDIC
  ibm       from ASCII to alternate EBCDIC
  block     pad newline-terminated records with spaces to cbs-size
  unblock   replace trailing spaces in cbs-size records with newline
  lcase     change upper case to lower case
  nocreat   do not create the output file
  excl      fail if the output file already exists
  notrunc   do not truncate the output file
  ucase     change lower case to upper case
  sparse    try to seek rather than write the output for NUL input blocks
  swab      swap every pair of input bytes
  noerror   continue after read errors
  sync      pad every input block with NULs to ibs-size; when used
            with block or unblock, pad with spaces rather than NULs
  fdatasync  physically write output file data before finishing
  fsync     likewise, but also write metadata

Each FLAG symbol may be:

  append    append mode (makes sense only for output; conv=notrunc suggested)
  direct    use direct I/O for data
  directory  fail unless a directory
  dsync     use synchronized I/O for data
  sync      likewise, but also for metadata
  fullblock  accumulate full blocks of input (iflag only)
  nonblock  use non-blocking I/O
  noatime   do not update access time
  noctty    do not assign controlling terminal from file
  nofollow  do not follow symlinks
  count_bytes  treat 'count=N' as a byte count (iflag only)

Sending a USR1 signal to a running `dd' process makes it
print I/O statistics to standard error and then resume copying.

  $ dd if=/dev/zero of=/dev/null& pid=$!
  $ kill -USR1 $pid; sleep 1; kill $pid
  18335302+0 records in
  18335302+0 records out
  9387674624 bytes (9.4 GB) copied, 34.6279 seconds, 271 MB/s

需重点查看参数:if, of, bs, skip, count

示例

源数据准备

cat>dd_in.txt<<EOF
a
b
c
d
EOF

二分查看,前半部分数据

dd bs=1 count=4 if=dd_in.txt

a
b
4+0 records in
4+0 records out
4 bytes (4 B) copied, 9.903e-05 s, 40.4 kB/s

二分匹配数据

dd bs=1 count=4 if=dd_in.txt | grep b # 能匹配到
dd bs=1 count=4 if=dd_in.txt | grep c # 不能匹配到

二分查看,后半部分数据

dd bs=1 skip=4 count=4 if=dd_in.txt

c
d
4+0 records in
4+0 records out
4 bytes (4 B) copied, 0.00013476 s, 29.7 kB/s

二分匹配数据

dd bs=1 skip=4 count=4 if=dd_in.txt | grep b # 不能匹配到
dd bs=1 skip=4 count=4 if=dd_in.txt | grep c # 能匹配到

示例解释

dd bs=1 count=4 if=dd_in.txt

bs=1 设置每次查找块大小为1字节(这里数据小用1字节方便说明。在大文件的情况下,可以用 1024 这样 1KB 或更大的值加快扫描速度)
count=4 取4个块区,这个值需要配合文件大小计算出,前半部分。

以上就实现,文件二分查找前半部分,配合 grep 即可进行查找操作。
文件后半部分的查找需要配合 skip,跳过块区。

如需转载,请标注来源谢谢: http://lukachen.com/archives/417/

发表新评论