Linux dd+grep 大文件二分查找
Linux dd 命令用于读取、转换并输出数据。
dd 可从标准输入或文件中读取数据,根据指定的格式来转换数据,再输出到文件、设备或标准输出。
参数说明(dd --help)
Usage: dd [OPERAND]...
or: dd OPTION
Copy a file, converting and formatting according to the operands.
bs=BYTES read and write BYTES bytes at a time (also see ibs=,obs=)
cbs=BYTES convert BYTES bytes at a time
conv=CONVS convert the file as per the comma separated symbol list
count=N copy only N input blocks
ibs=BYTES read BYTES bytes at a time (default: 512)
if=FILE read from FILE instead of stdin
iflag=FLAGS read as per the comma separated symbol list
obs=BYTES write BYTES bytes at a time (default: 512)
of=FILE write to FILE instead of stdout
oflag=FLAGS write as per the comma separated symbol list
seek=BLOCKS skip BLOCKS obs-sized blocks at start of output
skip=BLOCKS skip BLOCKS ibs-sized blocks at start of input
status=WHICH WHICH info to suppress outputting to stderr;
'noxfer' suppresses transfer stats, 'none' suppresses all
BLOCKS and BYTES may be followed by the following multiplicative suffixes:
c =1, w =2, b =512, kB =1000, K =1024, MB =1000*1000, M =1024*1024, xM =M
GB =1000*1000*1000, G =1024*1024*1024, and so on for T, P, E, Z, Y.
Each CONV symbol may be:
ascii from EBCDIC to ASCII
ebcdic from ASCII to EBCDIC
ibm from ASCII to alternate EBCDIC
block pad newline-terminated records with spaces to cbs-size
unblock replace trailing spaces in cbs-size records with newline
lcase change upper case to lower case
nocreat do not create the output file
excl fail if the output file already exists
notrunc do not truncate the output file
ucase change lower case to upper case
sparse try to seek rather than write the output for NUL input blocks
swab swap every pair of input bytes
noerror continue after read errors
sync pad every input block with NULs to ibs-size; when used
with block or unblock, pad with spaces rather than NULs
fdatasync physically write output file data before finishing
fsync likewise, but also write metadata
Each FLAG symbol may be:
append append mode (makes sense only for output; conv=notrunc suggested)
direct use direct I/O for data
directory fail unless a directory
dsync use synchronized I/O for data
sync likewise, but also for metadata
fullblock accumulate full blocks of input (iflag only)
nonblock use non-blocking I/O
noatime do not update access time
noctty do not assign controlling terminal from file
nofollow do not follow symlinks
count_bytes treat 'count=N' as a byte count (iflag only)
Sending a USR1 signal to a running `dd' process makes it
print I/O statistics to standard error and then resume copying.
$ dd if=/dev/zero of=/dev/null& pid=$!
$ kill -USR1 $pid; sleep 1; kill $pid
18335302+0 records in
18335302+0 records out
9387674624 bytes (9.4 GB) copied, 34.6279 seconds, 271 MB/s
需重点查看参数:if, of, bs, skip, count
示例
源数据准备
cat>dd_in.txt<<EOF
a
b
c
d
EOF
二分查看,前半部分数据
dd bs=1 count=4 if=dd_in.txt
a
b
4+0 records in
4+0 records out
4 bytes (4 B) copied, 9.903e-05 s, 40.4 kB/s
二分匹配数据
dd bs=1 count=4 if=dd_in.txt | grep b # 能匹配到
dd bs=1 count=4 if=dd_in.txt | grep c # 不能匹配到
二分查看,后半部分数据
dd bs=1 skip=4 count=4 if=dd_in.txt
c
d
4+0 records in
4+0 records out
4 bytes (4 B) copied, 0.00013476 s, 29.7 kB/s
二分匹配数据
dd bs=1 skip=4 count=4 if=dd_in.txt | grep b # 不能匹配到
dd bs=1 skip=4 count=4 if=dd_in.txt | grep c # 能匹配到
示例解释
dd bs=1 count=4 if=dd_in.txt
bs=1 设置每次查找块大小为1字节(这里数据小用1字节方便说明。在大文件的情况下,可以用 1024 这样 1KB 或更大的值加快扫描速度)
count=4 取4个块区,这个值需要配合文件大小计算出,前半部分。
以上就实现,文件二分查找前半部分,配合 grep 即可进行查找操作。
文件后半部分的查找需要配合 skip,跳过块区。
打赏: 微信
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。