Linux终端的乐趣之把玩字词计数

在使用的脚本来分析文本文件之前，我们必须有一个文本文件。为了保持一致性，我们将创建一个文本文件，man命令的输出如下所述。

$ man man > man.txt

以上命令是将man命令的使用方式导入到man.txt文件里。

我们希望能得到最平常的单词，对之前我们新建的文件执行如下脚本。

$ cat man.txt | tr &apos; &apos;  &apos;12&apos; | tr &apos;[:upper:]&apos; &apos;[:lower:]&apos; | tr -d &apos;[:punct:]&apos; | grep -v &apos;[^a-z]&apos; | sort | uniq -c | sort -rn | head

Sample Output

7557 
262 the 
163 to 
112 is 
112 a 
78 of 
78 manual 
76 and 
64 if 
63 be

上面的脚本，输出了最常使用的十个单词。

如何看单个的字母呢？那就用如下的命令。

$ echo &apos;tecmint team&apos; | fold -w1

Sample Output

t 
e 
c 
m 
i 
n 
t 
t 
e 
a 
m

注: -w1只是设定了长度

现在我们将从那个文本文件中掰下来的每一个字母，对结果进行排序，得到所需的输出频率的十个最常见的字符。

$ fold -w1 < man.txt | sort | uniq -c | sort -rn | head

Sample Output

如何区分大小写呢？之前我们都是忽略大小写的。所以，用如下命令。

$ fold -w1 < man.txt | sort | tr &apos;[:lower:]&apos; &apos;[:upper:]&apos; | uniq -c | sort -rn | head -20

Sample Output

请检查上面的输出，标点符号居然包括在内。让我们干掉他，用tr 命令。GO:

$ fold -w1 < man.txt | tr &apos;[:lower:]&apos; &apos;[:upper:]&apos; | sort | tr -d &apos;[:punct:]&apos; | uniq -c | sort -rn | head -20

Sample Output

现在，我们有了三个文本，那就让我们用如下命令查看结果吧。

$ cat *.txt | fold -w1 | tr &apos;[:lower:]&apos; &apos;[:upper:]&apos; | sort | tr -d &apos;[:punct:]&apos; | uniq -c | sort -rn | head -8

Sample Output

下一步我们将会生成那些罕见的至少十个字母长的单词。以下是简单的脚本：

$ cat man.txt | tr &apos;&apos; &apos;12&apos; | tr &apos;[:upper:]&apos; &apos;[:lower:]&apos; | tr -d &apos;[:punct:]&apos; | tr -d &apos;[0-9]&apos; | sort | uniq -c | sort -n |  grep -E &apos;..................&apos; | head

Sample Output

1        ────────────────────────────────────────── 
1        a all 
1        abc             any or all arguments within   are optional 
1               able  see setlocale for precise details 
1        ab              options delimited by  cannot be used together 
1               achieved by using the less environment variable 
1              a child process returned a nonzero exit status 
1               act as if this option was supplied using the name as a filename 
1               activate local mode  format and display  local  manual  files 
1               acute accent

注: 上面的.越来越多，其实，我们可以使用.{10} 得到同样的效果。

这些简单的脚本，让我们知道最频繁出现的单词和英语中的字符。

现在结束了。下次我会在这里讲到另一个有趣的话题，你应该会喜欢读。还有别忘了向我们提供您的宝贵意见。

via: http://www.tecmint.com/play-with-word-and-character-counts-in-linux/

作者：Avishek Kumar 译者：MikeCoder 校对：wxy

本文由 LCTT 原创翻译，Linux中国荣誉推出

本文转载来自 Linux 中国: https://github.com/Linux-CN/archive

对这篇文章感觉如何？

太棒了

不错

爱死了

不太好

感觉很糟

Rain

雨落清风。心向阳

Linux终端的乐趣之把玩字词计数

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

对这篇文章感觉如何？

Linux中国 2014 线下沙龙（北京）微博直播

直接从硬盘启动Linux ISO镜像

Leave a reply 取消回复

More in:Linux中国

捐赠 Let's Encrypt，共建安全的互联网

Let's Encrypt 正式发布，已经保护 380 万个域名

关于Linux防火墙iptables的面试问答

Lets Encrypt 已被所有主流浏览器所信任

中文操作系统论坛

关注 LinuxStory

开源学村

编程类开放书籍荟萃

如何杀死 Linux 中的僵尸进程

2022年，从学习Rust开始

使用 Linux 命令行解决Wordle 问题

5本学习 TeX 的最佳免费书籍

Makefile 简介

Linux 内核补丁提交初体验

Linux 的前世今生 – 1

特别关注

更开放的分布式事务 | Fescar 品牌升级，更名为 Seata

HeRM’s – 一个命令食谱管理器

使用 Let's Encrypt 保护你的网站

LinuxStory

加入 LinuxStory 交流群组

投票调查

最热标签

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

分享

对这篇文章感觉如何？

You may also like

Leave a reply 取消回复

More in:Linux中国

中文操作系统论坛

关注 LinuxStory

开源学村

特别关注

最新文章

最热标签