Linux終端的樂趣之把玩字詞計數

在使用的腳本來分析文本文件之前，我們必須有一個文本文件。為了保持一致性，我們將創建一個文本文件，man命令的輸出如下所述。

$ man man > man.txt

以上命令是將man命令的使用方式導入到man.txt文件里。

我們希望能得到最平常的單詞，對之前我們新建的文件執行如下腳本。

$ cat man.txt | tr &apos; &apos;  &apos;12&apos; | tr &apos;[:upper:]&apos; &apos;[:lower:]&apos; | tr -d &apos;[:punct:]&apos; | grep -v &apos;[^a-z]&apos; | sort | uniq -c | sort -rn | head

Sample Output

7557 
262 the 
163 to 
112 is 
112 a 
78 of 
78 manual 
76 and 
64 if 
63 be

上面的腳本，輸出了最常使用的十個單詞。

如何看單個的字母呢？那就用如下的命令。

$ echo &apos;tecmint team&apos; | fold -w1

Sample Output

t 
e 
c 
m 
i 
n 
t 
t 
e 
a 
m

注: -w1隻是設定了長度

現在我們將從那個文本文件中掰下來的每一個字母，對結果進行排序，得到所需的輸出頻率的十個最常見的字元。

$ fold -w1 < man.txt | sort | uniq -c | sort -rn | head

Sample Output

如何區分大小寫呢？之前我們都是忽略大小寫的。所以，用如下命令。

$ fold -w1 < man.txt | sort | tr &apos;[:lower:]&apos; &apos;[:upper:]&apos; | uniq -c | sort -rn | head -20

Sample Output

請檢查上面的輸出，標點符號居然包括在內。讓我們幹掉他，用tr 命令。GO:

$ fold -w1 < man.txt | tr &apos;[:lower:]&apos; &apos;[:upper:]&apos; | sort | tr -d &apos;[:punct:]&apos; | uniq -c | sort -rn | head -20

Sample Output

現在，我們有了三個文本，那就讓我們用如下命令查看結果吧。

$ cat *.txt | fold -w1 | tr &apos;[:lower:]&apos; &apos;[:upper:]&apos; | sort | tr -d &apos;[:punct:]&apos; | uniq -c | sort -rn | head -8

Sample Output

下一步我們將會生成那些罕見的至少十個字母長的單詞。以下是簡單的腳本：

$ cat man.txt | tr &apos;&apos; &apos;12&apos; | tr &apos;[:upper:]&apos; &apos;[:lower:]&apos; | tr -d &apos;[:punct:]&apos; | tr -d &apos;[0-9]&apos; | sort | uniq -c | sort -n |  grep -E &apos;..................&apos; | head

Sample Output

1        ────────────────────────────────────────── 
1        a all 
1        abc             any or all arguments within   are optional 
1               able  see setlocale for precise details 
1        ab              options delimited by  cannot be used together 
1               achieved by using the less environment variable 
1              a child process returned a nonzero exit status 
1               act as if this option was supplied using the name as a filename 
1               activate local mode  format and display  local  manual  files 
1               acute accent

注: 上面的.越來越多，其實，我們可以使用.{10} 得到同樣的效果。

這些簡單的腳本，讓我們知道最頻繁出現的單詞和英語中的字元。

現在結束了。下次我會在這裡講到另一個有趣的話題，你應該會喜歡讀。還有別忘了向我們提供您的寶貴意見。

via: http://www.tecmint.com/play-with-word-and-character-counts-in-linux/

作者：Avishek Kumar 譯者：MikeCoder 校對：wxy

本文由 LCTT 原創翻譯，Linux中國榮譽推出

本文轉載來自 Linux 中國: https://github.com/Linux-CN/archive

對這篇文章感覺如何？

太棒了

不錯

愛死了

不太好

感覺很糟

Rain

雨落清風。心向陽

Linux終端的樂趣之把玩字詞計數

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

對這篇文章感覺如何？

Linux中國 2014 線下沙龍（北京）微博直播

直接從硬碟啟動Linux ISO鏡像

Leave a reply 取消回復

More in:Linux中國

捐贈 Let's Encrypt，共建安全的互聯網

Let's Encrypt 正式發布，已經保護 380 萬個域名

關於Linux防火牆iptables的面試問答

Lets Encrypt 已被所有主流瀏覽器所信任

中文操作系統論壇

關注 LinuxStory

開源學村

編程類開放書籍薈萃

如何殺死 Linux 中的殭屍進程

2022年，從學習Rust開始

使用 Linux 命令行解決Wordle 問題

5本學習 TeX 的最佳免費書籍

Makefile 簡介

Linux 內核補丁提交初體驗

Linux 的前世今生 – 1

特別關注

更開放的分散式事務 | Fescar 品牌升級，更名為 Seata

HeRM’s – 一個命令食譜管理器

使用 Let's Encrypt 保護你的網站

LinuxStory

加入 LinuxStory 交流群組

投票調查

最熱標籤

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

Sample Output

分享

對這篇文章感覺如何？

You may also like

Leave a reply 取消回復

More in:Linux中國

中文操作系統論壇

關注 LinuxStory

開源學村

特別關注

最新文章

最熱標籤