MIT-Missing-Semester4: Data Wrangling

2024-11-23 来源：个人技术集锦

A. Lecture Notes: Data Wrangling

Context：通过shell script，借助一些常用的editor比如awk、sed等，可以实现很多数据处理的工作，类似于Python中用numpy和pandas处理数据。

1、特殊符号及含义

2、捕获组

在正则表达式中可以用()选中一部分或几部分作为捕获组，以便之后使用。

| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'

在这个例子中，共有三个capture group，分别可以用\1, \2, \3来调用，里面捕获了第二个组。

3、regex debugger

工具：https://regex101.com/

1、sed的更多用法

2、sort

sort是一个常用的Linux函数，可以指定根据哪列排序，升序还是降序，lexicographic 还是numeric。

3、uniq

uniq -c 类似于pandas中的value_counts()函数

4、head and tail

选取结果的前几行或尾几行

awk is another stream editor which is very good for text processing

显示全文