짜리몽땅 매거진
[Python] 텍스트문법 응용 본문
import pandas, re
join, apply, split
실전 텍스트 분석 응용
python 문법을 응용하여 분석해 보자!
pip install pandas
Requirement already satisfied: pandas in c:\users\rnrwnsgh\anaconda3\lib\site-packages (1.4.4)
Requirement already satisfied: pytz>=2020.1 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from pandas) (2022.1)
Requirement already satisfied: numpy>=1.18.5 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from pandas) (1.21.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import re #정규표현식 모듈
df=pd.read_csv("C:/Users/rnrwnsgh/Desktop/spam.csv")
df
Unnamed: 0 target text
0 0 ham Go until jurong point, crazy.. Available only ...
1 1 ham Ok lar... Joking wif u oni...
2 2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 3 ham U dun say so early hor... U c already then say...
4 4 ham Nah I don't think he goes to usf, he lives aro...
... ... ... ...
5569 5569 spam This is the 2nd time we have tried 2 contact u...
5570 5570 ham Will ü b going to esplanade fr home?
5571 5571 ham Pity, * was in mood for that. So...any other s...
5572 5572 ham The guy did some bitching but I acted like i'd...
5573 5573 ham Rofl. Its true to its name
5574 rows × 3 columns
df=df[['target','text']]
df
target text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5569 spam This is the 2nd time we have tried 2 contact u...
5570 ham Will ü b going to esplanade fr home?
5571 ham Pity, * was in mood for that. So...any other s...
5572 ham The guy did some bitching but I acted like i'd...
5573 ham Rofl. Its true to its name
5574 rows × 2 columns
- 문자열 데이터는 깔끔하게 전처리를 해야 한다.
- 컴퓨터가 이해하기에는 문자에 대한 의미는 이해하지 못한다.
- 수치로 변경을 해야 한다.
- 정규식을 통해서 전처리가 가능하다.
import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
sp_df=df['text'][0]
sp_df
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
a=[]
for i in sp_df:
if i not in string.punctuation:
a.append(i)
str_1=''.join(a) #조인함수를 통해 리스트의 문자열들을 붙일 수 있다. split과 반대개념
str_1
'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'
# 함수 def
def pre_str(x):
new_str=[]
for i in x:
if i not in string.punctuation:
new_str.append(i)
new_str = ''.join(new_str)
return new_str
pre_str(sp_df)
'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'
- apply함수 : 전처리할 때 적용시켜주는 함수
- 함수에 대해서 전체 데이터에 적용시켜주는데 사용한다.
df['text_n']=df['text'].apply(pre_str)
df['text_n']
0 Go until jurong point crazy Available only in ...
1 Ok lar Joking wif u oni
2 Free entry in 2 a wkly comp to win FA Cup fina...
3 U dun say so early hor U c already then say
4 Nah I dont think he goes to usf he lives aroun...
...
5569 This is the 2nd time we have tried 2 contact u...
5570 Will ü b going to esplanade fr home
5571 Pity was in mood for that Soany other suggest...
5572 The guy did some bitching but I acted like id ...
5573 Rofl Its true to its name
Name: text_n, Length: 5574, dtype: object
df
target text text_n
0 ham Go until jurong point, crazy.. Available only ... Go until jurong point crazy Available only in ...
1 ham Ok lar... Joking wif u oni... Ok lar Joking wif u oni
2 spam Free entry in 2 a wkly comp to win FA Cup fina... Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say... U dun say so early hor U c already then say
4 ham Nah I don't think he goes to usf, he lives aro... Nah I dont think he goes to usf he lives aroun...
... ... ... ...
5569 spam This is the 2nd time we have tried 2 contact u... This is the 2nd time we have tried 2 contact u...
5570 ham Will ü b going to esplanade fr home? Will ü b going to esplanade fr home
5571 ham Pity, * was in mood for that. So...any other s... Pity was in mood for that Soany other suggest...
5572 ham The guy did some bitching but I acted like i'd... The guy did some bitching but I acted like id ...
5573 ham Rofl. Its true to its name Rofl Its true to its name
5574 rows × 3 columns
df['text_n'][0]
'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'
df_text_n_1=df['text_n'][0].split(' ')
df_text_n_1
['Go',
'until',
'jurong',
'point',
'crazy',
'Available',
'only',
'in',
'bugis',
'n',
'great',
'world',
'la',
'e',
'buffet',
'Cine',
'there',
'got',
'amore',
'wat']
리스트 컴프리헨션과 split, len, replace 사용해서 응용하기
df_all=[i.split(' ') for i in df['text_n']]
sp_len_all=[len(k) for k in df_all]
sp_len_all_2 = [len(k.replace(' ','')) for k in df['text_n']]
'Data > Python' 카테고리의 다른 글
[Python] 통계분석 기초 (0) | 2023.09.17 |
---|---|
[Python] 데이터분석 기초 (0) | 2023.09.12 |
[Python] 모듈 패키지 (0) | 2023.08.25 |
[Python] 정규표현식 (0) | 2023.08.14 |
[Python] if, for문 응용한 퀴즈 만들기 (0) | 2023.08.03 |