«   2024/07   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
Recent Posts
Today
Total
관리 메뉴

짜리몽땅 매거진

[Python] 텍스트문법 응용 본문

Data/Python

[Python] 텍스트문법 응용

쿡국 2023. 8. 29. 15:53

import pandas, re
join, apply, split 

실전 텍스트 분석 응용
python 문법을 응용하여 분석해 보자!

pip install pandas
Requirement already satisfied: pandas in c:\users\rnrwnsgh\anaconda3\lib\site-packages (1.4.4)
Requirement already satisfied: pytz>=2020.1 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from pandas) (2022.1)
Requirement already satisfied: numpy>=1.18.5 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from pandas) (1.21.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\rnrwnsgh\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

import pandas as pd
import re #정규표현식 모듈
df=pd.read_csv("C:/Users/rnrwnsgh/Desktop/spam.csv")
df
Unnamed: 0	target	text
0	0	ham	Go until jurong point, crazy.. Available only ...
1	1	ham	Ok lar... Joking wif u oni...
2	2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	3	ham	U dun say so early hor... U c already then say...
4	4	ham	Nah I don't think he goes to usf, he lives aro...
...	...	...	...
5569	5569	spam	This is the 2nd time we have tried 2 contact u...
5570	5570	ham	Will ü b going to esplanade fr home?
5571	5571	ham	Pity, * was in mood for that. So...any other s...
5572	5572	ham	The guy did some bitching but I acted like i'd...
5573	5573	ham	Rofl. Its true to its name
5574 rows × 3 columns

df=df[['target','text']]
df
target	text
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
...	...	...
5569	spam	This is the 2nd time we have tried 2 contact u...
5570	ham	Will ü b going to esplanade fr home?
5571	ham	Pity, * was in mood for that. So...any other s...
5572	ham	The guy did some bitching but I acted like i'd...
5573	ham	Rofl. Its true to its name
5574 rows × 2 columns
  • 문자열 데이터는 깔끔하게 전처리를 해야 한다.
  • 컴퓨터가 이해하기에는 문자에 대한 의미는 이해하지 못한다.
  • 수치로 변경을 해야 한다.
  • 정규식을 통해서 전처리가 가능하다.
import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

sp_df=df['text'][0]

sp_df
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

a=[]
for i in sp_df:
    if i not in string.punctuation:
        a.append(i)
str_1=''.join(a)  #조인함수를 통해 리스트의 문자열들을 붙일 수 있다. split과 반대개념
str_1
'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'

# 함수 def
def pre_str(x):
    new_str=[]
    for i in x:
        if i not in string.punctuation:
            new_str.append(i)
    new_str = ''.join(new_str)
    return new_str

pre_str(sp_df)
'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'
  • apply함수 : 전처리할 때 적용시켜주는 함수
  • 함수에 대해서 전체 데이터에 적용시켜주는데 사용한다.
df['text_n']=df['text'].apply(pre_str)
df['text_n']
0       Go until jurong point crazy Available only in ...
1                                 Ok lar Joking wif u oni
2       Free entry in 2 a wkly comp to win FA Cup fina...
3             U dun say so early hor U c already then say
4       Nah I dont think he goes to usf he lives aroun...
                              ...                        
5569    This is the 2nd time we have tried 2 contact u...
5570                  Will ü b going to esplanade fr home
5571    Pity  was in mood for that Soany other suggest...
5572    The guy did some bitching but I acted like id ...
5573                            Rofl Its true to its name
Name: text_n, Length: 5574, dtype: object

df
target	text	text_n
0	ham	Go until jurong point, crazy.. Available only ...	Go until jurong point crazy Available only in ...
1	ham	Ok lar... Joking wif u oni...	Ok lar Joking wif u oni
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...	U dun say so early hor U c already then say
4	ham	Nah I don't think he goes to usf, he lives aro...	Nah I dont think he goes to usf he lives aroun...
...	...	...	...
5569	spam	This is the 2nd time we have tried 2 contact u...	This is the 2nd time we have tried 2 contact u...
5570	ham	Will ü b going to esplanade fr home?	Will ü b going to esplanade fr home
5571	ham	Pity, * was in mood for that. So...any other s...	Pity was in mood for that Soany other suggest...
5572	ham	The guy did some bitching but I acted like i'd...	The guy did some bitching but I acted like id ...
5573	ham	Rofl. Its true to its name	Rofl Its true to its name
5574 rows × 3 columns

df['text_n'][0]
'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'

df_text_n_1=df['text_n'][0].split(' ')

df_text_n_1
['Go',
 'until',
 'jurong',
 'point',
 'crazy',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'Cine',
 'there',
 'got',
 'amore',
 'wat']

리스트 컴프리헨션과 split, len, replace 사용해서 응용하기

df_all=[i.split(' ') for i in df['text_n']]

sp_len_all=[len(k) for k in df_all]

sp_len_all_2 = [len(k.replace(' ','')) for k in df['text_n']]

'Data > Python' 카테고리의 다른 글

[Python] 통계분석 기초  (0) 2023.09.17
[Python] 데이터분석 기초  (0) 2023.09.12
[Python] 모듈 패키지  (0) 2023.08.25
[Python] 정규표현식  (0) 2023.08.14
[Python] if, for문 응용한 퀴즈 만들기  (0) 2023.08.03