Hi, this post we’ll talk about basic text analysis using Python. This is something that I learned from ‘Applied Text Analysis with Python’ by Benjamin Bengfort. A good book to start Text Analysis.
We will talk about 3 features of language:
- Language Feature
- Contextual Feature
- Structural Feature
Language Feature
This type of analysis is base on language feature, which means that the word or words itself determine if it is a positive word or a negative one. It does not necessarily mean it reflect he context or sentiment of the sentence. Hence language/word feature is only part of text analysis.
Basically we have 2 lists (Good_Words and Bad_Words) that contain some words tokens
The sample sentence (my_text) which we will analyze is part of a financial article from Yahoo Finance. The Python script will tell us if the news is more positive or negative. And the technique is using intersection function to match the identical word from my_text and the 2 lists and return the count. Below is the code:
Good_News = 'Good News'
Bad_News = 'Bad News'
Not_Sure = 'Not Sure'
Mix_Of_Good_Bad = 'Mixed'
Good_Words = set(['positive' , 'exciting' , "upbeat" , 'bullish' , 'higher' , 'gain' , 'gained' , 'rise' ,'rose' ,
'gained' ,'gained' ,'gained' ,'gained' ,'gained' , ])
Bad_Words = set(['negative' , 'dull' , "bearish" , 'slump' , 'lower' , 'retrench' , 'layoff' , 'layoffs' , 'worry' , 'worries' , 'worries' ,
'worries' ,'worries' ,'worries' ,'worries' ,'worries' , ])
my_text = ("U.S. stocks drifted higher on Friday to log gains across the board, capping the final full trading week of 2022."
"When the closing bell rang on Wall Street, the S&P 500 (^GSPC) rose 0.6%, the Dow Jones Industrial Average (^DJI)"
"rose 0.5, while the technology-heavy Nasdaq Composite (^IXIC) gained 0.2%.")
def good_bad_news(words):
global good_len
global bad_len
good_len = len(Good_Words.intersection(words.split(" ")))
bad_len = len(Bad_Words.intersection(words.split(" ")))
if good_len > 0 and bad_len == 0:
return Good_News
elif bad_len > 0 and good_len == 0:
return Bad_News
elif good_len > 0 and bad_len > 0:
return Mix_Of_Good_Bad
else:
return Not_Sure
print(good_bad_news(my_text))
print("good_len : " + str(good_len))
print("bad_len : " + str(bad_len))
The output is:
Good News
good_len : 3
bad_len : 0