Text Analysis using Python

Hi, this post we’ll talk about basic text analysis using Python. This is something that I learned from ‘Applied Text Analysis with Python’ by Benjamin Bengfort. A good book to start Text Analysis.

We will talk about 3 features of language:

  • Language Feature
  • Contextual Feature
  • Structural Feature

Language Feature

This type of analysis is base on language feature, which means that the word or words itself determine if it is a positive word or a negative one. It does not necessarily mean it reflect he context or sentiment of the sentence. Hence language/word feature is only part of text analysis.

Basically we have 2 lists (Good_Words and Bad_Words) that contain some words tokens

The sample sentence (my_text) which we will analyze is part of a financial article from Yahoo Finance. The Python script will tell us if the news is more positive or negative. And the technique is using intersection function to match the identical word from my_text and the 2 lists and return the count. Below is the code:

Good_News = 'Good News'
Bad_News = 'Bad News'
Not_Sure = 'Not Sure'
Mix_Of_Good_Bad = 'Mixed'

Good_Words = set(['positive' , 'exciting' , "upbeat" , 'bullish' , 'higher' , 'gain' , 'gained' , 'rise' ,'rose' ,
'gained' ,'gained' ,'gained' ,'gained' ,'gained' , ])

Bad_Words = set(['negative' , 'dull' , "bearish" , 'slump' , 'lower' , 'retrench' , 'layoff' , 'layoffs' , 'worry' , 'worries' , 'worries' ,
'worries' ,'worries' ,'worries' ,'worries' ,'worries' , ])

my_text = ("U.S. stocks drifted higher on Friday to log gains across the board, capping the final full trading week of 2022."
"When the closing bell rang on Wall Street, the S&P 500 (^GSPC) rose 0.6%, the Dow Jones Industrial Average (^DJI)" 
"rose 0.5, while the technology-heavy Nasdaq Composite (^IXIC) gained 0.2%.")

def good_bad_news(words):
    global good_len
    global bad_len
    good_len = len(Good_Words.intersection(words.split(" ")))
    bad_len = len(Bad_Words.intersection(words.split(" ")))
    if good_len > 0 and bad_len == 0:
        return Good_News
    elif bad_len > 0 and good_len == 0:
        return Bad_News
    elif good_len > 0 and bad_len > 0:
        return Mix_Of_Good_Bad
        return Not_Sure

print("good_len : "  + str(good_len))
print("bad_len : " + str(bad_len))

The output is:

Good News
good_len : 3
bad_len : 0