This post is about text manipulation in a text paragraph. This script will move every new sentence into a newline when it faces a period(.). But it will avoid words like ‘Mr.’ , ‘U.S.A’ etc… as long as you include then into the key_word list.
This script I learn from stackOverflow at this link : https://stackoverflow.com/questions/38831457/regex-splitting-strings-at-full-stops-unless-its-part-of-an-honorific, from this user ‘ospahiu’ (https://stackoverflow.com/users/6320655/ospahiu).
Below is the code:
#Open text file which contain the text
text_file_location = r"C:\Users\abcd\Documents\Blog_Materials\news.txt"
#Remove newline
with open(text_file_location, 'r') as file:
text = file.read().replace('\n', '')
#Break up sentences by space
text_tokens = text.split(" ")
print(text_tokens)
#List of words that we do not want to split to newline
key_word = ["U.S.", "Mr.", "Mrs.", "U.S.A", "U.S"]
new_text_tokens = []
for x in text_tokens:
if "." in x and x not in key_word:
new_text_tokens.append( x + "\n")
else:
new_text_tokens.append( x + " ")
#Finally join back the word tokens, those words with \n will now cause the next sentence to be newlined.
# The rstrip() is to remove the last \n.
final_text = "".join(new_text_tokens).rstrip()
print("=======================================")
print(new_text_tokens)
print("=======================================")
print(final_text)
The original text from Yahoo Finance is:
This coming holiday-shortened week will round out a brutal year for Wall Street as 2022 comes to an end. The U.S. stock and bond markets will be closed on Monday, December 26, in observance of Christmas Day. The earnings and economic calendars will be light, with much of the business world off until next year. Traders who are working through the holiday period will get readings on wholesale and retail inventories, weekly jobless claims, and the latest S&P CoreLogic Case-Shiller home price index. When investors return from a long weekend Tuesday, hopes will be high for a Santa Claus Rally – a seasonal rise in the stock market that occurs at the end of December.
But with selling pressures remaining in place over fears about a looming recession, the favorable season pattern may take this year off.
After the manipulation, the text is:
This coming holiday-shortened week will round out a brutal year for Wall Street as 2022 comes to an end.
The U.S. stock and bond markets will be closed on Monday, December 26, in observance of Christmas Day.
The earnings and economic calendars will be light, with much of the business world off until next year.
Traders who are working through the holiday period will get readings on wholesale and retail inventories, weekly jobless claims, and the latest S&P CoreLogic Case-Shiller home price index.
When investors return from a long weekend Tuesday, hopes will be high for a Santa Claus Rally – a seasonal rise in the stock market that occurs at the end of December.
But with selling pressures remaining in place over fears about a looming recession, the favorable season pattern may take this year off.