Day 20 : Natural Language Processing (NLP) using python.

Hello guys,

previously we learn about OpenCV and web Scraping using python.

Today we will learn about Natural Language Processing (NLP) using python.

What is Natural Language Processing (NLP)?

Natural language processing (NLP) is about developing applications and services that are able to understand human languages. Some Practical examples of NLP are speech recognition for eg: google voice search, understanding what the content is about or sentiment analysis etc.



  • Advantages of rule-based approaches:
    • Training data not required
    • High precision
    • Can be a good way to collect data as one can start the system with rules and let data come by naturally as people use the system.  
  • Disadvantages of rule-based approaches
    • Lower recall
    • Difficult and tedious to list all the rules



Application of NLP
1)Automatic Summarization : Produce a readable summary of a part of the text.(Newspapers can be an example)
2)Coreference resolution: Given a sentence or larger chunk of text, determine which words refer to the same objects .
3)Discourse Analysis: This includes a number of related tasks. One task is identifying the discourse structure of connected text(discourse relationships between sentences).
NLP using Python
Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it.
NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.
In this NLP Tutorial, we will use Python NLTK library.

How to install NLTK?

pip install nltk

To check if NLTK has installed correctly, you can open python terminal and type the following:

import nltk

If everything goes fine, that means you’ve successfully installed NLTK library. Once you’ve installed NLTK, you should install the NLTK packages by running the following code:

import nltk
nltk.download()


This will download huge amount of data and take more time.


Let's start
First we have to get data
I have take this data from link .

>>> data="""With the addition of 1,553 cases and 36 deaths in 24 hours, India's total number of coronavirus-positive cases reached 17,615 on Monday. Globally, 2,414,098 people have been infected and 165,153 have died so far, according to Worldometer. The central government on Monday said the Covid-19 situation was "especially serious" in Mumbai, Pune, Indore, Jaipur and Kolkata. In what could be abother cause for concern in India's fight against coronavirus, it has come to light that over 80 per cent of the cases in India are asymptomatic, a TV new channel has reported, quoting senior ICMR scientist Dr Raman R Gangakhedkar."""

We have to split this data into space.
Basically, we have to split sentences into words.

>>> tokens = [t for t in data.split()]

>>> tokens
['With', 'the', 'addition', 'of', '1,553', 'cases', 'and', '36', 'deaths', 'in', '24', 'hours,', "India's", 'total', 'number', 'of', 'coronavirus-positive', 'cases', 'reached', '17,615', 'on', 'Monday.', 'Globally,', '2,414,098', 'people', 'have', 'been', 'infected', 'and', '165,153', 'have', 'died', 'so', 'far,', 'according', 'to', 'Worldometer.', 'The', 'central', 'government', 'on', 'Monday', 'said', 'the', 'Covid-19', 'situation', 'was', '"especially', 'serious"', 'in', 'Mumbai,', 'Pune,', 'Indore,', 'Jaipur', 'and', 'Kolkata.', 'In', 'what', 'could', 'be', 'abother', 'cause', 'for', 'concern', 'in', "India's", 'fight', 'against', 'coronavirus,', 'it', 'has', 'come', 'to', 'light', 'that', 'over', '80', 'per', 'cent', 'of', 'the', 'cases', 'in', 'India', 'are', 'asymptomatic,', 'a', 'TV', 'new', 'channel', 'has', 'reported,', 'quoting', 'senior', 'ICMR', 'scientist', 'Dr', 'Raman', 'R', 'Gangakhedkar.']

Count word Frequency

nltk offers a function FreqDist() which will do the job for us. Also, we will remove stop words (a, at, the, for etc) from our web page as we don't need them to hamper our word frequency count


>>> from nltk.corpus import stopwords
>>> sr= stopwords.words('english')

These are stopwords that are predefined in NLTK

>>> sr
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Actual tokens.
>>> clean_tokens = tokens[:]
>>> clean_tokens
['With', 'the', 'addition', 'of', '1,553', 'cases', 'and', '36', 'deaths', 'in', '24', 'hours,', "India's", 'total', 'number', 'of', 'coronavirus-positive', 'cases', 'reached', '17,615', 'on', 'Monday.', 'Globally,', '2,414,098', 'people', 'have', 'been', 'infected', 'and', '165,153', 'have', 'died', 'so', 'far,', 'according', 'to', 'Worldometer.', 'The', 'central', 'government', 'on', 'Monday', 'said', 'the', 'Covid-19', 'situation', 'was', '"especially', 'serious"', 'in', 'Mumbai,', 'Pune,', 'Indore,', 'Jaipur', 'and', 'Kolkata.', 'In', 'what', 'could', 'be', 'abother', 'cause', 'for', 'concern', 'in', "India's", 'fight', 'against', 'coronavirus,', 'it', 'has', 'come', 'to', 'light', 'that', 'over', '80', 'per', 'cent', 'of', 'the', 'cases', 'in', 'India', 'are', 'asymptomatic,', 'a', 'TV', 'new', 'channel', 'has', 'reported,', 'quoting', 'senior', 'ICMR', 'scientist', 'Dr', 'Raman', 'R', 'Gangakhedkar.']

Here we have to remove stopwords from tokens.
>>> for token in tokens:
if token in sr:
clean_tokens.remove(token)

After Removing stopwords
>>> clean_tokens
['With', 'addition', '1,553', 'cases', '36', 'deaths', '24', 'hours,', "India's", 'total', 'number', 'coronavirus-positive', 'cases', 'reached', '17,615', 'Monday.', 'Globally,', '2,414,098', 'people', 'infected', '165,153', 'died', 'far,', 'according', 'Worldometer.', 'The', 'central', 'government', 'Monday', 'said', 'Covid-19', 'situation', '"especially', 'serious"', 'Mumbai,', 'Pune,', 'Indore,', 'Jaipur', 'Kolkata.', 'In', 'could', 'abother', 'cause', 'concern', "India's", 'fight', 'coronavirus,', 'come', 'light', '80', 'per', 'cent', 'cases', 'India', 'asymptomatic,', 'TV', 'new', 'channel', 'reported,', 'quoting', 'senior', 'ICMR', 'scientist', 'Dr', 'Raman', 'R', 'Gangakhedkar.']


Count frequency of words in paragarph.
>>> freq = nltk.FreqDist(clean_tokens)
>>> freq
FreqDist({'cases': 3, "India's": 2, 'With': 1, 'addition': 1, '1,553': 1, '36': 1, 'deaths': 1, '24': 1, 'hours,': 1, 'total': 1, ...})
>>> for key,val in freq.items():
print(str(key) + ':' + str(val))

With:1
addition:1
1,553:1
cases:3
36:1
deaths:1
24:1
hours,:1
India's:2
total:1
number:1
coronavirus-positive:1
reached:1
17,615:1
Monday.:1
Globally,:1
2,414,098:1
people:1
infected:1
165,153:1
died:1
far,:1
according:1
Worldometer.:1
The:1
central:1
government:1
Monday:1
said:1
Covid-19:1
situation:1
"especially:1
serious":1
Mumbai,:1
Pune,:1
Indore,:1
Jaipur:1
Kolkata.:1
In:1
could:1
abother:1
cause:1
concern:1
fight:1
coronavirus,:1
come:1
light:1
80:1
per:1
cent:1
India:1
asymptomatic,:1
TV:1
new:1
channel:1
reported,:1
quoting:1
senior:1
ICMR:1
scientist:1
Dr:1
Raman:1
R:1
Gangakhedkar.:1


You can do this for youtube comments or twitter sentiment analysis.

❤❤Quarantine python group link ❤❤

8805271377 WhatsApp

Follow here ❤

@mr._mephisto_ Instagram 

There will be no restrictions just feel free to learn. 

Share and take one more step to share knowledge to others. 

Believe in yourself 🤟 you are awesome. 

Be safe, Be happy😁
Take care of yourself and your family 
Of course watch movies and series🤟😉 

And follow the guidelines of government


Comments

Post a Comment

Popular posts from this blog

Web Design Challenge

News website using Flask and news API - working with API : Part 4

Top programming languages to learn in 2023