Lecture-2: Tokenization of NLP with code
What do you understand by Tokenization?
Basically Tokenization is splitting up a text into minimal meaningful units as known as tokenization. Tokenization is also the process of breaking down a piece of text into small units called tokens. Token may be a word, part of a word or just characters like punctuation.
Tokenization is breaking the raw text into small chunks.
Example:
Now Comes to the practical coding and procedure step by step with python:
First of all we need Google Collaboratory for coding. If you don't know what is Google Collaboratory then this short description for you guys on google Collaboratory.
Google Collaboratory:
Google Colab is a free Jupyter notebook environment that runs entirely in the cloud. Most importantly, it does not require a setup and the notebooks that you create can be simultaneously edited by your team members. Google Colab supports many popular machine learning libraries which can be easily loaded in your notebook. I have suggested Google colab because it's easier than Jupiter notebook. You need not set up any environment for coding. Just go to your browser and type google colab and enter colab research then create new notebook. That's all short description about google colab.
Implementation of Tokenization with Python:
Import NLTK:
NLTK refers Natural language tool kit. First of all you need to import NLTK on google Colab. Using this tool kit you can lots of natural language processing implementation. You can say that its a library function of NLP.
Type your notebook to following code.
Now comes to the main point which is our topic Tokenization. we have already discussed about tokenization. what is it and how its work. for tokenization you need to import library of tokenization. After import that library you can tokenize a text into chunk. So import tokenization library following this code.
If you want to see type of tokens and also length of tokens. Then simply you can generate code like this.
Now comes to the more important point which is frequency measurement. If you want to check all word frequency which is in list as a tokens then you can. Frequency means repetition of word. You can see how many times word has repeated in list. So if you want to measure frequency of tokens then you need to import more library function which is called FreqDist. Following this code you can measure frequency of tokens.
One more term is find out most common word in list. So find out this you need to write following code.
No more today. Next blog I will discuss about Unigram, Bigram, Trigram and also N gram.
Thanks for your time to keep supporting for new updates.
@MICHALE DAVID yeah it's really helpful
ReplyDelete