There is a connection between the words (and thus between i and j which represents them) if the words co-occur within a window of a specified 'window_size' in the processed_text. If weighted_edge is zero, it means no edge or connection is present between the words represented by index i and j. Weighted_edge contains the weight of the connecting edge between the word vertex represented by vocabulary index i and the word vertex represented by vocabulary j. I am building a graph with wieghted undirected edges. The weighetd_edge matrix contains the information of edge connections among all vertices. The words will be represented in the vertices by their index in vocabulary list. Each words in the vocabulary will serve as a vertex for graph. TextRank is a graph based model, and thus it requires us to build a graph. I will be developing the graph using unigram texts as vertices) But, for now, I will simply remove the stopwords, and work with a 'bag-of-words' approach. (The contents of this set will be later used to partition the lemmatized text into n-gram phrases. Stopwords-plus constitute the sum total of all stopwords and potential phrase-delimiters. Stopwords to create the final list 'stopwords-plus' which is then converted into a set. Remain which are very bad candidates for being keywords (or part of it).Īn external file constituting a long list of stopwords is loaded and all the words are added with the previous Stopwords = stopwords + punctuations Complete stopword generationĮven if we remove the aforementioned stopwords, still some extremely common nouns, adjectives or gerunds may Punctuations are added to the stopword list too. This is based on the assumption that usually keywords are noun, The tags will be used for filtering later on.Īny word from the lemmatized text, which isn't a noun, adjective, or gerund (or a 'foreign word'), is hereĬonsidered as a stopword (non-content). Text tokens after lemmatization of adjectives and nouns: For example, 'glasses' may be replaced by 'glass'. In lemmatization different grammatical counterparts of a word will be replaced by singleīasic lemma. The tokenized text (mainly the nouns and adjectives) is normalized by lemmatization. NLTK is again used for POS tagging the input text so that the words can be lemmatized based on their POS tags.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |