{"id":299,"date":"2016-07-26T00:18:11","date_gmt":"2016-07-26T09:18:11","guid":{"rendered":"https:\/\/posuer000.wordpress.com\/?p=299"},"modified":"2016-09-10T13:00:28","modified_gmt":"2016-09-10T04:00:28","slug":"lemmatization-of-english-words-in-sentences-in-xml-format-by-python","status":"publish","type":"post","link":"https:\/\/wanggengyu.com\/?p=299","title":{"rendered":"Lemmatization of English words in sentences in XML format by Python"},"content":{"rendered":"<p>Lemmatization of English words in sentences in XML format by Python<\/p>\n<p>Python 2.7, NLTK 3.0<br \/>\nThe input XML file look likes this:<\/p>\n<p><code><br \/>\n&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;<br \/>\n&lt;sentences version=\"1.0\"&gt;<br \/>\n&lt;item id=\"1\" asks-for=\"cause\" most-plausible-alternative=\"1\"&gt;<br \/>\n&lt;p&gt;my body cast a shadow over the grass . &lt;\/p&gt;<br \/>\n&lt;a1&gt;the sun be rise . &lt;\/a1&gt;<br \/>\n&lt;a2&gt;the grass be cut . &lt;\/a2&gt;<br \/>\n&lt;\/item&gt;<\/p>\n<p>&lt;item id=\"2\" asks-for=\"cause\" most-plausible-alternative=\"1\"&gt;<br \/>\n&lt;p&gt;the woman tolerate the woman friend 's difficult behavior . &lt;\/p&gt;<br \/>\n&lt;a1&gt;the woman know the woman friend be go through a hard time . &lt;\/a1&gt;<br \/>\n&lt;a2&gt;the woman felt that the woman friend take advantage of her kindness . &lt;\/a2&gt;<br \/>\n&lt;\/item&gt;<br \/>\n...<\/p>\n<p>&lt;\/sentences&gt;<br \/>\n<\/code><br \/>\n<!--more Read More--><\/p>\n<h3>Python Code<\/h3>\n<p><code><br \/>\n#This setting is only necessary for error about 'encoding utf-8'<br \/>\nimport sys<br \/>\nreload(sys)<br \/>\nsys.setdefaultencoding(&amp;quot;utf-8&amp;quot;)<\/p>\n<p>import xml.etree.cElementTree as ET #library for XML processing<\/p>\n<p>from nltk.tokenize import word_tokenize #library for word tokenize<\/p>\n<p>from nltk.stem import WordNetLemmatizer #library for word lemmatize<br \/>\nwordnet_lemmatizer = WordNetLemmatizer()<\/p>\n<p>tree = ET.parse('input.xml') #parse the XML tree from input.xml<br \/>\nroot = tree.getroot() #get root element of the tree<\/p>\n<p>for item_of_root in root: #for each item<br \/>\nfor sentence in item_of_root: #for each sentence in the item<br \/>\nwords = word_tokenize(sentence.text) #divide sentence to words<br \/>\nsentenceNew = &amp;quot;&amp;quot; #contatiner for new lemmatized sentence<br \/>\nfor word in words: #for each word in the sentence<br \/>\nlamWord = wordnet_lemmatizer.lemmatize(word, pos='v') #lemmatize the words<br \/>\nsentenceNew += lamWord + ' ' #put the lemmatized word to the contatiner<br \/>\nsentence.text = sentenceNew #store the new sentence to the tree<\/p>\n<p>tree.write('output.xml') #ouput the lemmatized tree to file<br \/>\n<\/code><br \/>\n&nbsp;<\/p>\n<h3>Reference<\/h3>\n<p><a href=\"https:\/\/docs.python.org\/2\/library\/xml.etree.elementtree.html\" target=\"_blank\">The ElementTree XML API &#8211; Python 2.7.12 Documentation<\/a><\/p>\n<p><a href=\"http:\/\/www.nltk.org\/data.html\" target=\"_blank\">Installing NLTK Data<\/a><\/p>\n<p><a href=\"http:\/\/textminingonline.com\/dive-into-nltk-part-i-getting-started-with-nltk\" target=\"_blank\">Dive Into NLTK, Part I: Getting Started with NLTK<\/a><\/p>\n<p><a href=\"http:\/\/textminingonline.com\/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize\" target=\"_blank\">Dive Into NLTK, Part II: Sentence Tokenize and Word Tokenize<\/a><\/p>\n<p><a href=\"http:\/\/textminingonline.com\/dive-into-nltk-part-iv-stemming-and-lemmatization\" target=\"_blank\">Dive Into NLTK, Part IV: Stemming and Lemmatization<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Lemmatization of English words in sentences in XML format by Python Python 2.7, NLTK 3.0 The input XML file look likes this: &lt;?xml version=&#8221;1.0&#8243; encoding=&#8221;UTF-8&#8243;?&gt; &lt;sentences version=&#8221;1.0&#8243;&gt; &lt;item id=&#8221;1&#8243; asks-for=&#8221;cause&#8221; most-plausible-alternative=&#8221;1&#8243;&gt; &lt;p&gt;my body cast a shadow over the grass . &lt;\/p&gt; &lt;a1&gt;the sun be rise . &lt;\/a1&gt; &lt;a2&gt;the grass be&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-299","post","type-post","status-publish","format-standard","hentry","category-techniques"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/wanggengyu.com\/index.php?rest_route=\/wp\/v2\/posts\/299","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wanggengyu.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wanggengyu.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wanggengyu.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wanggengyu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=299"}],"version-history":[{"count":0,"href":"https:\/\/wanggengyu.com\/index.php?rest_route=\/wp\/v2\/posts\/299\/revisions"}],"wp:attachment":[{"href":"https:\/\/wanggengyu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=299"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wanggengyu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=299"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wanggengyu.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=299"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}