Why would you want to do that?
Well, if you are web scraping using Python, and Scrapy for instance, you may need to extract reviews, or comments that are loaded from JavaScript. This would mean you could not use your css or xpath selectors like you can with regular html.
Parse
Instead, in your browser, check if you may be able to parse the code, beginning with ctrl + f, and “json” and track down some JSON in the form of a python dictionary. You ‘just’ need to isolate it.
The response is not nice, but you can gradually shrink it down, in Scrapy shell or python shell…
Split, strip, replace
From within Scrapy, or your own Python code you can split, strip, and replace, with the built-in python commands until you have just a dictionary that you can use with json.loads.
x = response.text.split('JSON.parse')[3].replace("\u0022","\"").replace("\u2019m","'").lstrip("(").split(" ")[0].strip().replace("\"","",1).replace("\");","")
Master replace, strip , and split and you won’t need regular expressions!
With the response.text now ready as a JSON friendly dictionary you can do this:
import json
q = json.loads(x)comment = (q[‘doctor’][‘sample_rating_comment’])
comment.replace(“\u2019″,”‘”)
print(comment)
The key thing to remember to use when parsing the response text is to use the index, to pick out the section you want, and then make use of “\” backslash to escaped characters when you are working with quotes, and actual backslashes in the text you’re parsing.
Conclusion
Rendering to HTML using Splash, or Selenium, or using regular expressions are not always essential. Hope this helps illustrate how you can extract values FROM a python dictionary FROM json FROM javascript !
You may see a mass of text on your screen to begin with, but persevere and you can arrive at the dictionary contained within…
No comments:
Post a Comment
Thanks for your comments