HomeОбразованиеRelated VideosMore From: SAF Business Analytics

Custom Python Workshop - TripAdvisor Web Scraping (BeautifulSoup)

42 ratings | 10284 views
Code - https://github.com/kiengiv/TripAdvisorPython Want to learn Python, start here - https://www.youtube.com/playlist?list=PLjrXzkmqZGHJo1dp9GTXTsCwDKPnjxG0q Walking through script designed to scrape TripAdvisor.
Html code for embedding videos on your blog
Text Comments (42)
lin lin (2 years ago)
Thanks for the video. I got some serious problem, the error in running the program is below : line 31, in <module> counter = image.split("helpful vote", 1)[0].split("|", 1)[1][-4:].replace("|", "").strip() IndexError: list index out of range how can I fix it ? I can't figure out... Can someone help me out? Thanks in advance.
Kaiyang Huang (2 years ago)
Thanks for this. Very helpful. Is there a way to get the whole written review (the one you get after clicking "More") and not just a partial review? Thanks!
Daniel Diaz (2 years ago)
Hi Saf, I get this error Traceback (most recent call last): File "C:\Users\Diaz\Desktop\password.txt", line 106, in <module> Rating = altarray[x][:1] IndexError: string index out of range Is this still working or is it my fault?
naive boy (2 years ago)
Hey all,I highly appreciate it if anyone helps,thanks in advance:) Can anyone please send me working code to my email([email protected])? I just want to extract the reviews I'm using python 3.5 I get the following error when this line of code runs Reviewer = Reviewer.replace(',', ' ').replace('”', '').replace('“', '').replace('"', '').strip() SyntaxError: Non-UTF-8 code starting with '\x94' in file D:/python/trip1.py on line 105, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
pycoh salva (2 years ago)
nice video. very informative. do you have a video tutorial getting the address,phone,website and email in tripadvisor?
KyungBae Park (2 years ago)
DEAR SAF: I first would like to appreciate your video and the entire code for scraping the tripadvisor! This is just amazing to me since I have been struggling quite a while. After modifying some minor issue and experimenting to scrape for a certain hotel, your scraper is now successfully scraping the entire reviews!!! ... and it is just done scraping over 7,000 reviews for one hotel! I do appreciate it. Now I am double checking the csv file, specifically on the actual review content, I am referring the code from yours: Review = soup.findAll(attrs={"class": "entry"})[x].text.replace(',', ' ').replace('\n', ' ').strip() It seems the scraped review content has "More", which is the button for brevity by TripAdvisor. So, my question is how to scrape the entire review hidden by this "More" button? I was playing around to figure the proper tag but failed to find a way. If there is any chance, could you kindly explain to figure this out? Again, I do appreciate your help and time. Thank you.
Jose Alatrista (2 years ago)
Hi - would you be able to share how you managed to tweak the code to scrape the entire reviews? I am only able to scrape the brief review at the moment. Thanks!
Felipe Navarrete (3 years ago)
HI SAF, Thanks for the code, everything is working well, but I only can obtain reviews in english and i need it in every languaje, please if you can help me I really apreaciate it. Thanks in advance!
SAF Business Analytics (3 years ago)
Send me an email and we can figure this out
Maria Leonardi (3 years ago)
I have the same problem. Can you help me?
SAF Business Analytics (3 years ago)
+Felipe Navarrete Send me an email (see the about section of this channel) and let's figure this out. Might be the unicode vs non-unicode
Nguyen Duc Nhat Y (3 years ago)
HI SAF, I have 1 question, please help. In TripAdvisor Comment object, That have some option (Click More to view) : Location rate, Cleanliness rate, Service, Room, Value... rating infor mation (1-5 star).Can I crawl it for BeautifulSoup ? If can, please suggestion for me from you code, thanks very much.
John Posada (2 years ago)
Hi @Nguyen Duc Nhat Y were you able to extract all of the desired data?
John Posada (2 years ago)
Hi SAF, I have another problem. I did exactly what you said but after scraping the data for the first few pages it seems that trip advisor replaces the html code for each individual review with the following snippet... <div id="review_1234567" class="  reviewSelector "> </div>And it correlates back to this snippet...<script type='text/javascript'>injShowReviewBlock(1234567, 'rblock', 5, false);</script>I get the feeling that they switched the html code for a javascript query. Is there a way around this? Because now I can't scrape the reviews and I also don't have any new html code that reproduces the desire data to extract.
SAF Business Analytics (2 years ago)
That's correct. I would have the scrapper pull the data for each rating then categorize the data into the appropriate area based on some form of logic (e.g. does it say Cleanliness in that row) and repeat then you can make the ones not used blank.
John Posada (2 years ago)
Thank you! I have one last question. In your response above you mentioned that since there isn't always a consistent rating available for Cleanliness, Service, or Room Value. Then in order to retrieve the rating we would need to apply Try and Except. Would I have to then write an individual code similar to the rating code for each individual unique rating so that when it's exported to an xls file each rating name would have it's own column after the original review that's consistent for each review if a rating is available?
SAF Business Analytics (2 years ago)
Use this link - https://www.tripadvisor.ca/ShowUserReviews-g60922-d225341-r418438894-Grand_America_Hotel-Salt_Lake_City_Utah.html#CHECK_RATES_CONT
Nguyen Duc Nhat Y (3 years ago)
Great , thanks you very much, it work well (y)
SAF Business Analytics (3 years ago)
+Nguyen Duc Nhat Y That's great to hear! If you are interested in learning more Python - check out my playlist on Python - https://www.youtube.com/playlist?list=PLjrXzkmqZGHJo1dp9GTXTsCwDKPnjxG0q. I am going to be posting a more in-depth series on webscraping soon so stay tuned.
יניב גבאי (3 years ago)
Hi ? do you have already working script this one from github is not working :(
SAF Business Analytics (3 years ago)
+‫יניב גבאי‬‎ I am glad we were able to get it fixed! Just remember to pick the url that ends with "#REVIEWS". Best of luck!
SAF Business Analytics (3 years ago)
+‫יניב גבאי‬‎ It does work. I just tested it out. A few things - what version of Python are you using and what are you trying to web scrape?
Shanthi Marie Alexis (3 years ago)
Hey..appreciate this video. Really learnt a bit from it. I'm trying to capture the count of the visitor rating only. But the nested elements are posing a challenge. Would you by any chance be able to offer soine pointers ? I can capture where soup.findAll('div', attrs={'class':'colTitle'}) which return Traveller rating. ButI am unsuccessful getting the nested ratings with that class. Any ideas?
SAF Business Analytics (3 years ago)
+Shanthi Marie Alexis I am glad we were able to resolve the problem. Best of luck with your python adventures :)
SAF Business Analytics (3 years ago)
+Shanthi Marie Alexis do you want to email me a screenshot what section you want to capture? [email protected]
Besim Ismaili (3 years ago)
why i get this in version 3.4.3: File "C:\Python34\lib\site-packages\bs4\__init__.py", line 175 except Exception, e:
Besim Ismaili (3 years ago)
+SAF Business Analytics Send me your email, It does not fit here! :)
SAF Business Analytics (3 years ago)
+Besim Ismaili I updated the script on github, can you try it now? Thanks
SAF Business Analytics (3 years ago)
+Besim Ismaili can you paste your script here or send it to me via message?
Besim Ismaili (3 years ago)
+SAF Business Analytics THanks for your help, but still does not work. I get these errors: Traceback (most recent call last): File "C:/Python34/newtest.py", line 27, in <module> soup=BeautifulSoup(thepage) File "C:\Python34\lib\site-packages\bs4\__init__.py", line 172, in __init__ self._feed() File "C:\Python34\lib\site-packages\bs4\__init__.py", line 185, in _feed self.builder.feed(self.markup) File "C:\Python34\lib\site-packages\bs4\builder\_htmlparser.py", line 146, in feed parser.feed(markup) File "C:\Python34\lib\html\parser.py", line 165, in feed self.goahead(0) File "C:\Python34\lib\html\parser.py", line 222, in goahead k = self.parse_starttag(i) File "C:\Python34\lib\html\parser.py", line 411, in parse_starttag self.handle_startendtag(tag, attrs) File "C:\Python34\lib\html\parser.py", line 506, in handle_startendtag self.handle_starttag(tag, attrs) File "C:\Python34\lib\site-packages\bs4\builder\_htmlparser.py", line 48, in handle_starttag self.soup.handle_starttag(name, None, None, dict(attrs)) File "C:\Python34\lib\site-packages\bs4\__init__.py", line 298, in handle_starttag self.currentTag, self.previous_element) File "C:\Python34\lib\site-packages\bs4\element.py", line 749, in __init__ self.name, attrs) File "C:\Python34\lib\site-packages\bs4\builder\__init__.py", line 162, in _replace_cdata_list_attribute_values if isinstance(value, basestring): NameError: name 'basestring' is not defined >>>
SAF Business Analytics (3 years ago)
+Besim Ismaili Potentially - try uninstalling 2.7 and just have 3.4.3
Besim Ismaili (3 years ago)
Which version of Python is this?
SAF Business Analytics (2 years ago)
Python 3.4.3

Would you like to comment?

Join YouTube for a free account, or sign in if you are already a member.