How to pick content from html using beatifulsoup
- From: Sheetal Singh <sheetalsingh@xxxxxxxxxxxxx>
- Date: Tue, 10 Jul 2012 04:02:28 +0000
Hi,
I am a newbie in python, I need to fetch names of side filters and save in csv [PFA screen shot].
Following is snippet from code:
soup = BeautifulStoneSoup(html)
# for e in soup.findAll('div'):
# for c in e.findAll('h3'):
# for d in c.findAll('li'):
# print'@@@@@@@', d.extract()
#
# #select_pod=soup.findAll('div', {"class":"win aboutUs"})
# #promeg= select_pod[0].findAll("p")[0]
#
#
# for dv in soup.findAll('div', {"class":"attribution"}):
# ds = dv.findAll("<h3>")
# print ds
select_pod = soup.findAll('div')
print select_pod
for j in select_pod:
if j is not None:
print j.findall('a')
promeg = select_pod.findAll("<h3>")
#print '--', promeg
#hreflist = [ each.get('value') for each in soup.findAll('<h3>') ]
for m in promeg :
if m:
print 'Data values', m
fd1.writerow([x[2], m, i[0], "Data Found"])
Structure of HTML:
<div class="attribution">
<div>
<h3>By Brand</h3>
<ul>
<li>
<a href="http://www.xyz.com/cellphones/nokia/nokia/259-33902/buy">Nokia</a>
</li>
<li>
<li>
<li>
<li>
<li>
<li>
<li>
<li class="more">
</ul>
</div>
<div>
<h3>By Seller</h3>
<ul>
<li>
<a id="att_296935_184059" class="attributeUrlReplacementTarget" href="http://www.xyz.com/cellphones/nokia/amazon-marketplace/296935-184059/buy">Amazon Marketplace</a>
<input id="att_296935_184059_replacement" type="hidden" value="http://www.xyz.com/cellphones/nokia/amazon-marketplace/296935-184059/buy">
</li>
<li>
<li>
<li>
<li>
<li>
<li>
<li>
<li class="more">
</ul>
</div>
<div>
<div>
</div>
Output required in csv:
By Brands
Nokia
Samsung
..
..
By Seller
Amazon
Buy.com
..
..
..
Please suggest how to fetch details.
Sheetal Singh
Attachment:
filters.png
Description: filters.png
- Prev by Date: Re: How to safely maintain a status file
- Next by Date: RE: Python Interview Questions
- Previous by thread: Re: migrating from python 2.4 to python 2.6
- Next by thread: Re: ANN: Intro+Intermediate Python, San Francisco, Aug 1-3
- Index(es):
Relevant Pages
|