[beautifulsoup] 01. beautifulsoup 모듈 사용하여 HTML 파싱하기(parsing)

Notice

가슴에 새길 말..

Recent Posts

Recent Comments

Tags more

Today

Total

Archives

관리 메뉴

SiLaure's Data

[beautifulsoup] 01. beautifulsoup 모듈 사용하여 HTML 파싱하기(parsing) 본문

Records of/Learning

[beautifulsoup] 01. beautifulsoup 모듈 사용하여 HTML 파싱하기(parsing)

data_soin 2021. 7. 29. 20:00

Parsing?
html문서 내에서 원하는 값만 추출하는 것

- HTML 문자열 파싱

문자열로 정의된 html 데이터 파싱하기

(예제)

html = '''
<html>

  <head>
    <title>BeautifulSoup test</title>
  </head>

  <body>
    <div id='upper' class='test' custom='good'>
      <h3 title='Good Content Title'>Contents Title</h3>
      <p>Test contents</p>
    </div>

    <div id='lower' class='test' custom='nice'>
      <p>Test Test Test 1</p>
      <p>Test Test Test 2</p>
      <p>Test Test Test 3</p>
    </div>
  </body>
  
</html>'''

- find() 함수

: 특정 html tag를 검색하거나 검색 조건을 명시하여 찾고자 하는 tag를 검색할 수 있다.

soup = BeautifulSoup(html)

tag 명으로 찾을 수 있다.

soup.find('h3')

출력 : <h3 title="Good Content Title">Contents Title</h3>

여러 개일 경우 첫 번째의 태그를 찾아준다.

soup.find('p')

출력 : <p>Test contents</p>

속성을 이용해서 같은 이름을 가진 다른 태그를 찾을 수 있다.

soup.find('div', custom='nice')
soup.find('div', id='lower')

출력 :
<div class="test" custom="nice" id="lower">
<p>Test Test Test 1</p>
<p>Test Test Test 2</p>
<p>Test Test Test 3</p>
</div>

class는 keyword이기 때문에 뒤에 under-bar(_)를 붙여야 한다.

soup.find('div', class_='test')

출력 :
<div class="test" custom="good" id="upper">
<h3 title="Good Content Title">Contents Title</h3>
<p>Test contents</p>
</div>

여러 태그를 찾고 싶을 때 attribute 변수를 이용해서 불러올 수 있다.

attrs = {'id' : 'upper', 'class' : 'test'}
soup.find('div', attrs=attrs)

출력 :
<div class="test" custom="good" id="upper">
<h3 title="Good Content Title">Contents Title</h3>
<p>Test contents</p>
</div>

- find_all() 함수

: find가 조건에 만족하는 하나의 tag만 검색한다면, find_all은 조건에 맞는 모든 tag를 리스트로 반환

find_all()은 해당되는 모든 tag를 list type으로 반환한다.

soup.find_all('div', class_='test')

출력 :
soup.find_all('div', class_='test')
[<div class="test" custom="good" id="upper">
<h3 title="Good Content Title">Contents Title</h3>
<p>Test contents</p>
</div>, <div class="test" custom="nice" id="lower">
<p>Test Test Test 1</p>
<p>Test Test Test 2</p>
<p>Test Test Test 3</p>
</div>]

- get_text() 함수

: tag 안의 value를 추출해준다. 부모 tag를 불러올 경우 모든 자식 tag의 value를 추출해 온다.

tag = soup.find('h3')
print(tag)
tag.get_text()

출력 :
<h3 title="Good Content Title">Contents Title</h3>
Contents Title

부모 tag일 경우 자식 tag의 value까지 다 찾아온다.

tag = soup.find('p')
print(tag)
tag.get_text()

출력 :
<div class="test" custom="good" id="upper">
<h3 title="Good Content Title">Contents Title</h3>
<p>Test contents</p>
</div>
Contents Title\nTest contents

attribute 값 추출하기 (key-value)

경우에 따라 추출하고자 하는 값이 attribute에도 존재함
이 경우에는 검색한 tag에 attribute 이름을 [ ]연산을 통해 추출가능
예) div.find('h3')['title']

tag = soup.find('h3')
print(tag)
tag['title']

출력 :
<h3 title="Good Content Title">Contents Title</h3>
Good Content Title

저작자표시 비영리 변경금지

'Records of > Learning' 카테고리의 다른 글

[BeautifulSoup] 02~04. id class 속성/CSS/정규표현식을 이용하여 원하는 값 추출하기 (0)	2021.08.01
[Kaggle] Titanic Competition (0)	2021.07.30
[API] Open API를 활용하여 json 데이터 추출하기(공공데이터 API) (0)	2021.07.29
[API] 01. requests 모듈 사용하기(HTTP 통신) (0)	2021.07.29
[Web] HTTP method(GET, POST) / HTML element(태그, 속성, 값) (0)	2021.07.29

'Records of/Learning' Related Articles

Comments

SiLaure's Data

[beautifulsoup] 01. beautifulsoup 모듈 사용하여 HTML 파싱하기(parsing) 본문

[beautifulsoup] 01. beautifulsoup 모듈 사용하여 HTML 파싱하기(parsing)

- HTML 문자열 파싱

- find() 함수

- find_all() 함수

- get_text() 함수

attribute 값 추출하기 (key-value)

'Records of > Learning' 카테고리의 다른 글

티스토리툴바