Python 6강 - Crawling

Programming/Python

Python 6강 - Crawling

상맹 2021. 9. 20. 19:05

728x90

pip install flask
pip install requests
pip install beautifulsoup4

1. 주소분석 → 주소분석은 api가 아니기 때문에 직접 분석이 필요하다.

* 뉴스별 카테고리 주소

① 네이버 뉴스 정치 카테고리 주소

https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=100

② 네이버 뉴스 경제 카테고리 주소

https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=101

③ 네이버 뉴스 사회 카테고리 주소

https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=102

④ 네이버 뉴스 생활/문화 카테고리 주소

https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=103

→ 분석결과

네이버 뉴스 카테고리 별로 분류 해보았을때 이러한 결과가 나왔었다.

sid1 = 정치(100), 경제(101), 사회(102), 생활/문화(103) 카테고리

oid = 신문사 (005 국민일보)

aid = 글번호 (없을 수도 있으니 주의)

requests를 정치 뉴스만 출력하고싶을때 sid1 을 sid1=100을 호출하면된다.

=> 정치 뉴스 를 신문사별로 출력(없는 신문사는 추출)

2. oid 수집 → oid 수집하여 각 신문사 이름 별로 출력 해보자

① oid는 1번 부터 시작해서 대충 세자리 숫자 까지 있다. 하지만 일의자리 숫자가 있지만 00이 붙어 있기때문에 출력

전에 임의로 '00' 이나 '0'을 붙여주어야 한다.(zfill() 함수를 이용하면된다.)

Ex) 5번 일경우 → 00 + 5 zfill(3) 입력해주면된다.

import requests
from bs4 import BeautifulSoup

# 1. 주소분석
# 2. 알고리즘(for문 돌면서 다운받는다.)
# 3. text/html 이기 때문에 파싱이 힘들다.

# 신문사 이름을 크롤링
# 목적 : oid 수집, 수집시간 대략 1~2분

start_oid = 1
oid_list = []

for num in range(0, 1000):
    start_oid_str = str(start_oid).zfill(3)
    uri = f"https://newsstand.naver.com/?list=&pcode={start_oid_str}"

    response = requests.get(uri)

    if(response.status_code == 200): # response 상태가 200일 경우
        oid_list.append(start_oid_str) # oid_list에 번호 넣기

    start_oid = start_oid + 1


print(f"oid 총 개수 : {len(oid_list)}")

② 만약 0 ~ 999 까지의 oid를 대입하였을 때, 서버로 부터 정상적인 신호를 받는 경우를 가정한다.

response = request.get(uri)
if(response.status_code == 200)

③ 받은 oid를 list에 담아 수집한 것이며 oid는 신문사의 개별 넘버이며, oid의 총 갯수는 신문사의 수와 동일하다.

3. 신문사 이름 크롤링

import requests
from bs4 import BeautifulSoup

# 1. 주소분석
# 2. 알고리즘(for문 돌면서 다운받는다.)
# 3. text/html 이기 때문에 파싱이 힘들다.

# 신문사 이름을 크롤링
# 목적 : oid 수집, 수집시간 대략 1~2분

start_oid = 1
#oid_list = []
oid_names = []

for num in range(0, 1000):
    start_oid_str = str(start_oid).zfill(3)
    uri = f"https://newsstand.naver.com/?list=&pcode={start_oid_str}"

    response = requests.get(uri)

    if(response.status_code == 200):
        # oid_list.append(start_oid_str)
        # print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        target = soup.select_one("#focusPanelCenter .panel_inner img")
        oid_names.insert(num, target["alt"])

    start_oid = start_oid + 1

print(oid_names)


# print(f"oid 총 개수 : {len(oid_list)}")
# print(oid_list)

# let temp = document.querySelector("#focusPanelCenter .panel_inner img");

★ id : # / class : . ★

- let temp = document.querySelector("#focuspanelcenter .panel_inner img");

- temp

- temp.alt

→자바스크립트에서는 바로 찾을 수 있지만 파이썬에선 바로 찾지 못하기 때문에 Beautifulsoup 라이브러리를 이용

Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio

beautiful-soup-4.readthedocs.io

728x90

저작자표시 비영리 변경금지 (새창열림)

'Programming > Python' 카테고리의 다른 글

Python 8강 - 데이터 크롤링 (0)	2021.10.06
Python 7강 - HTML Parsing (0)	2021.09.22
Python 5강 - Web에서 구동 (0)	2021.09.13
Python 4강 - 통신 (0)	2021.09.13
Python 3강 - Class, 생성자, Exception (0)	2021.09.13

현재글Python 6강 - Crawling

250x250

Today :
Yesterday :

Branch, git 설정, git 설치, JDK11, 쿠키런폰트, 자바폰트, git, Merge, 스프링, 환경변수, JDK, git merge, git branch,

상맹의 명상