파이썬 프로젝트 (아마존 웹사이트 웹 스크래핑) 6 | Selenium으로 제품 정보 수집하기 (2)

올리브한입 2025. 3. 21. 05:17

# Extract Product Title
try:
    title = driver.find_element(By.ID, "productTitle").text.strip()
except:
    title = "N/A"

By.ID: By는 Selenium에서 요소를 찾을 때 사용하는 방법 중 하나입니다. By.ID는 HTML 요소의 ID 속성을 기준으로 요소를 찾습니다. 여기서는 ID가 "productTitle"인 요소를 찾습니다. strip() 메서드는 문자열의 양쪽 끝에 있는 공백을 제거합니다.

# Extract Brand Name
try:
    brand = driver.find_element(By.ID, "bylineInfo").text.replace("Visit the ", "").replace(" Store", "")
except:
    brand = "N/A"

같은 방식으로 브랜드 네임을 찾습니다. 불필요한 부분은 replace() 메서드를 사용하여 삭제합니다.

# Check for Amazon's Choice label
try:
    amazon_choice = driver.find_element(By.CLASS_NAME, 'ac-badge-wrapper')
    amazon_choice_exists = "Yes" if amazon_choice else "No"
except:
    amazon_choice_exists = "No"

아마존 상품 페이지에서 Amazon's Choice 배지를 확인하고 그 존재 여부를 추출하는 작업을 수행합니다. By.CLASS_NAME는 요소를 찾을 때 사용되는 기준 중 하나로, HTML 요소의 클래스 이름을 기준으로 찾습니다. 여기서는 ac-badge-wrapper라는 클래스를 가진 요소를 찾습니다. 조건부 표현식을 사용하여 amazon_choice가 존재하면 "Yes", 그렇지 않으면 "No"를 amazon_choice_exists 변수에 할당합니다.

# Extract Star Rating
try:
    star_rating = driver.find_element(By.CSS_SELECTOR, "#acrPopover span.a-size-base.a-color-base")
    star_rating = float(star_rating.text.strip())
except:
    star_rating = "N/A"

Amazon 제품 페이지에서 별점을 추출하는 과정입니다. CSS Selector는 HTML 문서에서 스타일을 적용할 특정 요소를 지정하는 데 사용됩니다.

ID 선택자 (#):

HTML 요소의 id 속성으로 요소를 선택합니다.
예: #productTitle — id="productTitle"인 요소를 선택합니다.

클래스 선택자 (.):

HTML 요소의 class 속성으로 요소를 선택합니다.
예: .product-price — class="product-price"인 요소를 선택합니다.

태그 선택자 (태그 이름):

특정 태그를 가진 요소를 선택합니다.
예: div — 모든 <div> 요소를 선택합니다.

속성 선택자 ([속성=값]):

특정 속성을 가진 요소를 선택합니다.
예: [href='https://www.example.com'] — href="https://www.example.com"인 요소를 선택합니다.

자식, 후손, 형제 선택자:

>는 자식 요소를 선택합니다.
예: div > p — 모든 <div> 요소의 자식인 <p> 요소를 선택합니다.
+는 바로 다음 형제를 선택합니다.
예: h2 + p — 바로 뒤에 나오는 <h2> 다음의 <p> 요소를 선택합니다.
~는 동일한 부모를 가진 형제들 중 선택합니다.
예: h2 ~ p — <h2> 다음에 나오는 모든 <p> 요소를 선택합니다.

#acrPopover는 id 속성으로 "acrPopover"인 요소를 선택합니다. span.a-size-base.a-color-base는 클래스 이름이 "a-size-base"와 "a-color-base"인 span 태그를 선택합니다.

# Extract the second rufus question
try:
    # Find all the question buttons
    question_buttons = driver.find_elements(By.CSS_SELECTOR, 'span.a-declarative button.small-widget-pill')
    second_rufus_question = question_buttons[1].text if len(question_buttons) > 1 else "N/A"
except:
    second_rufus_question = "N/A"

'span.a-declarative button.small-widget-pill'는 페이지에서 span 태그의 class가 a-declarative인 요소 안에 있는 button 태그를 찾습니다. 그 버튼의 클래스 이름이 small-widget-pill이어야 합니다. 그리고 question_buttons 리스트의 두 번째 요소(두 번째 질문 버튼)를 가져옵니다. 만약 question_buttons 리스트에 두 개 이상의 질문 버튼이 있으면, 두 번째 버튼의 텍스트를 가져옵니다.

# Extract Coupon Discount
try:
    coupon_discount_text = driver.find_element(By.XPATH, "//label[contains(@id, 'couponText')]").text.strip()

    # Split the text and check for percentage
    # parts = ["Apply", "25%", "off", "coupon"]
    parts = coupon_discount_text.split()
    if len(parts) > 1 and '%' in parts[1]:
        coupon_discount = parts[1]
    else:
        coupon_discount = "N/A"
except:
    coupon_discount = "N/A"

//label[contains(@id, 'couponText')]는 id 속성에 'couponText'를 포함하는 label 태그를 찾습니다. 공백을 기준으로 문자열을 나누는 작업을 합니다. 결과는 각 단어들이 나누어진 리스트로 반환됩니다. parts 리스트의 길이가 2 이상이면 조건을 만족합니다. 즉, 쿠폰 할인 텍스트에서 최소한 두 단어가 있는 경우에만 조건을 실행합니다. parts 리스트의 두 번째 항목(인덱스 1)이 %를 포함하는지 확인합니다.

# Returns a list 
return [asin, title, brand, amazon_choice_exists, star_rating, rating_count, second_rufus_question, coupon_discount, scrape_time]

리스트 형식으로 값을 반환합니다.

다음 포스팅에서 이어집니다.