Google Vertex AI에서의 LLM 평가

이 가이드는 Google Vertex AI와 W&B Weave를 활용한 대규모 언어 모델(LLM) 평가 방법을 소개하며, 텍스트 요약 작업에서 다양한 Gemini 모델을 비교하는 데 초점을 맞춥니다! 이 글은 AI 번역본입니다. 오역이 의심되는 부분이 있다면 댓글로 알려주세요.
Brett Young, Christian Williams
Created on September 12|Last edited on September 12
Comment
이 글 전반에서 우리는 견고한 프레임워크를 구축하여 LLM 평가 Google Vertex AI에서 언어 모델의 성능을 엄밀하게 평가하고 비교할 수 있도록 해 줍니다. 엔드 투 엔드 파이프라인 구축에 초점을 맞춰, 이 튜토리얼은 Vertex AI의 확장 가능한 인프라와 함께 활용하는 방법을 보여 줍니다. W&B Weave의 강력한 평가 도구 세트
이 과정을 설명할 때는 텍스트 요약 사용 사례를 통해 설명하지만, 여기에서 다루는 원칙과 방법론은 다양한 LLM 평가 작업 전반에 적용할 수 있습니다. Vertex AI 환경을 설정하고, 서로 다른 모델을 배포하는 방법을 배우게 됩니다. Gemini 모델그리고 재현 가능한 워크플로를 구축하기 위해 상세한 평가 지표를 정의합니다.
이 튜토리얼을 마치면 다음과 같은 지표를 사용해 모델 성능을 수치화할 뿐만 아니라 ROUGE 및 BERTScore뿐만 아니라 출력물을 정성적으로 분석하여, 요구 사항에 가장 적합한 모델을 선택할 수 있도록 합니다.
Jump to the tutorial﻿
﻿
﻿
목차Google Vertex AI에서 사용할 수 있는 파운데이션 모델W&B Weave를 사용해 Vertex AI에서 LLM 요약 성능 평가하기텍스트 요약을 위한 Gemini 모델 평가 1단계: Google Cloud 프로젝트 만들기2단계: Vertex AI API 활성화3단계: Google Cloud CLI 설정하기4단계: IAM 역할 구성정답 요약 생성 Gemini 모델을 사용한 요약 생성요약 평가를 위한 Weave의 평가 지표 정의Weave로 버그 잡기 왜 Google Vertex AI를 선택할까요?결론 관련 문서 
﻿
Google Vertex AI에서 사용할 수 있는 파운데이션 모델Google Vertex AI는 요약 작업에 적합한 다양한 모델을 제공합니다. 이는 우리가 중점적으로 다루는 사용 사례입니다. LLM 평가 이 예시뿐 아니라 다른 AI 기반 작업에서도 적용할 수 있습니다. 그중에서도 Gemini 시리즈는 특히 요약에 최적화되어 있습니다.
Gemini 1.5 Flash 고품질 요약을 빠르게 처리하여 대량의 텍스트를 효율적으로 다루기에 적합합니다. 2025년 2월 15일 기준, 해당 모델의 가격은 입력 토큰 100만개당 $0.075, 출력 토큰 100만개당 $0.30입니다. 프롬프트가 최대 128,000토큰일 때까지 해당 요금이 적용되며, 그보다 긴 입력에는 더 높은 요금이 부과됩니다.
Gemini 2.0 Flash 이러한 기능을 바탕으로 더욱 빠른 응답 시간과 강화된 멀티모달 기능을 제공합니다. 이는 입력 토큰 100만 개당 $0.10, 출력 토큰 100만 개당 $0.40의 가격이 책정되어 있습니다., 대규모 환경에서 빠르고 고품질의 응답이 필요한 애플리케이션에 이상적입니다.
Gemini 계열 모델 외에도 Vertex AI는 다른 제공업체의 모델에 대한 액세스를 제공합니다:
Meta의 Llama 시리즈: 다재다능한 대안으로 텍스트 요약, Google의 기본 모델과 ���교해 고유한 성능 특성을 제공합니다.
﻿Anthropic의 Claude 모델예를 들어 Claude 3.5 Haiku처럼 효율적이고 대화형 출력을 뛰어나게 생성하는 모델은, 빠른 응답이 필수적인 대화형 요약 작업에 특히 적합합니다.
텍스트 임베딩 모델(예: E5 Text Embedding): 이들 모델은 텍스트를 벡터 표현으로 변환하여 의미 검색, 분류, 클러스터링과 같은 작업을 가능하게 합니다.
텍스트 기반 모델 외에도, Vertex AI는 다음과 같은 다른 모달리티도 지원합니다 Stable Diffusion, 텍스트 프롬프트를 기반으로 이미지를 생성하거나 수정하는 고품질 텍스트-투-이미지 생성 모델입니다.
이처럼 다양한 선택지 덕분에 Vertex AI는 텍스트 처리, 이미지 생성, 임베딩 기반 애플리케이션 등에서 가장 적합한 모델을 고를 수 있도록 하여, 폭넓은 AI 기반 작업 전반에 걸친 유연성을 제공합니다.
W&B Weave를 사용해 Vertex AI에서 LLM 요약 성능 평가하기텍스트 요약과 같은 작업을 위해 여러 언어 모델을 평가할 때는 성능을 효과적으로 측정하고 비교할 수 있는 견고한 도구가 필수적입니다. W&B Weave는 평가 기준 정의, 결과의 자동 수집, 다양한 차원에서의 모델 성능 시각화를 통해 이 과정을 단순화하는 포괄적인 LLM 평가 프레임워크를 제공합니다.
﻿Weave는 트레이싱을 지원합니다, 워크플로 전반에서 LLM의 입력, 출력, 그리고 모델 동작을 모니터링할 수 있도록 해주는 @weave.op 데코레이터입니다. 이를 추가하면 @weave.op 어떤 함수 위에든 추가하면 입력과 출력을 자동으로 기록하고 추적하여 상세한 실행 트레이스를 생성할 수 있습니다. 이 기능은 데이터 흐름의 각 단계를 포착해 입력이 출력으로 어떻게 변환되는지 정확히 파악할 수 있으므로 디버깅에 매우 유용합니다. 트레이스 데이터는 Weave 내에서 기록되고 시각화되어 모델 동작에 대한 통찰을 제공하고, 튜닝이 필요할 수 있는 지점을 강조합니다.
트레이싱에 더해 Weave 평가 평가 지표를 체계적으로 정의하고 수동 설정이나 커스텀 평가 루프 없이 결과를 수집할 수 있게 하여, 언어 모델 출력 비교를 간소화합니다.
Weave에서의 평가 프로세스는 몇 가지 핵심 구성 요소로 이루어집니다:
모델모델은 다음을 서브클래싱하여 정의할 수 있습니다 Model 클래스를 상속하고 predict 입력 예제를 처리하고 출력을 반환하는 함수입니다. 이 설정을 통해 프롬프트, 온도 같은 모델 속성을 버전 관리하고 추적할 수 있습니다.
데이터셋종종 실패 사례나 특정 시나리오를 대표하는 예시 모음을 데이터셋으로 구성합니다. 이러한 예시는 모델 성능을 체계적으로 평가하기 위한 테스트 케이스로 사용됩니다.
스코어러: 평가 지표는 스코어러를 사용해 정의하며, 이는 데코레이터가 적용된 간단한 Python 함수일 수 있습니다 @weave.op 또는 다음을 상속하는 더 복잡한 클래스 weave.Scorer스코어러는 모델 출력을 분석하고 평가 지표를 담은 딕셔너리를 반환하여, 모델 성능의 다양한 측면을 평가할 수 있도록 합니다.
평가: 모델, 데이터셋, 스코어러를 조합하여 다음을 만들 수 있습니다 Evaluation 평가 프로세스를 관리하는 객체. 다음은 evaluate 메서드는 모델의 predict 데이터셋의 각 예제에 대해 함수를 실행하고, 정의된 스코어러를 적용해 출력물을 평가합니다.
이 구조화된 접근 방식은 일관되고 반복 가능한 평가를 가능하게 하여, 서로 다른 모델이나 모델 버전을 쉽게 비교할 수 있게 합니다. Weave의 시각화 기능은 평가 결과를 보여 주는 대화형 대시보드를 제공해 이 과정을 한층 강화하며, 특정 예제를 심층 탐색하고 성능 지표를 분석하며 개선이 필요한 영역을 식별할 수 있도록 합니다.
Weave Evaluations를 워크플로에 통합하면 언어 모델 활용 사례에 대해 공정하고 엄밀한 동등 조건의 평가를 구축하고, LLM 워크플로 전반에서 생성되는 정보를 체계적으로 정리하며, 애플리케이션을 자신 있게 반복 개선할 수 있습니다.  
텍스트 요약을 위한 Gemini 모델 평가 다음으로 Google Vertex AI와 W&B Weave를 사용해 두 가지 Gemini 모델의 성능을 비교하여 LLM 평가를 수행하겠습니다—Gemini-1.5-Flash 그리고 Gemini-2.0-Flash—텍스트 요약 작업에서.
먼저 기준점으로 사용할 정답 데이터셋을 생성해 각 모델의 강점과 한계를 신뢰할 수 있게 평가할 수 있는 베이스라인을 마련하겠습니다. 이어서 Weave의 메트릭과 시각화 도구를 활용해 모델 출력물을 분석하고, 다양한 요약 요구에 가장 적합한 모델이 무엇인지 파악하겠습니다. 또한 Weave의 시각화 도구로 결과를 더 자세히 살펴보며 우리 작업에 최적의 모델을 결정하겠습니다.
먼저 Vertex AI 설정부터 다루겠습니다. Google Cloud 프로젝트를 생성하고, 필요한 API를 활성화하며, Google Cloud CLI를 구성하는 단계로 시작합니다. 이 기반을 통해 Vertex AI의 기능을 완전히 활용하는 데 필요한 도구와 권한을 갖추게 됩니다. 아래에서 Google Cloud 프로젝트와 개발 환경을 설정하는 주요 단계를 설명하겠습니다. 
1단계: Google Cloud 프로젝트 만들기에서 새 프로젝트를 만드는 것부터 시작하세요 Google Cloud 콘솔. 프로젝트 선택기 페이지로 이동하여 기존 프로젝트를 선택하거나 새 프로젝트를 만드세요. Vertex AI 서비스를 사용하려면 프로젝트에 결제가 활성화되어 있어야 합니다. 아직 프로젝트를 만들지 않았다면 Google Cloud 검색창에 ‘create project’를 검색한 뒤, 첫 번째 결과를 클릭하면 프로젝트 생성 안내로 쉽게 이동할 수 있습니다. 
﻿
﻿
2단계: Vertex AI API 활성화다음으로 프로젝트에서 Vertex AI API를 활성화하세요. Google Cloud 콘솔의 검색창에 “Vertex AI”를 입력합니다. 결과에서 Vertex AI를 선택하면 Vertex AI 대시보드로 이동합니다. “추천 API 모두 활성화Vertex AI에 필요한 API를 활성화하려면 이 옵션을 선택하세요. (이 과정은 완료까지 잠시 시간이 걸릴 수 있습니다.)
﻿
﻿
3단계: Google Cloud CLI 설정하기로컬 개발 환경에서 Google Cloud 서비스를 사용하려면 Google Cloud CLI를 설치해야 합니다. Google Cloud 문서에서 CLI를 다운로드하여 설치하세요. 설치가 완료되면 다음 명령으로 CLI를 초기화합니다 gcloud init 터미널에서 실행하세요. 이 명령은 프로젝트 선택과 설정 구성을 단계별로 도와줍니다.
다음 명령을 실행하여 최신 도구와 기능을 사용할 수 있도록 CLI 구성 요소를 업데이트하세요:
gcloud components update
gcloud components install beta
gcloud auth login
4단계: IAM 역할 구성관리자는 적절한 IAM 역할이 할당되었는지 확인해야 합니다. 이러한 역할에는 다음이 포함됩니다:
Vertex AI 사용자 또는 Vertex AI 관리자, 그리고
서비스 계정 사용자
Vertex AI를 어떤 용도로, 어떤 필요에 맞춰 사용할지에 따라 다릅니다. 이 튜토리얼에서는 Vertex AI 관리자와 서비스 계정 사용자 권한을 권장합니다. 
이를 위해 Google Cloud 검색창에서 “IAM”을 검색하면 됩니다. 그러면 다음을 수행할 수 있습니다. 
﻿
그런 다음 사용자 계정 옆에 있는 편집 버튼을 선택합니다. 버튼은 다음과 같이 표시됩니다: 
﻿
그리고 적절한 역할을 할당합니다:  
﻿
﻿
Google Vertex AI와 W&B Weave를 사용한 요약 평가 워크플로를 지원하려면 몇 가지 핵심 Python 패키지를 설치해야 합니다. 다음 명령을 실행하면 환경 구성을 위한 주요 라이브러리를 설치할 수 있습니다:
pip install google-cloud google-cloud-aiplatform openai wandb weave arxiv pymupdf rouge-score
초기 설정을 마치면 Google Cloud 콘솔에서 “Model Garden”을 확인할 수 있으며, 여기에서 Vertex AI에서 사용할 수 있는 모든 모델을 볼 수 있습니다. 이제 평가에 사용할 일부 모델을 활용할 준비가 거의 끝났지만, 그보다 먼저 모델의 성능을 비교하기 위해 사용할 데이터셋을 만들어야 합니다. 
정답 요약 생성 다음과 같이 Gemini 모델의 요약 성능을 벤치마크합니다. 원문 초록을 제거했을 때 연구 논문의 초록을 정확하게 생성하는 능력을 테스트합니다.이 접근 방식은 각 논문의 본문에서 핵심 정보를 간결하고 관련성 있게 요약하는 Gemini 모델의 능력을 직접 평가할 수 있도록 해줍니다.
이 방법은 사람이 복잡한 정보를 요약할 때 수행하는 과제와 동일하기 때문에 모델의 요약 능력을 효과적으로 검증합니다. 즉, 논문의 핵심 목표, 방법, 그리고 주요 결과를 간결하고 일관된 초록으로 압축하는 작업을 그대로 반영합니다. 초록을 미리 제공하지 않음으로써, 모델이 논문에서 가장 중요한 요소를 스스로 파악하고 전달할 수 있는지 평가할 수 있으며, 구조적이고 간결한 형식으로 학술 내용을 처리·평가·요약하는 인간에 가까운 능력을 보여주는지 확인할 수 있습니다. 이 구성은 정보 포착의 정확도뿐 아니라, 사람이 전문적으로 하듯 그것을 간명하게 조직하는 역량까지 함께 평가할 수 있게 해줍니다.
벤치마크 데이터셋을 만들기 위해 먼저 arXiv에서 인공지능과 머신러닝 주제의 연구 논문을 수집합니다. 각 논문에서 보통 초록이 위치한 첫 페이지만 추출하고, Gemini-1.5를 사용해 해당 구간을 분리한 뒤 JSON 객체 형태로 구조화합니다. 이렇게 추출한 초록은 “골드 스탠더드” 기준으로 활용하며, 손쉬운 로딩과 일관된 평가를 위해 JSONL 파일 형식으로 저장합니다.
이 파일 형식은 Gemini 모델을 평가할 때 요약본을 손쉽게 로드하고 처리할 수 있게 해주며, 기준 데이터가 올바르게 버전 관리되고 쉽게 공유되도록 보장합니다.
다음 코드는 논문을 다운로드한 뒤, 각 논문의 첫 페이지에서 초록을 추출합니다. Gemini 1.5 Pro를 사용합니다. 
import os
import arxiv
import fitz  # PyMuPDF
import json
from vertexai.generative_models import GenerativeModel, GenerationConfig
import vertexai
import weave; weave.init('paper_abstract_gen')
import re
from time import sleep
﻿
﻿
# Set up Vertex AI
PROJECT_ID = "dsports-6ab79"
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)
﻿
﻿
# Directory to save downloaded papers
download_dir = "arxiv_papers"
os.makedirs(download_dir, exist_ok=True)
﻿
﻿
# Define AI-specific search queries
search_queries = [
    "Large Language Models for vision tasks AND cat:cs.AI",
    "Multimodal AI techniques AND cat:cs.CV",
    "Applications of Transformers in healthcare AI AND cat:cs.LG",
    "Few-shot learning in AI and ML AND cat:cs.LG",
    "Vision and language models integration AND cat:cs.CV",
    "Domain-specific fine-tuning for ML models AND cat:cs.LG",
    "Foundational models in AI and CV applications AND cat:cs.AI",
    "NLP in robotics and vision systems AND cat:cs.AI",
    "Bias and fairness in AI for CV AND cat:cs.CV",
    "Evaluation metrics for multimodal AI AND cat:cs.LG"
]
﻿
﻿
def download_papers(max_pages=15, max_attempts_per_query=20):
    """Download papers or use existing ones from the download directory."""
    papers = []
    downloaded_titles = set()
    
    # First check for existing papers
    if os.path.exists(download_dir):
        existing_pdfs = [f for f in os.listdir(download_dir) if f.endswith('.pdf')]
        if existing_pdfs:
            print(f"Found {len(existing_pdfs)} existing papers, checking validity...")
            for pdf_file in existing_pdfs:
                pdf_path = os.path.join(download_dir, pdf_file)
                try:
                    with fitz.open(pdf_path) as pdf:
                        if pdf.page_count <= max_pages:
                            arxiv_id = pdf_file.replace('.pdf', '')
                            # Try to get a clean title from the first page
                            title = pdf[0].get_text().split('\n')[0].strip()
                            papers.append({
                                "title": title,
                                "file_path": pdf_path,
                                "arxiv_id": arxiv_id
                            })
                            downloaded_titles.add(title)
                            print(f"Using existing paper: {title}")
                except Exception as e:
                    print(f"Error checking existing PDF {pdf_path}: {e}")
                    if os.path.exists(pdf_path):
                        os.remove(pdf_path)
    
    # If we have enough papers (one per query), return early
    if len(papers) >= len(search_queries):
        print(f"\nUsing {len(papers)} existing papers")
        return papers[:len(search_queries)]  # Return only what we need
    
    # Otherwise, download remaining papers
    print(f"\nNeed {len(search_queries) - len(papers)} more papers, downloading...")
    
    client = arxiv.Client()
    
    for query in search_queries[len(papers):]:  # Only process remaining queries
        paper_found = False
        attempt = 0
        
        while not paper_found and attempt < max_attempts_per_query:
            search = arxiv.Search(
                query=query,
                max_results=100,
                sort_by=arxiv.SortCriterion.SubmittedDate
            )
            
            try:
                results = list(client.results(search))
                start_idx = attempt * 5
                end_idx = start_idx + 5
                current_batch = results[start_idx:end_idx]
                
                for result in current_batch:
                    if result.title not in downloaded_titles:
                        print(f"Downloading: {result.title}")
                        paper_id = result.entry_id.split('/')[-1]
                        pdf_filename = f"{paper_id}.pdf"
                        pdf_path = os.path.join(download_dir, pdf_filename)
                        
                        result.download_pdf(dirpath=download_dir, filename=pdf_filename)
                        
                        try:
                            with fitz.open(pdf_path) as pdf:
                                if pdf.page_count <= max_pages:
                                    papers.append({
                                        "title": result.title,
                                        "file_path": pdf_path,
                                        "arxiv_id": paper_id
                                    })
                                    downloaded_titles.add(result.title)
                                    print(f"Accepted: {result.title}")
                                    paper_found = True
                                    break
                                else:
                                    os.remove(pdf_path)
                                    print(f"Skipped (too many pages: {pdf.page_count}): {result.title}")
                        except Exception as e:
                            print(f"Error checking PDF {pdf_path}: {e}")
                            if os.path.exists(pdf_path):
                                os.remove(pdf_path)
                
                attempt += 1
                if not paper_found:
                    print(f"Attempt {attempt}/{max_attempts_per_query} for query: {query}")
                    sleep(3)
                    
            except Exception as e:
                print(f"Error during download: {e}")
                sleep(3)
                attempt += 1
                continue
        
        if not paper_found:
            print(f"Failed to find suitable paper for query after {max_attempts_per_query} attempts: {query}")
    
    print(f"\nTotal papers available: {len(papers)} ({len(papers) - len(search_queries)} existing, {len(search_queries) - (len(papers) - len(search_queries))} new)")
    return papers
﻿
﻿
def extract_first_page_text(pdf_path):
    """Extract text from only the first page of the PDF."""
    with fitz.open(pdf_path) as pdf:
        if pdf.page_count > 0:
            page = pdf[0]
            return page.get_text()
    return ""
﻿
def extract_abstract_with_gemini(text, title, max_retries=3):
    """Extract abstract using Gemini with simple retry logic."""
    model = GenerativeModel("gemini-1.5-pro-002")
    
    prompt = (
        f"From the following first page of the research paper titled '{title}', "
        f"extract ONLY the abstract section. Return the result in JSON format with 'abstract' as the key. "
        f"If you cannot find the abstract, return an empty string as the value.\n\n"
        f"Paper content:\n\n{text}"
    )
    
    for attempt in range(max_retries):
        try:
            response = model.generate_content(
                prompt,
                generation_config=GenerationConfig(
                    temperature=0,
                    response_mime_type="application/json"
                )
            )
            return json.loads(response.text)
            
        except Exception as e:
            if attempt == max_retries - 1:  # Last attempt
                print(f"Error extracting abstract for {title}: {e}")
                return {"abstract": ""}
            
            # Simple exponential backoff: 5s, 10s, 20s
            sleep_time = 5 * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed, retrying in {sleep_time}s...")
            sleep(sleep_time)
    
    return {"abstract": ""}
﻿
def count_words(text):
    """Count words excluding punctuation and special characters."""
    cleaned_text = re.sub(r'[^\w\s]', ' ', text.lower())
    words = [word for word in cleaned_text.split() if word.strip()]
    return len(words)
﻿
﻿
def main():
    # Download papers
    papers = download_papers()
    print(f"\nDownloaded {len(papers)} papers. Processing abstracts...\n")
    
    # Process papers and extract abstracts
    paper_data = []
    for paper in papers:
        title = paper["title"]
        pdf_path = paper["file_path"]
        
        print(f"Processing: {title}")
        first_page_text = extract_first_page_text(pdf_path)
        abstract_json = extract_abstract_with_gemini(first_page_text, title)
        abstract_text = abstract_json.get('abstract', '')
        word_count = count_words(abstract_text)
        
        paper_data.append({
            "title": title,
            "file_path": pdf_path,
            "abstract": abstract_text,
            "word_count": word_count,
            "arxiv_id": paper["arxiv_id"]
        })
        
        sleep(2)
﻿
﻿
    # Save to JSONL file
    output_file = "paper_abstracts.jsonl"
    with open(output_file, "w") as f:
        for entry in paper_data:
            json.dump(entry, f)
            f.write("\n")
﻿
﻿
    print(f"\nProcessed {len(paper_data)} papers. Results saved to {output_file}")
﻿
﻿
if __name__ == "__main__":
    main()
﻿
이 스크립트는 Gemini 모델을 평가하기 위한 기준 요약 데이터셋을 구성합니다. arXiv에서 수집한 인공지능 연구 논문에서 원본 초록을 직접 추출해, 일관되고 신뢰할 수 있는 정답 데이터를 마련했습니다. 이러한 초록은 JSONL 형식으로 저장되며, 비교를 위한 벤치마크로 사용됩니다. 전체 과정은 논문을 다운로드하고, 각 PDF에서 초록만 추출한 뒤, 이를 구조화된 형식으로 정리하는 단계로 이루어집니다. 이처럼 정답 데이터셋을 구축해 두면, 이제 Gemini 모델이 생성한 요약을 원본 초록과 나란히 비교해 성능을 평가할 수 있습니다.
Gemini 모델을 사용한 요약 생성Google Vertex AI와 W&B Weave를 사용해 텍스트 요약 과제에서 두 Gemini 모델인 Gemini-1.5-Flash와 Gemini-2.0-Flash의 성능과 비용을 비교합니다. Gemini-2.0-Flash는 더 최신이며 다소 더 비용이 드는 Gemini Flash 모델의 버전으로, 효율성과 멀티모달 기능에서 잠재적인 개선을 제공합니다.
일관성을 유지하기 위해, 두 모델 모두에게 각 논문의 핵심 내용을 간결한 초록 형태로 요약하도록 지시하는 구조화된 프롬프트를 사용합니다. 우리의 구성에서는 원문 초록을 제외하기 위해 두 번째 페이지부터 본문 텍스트를 추출해, 모델이 자체적으로 요약을 생성하도록 유도합니다. 이 방식은 생성된 초록이 독립적으로 완결성을 갖추고, 논문의 기여 내용을 명료하게 요약하도록 보장합니다.
평가의 일관성을 위해 두 모델의 temperature를 0.0으로 설정해 결정적 출력을 보장하고, 성능을 직접 비교할 수 있게 합니다. 각 모델이 생성한 초록은 JSONL 파일로 저장됩니다. gemini_1_5_flash_abstract_predictions.jsonl 그리고 gemini_2_flash_abstract_predictions.jsonl - 참조 초록과 동일한 형식을 따릅니다. 이러한 구조화된 출력은 모델들의 성능을 나란히 비교하기 쉽게 해 줍니다.
또한 API 속도 제한에 대비해 요청 사이에 2초 지연을 두는 안전장치를 포함했습니다. 동일한 논문 집합을 두 모델 모두로 처리함으로써, 초록 생성 성능을 체계적으로 평가할 수 있는 데이터셋을 구축합니다. 이 방식은 어떤 모델이 핵심 내용을 더 잘 포착하는지와 각 모델이 강점을 보이는 콘텐츠 유형을 파악할 수 있게 해 줍니다.
import json
import fitz  # PyMuPDF
import time
from vertexai.generative_models import GenerativeModel, GenerationConfig
import vertexai
import weave; weave.init('vertex_abstract_prediction')
﻿
﻿
# Configuration
PROJECT_ID = "dsports-6ab79"
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)
﻿
﻿
# Model configurations
MODELS = {
    "gemini_2_flash": {
        "name": "gemini-2.0-flash-001",
        "type": "vertex",
        "delay": 2,
        "temperature": 0.0
    },
    "gemini_1_5_flash": {
        "name": "gemini-1.5-flash-002",
        "type": "vertex",
        "delay": 2,
        "temperature": 0.0
    }
}
﻿
﻿
def load_paper_data():
    """Load papers with their abstracts and word counts."""
    with open("paper_abstracts.jsonl", "r") as f:
        return [json.loads(line) for line in f]
﻿
﻿
def extract_text_after_page_one(pdf_path):
    """Extract text from page 2 onwards."""
    text = ""
    try:
        with fitz.open(pdf_path) as pdf:
            if pdf.page_count > 1:
                for page_num in range(1, pdf.page_count):  # Start from page 2
                    page = pdf[page_num]
                    text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from PDF {pdf_path}: {e}")
    return text
﻿
﻿
def create_abstract_prompt(text, title, target_length):
    """Create a prompt for abstract generation with target length."""
    return (
        f"You are tasked with generating an abstract for a research paper titled '{title}'. "
        f"The abstract should be approximately {target_length} words long.\n\n"
        f"Generate an abstract that summarizes the key points of the paper, including the "
        f"research objective, methodology, and main findings. The abstract should be "
        f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
        f"Paper content:\n\n{text}"
        f"Respond only with the ABSTRACT!"
    )
﻿
﻿
@weave.op
def predict_abstract_with_gemini(text, title, target_length, model_info):
    """Generate abstract prediction using specified Gemini model."""
    model = GenerativeModel(model_info["name"])
    generation_config = GenerationConfig(
        temperature=model_info["temperature"]
    )
    
    try:
        response = model.generate_content(
            create_abstract_prompt(text, title, target_length),
            generation_config=generation_config
        )
        time.sleep(model_info["delay"])
        return response.text
    except Exception as e:
        print(f"Error generating abstract with {model_info['name']}: {e}")
        return ""
﻿
﻿
def process_papers(start_index=0, end_index=None):
    """Process papers and generate abstract predictions."""
    papers = load_paper_data()
    model_predictions = {model_name: [] for model_name in MODELS.keys()}
    
    papers_to_process = papers[start_index:end_index] if end_index else papers[start_index:]
    
    for i, paper in enumerate(papers_to_process, start=start_index):
        title = paper["title"]
        pdf_path = paper["file_path"]
        target_length = paper["word_count"]
        
        print(f"\nProcessing paper {i+1}/{len(papers)}: {title}")
﻿
﻿
        # Extract text from page 2 onwards
        paper_text = extract_text_after_page_one(pdf_path)
        if not paper_text:
            print(f"Skipping file {pdf_path} - no text found after page 1.")
            continue
﻿
﻿
        # Generate predictions using each model
        for model_name, model_info in MODELS.items():
            try:
                print(f"Generating abstract prediction using {model_name}...")
                predicted_abstract = predict_abstract_with_gemini(
                    paper_text, 
                    title, 
                    target_length, 
                    model_info
                )
                
                model_predictions[model_name].append({
                    "title": title,
                    "file_path": pdf_path,
                    "abstract": predicted_abstract
                })
                
                print(f"Successfully generated {model_name} abstract prediction")
                
            except Exception as e:
                print(f"Error processing paper {title} with {model_name}: {e}")
﻿
﻿
        # Save progress after each paper
        for model_name, predictions in model_predictions.items():
            output_file = f"{model_name}_abstract_predictions.jsonl"
            with open(output_file, "w") as f:
                for entry in predictions:
                    json.dump(entry, f)
                    f.write("\n")
﻿
﻿
    return model_predictions
﻿
﻿
if __name__ == "__main__":
    print("Starting abstract prediction...")
    print(f"Using temperature settings:")
    for model, config in MODELS.items():
        print(f"- {model}: temperature={config['temperature']}, delay={config['delay']}s")
    
    predictions = process_papers()
    print("\nProcessing completed. Predictions saved to individual JSONL files.")
﻿
이 스크립트에서는 Gemini-1.5-Flash와 Gemini-2.0-Flash 모델을 사용해 연구 논문의 요약을 생성했습니다. 두 모델 모두 일관된 프롬프트와 설정을 바탕으로 간결하고 구조화된 요약을 생성하도록 동일한 과제를 부여해 공정한 비교를 보장했습니다. 출력물은 정답 데이터셋과 유사한 형식의 JSONL 파일로 저장해 평가를 간편하게 했습니다. 이를 통해 다음 단계에서 핵심 요약 기준 전반에 걸쳐 각 모델의 성능을 분석할 수 있는 견고한 기반을 마련했습니다.
요약 평가를 위한 Weave의 평가 지표 정의원문 논문의 정답 초록과 우리 Gemini 모델이 생성한 초록을 비교하기 위해 Weave에서 다양한 평가 지표를 사용합니다. 이 지표들은 전통적인 텍스트 유사도 점수, 신경망 기반 의미 유사도, 그리고 LLM 기반 채점 시스템을 결합해 요약 품질을 종합적으로 평가합니다.
핵심 지표 하나는 자동화된 LLM 평가자로 Gemini 1.5 Pro 사용. 우리는 이 모델에 원문 초록의 핵심을 얼마나 정확히 포착했는지에 따라, 각 생성 초록을 1점부터 5점까지 평가하도록 프롬프트합니다. 모델은 연구 목적, 방법론, 데이터셋, 발견 사항, 시사점, 한계, 향후 방향과 같은 핵심 요소를 고려합니다. 이 LLM 심판 방식 각 초록의 의미적 이해를 고려하므로, 단순한 단어 중복을 넘어선 평가가 가능해집니다.
신경망 기반 의미 유사도 측정을 위해 BERTScore를 사용합니다, 생성 초록과 원문 초록의 유사도를 측정하기 위해 BERT의 문맥 임베딩을 활용합니다. BERTScore는 서로 다른 단어로 동일한 개념을 표현하더라도 의미적 유사성을 포착할 수 있어 특히 유용합니다. 이는 기술 용어가 달라질 수 있는 연구 요약에서 큰 가치를 제공합니다.
ROUGE 점수도 사용합니다 요약 과제에서 널리 쓰이는 표준 지표인 ROUGE(Recall-Oriented Understudy for Gisting Evaluation)입니다. ROUGE-1은 단어 중복을, ROUGE-2는 구(프레이즈) 일치를, ROUGE-L은 생성된 초록과 원문 초록 사이의 최장 공통 부분열을 측정합니다. 이 접근은 어휘적 중복과 구조적 유사성에 대한 통찰을 제공합니다.
요약 품질에 대한 추가 통찰을 얻기 위해 커버리지와 압축 점수를 구현합니다. 커버리지는 자카드 유사도를 사용해 계산합니다, 각 생성 초록과 원문 초록의 단어 집합을 비교해 정보 보존 정도를 평가합니다. 커버리지 점수가 높다는 것은 생성 요약이 원문의 핵심 내용을 대부분 유지했음을 시사하지만, 이는 더 깊은 의미적 유사성보다는 단어 중복에 기반한 평가임을 유의해야 합니다.
마지막으로, 길이와 간결성을 살펴보기 위해 압축 비율을 사용합니다이 지표는 생성된 초록과 원문 초록의 길이를 비교하여 더 짧은 쪽을 더 긴 쪽으로 나눈 비율을 계산하고, 0에서 1 사이의 점수를 제공합니다. 점수가 1에 가까울수록 생성 요약의 길이가 원문과 잘 맞음을 의미하며, 낮은 점수는 과도한 압축이나 정보 누락을 시사할 수 있습니다.
이 지표들을 Weave의 평가 대시보드에서 함께 보면, 각 Gemini 모델의 성능을 다차원적으로 파악할 수 있습니다. LLM 기반 의미 평가, 신경망 유사도 지표, 전통적인 텍스트 중복 측정을 결합하면 어떤 모델이 전반적으로 더 잘 수행하는지뿐 아니라 각 모델이 어디에서 강점이 있고 어디에서 개선이 필요한지도 종합적으로 파악할 수 있습니다.
다음 코드를 실행하기 전에, 먼저 설정을 진행할 것을 권장합니다. WEAVE_PARALLELISM 사용 중인 시스템의 성능에 따라 환경 변수를 낮은 값으로 설정하세요. 제 M1 MacBook Pro에서는 BERTScore 계산 시 메모리 문제를 방지하기 위해 값을 1로 설정하니 잘 작동했습니다. 다음 명령으로 값을 설정할 수 있습니다 export WEAVE_PARALLELISM=1다음은 평가에 사용할 코드입니다: 
import weave
from weave import Model
import json
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from time import sleep
import asyncio
from rouge_score.rouge_scorer import RougeScorer
from typing import Dict, Any
import bert_score
﻿
﻿
# Initialize Vertex AI and Weave
PROJECT_ID = "dsports-6ab79"
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)
weave.init('abstract_metrics_eval')
﻿
﻿
class BaseJsonModel(Model):
    """Base model class for loading abstracts from JSON files."""
    abstract_file: str = ""
﻿
﻿
    def get_abstracts(self) -> dict:
        """Load abstracts from the JSON file."""
        abstracts = {}
        with open(self.abstract_file, 'r') as f:
            for line in f:
                entry = json.loads(line)
                abstracts[entry['title']] = entry['abstract']
        return abstracts
﻿
﻿
class GeminiFlash_1_5_Model(BaseJsonModel):
    """Specific model class for Gemini Flash 1.5 abstracts."""
    @weave.op
    def predict(self, title: str) -> dict:
        """Return the pre-generated abstract for a given title."""
        abstracts = self.get_abstracts()
        return {"model_output": abstracts.get(title, "")}
﻿
﻿
class GeminiFlash2Model(BaseJsonModel):
    """Specific model class for Gemini Flash abstracts."""
﻿
﻿
    @weave.op
    def predict(self, title: str) -> dict:
        """Return the pre-generated abstract for a given title."""
        abstracts = self.get_abstracts()
        return {"model_output": abstracts.get(title, "")}
﻿
﻿
@weave.op
def bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate BERTScore for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'bert_score': 0.0}
    
    try:
        P, R, F1 = bert_score.score(
            [model_output['model_output']],
            [gt_abstract],
            lang='en',
            model_type='microsoft/deberta-xlarge-mnli'
        )
        return {'bert_score': float(F1.mean())}
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return {'bert_score': 0.0}
﻿
﻿
async def gemini_scorer(gt_abstract: str, model_output: dict) -> dict:
    """Evaluate abstract using Gemini model."""
    if not model_output or 'model_output' not in model_output:
        print("Invalid model output")
        return {'gemini_score': 0}
﻿
    model_id = "gemini-1.5-pro-002"
    model = GenerativeModel(model_id)
    
    response_schema = {
        "type": "object",
        "properties": {
            "score": {"type": "integer", "minimum": 1, "maximum": 5}
        },
        "required": ["score"]
    }
﻿
    formatted_text = (
        f"Given these two research paper abstracts:\n\n"
        f"Ground Truth Abstract:\n{gt_abstract}\n\n"
        f"Generated Abstract:\n{model_output['model_output']}\n\n"
        f"Rate how well the generated abstract captures the key information from the ground truth abstract "
        f"on a scale from 1-5, where 1 is poor and 5 is excellent. Consider:\n"
        f"Respond with ONLY a JSON object in this format: {{'score': X}} where X is your integer rating."
    )
    
    max_attempts = 10
    attempt = 0
    base_delay = 3
    
    while attempt < max_attempts:
        print(f"evaluating with gemini attempt: {attempt}")
        try:
            response = model.generate_content(
                formatted_text,
                generation_config=GenerationConfig(
                    temperature=0.0,
                    response_mime_type="application/json",
                    response_schema=response_schema
                )
            )
            
            eval_result = json.loads(response.text)
            score = int(eval_result.get('score', 0))
            
            if not 1 <= score <= 5:
                raise ValueError(f"Invalid score: {score}")
                
            print(f"Sleeping for {base_delay}s between calls")
            sleep(base_delay)
            return {'gemini_score': score}
            
        except Exception as e:
            attempt += 1
            if "429" in str(e) and attempt < max_attempts:
                delay = base_delay * (2 ** attempt)
                print(f"Rate limit hit. Attempt {attempt}/{max_attempts}. "
                      f"Retrying in {delay}s...")
                sleep(delay)
            else:
                if attempt == max_attempts:
                    print(f"Max attempts reached")
                print(f"Error in evaluation: {e}")
                return {'gemini_score': 0}
    
    return {'gemini_score': 0}
﻿
@weave.op
def rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate ROUGE scores for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
    
    try:
        scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(gt_abstract, model_output['model_output'])
        
        return {
            'rouge1_f': float(scores['rouge1'].fmeasure),
            'rouge2_f': float(scores['rouge2'].fmeasure),
            'rougeL_f': float(scores['rougeL'].fmeasure)
        }
    except Exception as e:
        print(f"Error calculating ROUGE scores: {e}")
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
﻿
﻿
@weave.op
def compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate compression ratio of the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'compression_ratio': 0.0}
    
    try:
        gt_words = len(gt_abstract.split())
        generated_words = len(model_output['model_output'].split())
        
        compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)
        
        return {'compression_ratio': float(compression_ratio)}
    except Exception as e:
        print(f"Error calculating compression ratio: {e}")
        return {'compression_ratio': 0.0}
﻿
﻿
@weave.op
def coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate content coverage using word overlap."""
    if not model_output or 'model_output' not in model_output:
        return {'coverage_score': 0.0}
    
    try:
        gt_words = set(gt_abstract.lower().split())
        generated_words = set(model_output['model_output'].lower().split())
        
        intersection = len(gt_words.intersection(generated_words))
        union = len(gt_words.union(generated_words))
        
        coverage_score = intersection / union if union > 0 else 0.0
        
        return {'coverage_score': float(coverage_score)}
    except Exception as e:
        print(f"Error calculating coverage score: {e}")
        return {'coverage_score': 0.0}
﻿
﻿
def create_evaluation_dataset(gt_file: str):
    """Create dataset from ground truth file."""
    dataset = []
    with open(gt_file, 'r') as f:
        for line in f:
            entry = json.loads(line)
            dataset.append({
                "title": entry["title"],
                "gt_abstract": entry["abstract"]
            })
    return dataset
﻿
﻿
async def run_evaluations(gt_file: str):
    """Run separate evaluations for each model."""
    eval_dataset = create_evaluation_dataset(gt_file)
    scorers = [
        gemini_scorer,
        rouge_scorer,
        compression_scorer,
        coverage_scorer,
        bert_scorer
    ]
    
    # Create and evaluate Gemini flash 1.5 model
    print("\nEvaluating Gemini Flash 1.5 abstracts...")
    flash_1_5_model = GeminiFlash_1_5_Model(abstract_file="gemini_1_5_flash_abstract_predictions.jsonl")
    flash_1_5_evaluation = weave.Evaluation(
        dataset=eval_dataset,
        scorers=scorers
    )
    flash1_5_results = await flash_1_5_evaluation.evaluate(flash_1_5_model)
    
    # Create and evaluate Gemini Flash model
    print("\nEvaluating Gemini Flash abstracts...")
    flash_model = GeminiFlash2Model(abstract_file="gemini_2_flash_abstract_predictions.jsonl")
    flash_2_evaluation = weave.Evaluation(
        dataset=eval_dataset,
        scorers=scorers
    )
    flash2_results = await flash_2_evaluation.evaluate(flash_model)
    
    # Print results
    print("\nEvaluation Results:")
    print("\nGemini Flash 1.5 Results:")
    print(json.dumps(flash1_5_results, indent=2))
    print("\nGemini Flash 2.0 Results:")
    print(json.dumps(flash2_results, indent=2))
    
    return {
        "gemini_flash1_5": flash1_5_results,
        "gemini_flash2": flash2_results
    }
﻿
﻿
if __name__ == "__main__":
    gt_file = "paper_abstracts.jsonl"
    asyncio.run(run_evaluations(gt_file))
GPU 없이 이 평가를 실행한다면, CPU 속도에 따라 BERT 점수 계산이 다소 느릴 수 있습니다. 가능하면 GPU를 사용하시고, CPU에서 실행해야 한다면 채점기 목록에서 BERTScore를 제외하는 것을 권장합니다.  
💡
우리의 평가 구현은 Weave 프레임워크를 활용하여 두 Gemini 모델이 생성한 요약을 기준 정답 초록과 비교합니다. 이를 위해 이전에 생성해 둔 초록을 읽어들이는 모델 클래스를 만들고 (gemini_pro_summaries.jsonl 그리고 gemini_flash_summaries.jsonl), 우리는 모델 추론 단계와 평가 단계를 분리했습니다. 이러한 분리는 여러 가지 이점을 제공합니다. 추가적인 API 비용 없이 평가를 여러 번 실행할 수 있고, 모델 출력의 디버깅과 분석이 쉬워지며, 요약을 다시 생성하지 않고도 평가 지표를 반복적으로 개선할 수 있습니다.
우리는 두 가지 특정 모델 클래스를 만듭니다. GeminiProModel 그리고 GeminiFlashModel, 이는 다음 클래스를 상속합니다 BaseJsonModel이 구조를 사용하면 각 모델이 Weave 평가 대시보드에서 명확히 구분되어 표시됩니다.
평가는 두 모델 모두에 대해 동일한 지표 모음으로 수행됩니다. 어휘적 중복을 측정하는 ROUGE 점수, 내용 보존을 평가하는 커버리지 점수, 길이 분석을 위한 압축 비율, 의미적 유사성을 포착하는 BERTScore, 그리고 Gemini-1.5-Pro를 심판으로 사용하는 지능형 유사도 점수를 포함합니다. Weave 대시보드는 이러한 결과를 대화형 형식으로 제공합니다, 서로 다른 논문과 지표 전반에서 모델을 세밀하게 비교할 수 있도록 합니다.
평가 스크립트를 실행한 뒤에는 Weave 안에서 결과를 분석할 수 있습니다. 아래 스크린샷은 평가 플랫폼의 예시입니다. 이 인터페이스는 다양한 지표에서 모델들이 어떻게 비교되는지 매우 명확하고 직관적으로 보여줍니다.
﻿
﻿
﻿
Gemini-2.0-Flash 모델은 기준 정답 초록과의 비교에서 LLM 판정 점수, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, 커버리지 등 주요 평가 지표 전반에 걸쳐 Gemini-1.5-Flash 모델보다 향상된 성능을 보입니다. 두 모델 모두 효율성은 높게 유지되지만, Gemini-2.0-Flash는 의미 보존과 구조적 충실도가 더 우수한 요약을 생성하여 원문 초록과의 응집성 및 정합성이 발전했음을 보여줍니다.
Weave Evaluations는 데이터셋의 각 항목에 대해 모델 출력물을 직접 비교할 수 있는 매우 우수한 대시보드도 제공합니다. 이를 통해 동일한 입력을 제공했을 때 서로 다른 모델이 어떻게 동작하는지 훨씬 더 실무적으로 이해할 수 있습니다.
LLM을 선택하는 일은 직원을 채용하는 것과 비슷합니다. 채용에서는 보통 자격과 경력 같은 기본 정보뿐 아니라, 지원자가 해당 역할에 얼마나 잘 맞는지에 대한 “감”도 함께 살펴봅니다. 이렇게 정량적 평가와 정성적 평가를 조합하면, 능력은 물론 환경과 기대에 부합하는 사람을 고를 수 있습니다. 마찬가지로, 적합한 LLM을 선택할 때도 핵심 지표에서의 성능 같은 수치와 더불어, 모델이 실제로 생성하는 출력물의 품질을 함께 살펴봐야 합니다. 
Weave Evaluations는 데이터셋의 다양한 항목에 걸쳐 모델 출력물을 직접 비교할 수 있는 종합 대시보드를 제공하여 모델 선택을 용이하게 합니다. 이를 통해 ML 엔지니어는 동일한 과제를 부여했을 때 각 모델의 강점과 약점을 보다 실무적으로 파악할 수 있습니다. 지표가 중요한 정량적 기반을 제공하지만, 이 비교 대시보드는 각 모델의 미세한 접근 방식을 평가할 수 있는 LLM의 “질적 인터뷰”를 포착합니다. 아래는 비교 화면의 스크린샷입니다! 
﻿
Weave로 버그 잡기 처음에는 Gemini 패밀리의 이전 모델인 Gemini 1.0 Pro로 이 LLM 평가를 실행했습니다. 모델의 일부 출력을 살펴보니, 제가 프롬프트에서 초록만 요청했음에도 초록이 논문의 각 섹션 요약에 더 가까워 보였습니다. 반면, Gemini Flash에서는 같은 문제가 나타나지 않는 듯했습니다. 이제, Weave 비교 대시보드가 없었다면 아마 이 문제를 알아차리지 못했을 것입니다.하지만 다행히도 비교 대시보드 덕분에 이 문제가 아주 명확하게 드러났습니다.
코드를 더 자세히 살펴보니 프롬프트 자체는 대체로 올바른 듯했지만, 원래 지시에 전체 논문이 덧붙여지는 바람에 모델이 제 원래 지시를 ‘잊는’ 것처럼 보였습니다.다음은 버그를 유발하던 원래 프롬프트입니다: 
def create_abstract_prompt(text, title, target_length):
    """Create a prompt for abstract generation with target length."""
    return (
        f"You are tasked with generating an abstract for a research paper titled '{title}'. "
        f"The abstract should be approximately {target_length} words long.\n\n"
        f"Generate an abstract that summarizes the key points of the paper, including the "
        f"research objective, methodology, and main findings. The abstract should be "
        f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
        f"Paper content:\n\n{text}"
    )
여기에서 보면, 프롬프트에 추가된 마지막 텍스트 조각이 논문 본문입니다. 분명 Gemini Flash는 이를 잘 처리하지만, Gemini Pro 1.0은 이렇게 많은 컨텍스트에서 어려움을 겪습니다. 이를 해결하기 위해 마지막에 “초록만 작성하라”는 지시를 다시 추가했고, 문제가 해결되었습니다. 새로운 프롬프트는 다음과 같습니다: 
def create_abstract_prompt(text, title, target_length):
    """Create a prompt for abstract generation with target length."""
    return (
        f"You are tasked with generating an abstract for a research paper titled '{title}'. "
        f"The abstract should be approximately {target_length} words long.\n\n"
        f"Generate an abstract that summarizes the key points of the paper, including the "
        f"research objective, methodology, and main findings. The abstract should be "
        f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
        f"Paper content:\n\n{text}"
        f"Respond only with the ABSTRACT!" # prompt 'engineering' 
    )
왜 Google Vertex AI를 선택할까요?Google Vertex AI는 다음과 같은 작업을 위한 종합 플랫폼을 제공합니다 대형 언어 모델, 다양한 작업에 맞춘 폭넓은 모델에 접근할 수 있도록 합니다. 이러한 선택지는 사용자가 자신의 요구 사항에 가장 잘 맞는 모델을 고를 수 있게 해줍니다. 또한 플랫폼은 Google Cloud와 매끄럽게 통합되어 Google 생태계 내에서의 배포와 데이터 관리를 간편하게 하며, 운영 효율성을 높입니다.
Vertex AI의 확장성은 수요 변화에 맞춰 모델 배포 규모를 쉽게 조정할 수 있게 하여, 다양한 워크로드 전반에서 최적의 성능을 보장합니다. 보안과 컴플라이언스 또한 최우선으로 다루며, Google Cloud의 엄격한 기준을 준수하는 강력한 데이터 보호 조치를 통해 민감한 정보를 안전하게 지킵니다. 또한 Google의 클라우드 기술로 뒷받침되는 고성능 인프라는 신뢰성과 효율성이 높은 모델 운영을 보장하여, 일관되고 믿을 수 있는 결과를 제공합니다. 
결론 Google Vertex AI와 W&B Weave를 사용해 Gemini-2.0-Flash와 Gemini-1.5-Flash의 성능을 LLM 기반 판정 점수를 포함한 핵심 지표 전반에서 평가했습니다. 분석 결과, Gemini-2.0-Flash는 ROUGE, 커버리지, BERTScore에서 일관되게 Gemini-1.5-Flash를 앞서며, 기준 콘텐츠를 더 정확하게 반영하는 요약을 생성하는 능력을 보여 주었습니다. LLM 판정 점수 또한 Gemini-2.0-Flash에 유리하게 나타나, 의미적 정렬과 전반적 응집성이 개선되었음을 시사합니다.  
Gemini-1.5-Flash는 이러한 지표에서 다소 뒤처지지만, 효율성과 품질의 균형이 필요한 사용 사례에서는 여전히 경쟁력 있는 선택지입니다. 전반적으로 Vertex AI의 모델 발전과 Weave의 평가 프레임워크는 지속적으로 의미 있는 인사이트를 제공하여, 정확도, 의미적 풍부함, 처리 속도 중 무엇을 우선시하든 사용자들이 요약 요구에 가장 적합한 모델을 선택할 수 있게 해줍니다.
관련 문서 
Supercharging LLM summarization 
A guide to making the most of LLMs for summarization tasks
Building an LLM Python debugger agent with the new Claude 3.5 Sonnet  
Building a AI powered coding agent with Claude 3.5 Sonnet!
Claude 3.5 Sonnet on Vertex AI: Python quickstart
Here's how to get up and running with the newest model from Anthropic
Building and evaluating a RAG system with DSPy and W&B Weave 
A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.
﻿
﻿
﻿
﻿
 이 글은 AI로 번역된 기사입니다. 오역이 의심되는 부분이 있으면 댓글로 알려주세요. 원문 보고서는 아래 링크에서 확인하실 수 있습니다: 원문 보고서 보기﻿
﻿
Add a comment