Pydantic과 OpenAI 함수 호출을 활용한 더 나은 데이터 추출

구조화된 추출과 프롬프트 엔지니어링에 대한 새로운 관점. 이 글은 AI 번역본입니다. 오역 가능성이 있으면 댓글로 알려주세요.
Created on September 12|Last edited on September 12
Comment
﻿
소개OpenAI로 신뢰할 수 있는 구조화된 출력을 만들기 어려우신가요? 다행히도 지난달 OpenAI가 바로 이 문제를 해결하기 위해 함수 호출 기능을 도입했습니다. 이 함수 호출은 개발자가 스키마를 정의하고 JSON을 더욱 투명하고 접근하기 쉬운 방식으로 반환하도록 도와줍니다.
저는 다음과 같은 경량 라이브러리를 만들고 있습니다 openai_function_call 을 활용하는 파이던틱 및 OpenAI 함수 호출. 이 라이브러리의 목표는 추상화를 최소화하고 개발자가 솔루션을 완전히 커스터마이즈할 수 있도록 하는 것입니다.
여러분이 잘 알고 사랑하는 많은 라이브러리에도 이러한 아이디어가 구현되어 있습니다. LangChain, LlamaIndex, 그리고 MarvinAI. 그러나 … 목표는 이 이 라이브러리의 목표는 불필요한 추상화를 최소화하고, 유용한 추상화를 시험해 보는 실험장 역할을 하는 것입니다.
이 글에서는 과거에 구조화된 출력으로 인해 겪었던 어려움을 살펴보고, 함수 호출이 이러한 문제를 어떻게 해결하는지 설명한 뒤, OpenAISchema 클래스가 개발자 경험을 어떻게 개선하는지 보여 드리겠습니다. 또한 이 스키마들을 활용해 더 나은 프롬프트를 작성하는 방법을 예제로 소개하겠습니다. 아울러 저는 또한 코랩 직접 몇 가지를 시험해 보고 싶다면:
﻿
숙련된 개발자이든 이제 막 시작한 분이든, 이 글은 OpenAI 함수 호출이 프로젝트에 어떤 이점을 가져다줄 수 있는지에 대한 유용한 인사이트를 제공합니다. 더 알아보려면 계속 읽어 주세요!
목차소개목차파트 1: AI 엔지니어에게 구조화된 출력은 늘 까다로웠다해결책 1: 기본 OpenAI 함수 호출해결책 2: Pydantic 기반 OpenAISchemaPydantic을 사용하면 어떤 이점이 있나요?2부: 활용 사례 인용 추출질의 계획 수립과 실행텍스트를 데이터프레임으로 변환하기프롬프트 엔지니어링의 끝일까?마무리관련 읽을거리: 
﻿
파트 1: AI 엔지니어에게 구조화된 출력은 늘 까다로웠다JSON과 같은 구조화된 데이터를 LLM에서 뽑아내는 일은 초창기부터 많은 AI 엔지니어에게 난제였다. 구조화된 데이터는 많은 활용 사례에서 성패를 좌우하는 기능이며, OpenAI의 함수 호출 메커니즘은 차별화 요소로서 앞으로 더 많은 개발자가 OpenAI를 선택하도록 만드는 데 기여할 가능성이 크다.
﻿
함수 호출이 나오기 전에는, 구조화된 데이터를 뽑아내려고 AI에게 애걸복걸해야 했을 뿐 아니라, 겨우 원시 문자열을 받아도 정규식이 제대로 먹히고 json.loads가 깨지지 않기만을 바라야 했다. 거기에 데이터가 누락되거나, 환각으로 생성되거나, 잘못 검증되지 않기를 바랐고, 마침내 모든 것이 겨우 돌아가기 시작할 즈음이면 더 수다스러운 새로운 GPT 버전이 나와 전부를 다시 망가뜨리곤 했다.
해결책 1: 기본 OpenAI 함수 호출함수 호출이 등장한 뒤에는 데이터 파싱을 위해 기괴한 프롬프트나 정규식을 쓸 필요가 없다는 걸 깨달았다. 대신 JSON Schema만 정의하면 된다. 그런데 문제가 있다. 나는 json schema를 쓰고 싶지 않다. 까다롭고, 복잡한 스키마를 처음부터 작성하는 일은 성가시다.
이름과 나이를 가진 User가 있다면 스키마는 매우 간단합니다:
functions = [{
 "name": "ExtractUser",
 "description": "Extract User information",
 "parameters": {
  "type": "object",
  "properties": {
   "name": {
    "type": "string"
   },
   "age": {
    "type": "integer"
   }
  },
  "required": ["age","name"]
 }
}]
더 복잡한 작업을 하고 싶다면 어떻게 할까요? 지난주에 작업한 예시는 고객 요청에서 여러 개의 검색 쿼리를 추출하는 것이었습니다. 검색 쿼리는 동영상, 문서, 녹취록 등 여러 출처에서 나올 수 있었습니다. 성능을 높이기 위해 모델을 더 잘 유도하도록 설명을 더 풍부하게 추가하기도 했습니다.
functions = [{
 "name": "MultiSearch",
 "description": "Correct segmentation of `Search` tasks", # prompting 
 "parameters": {
  "type": "object",
  "properties": {
   "tasks": {
    "type": "array",
    "items": {"$ref": "#/definitions/Search"}
   }
  },
  "definitions": {
   "Source": {
    "description": "An enumeration.",
    "enum": ["VIDEO","TRANSCRIPT","DOCUMENT"]
   },
   "Search": {
    "type": "object",
    "properties": {
     "query": {
      "description": "Detailed, comprehensive, and specific query to be used for semantic search", # prompting 
      "type": "string"
     },
     "source": { "$ref": "#/definitions/Source"}
    },
    "required": ["title","query","source"]
   }
  },
  "required": ["tasks"]
 }
}]
신뢰해야 할 출력의 양이 늘어날수록 이런 코드는 점점 더 혼란스러워진다. 게다가 데이터를 얻고 나서는 그것이 제대로 검증되었는지 어떻게 보장할 수 있을까? 그냥 Python 딕셔너리를 여기저기 넘기면서 데이터가 맞길 바라는 방식은 신뢰할 수 없다. 
새로운 출처가 생겨난다면 어떻게 할까요?
# Writing code like this gets crazier the more you need to trust your outputs
data = json.loads(completion...["function_call"]
assert hasattr(data, "tasks")
for task in data.tasks:
	assert isinstance(task, dict), "task is not a dict"
	assert list(task.keys) == ["title", "query", "source"], "task missing keys"
	assert tasks.source in {"VIDEO","TRANSCRIPT","DOCUMENT"}, "source not valid"
해결책 2: Pydantic 기반 OpenAISchema스키마를 타입 없는 딕셔너리로 작성하고 데이터를 딕셔너리로 내보내는 대신, Pydantic을 사용하면 코드에서 모델을 정의하고 그 모델로부터 스키마를 생성하기그뿐만 아니라 JSON을 다시 모델로 파싱하여 객체와 그 속성, 심지어 메서드까지도 쉽게 접근할 수 있습니다. 여기서는 OpenAISchema Pydantic을 확장하는 BaseModel 그리고 다음을 위한 몇 가지 보조 메서드를 추가합니다 OpenAISchema.
from openai_function_call import OpenAISchema
from pydantic import Field
import enum
import openai
﻿
class Source(enum.Enum):
    VIDEO = "VIDEO"
    TRANSCRIPT = "TRANSCRIPT"
    DOCUMENT = "DOCUMENT"
﻿
class Search(OpenAISchema):
    query: str = Field(
        ..., description="Detailed, comprehensive, and specific query to be used for semantic search",
    )
    source: Source
﻿
    def search(self):
	# any logic can go here since its just python
    	return f"Fake results: `{self.query}` from {self.source}"
﻿
class MultiSearch(OpenAISchema):
    "Correct segmentation of `Search` tasks"
    tasks: list[Search]
﻿
completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0613",
    functions=[MultiSearch.openai_schema],
    function_call={"name": MultiSearch.openai_schema["name"]},
    messages=[
        {
	"role": "user", 
	"content": "Can you show me the cat video you found last week and the onboarding documents as well?"
	},
    ],
)
﻿
response = MultiSearch.from_response(completion)
response 
>>> MultiSearch(tasks=[
	Search(title="cat videos", query="cat videos from last week", source=Source.VIDEO),
	Search(title="onboarding documents", query="onboarding documents", source=Source.DOCUMENT),
])
Pydantic을 사용하면 어떤 이점이 있나요?Python에서 JSON 스키마를 단순한 딕셔너리로 작성하는 대신, pydantic.BaseModel, 그리고 더 일반적으로 OpenAISchema, 데이터를 Python 객체로 모델링할 수 있는 기능을 제공합니다. Pydantic은 수천 명의 개발자가 신뢰하는 훌륭한 도구 생태계를 갖추고 있으며, 뛰어난  문서. 곧 출시될 새로운 기능으로 로드맵 맞춤 코드를 한 줄도 쓰지 않고도 LLM 스택을 더 개선할 수 있는 여지가 훨씬 많습니다. 
코드가 곧 프롬프트가 된다다음을 실행하면 Schema.openai_schema, API가 보게 될 내용을 정확히 확인할 수 있습니다. 도크스트링, 속성, 타입, 필드 설명이 이제 스키마의 일부가 되었음을 확인하세요.
프롬프트, 모델, 역직렬화를 분리하지 않고 코드와 함께 배치하면, 프롬프트가 사용자와 AI 모두에게 더 나은 문서 역할을 하게 됩니다.
from openai_function_call import OpenAISchema
from pydantic import Field 
﻿
class ExtractUser(OpenAISchema):
    "Correctly extracted user information" #(1)
    name: str = Field(..., description="User's full name") #(2)
    age: int
﻿
>>> ExtractUser.openai_schema
{
"name": "ExtractUser",
"description": "Correctly extracted user information", #(1)
"parameters": {
    "type": "object",
    "properties": {
    "name": {
        "description": "User's full name", #(2)
        "type": "string"
    },
    "age": {
        "type": "integer"
    }
    },
    "required": ["age","name"]
}}
Pydantic은 출력 결과를 검증할 수 있습니다사용할 때 Schema.from_response 완료 응답에서 데이터를 추출할 때, Pydantic은 스키마를 검증하고 딕셔너리 대신 Python 객체를 반환합니다. 이를 통해 타입 힌트와 검증 오류를 활용해 신뢰할 수 있는 코드를 쉽게 작성할 수 있습니다.
다시 참고하면 MultiSearch 예를 들어, 모든 속성(예: tasks, source)이 올바른 타입임을 확인할 수 있습니다. 또한 IDE를 사용하면 자동 완성 제안까지 받을 수 있습니다!
response = MultiSearch.from_response(completion)
assert isinstance(response, MultiSearch)
assert isinstance(response.tasks[0], Search)
assert isinstance(response.tasks[0].source, Source)
복잡한 스키마는 Erdantic으로 시각화할 수 있습니다Pydantic(JSON Schema)가 복잡한 중첩을 지원하고, 또 OpenAISchema 확장합니다 pydantic.BaseModel, LangChain, LlamaIndex, Marvin과 같은 전체 도구 생태계를 활용할 수 있고 Erdantic 우리 모델을 시각화하기 위해서입니다. 이는 피드백을 받고 문서를 더욱 개선할 수 있는 훌륭한 기회를 제공합니다.
﻿
객체 지향 프로그래밍스키마 클래스에 메서드 45개를 잔뜩 붙이는 건 좋은 생각은 아니지만, 이렇게 하면 클래스에 구조화된 방식으로 계산을 수행하는 메서드를 작성할 수 있다는 의미이기도 합니다:
# adding methods to your schema is totally fine
class Search(OpenAISchema):
    ...
    def search(self) -> list:
        if self.source == Source.VIDEO:
            return video_index.search(self.query) 
        else if self.source == Source.DOCUMENTS:
            return doc_index.search(self.query)
﻿
ms = MultiSearch.from_response(completion)
>>> [s.search() for s in ms]
항상 구조체처럼 다루면 됩니다. 타입 힌트를 활용하면 코드 가독성도 더 좋아집니다!
def search(query: Search) -> list:
    if query.source == Source.VIDEO:
            return video_index.search(self.query) 
        else if query.source == Source.DOCUMENTS:
            return doc_index.search(self.query)
﻿
ms = MultiSearch.from_response(completion)
>>> [search(s) for s in ms]
이제 전반적인 철학을 어느 정도 다뤘으니, 몇 가지 예시를 살펴보겠습니다:
2부: 활용 사례 참고로, 직접 예제를 실행해 보고 싶다면 다음을 확인하세요 Colab.
인용 추출이 예제에서는 OpenAI Function Call을 사용해 AI에게 질문을 하고 정확한 인용이 포함된 답변을 받는 방법을 보여줍니다. Pydantic을 사용해 필요한 데이터 구조를 정의하고, 각 답변에 대한 인용을 어떻게 조회하는지 시연하겠습니다.
동기AI 모델을 사용해 질문에 답할 때는 적절한 인용을 통해 정확하고 신뢰할 수 있는 정보를 제공하는 것이 중요합니다. 각 진술에 대한 출처를 함께 제시하면, 정보가 신뢰할 만한 근거로 뒷받침됨을 보장하고 독자가 스스로 정보를 검증할 수 있도록 도울 수 있습니다.
데이터 구조 정의하기이 작업에 필요한 데이터 구조를 먼저 정의해 봅시다: 사실과 질문·답변
from pydantic import Field
from openai_function_call import OpenAISchema
﻿
﻿
class Fact(OpenAISchema):
    """
    Each fact has a body and a list of sources.
    If there are multiple facts, make sure to break them apart such that each one only uses a set of sources that are relevant to it.
    """
﻿
    fact: str = Field(..., description="Body of the sentence as part of a response")
    substring_quote: list[str] = Field(
        ...,
        description="Each source should be a direct quote from the context, as a substring of the original content",
    )
﻿
    def _get_span(self, quote, context, errs=100):
        import regex
﻿
        minor = quote
        major = context
﻿
        errs_ = 0
        s = regex.search(f"({minor}){{e<={errs_}}}", major)
        while s is None and errs_ <= errs:
            errs_ += 1
            s = regex.search(f"({minor}){{e<={errs_}}}", major)
﻿
        if s is not None:
            yield from s.spans()
﻿
    def get_spans(self, context):
        for quote in self.substring_quote:
            yield from self._get_span(quote, context)
﻿
﻿
class QuestionAnswer(OpenAISchema):
    """
    Class representing a question and its answer as a list of facts, where each fact should have a source.
    Each sentence contains a body and a list of sources.
    """
﻿
    question: str = Field(..., description="Question that was asked")
    answer: list[Fact] = Field(
        ...,
        description="Body of the answer, each fact should be its separate object with a body and a list of sources",
    )
검색 예제와 마찬가지로, 원문에서 인용이 정확히 어디에 위치하는지 찾는 데 도움이 되는 `spans`라는 메서드를 구현합니다. 이제 OpenAI를 호출하는 함수를 정의하고 어떤 결과가 나오는지 확인해 봅시다.
def ask_ai(question: str, context: str) -> QuestionAnswer:
    # Making a request to the hypothetical 'openai' module
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        temperature=0.2,
        max_tokens=1000,
        functions=[QuestionAnswer.openai_schema],
        function_call={"name": QuestionAnswer.openai_schema["name"]},
        messages=[
            {
                "role": "system",
                "content": f"You are a world class algorithm to answer questions with correct and exact citations. ",
            },
            {"role": "user", "content": f"Answer question using the following context"},
            {"role": "user", "content": f"{context}"},
            {"role": "user", "content": f"Question: {question}"},
            {
                "role": "user",
                "content": f"Tips: Make sure to cite your sources, and use the exact words from the context.",
            },
        ],
    )
﻿
    # Creating an Answer object from the completion response
    return QuestionAnswer.from_response(completion)
인용 평가하기예제를 평가해 보겠습니다. AI에게 질문을 던지고 인용이 포함된 답변을 받아 보죠. 주어진 컨텍스트를 이용해 “저자는 대학 시절에 무엇을 했나요?”라고 질문하겠습니다.
def highlight(text, span):
    return ...
﻿
question = "What did the author do during college?"
context = """
My name is Jason Liu, and I grew up in Toronto Canada but I was born in China.
I went to an arts high school but in university I studied Computational Mathematics and physics.
As part of coop I worked at many companies including Stitchfix, Facebook.
I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.
"""
﻿
answer = ask_ai(question, context)
﻿
print("Question:", question)
print()
for fact in answer.answer:
    print("Statement:", fact.fact)
    for span in fact.get_spans(context):
        print("Citation:", highlight(context, span))
    print()
﻿
Question: What did the author do during college?
﻿
Statement: The author studied Computational Mathematics and physics in university.
Citation: ...rts high school but <in university I studied Computational Mathematics and physics> .As part of coop I ...
﻿
Statement: The author started the Data Science club at the University of Waterloo.
Citation: ...titchfix, Facebook.<I also started the Data Science club at the University of Waterloo>  and I was the presi...
﻿
Statement: The author was the president of the Data Science club for 2 years.
Citation: ...ity of Waterloo and <I was the president of the club for 2 years> ....
질의 계획 수립과 실행이 예시는 OpenAI Function Call ChatCompletion 모델을 사용해 질의응답 시스템에서 질의 계획을 수립하고 실행하는 방법을 보여줍니다. 복잡한 질문을 명시된 의존 관계를 가진 더 작은 하위 질문들로 분해함으로써, 시스템은 주된 질문에 답하는 데 필요한 정보를 체계적으로 수집할 수 있습니다.
동기이 예시의 목적은 질의 계획이 복잡한 질문을 처리하고, 반복적인 정보 수집을 촉진하며, 워크플로를 자동화하고, 프로세스를 최적화하는 방법을 보여주는 것입니다. OpenAI Function Call 모델을 활용하면, 체계적인 계획을 수립하고 실행하여 효과적으로 답을 찾을 수 있습니다.
사용 사례복잡한 질의응답
반복적 정보 수집
워크플로 자동화
프로세스 최적화
OpenAI Function Call 모델을 사용하면 계획 수립 과정을 사용자 정의하고, 고유한 요구 사항을 충족하도록 특정 애플리케이션에 통합할 수 있습니다.
데이터 구조 정의하기질의 계획과 개별 질의를 표현하기 위해 필요한 Pydantic 모델을 정의해 봅시다.
class QueryType(str, enum.Enum):
    """Enumeration representing the types of queries that can be asked to a question answer system."""
﻿
    SINGLE_QUESTION = "SINGLE"
    MERGE_MULTIPLE_RESPONSES = "MERGE_MULTIPLE_RESPONSES"
﻿
﻿
class Query(OpenAISchema):
    """Class representing a single question in a query plan."""
﻿
    id: int = Field(..., description="Unique id of the query")
    question: str = Field(
        ...,
        description="Question asked using a question answering system",
    )
    dependancies: List[int] = Field(
        default_factory=list,
        description="List of sub questions that need to be answered before asking this question",
    )
    node_type: QueryType = Field(
        default=QueryType.SINGLE_QUESTION,
        description="Type of question, either a single question or a multi-question merge",
    )
﻿
﻿
class QueryPlan(OpenAISchema):
    """Container class representing a tree of questions to ask a question answering system."""
﻿
    query_graph: List[Query] = Field(
        ..., description="The query graph representing the plan"
    )
﻿
    def _dependencies(self, ids: List[int]) -> List[Query]:
        """Returns the dependencies of a query given their ids."""
        return [q for q in self.query_graph if q.id in ids]
질의 계획 수립이제 정의한 모델과 OpenAI API를 사용해 질의 계획을 수립하고 실행하는 방법을 시연해 보겠습니다.
def query_planner(question: str) -> QueryPlan:
    PLANNING_MODEL = "gpt-4-0613"
﻿
    messages = [
        {
            "role": "system",
            "content": "You are a world class query planning algorithm capable ofbreaking apart questions into its dependency queries such that the answers can be used to inform the parent question. Do not answer the questions, simply provide a correct compute graph with good specific questions to ask and relevant dependencies. Before you call the function, think step-by-step to get a better understanding of the problem.",
        },
        {
            "role": "user",
            "content": f"Consider: {question}\nGenerate the correct query plan.",
        },
    ]
﻿
    completion = openai.ChatCompletion.create(
        model=PLANNING_MODEL,
        temperature=.2,
        functions=[QueryPlan.openai_schema],
        function_call={"name": QueryPlan.openai_schema["name"]},
        messages=messages,
        max_tokens=1000,
    )
    return QueryPlan.from_response(completion)
이제 다중 홉 추론이 필요한 질문을 할 수 있습니다
plan = query_planner(
    "What is the difference in populations of Canada and the Jason's home country?"
)
plan.dict()
그리고 질의가 어떻게 분해되었는지 확인하세요
{'query_graph': [{'id': 1,
   'question': 'What is the population of Canada?',
   'dependancies': [],
   'node_type': <QueryType.SINGLE_QUESTION: 'SINGLE'>},
  {'id': 2,
   'question': "What is Jason's home country?",
   'dependancies': [],
   'node_type': <QueryType.SINGLE_QUESTION: 'SINGLE'>},
  {'id': 3,
   'question': "What is the population of Jason's home country?",
   'dependancies': [2],
   'node_type': <QueryType.SINGLE_QUESTION: 'SINGLE'>},
  {'id': 4,
   'question': "What is the difference in populations of Canada and Jason's home country?",
   'dependancies': [1, 3],
   'node_type': <QueryType.MERGE_MULTIPLE_RESPONSES: 'MERGE_MULTIPLE_RESPONSES'>}]}
이 예시에서는 질의 계획을 구성하지만, 실제로 질문에 답하는 방법은 제안하지 않습니다. 대신, 검색을 수행하고 OpenAI를 호출해 retrieval‑augmented generation을 수행하는 자체적인 답변 함수를 구현할 수 있습니다. 그 단계에서도 function call을 활용하게 되지만, 이는 이 예시의 범위를 벗어납니다.
텍스트를 데이터프레임으로 변환하기이 예시에서는 OpenAI Function Call을 사용해 텍스트를 데이터프레임으로 변환하는 방법을 시연합니다. Pydantic으로 필요한 데이터 구조를 정의하고, 텍스트를 데이터프레임으로 변환하는 과정을 보여줍니다.
동기데이터를 파싱할 때는 종종 구조화된 데이터를 추출할 수 있는 기회가 있습니다. 임의의 스키마를 가진 테이블을 임의의 개수만큼 추출할 수 있다면 어떨까요? 데이터프레임을 뽑아내면 테이블이나 CSV 파일로 저장해 검색한 데이터에 첨부할 수 있습니다.
데이터 구조 정의하기이 작업에 필요한 데이터 구조인 RowData, DataFrame, Database를 먼저 정의합니다.
프롬프트와 설명을 천천히 읽어 보며, 어떻게 하면 효과적으로 프롬프트를 작성할 수 있는지 감을 잡으세요.
from typing import Any
﻿
class RowData(OpenAISchema):
    row: list[Any] = Field(..., description="The values for each row")
    citation: str = Field(
        ..., description="The citation for this row from the original source data"
    )
﻿
﻿
class Dataframe(OpenAISchema):
    """
    Class representing a dataframe. This class is used to convert
    data into a frame that can be used by pandas.
    """
﻿
    name: str = Field(..., description="The name of the dataframe")
    data: List[RowData] = Field(
        ...,
        description="Correct rows of data aligned to column names, Nones are allowed",
    )
    columns: list[str] = Field(
        ...,
        description="Column names relevant from source data, should be in snake_case",
    )
﻿
    def to_pandas(self):
        import pandas as pd
﻿
        columns = self.columns + ["citation"]
        data = [row.row + [row.citation] for row in self.data]
﻿
        return pd.DataFrame(data=data, columns=columns)
﻿
﻿
class Database(OpenAISchema):
    """
    A set of correct named and defined tables as dataframes
    """
﻿
    tables: list[Dataframe] = Field(
        ...,
        description="List of tables in the database",
    )
그 RowData 이 클래스는 데이터프레임에서 단일 행을 나타냅니다. 각 행의 값을 담는 row 속성과, 원본 소스 데이터에서 가져온 인용 정보를 담는 citation 속성을 포함합니다.
그 Dataframe 클래스는 데이터프레임을 나타내며, 이름을 담는 name 속성, data 속성에 포함된 RowData 객체의 목록, 그리고 columns 속성에 담긴 열 이름 목록으로 구성됩니다. 또한 다음을 제공합니다 to_pandas 데이터프레임을 Pandas DataFrame으로 변환하는 메서드.
그 Database 클래스는 데이터베이스의 테이블 집합을 나타냅니다. 이 클래스에는 다음의 목록이 포함됩니다 Dataframe tables 속성에 포함된 객체들.
이제 평소처럼 우리만의 추출 함수를 정의하고 어떤 결과가 나오는지 확인해 봅시다.
def dataframe(data: str) -> Database:
    completion = openai.ChatCompletion.create(
        model="gpt-4-0613", # Notice I have to use gpt-4 here, this task is pretty hard
        temperature=0.1,
        functions=[Database.openai_schema],
        function_call={"name": Database.openai_schema["name"]},
        messages=[
            {
                "role": "system",
                "content": """Map this data into a dataframe a
                nd correctly define the correct columns and rows""",
            },
            {
                "role": "user",
                "content": f"{data}",
            },
        ],
        max_tokens=1000,
    )
    return Database.from_response(completion)
추출 평가dataframe 함수를 사용해 텍스트를 데이터프레임으로 변환하고, 생성된 데이터프레임들을 출력해 예제를 평가해 봅시다.
dfs = dataframe("""My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago.
﻿
On one team 'Tigers' the captain is John and there are 12 players.
On the other team 'Lions' the captain is Mike and there are 10 players.
""")
for df in dfs:
   print(df.to_pandas())
예제에서 두 개의 추출된 데이터프레임을 얻었습니다!
﻿
    People
    Name  Age           City Favorite Sport
    0   John   25       New York     Basketball
    1   Mike   30  San Francisco       Baseball
    2  Sarah   20    Los Angeles         Tennis
    3   Mary   35        Chicago           None
﻿
    Teams
    Team Name Captain  Number of Players
    0    Tigers    John                 12
    1     Lions    Mike                 10
﻿
프롬프트 엔지니어링의 끝일까?아니요, 그렇지 않습니다.
자신만의 예제를 만들 때 변수 이름, 도크스트링, 그리고 설명을 어떻게 정하느냐는 매우 중요합니다. 이제는 사람과 AI 모두가 이름과 문서를 활용하므로 그 중요성은 더욱 커졌습니다.
좋은 스키마를 작성하는 팁스키마가 올바르게 파싱되지 않는다면, 다음 팁을 고려해 보세요:
1. 일반적인 속성 이름 사용을 피하세요.
2. 모든 클래스에 도크스트링을 작성하세요.
3. 필요하다면 도크스트링에 팁과 간단한 예시를 포함하세요.
4. 설명에서 형용사는 중요합니다. 예를 들어, “짧고 간결한” 쿼리는 추가 키워드를 포함한 “자세하고 구체적인” 쿼리와 다릅니다.
마무리이 글은 OpenAI Function Call을 활용해 인용 추출, 쿼리 플랜 수립 및 실행, 텍스트를 데이터프레임으로 변환하는 등 다양한 작업을 수행하는 방법을 보여줍니다. 또한 Pydantic을 사용해 명확하고 구체적인 데이터 구조를 정의하고, 정확하고 상세한 문서를 제공하는 것이 중요하다는 점을 강조합니다.
언어 모델이 많은 작업을 자동화할 수는 있지만, 인간의 전문성과 판단을 대체하지는 않습니다. Pydantic으로 데이터 구조를 모델링하는 것은 출력 구조를 어떻게 설계할지, 코드·스키마·프롬프트를 어떻게 함께 표현할지에 대한 논의를 시작하기에 훌륭한 방법입니다.
프롬프트 엔지니어링은 문제를 해결하고 질문에 답하기 위해 AI 모델을 효과적으로 활용하는 데 여전히 핵심적인 단계입니다. 명확하고 구체적인 데이터 구조를 제시하고, 정확하고 상세한 문서를 제공하는 것은 성공적인 애플리케이션을 구축하는 데 필수적입니다.
관련 읽을거리: ﻿문서를 방문하세요 우리의 최신 기능인 스트리밍 멀티태스킹에 대해 더 알아보기 
이 작업을 TypeScript로 해보고 싶다면, Microsoft의 자료를 참고하세요 TypeChat﻿
트위터, 그러니까 X.com에서도 저를 확인해 보세요 @jxnlco﻿
Building Advanced Query Engine and Evaluation with LlamaIndex and W&B
This report showcases a few cool evaluation strategies and touches upon a few advanced features in LlamaIndex that can be used to build LLM-based QA bots. It also shows, the usefulness of W&B for building such a system.
What Do LLMs Say When You Tell Them What They Can't Say?
An exploration of token banning on GPT's vocabulary.
How to Run LLMs Locally With llama.cpp and GGML
This article explores how to run LLMs locally on your computer using llama.cpp — a repository that enables you to run a model locally in no time with consumer hardware. 
Prompt Engineering LLMs with LangChain and W&B
Join us for tips and tricks to improve your prompt engineering for LLMs. Then, stick around and find out how LangChain and W&B can make your life a whole lot easier.
﻿
﻿
 이 글은 AI 번역본입니다. 오역이 있을 경우 댓글로 알려주세요. 원문 보고서는 아래 링크에서 확인할 수 있습니다: 원문 보고서 보기﻿
﻿
Add a comment