Skip to content

apify/langchain-apify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ‰ Apify MCP server released! πŸŽ‰

Apify has released its MCP (Model Context Protocol) server, which offers more features. You can use it through the LangChain MCP Adapter. It allows you to run Apify Actors, access Apify storage, search and read Apify documentation, and much more.

πŸ‘‰ https://mcp.apify.com πŸ‘ˆ

Apify logo

LangChain Apify: A full-stack scraping platform built on Apify's infrastructure and LangChain's AI tools. Maintained by Apify.

GitHub Repo stars Tests


Build web scraping and automation workflows in Python by connecting Apify Actors with LangChain. This package gives you programmatic access to Apify's infrastructure - run scraping tasks, handle datasets, and use the API directly through LangChain's tools.

Agentic LLMs

If you are an agent or an LLM, refer to the llms.txt file to get package context and learn how to work with this package.

Installation

pip install langchain-apify

Prerequisites

You should configure credentials by setting the following environment variables:

  • APIFY_API_TOKEN - Apify API token

Register your free Apify account here and learn how to get your API token in the Apify documentation.

Tools

ApifyActorsTool class provides access to Apify Actors, which are cloud-based web-scraping and automation programs that you can run without managing any infrastructure. For more detailed information, see the Apify Actors documentation.

ApifyActorsTool is useful when you need to run an Apify Actor as a tool in LangChain. You can use the tool to interact with the Actor manually or as part of an agent workflow.

Example usage of ApifyActorsTool with the RAG Web Browser Actor, which searches for information on the web:

import os
import json
from langchain_apify import ApifyActorsTool

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
os.environ["APIFY_API_TOKEN"] = "YOUR_APIFY_API_TOKEN"

browser = ApifyActorsTool('apify/rag-web-browser')
search_results = browser.invoke(input={
    "run_input": {"query": "what is Apify Actor?", "maxResults": 3}
})

# use the tool with an agent
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

model = ChatOpenAI(model="gpt-4o-mini")
tools = [browser]
agent = create_react_agent(model, tools)

for chunk in agent.stream(
    {"messages": [("human", "search for what is Apify?")]},
    stream_mode="values"
):
    chunk["messages"][-1].pretty_print()

Document loaders

⚠️ Note for Actor Developers: If you're building an Apify Actor, use Actor.open_dataset() from the Apify SDK instead of this loader. See the Note for Apify Actor developers section for details.

ApifyDatasetLoader class provides access to Apify datasets as document loaders. Datasets are storage solutions that store results from web scraping, crawling, or data processing.

ApifyDatasetLoader is useful when you need to process data from an Apify Actor run from outside the Actor runtime (e.g., in an external script, notebook, or application). If you are extracting webpage content, you would typically use this loader after running an Apify Actor manually from the Apify console, where you can access the results stored in the dataset.

Example usage for ApifyDatasetLoader with a custom dataset mapping function for loading webpage content and source URLs as a list of Document objects containing the page content and source URL.

import os
from langchain_apify import ApifyDatasetLoader

os.environ["APIFY_API_TOKEN"] = "YOUR_APIFY_API_TOKEN"

# Example dataset structure
# [
#     {
#         "text": "Example text from the website.",
#         "url": "http://example.com"
#     },
#     ...
# ]

loader = ApifyDatasetLoader(
    dataset_id="your-dataset-id",
    dataset_mapping_function=lambda dataset_item: Document(
        page_content=dataset_item["text"],
        metadata={"source": dataset_item["url"]}
    ),
)

Wrappers

ApifyWrapper class wraps the Apify API to easily convert Apify datasets into documents. It is useful when you need to run an Apify Actor programmatically and process the results in LangChain. Available methods include:

  • call_actor: Runs an Apify Actor and returns an ApifyDatasetLoader for the results.
  • acall_actor: Asynchronous version of call_actor.
  • call_actor_task: Runs a saved Actor task and returns an ApifyDatasetLoader for the results. Actor tasks allow you to create and reuse multiple configurations of a single Actor for different use cases.
  • acall_actor_task: Asynchronous version of call_actor_task.

For more information, see the Apify LangChain integration documentation.

Example usage for call_actor involves running the Website Content Crawler Actor, which extracts content from webpages. The wrapper then returns the results as a list of Document objects containing the page content and source URL:

import os
from langchain_apify import ApifyWrapper
from langchain_core.documents import Document

os.environ["APIFY_API_TOKEN"] = "YOUR_APIFY_API_TOKEN"

apify = ApifyWrapper()

loader = apify.call_actor(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://python.langchain.com/docs/get_started/introduction"}],
        "maxCrawlPages": 10,
        "crawlerType": "cheerio"
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "",
        metadata={"source": item["url"]}
    ),
)
documents = loader.load()

Note for Apify Actor developers

If you are building an Apify Actor that will run on the Apify platform, you should NOT use this package for dataset loading. Instead:

Use the Apify Actor SDK directly with Actor.open_dataset() Do NOT use ApifyDatasetLoader from this package

Why?

  1. Security & permissions: Actors should run with LIMITED_PERMISSIONS and use scoped tokens that grant access only to specific resources. The Actor SDK's Actor.open_dataset() method respects these scoped tokens.
  2. Best practices: Using the Actor SDK is the proper way to access Apify resources within an Actor runtime environment.
  3. No external dependencies: Your Actor doesn't need to depend on langchain-apify for basic dataset operations.

Example: Loading dataset in an Actor

from apify import Actor
from langchain_core.documents import Document

async def main():
    async with Actor:
        # Get dataset ID from input or integration payload
        dataset_id = Actor.get_input().get("datasetId")

        # Open dataset using Actor SDK (respects LIMITED_PERMISSIONS)
        dataset = await Actor.open_dataset(name=dataset_id)

        # Transform items to Documents
        documents = []
        async for item in dataset.iterate_items():
            doc = Document(page_content=item.get("text", ""), metadata={"url": item.get("url")})
            documents.append(doc)

When to use langchain-apify

This package is designed for:

  • External scripts and applications that need to access Apify from outside the Actor runtime
  • LangChain agents that use Apify Actors as tools
  • Data processing pipelines that consume Apify datasets

It is NOT designed for:

  • Code running inside an Apify Actor (use Actor SDK instead)

About

Apify integration for LangChain πŸ¦œπŸ”—

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors