如何用自然语言构建将网页自动批量截图至Notion的脚本
🗒️如何用自然语言构建将网页自动批量截图至Notion的脚本
2024-8-16
| 2024-8-19
0  |  阅读时长 0 分钟
date
category
tag
summary

前言

我制作了两个导航站,来收藏我关注的AI产品效率工具,以便信息检索。
为了导航站能更直观的显示,我需要将大量的链接网站首页批量截图保存至Notion。

解决策略

通过LLM实现从自然语言到自动化脚本的转化,借助Chrome开发者工具作为自动截图工具,Cloudinary 作为外部图像服务,Notion API 作为接口。
这里用到的LLM是ClaudeChatGPTMistral

呈现效果

notion image
notion image

脚本运行演示

 
Chrome Screenshot to Notion Script Run Demo
notion image

实现过程

与Claude对话六轮,实现初步脚本

对话一:基础,如何用Microlink API进行截图

K
K
want to use API to take screenshots of multiple URLs. But I can't understand the content below. Can you explain this and help me learn to use microlink API to take screenshots?
{
screenshot Type: <boolean> Default: false It generates a screenshot over the target url. * CLI * JavaScript * Shell * Python * Ruby When it's enabled, a new screenshot data field will be part of the response payload.
You can configure different specific options, such as element or overlay: * CLI * JavaScript * Shell * Python * Ruby Also, combine it with embed for inserting it as HTML markup and refresh it asynchronously in the background (known as stale).
or inside CSS stylesheets:
even in Markdown:
}
Claude
Claude
I'd be happy to explain how to use the Microlink API to take screenshots of multiple URLs. Let's break this down step by step:
  1. Understanding the Microlink API: Microlink is a service that allows you to extract information from any website, including taking screenshots. The API endpoint for this service is https://api.microlink.io.
  1. Basic Usage: To take a screenshot of a URL, you need to make a GET request to the Microlink API with the following parameters:
      • url: The website URL you want to screenshot
      • screenshot: Set to true to enable screenshot generation
      Example:
  1. API Response: When successful, the API will return a JSON response containing the screenshot information, including the URL of the generated screenshot, its dimensions, file type, and size.
  1. Taking Screenshots of Multiple URLs: To screenshot multiple URLs, you'll need to make separate API calls for each URL. This can be done using a script in your preferred programming language.
Here's a simple Python script to demonstrate how to take screenshots of multiple URLs:
This script does the following:
  1. Defines a function take_screenshot that makes a request to the Microlink API for a given URL.
  1. If successful, it extracts the screenshot URL from the API response.
  1. We then loop through a list of URLs, calling take_screenshot for each one.
  1. The script prints the screenshot URL for each successful request, or an error message for failed requests.
To use this script, you'll need to install the requests library (pip install requests) if you haven't already.
Additional Notes:
  • The API offers more customization options, such as specifying the screenshot dimensions, format, and more. These can be added as additional parameters to the API call.
  • For high-volume usage, you may need to sign up for a Microlink account and use authentication.
  • Be mindful of rate limits and terms of service when using the API.
Would you like me to explain any part of this in more detail or help you implement this in a different programming language?

对话二:闭环,如何通过Notion API将截图传到Notion

K
K
After that ,how to connect with Notion API and put this screenshot into a specific page which is in a Notion database?
Claude
Claude
Certainly! I can help you connect the Microlink screenshot functionality with the Notion API to add the screenshots to a specific page in a Notion database. Let's break this down into steps:
  1. Set up Notion API
  1. Create a function to add content to a Notion page
  1. Combine the screenshot and Notion functionality
Here's how we can do this:
  1. Set up Notion API: First, you need to set up your Notion integration and get your API key. Here are the steps:
      • Click "New integration" and give it a name
      • Select the workspace where you'll use it
      • Copy the "Internal Integration Token" (this is your API key)
      • In your Notion workspace, share the database with your integration
  1. Create a function to add content to a Notion page: We'll create a function that adds a new block (in this case, an image) to a specific Notion page.
  1. Combine the screenshot and Notion functionality: We'll modify our previous script to include the Notion integration.
Here's the updated Python script that accomplishes this:
To use this script, you need to:
  1. Install the required Python library:
    1. Replace "your_notion_api_key_here" with your actual Notion API key.
    1. Replace "your_notion_page_id_here" with the ID of the Notion page where you want to add the screenshots. You can find this ID in the URL of your Notion page.
    1. Customize the urls list with the websites you want to screenshot.
    This script does the following:
    1. It uses the take_screenshot function from our previous example to get screenshot URLs from Microlink.
    1. It defines a new function add_image_to_notion_page that uses the Notion API to add an image block to a specific page.
    1. In the main function, it loops through each URL, takes a screenshot, and then adds that screenshot to the specified Notion page.
    A few important notes:
    • This script adds each screenshot as a new block at the end of the specified Notion page. If you want to add them to a database instead, you'll need to modify the Notion API call to create new pages in the database.
    • The Notion API has rate limits, so if you're processing many URLs, you might need to add delays between requests.
    • Error handling in this script is basic. In a production environment, you'd want to add more robust error handling and possibly retries for failed requests.
    • Make sure to keep your Notion API key secure and never share it publicly.
    Would you like me to explain any part of this in more detail or help you customize it further?

    对话三:优化,如何将Notion数据库页面的URL属性所示网站截图至对应页面的File属性

    K
    K
    There are a lot of page in a notion database. They all have url property. If I can take screenshots of urls which is in url property and put them into the same page's file property?
    Claude
    Claude
    Certainly! I understand you want to:
    1. Access a Notion database
    1. For each page in the database, get the URL from a "url" property
    1. Take a screenshot of that URL
    1. Add the screenshot to the same page's "file" property
    This is a great use case. Let's modify our script to accomplish this task. Here's how we can do it:
    To use this script:
    1. Install the required Python library if you haven't already:
      1. Replace "your_notion_api_key_here" with your actual Notion API key.
      1. Replace "your_database_id_here" with the ID of your Notion database. You can find this ID in the URL of your database page.
      Here's what this script does:
      1. get_database_pages: This function retrieves all pages from the specified Notion database.
      1. update_page_with_screenshot: This function updates a specific page in the Notion database, adding the screenshot URL to the "file" property.
      1. In the main function:
          • It gets all pages from the database.
          • For each page, it checks for a "url" property.
          • If a URL is found, it takes a screenshot using the Microlink API.
          • It then updates the page, adding the screenshot URL to the "file" property.
          • There's a 1-second delay between processing each page to respect rate limits.
      Important notes:
      1. Error handling: The script includes basic error handling. In a production environment, you might want to add more robust error handling and logging.
      1. Rate limits: Both Notion and Microlink have rate limits. The script includes a basic 1-second delay between operations, but you might need to adjust this based on your specific usage and the rate limits of your account.
      1. Pagination: If your database has more than 100 pages, you'll need to implement pagination in the get_database_pages function to retrieve all pages.
      1. Property names: The script assumes your URL property is named "url" and your file property is named "file". If these are named differently in your database, you'll need to adjust the property names in the script.
      1. Existing files: This script will add a new file to the "file" property. If you want to replace an existing file or handle cases where there might already be files, you'll need to modify the update_page_with_screenshot function.
      Would you like me to explain any part of this further or help you with any modifications?

      对话四:优化,如何分次提取50个,并跳过File中已存在截图的页面

      K
      K
      My database has more than 100 pages. And I want it run if the file property is empty. Can I take screenshots 50 times once and it can jump over which have already had screenshot in file property?So that I can take screenshots of all my 200+ pages by running a couple of times. Help me improve it.
      Claude
      Claude
      Certainly! I understand your requirements:
      1. Handle pagination for databases with more than 100 pages
      1. Only take screenshots if the file property is empty
      1. Limit the script to process 50 pages at a time
      1. Skip pages that already have screenshots
      Let's modify the script to accommodate these requirements. Here's an improved version:
      Here's what's new in this improved version:
      1. Pagination: The get_database_pages function now supports pagination. It returns the next_cursor and has_more flag along with the results.
      1. Limit processing: The script now has a max_screenshots variable (set to 50) to limit the number of screenshots taken in one run.
      1. Empty file property check: The script now checks if the file property is empty before processing a page.
      1. Cursor-based iteration: The main loop now uses a cursor to iterate through all pages in the database, continuing where it left off in the previous run.
      To use this script:
      1. Replace "your_notion_api_key_here" with your actual Notion API key.
      1. Replace "your_database_id_here" with your database ID.
      1. Run the script multiple times to process all pages in your database.
      Key points:
      • The script will process up to 50 pages that need screenshots in each run.
      • It will skip pages that already have files in the file property.
      • You can run this script multiple times to eventually process all pages in your database.
      • The script respects rate limits by waiting 1 second between operations.
      To process all 200+ pages in your database:
      1. Run the script once. It will process up to 50 pages.
      1. Wait a bit (maybe a few minutes) to respect rate limits.
      1. Run the script again. It will start where it left off and process up to 50 more pages.
      1. Repeat steps 2-3 until all pages have been processed.
      This approach allows you to process your entire database in manageable chunks, respecting rate limits and avoiding unnecessary API calls for pages that already have screenshots.
      Would you like me to explain any part of this further or help you with any additional modifications?

      对话五:排障,如何跳过需要登陆的页面

      K
      K
      {
      Screenshot taken: https://iad.microlink.io/e3bFoUcPA2-yv0mffLhp9N3gAeVNI0XeYUcXMOQLqh2TI9oQ6vuSVsoRvv1N6uW5r2s4y4qjto0XMLm5wF-_Kg.png Screenshot added to Notion page Processing URL: https://cloud.mongodb.com/v2#/org/663ee0a28e5df13806dad550/projects Traceback (most recent call last): File "C:\Users\Kinpoe·Ray\Downloads\screenshot-to-notion.py", line 115, in <module> main() File "C:\Users\Kinpoe·Ray\Downloads\screenshot-to-notion.py", line 86, in main screenshot_url = take_screenshot(url) ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Kinpoe·Ray\Downloads\screenshot-to-notion.py", line 12, in take_screenshot return data['data']['screenshot']['url'] ~~~~~~~~~~~~^^^^^^^^^^^^^^ KeyError: 'screenshot'
      }
      I meet someting erroe like this . If I can jump over this kind of url in script? Or if you have ways to take screenshots of this kind of url?
      Claude
      Claude
      I understand you're encountering an error when trying to take a screenshot of certain URLs, specifically for https://cloud.mongodb.com/v2#/org/663ee0a28e5df13806dad550/projects in this case. This error occurs because the Microlink API couldn't generate a screenshot for this URL, likely due to authentication requirements or other access restrictions.
      Let's modify the script to handle these cases gracefully. We'll add error handling to skip URLs that can't be screenshotted and optionally log them for manual review. Here's an improved version of the take_screenshot function and some modifications to the main loop:
      Key changes in this updated script:
      1. Error Handling: The take_screenshot function now has better error handling. It will catch any RequestException (which includes connection errors, timeouts, etc.) and log the error.
      1. Logging: We've added logging functionality. Errors will be written to a file named screenshot_errors.log in the same directory as the script.
      1. Skipping Problematic URLs: If a screenshot can't be taken (either due to an error or because the API response doesn't contain a screenshot URL), the script will log the issue and continue to the next URL.
      1. Graceful Continuation: Even if a screenshot fails, the script will continue processing other URLs.
      To use this updated script:
      1. Replace "your_notion_api_key_here" with your actual Notion API key.
      1. Replace "your_database_id_here" with your database ID.
      1. Run the script as before.
      This improved version will:
      • Skip over URLs that can't be screenshotted (like the MongoDB URL you encountered)
      • Log errors to screenshot_errors.log for your review
      • Continue processing other URLs even if some fail
      After running the script, you can check the screenshot_errors.log file to see which URLs couldn't be screenshotted. For these URLs, you might need to:
      1. Manually take screenshots if they require authentication
      1. Review if they're internal URLs that the Microlink API can't access
      1. Check if they're valid URLs or if they need updating in your Notion database
      Remember, some URLs (especially those requiring authentication or accessing internal systems) may not be accessible to the Microlink API. For these, you might need to explore alternative screenshot methods or manually add screenshots to your Notion pages.
      Is there anything else you'd like me to explain or modify in the script?

      对话六:升级,使用Chrome开发者工具进行自动化截图

      K
      K
      {
      How to screenshot a webpage right in Chrome Taking a screenshot of an entire webpage can be incredibly useful, whether you're capturing a bug to report, saving a snapshot for reference, or gathering information for research. While there are various tools and extensions available for this purpose, Google Chrome's Developer Tools offer a built-in, straightforward method to capture full-page screenshots without the need for additional software. This guide will walk you through the steps to utilize this feature effectively. We give an overview of more manual ways to screenshot webpages in our post here. Access Chrome developer tools To begin capturing a screenshot using Chrome's Developer Tools, you first need to access the tools themselves. This can be done through keyboard shortcuts or the Chrome menu. 1. Using Keyboard Shortcuts: 2. On a Mac, press Cmd + Opt + I. 3. On Windows, press Ctrl + Shift + I. 4. Using Chrome Menu: 5. Click the three-dot menu icon at the top-right corner of Chrome. 6. Navigate to More Tools > Developer Tools. Taking a full-page screenshot Once you have the Developer Tools open, you're just a few steps away from capturing a full-page screenshot. 1. Open the Command Menu: 2. On a Mac, press Cmd + Shift + P. 3. On Windows, press Ctrl + Shift + P. 4. In the Command Menu, type "screenshot" to bring up the screenshot options. 5. Select Capture area screenshot/Capture full size screenshot/Capture node screenshot/Capture screenshot from the list of options. Chrome will then capture the webpage.
      }
      How can I use this way to automatically take screenshots of multiple web pages to Notion ?
      注:{*}内容来自 Bardeen
      Claude
      Claude
      To automate taking screenshots of multiple webpages using Chrome's Developer Tools method and then upload them to Notion, we'll need to use a combination of browser automation (with Selenium or Playwright) and the Notion API. This approach will allow us to simulate the process of opening Developer Tools, running commands, and capturing screenshots programmatically. Here's how we can modify our script to accomplish this:
      This script uses Selenium to automate Chrome and take screenshots using the Developer Tools method you described. Here's a breakdown of the key changes and how to use this script:
      1. Setup:
          • Install required libraries: pip install selenium requests
          • Download ChromeDriver that matches your Chrome version and update the path in setup_chrome_driver()
      1. Key Functions:
          • setup_chrome_driver(): Initializes a headless Chrome browser
          • take_screenshot(): Uses Selenium to open Developer Tools and capture a full-page screenshot
          • upload_to_notion(): Uploads the screenshot to Notion and updates the page
      1. Usage:
          • Replace "your_notion_api_key_here" with your Notion API key
          • Replace "your_database_id_here" with your Notion database ID
          • Run the script
      1. Features:
          • Takes full-page screenshots using Chrome's Developer Tools
          • Uploads screenshots directly to Notion
          • Handles pagination for large databases
          • Skips pages that already have files
          • Logs errors for review
      1. Limitations and Considerations:
          • This method may be slower than using a dedicated screenshot API
          • It requires Chrome and ChromeDriver to be installed
          • Some websites may detect and block automated browsers
          • Error handling for complex web applications might need refinement
      To use this script:
      1. Ensure you have Chrome installed
      1. Download the appropriate ChromeDriver for your Chrome version
      1. Install the required Python libraries
      1. Update the Notion API key and database ID in the script
      1. Run the script
      This approach should work for most websites, including those that were problematic with the Microlink API. However, it may still face challenges with websites that require authentication or have complex dynamic content.
      Would you like me to explain any part of this script in more detail or help you with any modifications?
      以上是第一阶段的对话,我从使用Microlink API到直接用Chrome开发者工具。
      因为Microlink API免费一天只能50次,而使用Chrome开发者工具不限次。
      notion image

      与 ChatGPT 和 Mistral 对话,完成最终效果

      Claude的免费次数用完后,我把代码和运行问题托给了ChatGPT。
      优化处理:把图像上传到外部服务上,这里我用的是Cloudinary。
      notion image
      升级后的代码长度太长,免费版 Claude 和 ChatGPT 都不够用了。
      notion image

      最终代码

      Chrome-Screenshot-to-Notion
      kinpoe-rayUpdated Sep 20, 2024

      常见问题

      错误一,少数网站有使用条款和隐私政策信息的警告弹窗,改手动

      最终脚本还是会碰到极少数的以下问题,这两个网站一个是LLM竞技场,一个是识别是AI还是人。
      notion image
      notion image
      根据错误信息,脚本在尝试对某些网站(如https://chat.lmsys.org/和https://www.aiornot.com/)进行截图时遇到了意外的警告弹窗。这些警告弹窗包含了使用条款和隐私政策信息,要求用户同意才能继续访问网站。
      具体来说,这些网站在访问时弹出了一个警告框,内容大致如下:
      • 该服务是一个研究预览版
      • 它只提供有限的安全措施,可能会生成令人反感的内容
      • 不得将其用于任何非法、有害、暴力、种族主义或性相关目的
      • 请不要上传任何私人信息
      • 该服务收集用户对话数据(包括文本和图像),并保留权利以知识共享署名(CC-BY)或类似许可分发这些数据
      自动化脚本无法处理这种意外的警告弹窗,因此无法继续执行截图操作,导致了错误。
      这类网站需手动截图上传。

      错误二,Notion API连接次数限制,等一会再运行

      在运行第三次时,显示以下的连接问题,估计是Notion API有连接次数限制,等待一会再尝试即可。
      等待一会之后又能运行了。
      notion image

      结语

      写在最后,为了更方便的共享代码,我花了半天时间学习如何使用 Git, GitHub, VS Code, 然后上传了第一个自己整理并输出的代码,还找了个参考模板撰写README文档,在下一篇博客我会分享这整个过程,包括学习资源。
      我作为设计师(正在All in AI),这次通过与Claude、ChatGPT和Mistral等AI助手的交互,就完成了一个的网页截图自动化脚本…
      根据模型能力边界选择合适的LLM,结合清晰的问题定义能力,引导AI并对其输出进行验证和优化,这或许会未来最常见的基本能力。
    2. Notion
    3. Chrome
    4. LLM
    5. Claude
    6. ChatGPT
    7. Mistral
    8. Screenshot
    9. 聊聊Dify的可视化编排我用Notion做了个AI相关的导航站
      Loading...
      目录