Source Code for downloading online article to Markdown
Below is the source code for downloading online article to markdown (.md) using Newspaper3k. Github Repo.
- Dependencies to install
newspaper3k==0.2.8
- How to run it
orpython app1.py
python3 app1.py
Source Code for downloading online article to Spreadsheets
Below is the source code for downloading online article to as a spread sheet (.csv) using Newspaper3k and Pandas. To download it as Excel file (.xlsx) don’t forget to install openpyxl
library. Github Repo.
- Dependencies to install
rich==13.8.1 bs4==0.0.2 scrapy==2.11.2 pandas==2.2.2 requests==2.32.3
- How to run it
orpython app2.py
python3 app2.py
Source Code for Data Scraping pipeline to Vector DB for LLM using RAG system
![[images/pyjo-2/data-scraping-to-chatbot.png|Data scraping to vector db diagram flow]] Last but not least below is the source code for data scraping pipeline to LLM RAG chatbot. Github Repo.
- Dependencies to install
langchain==0.3.0 langchain-chroma==0.1.4 langchain-huggingface==0.1.0 langchain-openai==0.2.0 python-decouple==3.8 requests==2.32.3 streamlit==1.38.0 scrapy==2.11.2
- Pip install
-r
to install all dependenciespip install -r requirements.txt
- Environtment Variable (
.env
)OPENAI_API_KEY = ""
- How to run Scrapy
scrapy crawl mojok_co
- How to run the Streamlit chatbot
streamlit run chatbot.py
Note: video tutorial menyusul ✌🏼. You also can read story about this event here.