Quickstart

<< back

This documentation provides an overview and explanation of a Google Colab script. The script processes files, generates embeddings, and uploads them to a FaithCopilot server while supporting two vector storage providers: Pinecone and Upstash.

1. Setup and Installation

The script starts by mounting Google Drive, which is where the source files are stored. Necessary Python packages are installed, including libraries for document handling, PDF parsing, and Whisper for transcription.

from google.colab import drive
drive.mount('/content/drive')

%%capture
%pip install python-docx
%pip install -q PyMuPDF
%pip install -q git+https://github.com/openai/whisper.git

2. Imports and Configurations

The script imports necessary Python libraries for various tasks such as file handling, HTTP requests, document processing, and model loading.

2.1 Imported Libraries

  • os, shutil, re, json, time - General utilities for file and string operations.
  • requests - HTTP library for API requests.
  • fitz (PyMuPDF) - PDF processing library.
  • whisper - Automatic speech recognition (ASR) model by OpenAI.
  • subprocess - Used for running system commands.
  • tqdm - Progress bars for loops.
  • Document from python-docx - Processing DOCX files.

2.2 Configuration Variables

The configuration includes metadata for vector storage providers, API keys, file paths, and batch processing settings.

VECTOR_STORE_PROVIDER = 'pinecone'  # Can be 'pinecone' or 'upstash'
INDEX_NAME = 'my-index'
FC_NAME = 'My Sermon Notes'
DIRECTORY_PATH = '/content/drive/MyDrive/Datastore Test/My Sermon Notes'

FC_TAGS = ['sermons', 'notes', 'english']
BATCH_SIZE = 50

index_metadata = {'language': 'en'}
NAMESPACE = FC_NAME.lower().replace(' ', '-')

3. Temporary Folder Creation

The script creates a temporary folder to store intermediate files for processing. If the folder already exists, it will be removed and re-created.

current_dir = os.getcwd()
TEMP_FOLDER_PATH = os.path.join(current_dir, 'temp')

if os.path.exists(TEMP_FOLDER_PATH):
    shutil.rmtree(TEMP_FOLDER_PATH)
os.makedirs(TEMP_FOLDER_PATH, exist_ok=True)

4. API Tokens and Keys

The script uses multiple external APIs (FaithCopilot, Cloudflare, Pinecone, and Upstash) which require API keys for authentication. These keys are stored in variables and masked in the documentation for security reasons.

5. Helper Functions

5.1 Pinecone Functions

Functions to interact with Pinecone API, including retrieving the index list, creating indexes, and upserting vectors to Pinecone.

def get_pinecone_index_list():
  headers = {'Api-Key': PINECONE_API_KEY, 'X-Pinecone-API-Version': '2024-07'}
  response = requests.get('https://api.pinecone.io/indexes', headers=headers)
  return response.json()

5.2 Upstash Functions

Functions to interact with Upstash API, similar to the Pinecone functions, handling index creation and vector upserting.

def get_upstash_index_list():
  response = requests.get('https://api.upstash.com/v2/vector/index/', auth=(UPSTASH_EMAIL_OR_ID, UPSTASH_API_KEY))
  return response.json()

5.3 Metadata Helper Functions

These functions extract metadata from different file types, such as blog posts, sermons, and documents.

def get_metadata(content):
  lines = content.split('\n')
  title = lines[0]
  url = lines[1]

  metadata = {'title': title, 'url': url}
  return metadata

5.4 Miscellaneous Helper Functions

These include batch processing, namespace generation, and file embedding extraction.

def get_batches(object_to_batch, batch_size=100):
  if isinstance(object_to_batch, dict):
    keys = list(object_to_batch.keys())
    return [{key: object_to_batch[key] for key in keys[i:i+batch_size]} for i in range(0, len(keys), batch_size)]
  elif isinstance(object_to_batch, list):
    return [object_to_batch[i:i+batch_size] for i in range(0, len(object_to_batch), batch_size)]

6. File Processing and Transcription

The script supports different file types such as TXT, PDF, DOCX, and audio files (MP3/WAV). For audio files, Whisper transcribes the speech to text.

def voice_to_plaintext(audio_file_path):
  model = whisper.load_model("base")
  result = model.transcribe(audio_file_path)
  return result['text'] if result else False  

7. Vector Store Selection

The script checks whether Pinecone or Upstash is being used for vector storage, and fetches the index host accordingly. If no index is found, it creates one.

if VECTOR_STORE_PROVIDER == 'pinecone':
  INDEX_HOST = get_pinecone_index_host(INDEX_NAME)
elif VECTOR_STORE_PROVIDER == 'upstash':
  INDEX_HOST = get_upstash_index_host(INDEX_NAME)

8. Cloudflare Embeddings

For each file, embeddings are generated using the Cloudflare BAAI API. If the API call fails, the script retries the request.

def get_cloudflare_embeddings(text):
  headers = {"Authorization": f"Bearer {CF_TOKEN}"}
  response = requests.post(f"https://api.cloudflare.com/client/v4/accounts/{CF_ACCOUNT_ID}/ai/run/@cf/baai/bge-base-en-v1.5", headers=headers, json={"text": text})
  return response.json()['result']['data']

9. Upload to FaithCopilot

The processed files, along with their metadata and embeddings, are uploaded to the FaithCopilot server via an API call.

def upload_to_faithcopilot(files, index_metadata):
  faithcopilot_data = {'name': FC_NAME, 'tags': json.dumps(FC_TAGS), 'metadata': json.dumps(index_metadata)}
  headers = {"Authorization": f"Bearer {FC_TOKEN}"}
  response = requests.post(FC_ENDPOINT, headers=headers, files=files, data=faithcopilot_data)
  return response

10. Cleanup and Script Completion

After processing, the temporary folder is deleted and the script execution time is printed.

shutil.rmtree(TEMP_FOLDER_PATH)
script_end_time = time.time()
script_total_time = script_end_time - script_start_time
print(f"Script ran fully in {script_total_time / 60} minutes.")

11. Conclusion

This script automates the process of extracting text, generating embeddings, and uploading files to FaithCopilot, utilizing cloud services like Pinecone, Upstash, and Cloudflare.