DIY AI part 7: How to clean, organize, and accelerate your file scans

Matic Zorman/GettyImages

If you have been following along with our DIY modular offline AI, you will have a scheduler script that takes regular scans of your Desktop folder and lists the files in a CSV file. Now, we’re going to take a look at how we can clean up the data a little and make it more sustainable and easier to access for the AI.

What we can do so far

While there are a lot of steps to creating a DIY AI, the good news is that you can do a lot of things along the way to help make you more productive. For instance, you can use the central scheduler we made in part five to schedule the Facebook media posts we’ve shown you how to make here at GeekSided automatically.

You can also use it to schedule and log your internet speeds at different times of the day, which is another script we’ve covered here at GeekSided. Incorporating these ideas can help you get your program running right from the start.

Cleaning up

In the last guide, we started outputting our Desktop file information to a CSV, which can help give you an idea of what you are setting the AI to look at. If you have allowed the script to run a few times, you can look at the CSV files and see the data they contain. If they are extremely large, you can look through them to find any files or folders that are unlikely to change and move them to another location or programmatically tell the script to ignore those files during the scan. For instance, I have the code for my DIY AI on my Desktop and don’t want the script to scan my virtual environment folder because of the large number of Python libraries that have thousands of files.

Once you narrow it down so you are only scanning the important files, it’s also better to store them in a database table instead of the CSI file so the AI can access all of the information much faster.

Finally, since we will be accessing the database with multiple scrips, it’s better to create a common utility script that will keep the database logic in one place so it is easy to update if we need to.

Moving ahead

To make use of the database, we will modify our script. We will use a full scan script to complete a scan of our Desktop and add it to the database. This script will create a hash for each file, which is like a fingerprint that it will also store in the database.

Next, we’ll create an incremental scan that you can run much more frequently than the one that created CSV files. This script will only look for changes, including new files, changed files, and deleted files, and log any it finds in the database.

Top modern programming languages driving the future of development

Benefits of this method

Since we are only looking for changes, running the script requires much less processing power. There is also much less data collected without sacrificing the ability to know when any files changed. Hashing also allows us to find duplicates easily and know when a file moves.

How to add this system to your DIY AI project

Create the database table

First, you will need to create a table in your database to store the file information. It will need to have the following fields. We made them all text fields so they work easily with different databases, but you might want to use integers and datetime later.

id AUTOINCREMENT PRIMARY KEY,

path TEXT(255) NOT NULL UNIQUE,

file_name TEXT(255),

extension TEXT(50),

size_bytes TEXT,

last_modified TEXT(255),

hash TEXT(255),

created_at TEXT(255),

scanned_at TEXT(255)

Create the common_utils.py file

This file contains the information for logging into your database and also handles the hashing and extracting of data from the files. Remember to add the path to your database.

# common_utils.py

import os

import hashlib

from pathlib import Path

from datetime import datetime

import pyodbc

DB_PATH = r"C:\path\to\file_tracking.accdb" # Update this to your actual Access DB path

DIRECTORY_TO_SCAN = Path.home() / "Desktop" # Adjust if needed

def get_connection():

conn_str = (

r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"

rf"DBQ={DB_PATH};"

)

return pyodbc.connect(conn_str)

def compute_hash(file_path: Path) -> str:

"""Compute SHA-256 hash of a file."""

hash_sha256 = hashlib.sha256()

with open(file_path, 'rb') as f:

for chunk in iter(lambda: f.read(4096), b""):

hash_sha256.update(chunk)

return hash_sha256.hexdigest()

def file_metadata(file_path: Path):

"""Return (file_name, extension, size_bytes, last_modified) for a file."""

stat = file_path.stat()

size = stat.st_size

last_mod = datetime.fromtimestamp(stat.st_mtime)

ext = file_path.suffix

file_name = file_path.stem

return file_name, ext, size, last_mod

Create the full_scan.py file

After creating this file, you will run it once to populate the database initially. It also includes an option to skip a directory if it exists. If you don’t want it to leave out any directories, just set a path that doesn’t exist.

import os

from datetime import datetime

from pathlib import Path

import pyodbc

from common_utils import get_connection, compute_hash, file_metadata, DIRECTORY_TO_SCAN

# Directory to skip

SKIP_DIR = Path(r"C:\Path\to\folder\to\skip")

def sanitize(value: str) -> str:

# Double up any single quotes to avoid SQL errors

return value.replace("'", "''")

if __name__ == "__main__":

conn = get_connection()

cur = conn.cursor()

now = datetime.now()

now_str = now.strftime("%Y-%m-%d %H:%M:%S")

for root, dirs, files in os.walk(DIRECTORY_TO_SCAN):

root_path = Path(root)

# Remove SKIP_DIR if encountered

dirs[:] =

for fname in files:

file_path = root_path / fname

file_name, ext, size, last_mod = file_metadata(file_path)

last_mod_str = last_mod.strftime("%Y-%m-%d %H:%M:%S")

file_hash = compute_hash(file_path)

# Convert all fields to strings and sanitize

path_str = sanitize(str(file_path))

file_name_str = sanitize(file_name)

ext_str = sanitize(ext)

size_str = str(size) # size as string

last_mod_str = sanitize(last_mod_str)

file_hash_str = sanitize(file_hash)

now_str_sanitized = sanitize(now_str)

# Build the SQL statement directly

sql = f"""

INSERT INTO Mailop_FileTrack (

path, file_name, extension, size_bytes, last_modified, hash, created_at, scanned_at

) VALUES (

'{path_str}', '{file_name_str}', '{ext_str}', '{size_str}', '{last_mod_str}', '{file_hash_str}', '{now_str_sanitized}', '{now_str_sanitized}'

)

"""

# Attempt to insert

try:

cur.execute(sql)

except pyodbc.IntegrityError:

# If the file path already exists, skip

pass

conn.commit()

conn.close()

Create the incremental scan file

Next, we create the incremental scan file that looks for changes and logs them. Add this to your content schedule and run it daily or even every day to note any changes that the AI can use later.

import os

from datetime import datetime

from pathlib import Path

import pyodbc

from common_utils import get_connection, compute_hash, file_metadata, DIRECTORY_TO_SCAN

# Directory to skip

SKIP_DIR = Path(r"C:\Path\to\folder\to\skip")

def sanitize(value: str) -> str:

# Double up any single quotes to avoid SQL errors

return value.replace("'", "''")

if __name__ == "__main__":

conn = get_connection()

cur = conn.cursor()

now = datetime.now()

now_str = now.strftime("%Y-%m-%d %H:%M:%S")

now_str_sanitized = sanitize(now_str)

# Get known files

cur.execute("SELECT path, size_bytes, last_modified, hash FROM Mailop_FileTrack")

known_files = {}

for row in cur.fetchall():

# row.path, row.size_bytes, row.last_modified, row.hash are TEXT

# Store them in a dict

known_files = (row.size_bytes, row.last_modified, row.hash)

current_files = []

for root, dirs, files in os.walk(DIRECTORY_TO_SCAN):

root_path = Path(root)

# Remove SKIP_DIR if encountered

dirs[:] =

for fname in files:

file_path = root_path / fname

current_files.append(str(file_path))

file_name, ext, size, last_mod = file_metadata(file_path)

# Convert to strings

file_name_str = sanitize(file_name)

ext_str = sanitize(ext)

size_str = str(size)

last_mod_str = last_mod.strftime("%Y-%m-%d %H:%M:%S")

last_mod_str_sanitized = sanitize(last_mod_str)

file_hash = compute_hash(file_path)

file_hash_str = sanitize(file_hash)

path_str = sanitize(str(file_path))

if str(file_path) in known_files:

old_size, old_last_mod, old_hash = known_files

# Compare as strings

if (size_str != old_size) or (last_mod_str != old_last_mod):

# File changed, re-hash done above

# Update record

sql = f"""

UPDATE Mailop_FileTrack

SET size_bytes='{size_str}',

last_modified='{last_mod_str_sanitized}',

hash='{file_hash_str}',

scanned_at='{now_str_sanitized}'

WHERE path='{path_str}'

"""

cur.execute(sql)

else:

# Unchanged, just update scanned_at

sql = f"""

UPDATE Mailop_FileTrack

SET scanned_at='{now_str_sanitized}'

WHERE path='{path_str}'

"""

cur.execute(sql)

else:

# New file, insert

sql = f"""

INSERT INTO Mailop_FileTrack (

path, file_name, extension, size_bytes, last_modified, hash, created_at, scanned_at

) VALUES (

'{path_str}', '{file_name_str}', '{ext_str}', '{size_str}', '{last_mod_str_sanitized}', '{file_hash_str}', '{now_str_sanitized}', '{now_str_sanitized}'

)

"""

try:

cur.execute(sql)

except pyodbc.IntegrityError:

# If file path already exists, skip

pass

# Optional: handle deleted files if needed

# missing_files = set(known_files.keys()) - set(current_files)

# For now, do nothing with deleted files.

conn.commit()

conn.close()

What’s next?

Now that the script is running, it will collect important information and save it in the database. This work will make it easy to find duplicate files. You will also be able to use this information to keep track of the work you do on the Desktop, allowing AI to provide insights into your daily routine and much more, which we’ll get to soon.

Follow GeekSided to follow along with our DIY AI project