DIY AI part 8: Add smarter file tracking for better AI learning

Matic Zorman/GettyImages

If you have been following along with our DIY AI project, you should now have a database that regularly scans the Desktop to make a list of the files. This will help us a lot with getting organized, and it’s likely that you already started organizing the files yourself, which is a good thing.

In this guide, we’ll add a few fields to our database table and modify the script to keep track of when we access the files and how often. This addition will help organize files and provide information that our AI can use to learn about our habits and file usage, which will be helpful later.

Before you begin

It’s a good idea to start at the beginning of this guide because we cover a lot of topics, including installing the main Python libraries, setting up the development environment, and creating the schedular script.

In the last guide, we created a database table to store file information and Python, a script that regularly scans the Desktop (or any file)  and stores any changes in that table. If you find that the scanning process is taking too long, it’s a good idea to try moving some files to a better location to lighten the workload.

The new fields

accessed_at

In this guide, we will add two fields to our database table. The first is the accessed_at field, which tracks the last time you accessed a file and helps you identify recently used files. This information will be useful for determining active files versus idle files. Use “Date & Time (DATETIME)” in Microsoft Access for the accessed_at field.

Usage_count

The second field we need to add is the usage_count. This field will be a “Number (INTEGER)” in Microsoft Access. The usage_count field is a powerful metric that tracks how frequently a file has been accessed or modified. It can help you (and the AI) understand which files are important to your workflow. Files with high usage_count are likely critical to your daily tasks or long-term projects, while files with low or zero usage_count might be candidates for archiving or deletion.

You can prioritize files with high usage_count for deeper analysis, previews, or metadata extraction, and you can use this information to focus your resources better when scanning, training, etc.

Updating your script

Once you have the fields added to your table, updating the system requires changing a few lines of code in your intermittent_scan.py file. You won’t even need to redo a complete scan.

Complete code

Here is the updated code, with some small changes in lines 55 – 85.

import os

from datetime import datetime

from pathlib import Path

import pyodbc

from common_utilities import get_connection, compute_hash, file_metadata, DIRECTORY_TO_SCAN

# Directory to skip

SKIP_DIR = Path(r"C:\Users\yourname\Desktop\Directory_To_Skip")

def sanitize(value: str) -> str:

# Double up any single quotes to avoid SQL errors

return value.replace("'", "''")

if __name__ == "__main__":

conn = get_connection()

cur = conn.cursor()

now = datetime.now()

now_str = now.strftime("%Y-%m-%d %H:%M:%S")

now_str_sanitized = sanitize(now_str)

# Get known files

cur.execute("SELECT path, size_bytes, last_modified, hash FROM Your_Database_Table")

known_files = {}

for row in cur.fetchall():

# row.path, row.size_bytes, row.last_modified, row.hash are TEXT

# Store them in a dict

known_files = (row.size_bytes, row.last_modified, row.hash)

current_files = []

for root, dirs, files in os.walk(DIRECTORY_TO_SCAN):

root_path = Path(root)

# Remove SKIP_DIR if encountered

dirs[:] =

for fname in files:

file_path = root_path / fname

current_files.append(str(file_path))

file_name, ext, size, last_mod = file_metadata(file_path)

# Convert to strings

file_name_str = sanitize(file_name)

ext_str = sanitize(ext)

size_str = str(size)

last_mod_str = last_mod.strftime("%Y-%m-%d %H:%M:%S")

last_mod_str_sanitized = sanitize(last_mod_str)

file_hash = compute_hash(file_path)

file_hash_str = sanitize(file_hash)

path_str = sanitize(str(file_path))

if str(file_path) in known_files:

old_size, old_last_mod, old_hash = known_files

# Compare as strings

if (size_str != old_size) or (last_mod_str != old_last_mod):

# File changed, re-hash done above

# Update record

sql = f"""

UPDATE Your_Database_Table

SET size_bytes='{size_str}',

last_modified='{last_mod_str_sanitized}',

hash='{file_hash_str}',

accessed_at='{now_str_sanitized}',

usage_count = usage_count + 1,

scanned_at='{now_str_sanitized}'

WHERE path='{path_str}'

"""

cur.execute(sql)

else:

# Unchanged, just update scanned_at

sql = f"""

UPDATE Your_Database_Table

SET scanned_at='{now_str_sanitized}'

WHERE path='{path_str}'

"""

cur.execute(sql)

else:

# New file, insert

sql = f"""

INSERT INTO Your_Database_Table (

path, file_name, extension, size_bytes, last_modified, hash, created_at, scanned_at, accessed_at, usage_count

) VALUES (

'{path_str}', '{file_name_str}', '{ext_str}', '{size_str}', '{last_mod_str_sanitized}', '{file_hash_str}', '{now_str_sanitized}', '{now_str_sanitized}', '{now_str_sanitized}', 1

)

"""

try:

cur.execute(sql)

except pyodbc.IntegrityError:

# If file path already exists, skip

pass

# Optional: handle deleted files if needed

# missing_files = set(known_files.keys()) - set(current_files)

# For now, do nothing with deleted files.

conn.commit()

conn.close()

Summary

After making these changes to the code, we’ll let it run several more times to start collecting the information we need next.

Follow GeekSided for more fun projects with tech.