What do you do if you have to do loops in python?
For example: i have a file with 2000 lines, and a DB of 6000 lines
For each line in the file i have to check if any of the 6000 elements is in the line.
I currently have a solution in c# and java but im curious how would you go about it in python since the loops are very slow.
>Python
Please, leave ...
... just be gone
One loop to create an array of 2000 elements
One database query to check if a row exists with any such element
Maybe have to chunk it a bit in case 2000 is too many for an IN() clause.
i dont get what you mean?
The problem is simple.
You have a file and a DB of names or just "strings" that you have to find if any of that is in each line. Is my approach moronic? 2 for loops and then a loop over the string
it's what the industry uses homosexual
And besides, why in the world would you loop through records and execute a query per record in a loop? Are you insane?
def check_substrings(file_path, db_path):
# Read lines from db.txt
with open(db_path, 'r') as db_file:
db_lines = db_file.readlines()
# Read lines from file.txt
with open(file_path, 'r') as file:
file_lines = file.readlines()
# Check if any line from db.txt is a substring of any line in file.txt
for db_line_num, db_line in enumerate(db_lines, 1):
for file_line_num, file_line in enumerate(file_lines, 1):
if db_line.strip().upper() in file_line.upper():
print(f"Line {db_line_num} in {db_path} matches a substring in line {file_line_num} in {file_path}")
generated by chatgpt, i'm asking if this is too slow and if it can be made faster
no it's perfect
in fact, ChatGPT is so good you should use it instead of killing a thread for this shit moronic homosexual
class TrieNode:
def __init__(self):
self.children = {}
self.is_end_of_word = False
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end_of_word = True
def search(self, word):
node = self.root
matched_string = ""
for i, char in enumerate(word):
if char not in node.children:
break
node = node.children[char]
matched_string += char
if node.is_end_of_word:
return matched_string
return None
def build_trie(db_lines):
trie = Trie()
for line in db_lines:
trie.insert(line)
return trie
def check_substrings(file_path, db_path):
# Read lines from db.txt and convert to a list with .strip().upper()
with open(db_path, 'r') as db_file:
db_lines = [line.strip().upper() for line in db_file]
# Build a trie from db lines
trie = build_trie(db_lines)
# Read lines from file.txt
with open(file_path, 'r') as file:
for file_line_num, file_line in enumerate(file, 1):
file_line = file_line.strip().upper()
current_index = 0
while current_index < len(file_line):
matched_string = trie.search(file_line[current_index:])
if matched_string:
print(f"Found a match '{matched_string}' in line {file_line_num}")
break
current_index += 1
this is the code that chatgpt came up with and it dropped the runtime from 8 seconds to 1 second
Test my solution here:
then let me know the runtime.
And for god's sake, stop polluting IQfy with your AI slop.
for the extreme case, 22 seconds to 10seconds with your thing
For smaller cases, difference of 0.04(trie) 0.12(yours)
def check_substrings(file_path, db_path):
# Read lines from db.txt
with open(db_path, 'r') as db_file:
db_lines = [line.strip().upper() for line in db_file]
# Read lines from file.txt and convert to a set of uppercase lines
with open(file_path, 'r') as file:
file_lines_set = {line.strip().upper() for line in file}
# Check if any line from db.txt is a substring of any line in file.txt
matches = [(db_line_num, db_line, file_line_num)
for db_line_num, db_line in enumerate(db_lines, 1)
for file_line_num, file_line in enumerate(file_lines_set, 1)
if db_line in file_line]
# Print matches
for db_line_num, db_line, file_line_num in matches:
print(f"Line {db_line_num} in {db_path} matches a substring in line {file_line_num} {db_lines[db_line_num-1]}")
You did it wrong. You must use the "in" operation on the SET, not the LIST. Sets have O(1) lookup time, lists have O(n). Fricking dummy.
I'm leaving this thread, have fun with your AI slop.
# Read lines from db.txt
with open("db.txt", 'r') as db_file:
db_lines = db_file.readlines()
# Read lines from file.txt
with open("file.txt", 'r') as file:
file_lines = file.readlines()
file_set = set(file_lines)
[ print(f"{line_num}: {line}") for line_num, line in enumerate(db_lines, 1) if line in file_set ]
Here you go Rajesh, since you're clearly a complete fricking moron I did all the work for you. Just run this script and time it.
still doesnt work pajeet, it doesnt match anything
i dont mean to be toxic but the version generated worked meanwhile yours doesnt
I just tested my version and it does work, I tested it on
for (( i = 0; i < 5000; i++ )); do echo "$(head -c 300 /dev/urandom | tr -dc 'a-z0-9 ')" >> db.txt; done
for (( i = 0; i < 5000; i++ )); do echo "$(head -c 300 /dev/urandom | tr -dc 'a-z0-9 ')" >> file.txt; done
cat file.txt >> db.txt
shuf -o db.txt db.txt
time python3 script.py
real 0m0.110s
user 0m0.070s
sys 0m0.041s
Meanwhile your first solution took 18 seconds, so mine is about 180x faster. I don't know how you managed to break my code when I literally gave you a complete script to run, but you clearly have a talent for being stupid.
[ print(f"{line_num}: {line}") for line_num, line in enumerate(db_lines, 1) if line in file_set ]
this checks if "X" in file_set, if my file_set contains "homosexuals X homosexuals" it will not match, meanwhile this
matches = [(db_line_num, db_line, file_line_num)
for db_line_num, db_line in enumerate(db_lines, 1)
for file_line_num, file_line in enumerate(file_lines_set, 1)
if db_line in file_line]
does, why you have to be so toxic if you didnt understand what i asked
I don't even understand what the frick you're trying to say. You are literally a fricking Pajeet.
file_set1 --> "xxxhomosexualsxxx"
file_set2 --> "homosexual"
your code will match file_set2 but not file_set1 give the item "homosexual", what don't you understand? and stop calling me named
sorry, from 22 seconds with the first implementation to 1 second with the second one
for a file with 10k lines and a db with 2.4k lines
it takes 7 seconds to do it with the code i provided
it's a sunny day, why are you seething?
In Python, if you're using a for loop it almost always means you're doing something wrong. In this case there is a trivial solution that will also be orders of magnitude faster: convert both lists to a set, then take the intersection.
I imagine the fastest solution that could reasonably implemented while preserving the features of this current program (such as keeping track of line numbers) would be to convert file_line to a set, then use a list comprehension on db_line. Something like this:
file_set = file_line.set()
[ print("some_bullshit") for (line_num, line) in enumerate(db_line, 1) if line in file_set ]
Haven't tested it so you might need to make some syntax corrections, my Python is a little rusty. But I really doubt you'll find a faster solution than this while keeping things so simple.
Benefits:
- sets offer average lookup time of O(1)
- list comprehensions use optimised C code so they're much faster than for loops
- entire algorithm is O(n), compared to the O(n^2) algorithm you posted
You're welcome.
Put all the 6000 elements in a map
For every line in the 2000 line, see if the element is already there
Foe bigger sets I dont really know
>not coding in pregabalin
Ngmi