I have data in an excel file and I need a way to identify if there are some rows that are identicals.
My data are more than 1000 rows and I want to highlight the double rows in the excel document.
This is the code that I use to identify the double lines:
file = pd.read_excel('Path'.xlsx)
for i in range(1112):
for y in range(1113):
if file.iloc[i].equals(file.iloc[y]) & (i != y) & (i<y) :
print(file.iloc[i])
print(file.iloc[y])
file.to_excel(r'Path2', index = False)
The problem is that I also need to highligth the double row and I am not managing to do that.
This is a small example that I found on-line and I tried to change some parts in order to adapt the script to my needs. It does not work, but I do not know how to fix it.
import pandas as pd
def highlight_max(x):
for i in range(3):
for y in range(4):
if ((x.iloc[i]) == (x.iloc[y])) & (i != y) & (i<y) :
return ['background-color: yellow' if x.iloc[i] == x.iloc[y] else '' for a in x]
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Toyota Corolla'],
'Price': [22000,25000,27000,35000,25000], 'Year': [2000,2003,2005,2007,2003]}
df = pd.DataFrame(cars, columns = ['Brand', 'Price','Year'])
df.style.apply(highlight_max)
file.to_excel(r'path', index = False)
Source: Python Questions