python - How to load a dataframe from a printed dataframe string?

Tuesday 18 October 2016

python - How to load a dataframe from a printed dataframe string?

Often people ask questions on with an output of print(dataframe). It is convenient if one has a way of quickly loading the dataframe data into a pandas.dataframe object.

What is/are the most suggestible ways of loading a dataframe from a dataframe-string (which may or may not be properly formatted)?

Example-1

If you want to load the following string as a dataframe what would you do?

# Dummy Data
s1 = """
Client NumberOfProducts ID

A      1                2
A      5                1
B      1                2
B      6                1
C      9                1
"""

Example-2

This type is more similar to what you find in csv file.

# Dummy Data
s2 = """
Client, NumberOfProducts, ID
 A, 1, 2
 A, 5, 1
 B, 1, 2
 B, 6, 1
 C, 9, 1

"""

Expected Output

Note: The following two links do not address the specific situation presented in Example-1. The reason I think my question is not a duplicate is that I think one cannot load the string in Example-1 using any of the solutions already posted on those links (at the time of writing).

Create Pandas DataFrame from a string. Note that pd.read_csv(StringIO(s1), sep), as suggested here, doesn't really work for Example-1. You get the following output.

This question was marked as a duplicate of two links. One of them is the one above, which fails in addressing the case presented in Example-1. And the second one is . Among all the answers presented there, only one looked like it might work for Example-1, but it did not work.

# could not read the clipboard and threw error
pd.read_clipboard(sep='\s\s+')

Error Thrown:

PyperclipException: 
    Pyperclip could not find a copy/paste mechanism for your system.
    For more information, please visit https://pyperclip.readthedocs.org

Answer

I can suggest two methods to approach this problem.

Method-1

Process the string with regex and numpy to make the dataframe. What I have seen is that this works most of the time. This would for the case presented in "Example-1".

# Make Dataframe
import pandas as pd
import numpy as np
import re

# Make Dataframe
# s = s1

ncols = 3 # number_of_columns
ss = re.sub('\s+',',',s.strip())
sa = np.array(ss.split(',')).reshape(-1,ncols)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df

Method-2

Use io.StringIO to feed into pandas.read_csv(). But this would work if the separator is well defined. For instance, if your data looks similar to "Example-2". Source credit

import pandas as pd
from io import StringIO

# Make Dataframe
# s = s2
df = pd.read_csv(StringIO(s), sep=',')

Blog

Tuesday 18 October 2016