Re: Unicode problem in ucs4



On Mar 23, 3:04 pm, "M.-A. Lemburg" <m...@xxxxxxxxxx> wrote:
On 2009-03-23 08:18, abhi wrote:



On Mar 20, 5:47 pm, "M.-A. Lemburg" <m...@xxxxxxxxxx> wrote:
unicodeTest.c
#include<Python.h>
static PyObject *unicode_helper(PyObject *self,PyObject *args){
   PyObject *sampleObj = NULL;
           Py_UNICODE *sample = NULL;
      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }
    // Explicitly convert it to unicode and get Py_UNICODE value
      sampleObj = PyUnicode_FromObject(sampleObj);
      sample = PyUnicode_AS_UNICODE(sampleObj);
      wprintf(L"database value after unicode conversion is : %s\n",
sample);
You have to use PyUnicode_AsWideChar() to convert a Python
Unicode object to a wchar_t representation.

Please don't make any assumptions on what Py_UNICODE maps
to and always use the the Unicode API for this. It is designed
to provide a portable interface and will not do more conversion
work than necessary.

Hi Mark,
     Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
        PyObject *sampleObj = NULL;
        wchar_t *sample = NULL;
        int size = 0;

      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }

         // use wide char function
      size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

The 3. argument is the buffer size in bytes, not code points.
The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
bytes without a trailing NUL, otherwise sizeof(wchar_t) *
(PyUnicode_GetSize(databaseObj) + 1).

You also have to allocate the buffer to store the wchar_t data in.
Passing in a NULL pointer will result in a seg fault. The function
does not allocate a buffer for you:

/* Copies the Unicode Object contents into the wchar_t buffer w.  At
   most size wchar_t characters are copied.

   Note that the resulting wchar_t string may or may not be
   0-terminated.  It is the responsibility of the caller to make sure
   that the wchar_t string is 0-terminated in case this is required by
   the application.

   Returns the number of wchar_t characters copied (excluding a
   possibly trailing 0-termination character) or -1 in case of an
   error. */

PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
    PyUnicodeObject *unicode,   /* Unicode object */
    register wchar_t *w,        /* wchar_t buffer */
    Py_ssize_t size             /* size of buffer */
    );



      printf("%d chars are copied to sample\n", size);
      wprintf(L"database value after unicode conversion is : %s\n",
sample);
      return Py_BuildValue("");

}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
        Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

Any ideas?

-
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix..com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}


// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample[i]);
printf("w is : %c\n",w[i]);
}
return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

This gives the following output when I pass "abc" as input:

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

-
Abhigyan
.



Relevant Pages

  • Re: Using exact-size structs to go thru raw byte buffers
    ... long as your layout is correct and chars are indeed 8 bits, ...   #define PROTOCOL 12 ... I would expect that the buffer, ...
    (comp.lang.c)
  • Re: CFile ops using char or TCHAR
    ... >just noticed that my file handling class is working chars not TCHARs. ... >Do I need to be working in wide chars for CFile operations, ... you need to use the type of characters you want ... CE is heavily biased towards UNICODE. ...
    (microsoft.public.windowsce.embedded.vc)
  • Re: Unicode & Pythonwin / win32 / console?
    ... > * Webbrowsers for example have to display defective HTML as good as ... unknown unicode chars as "?" ... > occasionally chinese chars are displayed cryptically on non-chinese ... If the decision for default 'strict' encoding on ...
    (comp.lang.python)
  • =?windows-1252?Q?Re=3A_getting_rid_of_=97?=
    ... on the Unicode version of the "html source code". ...   File ... UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ...
    (comp.lang.python)
  • Re: Fedora, unicode, console
    ... > to get UTF-8 enabled in console? ... *all* the Unicode characters: Fedora has chosen a good one, ... > has not all UTF-8 chars, ... Well, in vim, if you know the Unicode reference, try ...
    (Fedora)